Two Threads, One Counter

Python’s counter += 1 isn’t atomic. It compiles to multiple bytecode instructions (LOAD, ADD, STORE), and the GIL can switch threads between any of them. Two threads incrementing a shared counter can lose updates. Fix it with a threading.Lock(), or better yet, don’t share mutable state at all.

mode:

click Step or Play to inspect thread state

mode:

click Step or Play to inspect thread state

the bug

Here’s code that looks correct and isn’t:

import threading

counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()

print(counter)  # expected: 200000

Run this a few times. You’ll get numbers like 134,291 or 167,440. Never 200,000. The counter is losing increments.

why it breaks

counter += 1 looks like one operation. It isn’t. Python compiles it to four bytecode instructions:

LOAD_GLOBAL    counter    # read counter into local
LOAD_CONST     1
BINARY_ADD                # compute counter + 1
STORE_GLOBAL   counter    # write result back

The GIL (Global Interpreter Lock) protects Python’s internal state, but it can release between any two bytecodes. If Thread 1 does LOAD (reads 0), then the GIL switches to Thread 2, which also reads 0, both threads compute 0 + 1 = 1, and both write 1 back. Two increments happened, but the counter only went up by one. That’s a lost update.

The simulation above shows exactly this interleaving. Step through “Unsafe” mode and watch both threads read the same stale value.

the fix

Wrap the critical section in a threading.Lock():

lock = threading.Lock()

def increment():
    global counter
    for _ in range(100_000):
        with lock:
            counter += 1

with lock acquires before entering and releases on exit. If Thread 2 tries to acquire while Thread 1 holds it, Thread 2 blocks until Thread 1 releases. No two threads can be inside the critical section at the same time. Switch the simulation to “Lock” mode to see this in action.

the better fix

Don’t share mutable state. If neither thread writes to a shared variable, there’s nothing to race on.

from queue import Queue

q = Queue()

def producer():
    for _ in range(100_000):
        q.put(1)

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=producer)
t1.start()
t2.start()
t1.join()
t2.join()

total = 0
while not q.empty():
    total += q.get()

print(total)  # always 200000

queue.Queue is thread-safe internally. Each thread only writes to the queue, and a single consumer reads from it. No shared mutable state, no race condition. Switch to “Queue” mode in the simulation to see the difference.

mode:

click Step or Play to inspect thread state

the bug

Here’s code that looks correct and isn’t:

import threading

counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()

print(counter)  # expected: 200000

Run this a few times. You’ll get numbers like 134,291 or 167,440. Never 200,000. The counter is losing increments.

why it breaks

counter += 1 looks like one operation. It isn’t. Python compiles it to four bytecode instructions:

LOAD_GLOBAL    counter    # read counter into local
LOAD_CONST     1
BINARY_ADD                # compute counter + 1
STORE_GLOBAL   counter    # write result back

the fix

Wrap the critical section in a threading.Lock():

lock = threading.Lock()

def increment():
    global counter
    for _ in range(100_000):
        with lock:
            counter += 1

the better fix

Don’t share mutable state. Use queue.Queue or return values and aggregate.

+ is Python actually parallel?

Not with threads. The GIL means only one thread runs Python bytecode at a time, even on a 32-core machine. CPU-bound threads take turns; they don’t run simultaneously.

But IO-bound threads do benefit from threading. When a thread does a socket.recv() or file.read(), it releases the GIL while waiting for the OS to return data. Other threads run during that wait. This is why threading works well for web scrapers, API clients, and network servers.

For true CPU parallelism in Python, use multiprocessing. Each process gets its own interpreter and its own GIL. The tradeoff: processes don’t share memory by default, so you need multiprocessing.Queue, multiprocessing.Value, or shared memory to communicate between them.

+ why does the GIL exist?

CPython uses reference counting for memory management. Every object has a reference count, and when it hits zero, the object is freed immediately. Without a GIL, every Py_INCREF and Py_DECREF would need to be atomic, which would slow down single-threaded code (the common case) significantly.

The GIL also simplifies C extensions. Extension authors don’t need to worry about thread safety for most operations because the GIL serializes access to Python objects.

There have been multiple attempts to remove the GIL (Larry Hastings’ “Gilectomy”, Sam Gross’s “nogil” fork). PEP 703 was accepted in 2023 to make the GIL optional in CPython 3.13+, but it’s experimental and disabled by default. The challenge isn’t just removing the lock. It’s maintaining single-threaded performance and C extension compatibility while doing it.

+ atomicity in other languages

This problem isn’t Python-specific. It’s everywhere that threads share mutable state. The solutions vary:

Java has AtomicInteger with methods like incrementAndGet() that use CPU-level compare-and-swap (CAS) instructions. No lock needed. synchronized blocks serve the same purpose as Python’s Lock.

Go has sync/atomic for atomic operations and sync.Mutex for locks. But Go’s preferred pattern is CSP (Communicating Sequential Processes): goroutines communicate through channels instead of sharing memory. “Don’t communicate by sharing memory; share memory by communicating.”

Rust makes data races a compile-time error. You can’t share mutable data between threads unless it’s wrapped in Arc<Mutex<T>> or uses atomics. The borrow checker enforces this at compile time, not at runtime.

+ what about asyncio?

asyncio is single-threaded concurrency. There’s only one thread, so there are no race conditions on shared state from concurrent access.

Scheduling is cooperative: tasks yield control at await points. Between await points, your code runs uninterrupted. This means counter += 1 in an async function is safe, as long as there’s no await between the read and the write.

The tradeoff: asyncio can’t use multiple CPU cores (it’s one thread), and a CPU-intensive task blocks the entire event loop. It’s designed for IO-bound workloads where you’re mostly waiting on network or disk.

mode:

click Step or Play to inspect thread state

The simulation shows you the textbook version of a race condition. Two threads, one counter, a lost update. What follows is what the textbooks tend to skip.

the real problem isn’t the GIL

Python developers sometimes think the GIL is the source of their threading problems. It’s actually hiding the harder ones.

In languages without a GIL (Java, Go, C++), counter += 1 with two threads is still broken on modern hardware. Not because of bytecode interleaving, but because of CPU architecture.

Each core has its own L1/L2 cache. When Thread 1 on Core 0 writes counter = 1, that value sits in Core 0’s cache. Thread 2 on Core 1 might still see the old value in its own cache. The CPU’s cache coherence protocol (MESI on x86) will eventually propagate the update, but “eventually” is the problem. Between the write and the propagation, both cores have different views of the same memory address.

It gets worse. Modern CPUs reorder instructions for performance. A store followed by a load might execute as a load followed by a store if the CPU decides that’s faster. Your source code says one thing. The CPU does another. On x86 this is relatively tame (stores are ordered with respect to other stores). On ARM, the reordering is much more aggressive.

The GIL hides all of this. Because only one thread runs at a time, there’s no concurrent access to memory, no cache coherence issues, no reordering surprises. The moment you reach for multiprocessing with shared memory, or call into C extensions that release the GIL, these problems reappear in full force.

memory models

A memory model defines what one thread is guaranteed to see when another thread writes to memory. It’s a contract between the programmer, the compiler, and the hardware.

“Happens-before” is the key concept. If operation A happens-before operation B, then B is guaranteed to see the effects of A. Acquiring a lock happens-before releasing it. Starting a thread happens-before any operation in that thread. These relationships form a partial order on operations, and only operations connected by happens-before have visibility guarantees.

Java formalized this in the Java Memory Model (JSR-133, 2004). It was groundbreaking because it gave developers a way to reason about concurrent code without understanding CPU cache protocols. C++ followed with std::memory_order in C++11, offering a spectrum from memory_order_relaxed (no guarantees beyond atomicity) to memory_order_seq_cst (full sequential consistency, the default for std::atomic).

Python doesn’t have a formal memory model. It doesn’t need one, because the GIL provides sequential consistency for free. Every bytecode instruction completes before the next one starts, and thread switches happen only between bytecodes. But this is an implementation detail of CPython, not a language guarantee. PyPy, Jython, and IronPython have different threading behavior.

If Python removes the GIL (PEP 703), it will need a memory model. The Python developers will have to decide what guarantees threading.Lock, queue.Queue, and threading.Event provide about memory visibility. These are the questions that Java and C++ spent years working through, and they’re harder than they look.

Understanding memory models isn’t about memorizing rules. It’s about recognizing that concurrent programs don’t execute the way your source code reads, and that locks and atomics aren’t just about mutual exclusion. They’re about making writes visible across threads. Without them, your code isn’t wrong because of a bug. It’s wrong because the hardware makes no promises.