Two Threads, One Counter
Python’s counter += 1 isn’t atomic. It compiles to multiple bytecode
instructions (LOAD, ADD, STORE), and the GIL can switch threads between
any of them. Two threads incrementing a shared counter can lose updates.
Fix it with a threading.Lock(), or better yet, don’t share mutable
state at all.
the bug
Here’s code that looks correct and isn’t:
import threading
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print(counter) # expected: 200000Run this a few times. You’ll get numbers like 134,291 or 167,440. Never 200,000. The counter is losing increments.
why it breaks
counter += 1 looks like one operation. It isn’t. Python compiles it to
four bytecode instructions:
LOAD_GLOBAL counter # read counter into local
LOAD_CONST 1
BINARY_ADD # compute counter + 1
STORE_GLOBAL counter # write result backThe GIL (Global Interpreter Lock) protects Python’s internal state, but it can release between any two bytecodes. If Thread 1 does LOAD (reads 0), then the GIL switches to Thread 2, which also reads 0, both threads compute 0 + 1 = 1, and both write 1 back. Two increments happened, but the counter only went up by one. That’s a lost update.
The simulation above shows exactly this interleaving. Step through “Unsafe” mode and watch both threads read the same stale value.
the fix
Wrap the critical section in a threading.Lock():
lock = threading.Lock()
def increment():
global counter
for _ in range(100_000):
with lock:
counter += 1with lock acquires before entering and releases on exit. If Thread 2 tries
to acquire while Thread 1 holds it, Thread 2 blocks until Thread 1 releases.
No two threads can be inside the critical section at the same time. Switch
the simulation to “Lock” mode to see this in action.
the better fix
Don’t share mutable state. If neither thread writes to a shared variable, there’s nothing to race on.
from queue import Queue
q = Queue()
def producer():
for _ in range(100_000):
q.put(1)
t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=producer)
t1.start()
t2.start()
t1.join()
t2.join()
total = 0
while not q.empty():
total += q.get()
print(total) # always 200000queue.Queue is thread-safe internally. Each thread only writes to the
queue, and a single consumer reads from it. No shared mutable state, no
race condition. Switch to “Queue” mode in the simulation to see the
difference.
the bug
Here’s code that looks correct and isn’t:
import threading
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print(counter) # expected: 200000Run this a few times. You’ll get numbers like 134,291 or 167,440. Never 200,000. The counter is losing increments.
why it breaks
counter += 1 looks like one operation. It isn’t. Python compiles it to
four bytecode instructions:
LOAD_GLOBAL counter # read counter into local
LOAD_CONST 1
BINARY_ADD # compute counter + 1
STORE_GLOBAL counter # write result backThe GIL (Global Interpreter Lock) protects Python’s internal state, but it can release between any two bytecodes. If Thread 1 does LOAD (reads 0), then the GIL switches to Thread 2, which also reads 0, both threads compute 0 + 1 = 1, and both write 1 back. Two increments happened, but the counter only went up by one. That’s a lost update.
the fix
Wrap the critical section in a threading.Lock():
lock = threading.Lock()
def increment():
global counter
for _ in range(100_000):
with lock:
counter += 1with lock acquires before entering and releases on exit. If Thread 2 tries
to acquire while Thread 1 holds it, Thread 2 blocks until Thread 1 releases.
No two threads can be inside the critical section at the same time. Switch
the simulation to “Lock” mode to see this in action.
the better fix
Don’t share mutable state. Use queue.Queue or return values and aggregate.
+ is Python actually parallel?
Not with threads. The GIL means only one thread runs Python bytecode at a time, even on a 32-core machine. CPU-bound threads take turns; they don’t run simultaneously.
But IO-bound threads do benefit from threading. When a thread does a
socket.recv() or file.read(), it releases the GIL while waiting for
the OS to return data. Other threads run during that wait. This is why
threading works well for web scrapers, API clients, and network servers.
For true CPU parallelism in Python, use multiprocessing. Each process
gets its own interpreter and its own GIL. The tradeoff: processes don’t
share memory by default, so you need multiprocessing.Queue,
multiprocessing.Value, or shared memory to communicate between them.
+ why does the GIL exist?
CPython uses reference counting for memory management. Every object has a
reference count, and when it hits zero, the object is freed immediately.
Without a GIL, every Py_INCREF and Py_DECREF would need to be atomic,
which would slow down single-threaded code (the common case) significantly.
The GIL also simplifies C extensions. Extension authors don’t need to worry about thread safety for most operations because the GIL serializes access to Python objects.
There have been multiple attempts to remove the GIL (Larry Hastings’ “Gilectomy”, Sam Gross’s “nogil” fork). PEP 703 was accepted in 2023 to make the GIL optional in CPython 3.13+, but it’s experimental and disabled by default. The challenge isn’t just removing the lock. It’s maintaining single-threaded performance and C extension compatibility while doing it.
+ atomicity in other languages
This problem isn’t Python-specific. It’s everywhere that threads share mutable state. The solutions vary:
Java has AtomicInteger with methods like incrementAndGet() that
use CPU-level compare-and-swap (CAS) instructions. No lock needed.
synchronized blocks serve the same purpose as Python’s Lock.
Go has sync/atomic for atomic operations and sync.Mutex for
locks. But Go’s preferred pattern is CSP (Communicating Sequential
Processes): goroutines communicate through channels instead of sharing
memory. “Don’t communicate by sharing memory; share memory by
communicating.”
Rust makes data races a compile-time error. You can’t share mutable
data between threads unless it’s wrapped in Arc<Mutex<T>> or uses
atomics. The borrow checker enforces this at compile time, not at runtime.
+ what about asyncio?
asyncio is single-threaded concurrency. There’s only one thread, so
there are no race conditions on shared state from concurrent access.
Scheduling is cooperative: tasks yield control at await points. Between
await points, your code runs uninterrupted. This means counter += 1
in an async function is safe, as long as there’s no await between the
read and the write.
The tradeoff: asyncio can’t use multiple CPU cores (it’s one thread), and a CPU-intensive task blocks the entire event loop. It’s designed for IO-bound workloads where you’re mostly waiting on network or disk.
The simulation shows you the textbook version of a race condition. Two threads, one counter, a lost update. What follows is what the textbooks tend to skip.
the real problem isn’t the GIL
Python developers sometimes think the GIL is the source of their threading problems. It’s actually hiding the harder ones.
In languages without a GIL (Java, Go, C++), counter += 1 with two threads
is still broken on modern hardware. Not because of bytecode interleaving,
but because of CPU architecture.
Each core has its own L1/L2 cache. When Thread 1 on Core 0 writes
counter = 1, that value sits in Core 0’s cache. Thread 2 on Core 1 might
still see the old value in its own cache. The CPU’s cache coherence protocol
(MESI on x86) will eventually propagate the update, but “eventually” is the
problem. Between the write and the propagation, both cores have different
views of the same memory address.
It gets worse. Modern CPUs reorder instructions for performance. A store followed by a load might execute as a load followed by a store if the CPU decides that’s faster. Your source code says one thing. The CPU does another. On x86 this is relatively tame (stores are ordered with respect to other stores). On ARM, the reordering is much more aggressive.
The GIL hides all of this. Because only one thread runs at a time, there’s
no concurrent access to memory, no cache coherence issues, no reordering
surprises. The moment you reach for multiprocessing with shared memory, or
call into C extensions that release the GIL, these problems reappear in full
force.
memory models
A memory model defines what one thread is guaranteed to see when another thread writes to memory. It’s a contract between the programmer, the compiler, and the hardware.
“Happens-before” is the key concept. If operation A happens-before operation B, then B is guaranteed to see the effects of A. Acquiring a lock happens-before releasing it. Starting a thread happens-before any operation in that thread. These relationships form a partial order on operations, and only operations connected by happens-before have visibility guarantees.
Java formalized this in the Java Memory Model (JSR-133, 2004). It was
groundbreaking because it gave developers a way to reason about concurrent
code without understanding CPU cache protocols. C++ followed with
std::memory_order in C++11, offering a spectrum from memory_order_relaxed
(no guarantees beyond atomicity) to memory_order_seq_cst (full sequential
consistency, the default for std::atomic).
Python doesn’t have a formal memory model. It doesn’t need one, because the GIL provides sequential consistency for free. Every bytecode instruction completes before the next one starts, and thread switches happen only between bytecodes. But this is an implementation detail of CPython, not a language guarantee. PyPy, Jython, and IronPython have different threading behavior.
If Python removes the GIL (PEP 703), it will need a memory model. The
Python developers will have to decide what guarantees threading.Lock,
queue.Queue, and threading.Event provide about memory visibility. These
are the questions that Java and C++ spent years working through, and they’re
harder than they look.
Understanding memory models isn’t about memorizing rules. It’s about recognizing that concurrent programs don’t execute the way your source code reads, and that locks and atomics aren’t just about mutual exclusion. They’re about making writes visible across threads. Without them, your code isn’t wrong because of a bug. It’s wrong because the hardware makes no promises.