Memory Models and Happens-Before
When Thread 1 writes x = 1, Thread 2 might not see it. Not because of
a bug, but because CPUs have caches and reorder instructions. A memory
model defines what visibility guarantees you get. Locks provide those
guarantees. Without locks or atomics, you have no guarantees at all.
the broken flag
This is the most common anti-pattern in concurrent code. One thread sets a flag, another thread watches for it:
import threading
data = 0
ready = False
def writer():
global data, ready
data = 42
ready = True
def reader():
while not ready:
pass
print(data) # might print 0!
t1 = threading.Thread(target=writer)
t2 = threading.Thread(target=reader)
t2.start()
t1.start()
t1.join()
t2.join()The programmer’s intent is clear. Write data first, then set ready.
The reader spins until ready is True, then prints data. Should
always print 42.
It might print 0.
Two things can go wrong. First, the CPU has store buffers. When the writer
sets data = 42, that value goes into Core 0’s store buffer, not directly
to main memory. Core 1 (running the reader) might see the stale value.
The store buffer for ready = True might flush before the one for
data = 42, so the reader sees ready as True but data as 0.
Second, the CPU (and the compiler) can reorder instructions. data = 42
and ready = True don’t depend on each other from the CPU’s perspective.
The processor is free to execute them in either order. If ready = True
executes first, the reader thread wakes up and reads data before it’s
been written.
Both problems have the same root cause: without explicit synchronization, one thread’s writes have no guaranteed order when observed by another thread.
why locks fix it
threading.Lock() does more than mutual exclusion. It also acts as a
memory fence:
import threading
lock = threading.Lock()
data = 0
def writer():
global data
with lock:
data = 42
def reader():
with lock:
print(data) # always 42
t1 = threading.Thread(target=writer)
t2 = threading.Thread(target=reader)
t1.start()
t1.join()
t2.start()
t2.join()When Thread 1 releases the lock, all of its pending writes are flushed to main memory. When Thread 2 acquires the lock, its CPU cache is invalidated, forcing fresh reads from main memory. Every write before the release is visible to every read after the next acquire.
This is the happens-before relationship. Lock release happens-before the next lock acquire of the same lock. It’s not just about preventing two threads from entering a critical section at the same time. It’s about making writes visible across cores.
happens-before rules
Python doesn’t have a formal memory model specification (yet), but the threading primitives provide these ordering guarantees in CPython:
- Lock release happens-before the next acquire of the same lock. All writes before the release are visible after the acquire.
- Thread start happens-before any operation in the started thread.
If you set
x = 10before callingt.start(), the new thread seesx = 10. - Thread join happens-before the code after
join()returns. If the thread wroteresult = 42before finishing, the parent thread sees it afterjoin(). Queue.put()happens-beforeQueue.get()for the same item. Data passed through aQueueis always visible to the consumer.Event.set()happens-beforeEvent.wait()returns. Any writes beforeset()are visible to the thread that returns fromwait().
If two operations are not connected by any of these rules, you have no visibility guarantee. The operations are concurrent, and either thread might see stale data. This is why the broken flag pattern fails: there’s no happens-before edge between the writer and the reader.
the broken flag
A shared boolean flag with no synchronization is broken. The writer sets
data = 42 then ready = True, but the reader might see ready as
True while data is still 0. Store buffers and instruction reordering
mean one thread’s writes have no guaranteed order when seen by another.
why locks fix it
A lock acts as a memory fence. Release flushes all pending writes. Acquire invalidates the cache, forcing fresh reads. This creates a happens-before edge: all writes before the release are visible after the next acquire.
happens-before rules
Lock release/acquire. Thread start. Thread join. Queue put/get. Event set/wait. These are the edges. No edge, no guarantee.
+ why the GIL hides this
In CPython, the GIL (Global Interpreter Lock) provides sequential consistency for free. Only one thread executes Python bytecode at a time. There’s no true concurrent memory access, so there are no store buffer surprises and no reordering visible to Python code.
This is why the broken flag pattern actually works in CPython most of the time. The GIL serializes the bytecode instructions, and its acquire/release cycle acts as a memory barrier. Thread 2 sees Thread 1’s writes because the GIL forces them to take turns.
But this is a CPython implementation detail, not a language guarantee. PyPy uses a different GIL implementation with different timing characteristics. Jython and IronPython run on the JVM and CLR respectively, with their own memory models. And the experimental no-GIL build (PEP 703, available as a build option in CPython 3.13+) removes the GIL entirely.
If the GIL goes away, Python code that relies on accidental sequential consistency will break. The broken flag pattern will produce real bugs. Python will need a formal memory model, and developers who never had to think about happens-before will suddenly need to. Every other language that removed or never had a GIL went through this transition. Python’s turn is coming.
+ the Java Memory Model
Java formalized its memory model in JSR-133 (2004). It was the first mainstream language to specify exactly what one thread is guaranteed to see when another thread writes to memory.
Key rules: volatile reads and writes have happens-before guarantees. A
write to a volatile variable happens-before every subsequent read of
that variable. synchronized blocks provide acquire/release semantics,
just like Python’s locks.
The canonical example of why this matters: double-checked locking for singletons.
// BROKEN without volatile
class Singleton {
private static Singleton instance;
public static Singleton getInstance() {
if (instance == null) {
synchronized (Singleton.class) {
if (instance == null) {
instance = new Singleton();
}
}
}
return instance;
}
}This looks correct. But instance = new Singleton() involves three
steps: allocate memory, run the constructor, assign the reference. The
CPU can reorder these so the reference is assigned before the constructor
finishes. Thread 2 sees a non-null instance and returns an object
whose fields are still uninitialized.
The fix: declare instance as volatile. The JMM guarantees that a
volatile write publishes all prior writes, and a volatile read sees
all those writes. The constructor finishes before the reference becomes
visible.
This bug was so widespread that it took years and a formal memory model to explain why it was broken and how to fix it.
+ compare-and-swap
At the CPU instruction level, atomic operations work through compare-and-swap (CAS). The idea: read a value, compute a new value, write the new value only if the original hasn’t changed since you read it. If it has changed, retry.
# pseudocode for CAS
def compare_and_swap(location, expected, new_value):
# this entire function is ONE atomic CPU instruction
current = read(location)
if current == expected:
write(location, new_value)
return True
return FalseA lock-free counter using CAS:
# pseudocode (not real Python, but shows the pattern)
def atomic_increment(counter):
while True:
old = counter.value
new = old + 1
if compare_and_swap(counter, old, new):
return new
# another thread changed it, retryNo lock is held. No thread blocks. If contention is low, CAS succeeds on the first try. Under high contention, threads spin and retry, which can be worse than a lock.
x86 implements this as CMPXCHG (compare and exchange). ARM uses a
load-exclusive/store-exclusive pair: LDXR reads the value and marks
the cache line as exclusive, STXR writes the new value only if the
exclusive mark is still held. If another core touched that cache line,
the store fails and you retry.
Python exposes this indirectly through threading primitives. In
C/C++, you have std::atomic with compare_exchange_strong. In Java,
AtomicInteger.compareAndSet. In Rust, AtomicI32::compare_exchange.
These are the building blocks of lock-free queues, lock-free hash maps,
and the concurrent data structures you find in java.util.concurrent.
The simulation above shows what happens when two threads share memory without synchronization. The writes are there, but the reads don’t see them. This final post in the series takes a step back from Python and looks at where these rules come from.
Lamport’s logical clocks
In 1978, Leslie Lamport published “Time, Clocks, and the Ordering of Events in a Distributed System.” It’s one of the most cited papers in computer science, and it established the happens-before relation that underpins everything in this post.
Lamport’s insight: in a distributed system (or a multi-core CPU), there is no single global clock. You cannot say “event A happened at time T1 and event B happened at time T2, so A happened first.” Clocks drift. Messages have latency. There is no universal “now.”
Instead, you can only reason about ordering through causal relationships. If process A sends a message and process B receives it, then A’s send happens-before B’s receive. If A does operation X before operation Y in program order, then X happens-before Y. These relationships are transitive: if X happens-before Y and Y happens-before Z, then X happens-before Z.
The critical part: if two events have no causal relationship, they are concurrent. Neither happened first. Both orderings are valid depending on the observer. This is not a limitation of our measurement tools. It is a fundamental property of distributed computation.
This is exactly the same insight that makes memory models necessary. Two threads on two cores, with no lock or fence between them, are a distributed system. Their writes are concurrent. Either ordering is valid. The hardware chooses whichever is fastest, and “fastest” can change between runs, between cores, between microarchitecture revisions.
When you acquire a lock, you create a causal edge. The release happens-before the acquire. The writes before the release are ordered before the reads after the acquire. Without that edge, you’re in Lamport’s concurrent zone, and the hardware owes you nothing.
the spectrum of consistency
Memory models sit on a spectrum. From strongest to weakest:
Sequential consistency. Every operation appears to execute in some global order that is consistent with each thread’s program order. If Thread 1 writes A then B, and Thread 2 writes C then D, the global order must preserve A-before-B and C-before-D, but A and C can interleave in any valid way. This is the model most programmers assume. It’s also the most expensive to implement, because it requires every write to be immediately visible to every core.
Total store order (TSO). x86 uses this. Stores from a single core are seen in program order by all other cores. But a core can read its own stores before they become visible to others (through the store buffer). This is slightly weaker than sequential consistency, but strong enough that most code works without explicit fences. It’s why x86 programmers rarely encounter memory ordering bugs.
Weak/relaxed ordering. ARM and RISC-V use this. Stores and loads can be reordered freely unless you insert explicit barrier instructions. This gives the hardware maximum freedom to optimize, but puts the burden on the programmer (or the compiler) to insert barriers where ordering matters. Code that works on x86 can break on ARM because ARM provides fewer guarantees by default.
Causal consistency. Only causally related operations are ordered. If A causes B, everyone sees A before B. But concurrent operations can appear in any order to different observers.
Eventual consistency. All replicas will converge to the same state, but there are no ordering guarantees in between. A write might not be visible for seconds, minutes, or longer. This is the model used by many distributed databases (DynamoDB, Cassandra) for performance.
The tradeoff is always the same: stronger consistency requires more synchronization, which means more latency and less throughput. CPUs chose weak memory models because fence instructions on every store would cut performance in half. Distributed databases offer tunable consistency because sequential consistency across data centers is prohibitively slow.
Understanding this spectrum is what separates knowing how to use locks from knowing when you need them. A lock gives you sequential consistency within its critical section. Outside the lock, you get whatever the hardware provides. Knowing what the hardware provides, and what it doesn’t, is the difference between code that works and code that works on your machine, on your CPU, today.
This is the last post in the series. We started with a counter that loses increments and ended with the theoretical foundations that explain why. Locks, condition variables, semaphores, read-write locks, thread pools, event loops, memory fences. They’re all tools for imposing order on systems that have no natural ordering. The hardware is fast because it makes no promises. Your job, as the programmer, is to know exactly which promises you need and to demand them explicitly. Nothing more, nothing less.