Deadlock at the Dinner Table
A deadlock happens when threads form a cycle of resource dependencies. Thread A holds lock 1 and wants lock 2. Thread B holds lock 2 and wants lock 1. Neither can proceed. Four conditions must all be true for deadlock: mutual exclusion, hold-and-wait, no preemption, and circular wait. Break any one of them and deadlock becomes impossible. The simplest fix: always acquire locks in the same order.
the bug
Here are two threads that will never finish:
import threading
import time
lock_a = threading.Lock()
lock_b = threading.Lock()
def worker_1():
with lock_a:
time.sleep(0.1) # simulate some work
with lock_b: # blocks forever
print("worker 1 done")
def worker_2():
with lock_b:
time.sleep(0.1)
with lock_a: # blocks forever
print("worker 2 done")
t1 = threading.Thread(target=worker_1)
t2 = threading.Thread(target=worker_2)
t1.start()
t2.start()
t1.join() # hangs foreverRun this and it hangs. No error message, no crash, no stack trace. Just silence. Both threads are alive but neither can make progress. That’s a deadlock.
The time.sleep(0.1) makes it nearly deterministic. Without it, you might
get lucky and have one thread finish before the other starts. But in
production, “getting lucky” isn’t a concurrency strategy.
why it happens
In 1971, Edward Coffman Jr. identified four conditions that must all hold simultaneously for deadlock to occur. Map them onto the code above:
- Mutual exclusion. Only one thread can hold a
Lockat a time. That’s the entire point of a lock. - Hold and wait.
worker_1holdslock_awhile waiting forlock_b. It won’t release what it has. - No preemption. Python won’t forcibly take a lock from a thread. Once acquired, a lock stays held until the thread explicitly releases it.
- Circular wait.
worker_1wantslock_b(held byworker_2).worker_2wantslock_a(held byworker_1). A cycle.
All four must be true simultaneously. Break any one and deadlock is impossible. That gives you four potential strategies. In practice, two of them actually work well.
the fix: lock ordering
def worker_1():
with lock_a: # always acquire a first
with lock_b:
print("worker 1 done")
def worker_2():
with lock_a: # same order as worker_1
with lock_b:
print("worker 2 done")Both threads acquire lock_a first, then lock_b. No circular wait is
possible. If worker_2 can’t get lock_a, it waits without holding any
other lock. No hold-and-wait either. Two conditions broken at once.
This is the most common fix in practice because it’s simple to reason about and has no runtime overhead. Switch to “Resource Ordering” mode in the simulation to see this in action.
the fix: timeout
def worker_1():
lock_a.acquire()
if not lock_b.acquire(timeout=1):
lock_a.release() # back off
return worker_1() # retry
try:
print("worker 1 done")
finally:
lock_b.release()
lock_a.release()If a thread can’t get the second lock within a timeout, it releases the first lock and retries. This breaks the “hold and wait” condition. The thread refuses to hold a lock indefinitely while waiting for another.
The trade-off: retries add latency, and under high contention you can get livelock (both threads repeatedly backing off and retrying without making progress). But it’s a solid fallback when lock ordering is impractical. Switch to “Timeout” mode in the simulation to watch the retry behavior.
the bug
import threading
import time
lock_a = threading.Lock()
lock_b = threading.Lock()
def worker_1():
with lock_a:
time.sleep(0.1)
with lock_b: # blocks forever
print("worker 1 done")
def worker_2():
with lock_b:
time.sleep(0.1)
with lock_a: # blocks forever
print("worker 2 done")
t1 = threading.Thread(target=worker_1)
t2 = threading.Thread(target=worker_2)
t1.start()
t2.start()
t1.join() # hangs foreverNo error, no crash. Just silence. Both threads are alive but neither can make progress.
why it happens
Four conditions (the Coffman conditions) must all hold simultaneously:
- Mutual exclusion. Only one thread can hold a
Lockat a time. - Hold and wait.
worker_1holdslock_awhile waiting forlock_b. - No preemption. Python won’t forcibly take a lock from a thread.
- Circular wait.
worker_1wantslock_b(held byworker_2), andworker_2wantslock_a(held byworker_1).
Break any one and deadlock is impossible.
+ the four Coffman conditions
Edward Coffman Jr. identified these in 1971. They’re necessary AND sufficient. If all four hold, deadlock will eventually occur.
Mutual exclusion: the resource can only be held by one thread. You can’t eliminate this for most locks. But read-write locks allow multiple readers, and lock-free data structures avoid mutual exclusion entirely.
Hold and wait: a thread holds one resource while waiting for another. Prevent this by requiring all-or-nothing acquisition. The downside: it reduces concurrency.
No preemption: resources can’t be forcibly taken. Break this with
tryLock() and timeout. If acquisition fails, release and retry.
Circular wait: a cycle in the wait-for graph. Resource ordering prevents this. If every thread acquires locks in the same order, cycles are impossible.
the fix: lock ordering
def worker_1():
with lock_a: # always acquire a first
with lock_b:
print("worker 1 done")
def worker_2():
with lock_a: # same order as worker_1
with lock_b:
print("worker 2 done")No circular wait possible. Switch to “Resource Ordering” in the simulation.
+ lock ordering in practice
Lock ordering sounds simple. In practice, it’s hard to enforce across a large codebase.
- Number your locks and always acquire in ascending order.
- Use a lock hierarchy: “database lock” > “table lock” > “row lock”. Never acquire a lower-level lock while holding a higher-level one.
- Static analysis tools catch some violations. Java’s FindBugs and Go’s race detector are examples.
contextlib.ExitStackin Python helps when acquiring a variable number of locks in consistent order.
Python’s threading module doesn’t enforce ordering. It’s up to you.
the fix: timeout
def worker_1():
lock_a.acquire()
if not lock_b.acquire(timeout=1):
lock_a.release() # back off
return worker_1() # retry
try:
print("worker 1 done")
finally:
lock_b.release()
lock_a.release()Breaks “hold and wait”. The trade-off: retries add latency, and under high contention you get livelock. Switch to “Timeout” mode in the simulation.
+ deadlock detection in databases
Databases face deadlock constantly. Two transactions updating the same rows in different orders will deadlock.
The approach: maintain a wait-for graph. Nodes are transactions. An edge from T1 to T2 means T1 is waiting for a lock held by T2. Run cycle detection periodically. When a cycle is found, kill the youngest transaction and let the other proceed.
Postgres checks after deadlock_timeout (default 1 second). It doesn’t
check immediately because most lock waits resolve quickly.
This is detection, not prevention. The database allows deadlocks and then resolves them. This works because transactions can be retried.
The simulation gives you the clean version. Locks, threads, forks on a table. In production, deadlocks rarely announce themselves so clearly. What follows are the cases where the problem gets more subtle.
priority inversion
In 1997, NASA’s Mars Pathfinder lander started rebooting itself on the surface of Mars. The software had three threads running at different priority levels: a high-priority bus management task, a medium-priority communications task, and a low-priority meteorological data collector.
The low-priority weather thread held a mutex protecting a shared data bus. The high-priority bus management thread needed that same mutex and blocked waiting for it. So far, normal. The low-priority thread would finish its work, release the mutex, and the high-priority thread would proceed.
Except the medium-priority communications thread preempted the low-priority thread. It had higher priority than the weather thread, so the scheduler ran it instead. The weather thread couldn’t finish. The mutex stayed held. The high-priority bus management thread stayed blocked. A watchdog timer fired and rebooted the entire system.
This wasn’t a deadlock in the classical sense. No circular wait. But it produced the same symptom: a critical thread couldn’t make progress.
The fix was priority inheritance. When a high-priority thread blocks on a mutex held by a low-priority thread, you temporarily boost the low-priority thread’s priority so it can finish and release the mutex. The scheduler can’t preempt it with medium-priority work anymore. JPL engineers patched the Pathfinder’s VxWorks config to enable priority inheritance on that mutex, and the reboots stopped.
The lesson extends beyond real-time systems. Whenever you have threads at different priority levels sharing resources, you need to think about whether a low-priority holder can be starved by medium-priority work. The resource dependency graph matters even when you’ve technically prevented circular waits.
beyond locks
Locks cause deadlocks because threads hold one resource while waiting for another. What if you didn’t use locks at all?
Lock-free algorithms replace locks with atomic compare-and-swap (CAS) operations. A CAS atomically reads a memory location, compares it to an expected value, and writes a new value only if the comparison succeeds. No lock is ever held, so no deadlock is possible.
def atomic_increment(counter_ptr):
while True:
old = counter_ptr.value
new = old + 1
if compare_and_swap(counter_ptr, old, new):
return new
# someone else changed it, retryThe thread never blocks. If another thread modified the value between the read and the CAS, it retries. No waiting, no holding, no deadlock.
But lock-free programming is notoriously difficult to get right. You have to reason about every possible interleaving of concurrent operations. The ABA problem (a value changes from A to B and back to A, making your CAS succeed when it shouldn’t) is just one of many pitfalls. Memory ordering on modern CPUs adds another layer of complexity. Most engineers should use locks with proper ordering rather than attempting lock-free code.
There’s a more radical approach: don’t share state at all. The actor model
gives each thread (or “actor”) its own private state and has them
communicate exclusively through message passing. If threads don’t share
mutable state, they can’t deadlock on shared resources. Erlang built an
entire telecom infrastructure on this idea. Python’s multiprocessing
module with Queue objects gives you a basic version of it.
The deadlock problem is fundamentally about shared mutable state. Locks are one way to manage it, but they bring the risk of circular dependencies. Lock ordering manages that risk. Lock-free algorithms eliminate it. The actor model sidesteps it entirely by removing shared state from the equation. Each approach trades one kind of complexity for another.