Deadlock at the Dinner Table

mode:

click Step or Play to inspect philosopher state

the bug

Here are two threads that will never finish:

import threading
import time

lock_a = threading.Lock()
lock_b = threading.Lock()

def worker_1():
    with lock_a:
        time.sleep(0.1)  # simulate some work
        with lock_b:     # blocks forever
            print("worker 1 done")

def worker_2():
    with lock_b:
        time.sleep(0.1)
        with lock_a:     # blocks forever
            print("worker 2 done")

t1 = threading.Thread(target=worker_1)
t2 = threading.Thread(target=worker_2)
t1.start()
t2.start()
t1.join()  # hangs forever

Run this and it hangs. No error message, no crash, no stack trace. Just silence. Both threads are alive but neither can make progress. That’s a deadlock.

The time.sleep(0.1) makes it nearly deterministic. Without it, you might get lucky and have one thread finish before the other starts. But in production, “getting lucky” isn’t a concurrency strategy.

why it happens

In 1971, Edward Coffman Jr. identified four conditions that must all hold simultaneously for deadlock to occur. Map them onto the code above:

Mutual exclusion. Only one thread can hold a Lock at a time. That’s the entire point of a lock.
Hold and wait. worker_1 holds lock_a while waiting for lock_b. It won’t release what it has.
No preemption. Python won’t forcibly take a lock from a thread. Once acquired, a lock stays held until the thread explicitly releases it.
Circular wait. worker_1 wants lock_b (held by worker_2). worker_2 wants lock_a (held by worker_1). A cycle.

All four must be true simultaneously. Break any one and deadlock is impossible. That gives you four potential strategies. In practice, two of them actually work well.

the fix: lock ordering

def worker_1():
    with lock_a:        # always acquire a first
        with lock_b:
            print("worker 1 done")

def worker_2():
    with lock_a:        # same order as worker_1
        with lock_b:
            print("worker 2 done")

Both threads acquire lock_a first, then lock_b. No circular wait is possible. If worker_2 can’t get lock_a, it waits without holding any other lock. No hold-and-wait either. Two conditions broken at once.

This is the most common fix in practice because it’s simple to reason about and has no runtime overhead. Switch to “Resource Ordering” mode in the simulation to see this in action.

the fix: timeout

def worker_1():
    lock_a.acquire()
    if not lock_b.acquire(timeout=1):
        lock_a.release()  # back off
        return worker_1()  # retry
    try:
        print("worker 1 done")
    finally:
        lock_b.release()
        lock_a.release()

If a thread can’t get the second lock within a timeout, it releases the first lock and retries. This breaks the “hold and wait” condition. The thread refuses to hold a lock indefinitely while waiting for another.

The trade-off: retries add latency, and under high contention you can get livelock (both threads repeatedly backing off and retrying without making progress). But it’s a solid fallback when lock ordering is impractical. Switch to “Timeout” mode in the simulation to watch the retry behavior.

mode:

click Step or Play to inspect philosopher state

the bug

import threading
import time

lock_a = threading.Lock()
lock_b = threading.Lock()

def worker_1():
    with lock_a:
        time.sleep(0.1)
        with lock_b:     # blocks forever
            print("worker 1 done")

def worker_2():
    with lock_b:
        time.sleep(0.1)
        with lock_a:     # blocks forever
            print("worker 2 done")

t1 = threading.Thread(target=worker_1)
t2 = threading.Thread(target=worker_2)
t1.start()
t2.start()
t1.join()  # hangs forever

No error, no crash. Just silence. Both threads are alive but neither can make progress.

why it happens

Four conditions (the Coffman conditions) must all hold simultaneously:

Mutual exclusion. Only one thread can hold a Lock at a time.
Hold and wait. worker_1 holds lock_a while waiting for lock_b.
No preemption. Python won’t forcibly take a lock from a thread.
Circular wait. worker_1 wants lock_b (held by worker_2), and worker_2 wants lock_a (held by worker_1).

Break any one and deadlock is impossible.

+ the four Coffman conditions

Edward Coffman Jr. identified these in 1971. They’re necessary AND sufficient. If all four hold, deadlock will eventually occur.

Mutual exclusion: the resource can only be held by one thread. You can’t eliminate this for most locks. But read-write locks allow multiple readers, and lock-free data structures avoid mutual exclusion entirely.

Hold and wait: a thread holds one resource while waiting for another. Prevent this by requiring all-or-nothing acquisition. The downside: it reduces concurrency.

No preemption: resources can’t be forcibly taken. Break this with tryLock() and timeout. If acquisition fails, release and retry.

Circular wait: a cycle in the wait-for graph. Resource ordering prevents this. If every thread acquires locks in the same order, cycles are impossible.

the fix: lock ordering

def worker_1():
    with lock_a:        # always acquire a first
        with lock_b:
            print("worker 1 done")

def worker_2():
    with lock_a:        # same order as worker_1
        with lock_b:
            print("worker 2 done")

No circular wait possible. Switch to “Resource Ordering” in the simulation.

+ lock ordering in practice

Lock ordering sounds simple. In practice, it’s hard to enforce across a large codebase.

Number your locks and always acquire in ascending order.
Use a lock hierarchy: “database lock” > “table lock” > “row lock”. Never acquire a lower-level lock while holding a higher-level one.
Static analysis tools catch some violations. Java’s FindBugs and Go’s race detector are examples.
contextlib.ExitStack in Python helps when acquiring a variable number of locks in consistent order.

Python’s threading module doesn’t enforce ordering. It’s up to you.

the fix: timeout

def worker_1():
    lock_a.acquire()
    if not lock_b.acquire(timeout=1):
        lock_a.release()  # back off
        return worker_1()  # retry
    try:
        print("worker 1 done")
    finally:
        lock_b.release()
        lock_a.release()

Breaks “hold and wait”. The trade-off: retries add latency, and under high contention you get livelock. Switch to “Timeout” mode in the simulation.

+ deadlock detection in databases

Databases face deadlock constantly. Two transactions updating the same rows in different orders will deadlock.

The approach: maintain a wait-for graph. Nodes are transactions. An edge from T1 to T2 means T1 is waiting for a lock held by T2. Run cycle detection periodically. When a cycle is found, kill the youngest transaction and let the other proceed.

Postgres checks after deadlock_timeout (default 1 second). It doesn’t check immediately because most lock waits resolve quickly.

This is detection, not prevention. The database allows deadlocks and then resolves them. This works because transactions can be retried.

mode:

click Step or Play to inspect philosopher state

The simulation gives you the clean version. Locks, threads, forks on a table. In production, deadlocks rarely announce themselves so clearly. What follows are the cases where the problem gets more subtle.

priority inversion

In 1997, NASA’s Mars Pathfinder lander started rebooting itself on the surface of Mars. The software had three threads running at different priority levels: a high-priority bus management task, a medium-priority communications task, and a low-priority meteorological data collector.

The low-priority weather thread held a mutex protecting a shared data bus. The high-priority bus management thread needed that same mutex and blocked waiting for it. So far, normal. The low-priority thread would finish its work, release the mutex, and the high-priority thread would proceed.

Except the medium-priority communications thread preempted the low-priority thread. It had higher priority than the weather thread, so the scheduler ran it instead. The weather thread couldn’t finish. The mutex stayed held. The high-priority bus management thread stayed blocked. A watchdog timer fired and rebooted the entire system.

This wasn’t a deadlock in the classical sense. No circular wait. But it produced the same symptom: a critical thread couldn’t make progress.

The fix was priority inheritance. When a high-priority thread blocks on a mutex held by a low-priority thread, you temporarily boost the low-priority thread’s priority so it can finish and release the mutex. The scheduler can’t preempt it with medium-priority work anymore. JPL engineers patched the Pathfinder’s VxWorks config to enable priority inheritance on that mutex, and the reboots stopped.

The lesson extends beyond real-time systems. Whenever you have threads at different priority levels sharing resources, you need to think about whether a low-priority holder can be starved by medium-priority work. The resource dependency graph matters even when you’ve technically prevented circular waits.

beyond locks

Locks cause deadlocks because threads hold one resource while waiting for another. What if you didn’t use locks at all?

Lock-free algorithms replace locks with atomic compare-and-swap (CAS) operations. A CAS atomically reads a memory location, compares it to an expected value, and writes a new value only if the comparison succeeds. No lock is ever held, so no deadlock is possible.

def atomic_increment(counter_ptr):
    while True:
        old = counter_ptr.value
        new = old + 1
        if compare_and_swap(counter_ptr, old, new):
            return new
        # someone else changed it, retry

The thread never blocks. If another thread modified the value between the read and the CAS, it retries. No waiting, no holding, no deadlock.

But lock-free programming is notoriously difficult to get right. You have to reason about every possible interleaving of concurrent operations. The ABA problem (a value changes from A to B and back to A, making your CAS succeed when it shouldn’t) is just one of many pitfalls. Memory ordering on modern CPUs adds another layer of complexity. Most engineers should use locks with proper ordering rather than attempting lock-free code.

There’s a more radical approach: don’t share state at all. The actor model gives each thread (or “actor”) its own private state and has them communicate exclusively through message passing. If threads don’t share mutable state, they can’t deadlock on shared resources. Erlang built an entire telecom infrastructure on this idea. Python’s multiprocessing module with Queue objects gives you a basic version of it.

The deadlock problem is fundamentally about shared mutable state. Locks are one way to manage it, but they bring the risk of circular dependencies. Lock ordering manages that risk. Lock-free algorithms eliminate it. The actor model sidesteps it entirely by removing shared state from the equation. Each approach trades one kind of complexity for another.