The Bounded Buffer
Producers create data. Consumers process it. A bounded buffer sits between
them with a fixed capacity. When the buffer is full, producers wait. When
it’s empty, consumers wait. Condition variables let threads sleep until the
buffer state changes instead of burning CPU cycles polling. The
queue.Queue in Python’s standard library does all of this for you.
the pattern
Imagine a web scraper feeding URLs into a pool of workers. The scraper discovers pages faster than the workers can process them. Without a buffer, the scraper has to wait for a worker to finish before handing off the next URL. Producer and consumer are coupled, forced to run in lockstep.
A bounded buffer decouples them. The producer drops items into the buffer and keeps going. The consumer pulls items out when it’s ready. The buffer absorbs bursts from the producer and smooths out pauses from the consumer. But it has a fixed capacity, so a runaway producer can’t consume infinite memory.
from queue import Queue
import threading
buffer = Queue(maxsize=5)
def producer(name):
for i in range(10):
item = f"{name}-item-{i}"
buffer.put(item) # blocks if full
print(f"{name} produced {item}")
def consumer(name):
while True:
item = buffer.get() # blocks if empty
print(f"{name} consumed {item}")
buffer.task_done()
# Start threads
for i in range(2):
threading.Thread(target=producer, args=(f"P{i+1}",), daemon=True).start()
for i in range(2):
threading.Thread(target=consumer, args=(f"C{i+1}",), daemon=True).start()
buffer.join() # wait until all items processedQueue.put() blocks when the buffer is full. Queue.get() blocks when
it’s empty. No busy-waiting. No race conditions. The threads sleep until
there’s work to do, and the OS wakes them up when the buffer state changes.
building it yourself
queue.Queue is the right choice for production code. But understanding
what it does internally matters. Here’s a bounded buffer built with
threading.Condition:
import threading
class BoundedBuffer:
def __init__(self, capacity):
self.buffer = []
self.capacity = capacity
self.lock = threading.Lock()
self.not_full = threading.Condition(self.lock)
self.not_empty = threading.Condition(self.lock)
def put(self, item):
with self.not_full:
while len(self.buffer) >= self.capacity:
self.not_full.wait()
self.buffer.append(item)
self.not_empty.notify()
def get(self):
with self.not_empty:
while len(self.buffer) == 0:
self.not_empty.wait()
item = self.buffer.pop(0)
self.not_full.notify()
return itemTwo condition variables share a single lock. not_full is for producers
waiting for space. not_empty is for consumers waiting for items. When a
producer adds an item, it signals not_empty to wake a waiting consumer.
When a consumer removes an item, it signals not_full to wake a waiting
producer.
Notice the while loop around the wait() call. You might think if
would work. It won’t. Two reasons. First, spurious wakeups: the OS can
wake a thread even without a corresponding notify(). The POSIX spec
explicitly allows this. Second, with multiple producers or consumers,
another thread might grab the item between the notify() and the wakeup.
Thread A gets notified, but before it reacquires the lock, Thread B swoops
in and takes the last item. Thread A wakes up to an empty buffer. The
while loop re-checks the condition and goes back to sleep if needed. It
is not optional.
semaphores
Condition variables require you to write the predicate check yourself. Semaphores bake the counting into the primitive:
import threading
class BoundedBufferSemaphore:
def __init__(self, capacity):
self.buffer = []
self.mutex = threading.Lock()
self.empty_slots = threading.Semaphore(capacity)
self.full_slots = threading.Semaphore(0)
def put(self, item):
self.empty_slots.acquire() # wait for an empty slot
with self.mutex:
self.buffer.append(item)
self.full_slots.release() # signal one more full slot
def get(self):
self.full_slots.acquire() # wait for a full slot
with self.mutex:
item = self.buffer.pop(0)
self.empty_slots.release() # signal one more empty slot
return itemempty_slots starts at capacity. Every put() decrements it. When it
hits zero, the next producer blocks. full_slots starts at zero. Every
put() increments it. When a consumer calls acquire() on full_slots
and it’s zero, the consumer blocks. The counting replaces the while loop
condition check entirely. No spurious wakeup issue because
semaphore.acquire() handles the counting atomically.
Notice that the semaphore acquires happen outside the mutex. This is
deliberate. If you put empty_slots.acquire() inside with self.mutex,
a producer holding the mutex could block on a full buffer, preventing any
consumer from acquiring the mutex to remove an item. Deadlock.
the pattern
A bounded buffer decouples producers from consumers. Without one, they
run in lockstep. The buffer absorbs bursts and smooths out pauses, but
its fixed capacity prevents unbounded memory growth. queue.Queue gives
you this out of the box: put() blocks when full, get() blocks when
empty.
building it yourself
Here’s a bounded buffer built with threading.Condition:
import threading
class BoundedBuffer:
def __init__(self, capacity):
self.buffer = []
self.capacity = capacity
self.lock = threading.Lock()
self.not_full = threading.Condition(self.lock)
self.not_empty = threading.Condition(self.lock)
def put(self, item):
with self.not_full:
while len(self.buffer) >= self.capacity:
self.not_full.wait()
self.buffer.append(item)
self.not_empty.notify()
def get(self):
with self.not_empty:
while len(self.buffer) == 0:
self.not_empty.wait()
item = self.buffer.pop(0)
self.not_full.notify()
return itemThe while loop around wait() is critical. Spurious wakeups can
happen, and with multiple consumers, another thread may have taken the
item between notification and wakeup. The while re-checks and goes
back to sleep if needed.
semaphores
Semaphores bake counting into the primitive, replacing the while loop
condition check entirely. empty_slots starts at capacity and tracks
available space. full_slots starts at zero and tracks available items.
Semaphore acquires happen outside the mutex to avoid deadlock (a producer
holding the mutex while blocked on a full buffer would prevent any
consumer from draining it).
+ why not just use a list?
A Python list is not thread-safe for concurrent access. list.append()
and list.pop() are individually atomic under the GIL, but checking
len(buffer) and then calling append() is two separate operations.
Between the check and the append, another thread could have filled the
buffer past capacity.
Even worse, list.pop(0) on a large list is O(n) because it shifts all
elements. collections.deque is O(1) for both ends and its append()
and popleft() are atomic under the GIL. But relying on GIL atomicity
is fragile. It’s an implementation detail of CPython, not a language
guarantee.
Use queue.Queue. It handles all the synchronization internally, uses a
deque underneath, and provides a clean API. Don’t reinvent it unless
you’re learning how it works (which is what this post is for).
+ semaphores vs condition variables
Both solve the same problem. The choice depends on what you’re coordinating.
Condition variables are for predicate-based waiting. “Wait until this
condition is true.” They’re flexible: the condition can be anything
(buffer not full, temperature above threshold, user logged in). But you
must write the predicate check yourself, and you must use while, not
if.
Semaphores are for counting-based waiting. “Wait until a resource is
available.” They track a count internally. acquire() decrements and
blocks if zero. release() increments and wakes a waiter. The counting
is built in, so there’s less room for bugs.
Rule of thumb: if you’re counting resources (buffer slots, connection pool slots, permits), use a semaphore. If you’re waiting on an arbitrary condition, use a condition variable.
+ backpressure in real systems
The bounded buffer is the simplest form of backpressure: when the buffer is full, the producer slows down. This pattern appears everywhere.
Kafka uses consumer lag as backpressure. If consumers fall behind, messages pile up in the topic. Producers aren’t directly blocked, but the system has a retention limit. Eventually old messages are deleted whether or not they were consumed.
TCP has a receive window. The receiver advertises how much buffer space it has. When the window fills up, the sender stops transmitting. This is backpressure at the transport layer.
Go channels with a fixed buffer size are bounded buffers. Writing to a full channel blocks the goroutine. This makes backpressure automatic in Go pipelines.
The alternative to backpressure is dropping. If the buffer is full, discard the new item (or the oldest item). This is what happens with UDP, logging systems, and metrics collection under load. Whether you block or drop depends on whether data loss is acceptable.
The simulation above shows the textbook bounded buffer. Producers fill slots, consumers drain them, and everything blocks politely at the boundaries. What follows is what happens when polite isn’t fast enough.
the mechanical sympathy argument
In 2011, LMAX (a financial exchange in London) published the Disruptor, a lock-free ring buffer that processes 6 million transactions per second on a single thread. Not a cluster. Not a distributed system. One thread, one core, one ring buffer.
The Disruptor works because it’s designed around how CPUs actually work, not how programming languages abstract them. Three key decisions make it fast.
First, sequential memory access. The ring buffer is a contiguous array. Producers and consumers walk through it in order, one slot at a time. This is exactly the access pattern that CPU cache prefetchers are optimized for. The hardware predicts that you’ll access element N+1 after element N and loads it into L1 cache before you ask for it. A linked-list-based queue scatters nodes across the heap, defeating prefetch at every step.
Second, cache line padding. On most modern CPUs, a cache line is 64 bytes. When one core writes to a variable, the entire 64-byte cache line is invalidated on all other cores (the MESI protocol). If the producer’s write pointer and the consumer’s read pointer happen to sit on the same cache line, every write by the producer forces the consumer’s core to reload that line from main memory. This is called false sharing, and it destroys performance. The Disruptor pads its sequence counters to ensure they live on separate cache lines.
Third, memory barriers instead of locks. The Disruptor uses CPU-level memory fence instructions to enforce ordering between reads and writes. These are cheaper than OS-level locks by orders of magnitude. A lock involves a system call, a context switch to kernel mode, and potentially parking the thread on a wait queue. A memory barrier is a single CPU instruction that prevents the processor from reordering stores and loads across it.
The bounded buffer in this post uses locks. That’s the right choice for
the vast majority of applications. Python’s queue.Queue can handle
thousands of items per second, which is more than enough for web scrapers,
task queues, and data pipelines. But when you need to process millions of
events per second with microsecond latency, understanding the hardware
matters. The Disruptor’s numbers demonstrate that a ring buffer with
careful memory layout can outperform a lock-based queue by 10-100x.
This connects to a broader principle: the choice of synchronization mechanism is not just about correctness. It’s about understanding the performance characteristics of the hardware your code runs on. Locks involve context switches (expensive). Condition variables involve system calls (expensive). CAS (compare-and-swap) operations stay in user space (cheaper). Sequential memory access hits L1 cache (cheapest). Knowing which level you need is what separates adequate concurrent code from excellent concurrent code.
The bounded buffer is where most engineers should start. The Disruptor is where you go when the bounded buffer becomes the bottleneck. The important thing is knowing the path exists.