The Bounded Buffer | techdebt.blog

Producers create data. Consumers process it. A bounded buffer sits between them with a fixed capacity. When the buffer is full, producers wait. When it’s empty, consumers wait. Condition variables let threads sleep until the buffer state changes instead of burning CPU cycles polling. The queue.Queue in Python’s standard library does all of this for you.

mode:

click Step or Play to inspect buffer state

mode:

click Step or Play to inspect buffer state

the pattern

Imagine a web scraper feeding URLs into a pool of workers. The scraper discovers pages faster than the workers can process them. Without a buffer, the scraper has to wait for a worker to finish before handing off the next URL. Producer and consumer are coupled, forced to run in lockstep.

A bounded buffer decouples them. The producer drops items into the buffer and keeps going. The consumer pulls items out when it’s ready. The buffer absorbs bursts from the producer and smooths out pauses from the consumer. But it has a fixed capacity, so a runaway producer can’t consume infinite memory.

from queue import Queue
import threading

buffer = Queue(maxsize=5)

def producer(name):
    for i in range(10):
        item = f"{name}-item-{i}"
        buffer.put(item)  # blocks if full
        print(f"{name} produced {item}")

def consumer(name):
    while True:
        item = buffer.get()  # blocks if empty
        print(f"{name} consumed {item}")
        buffer.task_done()

# Start threads
for i in range(2):
    threading.Thread(target=producer, args=(f"P{i+1}",), daemon=True).start()
for i in range(2):
    threading.Thread(target=consumer, args=(f"C{i+1}",), daemon=True).start()

buffer.join()  # wait until all items processed

Queue.put() blocks when the buffer is full. Queue.get() blocks when it’s empty. No busy-waiting. No race conditions. The threads sleep until there’s work to do, and the OS wakes them up when the buffer state changes.

building it yourself

queue.Queue is the right choice for production code. But understanding what it does internally matters. Here’s a bounded buffer built with threading.Condition:

import threading

class BoundedBuffer:
    def __init__(self, capacity):
        self.buffer = []
        self.capacity = capacity
        self.lock = threading.Lock()
        self.not_full = threading.Condition(self.lock)
        self.not_empty = threading.Condition(self.lock)

    def put(self, item):
        with self.not_full:
            while len(self.buffer) >= self.capacity:
                self.not_full.wait()
            self.buffer.append(item)
            self.not_empty.notify()

    def get(self):
        with self.not_empty:
            while len(self.buffer) == 0:
                self.not_empty.wait()
            item = self.buffer.pop(0)
            self.not_full.notify()
            return item

Two condition variables share a single lock. not_full is for producers waiting for space. not_empty is for consumers waiting for items. When a producer adds an item, it signals not_empty to wake a waiting consumer. When a consumer removes an item, it signals not_full to wake a waiting producer.

Notice the while loop around the wait() call. You might think if would work. It won’t. Two reasons. First, spurious wakeups: the OS can wake a thread even without a corresponding notify(). The POSIX spec explicitly allows this. Second, with multiple producers or consumers, another thread might grab the item between the notify() and the wakeup. Thread A gets notified, but before it reacquires the lock, Thread B swoops in and takes the last item. Thread A wakes up to an empty buffer. The while loop re-checks the condition and goes back to sleep if needed. It is not optional.

semaphores

Condition variables require you to write the predicate check yourself. Semaphores bake the counting into the primitive:

import threading

class BoundedBufferSemaphore:
    def __init__(self, capacity):
        self.buffer = []
        self.mutex = threading.Lock()
        self.empty_slots = threading.Semaphore(capacity)
        self.full_slots = threading.Semaphore(0)

    def put(self, item):
        self.empty_slots.acquire()  # wait for an empty slot
        with self.mutex:
            self.buffer.append(item)
        self.full_slots.release()   # signal one more full slot

    def get(self):
        self.full_slots.acquire()   # wait for a full slot
        with self.mutex:
            item = self.buffer.pop(0)
        self.empty_slots.release()  # signal one more empty slot
        return item

empty_slots starts at capacity. Every put() decrements it. When it hits zero, the next producer blocks. full_slots starts at zero. Every put() increments it. When a consumer calls acquire() on full_slots and it’s zero, the consumer blocks. The counting replaces the while loop condition check entirely. No spurious wakeup issue because semaphore.acquire() handles the counting atomically.

Notice that the semaphore acquires happen outside the mutex. This is deliberate. If you put empty_slots.acquire() inside with self.mutex, a producer holding the mutex could block on a full buffer, preventing any consumer from acquiring the mutex to remove an item. Deadlock.

mode:

click Step or Play to inspect buffer state

the pattern

A bounded buffer decouples producers from consumers. Without one, they run in lockstep. The buffer absorbs bursts and smooths out pauses, but its fixed capacity prevents unbounded memory growth. queue.Queue gives you this out of the box: put() blocks when full, get() blocks when empty.

building it yourself

Here’s a bounded buffer built with threading.Condition:

import threading

class BoundedBuffer:
    def __init__(self, capacity):
        self.buffer = []
        self.capacity = capacity
        self.lock = threading.Lock()
        self.not_full = threading.Condition(self.lock)
        self.not_empty = threading.Condition(self.lock)

    def put(self, item):
        with self.not_full:
            while len(self.buffer) >= self.capacity:
                self.not_full.wait()
            self.buffer.append(item)
            self.not_empty.notify()

    def get(self):
        with self.not_empty:
            while len(self.buffer) == 0:
                self.not_empty.wait()
            item = self.buffer.pop(0)
            self.not_full.notify()
            return item

The while loop around wait() is critical. Spurious wakeups can happen, and with multiple consumers, another thread may have taken the item between notification and wakeup. The while re-checks and goes back to sleep if needed.

semaphores

Semaphores bake counting into the primitive, replacing the while loop condition check entirely. empty_slots starts at capacity and tracks available space. full_slots starts at zero and tracks available items. Semaphore acquires happen outside the mutex to avoid deadlock (a producer holding the mutex while blocked on a full buffer would prevent any consumer from draining it).

+ why not just use a list?

A Python list is not thread-safe for concurrent access. list.append() and list.pop() are individually atomic under the GIL, but checking len(buffer) and then calling append() is two separate operations. Between the check and the append, another thread could have filled the buffer past capacity.

Even worse, list.pop(0) on a large list is O(n) because it shifts all elements. collections.deque is O(1) for both ends and its append() and popleft() are atomic under the GIL. But relying on GIL atomicity is fragile. It’s an implementation detail of CPython, not a language guarantee.

Use queue.Queue. It handles all the synchronization internally, uses a deque underneath, and provides a clean API. Don’t reinvent it unless you’re learning how it works (which is what this post is for).

+ semaphores vs condition variables

Both solve the same problem. The choice depends on what you’re coordinating.

Condition variables are for predicate-based waiting. “Wait until this condition is true.” They’re flexible: the condition can be anything (buffer not full, temperature above threshold, user logged in). But you must write the predicate check yourself, and you must use while, not if.

Semaphores are for counting-based waiting. “Wait until a resource is available.” They track a count internally. acquire() decrements and blocks if zero. release() increments and wakes a waiter. The counting is built in, so there’s less room for bugs.

Rule of thumb: if you’re counting resources (buffer slots, connection pool slots, permits), use a semaphore. If you’re waiting on an arbitrary condition, use a condition variable.

+ backpressure in real systems

The bounded buffer is the simplest form of backpressure: when the buffer is full, the producer slows down. This pattern appears everywhere.

Kafka uses consumer lag as backpressure. If consumers fall behind, messages pile up in the topic. Producers aren’t directly blocked, but the system has a retention limit. Eventually old messages are deleted whether or not they were consumed.

TCP has a receive window. The receiver advertises how much buffer space it has. When the window fills up, the sender stops transmitting. This is backpressure at the transport layer.

Go channels with a fixed buffer size are bounded buffers. Writing to a full channel blocks the goroutine. This makes backpressure automatic in Go pipelines.

The alternative to backpressure is dropping. If the buffer is full, discard the new item (or the oldest item). This is what happens with UDP, logging systems, and metrics collection under load. Whether you block or drop depends on whether data loss is acceptable.

mode:

click Step or Play to inspect buffer state

The simulation above shows the textbook bounded buffer. Producers fill slots, consumers drain them, and everything blocks politely at the boundaries. What follows is what happens when polite isn’t fast enough.

the mechanical sympathy argument

In 2011, LMAX (a financial exchange in London) published the Disruptor, a lock-free ring buffer that processes 6 million transactions per second on a single thread. Not a cluster. Not a distributed system. One thread, one core, one ring buffer.

The Disruptor works because it’s designed around how CPUs actually work, not how programming languages abstract them. Three key decisions make it fast.

First, sequential memory access. The ring buffer is a contiguous array. Producers and consumers walk through it in order, one slot at a time. This is exactly the access pattern that CPU cache prefetchers are optimized for. The hardware predicts that you’ll access element N+1 after element N and loads it into L1 cache before you ask for it. A linked-list-based queue scatters nodes across the heap, defeating prefetch at every step.

Second, cache line padding. On most modern CPUs, a cache line is 64 bytes. When one core writes to a variable, the entire 64-byte cache line is invalidated on all other cores (the MESI protocol). If the producer’s write pointer and the consumer’s read pointer happen to sit on the same cache line, every write by the producer forces the consumer’s core to reload that line from main memory. This is called false sharing, and it destroys performance. The Disruptor pads its sequence counters to ensure they live on separate cache lines.

Third, memory barriers instead of locks. The Disruptor uses CPU-level memory fence instructions to enforce ordering between reads and writes. These are cheaper than OS-level locks by orders of magnitude. A lock involves a system call, a context switch to kernel mode, and potentially parking the thread on a wait queue. A memory barrier is a single CPU instruction that prevents the processor from reordering stores and loads across it.

The bounded buffer in this post uses locks. That’s the right choice for the vast majority of applications. Python’s queue.Queue can handle thousands of items per second, which is more than enough for web scrapers, task queues, and data pipelines. But when you need to process millions of events per second with microsecond latency, understanding the hardware matters. The Disruptor’s numbers demonstrate that a ring buffer with careful memory layout can outperform a lock-based queue by 10-100x.

This connects to a broader principle: the choice of synchronization mechanism is not just about correctness. It’s about understanding the performance characteristics of the hardware your code runs on. Locks involve context switches (expensive). Condition variables involve system calls (expensive). CAS (compare-and-swap) operations stay in user space (cheaper). Sequential memory access hits L1 cache (cheapest). Knowing which level you need is what separates adequate concurrent code from excellent concurrent code.

The bounded buffer is where most engineers should start. The Disruptor is where you go when the bounded buffer becomes the bottleneck. The important thing is knowing the path exists.