user@techdebt:~/blog$_
$ cd ..

The Thread Pool

depth: 0 REJECTED Worker 1 idle Worker 2 idle Worker 3 idle Worker 4 idle 0 tasks
mode:
$ pool inspector
Step through to inspect state
$ simulation.log

the cost of threads

Every time you create a thread, the OS allocates a stack (typically 1MB), sets up kernel data structures, and registers the thread with the scheduler. On Linux, this takes roughly 1ms. That sounds small until you’re handling 10,000 requests per second. Now you’re spending 10 seconds per second just on thread creation. You’re underwater.

Creation isn’t the only cost. Every thread competes for CPU time. The OS context-switches between them, saving and restoring registers, flushing caches. With hundreds of threads, the CPU spends more time switching than working. Here’s the naive approach:

import threading
import time

def handle_request(request_id):
    time.sleep(0.1)  # simulate IO
    return f"done-{request_id}"

threads = []
for i in range(1000):
    t = threading.Thread(target=handle_request, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

This creates 1,000 threads, each with its own stack. It works for small numbers. At scale, you run out of memory or hit the OS thread limit.

ThreadPoolExecutor

A thread pool creates the threads once and reuses them. Tasks go into a queue. Workers pull tasks off, execute them, and go back to waiting:

from concurrent.futures import ThreadPoolExecutor, as_completed

def handle_request(request_id):
    import time
    time.sleep(0.1)
    return f"done-{request_id}"

with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(handle_request, i) for i in range(1000)]
    for future in as_completed(futures):
        result = future.result()

Ten threads handle 1,000 tasks. submit() returns a Future. Call .result() to get the return value (blocks until done). as_completed() yields futures as they finish, so you process results in completion order.

For uniform work, map() is simpler:

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(handle_request, range(1000)))

map() preserves input order. Use as_completed() when order doesn’t matter and you want results as fast as possible.

what happens when the queue fills

Python’s ThreadPoolExecutor uses an unbounded queue by default. If your producer submits faster than workers can process, the queue grows without limit until the process crashes. For controlled backpressure, bound it:

from concurrent.futures import ThreadPoolExecutor
import threading

class BoundedPool:
    def __init__(self, max_workers, queue_size):
        self.pool = ThreadPoolExecutor(max_workers=max_workers)
        self.semaphore = threading.Semaphore(queue_size + max_workers)

    def submit(self, fn, *args, **kwargs):
        self.semaphore.acquire()  # blocks if too many pending
        future = self.pool.submit(fn, *args, **kwargs)
        future.add_done_callback(lambda _: self.semaphore.release())
        return future

    def shutdown(self, **kwargs):
        self.pool.shutdown(**kwargs)

The semaphore limits in-flight tasks. When the limit is reached, submit() blocks until a worker finishes. Other strategies: reject tasks with an exception, discard the oldest queued task, or have the caller execute the task itself. Java’s ThreadPoolExecutor has named policies for these (AbortPolicy, CallerRunsPolicy, DiscardOldestPolicy). Python leaves it to you.

Try the simulation above to see how tasks flow through the pool.