The Thread Pool | techdebt.blog

Creating a new thread per task is expensive. Each thread costs ~1MB of stack memory and ~1ms of OS overhead to create. A thread pool reuses a fixed set of worker threads. Tasks go into a queue, workers pull from it. When the queue fills up, you need a strategy: block, reject, or grow the pool. Python’s concurrent.futures.ThreadPoolExecutor handles all of this for you.

mode:

Step through to inspect state

mode:

Step through to inspect state

the cost of threads

Every time you create a thread, the OS allocates a stack (typically 1MB), sets up kernel data structures, and registers the thread with the scheduler. On Linux, this takes roughly 1ms. That sounds small until you’re handling 10,000 requests per second. Now you’re spending 10 seconds per second just on thread creation. You’re underwater.

Creation isn’t the only cost. Every thread competes for CPU time. The OS context-switches between them, saving and restoring registers, flushing caches. With hundreds of threads, the CPU spends more time switching than working. Here’s the naive approach:

import threading
import time

def handle_request(request_id):
    time.sleep(0.1)  # simulate IO
    return f"done-{request_id}"

threads = []
for i in range(1000):
    t = threading.Thread(target=handle_request, args=(i,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

This creates 1,000 threads, each with its own stack. It works for small numbers. At scale, you run out of memory or hit the OS thread limit.

ThreadPoolExecutor

A thread pool creates the threads once and reuses them. Tasks go into a queue. Workers pull tasks off, execute them, and go back to waiting:

from concurrent.futures import ThreadPoolExecutor, as_completed

def handle_request(request_id):
    import time
    time.sleep(0.1)
    return f"done-{request_id}"

with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(handle_request, i) for i in range(1000)]
    for future in as_completed(futures):
        result = future.result()

Ten threads handle 1,000 tasks. submit() returns a Future. Call .result() to get the return value (blocks until done). as_completed() yields futures as they finish, so you process results in completion order.

For uniform work, map() is simpler:

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(handle_request, range(1000)))

map() preserves input order. Use as_completed() when order doesn’t matter and you want results as fast as possible.

what happens when the queue fills

Python’s ThreadPoolExecutor uses an unbounded queue by default. If your producer submits faster than workers can process, the queue grows without limit until the process crashes. For controlled backpressure, bound it:

from concurrent.futures import ThreadPoolExecutor
import threading

class BoundedPool:
    def __init__(self, max_workers, queue_size):
        self.pool = ThreadPoolExecutor(max_workers=max_workers)
        self.semaphore = threading.Semaphore(queue_size + max_workers)

    def submit(self, fn, *args, **kwargs):
        self.semaphore.acquire()  # blocks if too many pending
        future = self.pool.submit(fn, *args, **kwargs)
        future.add_done_callback(lambda _: self.semaphore.release())
        return future

    def shutdown(self, **kwargs):
        self.pool.shutdown(**kwargs)

The semaphore limits in-flight tasks. When the limit is reached, submit() blocks until a worker finishes. Other strategies: reject tasks with an exception, discard the oldest queued task, or have the caller execute the task itself. Java’s ThreadPoolExecutor has named policies for these (AbortPolicy, CallerRunsPolicy, DiscardOldestPolicy). Python leaves it to you.

Try the simulation above to see how tasks flow through the pool.

mode:

Step through to inspect state

the cost of threads

Each OS thread costs ~1MB of stack memory and ~1ms to create. Context switching burns CPU on register saves and cache flushes. Thread-per-task breaks at scale. A thread pool creates workers once and reuses them.

ThreadPoolExecutor

submit() queues a task and returns a Future. as_completed() yields futures in completion order. map() preserves input order. The with block calls shutdown(wait=True) automatically.

from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(handle_request, i) for i in range(1000)]
    for future in as_completed(futures):
        result = future.result()

+ sizing your pool

CPU-bound tasks: set workers equal to core count. (With the GIL, use ProcessPoolExecutor instead for true parallelism.)

import os
max_workers = os.cpu_count()

IO-bound tasks: threads spend most time waiting, so you can have many more than cores. The formula:

N = cores * (1 + wait_time / compute_time)

A task spending 90ms on IO and 10ms computing on 4 cores gives N = 4 * (1 + 90/10) = 40 threads. Profile and adjust from there.

Little’s Law offers another angle: to sustain L requests/second at W seconds each, you need L * W workers minimum. 100 req/s at 200ms = 20 workers.

The default is min(32, os.cpu_count() + 4). Reasonable but not tuned. Measure before guessing.

+ what happens when the queue fills up

Python’s unbounded queue means every submit() succeeds, even if workers are buried. Memory grows until the process dies. Strategies:

Block the caller. A semaphore limits in-flight tasks. The producer slows to match the consumer.

Reject the task. Raise an exception. The caller retries, logs, or drops. Java calls this AbortPolicy.

Caller runs it. The submitting thread executes the task itself, naturally throttling submission. Java: CallerRunsPolicy.

Discard the oldest. Drop the oldest queued task for the new one. Good for real-time systems where stale data is worse than missing data. Java: DiscardOldestPolicy.

Pick based on your failure mode. If losing tasks is unacceptable, block or reject. If freshness matters more, discard.

+ shutdown gracefully

shutdown(wait=True) stops accepting new tasks and waits for running and queued tasks to finish. In Python 3.9+, cancel_futures=True cancels queued (but not running) tasks for faster draining.

For SIGTERM (containers, Kubernetes):

import signal

def handle_sigterm(signum, frame):
    pool.shutdown(wait=False, cancel_futures=True)

signal.signal(signal.SIGTERM, handle_sigterm)

The hardest part is stuck tasks. A thread blocked on socket.recv() won’t respond to shutdown. Set timeouts on blocking calls, and design tasks to check a cancellation flag:

shutdown_event = threading.Event()

def task():
    while not shutdown_event.is_set():
        chunk = process_next_chunk()
        if chunk is None:
            break

mode:

Step through to inspect state

The simulation shows the standard model: a fixed pool pulling from a shared queue. One queue, N workers, FIFO. What follows is what happens when that isn’t enough.

work-stealing

A single shared queue creates contention. Every worker locks the queue to dequeue. With 64 workers, the lock becomes a bottleneck.

Work-stealing gives each worker its own queue. When a worker’s queue empties, it steals from another worker’s tail. Stealing is rare; most of the time workers consume locally with no contention.

Go’s goroutine scheduler uses work-stealing. Each OS thread has a local run queue. When empty, it steals from other threads or falls back to a global queue. Java’s ForkJoinPool uses a similar design with per-worker deques: workers push and pop from the head (LIFO for cache locality), stealers take from the tail (FIFO for load balance).

Python’s ThreadPoolExecutor doesn’t do work-stealing. For IO-bound Python, the GIL is a bigger serialization point than the task queue.

M:N threading

Thread pools use OS threads. The M:N model maps M user-level threads onto N OS threads, where M is much larger than N.

Go creates millions of goroutines multiplexed onto GOMAXPROCS OS threads. Goroutines start at ~2KB of stack (vs ~1MB for OS threads) and grow dynamically. When a goroutine blocks on IO, the scheduler parks it and runs another on the same OS thread. No kernel context switch.

Erlang’s BEAM VM does something similar with lightweight processes: own heaps, message-passing communication, preemptive scheduling at the VM level. Millions of processes on a handful of scheduler threads.

Python uses OS threads directly. One Python thread equals one kernel thread with a full-sized stack. This is why Python pools are sized in the tens, not thousands.

virtual threads

Java 21 (Project Loom) adds virtual threads: user-level threads managed by the JVM, scheduled onto a small pool of carrier OS threads. You can create millions of them.

try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    for (int i = 0; i < 1_000_000; i++) {
        executor.submit(() -> {
            Thread.sleep(Duration.ofSeconds(1));
            return "done";
        });
    }
}

One million virtual threads sleeping for a second. With OS threads that would need ~1TB of stack memory. Virtual threads start at a few hundred bytes. When one blocks on IO, the JVM unmounts it and mounts another on the same carrier. The carrier never blocks.

This flips the thread pool argument. If threads are cheap, you don’t need a pool. The thread-per-task model we dismissed at the start becomes viable again. The costs that made pools necessary (stack allocation, kernel registration, context switching) disappear when the runtime handles scheduling. You still need locks for shared mutable state, but the resource management problem goes away.

why Python’s thread pool is simpler

Java’s ThreadPoolExecutor has core size, max size, keep-alive time, custom queues, and four rejection policies. Python has max_workers and an unbounded queue. That’s it.

This is usually enough. Python thread pools handle IO-bound concurrency: HTTP requests, database queries, file reads. The GIL prevents CPU-bound parallelism, so the scheduling sophistication Java and Go need doesn’t apply. When you outgrow a basic pool, reach for asyncio (high-concurrency IO), multiprocessing (CPU parallelism), or Celery (distributed tasks). The thread pool sits in the middle: good enough for moderate concurrency, simple enough to reason about, and limited enough that you know when you’ve outgrown it.