The Thread Pool
Creating a new thread per task is expensive. Each thread costs ~1MB of
stack memory and ~1ms of OS overhead to create. A thread pool reuses a
fixed set of worker threads. Tasks go into a queue, workers pull from it.
When the queue fills up, you need a strategy: block, reject, or grow the
pool. Python’s concurrent.futures.ThreadPoolExecutor handles all of this
for you.
the cost of threads
Every time you create a thread, the OS allocates a stack (typically 1MB), sets up kernel data structures, and registers the thread with the scheduler. On Linux, this takes roughly 1ms. That sounds small until you’re handling 10,000 requests per second. Now you’re spending 10 seconds per second just on thread creation. You’re underwater.
Creation isn’t the only cost. Every thread competes for CPU time. The OS context-switches between them, saving and restoring registers, flushing caches. With hundreds of threads, the CPU spends more time switching than working. Here’s the naive approach:
import threading
import time
def handle_request(request_id):
time.sleep(0.1) # simulate IO
return f"done-{request_id}"
threads = []
for i in range(1000):
t = threading.Thread(target=handle_request, args=(i,))
t.start()
threads.append(t)
for t in threads:
t.join()This creates 1,000 threads, each with its own stack. It works for small numbers. At scale, you run out of memory or hit the OS thread limit.
ThreadPoolExecutor
A thread pool creates the threads once and reuses them. Tasks go into a queue. Workers pull tasks off, execute them, and go back to waiting:
from concurrent.futures import ThreadPoolExecutor, as_completed
def handle_request(request_id):
import time
time.sleep(0.1)
return f"done-{request_id}"
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [pool.submit(handle_request, i) for i in range(1000)]
for future in as_completed(futures):
result = future.result()Ten threads handle 1,000 tasks. submit() returns a Future. Call
.result() to get the return value (blocks until done). as_completed()
yields futures as they finish, so you process results in completion order.
For uniform work, map() is simpler:
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(handle_request, range(1000)))map() preserves input order. Use as_completed() when order doesn’t
matter and you want results as fast as possible.
what happens when the queue fills
Python’s ThreadPoolExecutor uses an unbounded queue by default. If your
producer submits faster than workers can process, the queue grows without
limit until the process crashes. For controlled backpressure, bound it:
from concurrent.futures import ThreadPoolExecutor
import threading
class BoundedPool:
def __init__(self, max_workers, queue_size):
self.pool = ThreadPoolExecutor(max_workers=max_workers)
self.semaphore = threading.Semaphore(queue_size + max_workers)
def submit(self, fn, *args, **kwargs):
self.semaphore.acquire() # blocks if too many pending
future = self.pool.submit(fn, *args, **kwargs)
future.add_done_callback(lambda _: self.semaphore.release())
return future
def shutdown(self, **kwargs):
self.pool.shutdown(**kwargs)The semaphore limits in-flight tasks. When the limit is reached,
submit() blocks until a worker finishes. Other strategies: reject tasks
with an exception, discard the oldest queued task, or have the caller
execute the task itself. Java’s ThreadPoolExecutor has named policies
for these (AbortPolicy, CallerRunsPolicy, DiscardOldestPolicy). Python
leaves it to you.
Try the simulation above to see how tasks flow through the pool.
the cost of threads
Each OS thread costs ~1MB of stack memory and ~1ms to create. Context switching burns CPU on register saves and cache flushes. Thread-per-task breaks at scale. A thread pool creates workers once and reuses them.
ThreadPoolExecutor
submit() queues a task and returns a Future. as_completed() yields
futures in completion order. map() preserves input order. The with
block calls shutdown(wait=True) automatically.
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [pool.submit(handle_request, i) for i in range(1000)]
for future in as_completed(futures):
result = future.result()+ sizing your pool
CPU-bound tasks: set workers equal to core count. (With the GIL,
use ProcessPoolExecutor instead for true parallelism.)
import os
max_workers = os.cpu_count()IO-bound tasks: threads spend most time waiting, so you can have many more than cores. The formula:
N = cores * (1 + wait_time / compute_time)A task spending 90ms on IO and 10ms computing on 4 cores gives N = 4 * (1 + 90/10) = 40 threads. Profile and adjust from there.
Little’s Law offers another angle: to sustain L requests/second at W seconds each, you need L * W workers minimum. 100 req/s at 200ms = 20 workers.
The default is min(32, os.cpu_count() + 4). Reasonable but not
tuned. Measure before guessing.
+ what happens when the queue fills up
Python’s unbounded queue means every submit() succeeds, even if
workers are buried. Memory grows until the process dies. Strategies:
Block the caller. A semaphore limits in-flight tasks. The producer slows to match the consumer.
Reject the task. Raise an exception. The caller retries, logs, or
drops. Java calls this AbortPolicy.
Caller runs it. The submitting thread executes the task itself,
naturally throttling submission. Java: CallerRunsPolicy.
Discard the oldest. Drop the oldest queued task for the new one.
Good for real-time systems where stale data is worse than missing
data. Java: DiscardOldestPolicy.
Pick based on your failure mode. If losing tasks is unacceptable, block or reject. If freshness matters more, discard.
+ shutdown gracefully
shutdown(wait=True) stops accepting new tasks and waits for running
and queued tasks to finish. In Python 3.9+, cancel_futures=True
cancels queued (but not running) tasks for faster draining.
For SIGTERM (containers, Kubernetes):
import signal
def handle_sigterm(signum, frame):
pool.shutdown(wait=False, cancel_futures=True)
signal.signal(signal.SIGTERM, handle_sigterm)The hardest part is stuck tasks. A thread blocked on socket.recv()
won’t respond to shutdown. Set timeouts on blocking calls, and design
tasks to check a cancellation flag:
shutdown_event = threading.Event()
def task():
while not shutdown_event.is_set():
chunk = process_next_chunk()
if chunk is None:
break The simulation shows the standard model: a fixed pool pulling from a shared queue. One queue, N workers, FIFO. What follows is what happens when that isn’t enough.
work-stealing
A single shared queue creates contention. Every worker locks the queue to dequeue. With 64 workers, the lock becomes a bottleneck.
Work-stealing gives each worker its own queue. When a worker’s queue empties, it steals from another worker’s tail. Stealing is rare; most of the time workers consume locally with no contention.
Go’s goroutine scheduler uses work-stealing. Each OS thread has a local
run queue. When empty, it steals from other threads or falls back to a
global queue. Java’s ForkJoinPool uses a similar design with per-worker
deques: workers push and pop from the head (LIFO for cache locality),
stealers take from the tail (FIFO for load balance).
Python’s ThreadPoolExecutor doesn’t do work-stealing. For IO-bound
Python, the GIL is a bigger serialization point than the task queue.
M:N threading
Thread pools use OS threads. The M:N model maps M user-level threads onto N OS threads, where M is much larger than N.
Go creates millions of goroutines multiplexed onto GOMAXPROCS OS
threads. Goroutines start at ~2KB of stack (vs ~1MB for OS threads) and
grow dynamically. When a goroutine blocks on IO, the scheduler parks it
and runs another on the same OS thread. No kernel context switch.
Erlang’s BEAM VM does something similar with lightweight processes: own heaps, message-passing communication, preemptive scheduling at the VM level. Millions of processes on a handful of scheduler threads.
Python uses OS threads directly. One Python thread equals one kernel thread with a full-sized stack. This is why Python pools are sized in the tens, not thousands.
virtual threads
Java 21 (Project Loom) adds virtual threads: user-level threads managed by the JVM, scheduled onto a small pool of carrier OS threads. You can create millions of them.
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 1_000_000; i++) {
executor.submit(() -> {
Thread.sleep(Duration.ofSeconds(1));
return "done";
});
}
}One million virtual threads sleeping for a second. With OS threads that would need ~1TB of stack memory. Virtual threads start at a few hundred bytes. When one blocks on IO, the JVM unmounts it and mounts another on the same carrier. The carrier never blocks.
This flips the thread pool argument. If threads are cheap, you don’t need a pool. The thread-per-task model we dismissed at the start becomes viable again. The costs that made pools necessary (stack allocation, kernel registration, context switching) disappear when the runtime handles scheduling. You still need locks for shared mutable state, but the resource management problem goes away.
why Python’s thread pool is simpler
Java’s ThreadPoolExecutor has core size, max size, keep-alive time,
custom queues, and four rejection policies. Python has max_workers and
an unbounded queue. That’s it.
This is usually enough. Python thread pools handle IO-bound concurrency:
HTTP requests, database queries, file reads. The GIL prevents CPU-bound
parallelism, so the scheduling sophistication Java and Go need doesn’t
apply. When you outgrow a basic pool, reach for asyncio (high-concurrency
IO), multiprocessing (CPU parallelism), or Celery (distributed tasks).
The thread pool sits in the middle: good enough for moderate concurrency,
simple enough to reason about, and limited enough that you know when
you’ve outgrown it.