user@techdebt:~/blog$_
$ cd ..

The Throttle

mode:
$ throttle inspector
click Step or Play to inspect throttle state
$ simulation.log

the problem

Every service has a breaking point. A database can handle 10,000 queries per second. An API server can process 5,000 requests per second. A payment processor can handle 500 transactions per second. Beyond that, latency spikes, errors increase, and the service degrades or crashes.

Rate limiting puts a ceiling on request volume. When a client exceeds its quota, the service rejects the excess with HTTP 429 (Too Many Requests) rather than trying to process everything and failing for everyone.

Without rate limiting, a single misbehaving client (or a DDoS attack) can take down the entire service. Rate limiting is the immune system of distributed services.

token bucket

The most widely used algorithm. A bucket holds tokens, up to a maximum capacity. Tokens refill at a fixed rate (for example, 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

if tokens > 0:
  tokens -= 1
  accept request
else:
  reject request (429)

The key property: token buckets allow bursts. If the bucket is full (say, 50 tokens), a client can send 50 requests instantly. After the burst, the client is limited to the refill rate. This makes token buckets well-suited for APIs where occasional bursts are normal but sustained high rates are not.

Token buckets are simple to implement, cheap to evaluate (one comparison and decrement), and naturally handle bursty traffic patterns. AWS API Gateway, Stripe, and most cloud rate limiters use token buckets.

sliding window

Count the number of requests in a rolling time window. If the count exceeds the limit, reject new requests until old ones fall outside the window.

window = requests in the last 60 seconds
if count(window) < limit:
  accept request
else:
  reject request (429)

Unlike token buckets, sliding windows provide smooth rate enforcement. There is no burst allowance. If the limit is 100 requests per minute, the 101st request in any 60-second window is rejected.

The implementation trades off precision for memory. A true sliding window tracks every request timestamp. A sliding window counter approximates by weighting the current and previous fixed windows. The approximation is good enough for most use cases and uses constant memory.

leaky bucket

The leaky bucket is the dual of the token bucket. Requests enter a queue (the bucket). The queue drains at a constant rate. If the queue is full, new requests are dropped.

if queue is not full:
  add request to queue
else:
  drop request (429)

// separately, drain queue at constant rate
process one request from queue

The key property: output rate is always constant, regardless of input burstiness. A burst of 50 requests does not cause 50 simultaneous backend hits. Instead, they queue up and drain one at a time. The leaky bucket smooths traffic.

Leaky buckets are useful when the downstream service cannot handle bursts. A payment processor that handles exactly 100 transactions per second benefits from a leaky bucket that absorbs bursts and feeds transactions at a steady rate.

where to enforce

Rate limiting can be applied at multiple layers:

  • API gateway: The outermost layer. Catches abusive traffic before it reaches your services. Per-client or per-API-key limits. Most common.
  • Per-service: Each service enforces its own limits. Protects against internal services that generate too much traffic.
  • Per-user: Limits tied to authenticated users. Prevents individual users from monopolizing resources.
  • Global: A centralized rate limiter (often backed by Redis) that all services consult. Consistent limits across a distributed system.

Most architectures use multiple layers. The API gateway catches external abuse. Per-service limits protect against internal overload. Per-user limits ensure fairness.

where it shows up

  • API gateways (Kong, AWS API Gateway): Token bucket rate limiting per API key, per route, or per IP. Configurable limits and burst sizes.
  • Nginx: The limit_req module implements a leaky bucket. Configurable rate and burst size per zone (IP, path, etc.).
  • Redis-based limiters: Lua scripts in Redis implement token buckets or sliding windows with atomic operations. Used when rate limiting must be consistent across multiple application servers.
  • Cloud provider limits: AWS, GCP, and Azure impose rate limits on every API. EC2 API calls, S3 requests, Lambda invocations all have limits. These limits protect the cloud provider’s infrastructure and ensure fair sharing among tenants.