The Throttle | techdebt.blog

mode:

click Step or Play to inspect throttle state

the problem

Every service has a breaking point. A database can handle 10,000 queries per second. An API server can process 5,000 requests per second. A payment processor can handle 500 transactions per second. Beyond that, latency spikes, errors increase, and the service degrades or crashes.

Rate limiting puts a ceiling on request volume. When a client exceeds its quota, the service rejects the excess with HTTP 429 (Too Many Requests) rather than trying to process everything and failing for everyone.

Without rate limiting, a single misbehaving client (or a DDoS attack) can take down the entire service. Rate limiting is the immune system of distributed services.

token bucket

The most widely used algorithm. A bucket holds tokens, up to a maximum capacity. Tokens refill at a fixed rate (for example, 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

if tokens > 0:
  tokens -= 1
  accept request
else:
  reject request (429)

The key property: token buckets allow bursts. If the bucket is full (say, 50 tokens), a client can send 50 requests instantly. After the burst, the client is limited to the refill rate. This makes token buckets well-suited for APIs where occasional bursts are normal but sustained high rates are not.

Token buckets are simple to implement, cheap to evaluate (one comparison and decrement), and naturally handle bursty traffic patterns. AWS API Gateway, Stripe, and most cloud rate limiters use token buckets.

sliding window

Count the number of requests in a rolling time window. If the count exceeds the limit, reject new requests until old ones fall outside the window.

window = requests in the last 60 seconds
if count(window) < limit:
  accept request
else:
  reject request (429)

Unlike token buckets, sliding windows provide smooth rate enforcement. There is no burst allowance. If the limit is 100 requests per minute, the 101st request in any 60-second window is rejected.

The implementation trades off precision for memory. A true sliding window tracks every request timestamp. A sliding window counter approximates by weighting the current and previous fixed windows. The approximation is good enough for most use cases and uses constant memory.

leaky bucket

The leaky bucket is the dual of the token bucket. Requests enter a queue (the bucket). The queue drains at a constant rate. If the queue is full, new requests are dropped.

if queue is not full:
  add request to queue
else:
  drop request (429)

// separately, drain queue at constant rate
process one request from queue

The key property: output rate is always constant, regardless of input burstiness. A burst of 50 requests does not cause 50 simultaneous backend hits. Instead, they queue up and drain one at a time. The leaky bucket smooths traffic.

Leaky buckets are useful when the downstream service cannot handle bursts. A payment processor that handles exactly 100 transactions per second benefits from a leaky bucket that absorbs bursts and feeds transactions at a steady rate.

where to enforce

Rate limiting can be applied at multiple layers:

API gateway: The outermost layer. Catches abusive traffic before it reaches your services. Per-client or per-API-key limits. Most common.
Per-service: Each service enforces its own limits. Protects against internal services that generate too much traffic.
Per-user: Limits tied to authenticated users. Prevents individual users from monopolizing resources.
Global: A centralized rate limiter (often backed by Redis) that all services consult. Consistent limits across a distributed system.

Most architectures use multiple layers. The API gateway catches external abuse. Per-service limits protect against internal overload. Per-user limits ensure fairness.

where it shows up

API gateways (Kong, AWS API Gateway): Token bucket rate limiting per API key, per route, or per IP. Configurable limits and burst sizes.
Nginx: The limit_req module implements a leaky bucket. Configurable rate and burst size per zone (IP, path, etc.).
Redis-based limiters: Lua scripts in Redis implement token buckets or sliding windows with atomic operations. Used when rate limiting must be consistent across multiple application servers.
Cloud provider limits: AWS, GCP, and Azure impose rate limits on every API. EC2 API calls, S3 requests, Lambda invocations all have limits. These limits protect the cloud provider’s infrastructure and ensure fair sharing among tenants.

mode:

click Step or Play to inspect throttle state

+ distributed rate limiting with Redis

When your API runs on 20 servers, each server needs to agree on the current request count. A local counter on each server allows 20x the intended rate (each server thinks it is the only one counting).

The solution: a centralized counter in Redis. Every request increments a Redis key. If the count exceeds the limit, reject. Redis is fast enough (100K+ operations per second) to serve as a centralized rate limiter without becoming a bottleneck.

The standard Redis implementation uses a Lua script for atomicity:

-- Token bucket in Redis (pseudocode)
tokens = redis.get(key) or max_tokens
last_refill = redis.get(key_time) or now
elapsed = now - last_refill
tokens = min(max_tokens, tokens + elapsed * refill_rate)
if tokens >= 1:
  tokens -= 1
  redis.set(key, tokens)
  redis.set(key_time, now)
  return ALLOWED
else:
  return DENIED

The Lua script runs atomically in Redis, so there are no race conditions between concurrent requests. This pattern is used by Stripe, GitHub, and most cloud API providers.

For even higher throughput, you can shard rate limiters across multiple Redis instances, each handling a subset of clients.

+ backpressure vs rate limiting

Rate limiting and backpressure solve the same problem (overload) from different directions.

Rate limiting is proactive: it caps the input rate regardless of system state. Even if the system is healthy and underloaded, the limit applies. It is a hard boundary.

Backpressure is reactive: it slows down producers when consumers cannot keep up. When a queue fills up, the producer is told to slow down or wait. When the queue drains, the producer can speed up. It is a dynamic feedback mechanism.

Rate limiting says “no more than 1000 per second, ever.” Backpressure says “slow down, I am falling behind.”

In practice, use both. Rate limiting protects against external abuse (clients you cannot control). Backpressure protects against internal overload (services you can control). TCP congestion control is backpressure. API quotas are rate limiting.

+ adaptive rate limiting and AIMD

Fixed rate limits are blunt instruments. A limit of 1000 requests per second might be too low during off-peak hours (wasting capacity) and too high during peak hours (allowing overload).

Adaptive rate limiting adjusts the limit based on system health. When the system is healthy (low latency, low error rate), the limit increases. When the system is stressed, the limit decreases.

AIMD (Additive Increase, Multiplicative Decrease) is the classic adaptive algorithm. It is the foundation of TCP congestion control:

Additive Increase: When things are going well, increase the limit by a fixed amount each interval (e.g., +10 requests/second).
Multiplicative Decrease: When things are going badly (errors, timeouts), cut the limit by a fraction (e.g., halve it).

This creates a sawtooth pattern: the limit slowly ramps up, then quickly drops when problems are detected. Netflix’s concurrency-limits library implements AIMD for service-to-service rate limiting. It automatically finds the optimal concurrency for each service without manual tuning.

+ rate limiting in API design (headers, retry-after)

Good rate limiting tells clients what is happening. HTTP headers communicate limit information:

X-RateLimit-Limit: The maximum number of requests allowed in the window.
X-RateLimit-Remaining: How many requests the client has left.
X-RateLimit-Reset: When the limit resets (usually a Unix timestamp).
Retry-After: How many seconds the client should wait before retrying (included in 429 responses).

These headers let well-behaved clients throttle themselves before hitting the limit. A client that sees X-RateLimit-Remaining: 5 can slow down proactively.

The Retry-After header is especially important. Without it, clients that get a 429 immediately retry, creating a retry storm. With Retry-After: 30, clients know exactly when to try again. Adding jitter (a small random delay) prevents synchronized retries from many clients hitting the server at the same time.

mode:

click Step or Play to inspect throttle state

production stories

API rate limiting in practice

Most major APIs use tiered rate limiting. Free tiers get lower limits. Paid tiers get higher limits. Enterprise tiers get custom limits. The rate limiter is the enforcement point for the business model.

The implementation challenge is not the algorithm; it is the policy. Which dimension do you limit on? Per API key? Per IP? Per user? Per endpoint? Most systems limit on multiple dimensions simultaneously. A single API key might have a global limit of 10,000 requests per minute and a per-endpoint limit of 1,000 per minute for expensive operations.

The operational challenge is setting the right limits. Too strict and you frustrate legitimate users. Too lenient and you let abuse through. The best approach is to start with generous limits, monitor usage patterns, and tighten where needed.

DDoS protection

DDoS attacks send millions of requests per second from thousands of IPs. Traditional rate limiting (per-IP token bucket) is necessary but not sufficient. Attackers rotate IPs and distribute traffic to stay under per-IP limits.

Defense in depth combines multiple strategies: IP reputation (known bad actors are blocked before rate limiting), geographic filtering (block traffic from regions with no legitimate users), challenge-response (CAPTCHAs for suspicious traffic), and traffic analysis (detect abnormal patterns in real time).

CDN providers like Cloudflare and AWS Shield handle DDoS at the network edge, absorbing attack traffic across their global network before it reaches your servers. This is a problem best solved by specialists, not by application-level rate limiting alone.

TCP congestion control as rate limiting

TCP’s congestion control is the original rate limiter. When packets are lost (a signal of network congestion), TCP reduces its sending rate. When packets are acknowledged successfully, it increases the rate.

The algorithms have evolved: TCP Tahoe (1988) and Reno (1990) used AIMD. TCP CUBIC (2008, Linux default) uses a cubic function for faster recovery. TCP BBR (2016, Google) models the network’s bandwidth and RTT directly, avoiding loss-based signals entirely.

The parallel to application rate limiting is direct. TCP rate limits individual connections based on network capacity. Application rate limiters limit clients based on service capacity. Both use feedback loops to find the sustainable rate.

when rate limiting goes wrong

Rate limiters can cause cascading failures. Scenario: Service A rate-limits Service B. Service B retries. Service A sees more load and tightens its limits. Service B retries harder. The system enters a death spiral of retries.

The fix: exponential backoff with jitter. Instead of retrying immediately, wait 1 second, then 2, then 4, then 8, with a random jitter added to each delay. This spreads retries over time and prevents synchronized storms.

Circuit breakers complement rate limiting. When a downstream service is overwhelmed, the circuit breaker trips and immediately rejects requests without even sending them. This gives the downstream service time to recover. The circuit breaker periodically lets a few requests through to check if the service has recovered.