The Balancer
A load balancer distributes incoming requests across backend servers so no single server becomes a bottleneck. Round-robin rotates through servers sequentially, simple but blind to load. Least-connections sends traffic to the server with the fewest active connections, adapting to real workload. Weighted routing assigns proportional traffic based on server capacity. Health checks remove unhealthy servers from rotation.
the problem
One server handles 1,000 requests per second. You need to handle 10,000. You could buy a bigger server (vertical scaling), but there is a ceiling. At some point, you need multiple servers.
Now you have a new problem: which server handles which request? If you send everything to server 1, you have not scaled at all. If you split traffic unevenly, some servers are overloaded while others sit idle. A load balancer sits in front of your servers and makes this decision for every incoming request.
round-robin
The simplest algorithm. Requests go to servers in order: S1, S2, S3, S4, S1, S2, S3, S4, and so on. No state to track, no computation, perfect distribution assuming all servers are identical and all requests cost the same.
The problem: requests are not equal. A lightweight health check and a heavy database query both count as one request. Round-robin sends the heavy query to S3 even if S3 is already handling three heavy queries while S1 is idle. It distributes requests evenly but not load.
Round-robin works well when requests have similar cost and servers have similar capacity. For homogeneous, stateless API servers behind a gateway, it is often good enough.
least connections
Instead of rotating blindly, track how many active connections each server has. Send the next request to the server with the fewest. This adapts to reality: if S1 finishes fast, it gets more traffic. If S3 is stuck on a slow query, it gets less.
Least-connections requires the balancer to track connection state, which adds a small overhead. But the improvement in load distribution is significant for workloads with variable request latency.
Most production load balancers default to least-connections or a variant of it. Nginx uses least_conn, HAProxy uses leastconn, and AWS ALB uses a similar algorithm internally.
weighted routing
Not all servers are equal. A 16-core machine can handle more than a 4-core machine. Weighted routing assigns a weight to each server and distributes traffic proportionally.
With weights S1=1, S2=2, S3=3, S4=1, a batch of 7 requests splits: S1 gets 1, S2 gets 2, S3 gets 3, S4 gets 1. The beefy server handles more traffic because it can.
Weights are also useful for canary deployments. Set the canary server’s weight to 1 while production servers have weight 100. The canary gets roughly 1% of traffic. If it performs well, gradually increase its weight.
Setting a weight to 0 removes a server from rotation without marking it unhealthy. Useful for graceful draining during deployments.
health checks
A load balancer is only as good as its health checks. Sending traffic to a dead server means failed requests for users.
Active health checks: The balancer periodically pings each server (HTTP GET /health, TCP connect). If a server fails N consecutive checks, it is removed from the pool. When it passes again, it is added back.
Passive health checks: The balancer monitors actual traffic. If a server returns too many 5xx errors or times out too often, it is marked unhealthy. No extra traffic needed, but slower to detect failures.
Most production setups use both. Active checks catch servers that are completely down. Passive checks catch servers that are up but misbehaving.
L4 vs L7
Load balancers operate at different layers of the network stack.
L4 (transport layer): Routes based on IP addresses and TCP/UDP ports. Fast because it does not inspect the request payload. Cannot make routing decisions based on HTTP headers, URLs, or cookies.
L7 (application layer): Routes based on HTTP headers, URL paths, cookies, and request content. Can do path-based routing (/api goes to API servers, /static goes to CDN), header-based routing (mobile vs desktop), and cookie-based sticky sessions.
L4 is faster. L7 is smarter. Most modern load balancers (Nginx, Envoy, AWS ALB) operate at L7 by default.
where it shows up
- Nginx: The most widely deployed reverse proxy and load balancer. Supports round-robin, least-connections, IP hash. Runs as L7 by default, can do L4 with stream module.
- HAProxy: High-performance TCP/HTTP load balancer. Known for reliability and battle-tested configurations. Powers many high-traffic sites.
- AWS ALB/NLB: ALB is L7 (HTTP routing, path-based, host-based). NLB is L4 (TCP/UDP, ultra-low latency). ELB Classic is the older combined version.
- Envoy: Modern L7 proxy designed for microservice architectures. Used as the data plane in service meshes (Istio). Supports advanced features like circuit breaking, retries, and observability.
+ L4 vs L7 in depth
The difference matters more than you might expect.
L4 load balancing works at the TCP level. The balancer sees a TCP SYN packet, picks a backend, and forwards all packets for that connection to the chosen backend. It never inspects the HTTP request inside. This makes it extremely fast (millions of connections per second on commodity hardware) but limited in routing intelligence.
L7 load balancing terminates the TCP connection at the balancer, parses the HTTP request, and opens a new connection to the backend. This gives the balancer full visibility: it can route based on URL path, Host header, cookies, JWT claims, or any request attribute. The cost is higher latency (two TCP connections instead of one) and more CPU usage (HTTP parsing).
In practice, many architectures use both: an L4 balancer (like AWS NLB or IPVS) at the edge for raw throughput, with L7 balancers (Nginx, Envoy) behind it for intelligent routing.
Direct Server Return (DSR) is an L4 optimization where the backend responds directly to the client, bypassing the load balancer for the response path. This dramatically reduces balancer bandwidth usage for response-heavy workloads (like video streaming).
+ consistent hashing for load balancing
When backends are added or removed, traditional hash-based routing (hash(request) % N) remaps nearly every request to a different backend. This destroys any server-side caches or session state.
Consistent hashing solves this by mapping both requests and servers to a ring. Adding a server only remaps the requests that land on the new server’s segment. Removing a server remaps only its requests to the next server on the ring.
This is the same consistent hashing from the first post in this series, applied to load balancing instead of data partitioning. Envoy, Nginx (with the consistent hash module), and many service meshes support consistent hash load balancing.
Use consistent hashing when: backends have local caches and you want cache affinity, or when you need sticky sessions without cookies.
+ global server load balancing (GSLB)
GSLB operates at the DNS level, routing users to the nearest datacenter. When a user in Tokyo resolves api.example.com, the DNS server returns the IP of the Tokyo datacenter. A user in London gets the London datacenter IP.
GSLB considers geography, datacenter health, and capacity. If the Tokyo datacenter is down, Tokyo users get routed to the next closest healthy datacenter (perhaps Singapore or Seoul).
Implementation options include DNS-based (Route 53, Cloudflare DNS), anycast (same IP advertised from multiple locations, BGP routing picks the closest), and HTTP-redirect (the first request goes to a global endpoint that redirects to the optimal regional endpoint).
Anycast is increasingly popular because it requires no DNS TTL propagation. The same IP works everywhere, and the network automatically routes to the closest healthy origin.
+ service mesh and sidecar proxies
In a microservice architecture, every service talks to dozens of other services. Each service needs load balancing, retries, timeouts, circuit breaking, and observability. Implementing this in every service is impractical.
A service mesh moves this logic into a sidecar proxy (like Envoy) that runs alongside each service instance. The application sends requests to localhost, and the sidecar handles routing, load balancing, retries, mTLS, and telemetry.
Istio is the most widely deployed service mesh. It uses Envoy sidecars with a control plane that distributes configuration. The control plane handles service discovery, traffic policies, and certificate management. The data plane (Envoy sidecars) handles the actual request routing.
The overhead of sidecar proxies (one Envoy per pod, two extra network hops per request) is the main criticism. Newer approaches like ambient mesh (Istio’s ztunnel) and eBPF-based meshes (Cilium) aim to reduce this overhead.
production stories
Nginx at scale
Nginx handles millions of concurrent connections with a small memory footprint. Its event-driven architecture (non-blocking I/O with epoll/kqueue) makes it efficient as both a web server and a reverse proxy.
For load balancing, the typical Nginx configuration is straightforward: define an upstream block with backend servers, set the balancing algorithm, and configure health checks. The least_conn directive is the most common choice for API workloads.
The operational insight: Nginx’s worker process model (one worker per CPU core, each handling thousands of connections) means you rarely need to tune it. The default configuration handles most workloads. When you do need to tune, worker_connections and keepalive are the two knobs that matter most.
the power of two random choices
A deceptively simple algorithm that performs nearly as well as least-connections with much less state: pick two random servers, send the request to whichever has fewer connections.
The mathematics behind this are striking. Random selection gives expected max load of O(log n / log log n). Two random choices brings this down to O(log log n), an exponential improvement from one extra comparison.
This algorithm is used in Envoy, HAProxy, and many internal load balancers at large companies. It is particularly useful in distributed load balancing where maintaining global connection counts is expensive. Each balancer only needs to check two servers per request, and the result is nearly optimal.
graceful deployments
Load balancers are essential for zero-downtime deployments. The pattern: start new servers with the new version, add them to the pool, drain connections from old servers, then remove old servers.
Connection draining means the old server stops accepting new connections but finishes processing existing ones. The load balancer stops sending new requests to the draining server. Once all active connections complete (or a timeout expires), the server can safely shut down.
AWS ALB supports connection draining natively with a configurable timeout (default 300 seconds). Kubernetes uses a readiness probe and a termination grace period to achieve the same result with any load balancer.
when load balancing is not enough
Load balancers distribute traffic, but they do not solve capacity problems. If all four servers are at 100% CPU, adding a load balancer does not help. You need more servers.
Auto-scaling solves this by adding servers when load increases and removing them when load decreases. The load balancer and the auto-scaler work together: the auto-scaler watches metrics (CPU, request rate, queue depth) and adjusts the server count. The load balancer distributes traffic across whatever servers exist.
The key metric for auto-scaling is not CPU. It is request latency. CPU can spike briefly without affecting users. But when latency crosses a threshold, users notice immediately. Scale on the metric that matters to users.