The Consensus | techdebt.blog

mode:

click Step or Play to inspect cluster state

the problem

You have five servers. A client writes x = 5. All five servers need to agree that x = 5, in the same order, even if one of them crashes mid-write. That is the consensus problem.

It sounds simple until you consider: messages get delayed, reordered, or lost. Servers crash and restart. The network partitions, splitting your cluster in half. Each half thinks the other is dead.

Without consensus, each server might end up with a different value. Your database returns different answers depending on which server you ask. That is not a database. That is five databases pretending to be one.

the leader

Raft’s key insight: elect a single leader and funnel all writes through it. The leader decides the order of operations. Followers just replicate what the leader tells them.

This simplifies everything. There is no need to resolve conflicting writes from multiple servers. The leader is the single source of truth for ordering. If you want to write, talk to the leader. If you want to read (with strong consistency), talk to the leader.

Client -> Leader -> Followers
            |
            v
       "x=5 is entry #47"

The leader assigns each write a sequential log index. Followers apply entries in that order. Every node ends up with the same log.

terms and votes

What happens when the leader dies? The remaining nodes need to elect a new one. Raft uses terms, monotonically increasing logical clocks, to prevent two leaders in the same term.

Each term has at most one leader. When a follower’s heartbeat timer expires (no leader contact), it increments its term and becomes a candidate. It votes for itself and sends RequestVote RPCs to all other nodes.

Term 1: N3 is leader
Term 1: N3 crashes
Term 2: N1 starts election, votes for self
Term 2: N2, N4, N5 grant votes
Term 2: N1 becomes leader (4/5 votes, majority = 3)

A node grants its vote to the first candidate it hears from in a given term. One vote per term per node. This guarantees at most one leader per term. If a node sees a higher term number, it immediately steps down to follower. No arguments, no negotiation.

log replication

Once elected, the leader accepts client requests and appends them to its log. It then sends AppendEntries RPCs to all followers with the new entries.

When a majority of nodes (the quorum) have written the entry to their logs, the leader considers it committed. Committed entries are durable. They will survive any future leader election because the election restriction (below) guarantees the new leader has all committed entries.

Leader log:    [1:x=5] [2:y=3]
Follower N1:   [1:x=5] [2:y=3]  ack
Follower N2:   [1:x=5] [2:y=3]  ack
Follower N4:   [1:x=5]          slow
Follower N5:   [1:x=5] [2:y=3]  ack

4/5 = quorum -> entry 2 committed

Slow followers catch up. The leader retries AppendEntries until every follower’s log matches. If a follower has stale or divergent entries, the leader overwrites them. The leader’s log is always authoritative.

safety: the election restriction

Not just any node can become leader. A candidate’s log must be at least as up-to-date as any voter’s log. Voters reject candidates with shorter or older logs.

This is the key safety property: a new leader always has every committed entry. Without this restriction, a newly elected leader could overwrite committed data, violating consensus.

where it shows up

etcd: The coordination store behind Kubernetes. Uses Raft for all key-value writes. Every kubectl apply ultimately goes through Raft consensus.
ZooKeeper: Uses ZAB (ZooKeeper Atomic Broadcast), a protocol closely related to Raft. Coordinates distributed locks, configuration, and leader election for other systems.
CockroachDB: Runs a Raft group per range (chunk of data). Thousands of independent Raft groups operating in parallel across the cluster.
Consul: HashiCorp’s service mesh uses Raft for its key-value store and service catalog. Every service registration is a Raft write.
Kafka (KRaft): Kafka replaced ZooKeeper with its own Raft implementation (KRaft) for metadata management. Broker membership, topic configuration, and partition assignments all go through Raft.

mode:

click Step or Play to inspect cluster state

+ Raft vs Paxos

Paxos came first (Lamport, 1989) and is provably correct. It is also notoriously difficult to understand and implement. The original paper uses a metaphor about a Greek parliament that confuses more than it clarifies.

Raft was designed explicitly for understandability (Ongaro & Ousterhout, 2014). It decomposes consensus into three sub-problems: leader election, log replication, and safety. Each can be understood independently.

The key differences:

Leader requirement: Raft requires a leader at all times. Paxos can make progress with any proposer (Multi-Paxos optimizes by using a stable leader, but the protocol does not require one).

Log structure: Raft maintains a sequential, gap-free log. Paxos allows gaps, which means entries can be committed out of order and the log needs compaction.

Membership changes: Raft has explicit joint consensus for cluster membership changes. Paxos leaves this as an exercise for the implementer, which is where most Paxos implementations diverge.

In practice, most production systems use Raft or a Raft-like protocol. etcd, CockroachDB, TiKV, and Consul all chose Raft. Google’s Chubby and Spanner use Paxos variants, but with significant custom engineering on top.

+ joint consensus and membership changes

Changing cluster membership (adding or removing nodes) during operation is tricky. If you switch from a 3-node to a 5-node cluster, there is a moment when two different majorities could exist: a majority of the old 3-node config and a majority of the new 5-node config.

Raft solves this with joint consensus. The leader proposes a transitional configuration that includes both old and new members. During the transition, decisions require a majority of both the old and new configurations.

Phase 1: C_old (3 nodes: N1, N2, N3)
Phase 2: C_old,new (requires majority of {N1,N2,N3} AND {N1,N2,N3,N4,N5})
Phase 3: C_new (5 nodes: N1, N2, N3, N4, N5)

Once the joint configuration is committed, the leader switches to the new configuration. At no point can two independent majorities make conflicting decisions.

Single-server membership changes (adding or removing one node at a time) are simpler and avoid the need for joint consensus entirely. Most practical deployments use this approach.

+ log compaction and snapshots

The Raft log grows forever. Every write appends an entry. A server that has been running for months has millions of log entries. New nodes joining the cluster would need to replay all of them.

Snapshots solve this. Periodically, each node serializes its current state to disk and discards all log entries up to that point. The snapshot represents the cumulative effect of all discarded entries.

When a new node joins (or a slow follower falls too far behind), the leader sends its snapshot instead of replaying thousands of log entries. The follower loads the snapshot and then applies any subsequent log entries.

etcd takes snapshots every 10,000 entries by default. CockroachDB uses Raft snapshots for range rebalancing, sending a snapshot of the range data to the new replica.

+ Byzantine vs crash-fault consensus

Raft handles crash faults: nodes that stop responding. It assumes nodes are honest. A crashed node might lose state, but it never lies.

Byzantine faults are worse: nodes can behave arbitrarily. They can send conflicting messages to different peers, forge votes, or corrupt data. Handling Byzantine faults requires protocols like PBFT (Practical Byzantine Fault Tolerance), which needs 3f+1 nodes to tolerate f Byzantine failures (compared to Raft’s 2f+1 for f crash failures).

Most internal infrastructure uses crash-fault tolerance only. Byzantine fault tolerance matters for blockchain networks, multi-party computation, and systems where you do not trust all participants. The overhead of BFT protocols (more messages, more nodes, higher latency) makes them impractical for typical datacenter deployments.

mode:

click Step or Play to inspect cluster state

production stories

etcd and Kubernetes

Every Kubernetes cluster runs etcd, a distributed key-value store built on Raft. When you run kubectl apply, the API server writes to etcd. That write goes through Raft consensus across typically 3 or 5 etcd nodes.

etcd’s performance is bounded by disk fsync latency because Raft requires entries to be persisted before acknowledging. The official recommendation is dedicated SSDs for etcd. Running etcd on shared disks with other workloads is the most common cause of Kubernetes control plane instability.

Cluster size matters. Three nodes tolerate one failure. Five nodes tolerate two. Seven nodes tolerate three but add latency (more nodes must acknowledge each write). Most production Kubernetes deployments use three or five etcd nodes. Going beyond five is rarely worth the latency cost.

CockroachDB’s per-range Raft

CockroachDB does not run one Raft group for the entire database. It splits data into ranges (512 MB by default) and runs an independent Raft group for each range. A cluster with 1 TB of data has roughly 2,000 Raft groups operating simultaneously.

This design gives per-range consistency without global coordination. Two transactions touching different ranges never contend on Raft. The system scales horizontally because adding nodes adds capacity for more Raft groups.

The challenge is Raft group management. Each node might participate in hundreds of Raft groups. CockroachDB batches Raft messages and heartbeats to reduce network overhead. Without batching, the heartbeat traffic alone would saturate the network.

Kafka’s KRaft migration

For years, Kafka depended on ZooKeeper for metadata management: broker registration, topic configuration, partition leader election. ZooKeeper was a separate distributed system that Kafka operators had to deploy, monitor, and maintain alongside Kafka itself.

KRaft (Kafka Raft) replaced ZooKeeper with a built-in Raft-based metadata store. A subset of Kafka brokers (the controllers) form a Raft quorum and manage all metadata internally. No external dependency.

The migration from ZooKeeper to KRaft was a multi-year effort. Kafka 3.3 (2022) made KRaft production-ready. Kafka 4.0 removed ZooKeeper support entirely. The lesson: swapping the consensus layer underneath a running system is one of the hardest migrations in distributed systems.

when consensus is overkill

Not every system needs consensus. If your data can tolerate brief inconsistency (a social media feed, a product catalog, a cache), eventual consistency with conflict resolution is simpler and faster.

Consensus protocols add latency (waiting for quorum acknowledgment), limit write throughput (single leader bottleneck), and complicate operations (managing quorum health). Use them when correctness demands it: financial transactions, lock management, configuration stores, leader election. Skip them when availability and latency matter more than strict ordering.