user@techdebt:~/blog$_
$ cd ..

The Consensus

mode:
$ consensus inspector
click Step or Play to inspect cluster state
$ simulation.log

the problem

You have five servers. A client writes x = 5. All five servers need to agree that x = 5, in the same order, even if one of them crashes mid-write. That is the consensus problem.

It sounds simple until you consider: messages get delayed, reordered, or lost. Servers crash and restart. The network partitions, splitting your cluster in half. Each half thinks the other is dead.

Without consensus, each server might end up with a different value. Your database returns different answers depending on which server you ask. That is not a database. That is five databases pretending to be one.

the leader

Raft’s key insight: elect a single leader and funnel all writes through it. The leader decides the order of operations. Followers just replicate what the leader tells them.

This simplifies everything. There is no need to resolve conflicting writes from multiple servers. The leader is the single source of truth for ordering. If you want to write, talk to the leader. If you want to read (with strong consistency), talk to the leader.

Client -> Leader -> Followers
            |
            v
       "x=5 is entry #47"

The leader assigns each write a sequential log index. Followers apply entries in that order. Every node ends up with the same log.

terms and votes

What happens when the leader dies? The remaining nodes need to elect a new one. Raft uses terms, monotonically increasing logical clocks, to prevent two leaders in the same term.

Each term has at most one leader. When a follower’s heartbeat timer expires (no leader contact), it increments its term and becomes a candidate. It votes for itself and sends RequestVote RPCs to all other nodes.

Term 1: N3 is leader
Term 1: N3 crashes
Term 2: N1 starts election, votes for self
Term 2: N2, N4, N5 grant votes
Term 2: N1 becomes leader (4/5 votes, majority = 3)

A node grants its vote to the first candidate it hears from in a given term. One vote per term per node. This guarantees at most one leader per term. If a node sees a higher term number, it immediately steps down to follower. No arguments, no negotiation.

log replication

Once elected, the leader accepts client requests and appends them to its log. It then sends AppendEntries RPCs to all followers with the new entries.

When a majority of nodes (the quorum) have written the entry to their logs, the leader considers it committed. Committed entries are durable. They will survive any future leader election because the election restriction (below) guarantees the new leader has all committed entries.

Leader log:    [1:x=5] [2:y=3]
Follower N1:   [1:x=5] [2:y=3]  ack
Follower N2:   [1:x=5] [2:y=3]  ack
Follower N4:   [1:x=5]          slow
Follower N5:   [1:x=5] [2:y=3]  ack

4/5 = quorum -> entry 2 committed

Slow followers catch up. The leader retries AppendEntries until every follower’s log matches. If a follower has stale or divergent entries, the leader overwrites them. The leader’s log is always authoritative.

safety: the election restriction

Not just any node can become leader. A candidate’s log must be at least as up-to-date as any voter’s log. Voters reject candidates with shorter or older logs.

This is the key safety property: a new leader always has every committed entry. Without this restriction, a newly elected leader could overwrite committed data, violating consensus.

where it shows up

  • etcd: The coordination store behind Kubernetes. Uses Raft for all key-value writes. Every kubectl apply ultimately goes through Raft consensus.
  • ZooKeeper: Uses ZAB (ZooKeeper Atomic Broadcast), a protocol closely related to Raft. Coordinates distributed locks, configuration, and leader election for other systems.
  • CockroachDB: Runs a Raft group per range (chunk of data). Thousands of independent Raft groups operating in parallel across the cluster.
  • Consul: HashiCorp’s service mesh uses Raft for its key-value store and service catalog. Every service registration is a Raft write.
  • Kafka (KRaft): Kafka replaced ZooKeeper with its own Raft implementation (KRaft) for metadata management. Broker membership, topic configuration, and partition assignments all go through Raft.