The Replica
One copy of your data is a single point of failure. Replicas fix availability but introduce lag, stale reads, and conflicts. Leader-follower is simplest, multi-leader handles geo-distribution, and leaderless gives you tunable consistency with the W+R>N formula.
the problem
Your database lives on one server. That server has a disk, and disks fail. When it goes down, your entire application goes with it. Even if you have backups, restoring takes time: minutes to hours of downtime.
The fix sounds simple: keep copies. But the moment you have two copies of the same data, you have a new problem: keeping them in sync. Every write has to reach all copies, and that takes time. During that window, different copies have different data. Welcome to replication.
leader-follower
The most common setup: one node is the leader (also called primary or master). All writes go to the leader. The leader sends a stream of changes to followers (replicas, secondaries). Reads can go to any node.
# write path — always through the leader
def write(key, value):
leader.apply(key, value)
for follower in followers:
follower.replicate(key, value) # async
# read path — any node
def read(key):
return any_node.get(key)This is how PostgreSQL streaming replication, MySQL replication, and MongoDB replica sets work.
replication lag
If replication is asynchronous (and it almost always is), followers lag behind. A client that writes to the leader and immediately reads from a follower might get stale data.
This gap between what the leader knows and what followers know is replication lag. Under normal load it’s milliseconds. Under heavy write traffic or network issues, it can be seconds or more.
The sim above demonstrates this: after a write, followers don’t update instantly. Reading from a lagging follower returns the old value.
failover
When the leader dies, a follower must be promoted. This is failover:
- Detect the leader is down (usually via heartbeat timeout)
- Choose a follower, typically the one with the least lag
- Promote it to leader
- Redirect all writes to the new leader
The dangerous part: if the old leader had writes that didn’t replicate before it died, those writes are lost. The new leader has a slightly older view of the data. In practice, systems like PostgreSQL use synchronous replication for at least one follower to prevent this.
multi-leader
What if your users are in New York and London? With a single leader in New York, London writes cross the Atlantic, adding 80ms+ of latency to every write.
Multi-leader replication puts a leader in each datacenter. Local writes are fast. Leaders sync with each other asynchronously.
The problem: two leaders can accept conflicting writes to the same key. Client A writes user:1='Alice' in New York while Client B writes user:1='Bob' in London. When the leaders sync, they disagree.
Conflict resolution strategies:
- Last-Write-Wins (LWW): highest timestamp wins. Simple but lossy; one write is silently discarded.
- Merge: combine values (works for counters, sets; doesn’t work for arbitrary data).
- Application-level: surface the conflict to the user (like Google Docs showing concurrent edits).
leaderless replication
Instead of routing writes through a leader, send them to all replicas. A write succeeds when W out of N nodes acknowledge it. A read queries R nodes and takes the latest version.
The consistency formula: W + R > N guarantees you’ll always read the latest write. At least one of the R nodes you read from must have the latest W-acknowledged write.
N = 3 replicas
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)
W + R = 4 > 3 = N ✓ guaranteed overlapBreak the formula (W=1, R=1) and you can read stale data: the one node you read from might not be the one that got the write.
This is how Cassandra and DynamoDB’s consistency model works. You tune W and R per query to trade consistency for latency.
where it shows up
- PostgreSQL: streaming replication with synchronous/async modes. pg_basebackup for initial follower setup.
- MySQL: binlog-based replication. Group Replication adds multi-leader with conflict detection.
- MongoDB: replica sets with automatic failover. Write concern and read preference control consistency.
- Cassandra: leaderless with tunable consistency.
QUORUMreads/writes give you W+R>N automatically. - Amazon DynamoDB: leaderless under the hood.
ConsistentRead=trueforces reads from the leader partition.
sync vs async in depth
The fundamental tradeoff in replication is durability vs latency.
Synchronous replication: the leader waits for at least one follower to confirm before acknowledging the write to the client. If the leader dies, no data is lost; the synced follower has everything. But every write pays the network round-trip penalty.
Asynchronous replication: the leader acknowledges immediately, then streams changes to followers. Writes are fast, but if the leader dies before replicating, those writes are gone.
Most production systems use semi-synchronous: one follower is synchronous (the “sync standby”), the rest are async. This gives you durability for all committed writes without the latency of syncing to every follower.
Semi-synchronous:
Client → Leader → Sync Follower (wait for ACK)
→ Async Follower 1 (fire and forget)
→ Async Follower 2 (fire and forget)+ conflict resolution strategies
When multi-leader or leaderless systems produce conflicts, you need a resolution strategy:
Last-Write-Wins (LWW): attach a timestamp to each write. Highest timestamp wins. Simple but dangerous; clocks drift between datacenters, and you silently lose writes. Cassandra uses this by default.
Version vectors: each node maintains a version counter. On write, increment your counter. On merge, take the element-wise maximum. Conflicts are detected when neither version dominates the other. Riak used this approach.
CRDTs (Conflict-free Replicated Data Types): data structures designed to merge automatically. A G-Counter (grow-only counter) tracks per-node counts and sums them. A G-Set (grow-only set) takes the union. More complex CRDTs handle add/remove (OR-Set) and even text editing (RGA). Automerge and Yjs are popular CRDT libraries.
Custom application logic: surface conflicts to the application. CouchDB stores all conflicting revisions and lets the app pick a winner. This is the most flexible but pushes complexity to the developer.
+ chain replication
An alternative to broadcast replication: arrange nodes in a chain. Writes go to the head, propagate through each node in sequence, and are acknowledged by the tail. Reads go to the tail, which has all committed writes.
Write → Head → Node2 → Node3 → Tail → ACK
Read → Tail (always up-to-date for committed writes)Benefits: strong consistency without quorum overhead. The tail always has the latest committed data, so reads are always consistent. Throughput scales because each node only sends to one downstream neighbor.
Used by: HDFS (NameNode chain), Azure Storage, and Microsoft’s Chain Replication paper (2004). Facebook’s f4 blob storage uses a variant.
+ read-your-writes consistency
A user writes a comment and refreshes the page. If the read goes to an async follower that hasn’t replicated yet, the comment disappears. This violates read-your-writes consistency: the user should always see their own writes.
Fixes:
- Route reads to the leader for data the user recently modified (e.g., “read from leader if the user wrote within the last 10 seconds”).
- Track replication position: the client remembers the leader’s log position at write time. On read, wait until the follower has caught up to that position.
- Session stickiness: pin the user’s reads to the follower they wrote through. Simple but breaks if the follower dies.
PostgreSQL’s synchronous_commit with remote_apply guarantees read-your-writes by waiting until followers have applied the change, not just received it.
+ replication topologies
How changes flow between nodes matters:
Star (hub-and-spoke): one central leader, all followers connect to it. Simple but the leader is a bottleneck.
All-to-all: every leader replicates to every other leader. Maximum redundancy but O(n²) connections.
Circular: each leader forwards changes to the next. Low overhead but a single failure breaks the chain.
MySQL’s multi-source replication supports all three. MongoDB uses oplog tailing from the primary. PostgreSQL logical replication supports pub/sub topologies.
production stories
PostgreSQL streaming replication
PostgreSQL writes all changes to a Write-Ahead Log (WAL). Streaming replication sends WAL records to followers in real-time. The follower replays them to stay in sync.
The synchronous_standby_names setting controls which followers are synchronous. With FIRST 1 (standby1, standby2), at least one must confirm each commit. If both sync standbys die, writes block; availability traded for durability.
Hot standby lets followers serve read queries while replaying WAL. This is how most PostgreSQL read scaling works: one writer, N readers. Replication lag is typically sub-millisecond on the same network.
MongoDB replica sets
MongoDB groups 3-7 nodes into a replica set. One is primary, the rest are secondaries. The primary uses an oplog (operation log), a capped collection that secondaries tail.
Failover is automatic: if the primary’s heartbeat stops, secondaries hold an election. The secondary with the most recent oplog entry (highest optime) usually wins. Elections take 1-2 seconds with MongoDB 4.0+.
Write concern controls durability: w: "majority" waits for a majority of nodes. Read preference controls consistency: primaryPreferred reads from the primary when available, falls back to secondaries.
The risk: during failover, a secondary might not have all the primary’s writes. MongoDB 5.0+ handles this by rolling back uncommitted writes on the old primary when it rejoins.
Cassandra’s tunable consistency
Cassandra is leaderless; every node can accept reads and writes. Consistency is tuned per query:
ONE: fastest, least consistent (W=1 or R=1)QUORUM: majority of replicas (W=⌈N/2⌉+1, R=⌈N/2⌉+1). Guarantees W+R>N.ALL: every replica must respond. Strongest consistency, worst availability.LOCAL_QUORUM: quorum within the local datacenter only. Good for multi-DC setups.
In practice, most Cassandra deployments use QUORUM for writes and QUORUM for reads. This gives strong consistency while tolerating ⌊N/2⌋ failures.
Read repair happens automatically: when a read returns stale data from some replicas, the coordinator sends the latest value back to the stale nodes. Anti-entropy repair (nodetool repair) periodically syncs all replicas using Merkle trees.
when NOT to replicate
Replication adds complexity. Sometimes it’s not worth it:
- Stateless services: if your service has no local state, just run more instances. No replication needed.
- Cache layers: cache invalidation is already hard. Replicating caches adds another consistency layer. It’s often simpler to let each cache instance warm independently.
- Ephemeral data: session tokens, rate limit counters, temporary uploads. If losing them means a user logs in again, that’s acceptable.
- Small datasets: if your entire dataset fits in a single node with comfortable margin, a simple backup/restore strategy might be simpler than live replication.
The right question: what happens if this data is unavailable for 5 minutes? If the answer is “nothing critical,” simple backups beat live replication.