The Shard | techdebt.blog

mode:

click Step or Play to inspect shard distribution

the problem

Vertical scaling has a ceiling. The biggest single server you can buy has finite RAM, finite disk, finite CPU. When your data outgrows it, you have two choices: delete data or split it across machines.

Splitting is sharding (also called partitioning). Each machine holds a subset of the data. The hard part: deciding which data goes where, and what happens when you need to change that decision.

range sharding

The simplest approach: divide the key space into contiguous ranges. Users A-F go to shard 1, G-M to shard 2, and so on.

def find_shard(key):
    if key[0] <= 'F':   return shard_1
    elif key[0] <= 'M': return shard_2
    elif key[0] <= 'S': return shard_3
    else:                return shard_4

Range sharding preserves order: a range query like “all users from A to D” hits only one shard. But real data isn’t uniformly distributed. Names cluster around certain letters. Products starting with “iPhone” all land on the same shard. One shard gets hammered while others sit idle.

hash sharding

Hash the key, then assign based on the hash value. Keys that were neighbors in the alphabet end up on different shards.

def find_shard(key, num_shards):
    h = hash(key)
    shard_size = MAX_HASH // num_shards
    return h // shard_size

Hash sharding distributes data evenly regardless of key patterns. The same names that clustered on one range shard now spread across all shards.

The tradeoff: you’ve destroyed ordering. A range query “all users A-D” must now check every shard, a scatter-gather operation. Each shard scans its local data, the coordinator merges results. Latency is bounded by the slowest shard.

the hotspot problem

Even hash sharding can’t prevent all hotspots. If one key is extremely popular (a viral tweet, a celebrity’s profile), all reads for that key hit the same shard. The key is evenly distributed, but the load isn’t.

Mitigations:

Key splitting: Append a random suffix (0-9) to hot keys. Reads scatter across 10 shards and merge. Instagram uses this for celebrity accounts.
Caching: Put a cache layer in front of hot shards.
Read replicas: Replicate the hot shard for read scaling.

rebalancing

Your cluster needs a new shard. Now what? You have to move data: some items from existing shards to the new one. This is rebalancing.

The naive approach: rehash everything with the new shard count. Like hash(key) % N changing when N changes, most data moves. Terrible.

Better approaches:

Fixed partitions: Pre-split your data into many more partitions than shards (e.g., 1000 partitions across 4 shards). Adding a shard just reassigns whole partitions; no item-level scanning.
Dynamic splitting: When a partition gets too large, split it in two. When too small, merge neighbors. DynamoDB does this automatically.

where it shows up

MongoDB: Range or hash sharding. Chunks (64 MB default) are the unit of migration. The balancer moves chunks between shards automatically.
Vitess: MySQL sharding layer built at YouTube. Supports range and hash-based sharding with automatic resharding.
CockroachDB: Range-based with automatic splitting and rebalancing. Ranges are 512 MB by default.
DynamoDB: Hash-based with automatic partition splitting. Partition keys determine distribution. Hot partitions are split adaptively.

mode:

click Step or Play to inspect shard distribution

shard key selection

The shard key is the most important decision in a sharded system. It determines data distribution, query efficiency, and operational complexity.

A good shard key has:

High cardinality: Many distinct values. Sharding by boolean field gives you 2 shards max.
Even distribution: Values should spread data uniformly. Sharding by country puts 80% of data on the US shard.
Query alignment: Queries should target one shard when possible. If you always query by user_id, shard by user_id.

A bad shard key creates hotspots, forces scatter-gather queries, or both.

+ directory-based sharding

Instead of computing the shard from the key, maintain a lookup table that maps keys (or key ranges) to shards. This is directory-based or lookup-based sharding.

Directory:
  user:1-1000   → shard_1
  user:1001-2000 → shard_2
  user:2001-3000 → shard_1  (after rebalance)

Advantages: complete flexibility in data placement. You can move any range to any shard without changing the routing logic.

Disadvantages: the directory is a single point of failure and a bottleneck. Every read and write must consult it. In practice, the directory is replicated and cached.

Apache HBase uses a directory (the META table) to map row ranges to RegionServers. Vitess uses a VSchema that acts as a sharding directory.

+ compound shard keys

A single field often isn’t enough. Compound shard keys combine multiple fields:

shard_key = (tenant_id, timestamp)

The first field determines the shard (hash of tenant_id). The second field provides ordering within the shard. You can efficiently query “all events for tenant X between dates A and B”; it hits one shard and uses the index.

MongoDB supports compound shard keys natively. The first field is the partition key (determines shard), subsequent fields are the sort key (determines order within the shard).

DynamoDB’s partition key + sort key model is the same concept. The partition key is hashed for distribution; the sort key enables range queries within a partition.

+ geo-sharding

Shard by geographic region to keep data close to users. European users’ data stays in EU datacenters, US users in US datacenters.

shard_key = region(user.country)
EU shard: Frankfurt datacenter
US shard: Virginia datacenter
APAC shard: Singapore datacenter

This reduces read latency (data is local) and can help with data residency requirements (GDPR requires EU data to stay in EU).

CockroachDB supports geo-partitioning natively. You annotate tables with locality rules, and the system automatically places data in the right region. Spanner uses similar placement policies.

The challenge: cross-region queries. A report aggregating all users globally must contact every shard. And user migration (EU user moves to US) requires data migration.

+ cross-shard queries and joins

The hardest problem in sharded databases: queries that span multiple shards.

Scatter-gather: Send the query to all shards, collect results, merge. Simple but latency is bounded by the slowest shard. Common for search, analytics, and range queries on non-shard-key fields.

Cross-shard joins: Joining data from different shards is expensive. If users are on shard 1 and orders are on shard 3, joining them requires a network round-trip. Solutions:

Denormalize: Store user data alongside each order. Trades storage for query speed.
Co-locate: Shard both tables by the same key (user_id). Related data lands on the same shard. Vitess and Citus (PostgreSQL) support this.
Global tables: Small reference tables (countries, currencies) replicated to every shard. Joins against them are always local.

Cross-shard transactions: even harder than joins. Two-phase commit (2PC) works but is slow and blocks on coordinator failure. Most sharded systems avoid cross-shard transactions entirely or provide weaker guarantees (eventual consistency).

mode:

click Step or Play to inspect shard distribution

production stories

DynamoDB’s partition evolution

DynamoDB originally gave each table a fixed number of partitions based on provisioned throughput. Each partition handled up to 3,000 reads or 1,000 writes per second. If your table had 10,000 write capacity, you got 10 partitions.

The problem was hot partitions. If traffic was unevenly distributed across partition keys, one partition would throttle while others had spare capacity. Users saw throttling even when their total throughput was well below the provisioned limit.

The fix came in 2018 with adaptive capacity. DynamoDB now detects hot partitions and borrows capacity from cold ones. A hot partition can burst to the full table-level throughput if other partitions aren’t busy.

Later, on-demand mode removed provisioning entirely. DynamoDB splits and merges partitions automatically based on traffic patterns. The partition key still matters (a single hot key can still throttle), but the system is far more forgiving.

MongoDB’s chunk migration

MongoDB divides each sharded collection into chunks, contiguous ranges of shard key values (64 MB by default). The balancer runs continuously, moving chunks between shards to keep them even.

Chunk migration is a multi-step process: the source shard copies documents to the destination, then deletes its copies. During migration, the source shard is still serving reads and writes for that chunk. Writes during migration are forwarded to the destination.

The operational lesson: chunk migration is I/O intensive. Running it during peak hours can degrade performance. MongoDB 4.4 introduced a balancing window so you can restrict migrations to off-peak hours.

The bigger lesson: choosing the right shard key avoids most migration pain. A well-chosen key creates naturally balanced chunks. A poorly chosen key (like a monotonically increasing timestamp) creates a single hot chunk that constantly splits and migrates.

Vitess at YouTube

YouTube’s MySQL infrastructure was monolithic: one giant MySQL instance. As it grew, they built Vitess to transparently shard MySQL without changing application code.

Vitess introduces VSchema, a sharding schema that maps tables to keyspaces and defines shard key columns. The vtgate proxy routes queries to the right shard(s), handling scatter-gather for cross-shard queries.

Resharding in Vitess is online: it uses MySQL replication to build the new shards in parallel, then cuts over traffic. The application never sees downtime.

Key design decisions: Vitess avoids cross-shard joins at the database level. Instead, it pushes join logic to the application or uses lookup tables (essentially directory-based sharding) to co-locate related data.

when NOT to shard

Sharding is a last resort. It adds enormous operational complexity:

Vertical scaling first: A single PostgreSQL instance can handle hundreds of millions of rows. A beefy server with NVMe SSDs and 512 GB RAM gets you surprisingly far.
Read replicas first: If reads are the bottleneck, replicas are simpler than sharding.
Caching first: A Redis layer in front of your database handles most read hotspots.
Archive old data: If your table is huge because of historical data nobody reads, move old rows to cold storage.

Only shard when: (1) a single machine genuinely cannot hold your write throughput, (2) your data is too large for any single machine, or (3) you need data locality across regions. If none of these are true, you’re adding complexity for no benefit.