The Engine | techdebt.blog

mode:

click Step or Play to inspect storage engine state

the problem

Databases store data on disk. Disk is slow. The storage engine’s job is to organize data on disk so that reads and writes are as efficient as possible. The catch: you cannot optimize for both at the same time.

Every storage engine makes a tradeoff between read performance and write performance. B-trees optimize for reads by keeping data sorted and accessible through balanced tree lookups. LSM-trees optimize for writes by converting random writes into sequential ones. Understanding this tradeoff is the key to choosing the right database for your workload.

B-tree

The B-tree is the most widely used data structure in databases. PostgreSQL, MySQL (InnoDB), SQLite, and most relational databases use B-trees as their primary storage engine.

A B-tree organizes data in a balanced tree of fixed-size pages (typically 4KB or 16KB). The root page contains key ranges that point to child pages. Internal pages subdivide the range further. Leaf pages contain the actual data (or pointers to it).

Reads: Start at the root, follow pointers based on key comparisons, and arrive at the leaf containing the target key. The number of page reads equals the tree height, which is O(log N). A tree with a branching factor of 500 and 4 levels can index 500^4 = 62.5 billion keys.

Writes: Navigate to the correct leaf page (same path as a read), then update the page in place. If the page is full, it splits into two pages and the parent is updated. Splits can cascade up the tree, but this is rare with typical branching factors.

The cost of B-tree writes is write amplification. A single logical write (insert one row) may require multiple disk writes (update the leaf page, update the parent if split, update the WAL). In practice, write amplification is 10-30x for B-trees.

LSM-tree

The Log-Structured Merge-tree takes the opposite approach. Instead of updating data in place, it appends all writes to an in-memory buffer (the memtable). When the memtable is full, it is flushed to disk as a sorted, immutable file (an SSTable).

Writes: Append to the memtable (a sorted data structure like a red-black tree or skip list). This is a single memory write, no disk I/O until the memtable is full. Writes are fast.

Reads: Check the memtable first (memory lookup). If not found, check the most recent SSTable on disk. If not found, check older SSTables. In the worst case, a read must check every level. This is read amplification.

Compaction: Over time, SSTables accumulate at each level. Compaction merges overlapping SSTables into a single, larger SSTable, reducing the number of files a read must check. Compaction runs in the background and is the main source of write amplification in LSM-trees.

RocksDB, LevelDB, Cassandra, and ScyllaDB use LSM-trees. They excel at write-heavy workloads (time-series data, event logs, messaging systems).

read amplification vs write amplification

Read amplification is the number of disk reads required for a single logical read. B-trees have low read amplification (O(log N), typically 3-4 disk reads). LSM-trees have higher read amplification (must check multiple levels, though bloom filters help).

Write amplification is the number of disk writes required for a single logical write. B-trees have moderate write amplification (in-place update + WAL). LSM-trees have lower write amplification for the initial write (append only) but compaction adds background write amplification.

Space amplification is how much extra disk space the engine uses beyond the raw data size. B-trees waste space on partially filled pages (typically 50-70% utilization). LSM-trees waste space on overlapping SSTables before compaction.

The tradeoff is clear:

B-tree: low read amp, high write amp, moderate space amp
LSM-tree: high read amp, low write amp (for fresh writes), variable space amp

where it shows up

InnoDB (MySQL): B-tree storage engine. Clustered index stores rows sorted by primary key. Secondary indexes point to primary key values. Default for MySQL.
RocksDB: LSM-tree engine by Facebook. Used as the storage backend for many databases (CockroachDB, TiKV, Yugabyte). Optimized for SSDs with configurable compaction strategies.
LevelDB: Google’s original LSM-tree implementation. Simpler than RocksDB, used as a building block. LevelDB inspired RocksDB.
Cassandra: Uses an LSM-tree storage engine. Optimized for write-heavy workloads and time-series data. Compaction strategies are configurable (size-tiered, leveled).
PostgreSQL: B-tree indexes by default. Also supports hash indexes, GiST, GIN, and BRIN for specialized workloads. The heap (table storage) is separate from indexes.

mode:

click Step or Play to inspect storage engine state

+ bloom filters and fence pointers

LSM-trees have a read amplification problem: a point lookup might check every SSTable at every level. Two optimizations dramatically reduce this cost.

Bloom filters are probabilistic data structures that can tell you “definitely not in this SSTable” or “maybe in this SSTable.” Each SSTable has a bloom filter built during compaction. Before reading an SSTable from disk, check its bloom filter. If the filter says “not here,” skip the SSTable entirely. False positive rate of 1% means 99% of unnecessary disk reads are avoided.

Fence pointers (also called index blocks) store the min and max key for each data block within an SSTable. Before scanning a block, check if the target key falls within its range. This turns a full SSTable scan into a binary search over block ranges.

Together, bloom filters and fence pointers reduce LSM-tree read amplification from “check every SSTable” to “check 1-2 SSTables” for point lookups. Range scans still require checking multiple SSTables.

+ compaction strategies (leveled vs tiered)

Compaction determines how SSTables are merged, and it has a major impact on performance.

Size-tiered compaction (STCS): When enough SSTables of similar size accumulate, merge them into one larger SSTable. Simple, good write throughput, but can cause temporary space amplification (old and new SSTables coexist during compaction) and read amplification (many overlapping SSTables at each size tier).

Leveled compaction (LCS): SSTables are organized into levels. Level 0 holds recently flushed memtables. Each higher level is 10x larger. SSTables within a level do not overlap. When a level is full, one SSTable is picked and merged into the next level. This guarantees that each level has non-overlapping key ranges, which bounds read amplification.

The tradeoff: leveled compaction has better read performance and more predictable space usage, but higher write amplification (each key is rewritten once per level). Size-tiered has better write throughput but worse read latency and space amplification.

RocksDB defaults to leveled compaction. Cassandra offers both and recommends size-tiered for write-heavy workloads and leveled for read-heavy ones.

+ write-ahead logs and crash recovery

Both B-trees and LSM-trees need crash recovery. A power failure during a write could leave the data in a corrupted state.

The write-ahead log (WAL) solves this. Before any modification to the data structure, the change is written to a sequential log file. If the system crashes, it replays the WAL to reconstruct any changes that were not yet written to the data structure.

For B-trees, the WAL protects against partially written pages (a crash during a page split could leave the tree in an inconsistent state). For LSM-trees, the WAL protects the memtable (which lives in memory and would be lost on crash). When the memtable is flushed to an SSTable, the corresponding WAL entries can be discarded.

WAL writes are sequential, which makes them fast on both HDDs and SSDs. The WAL is the reason databases can offer durability guarantees despite caching writes in memory.

Some engines (like LMDB) use copy-on-write B-trees instead of WAL. Modified pages are written to new locations, and the root pointer is atomically updated. This avoids the WAL entirely but increases write amplification.

+ column stores vs row stores

The storage engines discussed so far are row stores: they store all columns of a row together. This is efficient for OLTP workloads (fetch a complete row by primary key) but wasteful for analytics (scan one column across millions of rows).

Column stores store each column separately. A query that sums a single column reads only that column’s data, skipping all others. For a table with 100 columns, this is a 100x reduction in I/O for single-column queries.

Column stores also enable better compression. Values in a column have the same type and often similar values (e.g., a “country” column with many repeated values). Run-length encoding, dictionary encoding, and delta encoding compress columns efficiently.

Examples: ClickHouse, Apache Parquet (file format), Amazon Redshift, Google BigQuery. These are the engines behind analytical databases and data warehouses.

The hybrid approach (HTAP, Hybrid Transactional/Analytical Processing) uses a row store for OLTP and a column store for analytics, with a change data capture pipeline connecting them. TiDB and CockroachDB are moving in this direction.

mode:

click Step or Play to inspect storage engine state

production stories

RocksDB everywhere

RocksDB started as Facebook’s fork of LevelDB, optimized for fast storage (SSDs). It has since become the default embedded storage engine for a remarkable number of databases: CockroachDB, TiKV (the storage layer of TiDB), YugabyteDB, and many others embed RocksDB as their storage backend.

The reason is simple: building a production-quality storage engine is extremely difficult. RocksDB handles compaction, bloom filters, compression, rate limiting of background I/O, and dozens of other concerns that are easy to get wrong. By embedding RocksDB, database developers get a battle-tested storage layer and can focus on the distributed systems layer above it.

The operational cost of RocksDB is compaction tuning. The default settings work for many workloads, but write-heavy workloads may need tuning of compaction triggers, write buffer sizes, and background thread counts. Compaction that falls behind causes write stalls, which manifest as latency spikes.

WiredTiger in MongoDB

MongoDB switched from its original MMAP storage engine to WiredTiger in version 3.2. WiredTiger is a B-tree engine with copy-on-write concurrency (MVCC), compression, and both B-tree and LSM-tree table types.

The switch was transformative. WiredTiger’s document-level concurrency control replaced MMAP’s collection-level locking. Compression (snappy by default, zlib and zstd available) reduced storage usage by 50-80%. Write performance improved dramatically because WiredTiger uses a WAL for durability instead of syncing entire memory-mapped files.

The architectural lesson: the storage engine matters more than most people realize. Changing the storage engine changed MongoDB’s performance characteristics more than years of incremental optimization.

choosing the right engine

The decision tree is simpler than it appears:

Read-heavy, point lookups, OLTP: B-tree. PostgreSQL, MySQL/InnoDB, SQLite. The read path is O(log N) and predictable. Writes are slower but acceptable for most web applications.

Write-heavy, time-series, event logs: LSM-tree. Cassandra, RocksDB, ScyllaDB. Writes are fast (append-only), and compaction keeps reads reasonable. Ideal for workloads that write far more than they read.

Analytical, column scans: Column store. ClickHouse, Parquet, Redshift. When your queries touch one column across millions of rows, column stores are orders of magnitude faster.

Mixed workloads: This is where it gets hard. Most applications start as B-tree (PostgreSQL or MySQL) and stay there. If write throughput becomes a bottleneck, consider moving the write-heavy portion to an LSM-based store. If analytics become slow, add a column store for reporting.

The worst choice is no choice. Understanding what your storage engine optimizes for lets you predict where it will struggle, before it struggles in production.