user@techdebt:~/blog$_
$ cd ..

The Engine

mode:
$ engine inspector
click Step or Play to inspect storage engine state
$ simulation.log

the problem

Databases store data on disk. Disk is slow. The storage engine’s job is to organize data on disk so that reads and writes are as efficient as possible. The catch: you cannot optimize for both at the same time.

Every storage engine makes a tradeoff between read performance and write performance. B-trees optimize for reads by keeping data sorted and accessible through balanced tree lookups. LSM-trees optimize for writes by converting random writes into sequential ones. Understanding this tradeoff is the key to choosing the right database for your workload.

B-tree

The B-tree is the most widely used data structure in databases. PostgreSQL, MySQL (InnoDB), SQLite, and most relational databases use B-trees as their primary storage engine.

A B-tree organizes data in a balanced tree of fixed-size pages (typically 4KB or 16KB). The root page contains key ranges that point to child pages. Internal pages subdivide the range further. Leaf pages contain the actual data (or pointers to it).

Reads: Start at the root, follow pointers based on key comparisons, and arrive at the leaf containing the target key. The number of page reads equals the tree height, which is O(log N). A tree with a branching factor of 500 and 4 levels can index 500^4 = 62.5 billion keys.

Writes: Navigate to the correct leaf page (same path as a read), then update the page in place. If the page is full, it splits into two pages and the parent is updated. Splits can cascade up the tree, but this is rare with typical branching factors.

The cost of B-tree writes is write amplification. A single logical write (insert one row) may require multiple disk writes (update the leaf page, update the parent if split, update the WAL). In practice, write amplification is 10-30x for B-trees.

LSM-tree

The Log-Structured Merge-tree takes the opposite approach. Instead of updating data in place, it appends all writes to an in-memory buffer (the memtable). When the memtable is full, it is flushed to disk as a sorted, immutable file (an SSTable).

Writes: Append to the memtable (a sorted data structure like a red-black tree or skip list). This is a single memory write, no disk I/O until the memtable is full. Writes are fast.

Reads: Check the memtable first (memory lookup). If not found, check the most recent SSTable on disk. If not found, check older SSTables. In the worst case, a read must check every level. This is read amplification.

Compaction: Over time, SSTables accumulate at each level. Compaction merges overlapping SSTables into a single, larger SSTable, reducing the number of files a read must check. Compaction runs in the background and is the main source of write amplification in LSM-trees.

RocksDB, LevelDB, Cassandra, and ScyllaDB use LSM-trees. They excel at write-heavy workloads (time-series data, event logs, messaging systems).

read amplification vs write amplification

Read amplification is the number of disk reads required for a single logical read. B-trees have low read amplification (O(log N), typically 3-4 disk reads). LSM-trees have higher read amplification (must check multiple levels, though bloom filters help).

Write amplification is the number of disk writes required for a single logical write. B-trees have moderate write amplification (in-place update + WAL). LSM-trees have lower write amplification for the initial write (append only) but compaction adds background write amplification.

Space amplification is how much extra disk space the engine uses beyond the raw data size. B-trees waste space on partially filled pages (typically 50-70% utilization). LSM-trees waste space on overlapping SSTables before compaction.

The tradeoff is clear:

  • B-tree: low read amp, high write amp, moderate space amp
  • LSM-tree: high read amp, low write amp (for fresh writes), variable space amp

where it shows up

  • InnoDB (MySQL): B-tree storage engine. Clustered index stores rows sorted by primary key. Secondary indexes point to primary key values. Default for MySQL.
  • RocksDB: LSM-tree engine by Facebook. Used as the storage backend for many databases (CockroachDB, TiKV, Yugabyte). Optimized for SSDs with configurable compaction strategies.
  • LevelDB: Google’s original LSM-tree implementation. Simpler than RocksDB, used as a building block. LevelDB inspired RocksDB.
  • Cassandra: Uses an LSM-tree storage engine. Optimized for write-heavy workloads and time-series data. Compaction strategies are configurable (size-tiered, leveled).
  • PostgreSQL: B-tree indexes by default. Also supports hash indexes, GiST, GIN, and BRIN for specialized workloads. The heap (table storage) is separate from indexes.