Cassandra in less than 5 mins

4 minute read

The database discussion section of a system design interview is often the crux of most interview questions. In my experience, for most systems that do not have ACID requirements, NoSQL database is an appropriate choice. Of course, this is in the context of system design interviews and the real world scenario might require specific database solutions.

Traditionally, you always start with what type of database you are going to be using. But before you can even thing about the type of database to use (relational vs non-relational), try thinking in terms of the data.

Specifically ask yourself: (not an exhaustive list, will keep updating this list.)

What kind of data are we dealing with (metadata, multimedia etc)
What are some of the common data access patterns?
- Read heavy vs write heavy
What do we need to optimize in this system?
- Read optimized or write optimized
Storage cost
ACID guarantees
Consistency vs Availability
Do we need to optimize for range queries?
Time series data?
Replication - leaderless, single leader, multi leader etc.
Sequential reads or writes?

Cassandra

Column oriented database (also known as wide column database)
Supports high read and write throughput (explanation below)
Highly available and scalable (horizontally scalable)

Use Cases

Write heavy applications

If data is generally self contained, and only needs to be fetched with other data from its partition - can also use arrays, sets in rows
Examples:
- Sensor readings (same partitioning key for the same sensor),
- Chat messages (chatID - partition key, timestamp - ordering key),
- User activity tracking

Storage Engine (what makes Cassandra fast)

CommitLog The writes first go to WAL (write ahead log) or CommitLog before being written to a disk (SSTable).

Why not directly update the write to the disk instead of updating WAL?

WAL updates are cheaper as it’s append-only and doesn’t require any restructuring of data on disk.

This also provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.

There are two primary tree data structures used in database: B-Trees and LSM Trees Cassandra uses LSM trees:

Writes are extremely fast as they are first sent to memory.
- Depending on how often log compaction is performed, frequent updates or deletes of a row can take up a lot of extra disk space and cpu resources
- Tombstones for deletion further optimizes writes
performance of reads can a bit slower compared to B-trees.
- Cassandra uses bloom filters as an optimization (approximates the contents of a set)

Partitioning

Cassandra partitioning key in conjunction with consistent hashing (optimized re-balancing when nodes are deleted or added to a cluster) to determine which replica to send a given row to.

PartitionKey

Each shard is replicated with the number of replicas specified by the user.
Adding / removing nodes will cause the recomputing tokens - some nodes can become hotspots as it will take on broader range.
Re-balancing and distribution of data results in moving a lot of data.
VNodes are used to spread the load more evenly across the physical nodes on the cluster by dividing the hash ranges into smaller sub-ranges. This speeds up the re-balancing process after adding or removing nodes.

Replication

Leaderless replication

Every single write is sent to every single replica for a given partition (as opposed to one leader node) and the admin can choose how many replicas need to respond with a success before returning to the client (can use quorums).
- Client sends writes to one coordinator node in cassandra cluster
This means that there will be inevitably be conflicting writes
- last write wins (most recent timestamps)
  - Clocks are not reliable, can use something like NTP (network time protocol)
  - Read repair on reads, if client sees a replica with an outdates value it will update it
  - Anti-entropy process running in background ensures eventual consistency of replicas - eventual consistency
- If for some reason replicas cannot handle writes in a given moment, the coordinator will store the write to be sent to them later, and then perform a hinted handoff

Fault Tolerance

Data stored on replicas:
- Each replica added should allow read and write throughput to scale linearly.
- Customizable replication topology - number of replicas in addition to location such as different rack or data center.
Faults detected via gossip protocol
- Nodes keep track of heartbeats received by other nodes in the cluster and their timestamp and send these via gossip, if a node has not gossiped in a long time it is presumed dead

Hinted Handoff

When a node is down or does not respond to a write request, the coordinator node writes a hint in a text file on the local disk. This hint contains the data itself along with information about which node the data belongs to.
When the coordinator node discovers from the Gossiper that a node for which it holds hints has recovered, it forwards the write requests for each hint to the target.
Cassandra can fix this stale data while serving a read request. Cassandra can issue a Read Repair when it sees stale data

Indexes

Cassandra uses the clustering keys to create indexes of the data within a partition. - Note that these are only local indexes not global indexes. If you have many clustering keys in order to achieve multiple different sort orders, Cassandra will de-normalize the data such that it keeps two copies of it, slow down writes - Secondary indexes are not global.

Conclusion

Ideal for write-heavy applications with millions or billions of users, where performance is a primary concern (e.g., messaging, sensor readings, activity tracking).
Main pitfalls include a lack of strong consistency and the inability to support data relationships beyond sorting within a partition.

Twitter Facebook LinkedIn

Cassandra in less than 5 mins

Cassandra

Use Cases

Storage Engine (what makes Cassandra fast)

Partitioning

Replication

Fault Tolerance

Hinted Handoff

Indexes

Conclusion

You May Also Enjoy

Design Yelp / Proximity Service!

Top K Heavy Hitters

Design Leetcode Contest

Design a Webcrawler - V2