Apache Cassandra — Internal Architecture

A developer's guide to how Cassandra actually works under the hood: cluster architecture, the storage engine, replication, the Log-Structured Merge tree, and how data moves when nodes join or leave.

Apache Cassandra is a distributed, wide-column NoSQL database designed for high write throughput, linear horizontal scalability, and no single point of failure. Its design borrows two big ideas: the data distribution and replication model from Amazon Dynamo (consistent hashing, tunable consistency, gossip, hinted handoff) and the on-disk storage model from Google Bigtable (the Log-Structured Merge tree, memtables, and SSTables). Understanding Cassandra means understanding how those two halves fit together.

Contents

  1. Design Goals and Core Ideas
  2. Cluster Architecture
  3. Data Distribution & the Token Ring
  4. Replication
  5. Storage Engine: LSM Tree
  6. Write Path
  7. Read Path
  8. Membership & Failure Detection
  9. Adding a Node
  10. Removing a Node
  11. Summary

1. Design Goals and Core Ideas

Every internal decision in Cassandra traces back to a small set of goals. Keeping them in mind makes the rest of the architecture predictable.

GoalHow Cassandra achieves it
No single point of failureFully peer-to-peer. Every node is identical; there is no master, no name node, no config server.
Linear scalabilityData is partitioned across nodes by a hash ring. Adding nodes adds capacity proportionally.
High write throughputWrites are append-only (commit log + in-memory memtable). No read-before-write, no in-place updates.
Tunable consistencyThe client chooses, per query, how many replicas must respond. Trade latency against consistency.
Always-on availabilityAn AP system (in CAP terms). It favors availability and partition tolerance; consistency is tunable, not absolute.

2. Cluster Architecture

A Cassandra cluster is a ring of identical, equal peers. There is no coordinator process and no leader election for data ownership. Any node can accept any request; the node that receives a client request becomes the coordinator for that single request and routes it to the nodes that own the data.

Peer-to-peer ring
Every node is identical and gossips with every other node. The node a client hits becomes the coordinator for that request.

The key components running on every node are:

3. Data Distribution: Partitioning and the Token Ring

Cassandra distributes data using consistent hashing. The entire output space of the hash function (a 64-bit signed integer for Murmur3) is treated as a ring. Each node is assigned one or more tokens — positions on the ring. A node owns the range of tokens from the previous node's token (exclusive) up to its own token (inclusive).

Token ring
The partition key is hashed to a token; the token determines which node owns the row.

To locate a row, Cassandra hashes its partition key to a token, then walks clockwise around the ring to find the first node whose range contains that token. That node is the first replica; the next N-1 nodes clockwise hold the additional copies.

A concrete walk-clockwise example

Picture the ring as the numbers 0 to 100 wrapping back to 0, with four single-token nodes: A at token 10, B at 40, C at 70, and D at 100. Each node owns the range from its predecessor's token (exclusive) up to its own token (inclusive), so A owns (100,10], B owns (10,40], C owns (40,70], and D owns (70,100]. If hash('user:42') resolves to 55, you start at 55 and move in the direction of increasing tokens; the first token you reach is 70, so Node C owns the row. For RF=3 you keep walking clockwise to the next distinct physical nodes, which gives the replica set { C, D, A }.

Walking clockwise example
hash('user:42') = 55 lands in C's range (40,70]; the next two distinct nodes clockwise complete the RF=3 replica set.

Virtual nodes (vnodes): naming and mapping

Instead of one large token range per machine, each physical node is assigned many small ranges scattered around the ring (16 in modern Cassandra, 256 historically). This spreads data and load more evenly and, critically, makes rebalancing faster — when a node joins or leaves, many small ranges are streamed from many peers in parallel rather than one huge range from a single neighbor.

A virtual node is not a named process — it is simply one token, identified by its token value (a 64-bit integer for Murmur3). Each physical node is configured with num_tokens and on join is assigned that many tokens. Each assigned token defines one vnode range, so "owning 16 vnodes" means owning the 16 ranges whose upper-bound tokens belong to that node. The cluster maintains a single sorted token-to-endpoint map — each physical node has a host-ID (a UUID) and an address — and gossip distributes this map so any node can resolve any token to its owning machine. The mapping is many-to-one and interleaved: consecutive tokens on the ring usually belong to different physical nodes, which is exactly what spreads data evenly and lets bootstrap and decommission stream from many peers in parallel.

vnodes to physical nodes
Three physical nodes, each owning several interleaved tokens. Each dot is one vnode (one token); colors show the owning machine.

The routing logic on the coordinator is straightforward:

function route(partition_key):
  token = murmur3(partition_key)        # 64-bit signed int
  primary = ring.find_owner_clockwise(token)
  replicas = []
  node = primary
  while len(replicas) < replication_factor:
    if placement_strategy.accepts(node, replicas):   # rack/DC aware
      replicas.append(node)
    node = ring.next_clockwise(node)
  return replicas

4. Replication

The replication factor (RF) defines how many copies of each row exist. With RF=3, every row is stored on three nodes. Replica placement is governed by a strategy:

Replication and consistency
RF=3 places three copies on consecutive distinct nodes. The consistency level chosen per query decides how many must respond.

Tunable consistency is the second half of the Dynamo model. Each query specifies a consistency level (CL) — how many replicas must acknowledge before the operation is considered successful. The fundamental rule is that if R + W > RF (read replicas plus write replicas exceed the replication factor), every read is guaranteed to see the latest acknowledged write, because the read and write quorums always overlap on at least one node.

Cassandra keeps replicas consistent over time through three anti-entropy mechanisms:

MechanismWhen it runsWhat it does
Read repairDuring a readIf replicas return different values, the coordinator pushes the newest version back to the stale replicas.
Hinted handoffDuring a write, when a replica is downThe coordinator stores a "hint" and replays the write to the replica once it comes back online.
Repair (Merkle trees)Scheduled / manual (nodetool repair)Replicas compare Merkle-tree hashes of their data and stream only the differing ranges to reconcile.

The coordinator's write logic under a consistency level:

function write(key, value, CL):
  replicas = route(key)
  required = consistency_count(CL, replication_factor)   # QUORUM -> floor(RF/2)+1
  acks = 0
  for r in replicas:
    if r.is_alive():
      send_async(r, key, value)     # r appends to commit log + memtable
    else:
      store_hint(r, key, value)     # hinted handoff
  wait until acks >= required        # count only successful replica responses
  return SUCCESS                     # remaining replicas keep applying in the background

5. The Storage Engine: Log-Structured Merge Tree

Cassandra's on-disk engine is a Log-Structured Merge tree (LSM). The central idea: never modify data in place. All writes are sequential appends, which is far faster than random disk I/O. Updates and deletes become new entries; the system reconciles versions at read time and during background compaction.

LSM tree and compaction
Writes land in the memtable (in memory) and commit log (on disk), flush to immutable SSTables, then merge during compaction.

The three core structures are:

Compaction is the background process that merges multiple SSTables into fewer, larger ones. It performs a merge-sort, keeps only the newest version of each cell, and physically removes tombstones (deletion markers) once they are older than gc_grace_seconds. Compaction is what keeps read amplification under control. The main strategies are:

StrategyBest for
STCS (Size-Tiered)Write-heavy workloads. Merges SSTables of similar size. Default.
LCS (Leveled)Read-heavy workloads. Keeps small, non-overlapping levels for predictable read latency.
TWCS (Time-Window)Time-series data with TTLs. Groups data by time window for cheap expiry.

6. Write Path

A write is acknowledged the moment it is durable in the commit log and visible in the memtable. There is no read-before-write and no random disk seek on the hot path, which is why Cassandra writes are so fast.

Write path
Inside a replica: append to commit log, update the memtable, acknowledge, and flush to an SSTable later.

Inside a replica, a single write proceeds as:

function apply_write(key, value):
  commit_log.append(key, value)     # 1. durability (sequential disk write)
  memtable.put(key, value)          # 2. in-memory, sorted
  ack()                             # 3. respond to coordinator

  if memtable.size() > threshold:   # 4. flush, asynchronously
    sstable = memtable.flush_to_disk()             # immutable, sorted
    commit_log.discard(segments_covered_by(sstable))

7. Read Path

A read is more involved than a write because the current value of a row may be spread across the memtable and several SSTables (each holding a different version or column). The coordinator gathers candidate data from all sources and merges by timestamp — last write wins at the cell level.

Read path
Caches and bloom filters cut work; the memtable and surviving SSTables are merged by timestamp into the result.

On a single replica, the local read works like this:

function read(key):
  if row_cache.contains(key): return row_cache.get(key)   # fast path
  candidates = []
  candidates += memtable.get(key)   # newest, in memory
  for sstable in sstables:
    if not sstable.bloom_filter.might_contain(key):
      continue                      # skip files that cannot hold the key
    offset = sstable.index.lookup(key)     # partition index / summary
    candidates += sstable.read_at(offset)
  row = merge_by_timestamp(candidates)     # last write wins; drop tombstoned cells
  row_cache.put(key, row)
  return row

The bloom filter is the key optimization: it lets a replica skip SSTables that definitely do not contain the requested key, avoiding pointless disk reads. A delete does not erase data; it writes a tombstone that shadows older versions until compaction physically removes them.

8. Membership and Failure Detection

Because there is no master, nodes must agree on who is alive entirely through gossip. Once per second, each node picks a few random peers and exchanges a digest of everything it knows about every node — heartbeat counters and application state. Newer information (higher version numbers) overwrites older information, so the cluster state converges quickly.

Failure detection is not a simple timeout. Cassandra uses a Phi Accrual Failure Detector, which outputs a continuous suspicion level (phi) based on the statistical distribution of recent heartbeat inter-arrival times, rather than a binary up/down. A node is marked down only when phi crosses a configured threshold, which adapts to real network conditions and reduces false positives.

every 1 second on each node:
  my_heartbeat += 1
  peers = pick_random_live_peers(count=3)
  for p in peers:
    send(p, my_view_of_cluster)     # digests + states
  # on receiving a peer's view, merge: keep entries with the higher version

function is_down(node):
  phi = -log10(P(now - last_heartbeat[node]))   # accrual
  return phi > PHI_THRESHOLD          # e.g. 8

9. Adding a Node (Scaling Out)

Adding a node is called bootstrapping. The new node is assigned tokens (with vnodes, many small ranges spread around the ring). It now becomes a replica for those ranges, so it must receive the corresponding data from the nodes that currently hold it before it can serve reads.

Adding a node
The new node takes over token ranges; current owners stream the relevant data to it before it serves traffic.

The bootstrap sequence is:

  1. The new node starts, contacts a seed node, and learns the ring topology through gossip.
  2. It calculates the token ranges it will own and announces a JOINING state.
  3. Neighbors that currently own those ranges stream the relevant SSTable data to the new node, in parallel across many vnode ranges.
  4. Once streaming completes, the node transitions to NORMAL and begins serving reads and writes for its ranges.
  5. The previous owners now hold redundant copies. An operator runs nodetool cleanup on them to delete data they no longer own and reclaim disk.
The old owners keep serving the data until streaming finishes, so there is no availability gap and no data loss during scale-out.
on new_node bootstrap:
  contact_seed(); learn_ring_via_gossip()
  my_ranges = compute_owned_ranges(my_tokens)
  announce(state = JOINING)
  for range in my_ranges:
    source = current_owner(range)
    stream_sstables_from(source, range)   # parallel
  announce(state = NORMAL)                 # now serves traffic
# operator later: nodetool cleanup on old owners

10. Removing a Node (Scaling In or Replacing Failure)

Removing a node must preserve the replication factor: every range the departing node held must still have RF copies afterward. The procedure depends on whether the node is alive.

Removing a node
A live node decommissions by streaming its data out; a dead node is removed and survivors rebuild the missing copies.
ScenarioCommandWhat happens
Node is alive (graceful scale-in)nodetool decommissionThe leaving node itself streams each of its ranges to the new owners (the next replica clockwise), then leaves the ring. RF is preserved throughout.
Node is dead (hardware failure)nodetool removenode <id>The dead node cannot stream anything. The surviving replicas that still hold copies of its ranges rebuild the now-missing third copy by streaming among themselves.
Replace in placereplace_address flagA fresh node takes over the dead node's exact tokens and bootstraps the data from the remaining replicas.

In all cases the ring re-forms automatically through gossip, and the affected token ranges end up with the full replication factor of copies again. No central coordinator orchestrates this; it is driven by the nodes that own the affected ranges.

decommission (node alive):
  announce(state = LEAVING)
  for range in my_ranges:
    new_owner = next_replica_clockwise(range)
    stream_sstables_to(new_owner, range)
  announce(state = LEFT); shut_down()

removenode (node dead):
  for range owned by dead_node:
    survivors = live_replicas(range)
    new_owner = next_replica_clockwise(range)
    stream_sstables(from = survivors, to = new_owner)

11. Summary

The whole system is built from a few reinforcing ideas:

ConcernMechanism
Where does a row live?Consistent hashing of the partition key onto a token ring; vnodes for even spread.
How many copies?Replication factor, placed by a topology-aware strategy.
How consistent?Tunable consistency level per query; R + W > RF gives strong reads.
How are writes so fast?Append-only commit log + in-memory memtable; no in-place updates.
How is data stored and reclaimed?Immutable SSTables (LSM), reconciled and compacted in the background.
How do nodes coordinate?Gossip + Phi Accrual failure detection; no master.
How does it scale?Bootstrap streams ranges in; decommission/removenode streams them out. RF is preserved throughout.
The recurring theme: every operation is decentralized, append-oriented, and eventually reconciled. There is no master to lose, no in-place mutation to lock, and no rebalancing step that takes the cluster offline.