← Back to index

Apache Kafka — Internal Architecture

A developer's guide to how Kafka actually works under the hood: the distributed commit log, brokers and partitions, on-disk segments, replication and the in-sync replica set, and how records flow from producer to consumer.

Apache Kafka is a distributed, durable, append-only commit log built for high throughput and strict per-partition ordering. It is not a queue in the traditional sense and not a database — it is a replicated log that producers append to and consumers read from at their own pace. Almost every design decision follows from one core abstraction: the partition is an ordered, immutable sequence of records identified by monotonically increasing offsets. Understanding Kafka means understanding how that log is partitioned across brokers, persisted on disk, replicated for durability, and consumed in parallel.

Contents

  1. Design Goals and Core Ideas
  2. Cluster Architecture
  3. The Log on Disk
  4. The Producer
  5. Replication & the ISR
  6. Write Path
  7. Consumers & Groups
  8. Read Path & Performance
  9. Summary

1. Design Goals and Core Ideas

Every internal decision in Kafka traces back to a small set of goals. Keeping them in mind makes the rest of the architecture predictable.

GoalHow Kafka achieves it
DurabilityRecords are appended to an on-disk log and replicated to multiple brokers before being acknowledged. Nothing lives only in memory.
Strict orderingWithin a single partition, records are stored and delivered in the exact order they were appended. Offsets never go backwards.
High throughputSequential disk writes, batching, compression, the OS page cache, and zero-copy reads. Disk is treated as a fast sequential medium, not random-access storage.
Horizontal scalabilityA topic is split into partitions spread across brokers. Partitions are the unit of both parallelism and ordering.
Replay & multiple readersReading does not consume. The log is retained; each consumer tracks its own offset, so many independent readers can re-read the same data.
The distributed commit log
Producers append to the tail of an ordered, offset-indexed log. Reads do not delete; each consumer tracks its own offset and can read at a different position.

The single most important mental model is the log itself: an append-only, ordered file where each record gets the next integer offset. Producers only ever write to the tail. Consumers read forward from a position they control. Because a read is just "give me records starting at offset N," the same record can be delivered to many consumers, re-read after a failure, or reprocessed days later — as long as it is still within the retention window.

2. Cluster Architecture

A Kafka cluster is a set of brokers — servers that store partition data and serve produce and fetch requests. Unlike a fully peer-to-peer system, Kafka has explicit roles: one broker is the leader for a given partition while others are followers, and a small controller quorum manages cluster metadata and leader assignment.

Cluster architecture
Topics are split into partitions; each partition has one leader and several followers spread across brokers. The KRaft controller quorum holds cluster metadata and assigns leaders.

The key components are:

KRaft vs legacy ZooKeeper

Historically Kafka stored all cluster metadata (broker membership, topic configs, partition leadership, ISR state) in an external ZooKeeper ensemble, and a single elected controller broker watched ZooKeeper for changes. That added an operational dependency and made metadata changes a bottleneck at large partition counts.

Modern Kafka uses KRaft (Kafka Raft), which removes ZooKeeper entirely. A dedicated set of controller nodes runs a Raft consensus quorum and stores all metadata in an internal, replicated metadata log — the controllers literally use a Kafka-style log to record metadata changes. Brokers subscribe to that metadata log and apply updates as events. The result is faster failover, far higher partition scalability, and one fewer system to operate.

3. The Log on Disk

Each partition replica on a broker is a directory on disk. The log is not one giant file — it is split into segments, and each segment is accompanied by sparse index files that make random lookups by offset or timestamp cheap.

Log segments and indexes
A partition is a sequence of segments. Each segment has a .log plus sparse .index and .timeindex files. Old data is reclaimed by retention (delete) or compaction (keep latest per key).

The on-disk structures are:

Reclaiming space happens in one of two ways, chosen per topic by its cleanup.policy:

PolicyWhat it doesBest for
delete (retention)Deletes whole segments once they are older than retention.ms or the partition exceeds retention.bytes. Deletion is by segment, never by individual record, which keeps it cheap.Event streams where only recent data matters.
compact (log compaction)A background cleaner rewrites segments keeping only the latest value per key. Older values for the same key are discarded; a record with a null value (a tombstone) marks a key for deletion.Changelogs and state — where you want the current value for every key, kept forever.
Compaction makes a Kafka topic behave like a durable key-value table: replay the whole topic and you reconstruct the latest state for every key. This is exactly how Kafka Streams and Connect store their state.

4. The Producer

The producer decides which partition each record goes to, batches records for efficiency, and chooses how strong a durability guarantee it wants via the acks setting.

Producer internals
Records are routed by the partitioner, accumulated into per-partition batches, and flushed by a background sender thread when a batch fills or linger.ms elapses.

Partitioning

The partitioner maps each record to a partition:

Batching and linger

The producer does not send one record at a time. Records accumulate in per-partition batches in the record accumulator. A batch is sent when it reaches batch.size or when linger.ms elapses — a deliberate small delay that trades a little latency for much larger, more efficient batches (and better compression).

acks and the durability ladder

acksWho must acknowledgeTrade-off
0Nobody — fire and forgetLowest latency, can silently lose data.
1The partition leader onlyFast, but a record acked by a leader that then crashes before followers replicate it is lost.
all (-1)The leader and all in-sync replicasStrongest durability; the record survives as long as one in-sync replica survives.

Two further guarantees build on top of acks:

function send(record):
  if record.key is not None:
    partition = murmur2(record.key) % num_partitions   # same key -> same partition
  else:
    partition = sticky_partition()                     # fill one batch, then switch
  batch = accumulator.batch_for(partition)
  batch.append(record, seq = next_seq(partition))      # seq# enables idempotence
  if batch.full() or batch.age() > linger_ms:
    sender.enqueue(batch)            # background thread sends to the leader
  # on send: broker acks per the acks setting (0 / 1 / all)

5. Replication and the In-Sync Replica Set

Durability comes from replication. Each partition is configured with a replication factor — the number of copies. One replica is the leader; the rest are followers that continuously fetch from the leader to stay caught up. All reads and writes for a partition go through its leader; followers exist purely for redundancy and failover.

Replication, ISR and high watermark
Followers fetch from the leader. Replicas that are caught up form the ISR. The high watermark is the highest offset replicated to all in-sync replicas; consumers only see up to it.

The central concepts are:

A write under acks=all is acknowledged only once the high watermark advances past it, which means it is durably on every in-sync replica. The minimum size of the ISR can be enforced with min.insync.replicas: if the ISR shrinks below that, the leader refuses acks=all writes rather than accept data it cannot safely replicate.

6. Write Path

Putting the producer and replication together, here is the full life of a write under acks=all. The key property is that an acknowledgement means the record is committed — durable on every in-sync replica and visible to consumers.

Write path
The producer sends to the leader, the leader appends locally, followers fetch, the ISR offsets advance the high watermark, and only then is the producer acked.

Step by step:

  1. The producer sends the record (or batch) to the partition's leader broker.
  2. The leader appends it to its local .log — a sequential disk write — advancing its log end offset.
  3. Followers continuously fetch from the leader (replication is pull-based, just like a consumer) and append the new records to their own logs.
  4. Each follower's fetch request tells the leader how far it has replicated. The leader tracks these offsets.
  5. Once every in-sync replica has the record, the leader advances the high watermark to min of the ISR's offsets. The record is now committed.
  6. The leader acks the producer (immediately, for acks=1; after the HW advances, for acks=all).
on leader, produce(record, acks):
  leader_log.append(record)                # 1-2. sequential write, LEO++
  if acks == 1:
    ack()                                  # leader durable, may lose on failover
  # followers pull in the background:
  on follower_fetch(follower, fetch_offset):
    follower.offset = fetch_offset         # 3-4. report replicated position
    new_hw = min(offset of r for r in ISR) # 5. advance only over in-sync replicas
    high_watermark = new_hw
    if acks == "all" and record.offset <= high_watermark:
      ack()                                # 6. committed on all ISR -> safe
Replication is pull-based: followers fetch from the leader exactly like consumers do. This reuses one code path and lets the leader stay simple — it never pushes, it just serves fetch requests and watches how far each follower has gotten.

7. Consumers and Consumer Groups

Consumers read by issuing fetch requests to partition leaders. The scaling unit is the consumer group: a set of consumers that cooperate to read a topic, with each partition assigned to exactly one consumer in the group. That is how Kafka delivers each record once per group while still reading partitions in parallel.

Consumers and groups
Within a group, each partition is owned by exactly one consumer. Adding or removing a consumer triggers a rebalance. Committed offsets are stored in the internal __consumer_offsets topic.

The mechanics:

Rebalancing

When a consumer joins, leaves, or dies, the group must reassign partitions — a rebalance. There are two protocols:

ProtocolBehavior
EagerStop-the-world: every consumer revokes all its partitions, then the coordinator reassigns everything. Simple, but the whole group pauses during the rebalance.
Cooperative (incremental)Only the partitions that actually need to move are revoked; consumers keep processing the partitions they retain. Much smaller disruption, now the default for most workloads.

Offset commits

A consumer records how far it has read by committing offsets. These are written to an internal compacted topic, __consumer_offsets, keyed by (group, topic, partition) — so the latest commit per partition is always retained. On restart or after a rebalance, a consumer resumes from its last committed offset. Committing after processing gives at-least-once delivery; committing before processing gives at-most-once. The offset is just data in a Kafka topic, which is why it survives broker restarts with no extra machinery.

consumer in group G subscribed to topic T:
  partitions = coordinator.assign(G, me)   # my share of T's partitions
  for p in partitions:
    pos[p] = committed_offset(G, p) or reset_policy   # resume point
  loop:
    records = fetch(leader_of(p), pos[p])  # pull from each partition leader
    process(records)
    pos[p] += len(records)
    commit(G, p, pos[p])                   # write to __consumer_offsets
  # on member change: coordinator triggers a rebalance (eager or cooperative)

8. Read Path and Performance

A consumer fetch is served entirely by the partition leader. The reason Kafka can saturate network cards on commodity hardware is that the read path barely touches user space at all — it leans on the operating system.

Read path and zero-copy
The leader serves fetches from the OS page cache. On a cache hit, sendfile() copies data straight from page cache to the network socket without ever entering the broker's user-space memory.

Two operating-system features do the heavy lifting:

This works precisely because records are stored on disk in exactly the wire format consumers expect — no per-record deserialization or transformation is needed on the broker, so the bytes can be streamed untouched. Combined with sequential reads (consumers march forward through the log) and large batched fetches, the broker mostly acts as a fast pipe from page cache to socket.

on leader, fetch(partition, offset, max_bytes):
  pos = index.lookup(offset)        # sparse index -> byte position
  # the OS page cache already holds hot (tail) segments in RAM
  return sendfile(log_file, pos, max_bytes, socket)
  # kernel copies page cache -> socket directly; bytes never enter the JVM

9. Summary

The whole system is built from a few reinforcing ideas:

ConcernMechanism
What is the core abstraction?An append-only, ordered, offset-indexed commit log, split into partitions.
How does it scale and order?Partitions: the unit of parallelism and the unit of ordering. Order holds within a partition, never across.
How is data stored and reclaimed?Immutable segments with sparse offset/time indexes; reclaimed by retention (delete) or compaction (latest per key).
How do producers control durability?acks 0/1/all, batching with linger.ms, idempotent IDs, and optional transactions.
How is data made durable?Leader/follower replication; the ISR plus the high watermark define what is committed and readable.
How do many readers share work?Consumer groups assign one partition to one consumer; offsets are committed to __consumer_offsets.
How is it so fast?Sequential I/O, the OS page cache, and zero-copy sendfile from cache straight to the socket.
How do brokers coordinate?A KRaft (Raft) controller quorum holds metadata and assigns leaders — no more ZooKeeper.
The recurring theme: everything is a log. Storage is a log, replication is followers reading a log, consumer progress is a log, and even cluster metadata under KRaft is a log. Master the log abstraction and the rest of Kafka falls into place.