A developer's guide to how Kafka actually works under the hood: the distributed commit log, brokers and partitions, on-disk segments, replication and the in-sync replica set, and how records flow from producer to consumer.
Apache Kafka is a distributed, durable, append-only commit log built for high throughput and strict per-partition ordering. It is not a queue in the traditional sense and not a database — it is a replicated log that producers append to and consumers read from at their own pace. Almost every design decision follows from one core abstraction: the partition is an ordered, immutable sequence of records identified by monotonically increasing offsets. Understanding Kafka means understanding how that log is partitioned across brokers, persisted on disk, replicated for durability, and consumed in parallel.
Every internal decision in Kafka traces back to a small set of goals. Keeping them in mind makes the rest of the architecture predictable.
| Goal | How Kafka achieves it |
|---|---|
| Durability | Records are appended to an on-disk log and replicated to multiple brokers before being acknowledged. Nothing lives only in memory. |
| Strict ordering | Within a single partition, records are stored and delivered in the exact order they were appended. Offsets never go backwards. |
| High throughput | Sequential disk writes, batching, compression, the OS page cache, and zero-copy reads. Disk is treated as a fast sequential medium, not random-access storage. |
| Horizontal scalability | A topic is split into partitions spread across brokers. Partitions are the unit of both parallelism and ordering. |
| Replay & multiple readers | Reading does not consume. The log is retained; each consumer tracks its own offset, so many independent readers can re-read the same data. |

The single most important mental model is the log itself: an append-only, ordered file where each record gets the next integer offset. Producers only ever write to the tail. Consumers read forward from a position they control. Because a read is just "give me records starting at offset N," the same record can be delivered to many consumers, re-read after a failure, or reprocessed days later — as long as it is still within the retention window.
A Kafka cluster is a set of brokers — servers that store partition data and serve produce and fetch requests. Unlike a fully peer-to-peer system, Kafka has explicit roles: one broker is the leader for a given partition while others are followers, and a small controller quorum manages cluster metadata and leader assignment.

The key components are:
Historically Kafka stored all cluster metadata (broker membership, topic configs, partition leadership, ISR state) in an external ZooKeeper ensemble, and a single elected controller broker watched ZooKeeper for changes. That added an operational dependency and made metadata changes a bottleneck at large partition counts.
Modern Kafka uses KRaft (Kafka Raft), which removes ZooKeeper entirely. A dedicated set of controller nodes runs a Raft consensus quorum and stores all metadata in an internal, replicated metadata log — the controllers literally use a Kafka-style log to record metadata changes. Brokers subscribe to that metadata log and apply updates as events. The result is faster failover, far higher partition scalability, and one fewer system to operate.
Each partition replica on a broker is a directory on disk. The log is not one giant file — it is split into segments, and each segment is accompanied by sparse index files that make random lookups by offset or timestamp cheap.

.log plus sparse .index and .timeindex files. Old data is reclaimed by retention (delete) or compaction (keep latest per key).The on-disk structures are:
.log file: the actual records, appended sequentially. The file name is the base offset of the first record it contains. Only the newest segment — the active segment — is written to; older segments are immutable, which makes them safe to read, cache, and delete wholesale..index (offset index): a sparse map from a relative offset to a byte position in the .log file. It holds one entry every few kilobytes of data, not one per record. To find offset N, Kafka binary-searches the index for the closest preceding entry, seeks to that byte position, then scans forward in the log. Because it is sparse it is tiny and can be memory-mapped..timeindex (time index): a sparse map from a timestamp to an offset, enabling "give me records from time T" lookups (used by time-based seeks and retention).Reclaiming space happens in one of two ways, chosen per topic by its cleanup.policy:
| Policy | What it does | Best for |
|---|---|---|
| delete (retention) | Deletes whole segments once they are older than retention.ms or the partition exceeds retention.bytes. Deletion is by segment, never by individual record, which keeps it cheap. | Event streams where only recent data matters. |
| compact (log compaction) | A background cleaner rewrites segments keeping only the latest value per key. Older values for the same key are discarded; a record with a null value (a tombstone) marks a key for deletion. | Changelogs and state — where you want the current value for every key, kept forever. |
The producer decides which partition each record goes to, batches records for efficiency, and chooses how strong a durability guarantee it wants via the acks setting.

linger.ms elapses.The partitioner maps each record to a partition:
hash(key) % num_partitions (a deterministic murmur2 hash). All records with the same key land in the same partition, which is how Kafka guarantees per-key ordering.The producer does not send one record at a time. Records accumulate in per-partition batches in the record accumulator. A batch is sent when it reaches batch.size or when linger.ms elapses — a deliberate small delay that trades a little latency for much larger, more efficient batches (and better compression).
acks | Who must acknowledge | Trade-off |
|---|---|---|
| 0 | Nobody — fire and forget | Lowest latency, can silently lose data. |
| 1 | The partition leader only | Fast, but a record acked by a leader that then crashes before followers replicate it is lost. |
| all (-1) | The leader and all in-sync replicas | Strongest durability; the record survives as long as one in-sync replica survives. |
Two further guarantees build on top of acks:
function send(record):
if record.key is not None:
partition = murmur2(record.key) % num_partitions # same key -> same partition
else:
partition = sticky_partition() # fill one batch, then switch
batch = accumulator.batch_for(partition)
batch.append(record, seq = next_seq(partition)) # seq# enables idempotence
if batch.full() or batch.age() > linger_ms:
sender.enqueue(batch) # background thread sends to the leader
# on send: broker acks per the acks setting (0 / 1 / all)
Durability comes from replication. Each partition is configured with a replication factor — the number of copies. One replica is the leader; the rest are followers that continuously fetch from the leader to stay caught up. All reads and writes for a partition go through its leader; followers exist purely for redundancy and failover.

The central concepts are:
replica.lag.time.max.ms). A follower that falls behind is dropped from the ISR; when it catches up again it rejoins. The leader is always in its own ISR.A write under acks=all is acknowledged only once the high watermark advances past it, which means it is durably on every in-sync replica. The minimum size of the ISR can be enforced with min.insync.replicas: if the ISR shrinks below that, the leader refuses acks=all writes rather than accept data it cannot safely replicate.
Putting the producer and replication together, here is the full life of a write under acks=all. The key property is that an acknowledgement means the record is committed — durable on every in-sync replica and visible to consumers.

Step by step:
.log — a sequential disk write — advancing its log end offset.min of the ISR's offsets. The record is now committed.acks=1; after the HW advances, for acks=all).on leader, produce(record, acks):
leader_log.append(record) # 1-2. sequential write, LEO++
if acks == 1:
ack() # leader durable, may lose on failover
# followers pull in the background:
on follower_fetch(follower, fetch_offset):
follower.offset = fetch_offset # 3-4. report replicated position
new_hw = min(offset of r for r in ISR) # 5. advance only over in-sync replicas
high_watermark = new_hw
if acks == "all" and record.offset <= high_watermark:
ack() # 6. committed on all ISR -> safe
Consumers read by issuing fetch requests to partition leaders. The scaling unit is the consumer group: a set of consumers that cooperate to read a topic, with each partition assigned to exactly one consumer in the group. That is how Kafka delivers each record once per group while still reading partitions in parallel.

__consumer_offsets topic.The mechanics:
When a consumer joins, leaves, or dies, the group must reassign partitions — a rebalance. There are two protocols:
| Protocol | Behavior |
|---|---|
| Eager | Stop-the-world: every consumer revokes all its partitions, then the coordinator reassigns everything. Simple, but the whole group pauses during the rebalance. |
| Cooperative (incremental) | Only the partitions that actually need to move are revoked; consumers keep processing the partitions they retain. Much smaller disruption, now the default for most workloads. |
A consumer records how far it has read by committing offsets. These are written to an internal compacted topic, __consumer_offsets, keyed by (group, topic, partition) — so the latest commit per partition is always retained. On restart or after a rebalance, a consumer resumes from its last committed offset. Committing after processing gives at-least-once delivery; committing before processing gives at-most-once. The offset is just data in a Kafka topic, which is why it survives broker restarts with no extra machinery.
consumer in group G subscribed to topic T:
partitions = coordinator.assign(G, me) # my share of T's partitions
for p in partitions:
pos[p] = committed_offset(G, p) or reset_policy # resume point
loop:
records = fetch(leader_of(p), pos[p]) # pull from each partition leader
process(records)
pos[p] += len(records)
commit(G, p, pos[p]) # write to __consumer_offsets
# on member change: coordinator triggers a rebalance (eager or cooperative)
A consumer fetch is served entirely by the partition leader. The reason Kafka can saturate network cards on commodity hardware is that the read path barely touches user space at all — it leans on the operating system.

sendfile() copies data straight from page cache to the network socket without ever entering the broker's user-space memory.Two operating-system features do the heavy lifting:
sendfile(): normally, sending a file over a socket copies bytes from the kernel page cache into a user-space buffer, then back into a kernel socket buffer. Kafka instead uses the sendfile system call, which tells the kernel to copy the requested bytes directly from the page cache to the network socket. The data never enters the JVM, eliminating two copies and the associated CPU and garbage-collection cost.This works precisely because records are stored on disk in exactly the wire format consumers expect — no per-record deserialization or transformation is needed on the broker, so the bytes can be streamed untouched. Combined with sequential reads (consumers march forward through the log) and large batched fetches, the broker mostly acts as a fast pipe from page cache to socket.
on leader, fetch(partition, offset, max_bytes):
pos = index.lookup(offset) # sparse index -> byte position
# the OS page cache already holds hot (tail) segments in RAM
return sendfile(log_file, pos, max_bytes, socket)
# kernel copies page cache -> socket directly; bytes never enter the JVM
The whole system is built from a few reinforcing ideas:
| Concern | Mechanism |
|---|---|
| What is the core abstraction? | An append-only, ordered, offset-indexed commit log, split into partitions. |
| How does it scale and order? | Partitions: the unit of parallelism and the unit of ordering. Order holds within a partition, never across. |
| How is data stored and reclaimed? | Immutable segments with sparse offset/time indexes; reclaimed by retention (delete) or compaction (latest per key). |
| How do producers control durability? | acks 0/1/all, batching with linger.ms, idempotent IDs, and optional transactions. |
| How is data made durable? | Leader/follower replication; the ISR plus the high watermark define what is committed and readable. |
| How do many readers share work? | Consumer groups assign one partition to one consumer; offsets are committed to __consumer_offsets. |
| How is it so fast? | Sequential I/O, the OS page cache, and zero-copy sendfile from cache straight to the socket. |
| How do brokers coordinate? | A KRaft (Raft) controller quorum holds metadata and assigns leaders — no more ZooKeeper. |