← Back to index

Elasticsearch — Internal Architecture

A developer's guide to how Elasticsearch actually works under the hood: the distributed layer over Lucene, node roles and cluster state, shards and routing, the inverted index, the indexing and query paths, and how near-real-time search is achieved.

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. Lucene is a single-machine, full-text search library: it stores an inverted index and answers term queries fast, but it knows nothing about clusters, networks, or replication. Elasticsearch wraps many Lucene indexes and adds the distributed half: it shards data across nodes, replicates each shard for safety, elects a master to manage cluster state, and exposes everything through a JSON-over-HTTP API. Understanding Elasticsearch means understanding how that distributed layer coordinates a fleet of independent Lucene engines.

Contents

  1. Design Goals and Core Ideas
  2. Cluster Architecture & Node Roles
  3. Indices, Shards & Routing
  4. Lucene Internals: The Inverted Index
  5. Indexing Path
  6. Near-Real-Time Search
  7. Replication Model
  8. Query Path
  9. Analysis: Text to Terms
  10. Summary

1. Design Goals and Core Ideas

Every design choice in Elasticsearch follows from one ambition: take Lucene's fast single-machine full-text search and make it distributed, horizontally scalable, and near-real-time, without giving up the rich query and aggregation capabilities developers expect.

GoalHow Elasticsearch achieves it
Full-text search at scaleEach shard is a self-contained Lucene index with an inverted index; queries run in parallel across shards.
Horizontal scalabilityAn index is split into shards spread across data nodes. Add nodes and shards relocate to use the new capacity.
High availabilityEvery primary shard has one or more replicas on other nodes. Lose a node and a replica is promoted.
Near-real-timeNew documents become searchable within a refresh interval (about one second), not instantly, trading a little latency for throughput.
Analytics, not just searchDoc-values columnar storage and aggregations let the same engine compute metrics, histograms, and facets.
Elasticsearch over Lucene
Elasticsearch is the distributed layer (sharding, replication, cluster state, REST API) wrapped around per-shard Lucene engines.

2. Cluster Architecture and Node Roles

An Elasticsearch cluster is a set of nodes that share a cluster name. Unlike a fully peer-to-peer system, Elasticsearch has an elected master that owns the cluster's metadata. The master does not sit in the data path — it manages the cluster state: the list of nodes, index mappings and settings, and which shard lives on which node. Every node holds a copy of the cluster state, and the master publishes updates to all of them.

Cluster architecture and node roles
The elected master manages cluster state; data nodes hold shards; coordinating nodes scatter and gather requests.

Nodes are configured with one or more roles:

Discovery is how nodes find each other and form the cluster. On startup a node contacts a configured list of seed hosts, learns about other master-eligible nodes, and participates in master election. Once a master is elected, it admits new nodes by adding them to the cluster state and publishing it. Failure detection runs continuously: the master pings the other nodes and they ping the master; a node that stops responding is removed from the cluster state, and the master reallocates its shards.

3. Indices, Shards and Routing

An index is a logical collection of documents (for example, all your logs). Physically, an index is split into primary shards, and each primary can have one or more replica shards. A shard is a complete, independent Lucene index. Splitting an index into shards is what lets a single index outgrow one machine and lets queries run in parallel.

Index, shards and routing
An index of three primaries, each with one replica. A document's _routing key hashes to exactly one primary.

Which shard does a document go to? Elasticsearch routes it deterministically:

function route(document):
  routing = document._routing or document._id   # default routing key is the doc id
  shard   = hash(routing) % num_primary_shards  # pick exactly one primary
  return shard

Two consequences matter in practice. First, the same key always lands on the same shard, so a get-by-id or a routed query touches only one shard. Second, num_primary_shards is fixed when the index is created — it is in the denominator of the routing formula, so changing it would relocate nearly every document. (Elasticsearch offers explicit split and shrink operations that rebuild the index when you genuinely need a different primary count.)

Replicas do not change the routing math. A replica is a byte-for-byte copy of its primary on a different node. Replicas serve two purposes: they take over if the node holding the primary fails (a replica is promoted to primary), and they add read throughput because searches can run against either the primary or a replica.

A useful mental model: primaries determine how much data you can hold and how parallel writes are; replicas determine how much you can survive losing and how parallel reads are.

4. Lucene Internals: The Inverted Index

At the bottom of every shard is Lucene, and at the heart of Lucene is the inverted index. A forward index maps a document to its words; an inverted index maps each term to the list of documents that contain it (the postings list). That inversion is what makes "find every document containing fox" a single dictionary lookup instead of a scan.

Inverted index
Documents are analyzed into terms; the inverted index maps each term to the documents (postings) that contain it.

Lucene stores this efficiently:

Because segments are immutable, a delete cannot remove a document in place. Instead Lucene records the doc id in a per-segment deleted-documents bitset and filters it out of results. An update is just a delete plus a fresh insert. The space is reclaimed later, during merge.

5. Indexing Path

When a document arrives at the primary shard, Elasticsearch must make it both durable (survive a crash) and eventually searchable, while keeping the write fast. It does this with an in-memory buffer plus a write-ahead log called the translog, and two background events: refresh and flush.

Indexing path
A document is added to the in-memory buffer and the translog; refresh makes it searchable in a new segment; flush fsyncs and clears the translog; merge consolidates segments.

The stages are:

  1. Buffer + translog. The document is added to an in-memory buffer and, for durability, appended to the translog on disk. The write is acknowledged once the translog write is safe. The document is durable but not yet searchable.
  2. Refresh. Periodically (default every second) the buffer is turned into a new Lucene segment and that segment is opened for search. This is when the document becomes visible to queries. Refresh does not fsync, so it is cheap; the new segment lives in the OS file cache.
  3. Flush. Less often, Elasticsearch performs a Lucene commit: it fsyncs the segments to disk, writes a commit point, and truncates the translog. After a flush, the operations recorded in the translog are safely on disk, so the translog can be cleared.
  4. Merge. Refreshes produce many small segments. A background merge process combines them into fewer, larger segments, and in doing so physically drops documents marked deleted. Merging keeps the number of segments — and therefore search cost — under control.
function index(doc):
  buffer.add(doc)                 # 1. in-memory, not yet searchable
  translog.append(doc)            # 1. durability (write-ahead log)
  ack()                           # acknowledged once translog is safe

every refresh_interval:           # 2. ~1s
  segment = buffer.to_segment()
  open_for_search(segment)        # now searchable; no fsync yet
  buffer.clear()

on flush:                         # 3. periodic Lucene commit
  fsync(all_segments)
  write_commit_point()
  translog.truncate()             # ops are durable in segments now

background:                       # 4. merge
  merge_small_segments()          # fewer files; purge deleted docs

The translog is what bridges the gap between durability and searchability: a document is durable the moment it is in the translog, even though it will not appear in search results until the next refresh, and its segment will not be fsynced until the next flush. On restart after a crash, Elasticsearch replays the translog to recover any operations that had not yet been flushed.

6. Near-Real-Time Search

Elasticsearch is near-real-time, not real-time. A freshly indexed document is not instantly searchable; it becomes visible only after the next refresh. The reason is structural: searches run against Lucene segments, and a document is not in any segment until refresh builds one from the in-memory buffer.

Near-real-time search and the refresh interval
A document indexed at t=0.2s is durable immediately but only becomes searchable at the next refresh (default 1s).

The refresh_interval setting controls the trade-off:

This is a deliberate design choice. Making every document instantly searchable would mean committing a tiny segment per document, which would crush indexing throughput and flood the merge process. By batching new documents into a segment once per interval, Elasticsearch amortizes that cost. The translog still guarantees durability in the meantime, so "not yet searchable" never means "not yet safe."

7. Replication Model

Elasticsearch replicates each shard with a primary-backup model. All writes for a shard go to its primary first; the primary then forwards them to the replicas. This is different from Cassandra's quorum model — there is a designated primary that orders the writes, which gives Elasticsearch a clear, single source of truth for each shard.

Replication model
Writes go to the primary, which indexes locally and forwards the operation to each replica before acknowledging the client.

The write sequence for a single document is:

function replicate_write(doc):
  primary.validate(doc)
  primary.index(doc)                       # 1. apply locally, assign a seq#
  for replica in in_sync_replicas:
    send(replica, op = doc, seq = seq_no)   # 2. forward the same operation
  wait_for_replica_acks()                   # 3. replicas apply and ack
  return ack_to_client()                    # 4. ack once replicas confirm

Two bookkeeping concepts keep replicas in step and make recovery cheap:

ConceptWhat it is
Sequence numberThe primary stamps every operation with a monotonically increasing seq#, so every copy can tell exactly which operations it has applied and in what order.
Local checkpointPer shard copy: the highest seq# below which that copy has applied every operation with no gaps.
Global checkpointThe minimum local checkpoint across all in-sync copies. Every operation at or below it is guaranteed present on all in-sync copies.

When a replica rejoins after a brief outage, it does not copy the whole shard. It compares checkpoints with the primary and replays only the operations after its checkpoint from the translog — a fast recovery of just the gap. The set of in-sync copies is tracked in the cluster state; a replica that falls too far behind is dropped from that set and must do a full file-based recovery instead.

8. Query Path

A search must consult every shard of the index, because any shard could hold a matching document. The coordinating node fans the query out to the shards (scatter) and combines their answers (gather). To avoid shipping full documents from every shard, Elasticsearch splits a search into two phases: query-then-fetch.

Query path: scatter-gather, query-then-fetch
Query phase: each shard returns top-K doc ids and scores. Fetch phase: only the globally winning documents are retrieved.
function search(query, size):
  # --- query phase: scatter ---
  for shard in index.shards:
    copy = pick_copy(shard)                 # primary or a replica
    partial[shard] = copy.query(query, top = size)   # ids + scores only
  top_ids = merge_and_sort(partial)[:size]  # global top-K

  # --- fetch phase: gather full docs ---
  for id in top_ids:
    docs.add(shard_of(id).fetch(id))        # retrieve only the winners
  return docs

The reason for two phases is bandwidth. If each shard returned full documents, the coordinating node would receive shards × size documents and throw most of them away. Returning lightweight ids and scores first, then fetching only the final winners, keeps the network cost proportional to the page size, not to the shard count. This is also why deep pagination is expensive: to return page 1000, every shard must still rank and return the top (page × size) ids for the coordinating node to merge.

9. Analysis: Turning Text Into Terms

Before text can go into an inverted index, it must be broken into terms. That job belongs to an analyzer, a pipeline of three stages applied in order.

Analysis pipeline
An analyzer runs character filters, then a tokenizer, then token filters, producing the terms stored in the inverted index.
StageWhat it doesExample
Character filtersPre-process the raw string before tokenizing.Strip HTML tags; map & to "and".
TokenizerSplit the stream into tokens.Standard tokenizer breaks on whitespace and punctuation: "Quick-Brown" → "Quick", "Brown".
Token filtersTransform, add, or drop tokens.Lowercase; remove stop words; stem "running" → "run".

The subtle but crucial point is that analysis happens at two times, and they must agree:

If the two used different analyzers, matches would silently fail. Index "FOX" as the lowercased term fox but search for the unanalyzed term FOX, and there is no match. Using the same analyzer on both sides — lowercasing, stemming, and stop-word removal applied identically — is what makes a search for "the Foxes" find a document that said "fox." (Fields meant for exact matching, like ids or tags, use the keyword type, which skips analysis and stores the value verbatim.)

10. Summary

Elasticsearch is best understood as a thin-but-clever distributed layer over many Lucene engines. The pieces reinforce each other:

ConcernMechanism
What does the searching?Per-shard Lucene inverted indexes made of immutable segments.
Who manages the cluster?An elected master owns and publishes the cluster state; it stays out of the data path.
Where does a document live?shard = hash(_routing) % num_primaries; primary count is fixed at create time.
How is it durable yet fast?In-memory buffer + translog; acknowledge on translog write, fsync later on flush.
Why not instant search?Documents are searchable only after refresh builds a new segment (about one second).
How are copies kept safe?Primary-backup replication with sequence numbers and local/global checkpoints for fast recovery.
How does a search work?Coordinating node scatter-gathers; query phase finds ids, fetch phase retrieves the winners.
How does text become searchable?Analyzers (char filters → tokenizer → token filters) at both index and search time.
The recurring theme: Lucene does the searching, Elasticsearch does the distributing. Immutable segments, an append-only translog, batched refreshes, and a single ordering primary per shard are what let it be fast, durable, and near-real-time all at once.