← Back to index

Apache Lucene — Internal Architecture

A developer's guide to how Lucene actually works under the hood: the inverted index, immutable segments, the analyzer pipeline, the indexing and search paths, BM25 scoring, deletes and merging, near-real-time reopen, and the on-disk codec files that engines like Elasticsearch and Solr are built on.

Apache Lucene is a single-machine, embeddable full-text search library written in Java. It is not a server: it has no cluster, no network layer, and no query language of its own — just an API for building an inverted index on local disk and answering ranked queries against it fast. Engines like Elasticsearch and Solr are the distributed half; Lucene is the search engine inside each shard. Almost everything about Lucene follows from two decisions: the core data structure is an inverted index, and that index is stored as a set of immutable segments. Once you understand those two ideas, the indexing path, deletes, merging, and near-real-time search all fall out as consequences.

Contents

  1. Design Goals and Core Ideas
  2. The Index Model: Documents & Fields
  3. Analysis: Text to Terms
  4. The Inverted Index
  5. Segments & Commits
  6. The Indexing Path
  7. The Search Path
  8. Relevance Scoring (BM25)
  9. Deletes & Updates
  10. Segment Merging
  11. Near-Real-Time Search
  12. Segment Files & Codecs
  13. Summary

1. Design Goals and Core Ideas

Every internal choice in Lucene serves fast, ranked retrieval over large text collections from a single process. Keeping the goals in mind makes the rest of the design predictable.

GoalHow Lucene achieves it
Fast full-text searchAn inverted index maps each term directly to the documents that contain it, so a query reads only the postings for its terms — never the whole collection.
Write-once, read-many storageIndex data lives in immutable segments. Nothing already written is ever modified in place, which makes reads lock-free and files trivially cacheable.
Compact on diskTerms are stored once in a shared dictionary backed by an FST; postings and doc IDs are delta-encoded and compressed. The index is typically a fraction of the original text size.
Exact and ranked retrievalBoolean queries find the matching set exactly; a similarity model (BM25 by default) scores and ranks those matches by relevance.
Near-real-time visibilityNew documents become searchable by opening a reader over freshly flushed segments — without a full, durable commit.
EmbeddableIt is a library you call in-process, not a service you connect to. The distributed concerns (sharding, replication) are left to whoever embeds it.
Lucene core design ideas
Documents are analyzed into terms, terms are stored in an inverted index spread across immutable segments, and a searcher answers ranked queries over them.
A useful mental model: Lucene is a library for one index on one machine. If you ever need more than one machine, you do not change Lucene — you run many Lucene indexes and coordinate them above, which is exactly what Elasticsearch and Solr do.

2. The Index Model: Documents & Fields

The unit you add to Lucene is a document: an ordered list of fields, where each field has a name, a value, and a type that decides what Lucene does with it. Lucene has no fixed schema — two documents in the same index can carry different fields — but each field's type controls which on-disk structures it populates.

Lucene document and field model
A field can opt into several independent capabilities. You pay storage only for the ones you enable.

A single field can contribute to up to four different structures, each answering a different kind of question:

The same logical field, say price, is often configured for several of these at once — a point for range filters, a doc value for sorting, and stored for display — because each backs a different access pattern.

3. Analysis: Text to Terms

Text is not indexed as-is. Before it reaches the inverted index it passes through an analyzer, a pipeline that turns a raw string into a stream of normalized terms. The analyzer has three stages.

Lucene analysis pipeline
Character filters clean the raw text, the tokenizer splits it into tokens, and token filters normalize each token into the final indexed terms.
function analyze(text, analyzer):
  text   = analyzer.char_filters.apply(text)   # e.g. strip HTML
  tokens = analyzer.tokenizer.split(text)       # "Quick-Brown" -> Quick, Brown
  for f in analyzer.token_filters:
    tokens = f.apply(tokens)                     # lowercase, stop, stem
  return tokens                                  # -> [quick, brown, fox]
The same analyzer runs at index time and at query time. If "FOX" is indexed as the term fox, then a search for "Fox" must be analyzed to fox too, or it would never match. Mismatched index- and query-time analysis is the single most common cause of "why doesn't my search find anything?".

4. The Inverted Index

The inverted index is the heart of Lucene. Instead of mapping documents to the terms they contain (a "forward" index), it maps each term to the list of documents that contain it — its postings list. Answering a query then means reading a few short postings lists rather than scanning every document.

Lucene inverted index
Documents are analyzed into terms; each term points to a postings list of the documents that contain it, with term frequencies (and optionally positions).

The index has two parts. The term dictionary is the sorted set of all terms in a field; Lucene holds an in-memory index into it as a finite state transducer (FST), a compact, prefix-sharing structure that maps a term to the on-disk location of its postings. The postings themselves store, for each term, the documents that contain it — and optionally the term frequency, positions, and offsets used for ranking and phrase queries.

Because the term dictionary is sorted and postings are stored as ascending document IDs, Boolean queries become efficient list operations. An AND of two terms is a merge-intersection that skips through the shorter list:

function intersect(postingsA, postingsB):     # docs containing BOTH terms
  result = []
  a = postingsA.first();  b = postingsB.first()
  while a != END and b != END:
    if a.doc == b.doc:
      result.add(a.doc);  a = postingsA.next();  b = postingsB.next()
    elif a.doc < b.doc:
      a = postingsA.advance(b.doc)            # skip-list jump, not linear scan
    else:
      b = postingsB.advance(a.doc)
  return result

Postings carry skip lists, so advance(target) can jump ahead instead of stepping one document at a time — which is what keeps multi-term queries fast even on long lists.

5. Segments & Commits

A Lucene index is not one big file. It is a directory containing a set of segments, each a small, complete, self-contained inverted index over a subset of the documents. Crucially, a segment is immutable: once written, its files never change.

Lucene segments and commit point
An index is a set of immutable segments plus a commit point (segments_N) that names the ones currently live.

Immutability is the design decision that everything else rests on:

What ties the segments together is the commit point: a small file named segments_N that lists exactly which segments are currently part of the index. A commit fsyncs the new segment files and then atomically writes a new segments_N+1. Until that file is written, a crash simply leaves the index at the previous commit — durability is the atomic swap of one tiny file.

6. The Indexing Path

The class that writes to an index is the IndexWriter. Only one may have a given index open for writing at a time (enforced by a write lock). Adding documents flows through an in-memory buffer and out to a new segment.

Lucene indexing path
Documents are analyzed and buffered in RAM per indexing thread, flushed to a new immutable segment, and made durable by a commit.

Step by step:

writer = IndexWriter(directory, config)
for doc in source:
  writer.addDocument(analyze(doc))   # into this thread's DWPT (RAM)
  if dwpt.ram_used > flush_threshold:
    segment = dwpt.flush_to_disk()   # new immutable segment, not yet durable
writer.commit()                      # fsync + write segments_N (durable)
writer.close()
Flushing and committing are different events. Flush makes documents searchable (a new segment exists); commit makes them durable (survives a crash). Near-real-time search exploits exactly this gap — see §11.

Reading is done through a DirectoryReader, which opens one leaf reader per segment, wrapped by an IndexSearcher. A query is not executed directly; it is compiled into objects that know how to iterate postings and produce scores.

Lucene search path
A query is rewritten into a Weight, executed as a per-segment Scorer over postings, collected into a top-K queue, and only the winners have their stored fields fetched.

The pipeline is:

searcher = IndexSearcher(DirectoryReader.open(directory))
weight   = searcher.createWeight(query)        # stats for scoring
topk     = PriorityQueue(maxsize = K)          # keep best K by score

for leaf in searcher.leaves():                 # one per segment
  scorer = weight.scorer(leaf)
  for doc in scorer:                           # skip-list iteration over postings
    if leaf.live_docs.get(doc):                # skip deleted docs
      topk.insert(doc, scorer.score())

return [fetch_stored_fields(hit) for hit in topk.sorted()]   # only the K winners

8. Relevance Scoring (BM25)

Boolean matching decides which documents qualify; a similarity decides their order. Lucene's default is BM25, which scores a document for a query term from three intuitive signals.

SignalMeaningEffect on score
Term frequency (tf)How often the term appears in the document.More occurrences raise the score, but with diminishing returns — the tenth occurrence adds far less than the second.
Inverse document frequency (idf)How rare the term is across the whole index.Rare terms are more discriminating, so they count for more; common terms count for little.
Field lengthLength of the field, vs. the average length.A match in a short field is worth more than the same match buried in a long one (length normalization).
score(doc, query) = Σ over query terms t:
    idf(t) · ( tf(t,doc) · (k1 + 1) )
            / ( tf(t,doc) + k1 · (1 - b + b · docLen / avgDocLen) )

  idf(t) = ln( 1 + (N - n_t + 0.5) / (n_t + 0.5) )   # N = #docs, n_t = #docs with t
  k1 ≈ 1.2   # tf saturation: higher = tf matters longer
  b  ≈ 0.75  # length normalization: 0 = ignore length, 1 = full

The two tunables are k1 (how quickly extra term occurrences stop helping) and b (how strongly long fields are penalized). The idf values come from the collection statistics captured in the Weight, which is why scoring needs that rewrite step before iteration.

9. Deletes & Updates

Since segments are immutable, Lucene cannot physically remove a document on request. Instead a delete is recorded as a tombstone: the document's bit is cleared in the segment's live-docs bitset (.liv). The data stays on disk; searches simply skip any document whose live bit is off (the live_docs.get(doc) check in §7).

An update is therefore not an in-place edit at all. updateDocument(term, doc) is exactly a delete-by-term followed by an add: the old document is tombstoned and a new version is written into the current in-memory buffer, landing in a future segment.

function updateDocument(term, newDoc):
  deleteDocuments(term)        # mark old doc's bit off in .liv
  addDocument(newDoc)          # new version -> RAM buffer -> new segment
This is why a heavily updated index can hold far more data on disk than its live document count suggests: deleted and superseded documents linger inside their segments until a merge rewrites them out. Reclaiming that space is one of the jobs of merging.

Lucene also supports soft deletes, where the tombstone is a doc-values marker rather than a hard removal, so the old version can be retained (for example to support point-in-time or change-tracking use cases) until a retention policy lets it be merged away.

10. Segment Merging

Flushing constantly produces new, small segments, and deletes leave dead documents behind. Left alone, an index would drift toward thousands of tiny segments — and every query has to visit every segment. Merging is the background process that keeps this in check by combining several segments into one larger segment.

Lucene segment merging
Many small segments (some carrying deletes) are merged into fewer, larger ones, and documents tombstoned in .liv are physically dropped in the process.

Merging does two things at once:

A MergePolicy decides which segments to merge and when. The default, TieredMergePolicy, groups segments into size tiers and merges within a tier, preferring segments with many deletes. Merges run on background threads, reading the inputs and writing one new segment before the old ones are dropped at the next commit — so search continues uninterrupted while a merge is in flight.

function maybeMerge(segments, policy):       # runs continuously in background
  candidates = policy.findMerges(segments)   # by size tier + delete ratio
  for group in candidates:
    new_seg = merge(group.live_documents())  # dead docs excluded here
    atomically_replace(group -> new_seg)     # visible at next commit

11. Near-Real-Time Search

A new document is searchable as soon as it is in a segment — and a flush creates a segment without a durable commit. Near-real-time (NRT) search exploits that gap: instead of committing (an fsync, which is slow), you open a reader over the just-flushed segments straight from the OS file cache.

Lucene near-real-time reopen
A flushed segment in the OS cache is picked up by reopening the reader — no durable commit required — so the new document becomes visible in milliseconds.

The mechanism is reader reopen. Rather than building a fresh reader from scratch, DirectoryReader.openIfChanged(oldReader) returns a new reader that reuses the leaf readers for unchanged segments and only opens leaves for the new ones. Because segments are immutable, this sharing is safe and cheap.

reader = DirectoryReader.open(writer)        # NRT reader tied to the writer
... index more documents ...
newReader = DirectoryReader.openIfChanged(reader)   # picks up flushed segments
if newReader != null:
  reader.close()                             # old leaves not reused are released
  reader = newReader
searcher = IndexSearcher(reader)             # now sees the new documents
NRT decouples visibility from durability. Reopening makes documents visible in milliseconds without fsync; a separate, less frequent commit provides crash durability. Engines built on Lucene add a write-ahead log of their own (Elasticsearch's translog) to cover the window between flushes.

12. Segment Files & Codecs

A single segment is physically a group of files that share one base name (_0.tim, _0.doc, …), each holding one part of the inverted index. The component that reads and writes these files is the Codec, and it is pluggable.

Lucene segment files / codec
The files that make up one segment, each written by the codec for a specific structure.
FilesHold
.tim / .tipTerm dictionary and its FST term index.
.doc / .pos / .payPostings: document IDs and frequencies, term positions, and payloads/offsets.
.fdt / .fdxStored field values and their index (returned in hits).
.dvd / .dvmDoc values: the per-field column store for sort, facet, and aggregate.
.kdd / .kdi / .kdmPoints: the BKD tree for numeric, date, and geo fields.
.vec / .vexkNN vector values and the HNSW graph for nearest-neighbor search.
.livLive-docs bitset recording deletions for this segment.
.fnmField infos: the per-segment record of which fields exist and how they were indexed.
.siSegment info: metadata such as document count and the codec used.

Because the codec is recorded per segment, an index can hold segments written by different codec versions at once. When Lucene upgrades its default format, existing segments keep the codec they were written with until a merge rewrites them in the current one — which is why merging is also how a format upgrade physically happens.

13. Summary

Lucene is a small set of ideas that compose into fast, ranked, durable search on a single machine:

ConcernMechanism
What makes search fast?An inverted index: each term points straight to a postings list of its documents, read with skip lists.
How is the index stored?As immutable segments — small, complete sub-indexes — named by an atomic commit point (segments_N).
How does text become searchable?An analyzer (char filters → tokenizer → token filters) turns it into terms, with the same analysis at index and query time.
How are documents written?IndexWriter buffers them per thread (DWPT), flushes to a new segment, and commits to make them durable.
How are results ranked?BM25 scoring from term frequency, inverse document frequency, and field-length normalization.
How do deletes and updates work?Tombstones in a live-docs bitset; an update is a delete-by-term plus an add. Space is reclaimed only on merge.
How does the index stay fast over time?Background merging bounds the segment count and physically drops deleted documents.
How are new documents seen quickly?Near-real-time reopen over flushed segments, reusing unchanged leaf readers — visibility without an fsync.
How is the on-disk format defined?A pluggable per-segment codec writing one file group per structure (terms, postings, stored, doc values, points, vectors).
The recurring theme: Lucene writes immutable segments and never edits them. Indexing appends new segments, deletes layer a bitset on top, merging rewrites them, and search reads them lock-free. Everything else — durability, near-real-time visibility, even format upgrades — is built on that one decision.