← Back to index

OpenObserve — Internal Architecture

A developer's guide to how OpenObserve actually works under the hood: the node roles, the schema-on-ingest pipeline, columnar Parquet on object storage, and how a query prunes its way down to the few files it actually needs.

OpenObserve is an observability platform — logs, metrics, and traces in one engine. Its design is built around a single, simple bet: object storage is the only durable store you need. There is no separate time-series database cluster, no dedicated search cluster, and no per-node attached disk that holds the source of truth. Data lands as compressed columnar Parquet files on an object store (S3, GCS, Azure Blob, MinIO, or even a local filesystem), and every other component is effectively a stateless worker that reads and writes those files. That choice — borrowing the data-lake storage model and pairing it with an Apache Arrow / DataFusion query engine — is what makes it cheap to run and is the thread that ties the rest of the architecture together.

Contents

  1. Design Goals and Core Ideas
  2. Architecture
  3. Data Model
  4. Storage Format
  5. Ingestion Path
  6. Indexing & Pruning
  7. Query Path
  8. Compaction & Retention
  9. Summary

1. Design Goals and Core Ideas

Every internal decision in OpenObserve traces back to a small set of goals. Keeping them in mind makes the rest of the architecture predictable.

GoalHow OpenObserve achieves it
Cheap observabilityObject storage is the only required durable store. No separate database cluster, no expensive local SSDs holding the source of truth. Data is compressed columnar Parquet, so storage and scan costs stay low.
Unified signalsLogs, metrics, and traces all flow through one ingestion pipeline and land in the same Parquet-on-object-store format, queried by one engine. There is no separate stack per signal type.
Stateless computeNode roles (router, ingester, querier, compactor) keep no durable state of their own. Everything that must survive a restart lives in the object store or the metadata store, so compute scales independently.
Schema-on-ingestYou do not declare a schema up front. Columns are inferred from the first records and extended as new fields appear, so any JSON shape can be ingested immediately.
SQL-first queryingQueries are SQL (and PromQL for metrics), planned and executed by DataFusion over Arrow. The same engine serves logs, metrics, and traces.
Unified signals on object storage
Logs, metrics, and traces feed one engine; object storage is the only required durable store, which is what makes it cheap to run.

2. Architecture

OpenObserve runs in two deployment shapes from the same binary. In single-node mode every role runs in one process backed by an embedded SQLite metadata store and a local-disk object store. In HA (cluster) mode the roles run as separate, independently scalable deployments, the metadata store moves to PostgreSQL, MySQL, or etcd, and the object store is a real bucket (S3, GCS, Azure, or MinIO).

Node roles, metadata store, object store
The router fronts all traffic; specialized roles read and write Parquet on the object store and coordinate through the metadata store.

The components are:

Two stores sit behind the roles:

Because the roles are stateless, scaling is a matter of running more of whichever role is the bottleneck. More ingest? Add ingesters. More query load? Add queriers. The data and the catalog are untouched.

3. Data Model

The data model has three levels. An organization is the top-level tenant boundary — it isolates data, users, and quotas. Inside an organization you have streams, which are the equivalent of tables. Each stream has a type: logs, metrics, or traces. A stream is a sequence of records sharing a schema and a retention policy.

Organizations, streams, time partitions
Organizations contain typed streams; schemas are inferred on ingest, and each stream is partitioned by time plus optional partition keys.

Schema inference is what makes ingestion frictionless. You do not run a CREATE TABLE. The first records to arrive on a stream define its columns and their types; when a later record carries a field the schema has not seen, the column is added and old rows simply have a null for it. This is schema-on-ingest — the schema is a living description maintained in the metadata store, not a fixed contract.

Partitioning is what makes querying fast. Every stream is partitioned by time (records are grouped into time buckets, typically hourly) and optionally by one or more partition keys (for example, a service or k8s_namespace column). Partitioning is expressed in the object-store key layout — each partition is its own prefix:

# object key layout for a partitioned stream
org/acme/logs/app_logs/
  2026/06/25/10/service=api/  # time bucket + partition key
    7f3a...parquet
    9b21...parquet
  2026/06/25/10/service=web/
    a44c...parquet
  2026/06/25/11/service=api/
    ...

Because partitions are encoded in the key prefix, a query with a time range and a service filter can be narrowed to a handful of prefixes before any file is opened. That is the first and cheapest layer of pruning.

4. Storage Format

On the object store, all data is Apache Parquet — a columnar file format. Inside a Parquet file the values of each column are stored together and compressed independently, and the file is divided into row groups, each carrying per-column statistics (min, max, null count) in its footer. Columnar layout is the right fit for observability queries, which usually touch a few columns out of hundreds and filter on a small number of them.

WAL, in-memory buffer, columnar Parquet
Records are first appended to a WAL for durability and held in an in-memory buffer; the buffer is flushed to a columnar Parquet object.

The write side of the storage engine has three structures, the same shape as a log-structured store:

The footer of every Parquet file is itself an index. Alongside the row-group statistics, OpenObserve records each file's overall min/max for the timestamp and partition columns in the metadata store, so the planner can reason about a file without reading it.

5. Ingestion Path

Ingestion is a straight pipeline. A record is durable the moment it is in the WAL; everything after that — buffering, Parquet conversion, the object-store upload — happens asynchronously and does not block the acknowledgement.

Ingestion path from API to object store
Ingest API to WAL to in-memory buffer to Parquet to object store. The ack returns after the WAL write; schema is inferred as rows arrive.

The sequence inside an ingester is:

function ingest(stream, records):
  for r in records:
    schema = infer_and_merge(stream, r)   # 1. schema-on-ingest
    wal.append(stream, r)                  # 2. durable; ack can return now
    buffer[stream].add(r)                  # 3. in-memory, columnar
  ack()                                    # client released after WAL write

  # asynchronously, on size/time threshold:
  if buffer[stream].should_flush():
    file = buffer[stream].to_parquet()     # 4. compress, columnar
    key  = partition_key(stream, file)     # time + partition keys
    object_store.put(key, file)            # upload to S3/GCS/...
    metadata.register_file(key, file.stats)  # min/max, rows, columns
    wal.truncate(covered_by(file))

The key detail is the order: durability (WAL) comes before visibility (buffer) and long before the Parquet flush. The flush registers the new file and its statistics in the metadata store in the same step, which is what makes the file immediately discoverable by queriers and prunable by the planner.

6. Indexing & Pruning

OpenObserve does not scan everything. The whole performance story is about reading as few bytes from the object store as possible, because object-store reads are the slow and metered part of a query. It does this with layered pruning, from coarse to fine.

Indexing and pruning candidate files
The pruner uses partition prefixes, file-level min/max stats, bloom filters, and inverted indexes to drop files that cannot match before any are read.

The layers, applied in order:

LayerGranularityWhat it eliminates
Partition pruningprefix / fileTime-range and partition-key filters narrow the candidate set to a few object-store prefixes before any file is touched.
Min/max column statsfile & row groupEach Parquet file (and each row group) carries per-column min/max. If a filter's value falls outside a file's range, the whole file is skipped.
Bloom filtersfileFor high-cardinality equality filters (e.g. a specific trace_id), a bloom filter says "this value is definitely not here" so the file is skipped without reading it.
Inverted indexterms within a fileFor full-text search over log fields, an inverted index maps terms to the records that contain them, so a text query reads only matching rows rather than scanning the column.

Critically, the first three layers run on metadata only — the planner consults statistics held in the metadata store and the Parquet footers, deciding which files matter before downloading a single byte of column data. A query against a stream holding thousands of files commonly reads only a handful.

7. Query Path

A query is SQL (or PromQL for metrics). It is parsed into a logical plan, pruned down to a file set, then executed in parallel by queriers that fetch only the needed Parquet from the object store and run the plan with DataFusion over Arrow record batches.

Query path: plan, prune, fetch, execute
SQL/PromQL is planned and pruned, then scattered across queriers that pull only the needed files from the object store and execute on Arrow.

The end-to-end flow is:

function query(sql):
  plan  = planner.build(sql)              # logical plan + filters
  files = prune(plan, metadata)           # partitions + min/max + bloom
  shards = split(files, num_queriers)     # scatter the file set
  partials = []
  for q in queriers:                      # run in parallel
    partials += q.execute(plan, shards[q])
  return merge(plan, partials)            # gather + final aggregation

# inside a single querier:
function execute(plan, files):
  for f in files:
    cols  = f.read_columns(plan.projected_columns)  # only needed columns
    batch = arrow.decode(cols)            # Parquet -> Arrow
    emit(datafusion.run(plan, batch))     # filter / aggregate

Two things keep this fast. First, scatter-gather: in HA mode the pruned file set is split across many queriers, so a large scan parallelizes across machines and the partial results are merged at the end. Second, columnar projection: each querier reads only the columns the query references out of each Parquet file, so a SELECT count(*) WHERE service='api' never pays to read the log body. DataFusion does the actual relational work — filtering, joining, aggregating — on Arrow batches, which are a columnar in-memory format that maps almost directly onto Parquet.

8. Compaction & Retention

The flush path optimizes for write speed, so it produces many small Parquet files — every ingester flush is its own object. Many small files are bad for reads: each one costs a separate object-store request and carries fixed footer overhead. The compactor fixes this in the background, the same role compaction plays in any LSM-style store.

Compaction, downsampling, retention
The compactor merges small files into fewer large ones, downsamples old metrics, and deletes files past the retention window.

The compactor does three jobs:

JobWhat it does
MergeReads many small Parquet files for a partition, merge-sorts their rows, deduplicates, and writes one larger, well-ordered Parquet file. Fewer files means fewer object-store requests and tighter min/max ranges, so reads get faster.
DownsampleFor metric streams, old high-resolution samples are rolled up to coarser intervals (for example 10-second points aggregated to 5-minute points). Recent data stays fine-grained; long-term data shrinks dramatically.
RetentionFiles older than the stream's retention window are deleted from the object store and de-registered from the metadata store. Because partitions are time-prefixed, expiry is a cheap prefix delete — no row-by-row scanning.

All three operations are rewrite-and-replace: the compactor writes new files, updates the metadata store to point at them, and then removes the superseded files. Queriers always read whatever the metadata store currently lists, so compaction is invisible to in-flight queries and never blocks ingestion.

on compaction for (stream, partition):
  small = metadata.files(stream, partition, max_size = SMALL)
  if len(small) < MIN_TO_MERGE: return
  merged = merge_sort_dedup(small)            # one larger file
  object_store.put(merged.key, merged)
  metadata.register_file(merged.key, merged.stats)
  metadata.deregister(small)                  # atomic swap in catalog
  object_store.delete(small)                  # reclaim space

on retention sweep:
  for f in metadata.files(stream):
    if f.max_ts < now - stream.retention:
      metadata.deregister(f); object_store.delete(f)

9. Summary

The whole system is built from a few reinforcing ideas:

ConcernMechanism
Where does data live?Compressed columnar Parquet on an object store — the only required durable store. The metadata store holds just the catalog.
How is it organized?Organizations contain typed streams (logs/metrics/traces); each stream is partitioned by time plus optional partition keys.
How does ingest stay fast?Append to WAL, ack, buffer in memory, then flush asynchronously to Parquet. Schema is inferred on ingest, never declared.
How are reads so cheap?Layered pruning — partition prefixes, file min/max stats, bloom filters, inverted index — drops files on metadata alone, then reads only the needed columns.
How does it execute?SQL/PromQL planned by DataFusion, scattered across stateless queriers, executed on Arrow batches, gathered at the end.
How is storage reclaimed?The compactor merges small files, downsamples old metrics, and deletes expired partitions by prefix — all rewrite-and-replace in the catalog.
How does it scale?Roles are stateless; add ingesters for write load, queriers for read load. Data and catalog are untouched.
The recurring theme: object storage is the source of truth, compute is stateless and disposable, and every operation is rewrite-and-replace plus metadata-driven pruning. There is no database cluster to scale, no in-place mutation to lock, and no expensive index that has to hold the whole dataset in memory.