DuckDB as the Runtime's Query Layer

Every platform that handles durable runs eventually needs to answer the same two kinds of questions. The first is “what is this run doing right now?” — cheap, point-lookup, lives in the hot tier. The second is “show me the last 200 failed runs for project X with tool-call counts greater than five, grouped by model.” That one is an analytics query, and if you try to serve it from the same index that serves the hot lookups, both paths get worse.

We reached for DuckDB. Not DuckDB-the-standalone-warehouse — DuckDB-the-embedded-library, living inside the same runtime binary as the gateway and the engine.

The query crate boots like this:

use duckdb::{Config, Connection, DuckdbConnectionManager};

let mut config = Config::default();
config = config.with("enable_http_metadata_cache", "true")?;

let manager = DuckdbConnectionManager::memory_with_flags(config)?;
let pool = r2d2::Pool::builder()
    .max_size(s3_cfg.pool_size)
    .build(manager)?;

// Per-connection init: load httpfs and point at the bucket.
let conn = pool.get()?;
conn.execute_batch(&format!(
    "INSTALL httpfs; LOAD httpfs;
     SET s3_region = '{region}';
     SET s3_endpoint = '{endpoint}';
     SET s3_access_key_id = '{access_key}';
     SET s3_secret_access_key = '{secret_key}';
     SET s3_use_ssl = {use_ssl};",
    region = s3_cfg.region,
    endpoint = s3_cfg.endpoint,
    access_key = s3_cfg.access_key,
    secret_key = s3_cfg.secret_key,
    use_ssl = s3_cfg.use_ssl,
))?;

That is the whole setup. Every connection in the pool knows how to read Parquet from S3 or MinIO. The runs listing endpoint executes a query like this:

SELECT run_id, status, started_at_ms, completed_at_ms, duration_ms, model
FROM read_parquet(
    's3://agnt5-engine/engine/runs/tenant=proj_abc123/**/*.parquet',
    hive_partitioning = true
)
WHERE status = 'failed'
  AND completed_at_ms >= $1
ORDER BY completed_at_ms DESC
LIMIT 50;

The Hive partitioning convention (tenant=.../day=.../) means DuckDB prunes day partitions outside the time window at plan time. For a single-tenant query over the last week, it opens exactly seven prefixes, reads only the columns it needs, and returns in double-digit milliseconds even against a cold cache.

Why embedded beats a separate service

We considered running DuckDB as a sidecar, or pushing analytics into a dedicated warehouse. Both lose on the same axis: they add a network hop and an operational dependency to a feature that is already bottlenecked on object storage round trips.

Embedding means the connection pool, the S3 client, and the schema cache all live in-process. It also means we can ship one binary. The self-hosted operator and the local dev experience get the analytics surface “for free” — no separate container to explain, no ClickHouse tuning to document. Managed deployments scale the same binary horizontally; each replica has its own DuckDB pool reading the same immutable Parquet files.

The tradeoff is honest: DuckDB is single-node. A single query runs on one replica’s cores. For the workloads we care about — listings up to a few tens of thousands of rows, grouped aggregates over a tenant’s last month — that is fine. If someone ever writes a query that needs to grind over billions of rows, they should be writing it against a warehouse, not a runtime query endpoint. The runtime’s job is not to be a warehouse.

The shape of the catalog

We do not materialize tables. There is no schema migration step. The Parquet files written by the processor’s s3_writer are the source of truth, and the query layer reads them directly via read_parquet(...). New fields in the Run proto become new Parquet columns the next time the writer flushes, and DuckDB picks them up on the next query.

This is the payoff of picking a columnar format with a rich type system on disk. The schema evolves forward without ceremony. Older Parquet files missing a column return NULL for that column; queries that filter on it still work. We do not write migration code for historical data.

What the query layer powers

The runs listing UI in Studio pulls directly from this layer. Token-attribution dashboards use it to sum tokens_in, tokens_out, and cost_usd columns by project and model. Eval leaderboards query it to correlate run outcomes with prompt versions. When a support engineer needs to find “all runs that failed with this error signature in the last 48 hours,” they hit the same endpoint.

Anything requiring point lookup on a live run — SSE streaming, resume after crash, step-level timeline — bypasses DuckDB entirely and hits the RocksDB hot tier. The split matters. Each tier does one thing.

Why this matters

A query layer is not the glamorous part of a durable execution platform. But it is the part users touch every day. Picking DuckDB let us keep the platform honest: completed runs live in open, inspectable Parquet files in a bucket you control. The query engine is a library, not a vendor. Your dashboards are SQL, not a proprietary API. If someone wants to pull last quarter’s runs into a notebook and do their own analysis, they can — the files are already there.

Reliability shows up in the hot path. Usability shows up in the query path. DuckDB is how we kept the second one from becoming its own project.