Datasets

Curate eval datasets from production runs, manual examples, or file uploads, then publish immutable versions for experiments.

A dataset is a curated collection of test cases — inputs, expected outputs, and optional trace events — that experiments run your components against. You build a dataset from real production runs, hand-written examples, or bulk file uploads, then publish it as an immutable version so experiment results stay comparable over time.

Drafts and versions

Every dataset has one mutable draft and zero or more immutable versions.

Draft: the working set. Adding runs, uploading files, and removing duplicates all edit the draft.
Version: a numbered, immutable snapshot of the draft. Experiments run against a version, never the draft, so a result always points at the exact items it was scored on.

Each dataset item carries:

Field	Description
`input`	JSON input passed to the component under test
`expected_output`	Optional JSON the scorers compare against
`metadata`	Optional JSON object for your own labels
`events`	Optional trace events, captured when importing from a run
`split`	Optional partition label (e.g. `train`, `test`)

Create a dataset

Via the CLI:

agnt5 datasets create \
  --name support-agent-golden-set \
  --description "Curated support conversations with verified answers"

Via Studio: open your project, go to Evaluate -> Datasets, and create a dataset from there.

# Find the dataset ID later
agnt5 datasets list --search support-agent

Add examples

From production runs

Importing a run captures its input, output, and trace events, so trace-level scorers (tool calls, token budgets) can score it later.

Find the run you want — agnt5 inspect runs ls, or copy the run ID from the Runs page in Studio.
Import it into the draft:

agnt5 datasets add-run <dataset-id> <run-id> \
  --expected-output '{"answer": "Refund issued within 5 business days"}'

The run’s recorded output becomes the item’s expected output unless you override it with --expected-output.

Manual examples

agnt5 datasets add-example <dataset-id> \
  --input '{"message": "Where is my order #4512?"}' \
  --expected-output '{"intent": "order_status"}'

Bulk upload from JSONL

One JSON object per line, with the same fields as a dataset item:

{"input": {"message": "Cancel my subscription"}, "expected_output": {"intent": "cancellation"}}
{"input": {"message": "Where is my order #4512?"}, "expected_output": {"intent": "order_status"}, "split": "test"}

agnt5 datasets upload <dataset-id> --file examples.jsonl

# Or pipe from stdin
cat examples.jsonl | agnt5 datasets upload <dataset-id>

Bulk upload from CSV

Map CSV columns to item fields. By default the input and expected_output column headers map directly:

agnt5 datasets upload-csv <dataset-id> --file examples.csv \
  --input-column question \
  --expected-output-column answer \
  --split-column split

Note: Valid rows import even when some rows are rejected (--partial defaults to true). Pass --partial=false to make the upload all-or-nothing.

Remove duplicates

Imports from multiple sources can land the same example twice. Preview duplicates before removing anything:

# See duplicate groups in the draft
agnt5 datasets dedup preview <dataset-id> --include-preview

# Keep the earliest item in each group, remove the rest
agnt5 datasets dedup apply <dataset-id>

# Or remove specific items
agnt5 datasets dedup apply <dataset-id> --remove-item-id <item-id>

Publish a version

Publishing snapshots the draft as the next immutable version:

agnt5 datasets publish <dataset-id> \
  --description "Adds 40 cancellation cases from last week's runs"

The draft stays editable after publishing — keep curating and publish again when ready.

Manage versions:

# List versions
agnt5 datasets versions list <dataset-id>

# Export a version as JSONL
agnt5 datasets versions export <dataset-id> <version-id>

# Compare two versions (added, removed, changed items)
agnt5 datasets versions compare <dataset-id> <base-version-id> <compare-version-id>

# Reset the draft to a published version
agnt5 datasets versions restore-draft <dataset-id> <version-id>

# Inspect items in the draft or a specific version
agnt5 datasets examples list <dataset-id> --version 2 --include-payload

Next steps

Scorers: define how outputs in your dataset get judged.
Experiments: run a component against a dataset version and score the results.
Deploying: produce the production runs that feed dataset curation.