> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Datasets
description: Curate eval datasets from production runs, manual examples, or file uploads, then publish immutable versions for experiments.
last_verified: 2026-06-07
---

A **dataset** is a curated collection of test cases — inputs, expected outputs, and optional trace events — that [experiments](/docs/improve/experiments.md) run your components against. You build a dataset from real production runs, hand-written examples, or bulk file uploads, then publish it as an immutable version so experiment results stay comparable over time.

---

## Drafts and versions

Every dataset has one mutable **draft** and zero or more immutable **versions**.

- **Draft**: the working set. Adding runs, uploading files, and removing duplicates all edit the draft.
- **Version**: a numbered, immutable snapshot of the draft. Experiments run against a version, never the draft, so a result always points at the exact items it was scored on.

Each **dataset item** carries:

| Field | Description |
|---|---|
| `input` | JSON input passed to the component under test |
| `expected_output` | Optional JSON the scorers compare against |
| `metadata` | Optional JSON object for your own labels |
| `events` | Optional trace events, captured when importing from a run |
| `split` | Optional partition label (e.g. `train`, `test`) |

---

## Create a dataset

Via the CLI:

```bash
agnt5 datasets create \
  --name support-agent-golden-set \
  --description "Curated support conversations with verified answers"
```

Via Studio: open your project, go to **Evaluate** -> **Datasets**, and create a dataset from there.

```bash
# Find the dataset ID later
agnt5 datasets list --search support-agent
```

---

## Add examples

### From production runs

Importing a run captures its input, output, and trace events, so trace-level scorers (tool calls, token budgets) can score it later.

1. Find the run you want — `agnt5 inspect runs ls`, or copy the run ID from the **Runs** page in Studio.
2. Import it into the draft:

```bash
agnt5 datasets add-run <dataset-id> <run-id> \
  --expected-output '{"answer": "Refund issued within 5 business days"}'
```

The run's recorded output becomes the item's expected output unless you override it with `--expected-output`.

### Manual examples

```bash
agnt5 datasets add-example <dataset-id> \
  --input '{"message": "Where is my order #4512?"}' \
  --expected-output '{"intent": "order_status"}'
```

### Bulk upload from JSONL

One JSON object per line, with the same fields as a dataset item:

```jsonl
{"input": {"message": "Cancel my subscription"}, "expected_output": {"intent": "cancellation"}}
{"input": {"message": "Where is my order #4512?"}, "expected_output": {"intent": "order_status"}, "split": "test"}
```

```bash
agnt5 datasets upload <dataset-id> --file examples.jsonl

# Or pipe from stdin
cat examples.jsonl | agnt5 datasets upload <dataset-id>
```

### Bulk upload from CSV

Map CSV columns to item fields. By default the `input` and `expected_output` column headers map directly:

```bash
agnt5 datasets upload-csv <dataset-id> --file examples.csv \
  --input-column question \
  --expected-output-column answer \
  --split-column split
```

> **Note:** Valid rows import even when some rows are rejected (`--partial` defaults to true). Pass `--partial=false` to make the upload all-or-nothing.

---

## Remove duplicates

Imports from multiple sources can land the same example twice. Preview duplicates before removing anything:

```bash
# See duplicate groups in the draft
agnt5 datasets dedup preview <dataset-id> --include-preview

# Keep the earliest item in each group, remove the rest
agnt5 datasets dedup apply <dataset-id>

# Or remove specific items
agnt5 datasets dedup apply <dataset-id> --remove-item-id <item-id>
```

---

## Publish a version

Publishing snapshots the draft as the next immutable version:

```bash
agnt5 datasets publish <dataset-id> \
  --description "Adds 40 cancellation cases from last week's runs"
```

The draft stays editable after publishing — keep curating and publish again when ready.

Manage versions:

```bash
# List versions
agnt5 datasets versions list <dataset-id>

# Export a version as JSONL
agnt5 datasets versions export <dataset-id> <version-id>

# Compare two versions (added, removed, changed items)
agnt5 datasets versions compare <dataset-id> <base-version-id> <compare-version-id>

# Reset the draft to a published version
agnt5 datasets versions restore-draft <dataset-id> <version-id>
```

```bash
# Inspect items in the draft or a specific version
agnt5 datasets examples list <dataset-id> --version 2 --include-payload
```


**Commands**: `agnt5 datasets {list,create,add-run,add-example,upload,upload-csv,dedup preview|apply,publish,versions list|export|compare|restore-draft,examples list}`
**Required**: authenticated session (`agnt5 auth login`); project binding (run inside the project directory)
**Item schema (JSONL)**: `{"input": <json>, "expected_output": <json?>, "metadata": <object?>, "events": <json?>, "split": <string?>}`
**Model**: one mutable draft per dataset; `publish` creates immutable numbered versions; experiments reference a dataset version ID.


## Next steps

* [Scorers](/docs/improve/scorers.md): define how outputs in your dataset get judged.
* [Experiments](/docs/improve/experiments.md): run a component against a dataset version and score the results.
* [Deploying](/docs/run/deploying.md): produce the production runs that feed dataset curation.
