> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Experiments
description: Run a deployed component against a dataset version, score every item, compare runs, and gate CI on the results.
last_verified: 2026-06-07
---

An **experiment** binds a target — a deployed component or a prompt — to a [dataset](/docs/improve/datasets.md) version and a set of [scorers](/docs/improve/scorers.md). Each **experiment run** executes the target against every dataset item, scores the outputs, and produces a pass/fail summary you can compare across runs or enforce in CI.

---

## Experiment versions

- An experiment's definition (target, scorers, config) is versioned. Starting a run snapshots it as an immutable **experiment version**, so older runs always show the exact configuration that produced them.
- A run pairs one experiment version with one dataset version. Because both sides are immutable, two runs over the same dataset version are directly comparable.
- Each dataset item becomes a **run item** with its own output and per-scorer scores.

---

## Create an experiment

The target component must already be [deployed](/docs/run/deploying.md). Then:

```bash
agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --target-type component \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --config '{"passed_threshold":1}'
```

- `--builtin-scorer` attaches a [built-in scorer](/docs/improve/scorers.md#built-in-deterministic-scorers) by name, or by JSON object when it needs config. Repeat for each scorer.
- `--scorer-id` attaches a deployed [custom scorer](/docs/improve/scorers.md#custom-scorers) by UUID.
- `--target-type prompt` with `--prompt-id` targets a [Prompt](/docs/build/prompts.md) instead of a component.

Via Studio: open your project, go to **Evaluate** -> **Experiments**, and create the experiment with the same choices — dataset version, target, and scorers.

---

## Run an experiment

```bash
# Fire and forget
agnt5 experiments run <experiment-id>

# Block until the run finishes; exit non-zero if the gate fails
agnt5 experiments run <experiment-id> --wait --timeout 15m
```

Useful overrides:

| Flag | Purpose |
|---|---|
| `--name` | Label the run (e.g. `pr-1234`) |
| `--deployment-id` | Compare a candidate deployment against the baseline for the same component |
| `--experiment-version-id` | Re-run an older immutable experiment version |
| `--fail-on-gate` | Return non-zero when the CI gate fails (default true with `--wait`) |

---

## Inspect results

```bash
# Runs for an experiment
agnt5 experiments runs list <experiment-id>

# One run's summary: status, pass rate, per-scorer aggregates
agnt5 experiments runs show <run-id>

# Failed items only
agnt5 reports failures <run-id>

# Why a specific score failed
agnt5 scores list --run-id <run-id>
agnt5 scores evidence <score-id> --include scorer_input,scorer_output,evidence
```

In Studio, the experiment run page shows the run timeline, per-item results, and per-scorer scores; click any item to see its output and score evidence side by side.

Cancel a stuck run with `agnt5 experiments runs cancel <experiment-id> <run-id>`.

---

## Compare runs

Compare a candidate against a baseline over the same dataset version:

```bash
agnt5 experiments runs compare <base-run-id> <compare-run-id>
```

The comparison shows aggregate score movement and which items flipped between pass and fail. Studio renders the same comparison under **Evaluate** -> **Experiments** when you select two runs.

---

## Gate CI on eval results

Use the reports commands in a CI job to block a merge or deploy on eval regressions:

```bash
# Start a run for the freshly deployed candidate, wait, and gate
agnt5 experiments run <experiment-id> \
  --deployment-id "$CANDIDATE_DEPLOYMENT_ID" \
  --name "ci-$GIT_SHA" \
  --wait --fail-on-gate

# Or wait on an already-started run
agnt5 reports wait <run-id> --timeout 15m

# Print the gate verdict for a finished run
agnt5 reports ci <run-id>

# Export full artifacts for the build log
agnt5 reports export <run-id> --artifact-format csv -o eval-results.csv
```

The gate verdict comes from the experiment's config (for example `{"passed_threshold": 1}`), so the threshold lives with the experiment, not the pipeline.

---

## Turn failures into a regression dataset

After a run with failures, capture the failing items as a new dataset so fixes stay fixed:

```bash
agnt5 experiments runs regression-dataset <run-id> \
  --name support-agent-regressions \
  --start-run --wait
```

This creates a dataset from the failed items (all of them, or specific ones via repeatable `--run-item-id`), a regression experiment over it, and — with `--start-run` — kicks off the first rerun immediately.


**Commands**: `agnt5 experiments {list,create,run,runs list|show|cancel|compare|regression-dataset}`, `agnt5 reports {summary,failures,ci,wait,export}`, `agnt5 scores {list,evidence}`
**Create requires**: `--name`, `--dataset-id`, `--dataset-version-id`, target (`--target-type component --deployment-id --component-name --component-type`, or `--target-type prompt --prompt-id`), at least one `--builtin-scorer <name|json>` or `--scorer-id <uuid>`
**CI gating**: `experiments run --wait --fail-on-gate` and `reports ci|wait` exit non-zero on gate failure; threshold set via experiment `--config '{"passed_threshold": <0..1>}'`
**Model**: experiment versions and dataset versions are immutable; a run = one experiment version × one dataset version; per-item scores queryable via `scores list --run-id`.


## Next steps

* [Datasets](/docs/improve/datasets.md): grow the dataset behind the experiment and publish a new version.
* [Scorers](/docs/improve/scorers.md): tighten what pass/fail means, or write a custom scorer.
* [Deploying](/docs/run/deploying.md): ship the candidate deployment your experiment runs against.
* [Prompts](/docs/build/prompts.md): version the prompt artifacts that prompt-target experiments compare.
