Experiments

Run a deployed component against a dataset version, score every item, compare runs, and gate CI on the results.

An experiment binds a target — a deployed component or a prompt — to a dataset version and a set of scorers. Each experiment run executes the target against every dataset item, scores the outputs, and produces a pass/fail summary you can compare across runs or enforce in CI.

Experiment versions

An experiment’s definition (target, scorers, config) is versioned. Starting a run snapshots it as an immutable experiment version, so older runs always show the exact configuration that produced them.
A run pairs one experiment version with one dataset version. Because both sides are immutable, two runs over the same dataset version are directly comparable.
Each dataset item becomes a run item with its own output and per-scorer scores.

Create an experiment

The target component must already be deployed. Then:

agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --target-type component \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --config '{"passed_threshold":1}'

--builtin-scorer attaches a built-in scorer by name, or by JSON object when it needs config. Repeat for each scorer.
--scorer-id attaches a deployed custom scorer by UUID.
--target-type prompt with --prompt-id targets a Prompt instead of a component.

Via Studio: open your project, go to Evaluate -> Experiments, and create the experiment with the same choices — dataset version, target, and scorers.

Run an experiment

# Fire and forget
agnt5 experiments run <experiment-id>

# Block until the run finishes; exit non-zero if the gate fails
agnt5 experiments run <experiment-id> --wait --timeout 15m

Useful overrides:

Flag	Purpose
`--name`	Label the run (e.g. `pr-1234`)
`--deployment-id`	Compare a candidate deployment against the baseline for the same component
`--experiment-version-id`	Re-run an older immutable experiment version
`--fail-on-gate`	Return non-zero when the CI gate fails (default true with `--wait`)

Inspect results

# Runs for an experiment
agnt5 experiments runs list <experiment-id>

# One run's summary: status, pass rate, per-scorer aggregates
agnt5 experiments runs show <run-id>

# Failed items only
agnt5 reports failures <run-id>

# Why a specific score failed
agnt5 scores list --run-id <run-id>
agnt5 scores evidence <score-id> --include scorer_input,scorer_output,evidence

In Studio, the experiment run page shows the run timeline, per-item results, and per-scorer scores; click any item to see its output and score evidence side by side.

Cancel a stuck run with agnt5 experiments runs cancel <experiment-id> <run-id>.

Compare runs

Compare a candidate against a baseline over the same dataset version:

agnt5 experiments runs compare <base-run-id> <compare-run-id>

The comparison shows aggregate score movement and which items flipped between pass and fail. Studio renders the same comparison under Evaluate -> Experiments when you select two runs.

Gate CI on eval results

Use the reports commands in a CI job to block a merge or deploy on eval regressions:

# Start a run for the freshly deployed candidate, wait, and gate
agnt5 experiments run <experiment-id> \
  --deployment-id "$CANDIDATE_DEPLOYMENT_ID" \
  --name "ci-$GIT_SHA" \
  --wait --fail-on-gate

# Or wait on an already-started run
agnt5 reports wait <run-id> --timeout 15m

# Print the gate verdict for a finished run
agnt5 reports ci <run-id>

# Export full artifacts for the build log
agnt5 reports export <run-id> --artifact-format csv -o eval-results.csv

The gate verdict comes from the experiment’s config (for example {"passed_threshold": 1}), so the threshold lives with the experiment, not the pipeline.

Turn failures into a regression dataset

After a run with failures, capture the failing items as a new dataset so fixes stay fixed:

agnt5 experiments runs regression-dataset <run-id> \
  --name support-agent-regressions \
  --start-run --wait

This creates a dataset from the failed items (all of them, or specific ones via repeatable --run-item-id), a regression experiment over it, and — with --start-run — kicks off the first rerun immediately.

Next steps

Datasets: grow the dataset behind the experiment and publish a new version.
Scorers: tighten what pass/fail means, or write a custom scorer.
Deploying: ship the candidate deployment your experiment runs against.
Prompts: version the prompt artifacts that prompt-target experiments compare.