Experiments
Run a deployed component against a dataset version, score every item, compare runs, and gate CI on the results.
An experiment binds a target — a deployed component or a prompt — to a dataset version and a set of scorers. Each experiment run executes the target against every dataset item, scores the outputs, and produces a pass/fail summary you can compare across runs or enforce in CI.
Experiment versions
- An experiment’s definition (target, scorers, config) is versioned. Starting a run snapshots it as an immutable experiment version, so older runs always show the exact configuration that produced them.
- A run pairs one experiment version with one dataset version. Because both sides are immutable, two runs over the same dataset version are directly comparable.
- Each dataset item becomes a run item with its own output and per-scorer scores.
Create an experiment
The target component must already be deployed. Then:
agnt5 experiments create \
--name support-agent-quality \
--dataset-id <dataset-id> \
--dataset-version-id <dataset-version-id> \
--target-type component \
--deployment-id <deployment-id> \
--component-name support_agent \
--component-type agent \
--builtin-scorer json_valid \
--builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
--config '{"passed_threshold":1}'--builtin-scorerattaches a built-in scorer by name, or by JSON object when it needs config. Repeat for each scorer.--scorer-idattaches a deployed custom scorer by UUID.--target-type promptwith--prompt-idtargets a Prompt instead of a component.
Via Studio: open your project, go to Evaluate -> Experiments, and create the experiment with the same choices — dataset version, target, and scorers.
Run an experiment
# Fire and forget
agnt5 experiments run <experiment-id>
# Block until the run finishes; exit non-zero if the gate fails
agnt5 experiments run <experiment-id> --wait --timeout 15mUseful overrides:
| Flag | Purpose |
|---|---|
--name | Label the run (e.g. pr-1234) |
--deployment-id | Compare a candidate deployment against the baseline for the same component |
--experiment-version-id | Re-run an older immutable experiment version |
--fail-on-gate | Return non-zero when the CI gate fails (default true with --wait) |
Inspect results
# Runs for an experiment
agnt5 experiments runs list <experiment-id>
# One run's summary: status, pass rate, per-scorer aggregates
agnt5 experiments runs show <run-id>
# Failed items only
agnt5 reports failures <run-id>
# Why a specific score failed
agnt5 scores list --run-id <run-id>
agnt5 scores evidence <score-id> --include scorer_input,scorer_output,evidenceIn Studio, the experiment run page shows the run timeline, per-item results, and per-scorer scores; click any item to see its output and score evidence side by side.
Cancel a stuck run with agnt5 experiments runs cancel <experiment-id> <run-id>.
Compare runs
Compare a candidate against a baseline over the same dataset version:
agnt5 experiments runs compare <base-run-id> <compare-run-id>The comparison shows aggregate score movement and which items flipped between pass and fail. Studio renders the same comparison under Evaluate -> Experiments when you select two runs.
Gate CI on eval results
Use the reports commands in a CI job to block a merge or deploy on eval regressions:
# Start a run for the freshly deployed candidate, wait, and gate
agnt5 experiments run <experiment-id> \
--deployment-id "$CANDIDATE_DEPLOYMENT_ID" \
--name "ci-$GIT_SHA" \
--wait --fail-on-gate
# Or wait on an already-started run
agnt5 reports wait <run-id> --timeout 15m
# Print the gate verdict for a finished run
agnt5 reports ci <run-id>
# Export full artifacts for the build log
agnt5 reports export <run-id> --artifact-format csv -o eval-results.csvThe gate verdict comes from the experiment’s config (for example {"passed_threshold": 1}), so the threshold lives with the experiment, not the pipeline.
Turn failures into a regression dataset
After a run with failures, capture the failing items as a new dataset so fixes stay fixed:
agnt5 experiments runs regression-dataset <run-id> \
--name support-agent-regressions \
--start-run --waitThis creates a dataset from the failed items (all of them, or specific ones via repeatable --run-item-id), a regression experiment over it, and — with --start-run — kicks off the first rerun immediately.