A Dataset is a curated collection of inputs the agent should handle correctly — the cases you want to evaluate against. Datasets can be authored by hand, generated, or captured directly from production Runs, which is the common case: a Run that surfaced a novel failure mode becomes a Dataset entry with one action, and future Experiments are measured against it.
You’ll find Datasets under Improve → Datasets in Studio.
Datasets grow as production surfaces new cases. Every Experiment is measured against a Dataset version, so comparisons are stable even as the Dataset itself evolves — new cases don’t silently invalidate old scores.
A deeper guide is in progress. For the full improvement loop, see the Improve overview.