Improve
Close the loop — add an eval, fix the failure, see the diff.
This is stage 5 of 5 of the AGNT5 loop — the part that makes the loop a loop. You already see runs in Observe; this stage turns observation into action.
The flow:
- Pick a bad run from Studio (a regression, a model that hallucinated, a tool that timed out).
- Capture its input into a dataset.
- Write an eval — a function that grades a run’s output against expected behavior.
- Make a change — prompt, model, retry policy, or code.
- Replay the dataset against the new version. Read the diff in Studio.
- Gate the deploy on the eval if you want it enforced in CI.
This is how gpt-5-mini → claude swaps stop being scary and become measurable. Deeper material on datasets, eval functions, and CI gating is being filled in.