Improve

Close the loop — add an eval, fix the failure, see the diff.

This is stage 5 of 5 of the AGNT5 loop — the part that makes the loop a loop. You already see runs in Observe; this stage turns observation into action.

The flow:

Pick a bad run from Studio (a regression, a model that hallucinated, a tool that timed out).
Capture its input into a dataset.
Write an eval — a function that grades a run’s output against expected behavior.
Make a change — prompt, model, retry policy, or code.
Replay the dataset against the new version. Read the diff in Studio.
Gate the deploy on the eval if you want it enforced in CI.

This is how gpt-5-mini → claude swaps stop being scary and become measurable. Deeper material on datasets, eval functions, and CI gating is being filled in.