Turn a failed production AI run into an eval

The most useful eval cases often start as production failures. This cookbook shows how to capture a bad run, preserve its prompt, tools, state, and output, then replay it against a fixed prompt or model before promoting the change.

Scenario

A workflow classifies enterprise support tickets. A customer reports that a security-sensitive ticket was routed to the wrong queue. The run exists in production with the original input, tool results, and model output.

What you build

A production failure review flow.
An eval case derived from the failed run.
A scorer that captures the expected behavior.
A replay comparison between current and candidate workflow versions.
A promotion gate based on the fixed case.

Capture the run

Start from the production run, not from a handwritten reproduction.

agnt5 runs describe run_01JSECURITY
agnt5 eval cases create --from-run run_01JSECURITY --dataset support-routing-regressions

The generated case should include:

workflow input,
relevant tool results,
the model output,
the expected routing outcome,
metadata linking back to the production run.

Write the scorer

Use a deterministic scorer for routing when possible.

@scorer(name="routes_security_ticket")
def routes_security_ticket(ctx: EvalContext) -> ScorerResultPy:
    output = SupportRoute.model_validate(ctx.output)
    passed = output.queue == "security" and output.severity in {"high", "critical"}
    return ScorerResultPy(score=1.0 if passed else 0.0, passed=passed)

The scorer turns the production failure into a guardrail that runs on every future prompt, model, or tool change.

Replay the candidate

Change the routing prompt, model, or tool policy in a candidate workflow version. Replay the captured case before promoting.

agnt5 eval run support-routing-regressions --workflow-version candidate
agnt5 eval compare --baseline production --candidate candidate

The comparison should show the failed case passing without regressing the rest of the dataset.

Production checks

The eval case links back to the original run.
The case contains enough state to reproduce the failure offline.
The scorer fails on the production version.
The scorer passes on the candidate version.
CI or a release checklist blocks promotion if this case regresses.