> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Turn a failed production AI run into an eval
description: Capture a bad production run, convert it into an eval case, and compare fixed prompts before release.
tags: ["Evals", "Replay", "Production"]
date: 2026-05-13
last_verified: 2026-05-13
audience: both
---

The most useful eval cases often start as production failures. This cookbook
shows how to capture a bad run, preserve its prompt, tools, state, and output,
then replay it against a fixed prompt or model before promoting the change.

## Scenario

A workflow classifies enterprise support tickets. A customer reports that a
security-sensitive ticket was routed to the wrong queue. The run exists in
production with the original input, tool results, and model output.

## What you build

- A production failure review flow.
- An eval case derived from the failed run.
- A scorer that captures the expected behavior.
- A replay comparison between current and candidate workflow versions.
- A promotion gate based on the fixed case.

## Capture the run

Start from the production run, not from a handwritten reproduction.

```bash
agnt5 runs describe run_01JSECURITY
agnt5 eval cases create --from-run run_01JSECURITY --dataset support-routing-regressions
```

The generated case should include:

- workflow input,
- relevant tool results,
- the model output,
- the expected routing outcome,
- metadata linking back to the production run.

## Write the scorer

Use a deterministic scorer for routing when possible.

```python
@scorer(name="routes_security_ticket")
def routes_security_ticket(ctx: EvalContext) -> ScorerResultPy:
    output = SupportRoute.model_validate(ctx.output)
    passed = output.queue == "security" and output.severity in {"high", "critical"}
    return ScorerResultPy(score=1.0 if passed else 0.0, passed=passed)
```

The scorer turns the production failure into a guardrail that runs on every
future prompt, model, or tool change.

## Replay the candidate

Change the routing prompt, model, or tool policy in a candidate workflow
version. Replay the captured case before promoting.

```bash
agnt5 eval run support-routing-regressions --workflow-version candidate
agnt5 eval compare --baseline production --candidate candidate
```

The comparison should show the failed case passing without regressing the rest
of the dataset.

## Production checks

- The eval case links back to the original run.
- The case contains enough state to reproduce the failure offline.
- The scorer fails on the production version.
- The scorer passes on the candidate version.
- CI or a release checklist blocks promotion if this case regresses.

## Next steps

- [Build a model comparison workflow](/cookbooks/model-comparison.md)
- [Debug and replay a failed AI workflow](/cookbooks/debug-production-run.md)
- [Build a customer support agent](/cookbooks/customer-support-agent.md)
