> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Build a model comparison workflow
description: Run the same case through multiple models, score outputs, and promote the release candidate.
tags: ["Evals", "Models", "CI"]
date: 2026-05-13
last_verified: 2026-05-13
audience: both
---

Model changes are production changes. This cookbook builds a workflow for
comparing model candidates against the same inputs, scoring outputs, and
promoting a winner only when it clears the eval gate.

## Scenario

You want to move a classification workflow to a cheaper or stronger model. The
team needs evidence that quality does not regress on real production cases.

## What you build

- A candidate list of models.
- A replayable eval dataset.
- A comparison workflow that runs each case through each model.
- Deterministic and judge-based scorers.
- A release gate for promotion.

## Workflow shape

```python
@workflow
async def compare_models(ctx: WorkflowContext, case_id: str, models: list[str]) -> ModelComparison:
    case = await ctx.step(load_eval_case, case_id)
    outputs = []
    for model in models:
        output = await ctx.step(run_case_with_model, case, model)
        score = await ctx.step(score_model_output, case.expected, output)
        outputs.append(ModelOutput(model=model, output=output, score=score))
    return ModelComparison(case_id=case_id, outputs=outputs)
```

For larger datasets, fan out by case and aggregate scores in a separate step.

## Scoring strategy

Use deterministic scorers when the expected output is structured:

- exact class match,
- required fields present,
- forbidden terms absent,
- citation coverage.

Use an LLM judge for subjective dimensions, but keep the judge prompt versioned
and trace-visible.

## Promotion checks

- Candidate model beats or matches baseline on critical cases.
- Cost and latency stay inside thresholds.
- Failures link to traces for inspection.
- Known production failures are included in the dataset.
- CI blocks the release when score drops below the threshold.

## Next steps

- [Turn a failed production AI run into an eval](/cookbooks/production-run-to-eval.md)
- [Build a data extraction workflow](/cookbooks/data-extraction.md)
- [Debug and replay a failed AI workflow](/cookbooks/debug-production-run.md)
