Skip to content
Docs
Cookbooks Build a model comparison workflow
May 13, 2026 EvalsModelsCI

Build a model comparison workflow

Run the same case through multiple models, score outputs, and promote the release candidate.

Model changes are production changes. This cookbook builds a workflow for comparing model candidates against the same inputs, scoring outputs, and promoting a winner only when it clears the eval gate.

Scenario

You want to move a classification workflow to a cheaper or stronger model. The team needs evidence that quality does not regress on real production cases.

What you build

  • A candidate list of models.
  • A replayable eval dataset.
  • A comparison workflow that runs each case through each model.
  • Deterministic and judge-based scorers.
  • A release gate for promotion.

Workflow shape

@workflow
async def compare_models(ctx: WorkflowContext, case_id: str, models: list[str]) -> ModelComparison:
    case = await ctx.step(load_eval_case, case_id)
    outputs = []
    for model in models:
        output = await ctx.step(run_case_with_model, case, model)
        score = await ctx.step(score_model_output, case.expected, output)
        outputs.append(ModelOutput(model=model, output=output, score=score))
    return ModelComparison(case_id=case_id, outputs=outputs)

For larger datasets, fan out by case and aggregate scores in a separate step.

Scoring strategy

Use deterministic scorers when the expected output is structured:

  • exact class match,
  • required fields present,
  • forbidden terms absent,
  • citation coverage.

Use an LLM judge for subjective dimensions, but keep the judge prompt versioned and trace-visible.

Promotion checks

  • Candidate model beats or matches baseline on critical cases.
  • Cost and latency stay inside thresholds.
  • Failures link to traces for inspection.
  • Known production failures are included in the dataset.
  • CI blocks the release when score drops below the threshold.

Next steps

© 2026 AGNT5
llms.txt