May 13, 2026 EvalsModelsCI
Build a model comparison workflow
Run the same case through multiple models, score outputs, and promote the release candidate.
Model changes are production changes. This cookbook builds a workflow for comparing model candidates against the same inputs, scoring outputs, and promoting a winner only when it clears the eval gate.
Scenario
You want to move a classification workflow to a cheaper or stronger model. The team needs evidence that quality does not regress on real production cases.
What you build
- A candidate list of models.
- A replayable eval dataset.
- A comparison workflow that runs each case through each model.
- Deterministic and judge-based scorers.
- A release gate for promotion.
Workflow shape
@workflow
async def compare_models(ctx: WorkflowContext, case_id: str, models: list[str]) -> ModelComparison:
case = await ctx.step(load_eval_case, case_id)
outputs = []
for model in models:
output = await ctx.step(run_case_with_model, case, model)
score = await ctx.step(score_model_output, case.expected, output)
outputs.append(ModelOutput(model=model, output=output, score=score))
return ModelComparison(case_id=case_id, outputs=outputs)For larger datasets, fan out by case and aggregate scores in a separate step.
Scoring strategy
Use deterministic scorers when the expected output is structured:
- exact class match,
- required fields present,
- forbidden terms absent,
- citation coverage.
Use an LLM judge for subjective dimensions, but keep the judge prompt versioned and trace-visible.
Promotion checks
- Candidate model beats or matches baseline on critical cases.
- Cost and latency stay inside thresholds.
- Failures link to traces for inspection.
- Known production failures are included in the dataset.
- CI blocks the release when score drops below the threshold.