Build a model comparison workflow

Model changes are production changes. This cookbook builds a workflow for comparing model candidates against the same inputs, scoring outputs, and promoting a winner only when it clears the eval gate.

Scenario

You want to move a classification workflow to a cheaper or stronger model. The team needs evidence that quality does not regress on real production cases.

What you build

A candidate list of models.
A replayable eval dataset.
A comparison workflow that runs each case through each model.
Deterministic and judge-based scorers.
A release gate for promotion.

Workflow shape

@workflow
async def compare_models(ctx: WorkflowContext, case_id: str, models: list[str]) -> ModelComparison:
    case = await ctx.step(load_eval_case, case_id)
    outputs = []
    for model in models:
        output = await ctx.step(run_case_with_model, case, model)
        score = await ctx.step(score_model_output, case.expected, output)
        outputs.append(ModelOutput(model=model, output=output, score=score))
    return ModelComparison(case_id=case_id, outputs=outputs)

For larger datasets, fan out by case and aggregate scores in a separate step.

Scoring strategy

Use deterministic scorers when the expected output is structured:

exact class match,
required fields present,
forbidden terms absent,
citation coverage.

Use an LLM judge for subjective dimensions, but keep the judge prompt versioned and trace-visible.

Promotion checks

Candidate model beats or matches baseline on critical cases.
Cost and latency stay inside thresholds.
Failures link to traces for inspection.
Known production failures are included in the dataset.
CI blocks the release when score drops below the threshold.

Scenario

What you build

Workflow shape

Scoring strategy

Promotion checks

Next steps