Scorers

Score component outputs with built-in deterministic checks, trace assertions, LLM-as-judge presets, or your own custom scorer code.

A scorer decides whether a component’s output meets the target behavior. Each scorer returns a score between 0.0 and 1.0, a pass/fail verdict, and an optional explanation. Experiments attach one or more scorers and run them against every dataset item.

Scorer classes

AGNT5 has exactly three scorer classes:

Class	Examples	Needs deployment?
Built-in deterministic	`exact_match`, `json_schema`, `tool_called`	No — runs as AGNT5-owned logic
Built-in LLM-as-judge	`llm_judge`, `correctness`, `faithfulness`	No — AGNT5-owned, configurable model and rubric
Custom	Your `@scorer` functions	Yes — registered and deployed with your worker

Built-ins work out of the box: select them in Studio or pass them to agnt5 experiments create --builtin-scorer <name>. Only custom scorers require your own code.

Scorers also differ by what they evaluate:

Output scorers compare a single item’s output against its input and expected output.
Trace scorers assert on execution behavior — which tools were called, how many LLM calls were made, how long the run took. They need trace events, which dataset items carry when imported from production runs.

Built-in deterministic scorers

Output scorers:

Scorer	Checks
`exact_match`	Output equals the expected output exactly
`contains`	Output contains a substring
`regex_match`	Output matches a regular expression
`json_valid`	Output is well-formed JSON
`json_schema`	Output validates against a JSON Schema
`numeric_range`	Numeric output falls within a range
`levenshtein`	Output is similar to expected by edit distance
`structured_assertions`	Configured assertions over input, output, and expected JSON

Trace scorers:

Scorer	Checks
`tool_called` / `tool_not_called`	A named tool was (or was not) called
`tool_sequence` / `tool_sequence_in_order`	Tools were called in the configured order
`tool_sequence_exact`	The tool trajectory matches exactly
`tool_sequence_any_order`	Configured tools all appear, in any order
`tool_trajectory`	Tools match a selected trajectory pattern
`tool_params_match`	Tool-call arguments match configured parameters
`max_tool_calls` / `max_llm_calls`	Total tool or LLM calls stay under a limit
`max_tokens`	Total LLM tokens stay under a budget
`duration_under`	Session duration stays under a limit
`no_errors`	The execution produced no errors
`state_equals`	A named state snapshot equals an expected value

Pass a bare name for default behavior, or a JSON object for configuration:

agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --builtin-scorer '{"name":"max_llm_calls","config":{"max":5}}'

Built-in LLM-as-judge scorers

LLM-as-judge scorers use a language model to grade outputs against criteria. All three are AGNT5-owned — no registration or deployment needed — and accept overridable judge settings: model, provider, prompt, and rubric.

llm_judge: generic judge. You supply the criteria.
correctness: managed preset that grades the output against the item’s input and expected output.
faithfulness: managed preset that grades whether the output stays faithful to configured context fields.

Note: Judge scorers call an LLM provider, so the project needs the provider credential configured (for example OPENAI_API_KEY as a project secret).

Custom scorers

When built-ins can’t express your check, write a custom scorer — user code that receives the eval context and returns a result. Custom scorers are components: they register with your worker and deploy with your code.

Python:

from agnt5.eval import scorer, EvalContext, ScorerResult

@scorer(name="cites_order_id", description="Reply must cite the order ID from the input")
def cites_order_id(ctx: EvalContext) -> ScorerResult:
    order_id = ctx.input.get("order_id", "")
    cited = order_id in str(ctx.output)
    return ScorerResult(
        score=1.0 if cited else 0.0,
        passed=cited,
        explanation=f"Order ID {order_id} {'found' if cited else 'missing'} in reply",
    )

The EvalContext carries input, output, expected, run_id, trace_id, and events (trace events for trace-level scorers, declared with scope="trace").

TypeScript:

import { scorer, ScorerResult } from "@agnt5/sdk";

const citesOrderId = scorer("cites_order_id", "Reply must cite the order ID from the input")(
  async (ctx, request) => {
    const orderId = (request.input as { order_id?: string }).order_id ?? "";
    const cited = String(request.output).includes(orderId);
    return new ScorerResult({ score: cited ? 1 : 0, passed: cited });
  },
);

Custom scorers register with the worker like any other component — auto-registration picks up decorated scorers, or pass them explicitly via the worker’s scorers list. After deploying, the scorer appears in Studio under Evaluate -> Scorers. Attach it to an experiment by ID:

agnt5 experiments create ... --scorer-id <scorer-id>

Inspect scores

Every scorer execution produces a score record with evidence — the inputs the scorer saw and why it decided what it decided.

# List scores for a run or experiment subject
agnt5 scores list

# Show evidence for one score
agnt5 scores evidence <score-id>

In Studio, open Evaluate -> Experiments, select a run, and click into any item to see its per-scorer results and evidence.

Next steps

Experiments: attach scorers to an experiment and run them against a dataset version.
Datasets: curate the items your scorers grade, including trace events for trace scorers.
Agents: structure agent tool use so trace assertions have meaningful events to check.