Scorers
Score component outputs with built-in deterministic checks, trace assertions, LLM-as-judge presets, or your own custom scorer code.
A scorer decides whether a component’s output meets the target behavior. Each scorer returns a score between 0.0 and 1.0, a pass/fail verdict, and an optional explanation. Experiments attach one or more scorers and run them against every dataset item.
Scorer classes
AGNT5 has exactly three scorer classes:
| Class | Examples | Needs deployment? |
|---|---|---|
| Built-in deterministic | exact_match, json_schema, tool_called | No — runs as AGNT5-owned logic |
| Built-in LLM-as-judge | llm_judge, correctness, faithfulness | No — AGNT5-owned, configurable model and rubric |
| Custom | Your @scorer functions | Yes — registered and deployed with your worker |
Built-ins work out of the box: select them in Studio or pass them to agnt5 experiments create --builtin-scorer <name>. Only custom scorers require your own code.
Scorers also differ by what they evaluate:
- Output scorers compare a single item’s output against its input and expected output.
- Trace scorers assert on execution behavior — which tools were called, how many LLM calls were made, how long the run took. They need trace events, which dataset items carry when imported from production runs.
Built-in deterministic scorers
Output scorers:
| Scorer | Checks |
|---|---|
exact_match | Output equals the expected output exactly |
contains | Output contains a substring |
regex_match | Output matches a regular expression |
json_valid | Output is well-formed JSON |
json_schema | Output validates against a JSON Schema |
numeric_range | Numeric output falls within a range |
levenshtein | Output is similar to expected by edit distance |
structured_assertions | Configured assertions over input, output, and expected JSON |
Trace scorers:
| Scorer | Checks |
|---|---|
tool_called / tool_not_called | A named tool was (or was not) called |
tool_sequence / tool_sequence_in_order | Tools were called in the configured order |
tool_sequence_exact | The tool trajectory matches exactly |
tool_sequence_any_order | Configured tools all appear, in any order |
tool_trajectory | Tools match a selected trajectory pattern |
tool_params_match | Tool-call arguments match configured parameters |
max_tool_calls / max_llm_calls | Total tool or LLM calls stay under a limit |
max_tokens | Total LLM tokens stay under a budget |
duration_under | Session duration stays under a limit |
no_errors | The execution produced no errors |
state_equals | A named state snapshot equals an expected value |
Pass a bare name for default behavior, or a JSON object for configuration:
agnt5 experiments create \
--name support-agent-quality \
--dataset-id <dataset-id> \
--dataset-version-id <dataset-version-id> \
--deployment-id <deployment-id> \
--component-name support_agent \
--component-type agent \
--builtin-scorer json_valid \
--builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
--builtin-scorer '{"name":"max_llm_calls","config":{"max":5}}'Built-in LLM-as-judge scorers
LLM-as-judge scorers use a language model to grade outputs against criteria. All three are AGNT5-owned — no registration or deployment needed — and accept overridable judge settings: model, provider, prompt, and rubric.
llm_judge: generic judge. You supply the criteria.correctness: managed preset that grades the output against the item’s input and expected output.faithfulness: managed preset that grades whether the output stays faithful to configured context fields.
Note: Judge scorers call an LLM provider, so the project needs the provider credential configured (for example
OPENAI_API_KEYas a project secret).
Custom scorers
When built-ins can’t express your check, write a custom scorer — user code that receives the eval context and returns a result. Custom scorers are components: they register with your worker and deploy with your code.
Python:
from agnt5.eval import scorer, EvalContext, ScorerResult
@scorer(name="cites_order_id", description="Reply must cite the order ID from the input")
def cites_order_id(ctx: EvalContext) -> ScorerResult:
order_id = ctx.input.get("order_id", "")
cited = order_id in str(ctx.output)
return ScorerResult(
score=1.0 if cited else 0.0,
passed=cited,
explanation=f"Order ID {order_id} {'found' if cited else 'missing'} in reply",
)The EvalContext carries input, output, expected, run_id, trace_id, and events (trace events for trace-level scorers, declared with scope="trace").
TypeScript:
import { scorer, ScorerResult } from "@agnt5/sdk";
const citesOrderId = scorer("cites_order_id", "Reply must cite the order ID from the input")(
async (ctx, request) => {
const orderId = (request.input as { order_id?: string }).order_id ?? "";
const cited = String(request.output).includes(orderId);
return new ScorerResult({ score: cited ? 1 : 0, passed: cited });
},
);Custom scorers register with the worker like any other component — auto-registration picks up decorated scorers, or pass them explicitly via the worker’s scorers list. After deploying, the scorer appears in Studio under Evaluate -> Scorers. Attach it to an experiment by ID:
agnt5 experiments create ... --scorer-id <scorer-id>Inspect scores
Every scorer execution produces a score record with evidence — the inputs the scorer saw and why it decided what it decided.
# List scores for a run or experiment subject
agnt5 scores list
# Show evidence for one score
agnt5 scores evidence <score-id>In Studio, open Evaluate -> Experiments, select a run, and click into any item to see its per-scorer results and evidence.
Next steps
- Experiments: attach scorers to an experiment and run them against a dataset version.
- Datasets: curate the items your scorers grade, including trace events for trace scorers.
- Agents: structure agent tool use so trace assertions have meaningful events to check.