Skip to content
Docs
Improve Scorers

Scorers

Score component outputs with built-in deterministic checks, trace assertions, LLM-as-judge presets, or your own custom scorer code.

A scorer decides whether a component’s output meets the target behavior. Each scorer returns a score between 0.0 and 1.0, a pass/fail verdict, and an optional explanation. Experiments attach one or more scorers and run them against every dataset item.


Scorer classes

AGNT5 has exactly three scorer classes:

Class Examples Needs deployment?
Built-in deterministic exact_match, json_schema, tool_called No — runs as AGNT5-owned logic
Built-in LLM-as-judge llm_judge, correctness, faithfulness No — AGNT5-owned, configurable model and rubric
Custom Your @scorer functions Yes — registered and deployed with your worker

Built-ins work out of the box: select them in Studio or pass them to agnt5 experiments create --builtin-scorer <name>. Only custom scorers require your own code.

Scorers also differ by what they evaluate:

  • Output scorers compare a single item’s output against its input and expected output.
  • Trace scorers assert on execution behavior — which tools were called, how many LLM calls were made, how long the run took. They need trace events, which dataset items carry when imported from production runs.

Built-in deterministic scorers

Output scorers:

Scorer Checks
exact_match Output equals the expected output exactly
contains Output contains a substring
regex_match Output matches a regular expression
json_valid Output is well-formed JSON
json_schema Output validates against a JSON Schema
numeric_range Numeric output falls within a range
levenshtein Output is similar to expected by edit distance
structured_assertions Configured assertions over input, output, and expected JSON

Trace scorers:

Scorer Checks
tool_called / tool_not_called A named tool was (or was not) called
tool_sequence / tool_sequence_in_order Tools were called in the configured order
tool_sequence_exact The tool trajectory matches exactly
tool_sequence_any_order Configured tools all appear, in any order
tool_trajectory Tools match a selected trajectory pattern
tool_params_match Tool-call arguments match configured parameters
max_tool_calls / max_llm_calls Total tool or LLM calls stay under a limit
max_tokens Total LLM tokens stay under a budget
duration_under Session duration stays under a limit
no_errors The execution produced no errors
state_equals A named state snapshot equals an expected value

Pass a bare name for default behavior, or a JSON object for configuration:

agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --builtin-scorer '{"name":"max_llm_calls","config":{"max":5}}'

Built-in LLM-as-judge scorers

LLM-as-judge scorers use a language model to grade outputs against criteria. All three are AGNT5-owned — no registration or deployment needed — and accept overridable judge settings: model, provider, prompt, and rubric.

  • llm_judge: generic judge. You supply the criteria.
  • correctness: managed preset that grades the output against the item’s input and expected output.
  • faithfulness: managed preset that grades whether the output stays faithful to configured context fields.

Note: Judge scorers call an LLM provider, so the project needs the provider credential configured (for example OPENAI_API_KEY as a project secret).


Custom scorers

When built-ins can’t express your check, write a custom scorer — user code that receives the eval context and returns a result. Custom scorers are components: they register with your worker and deploy with your code.

Python:

from agnt5.eval import scorer, EvalContext, ScorerResult

@scorer(name="cites_order_id", description="Reply must cite the order ID from the input")
def cites_order_id(ctx: EvalContext) -> ScorerResult:
    order_id = ctx.input.get("order_id", "")
    cited = order_id in str(ctx.output)
    return ScorerResult(
        score=1.0 if cited else 0.0,
        passed=cited,
        explanation=f"Order ID {order_id} {'found' if cited else 'missing'} in reply",
    )

The EvalContext carries input, output, expected, run_id, trace_id, and events (trace events for trace-level scorers, declared with scope="trace").

TypeScript:

import { scorer, ScorerResult } from "@agnt5/sdk";

const citesOrderId = scorer("cites_order_id", "Reply must cite the order ID from the input")(
  async (ctx, request) => {
    const orderId = (request.input as { order_id?: string }).order_id ?? "";
    const cited = String(request.output).includes(orderId);
    return new ScorerResult({ score: cited ? 1 : 0, passed: cited });
  },
);

Custom scorers register with the worker like any other component — auto-registration picks up decorated scorers, or pass them explicitly via the worker’s scorers list. After deploying, the scorer appears in Studio under Evaluate -> Scorers. Attach it to an experiment by ID:

agnt5 experiments create ... --scorer-id <scorer-id>

Inspect scores

Every scorer execution produces a score record with evidence — the inputs the scorer saw and why it decided what it decided.

# List scores for a run or experiment subject
agnt5 scores list

# Show evidence for one score
agnt5 scores evidence <score-id>

In Studio, open Evaluate -> Experiments, select a run, and click into any item to see its per-scorer results and evidence.

Next steps

  • Experiments: attach scorers to an experiment and run them against a dataset version.
  • Datasets: curate the items your scorers grade, including trace events for trace scorers.
  • Agents: structure agent tool use so trace assertions have meaningful events to check.
© 2026 AGNT5
llms.txt