> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Scorers
description: Score component outputs with built-in deterministic checks, trace assertions, LLM-as-judge presets, or your own custom scorer code.
last_verified: 2026-06-07
---

A **scorer** decides whether a component's output meets the target behavior. Each scorer returns a score between 0.0 and 1.0, a pass/fail verdict, and an optional explanation. [Experiments](/docs/improve/experiments.md) attach one or more scorers and run them against every [dataset](/docs/improve/datasets.md) item.

---

## Scorer classes

AGNT5 has exactly three scorer classes:

| Class | Examples | Needs deployment? |
|---|---|---|
| **Built-in deterministic** | `exact_match`, `json_schema`, `tool_called` | No — runs as AGNT5-owned logic |
| **Built-in LLM-as-judge** | `llm_judge`, `correctness`, `faithfulness` | No — AGNT5-owned, configurable model and rubric |
| **Custom** | Your `@scorer` functions | Yes — registered and deployed with your worker |

Built-ins work out of the box: select them in Studio or pass them to `agnt5 experiments create --builtin-scorer <name>`. Only custom scorers require your own code.

Scorers also differ by what they evaluate:

- **Output scorers** compare a single item's output against its input and expected output.
- **Trace scorers** assert on execution behavior — which tools were called, how many LLM calls were made, how long the run took. They need trace events, which dataset items carry when [imported from production runs](/docs/improve/datasets.md#from-production-runs).

---

## Built-in deterministic scorers

Output scorers:

| Scorer | Checks |
|---|---|
| `exact_match` | Output equals the expected output exactly |
| `contains` | Output contains a substring |
| `regex_match` | Output matches a regular expression |
| `json_valid` | Output is well-formed JSON |
| `json_schema` | Output validates against a JSON Schema |
| `numeric_range` | Numeric output falls within a range |
| `levenshtein` | Output is similar to expected by edit distance |
| `structured_assertions` | Configured assertions over input, output, and expected JSON |

Trace scorers:

| Scorer | Checks |
|---|---|
| `tool_called` / `tool_not_called` | A named tool was (or was not) called |
| `tool_sequence` / `tool_sequence_in_order` | Tools were called in the configured order |
| `tool_sequence_exact` | The tool trajectory matches exactly |
| `tool_sequence_any_order` | Configured tools all appear, in any order |
| `tool_trajectory` | Tools match a selected trajectory pattern |
| `tool_params_match` | Tool-call arguments match configured parameters |
| `max_tool_calls` / `max_llm_calls` | Total tool or LLM calls stay under a limit |
| `max_tokens` | Total LLM tokens stay under a budget |
| `duration_under` | Session duration stays under a limit |
| `no_errors` | The execution produced no errors |
| `state_equals` | A named state snapshot equals an expected value |

Pass a bare name for default behavior, or a JSON object for configuration:

```bash
agnt5 experiments create \
  --name support-agent-quality \
  --dataset-id <dataset-id> \
  --dataset-version-id <dataset-version-id> \
  --deployment-id <deployment-id> \
  --component-name support_agent \
  --component-type agent \
  --builtin-scorer json_valid \
  --builtin-scorer '{"name":"tool_called","config":{"tool":"search_orders"}}' \
  --builtin-scorer '{"name":"max_llm_calls","config":{"max":5}}'
```

---

## Built-in LLM-as-judge scorers

LLM-as-judge scorers use a language model to grade outputs against criteria. All three are AGNT5-owned — no registration or deployment needed — and accept overridable judge settings: model, provider, prompt, and rubric.

- **`llm_judge`**: generic judge. You supply the criteria.
- **`correctness`**: managed preset that grades the output against the item's input and expected output.
- **`faithfulness`**: managed preset that grades whether the output stays faithful to configured context fields.

> **Note:** Judge scorers call an LLM provider, so the project needs the provider credential configured (for example `OPENAI_API_KEY` as a [project secret](/docs/run/deploying.md#secrets)).

---

## Custom scorers

When built-ins can't express your check, write a **custom scorer** — user code that receives the eval context and returns a result. Custom scorers are components: they register with your worker and deploy with your code.

Python:

```python
from agnt5.eval import scorer, EvalContext, ScorerResult

@scorer(name="cites_order_id", description="Reply must cite the order ID from the input")
def cites_order_id(ctx: EvalContext) -> ScorerResult:
    order_id = ctx.input.get("order_id", "")
    cited = order_id in str(ctx.output)
    return ScorerResult(
        score=1.0 if cited else 0.0,
        passed=cited,
        explanation=f"Order ID {order_id} {'found' if cited else 'missing'} in reply",
    )
```

The `EvalContext` carries `input`, `output`, `expected`, `run_id`, `trace_id`, and `events` (trace events for trace-level scorers, declared with `scope="trace"`).

TypeScript:

```typescript
const citesOrderId = scorer("cites_order_id", "Reply must cite the order ID from the input")(
  async (ctx, request) => {
    const orderId = (request.input as { order_id?: string }).order_id ?? "";
    const cited = String(request.output).includes(orderId);
    return new ScorerResult({ score: cited ? 1 : 0, passed: cited });
  },
);
```

Custom scorers register with the worker like any other component — auto-registration picks up decorated scorers, or pass them explicitly via the worker's `scorers` list. After [deploying](/docs/run/deploying.md), the scorer appears in Studio under **Evaluate** -> **Scorers**. Attach it to an experiment by ID:

```bash
agnt5 experiments create ... --scorer-id <scorer-id>
```

---

## Inspect scores

Every scorer execution produces a **score** record with evidence — the inputs the scorer saw and why it decided what it decided.

```bash
# List scores for a run or experiment subject
agnt5 scores list

# Show evidence for one score
agnt5 scores evidence <score-id>
```

In Studio, open **Evaluate** -> **Experiments**, select a run, and click into any item to see its per-scorer results and evidence.


**Scorer classes**: built-in deterministic (SDK-core owned, no registration), built-in LLM-as-judge (`llm_judge`, `correctness`, `faithfulness`; configurable model/prompt/rubric), custom (user components, require worker registration + deployment).
**Built-in deterministic names**: `exact_match`, `contains`, `regex_match`, `json_valid`, `json_schema`, `numeric_range`, `levenshtein`, `structured_assertions`, `tool_called`, `tool_not_called`, `tool_sequence`, `tool_sequence_in_order`, `tool_sequence_exact`, `tool_sequence_any_order`, `tool_trajectory`, `tool_params_match`, `max_tool_calls`, `max_llm_calls`, `max_tokens`, `duration_under`, `no_errors`, `state_equals`.
**Result shape**: `{score: 0.0–1.0, passed: bool, label?: string, explanation?: string, metadata?: object}`.
**Code primitives**: `@scorer(name, description, scope)` decorator + `EvalContext` -> `ScorerResult` (Python); `scorer(name, description, scope)(handler)` (TypeScript).
**Errors**: built-ins fail with typed scorer errors (`input_error`, `config_error`, `provider_error`, `auth_error`, `artifact_error`, `timeout_error`); only custom scorers can return `scorer_not_found`.


## Next steps

* [Experiments](/docs/improve/experiments.md): attach scorers to an experiment and run them against a dataset version.
* [Datasets](/docs/improve/datasets.md): curate the items your scorers grade, including trace events for trace scorers.
* [Agents](/docs/build/agents.md): structure agent tool use so trace assertions have meaningful events to check.