Get started Improve with evals

Improve with evals

Score your investigator agent against a custom scorer, change the prompt, and measure whether the score moved.

You ran the quickstart and saw a brief. Was it a good brief? “Looks fine” is the answer most teams give to this question. Evals turn that judgment into a measurement you can repeat.

This page walks you through writing a scorer for the quickstart workflow, running it, changing the prompt, and comparing the score. The same pattern is what you’ll use when you change the model, swap a tool, or add a step.

Time: about 15 minutes.

You’ll learn:

  • Write a custom scorer with the @scorer decorator
  • Run it against the investigate_with_review workflow with client.eval
  • Change the prompt and compare the score before and after

Prerequisites:

  • Completed the quickstart and the Build locally walkthrough.
  • agnt5 dev is running for the my-investigator project.

Step 1: Decide what “good” means

A brief is good if it has all four sections (Answer, Evidence, Risks, Recommendation) and at least one open question. That’s a structural property — score it deterministically, no LLM judge needed for the first pass.

Sketch the rule out loud:

A brief passes if every section header is present and the open-questions section has at least one bullet. Score is the fraction of sections found.

That’s a scorer.

Step 2: Write the scorer

Add a new file src/agnt5_quickstart/scorers.py:

import re

from agnt5.eval import scorer
from agnt5.eval.types import EvalContext, ScorerResultPy


REQUIRED_SECTIONS = ("Answer:", "Evidence:", "Risks:", "Recommendation:", "Open questions:")


@scorer(name="brief_structure")
def brief_has_required_sections(ctx: EvalContext) -> ScorerResultPy:
    """Score whether the brief contains all required sections plus at least one open question."""
    output = str(ctx.output or "")

    found = [s for s in REQUIRED_SECTIONS if s in output]

    open_qs_match = re.search(
        r"Open questions:\s*(?:\n+\s*-\s*\S.*)+",
        output,
        flags=re.MULTILINE,
    )
    has_open_questions = open_qs_match is not None

    score = len(found) / len(REQUIRED_SECTIONS)
    if not has_open_questions and score == 1.0:
        score = 0.8  # all headers present but no actual open questions

    passed = score == 1.0 and has_open_questions

    missing = [s for s in REQUIRED_SECTIONS if s not in found]
    explanation = (
        f"Found {len(found)}/{len(REQUIRED_SECTIONS)} sections. "
        f"Missing: {missing or 'none'}. "
        f"Open questions present: {has_open_questions}."
    )
    return ScorerResultPy(score=score, passed=passed, explanation=explanation)

Importing the module registers the scorer with the SDK. You can confirm registration once at the start of an eval script:

import agnt5_quickstart.scorers  # noqa: F401  — register the scorer

from agnt5.eval import list_custom_scorers
print(list_custom_scorers())  # ["brief_structure", ...]

Step 3: Run the eval

Create eval_brief.py at the project root:

import asyncio

from agnt5 import Client

import agnt5_quickstart.scorers  # noqa: F401  — register brief_structure


async def main() -> None:
    client = Client()
    result = await client.eval(
        component="investigate_with_review",
        component_type="workflow",
        input_data={"question": "Should we migrate from Redis to Valkey?"},
        scorers=["brief_structure"],
    )
    for score in result.scores:
        print(f"{score.scorer}: score={score.score:.2f} passed={score.passed}")
        print(f"  {score.explanation}")


if __name__ == "__main__":
    asyncio.run(main())

Run it:

python eval_brief.py

client.eval runs the workflow end-to-end through agnt5 dev, captures the output, and applies the scorer. The workflow still pauses at the human review step — approve in Studio to let the eval finish.

Expected output on a healthy run:

brief_structure: score=1.00 passed=True
  Found 5/5 sections. Missing: none. Open questions present: True.

Step 4: Change the prompt and re-run

Edit INVESTIGATOR_PROMPT in workflows.py and remove the line that lists the required sections:

INVESTIGATOR_PROMPT = (
    "You investigate technical and operational questions for an engineering team. "
    "Use the DeepWiki MCP tools to read documentation and ask questions about "
    "GitHub repositories — that's your primary evidence source. "
    # Removed: "Return a concise brief: answer, evidence, risks, recommendation, open questions."
)

Hot reload picks up the change. Run the eval again:

python eval_brief.py

The score drops because the model no longer knows the required structure:

brief_structure: score=0.40 passed=False
  Found 2/5 sections. Missing: ['Risks:', 'Recommendation:', 'Open questions:']. Open questions present: False.

You have a measurement. The structural prompt instruction was load-bearing — removing it cost three sections.

Restore the line. The score returns to 1.00.

Step 5: Beyond structure — LLM-as-judge

Structural scoring catches format regressions. Quality regressions need a model in the loop. Swap the scorer:

from agnt5.eval import LLMJudge

result = await client.eval(
    component="investigate_with_review",
    component_type="workflow",
    input_data={"question": "Should we migrate from Redis to Valkey?"},
    scorers=[
        "brief_structure",
        LLMJudge(
            criteria=(
                "Does the brief separate first-party evidence (docs, source) "
                "from public commentary, and does the recommendation follow "
                "from the evidence?"
            ),
        ),
    ],
)

Run both scorers in the same eval. Treat the LLM judge’s score as a noisy signal — useful in aggregate over many cases, less reliable on any single case.

What you built

You wrote a deterministic scorer, ran it against a real workflow, made a change that moved the score, and saw the score move. That loop — write a scorer, eval, change, eval again, compare — is how you guard a workflow against regressions when you change a prompt, model, or tool.

What you did not write or configure:

  • A workflow runner — client.eval reuses your dev session
  • A scorer registry — the @scorer decorator handles registration
  • An LLM-judge prompt template — LLMJudge ships one, configurable

Next steps

  • Workflows — the durable-execution model that makes runs reproducible enough to score.
  • Templates — start from a workflow close to what you want to build.