Improve with evals
Score your investigator agent against a custom scorer, change the prompt, and measure whether the score moved.
You ran the quickstart and saw a brief. Was it a good brief? “Looks fine” is the answer most teams give to this question. Evals turn that judgment into a measurement you can repeat.
This page walks you through writing a scorer for the quickstart workflow, running it, changing the prompt, and comparing the score. The same pattern is what you’ll use when you change the model, swap a tool, or add a step.
Time: about 15 minutes.
You’ll learn:
- Write a custom scorer with the
@scorerdecorator - Run it against the
investigate_with_reviewworkflow withclient.eval - Change the prompt and compare the score before and after
Prerequisites:
- Completed the quickstart and the Build locally walkthrough.
agnt5 devis running for themy-investigatorproject.
Step 1: Decide what “good” means
A brief is good if it has all four sections (Answer, Evidence, Risks, Recommendation) and at least one open question. That’s a structural property — score it deterministically, no LLM judge needed for the first pass.
Sketch the rule out loud:
A brief passes if every section header is present and the open-questions section has at least one bullet. Score is the fraction of sections found.
That’s a scorer.
Step 2: Write the scorer
Add a new file src/agnt5_quickstart/scorers.py:
import re
from agnt5.eval import scorer
from agnt5.eval.types import EvalContext, ScorerResultPy
REQUIRED_SECTIONS = ("Answer:", "Evidence:", "Risks:", "Recommendation:", "Open questions:")
@scorer(name="brief_structure")
def brief_has_required_sections(ctx: EvalContext) -> ScorerResultPy:
"""Score whether the brief contains all required sections plus at least one open question."""
output = str(ctx.output or "")
found = [s for s in REQUIRED_SECTIONS if s in output]
open_qs_match = re.search(
r"Open questions:\s*(?:\n+\s*-\s*\S.*)+",
output,
flags=re.MULTILINE,
)
has_open_questions = open_qs_match is not None
score = len(found) / len(REQUIRED_SECTIONS)
if not has_open_questions and score == 1.0:
score = 0.8 # all headers present but no actual open questions
passed = score == 1.0 and has_open_questions
missing = [s for s in REQUIRED_SECTIONS if s not in found]
explanation = (
f"Found {len(found)}/{len(REQUIRED_SECTIONS)} sections. "
f"Missing: {missing or 'none'}. "
f"Open questions present: {has_open_questions}."
)
return ScorerResultPy(score=score, passed=passed, explanation=explanation)Importing the module registers the scorer with the SDK. You can confirm registration once at the start of an eval script:
import agnt5_quickstart.scorers # noqa: F401 — register the scorer
from agnt5.eval import list_custom_scorers
print(list_custom_scorers()) # ["brief_structure", ...]Step 3: Run the eval
Create eval_brief.py at the project root:
import asyncio
from agnt5 import Client
import agnt5_quickstart.scorers # noqa: F401 — register brief_structure
async def main() -> None:
client = Client()
result = await client.eval(
component="investigate_with_review",
component_type="workflow",
input_data={"question": "Should we migrate from Redis to Valkey?"},
scorers=["brief_structure"],
)
for score in result.scores:
print(f"{score.scorer}: score={score.score:.2f} passed={score.passed}")
print(f" {score.explanation}")
if __name__ == "__main__":
asyncio.run(main())Run it:
python eval_brief.pyclient.eval runs the workflow end-to-end through agnt5 dev, captures the output, and applies the scorer. The workflow still pauses at the human review step — approve in Studio to let the eval finish.
Expected output on a healthy run:
brief_structure: score=1.00 passed=True
Found 5/5 sections. Missing: none. Open questions present: True.Step 4: Change the prompt and re-run
Edit INVESTIGATOR_PROMPT in workflows.py and remove the line that lists the required sections:
INVESTIGATOR_PROMPT = (
"You investigate technical and operational questions for an engineering team. "
"Use the DeepWiki MCP tools to read documentation and ask questions about "
"GitHub repositories — that's your primary evidence source. "
# Removed: "Return a concise brief: answer, evidence, risks, recommendation, open questions."
)Hot reload picks up the change. Run the eval again:
python eval_brief.pyThe score drops because the model no longer knows the required structure:
brief_structure: score=0.40 passed=False
Found 2/5 sections. Missing: ['Risks:', 'Recommendation:', 'Open questions:']. Open questions present: False.You have a measurement. The structural prompt instruction was load-bearing — removing it cost three sections.
Restore the line. The score returns to 1.00.
Step 5: Beyond structure — LLM-as-judge
Structural scoring catches format regressions. Quality regressions need a model in the loop. Swap the scorer:
from agnt5.eval import LLMJudge
result = await client.eval(
component="investigate_with_review",
component_type="workflow",
input_data={"question": "Should we migrate from Redis to Valkey?"},
scorers=[
"brief_structure",
LLMJudge(
criteria=(
"Does the brief separate first-party evidence (docs, source) "
"from public commentary, and does the recommendation follow "
"from the evidence?"
),
),
],
)Run both scorers in the same eval. Treat the LLM judge’s score as a noisy signal — useful in aggregate over many cases, less reliable on any single case.
What you built
You wrote a deterministic scorer, ran it against a real workflow, made a change that moved the score, and saw the score move. That loop — write a scorer, eval, change, eval again, compare — is how you guard a workflow against regressions when you change a prompt, model, or tool.
What you did not write or configure:
- A workflow runner —
client.evalreuses your dev session - A scorer registry — the
@scorerdecorator handles registration - An LLM-judge prompt template —
LLMJudgeships one, configurable