Get started Improve

Improve

Turn observation into structured improvement — evaluate, iterate, ship

Most teams shipping agents get stuck in the same place. The first version works well enough on the cases they thought of. Production turns up cases they didn’t. They fix a prompt, redeploy, and hope nothing else broke. They can’t tell whether the new version is actually better, only that the last complaint went away.

This is not a prompting problem. It’s a missing loop. Observing production shows you where the agent is wrong. Fixing it requires a way to measure whether a change made things better, on a representative set of inputs, before the change reaches anyone who cares.

That loop is the Improve section of AGNT5. It turns the same Run data you already have into a structured workflow for measuring and improving agent behavior — without stitching together an external eval harness or a separate observability vendor.

The flywheel

Improvement in AGNT5 is organized around four primitives that compose into a single loop:

Datasets of inputs, scored by Scorers, run through Experiments against a candidate Prompt — compared to a baseline.

Each primitive has one job:

  • Datasets — curated collections of inputs. The cases you want the agent to handle correctly, captured from real production Runs or authored by hand. A Dataset is the thing you’re evaluating against.
  • Scorers — how you judge whether an output is good. Built-in LLM-as-judge scorers for open-ended outputs, exact-match and structured comparison for deterministic outputs, custom scorers for anything domain-specific.
  • Prompts — versioned prompt artifacts. Not strings buried inside code, but first-class objects you can iterate on, reference from an Agent, roll back, and compare across versions without redeploying.
  • Experiments — a single run of Agent × Dataset × Scorer, producing a comparable result. Experiments are the unit of measurement. Two Experiments on the same Dataset and Scorer, with different Prompts, tell you whether the change made things better.

One full turn of the loop looks like this:

  1. A production Run surfaces a failure mode you didn’t have coverage for.
  2. You add the input (and the expected behavior) to a Dataset.
  3. You draft a new Prompt version — still in Studio, no code change yet.
  4. You run an Experiment against the updated Dataset, with a Scorer that captures whether the new version actually handles the failure case.
  5. You compare the Experiment to the baseline. If it’s an improvement, you ship the new Prompt version and move on. If it’s not, you iterate on the Prompt — not the code.

None of this requires leaving AGNT5. The Dataset came from Runs you already had. The Prompt is referenced from the Agent without a redeploy. The Experiment’s execution uses the same durable runtime as production. The Scorer output is stored alongside the Run, ready to be drilled into.

Why it lives in one platform

Most teams assemble this loop from three or four separate tools: one for execution, one for tracing, one for evaluation, one for prompt management. Each piece works on its own. The seams between them are where the loop breaks — a trace from one system can’t be replayed as an eval case in another, a Prompt version tracked in one place isn’t the one actually being called in production, an evaluation score can’t be correlated with the production Run that inspired it.

AGNT5 collapses the loop because every stage runs against the same execution record:

  • A Run captured by the runtime can become a Dataset entry with a single action.
  • A Prompt version referenced from an Agent can be swapped without redeploying code.
  • An Experiment is executed by the same engine that runs production — same code path, same observability, same guarantees.
  • An Experiment result links back to the individual Runs it produced, which are inspectable the same way production Runs are.

This is what it means to run and improve in one platform. The improvement loop isn’t a separate surface bolted onto the execution engine — it’s the execution engine turned inward, replaying its own history to measure itself against itself.

Where to go next

  • Experiments — the unit of measurement.
  • Datasets — the cases you’re measuring against.
  • Scorers — how you decide what “better” means.
  • Prompts — the versioned artifact you’re iterating on.

If you haven’t yet, read the Run overview first — it’s the substrate the Improve loop builds on.