Skip to content
Docs
Improve Enable online evals

Enable online evals

Sample production runs, score them asynchronously, and alert when pass rate drops below a threshold.

Online evals let you measure production behavior without blocking user traffic. You attach a published scorer, a rule that grades a run or trace, to a deployment, a running AGNT5 worker, then AGNT5 samples completed runs and scores them in the background. After setup, Studio shows live score aggregates, sample decisions, alerts, and recent scores for that deployment.

Prerequisites

  • A deployment that is receiving production runs.
  • At least one enabled scorer with a published version.
  • A user or service token with developer access to the project.

Set up online evals in Studio

  1. Open the deployment in Studio.
  2. Select the Quality tab.
  3. Choose a Scorer.
  4. Set Sample % for ordinary runs.
  5. Set Slow-run % and Slow-run threshold ms when slow runs should be sampled more often.
  6. Set Pass-rate floor and Min count for the optional alert.
  7. Select Preview to estimate observed runs, selected runs, scorer jobs, and alert status.
  8. Select Enable to create the online eval policy and alert.

Use the policy table to disable or enable a policy, disable or enable its alert, or select edit to create a new policy version with updated sampling and scorer settings.

Preview with the API

Previewing does not create a policy. It evaluates the proposed sampling config against recent completed runs and returns selected/skipped counts, reason buckets, scorer job estimates, and optional alert status.

curl -X POST "https://api.agnt5.com/api/v1/projects/<project-id>/eval/online/preview" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "lookback_seconds": 86400,
    "max_runs": 1000,
    "policy": {
      "mode": "async_online",
      "binding_scope": "deployment",
      "deployment_id": "<deployment-id>",
      "sampling_config": {
        "type": "uniform",
        "rate": 0.02,
        "boost": [
          {
            "field": "duration_ms",
            "op": "gte",
            "value": 30000,
            "rate": 0.10
          }
        ]
      },
      "scorers": [
        {
          "scorer_id": "<scorer-id>",
          "scorer_version_id": "<scorer-version-id>",
          "scope": "run",
          "ordinal": 1,
          "required": true,
          "threshold": 0.9,
          "weight": 1
        }
      ]
    },
    "alert": {
      "name": "Production quality drop",
      "severity": "warning",
      "deployment_id": "<deployment-id>",
      "scorer_id": "<scorer-id>",
      "scorer_version_id": "<scorer-version-id>",
      "window_seconds": 1800,
      "metric": "pass_rate",
      "operator": "lt",
      "threshold": 0.9,
      "min_count": 50,
      "action_type": "notify"
    }
  }'

The preview response includes job volume, not a price estimate. Scorer cost depends on model/provider configuration and custom scorer runtime.

Enable with the API

Create the policy first, then create an alert linked to the returned policy ID.

curl -X POST "https://api.agnt5.com/api/v1/projects/<project-id>/eval/policies" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "async_online",
    "binding_scope": "deployment",
    "deployment_id": "<deployment-id>",
    "sampling_config": {
      "type": "uniform",
      "rate": 0.02,
      "boost": [
        {
          "field": "duration_ms",
          "op": "gte",
          "value": 30000,
          "rate": 0.10
        }
      ]
    },
    "scorers": [
      {
        "scorer_id": "<scorer-id>",
        "scorer_version_id": "<scorer-version-id>",
        "scope": "run",
        "ordinal": 1,
        "required": true,
        "threshold": 0.9,
        "weight": 1
      }
    ]
  }'
curl -X POST "https://api.agnt5.com/api/v1/projects/<project-id>/eval/online/alerts" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production quality drop",
    "severity": "warning",
    "evaluation_policy_id": "<policy-id>",
    "deployment_id": "<deployment-id>",
    "scorer_id": "<scorer-id>",
    "scorer_version_id": "<scorer-version-id>",
    "window_seconds": 1800,
    "metric": "pass_rate",
    "operator": "lt",
    "threshold": 0.9,
    "min_count": 50,
    "action_type": "notify"
  }'

Operate online evals

Use these endpoints to inspect or change a running setup:

Task Endpoint
List policies GET /api/v1/projects/<project-id>/eval/policies?mode=async_online&deployment_id=<deployment-id>
Disable a policy PATCH /api/v1/projects/<project-id>/eval/policies/<policy-id> with { "enabled": false }
Create a new policy version PATCH /api/v1/projects/<project-id>/eval/policies/<policy-id> with new sampling_config or scorers
List sample decisions GET /api/v1/projects/<project-id>/eval/online/sample-decisions?deployment_id=<deployment-id>
List live scores GET /api/v1/projects/<project-id>/eval/scores?source=live&deployment_id=<deployment-id>
Get live score aggregate GET /api/v1/projects/<project-id>/eval/online/scores/aggregate?source=live&deployment_id=<deployment-id>
Disable an alert PATCH /api/v1/projects/<project-id>/eval/online/alerts/<alert-id> with { "enabled": false }

Next steps

  • Improve with AGNT5: understand how production runs feed scorers, datasets, and experiments.
  • Deploying: deploy a worker before attaching online evals.
  • Workflows: structure runs so scorers can inspect stable inputs and outputs.
© 2026 AGNT5
llms.txt