Build a data extraction workflow

This cookbook builds a structured extraction workflow for AI outputs that must be parsed, validated, retried, and explained.

Scenario

An analyst submits free-form notes. The workflow extracts accounts, contacts, dates, and next actions as JSON, validates the result, and stores the structured record.

What you build

A structured-output prompt.
A schema validator.
A repair step for malformed JSON.
A retry policy for transient model failures.
A trace that shows raw and parsed outputs.

Workflow shape

@workflow
async def extract_account_update(ctx: WorkflowContext, note_id: str) -> ExtractionResult:
    note = await ctx.step(load_note, note_id)
    raw = await ctx.step(call_extraction_agent, note.text)
    parsed = await ctx.step(parse_and_validate_update, raw)
    receipt = await ctx.step(store_update_once, note.id, parsed)
    return ExtractionResult(update_id=receipt.id)

Separating model call and parse step makes malformed output easy to inspect.

Schema-first extraction

Define the expected output before writing the prompt.

class AccountUpdate(BaseModel):
    account_name: str
    contacts: list[str]
    next_action: str
    due_date: date | None
    confidence: float

The validator should reject missing required fields and values that do not match business rules.

Malformed output recovery

If parsing fails, run a bounded repair step and keep both versions in the trace.

@function
async def parse_and_validate_update(raw: str) -> AccountUpdate:
    try:
        return AccountUpdate.model_validate_json(raw)
    except ValidationError:
        repaired = await repair_json(raw)
        return AccountUpdate.model_validate_json(repaired)

Production checks

Raw model output and parsed output are both trace-visible.
Repair attempts are bounded.
Invalid data fails before the storage step.
The storage step is idempotent.
Failed extractions can be converted into eval cases.