> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Build a data extraction workflow
description: Call tools, force JSON outputs, recover from malformed responses, and inspect every extraction step.
tags: ["Structured output", "Tools", "Traces"]
date: 2026-05-13
last_verified: 2026-05-13
audience: both
---

This cookbook builds a structured extraction workflow for AI outputs that must
be parsed, validated, retried, and explained.

## Scenario

An analyst submits free-form notes. The workflow extracts accounts, contacts,
dates, and next actions as JSON, validates the result, and stores the structured
record.

## What you build

- A structured-output prompt.
- A schema validator.
- A repair step for malformed JSON.
- A retry policy for transient model failures.
- A trace that shows raw and parsed outputs.

## Workflow shape

```python
@workflow
async def extract_account_update(ctx: WorkflowContext, note_id: str) -> ExtractionResult:
    note = await ctx.step(load_note, note_id)
    raw = await ctx.step(call_extraction_agent, note.text)
    parsed = await ctx.step(parse_and_validate_update, raw)
    receipt = await ctx.step(store_update_once, note.id, parsed)
    return ExtractionResult(update_id=receipt.id)
```

Separating model call and parse step makes malformed output easy to inspect.

## Schema-first extraction

Define the expected output before writing the prompt.

```python
class AccountUpdate(BaseModel):
    account_name: str
    contacts: list[str]
    next_action: str
    due_date: date | None
    confidence: float
```

The validator should reject missing required fields and values that do not match
business rules.

## Malformed output recovery

If parsing fails, run a bounded repair step and keep both versions in the trace.

```python
@function
async def parse_and_validate_update(raw: str) -> AccountUpdate:
    try:
        return AccountUpdate.model_validate_json(raw)
    except ValidationError:
        repaired = await repair_json(raw)
        return AccountUpdate.model_validate_json(repaired)
```

## Production checks

- Raw model output and parsed output are both trace-visible.
- Repair attempts are bounded.
- Invalid data fails before the storage step.
- The storage step is idempotent.
- Failed extractions can be converted into eval cases.

## Next steps

- [Build a document processing pipeline](/cookbooks/document-processing.md)
- [Debug and replay a failed AI workflow](/cookbooks/debug-production-run.md)
- [Debug AI workflows with traces, not scattered logs](/cookbooks/workflow-native-observability.md)
