> For the complete documentation index, see [llms.txt](/llms.txt).
> A full single-fetch corpus is available at [llms-full.txt](/llms-full.txt).
---
title: Build a document processing pipeline
description: Extract structured fields, validate them, pause for review, and retry failed document steps safely.
tags: ["Documents", "Structured output", "Review"]
date: 2026-05-13
last_verified: 2026-05-13
audience: both
---

Document workflows fail in predictable ways: bad scans, missing fields,
malformed model output, and partial external writes. This cookbook builds a
pipeline that makes each failure inspectable and recoverable.

## Scenario

An operations team uploads invoices. The workflow extracts fields, validates the
result, pauses for review when confidence is low, and stores approved data in a
system of record.

## What you build

- A document ingestion workflow.
- OCR or text extraction.
- Structured field extraction.
- Validation and confidence checks.
- Human review for exceptions.
- An idempotent write to the destination system.

## Workflow shape

```python
@workflow
async def process_invoice(ctx: WorkflowContext, document_id: str) -> InvoiceOutcome:
    document = await ctx.step(load_document, document_id)
    text = await ctx.step(extract_text, document)
    invoice = await ctx.step(extract_invoice_fields, text)
    validation = await ctx.step(validate_invoice, invoice)

    if validation.needs_review:
        decision = await ctx.wait_for_signal(
            "invoice_review",
            timeout="10d",
            metadata={"document_id": document_id, "issues": validation.issues},
        )
        invoice = decision.corrected_invoice

    receipt = await ctx.step(store_invoice_once, document_id, invoice)
    return InvoiceOutcome(status="stored", receipt_id=receipt.id)
```

The review path is part of the workflow, not an out-of-band spreadsheet.

## Validation rules

Use deterministic validation before asking another model to judge the output.

- Required fields are present.
- Totals add up.
- Currency is supported.
- Vendor is recognized.
- Confidence passes the threshold.

## Production checks

- Raw document, extracted text, structured output, and validation errors are in
  the trace.
- Low-confidence extractions pause for review.
- The store step uses a stable idempotency key.
- Reprocessing a document does not duplicate destination records.
- Corrected review output can become an eval case.

## Next steps

- [Build a data extraction workflow](/cookbooks/data-extraction.md)
- [Retry AI workflow steps without duplicate side effects](/cookbooks/retry-without-duplicate-side-effects.md)
- [Turn a failed production AI run into an eval](/cookbooks/production-run-to-eval.md)
