Durable Execution
Checkpointing, retries, and crash recovery as runtime guarantees
Durable execution is the runtime guarantee that once a workflow starts, it finishes — even when individual calls fail, Workers crash, or the platform redeploys underneath it. You write the code as if nothing will go wrong; AGNT5 catches the cases where it does.
Three mechanics make this work:
- Checkpointing — every function call, step, and state transition is recorded to a durable journal before it’s acknowledged. The journal is the single source of truth for what a workflow has done.
- Replay — when a Worker picks up a partially-executed workflow (after a crash, restart, or redeploy), it replays the journal to the last checkpoint and resumes from there. No state is lost; no step runs twice unless you asked it to.
- Retries with backoff — transient failures (LLM timeouts, API 503s, rate limits) retry automatically with configurable backoff policies. Non-transient failures surface immediately.
The result: application code that reads like a happy-path script, but behaves like a production-grade distributed system.
A deeper guide is in progress. For the full execution model, see the Workflows foundation and Functions foundation.