Get started Durable Execution

Durable Execution

Checkpointing, retries, and crash recovery as runtime guarantees

Durable execution is the runtime guarantee that once a workflow starts, it finishes — even when individual calls fail, Workers crash, or the platform redeploys underneath it. You write the code as if nothing will go wrong; AGNT5 catches the cases where it does.

Three mechanics make this work:

  • Checkpointing — every function call, step, and state transition is recorded to a durable journal before it’s acknowledged. The journal is the single source of truth for what a workflow has done.
  • Replay — when a Worker picks up a partially-executed workflow (after a crash, restart, or redeploy), it replays the journal to the last checkpoint and resumes from there. No state is lost; no step runs twice unless you asked it to.
  • Retries with backoff — transient failures (LLM timeouts, API 503s, rate limits) retry automatically with configurable backoff policies. Non-transient failures surface immediately.

The result: application code that reads like a happy-path script, but behaves like a production-grade distributed system.

A deeper guide is in progress. For the full execution model, see the Workflows foundation and Functions foundation.