Durable runtime for AI agents & workflows
Crashes resume. Failures replay. Fixes ship proven.
Every tool call, branch, and pause in every run is written to one durable journal. That journal is the recovery layer, the trace, and the eval harness. What you debug is exactly what ran.
The production gap
Building agentic workflows is easy. Keeping them working isn't.
Your first agent ran in a notebook. But in production, agents and workflows make decisions, call tools, wait on humans, and run for hours. When they fail, your stack has no answer, because it treats an eight-hour agentic workflow like a 200ms API request.
Long-running work loses state when it fails
Agentic workflows aren't request-response functions. They branch, loop, call external services, and pause for human input, sometimes for hours. When something fails mid-run, the execution context and completed work are gone. You can't resume from step forty-one. You start over from step one.
The debug-fix-verify loop is broken
Something went wrong, but your tools don't agree on what. Your traces capture inputs and outputs but miss what happened in between. So you piece it together across three dashboards, make a fix, push a deploy, wait, and find out it didn't help. Then you do it again.
The agent runtime
One journal. Not four tools.
Recovery, replay, observability, and evals all come from the same journal that runs your agents and workflows. Ship to production, replay when it breaks, and prove fixes against the runs that failed.
Build
Write agents that survive production
Write agents and workflows in Python or TypeScript. Add a decorator and every completed step is checkpointed. A crash picks up at the step where it stopped, while the runtime handles state, recovery, and coordination so you focus on what your agent actually does.
Write your first agent →Durable SDKs for Python and TypeScript
Add @durable.function and your function gains automatic checkpointing, retries, and crash recovery. The learning surface is just two APIs and a decorator, but it changes what your code can survive.
Human-in-the-loop that actually works
Build workflows that pause for human approval and resume where they left off. The runtime suspends the run's full state, persists it, and picks up from where it stopped when the decision comes back.
Run
Ship fast, recover from anything
Your agents and workflows run on a Rust runtime that records every step and recovers from crashes automatically. Deploy from your laptop to production with one command.
Deploy your first agent →Deploy anywhere — from a laptop to a cluster
The entire runtime ships as a single binary. Run it on your laptop during development, deploy to a VPS for production, or scale out behind a Kubernetes operator when you outgrow a single node.
Crashes don't lose work
When a run crashes partway, the runtime picks up where it left off. Every step is recorded as it happens, so completed work isn't lost and doesn't need to be re-executed.
Improve
See what happened. Fix it. Prove it works.
Every run is recorded automatically. When something goes wrong you have the full picture, and the runs you've already served become the eval set that proves the fix.
Explore replay & evals →Replay any run, locally or in Studio
Pull any production run to your laptop with agnt5 replay and step through every decision, tool call, and state change exactly as it happened. Find the failure in minutes instead of hours.
Fix prompts and prove it works — before it ships
Change a prompt version and set it active. Future runs pick up the new version without a redeploy. Re-run the production runs that failed against the updated prompt and score the results with built-in evaluators.
Your first workflow,
deployed before lunch.
Start free. Add a decorator and ship. Every run is journaled from day one, so when the first incident hits, you'll be glad it was recording.
from agnt5 import durable@durabledef my_agent(query: str) -> str:# your agent logicreturn answer