Replay Semantics: How Deterministic Execution Actually Works

AGNT5 Team
4 min read

A walk through what happens when an AGNT5 run resumes after a crash — the journal, memoization, and the contract the SDK upholds.

A workflow that “resumes after a crash” sounds like magic until you own one. The hard part is not catching the crash. The hard part is guaranteeing that when the worker picks the run back up, the second execution arrives at the same state as the first — without re-running the LLM call that cost $2.30 the first time around.

AGNT5 handles this with a pair of ideas that are boring on their own and powerful together: an append-only per-run journal, and a strict contract that the SDK must wrap every non-deterministic operation in ctx.step().

Here is the contract, as it looks from a Python worker:

from agnt5 import workflow, WorkflowContext

@workflow
async def research_report(ctx: WorkflowContext, topic: str) -> dict:
    # Each step is memoized by its name within this run.
    sources = await ctx.step("gather_sources", lambda: search_web(topic))

    summaries = await ctx.step(
        "summarize_each",
        lambda: summarize_documents(sources, model="gpt-4o-mini"),
    )

    report = await ctx.step(
        "write_report",
        lambda: draft_report(topic, summaries, model="gpt-4o"),
    )

    return {"topic": topic, "report": report, "source_count": len(sources)}

Three steps, three names. Those names are load-bearing.

What the runtime actually records

When the worker first executes gather_sources, it sends a step_started entry to the coordinator, which appends it to the run’s journal. The SDK runs the lambda, captures the result, and sends a step_completed entry carrying the output and token/cost metadata. Both entries are durably synced into the segment’s RocksDB WAL before the coordinator acknowledges. Same pattern for the other two steps.

If the worker process dies halfway through write_report, the journal still contains completed entries for gather_sources and summarize_each. It contains a step_started for write_report but no matching step_completed.

What happens on resume

The coordinator detects the dead worker, picks a new one, and dispatches the run with an execution_epoch bumped by one. The SDK does not care that this is a resume. It calls the same workflow function with the same input.

When the SDK hits ctx.step("gather_sources", ...), it looks the step name up in the journal. It finds a step_completed entry. It returns the cached output without running the lambda. Same for summarize_each. When it reaches write_report, the journal has a step_started but no completion, so the SDK runs the lambda fresh and appends a new completion.

From outside the workflow, the run took one execution. From inside, it took two — but the second one skipped the expensive parts.

The rules the author signs up to

Memoization works because the author accepts three constraints:

  1. Non-determinism lives inside steps. Anything that talks to the outside world — LLMs, HTTP, databases, clocks, random — goes through ctx.step, ctx.sleep, ctx.now, or a similar typed primitive. Naked time.time() in the body of a workflow will produce different values on replay, and the replayed workflow will diverge from the journal.

  2. Step names are stable. If you rename summarize_each to summarize_documents, an in-flight run can no longer find its cached output. It will re-execute. This is the version hash problem in miniature — rename steps only in versioned deploys.

  3. Workflow code is a plan, not a payload. The workflow function describes the ordering of steps. The actual work — network calls, model inference, database writes — happens inside the step closures. If you find yourself doing real work in the workflow body, pull it into a step.

What “deterministic” means here

We do not replay an LLM call and expect it to produce the same tokens. Non-deterministic operations are not replayed — they are memoized. The guarantee is that the workflow function traverses the same control flow, in the same order, with the same step outputs, on any number of re-executions. The LLM was called once, its output is pinned in the journal, and every subsequent replay of the workflow sees that pinned output.

This is the same model Temporal, Restate, and Inngest use. The terminology varies — “history,” “journal,” “state” — but the shape is identical. Record inputs and outputs of effectful calls. Skip them on replay. Let pure control flow run free.

Why this matters in practice

The practical payoff shows up when you need to change a prompt mid-run. With replay semantics in place, you can checkpoint a production workflow at step 12, change the prompt at step 13, and finish the run under new behavior — no custom checkpointing code, no lost context. You can rerun a historical workflow against a new model and compare outputs, step by step. You can deploy a hotfix to a worker without draining in-flight runs, because they will resume on the new worker from wherever they left off.

Durable execution is not about avoiding failures. Failures happen on a long enough timeline. It is about making the second, third, or tenth attempt cheap and correct — which is mostly a matter of writing the journal down and reading it back.