Get started Determinism — why workflows have rules

Determinism — why workflows have rules

The contract on workflow code — replay must arrive at the same step calls in the same order — and how to keep your code on the right side of it.

Workflow code is deterministic by contract: given the same inputs and journal, it produces the same sequence of ctx.step(...) calls. Anything that varies between runs has to live inside a step, where its result is journaled.

from agnt5 import WorkflowContext, workflow

# WRONG — clock read in workflow body
@workflow
async def daily_summary_bad(ctx: WorkflowContext) -> str:
    today = datetime.utcnow().date()      # different value on replay
    rows = await ctx.step(load_rows, today)
    return await ctx.step(summarize, rows)


# RIGHT — clock read inside a step
@workflow
async def daily_summary_good(ctx: WorkflowContext) -> str:
    today = await ctx.step("today", lambda: datetime.utcnow().date())
    rows = await ctx.step(load_rows, today)
    return await ctx.step(summarize, rows)

The bad version replays differently on a Tuesday than it did on a Monday — load_rows would be journaled with Monday’s date, then re-called with Tuesday’s, and the runtime sees two different inputs at the same call site. Replay drift error. The good version journals the date as a step result, so replay reads the original Monday value and reaches the same load_rows call.

The mental model

Replay walks the workflow body and matches each ctx.step(...) call to a journal entry by call order. If your code reaches the same calls in the same order on every run, replay works. If the code’s behavior depends on something that changes between runs — a clock, a random number, a network response, the iteration order of a Python set — replay reaches different calls and the runtime cannot tell which journal entry belongs to which call site.

The fix is always the same: move the non-deterministic value into a step. Once it is journaled, replay reads the original value and the workflow body is deterministic again. ctx.step("name", lambda: ...) exists for exactly this purpose — it lets you wrap an arbitrary expression so its result is captured.

This contract is not enforced at compile time. No Python type system can prove a function is deterministic. Violations show up as replay-drift errors at runtime, often only when a worker crash forces a real replay. Treat the rule as a discipline; tests that simulate replay (worker restart mid-run) are the cheapest way to catch drift before production.

Why it works this way

Determinism is the price AGNT5 pays for not persisting full process memory at each step. The runtime needs a stable mapping from “where am I in the recipe” to “what should I do next” — and the only sustainable mapping is: walk the recipe deterministically, match calls in order, read journaled outcomes for completed calls, run the next call live.

The alternative — full memory snapshots, distributed transactions, or hash-based call-site identification — is either slower, more fragile, or both. The workflow-body constraint is small in practice (most logic is naturally deterministic) and explicit (you can see exactly which calls would violate it).

Edge cases and gotchas

  • Common offenders to move into steps:
    • time.time(), datetime.utcnow(), datetime.now(), any clock read
    • random.choice(...), random.random(), uuid.uuid4()
    • Network calls, file I/O, database reads
    • Reading environment variables that may change between runs
    • Iterating over a dict whose key insertion order differs between runs
  • Loops are fine; their bounds must be deterministic. for item in journaled_list: ... is safe — the loop count comes from a journaled value. for _ in range(some_random_count) is not.
  • Conditional ctx.step(...) calls are fine if the condition is deterministic. A branch whose condition reads a journaled value (or the workflow input) takes the same path on replay. A branch whose condition reads a clock or RNG does not.
  • In-process caches are a hidden source of drift. A module-level _cache: dict = {} populated during the original run is empty on a fresh worker. Any code that depends on cache state will reach different call sites. Caches must live inside steps if their values matter.
  • Replay drift errors point at the call site, not the source. When you see a drift exception, the offending non-determinism is somewhere upstream of the named step — the step itself is fine; the inputs reaching it differ from what was journaled.
  • agnt5 inspect trace shows the exact step sequence. When debugging suspected drift, compare the trace from the original run to the trace from replay. The first call site that differs is where the non-determinism lives.
  • Event sourcing and replay — the mechanism that makes determinism necessary.
  • Workflows — where the determinism contract applies.
  • Functions — the host for non-determinism (functions are free to be as non-deterministic as you need).