Building the Python SDK on a Shared Rust Core

AGNT5 Team
5 min read

Why the AGNT5 SDKs share a Rust implementation under the hood, what lives on which side of the FFI boundary, and what it costs.

Most platforms that support multiple SDK languages pick one of two patterns. They either write each SDK from scratch in the target language — full duplication, no shared code — or they write a single SDK in one language and ship the others as thin HTTP clients. Both have costs. Full duplication means every bug gets fixed N times. Thin HTTP clients mean you rebuild flow control, reconnection, and backpressure in every language, badly.

The AGNT5 SDKs take a third path. The protocol, transport, and connection management live in a Rust core crate. Every language SDK — Python, TypeScript, Go, Java, Kotlin — is a thin idiomatic wrapper over that core, invoked through language-native FFI.

The split is not subtle. Here is how the Python package resolves a call:

from agnt5 import function, FunctionContext

@function
async def analyze(ctx: FunctionContext, doc_id: str) -> dict:
    content = await ctx.step("fetch", lambda: fetch_doc(doc_id))
    summary = await ctx.step(
        "summarize",
        lambda: summarize(content, model="gpt-4o-mini"),
    )
    return {"doc_id": doc_id, "summary": summary}

The @function decorator is Python. The FunctionContext is a Python class. The ctx.step call serializes its arguments and hands them across a PyO3 boundary to Rust, which owns the connection to the coordinator, writes the journal entry, and returns the memoized output (if any) back across the boundary.

Roughly: Python handles decorators, async semantics, Pydantic types, and user code. Rust handles the wire.

Why the boundary sits where it does

The boundary is not arbitrary. It is drawn to keep the hard-to-get-right parts on one side and the easy-to-write-idiomatically parts on the other.

Protocol buffers and gRPC live in Rust. Generating proto code in five languages and keeping them all consistent is a maintenance tax we did not want to pay. Rust’s prost and tonic handle every wire interaction. Language SDKs see opaque request/response handles.

Connection management lives in Rust. The worker coordinator connection is a long-lived bidirectional stream with keepalive, reconnect, and backpressure logic. Writing that once in Rust — with the concurrency primitives tokio gives us — is far more reliable than writing it five times in five languages with five different runtime models.

Framework integration lives in each language. The Python SDK has a RuntimeAdapter that handles ASGI, Flask, and standalone worker modes. The TypeScript SDK has a different adapter that handles Node, edge runtimes, and serverless environments. These are not portable concerns. A single Rust core cannot know how a Python worker wants to plug into FastAPI’s lifecycle. So it does not try.

This is the principle: the Rust-language boundary is a simple message-handler interface. The core calls into the language to invoke user code, and the language calls into the core to write journal entries and talk to the coordinator. Nothing more.

What FFI costs

PyO3 is fast. For a typical ctx.step call, the FFI overhead — marshaling the step name, wrapping the result handle, releasing the GIL — is under a hundred microseconds on commodity hardware. The actual journal append (network + durable write) is 3–15ms depending on the deployment mode. The FFI is noise against the wire.

It is not noise for tight loops. If a Python worker calls the core a million times a second, the GIL and the marshaling start to matter. We do not see that pattern in practice — a worker at a thousand steps per second is already doing a lot of work — but it is the upper bound. Someone writing a high-throughput ingestion worker in Python should understand that the FFI is there.

The other cost is build complexity. Shipping a Python wheel that embeds a Rust extension means a multi-arch CI pipeline, musl and glibc builds, and manylinux compliance. We use maturin and a matrix of cross-builds. It works, and the wheels on PyPI install in one pip install agnt5 like anything else, but the pipeline has more moving parts than a pure-Python package would.

What the developer sees

None of this, ideally. The Python SDK has Python semantics — async functions, Pydantic models, decorators, context managers, typed exceptions. A user never imports a Rust-looking type or thinks about a .so file. The _core.abi3.so sits quietly next to __init__.py and does its job.

Typed state in entities uses Pydantic. Logging flows through the standard logging module. Exceptions are real Python exceptions with tracebacks that stop at the boundary — the Rust side cannot give you a Python traceback, so we materialize one on re-entry when a step raises.

The guarantee we are chasing is consistency: a workflow written against the Python SDK should have identical semantics if you rewrite it against the TypeScript SDK. Step names replay the same way. Retry policies behave the same way. Entity locking works the same way. That consistency is only possible because one Rust core is deciding what those semantics mean — the language wrappers cannot drift, because they do not implement them.

The tradeoff we accepted

A shared Rust core means a change in the protocol needs a release of the core, a release of every language SDK, and (for Python and TypeScript) a rebuild of native binaries. That is slower than bumping a constant in a single-language SDK. We accept that slowness because the alternative is feature drift: a capability that lands in the Python SDK in March, the TypeScript SDK in June, and the Go SDK never.

AI workloads are polyglot. Backends are Python. Edge functions are TypeScript. Platform code is Go. If the durable-execution SDK works cleanly in exactly one of those and adequately in the others, people will route around it. We want the opposite — the SDK to be a good choice in whichever language the team already writes.

One core, many wrappers. The boring choice, done carefully.