Agents SDK Adds Durable Harness for Long-Running Agent Work

Steve CoffeyOpenAIThursday, May 28, 202617 min read

OpenAI’s Steve Coffey and Nish Singaraju present the updated Agents SDK as a way to move long-running agent work out of hand-built orchestration loops and into a model-native harness. Their case is that production agents increasingly need durable state, file-system access, tools, skills, sandboxing, and resumability, while the actual compute environment should remain replaceable and ephemeral. Coffey distinguishes this from one-shot Responses API calls and hosted shell use, arguing that the SDK is meant for agents operating across files, systems, and multi-step workflows.

The Agents SDK moves long-running agent work out of brittle orchestration code

Steve Coffey framed the updated Agents SDK around a practical gap: models can now work for much longer stretches, but production agent infrastructure is still difficult to build and maintain. Codex is the reference case: an agentic coding tool that can complete features, refactors, migrations, and other end-to-end software tasks. Users may have seen Codex run for minutes or an hour, he said, while internally people have gotten Codex to run “for days up to a week on tasks.”

The same pattern is appearing beyond coding. OpenAI has a Codex-based security agent that scans repositories and dependencies for vulnerabilities, including in legacy software. It also has an internal data analysis agent connected to data lakes, letting employees ask questions such as how many people are using the Agents SDK or how many Responses API requests occurred two days earlier. Coffey contrasted that with a previous workflow where he might have spent an hour writing SQL; now, he said, it can be “a simple prompt” followed by an answer a few minutes later.

The SDK update is a response to the hard parts of building production agents: maximizing performance across models, keeping agents “in distribution” with the harnesses models are trained around, managing runtime state, and making the framework flexible enough for an application’s own data, tools, and workflows.

Coffey described the older direct-to-LLM pattern as a hand-built loop: receive a request, route to a model, inspect whether the model called a tool, execute the tool, update context, and repeat until the model stops calling tools. Around that loop, teams often add web search, file search, MCP, code execution, skills, message handling, context management, and a tool manager. The SDK’s job is to absorb that common runtime layer so teams can spend more of their time on product-specific behavior.

You might be layering in stuff in addition to that like web search, file search, MCP, kind of like spending most of your time building this orchestration layer around this loop where actually you should be spending most of the time building products.

Steve Coffey · Source

The updated Agents SDK provides what Coffey called a “model-native harness” that handles the loop, tools, and durable execution while still allowing developers to plug in their own tools and infrastructure. Its built-in tool surface includes web search, file search, MCP, code interpreter, skills, remote MCP, computer use, function calling, hosted or networked containers, and shell. OpenAI’s materials also listed sub-agents as “coming soon.” The SDK also supports server-side compaction, agent memory, text, and voice modalities.

Coffey emphasized that the SDK remains open source and customizable. It has been open source since its release, he said, and is “model agnostic” in the sense that other model providers can be used if they support the Responses API format. But he also argued that harnesses are becoming more important to model performance. Because models are increasingly tailored to the harnesses they are trained with, Coffey said it is “very very possible” that the best performance will come from the harnesses those models were trained with.

Responses API, hosted shell, and SDK agents solve different runtime problems

The SDK is not the right abstraction for every job. Coffey separated direct Responses API use, hosted shell, and harness-based agents by the amount of runtime responsibility a developer wants the system to take on.

The Responses API remains appropriate for a wide range of bounded tasks. Coffey said it is useful when a team is working in a language without a preferred agentic framework — he named Elixir and Haskell as examples — and wants to build directly against the API. It is also well suited to one-shot tasks such as translating a document or turning unstructured data into JSON. For those cases, hosted tools can provide “really light agentic loops” in a single API call.

The hosted shell tool is the middle ground. It lets a model run commands in an isolated container through the Responses API. Coffey described it as a lightweight version of the Agents SDK: make an API call, provide files, let OpenAI spin up a container, allow the model to write code or perform transformations inside it, and return results. OpenAI positioned hosted shell for CLI-grade transformations such as CSV, JSON, and log processing; build and test steps; linting; repository analysis; and deterministic workflows with mounted files and skills.

There are two hosted shell modes. In auto mode, the API creates a fresh container per request using container_auto. In reusable mode, the developer creates a container once and uses it across requests with container_reference. Coffey also described a containers endpoint where a developer can create a container, upload files, spin it up, and attach it to a Responses API request.

Harness-based agents are for longer-lived work where the agent needs durable state, files, tools, sandbox lifecycle management, and resumability. That is where the SDK’s SandboxAgent, snapshotting, manifests, function tools, skills, and sandbox providers become load-bearing.

Pattern	Runtime responsibility	Best fit described by Coffey
Responses API	Single request or developer-managed loop	One-shot tasks such as translation or structuring unstructured data into JSON
Hosted shell	Responses API runs commands in an isolated container	Bounded CLI-grade work such as transformations, linting, build/test steps, or repository analysis
SandboxAgent	Agents SDK manages sandbox lifetime, snapshots, tools, and resumability unless the developer owns the sandbox	Longer tasks over files, tools, state, and multi-step workflows

Coffey separated one-shot API use from hosted shell containers and full SDK-managed sandbox agents.

OpenAI has also added network controls for hosted containers. Hosted containers can make outbound network requests only when networking is enabled for the organization and, optionally, allowed at the container or request level. Network access can be controlled through organization-level policy in the Platform Dashboard, API-level allowlists scoped to a container or request, and validation rules that can further restrict access but cannot widen it.

Separating the harness from compute is the architectural move

The update Coffey said he was “most excited about” is the separation of the harness from the compute environment. In the older or simpler pattern, the agent loop and the file system it operates on are colocated. Codex running on a developer’s laptop is the easy mental model: the harness, tools, shell, files, and execution environment are all together.

That arrangement becomes harder in production. If the agent loop and file system live inside the same container, the sandbox becomes load-bearing. If the container dies, expires, or disappears, so can the state needed to continue the task. It also complicates secrets management. Developers ideally do not want secrets in an untrusted coding sandbox, Coffey said, because that creates exposure to prompt-injection or exfiltration risks.

If you kind of split those things up, then you can treat the sandbox as this totally ephemeral thing that you don't really have to worry if it lives or dies, it can expire, go away.

Steve Coffey · Source

The updated SDK instead treats the sandbox as ephemeral. The harness can run wherever the developer’s backend already runs — Coffey mentioned Temporal and AWS as examples — while the sandbox can run in a provider such as E2B, Modal, Cloudflare, Vercel, Bloxal, Daytona, Docker, or a local machine during testing. The sandbox can execute commands and read and write files, but trusted access to databases, APIs, and secrets can remain on the server side.

This separation changes failure handling. If a sandbox goes away, the SDK can spin up a new one, restore the file system from a snapshot, and continue the run. The model, in Coffey’s description, is unaware that it is operating in a fresh container: after rehydration, the file system appears the same as when the prior sandbox stopped.

The SDK supports both SDK-owned and user-owned sandbox lifecycles. In the SDK-owned pattern, the developer provides a recipe for creating a sandbox — for example, a Docker client or E2B client — and the SDK starts the sandbox for a turn, then tears it down afterward. Coffey said this is the default, partly to avoid developers accidentally leaving many cloud sandboxes running. In the user-owned pattern, the developer creates a sandbox out of band and passes its reference to the agent. The SDK then does not automatically tear it down, allowing a team to keep a sandbox warm across multiple turns.

The practical claim is not that every agent needs a heavyweight runtime. It is that long-horizon agents operating over files, tools, and state need a durable harness that is not itself trapped inside the sandbox where untrusted work is occurring.

The Codex-style harness brings shell, patching, compaction, skills, and memory into the SDK

Coffey described the updated harness as “Codex-style,” but not identical to the Codex product. The SDK brings over several patterns that make Codex useful: shell interaction, file editing, compaction, skills, and memory.

One central element is the shell loop. Coffey described it as an asynchronous shell interaction loop: the model can write a command, wait for it, leave it running, do something else, and return later. The harness keeps track of the commands currently running. That differs from a naïve single-call tool loop where a model requests a command, blocks, and waits. The Codex-style loop is designed for longer-running work where inspection, execution, and editing happen repeatedly.

Another default capability is file-system access. The SDK’s default capability set includes the ability for the model to view images and use apply-patch tools to edit files inline. The shell capability is a combination of two tools, according to Coffey’s explanation: one to execute commands and another to write to standard input. Compaction allows the model to keep going when it exceeds the context window by compacting its context, which Coffey said is what lets agents continue working for very long periods — “hours, weeks technically,” with no fixed limit described.

The SDK also adds first-class support for skills. A skill is a versioned bundle of files with a required SKILL.md manifest. The manifest frontmatter defines the skill’s name and description, including when and how to use it. Skills can include scripts, templates, data folders, and other resources that help the model complete a specific workflow.

Coffey’s example was a tax-prep skill for tax year 2025: the bundle might include rules the agent needs to know, scripts for processing documents, and logic to fill out a 1040. In the new Skills API, such a bundle can be uploaded, versioned, assigned a default version, and mounted into a hosted shell environment. OpenAI showed a curl example uploading a zip file to /v1/skills and receiving a skill ID, creation time, name, description, default version, and latest version.

The Skills API supports several reference patterns: a skill reference by ID, optionally pinned to a version; a curated skill by name, such as “spreadsheets,” described as audited and safe to mount; an inline base64 zip for quick packaging without persistent upload; and local skills bound from directories without upload.

Coffey also argued that GitHub is a strong place to store skills. In the task-tracker implementation, he used a public GitHub repository for a “Conference Program Editor Skill,” then configured the agent’s skills capability to load from that repo and main ref. His reason was operational: Git gives teams version control, pull requests, and review workflows for skill changes. The SDK, he said, has first-class support for loading skills from Git repositories, with GitHub as the default host but not the only possible one.

Runtime portability means changing providers without changing the agent’s job

Coffey used a small “Conference Launch Desk” app for a fictional conference to demonstrate the same runtime moving across local and cloud infrastructure. The important pattern was that a SandboxAgent could move from local Docker to Modal, and from local file-system snapshots to Cloudflare R2 snapshots, while preserving the agent’s working state.

The agent was defined as a SandboxAgent called “Program Editor,” using gpt-4.5-turbo and a set of program editor instructions. Coffey described SandboxAgent as a new type that subclasses the existing Agent class, so many of the same parameters from the existing Agents SDK still apply. Assigned to inspect files and edit them for clarity, the SDK created a Docker sandbox using a Python 3.12 image, uploaded the task files, and restricted the agent to the relevant workspace. The agent inspected files, executed commands, and returned a log of its work.

The local Docker path established the basic lifecycle. A Docker sandbox provider supplied a DockerSandboxClient and image setting. When a task stopped, the SDK stopped the container and snapshotted the file system to a developer-defined location. In Coffey’s local setup, snapshots were tar archives on his laptop. When a task resumed, the SDK retrieved the tarball, started a new container if needed, and rehydrated the file system. The point was that the model would see the same workspace even though the underlying container might be new.

The same runtime then moved to Modal. Coffey changed the provider to return a ModalSandboxClient and options, still using the Python 3.12 image. Snapshots moved from his local machine to Cloudflare R2. Modal hosted the sandbox; R2 stored the snapshot tar files. The SDK handled applying the workspace files to the Modal sandbox and later storing the resulting file-system state in R2.

This is the portability argument: the sandbox provider and snapshot store can be changed independently. Docker and local tarballs are useful for local development. Modal and R2 are a cloud deployment pattern. Coffey also named E2B, Cloudflare, Vercel, Daytona, and Bloxal as supported sandbox options, and said teams can bring their own implementation.

The same portability applies to skills and capabilities. Coffey added a Skills capability that loaded a skill from sdoffey/conference-program-editor-skill at the main ref, then asked the agent to read from its skill file. The capability object, as he described it, can bundle tools, additional instructions, manifest changes, and setup that should be placed into the computer when it starts. In other words, runtime portability is not only about where the container runs; it is also about packaging the agent’s working environment so it can be recreated across providers and sessions.

Manifests make the agent workspace explicit

The SDK’s manifest object defines the file system shape that should exist when an agent starts. In the task-tracker implementation, the manifest created a directory tree with folders such as original, working, output, and handoffs. Coffey described it as a simple class for saying: this is what the file system should look like when the sandbox spins up, and this is how files are attached to a task in a way that can be shared across tasks or agents.

The manifest can do more than create empty directories. It can copy files from wherever the harness is running, attach an R2 bucket, S3 bucket, Azure Blob Storage account, GitHub repository, or other source, and place those files into the sandbox. When the agent starts, the SDK renders a version of the file tree to the model, giving it an immediate view of its workspace so it does not need to spend as much time discovering the file system with ls or grep. Developers can also add descriptions to manifest entries, though Coffey did not go deeply into that.

The storage strategy can change without changing the agent’s conceptual workspace. Coffey first used uploaded files that were copied into the sandbox. Later, file attachment handling changed so uploaded files were stored directly in R2 rather than copied first to his laptop. The manifest then used an S3Mount for the task input folder. Although the bucket was Cloudflare R2, the S3 mount object worked because R2 and S3 APIs are compatible.

With Modal, the setup used a provider-native cloud bucket mount strategy. For more generic containers, Coffey said the SDK provides Rclone and FUSE options out of the box. The result is that a sandbox can read and write to a mounted volume backed by R2 as if the files were local. Coffey noted that this is opaque to the model: it operates over the mounted path like a normal file system even though the data may be coming over the network.

The tradeoff is between startup time, runtime performance, and freshness. Copying all files into the sandbox may make reads faster during execution, but startup is slower, especially for large file sets. Mounting external storage can make startup faster and preserve freshness, but runtime file operations may incur network latency. Nish Singaraju made the same distinction: copying a large blob into the sandbox slows startup but gives the model faster file reads during runtime; mounting makes startup very fast but requires reading files over the network.

This matters most when the agent needs access to many files or frequently changing data. Coffey gave the example of hundreds of PDFs that would be cumbersome to copy into each sandbox, or data with a strong freshness constraint where copying could make the data stale before the agent starts working.

Function tools are how the agent changes application state

The file system is only one side of the runtime. The other is application behavior: the agent needs controlled ways to act on the product around it. In Coffey’s task-tracker implementation, that meant three application-level tools: update_status, update_assignee, and search_assignees. These let the agent change task state, assign work to people or agents, and search possible assignees.

Function tools work the same way they have in the Agents SDK. In Python, they are ordinary functions decorated with a function-tool decorator; in TypeScript, they are TypeScript functions. The decorator lets the SDK translate the function into the API representation the model sees, including parameters. When the API returns an instruction to call a tool, the SDK routes that instruction to the local function, calls it, and sends the response back to the model. The developer does not need to implement the function-calling loop manually.

That pattern turns a file-editing sandbox into a controlled application actor. When Coffey asked the agent to assign a task to Steve and mark it ready for review, the application state changed through the exposed tools. The agent did not get direct uncontrolled access to the application database; it used the functions the developer made available.

The function-tool interface also supports more control than simply exposing a callable. Developers can override the name the model sees, provide descriptions to steer tool use, add guardrails around specific argument combinations, add hooks into execution flow, set timeouts, and dynamically enable or disable tools, for example based on a feature flag.

Tool-call approvals add a human-in-the-loop gate for sensitive operations. Coffey configured update_status with a needs_approval function, requires_completion_approval, which returned true when the requested status was DONE. His reason was straightforward: “I’m not gonna let you mark something as done without taking a quick look at it first.” When he asked the agent to mark a task done, the activity feed showed an approval request for update_status({"status": "done"}) with decline and approve buttons. Approving it resumed the flow and allowed the status change.

The same application-tool pattern also supported a simple handoff between agents. Coffey created a task asking the Program Editor Agent to refine content and then assign it to an Asset Producer Agent. The intended flow was that the Program Editor would do its work, reassign the task, and then the Asset Producer would create assets and place them back into the sandbox. This was not described as a full multi-agent framework, but as an example of agents coordinating through application state, assignments, and a shared workspace.

Persistence covers both the file system and the rollout

The SDK’s pause-and-resume story has two state components. One is the file system: the sandbox contents, captured as a snapshot and restored into a later sandbox. The other is the run state: the messages in the conversation so far, or what Coffey called the rollout, which the agent needs in order to continue.

The SDK manages both. Coffey said the snapshot mechanism works out of the box; the developer defines where snapshots should be stored, such as local disk by default or R2 in the cloud. The SDK can take one sandbox’s file system, zip it or tar it, store it elsewhere, and restore it into another sandbox later. The run state can be stored as a JSON object in the developer’s database, including the full rollout and a pointer to the snapshot location. Reloading that object lets the agent resume “losslessly,” including in a multinode system sharing state through a database.

Essentially we store the rollout and we also snapshot the file system which is a little bit unique compared to other SDKs.

Nish Singaraju · Source

Singaraju said that combination should make it possible to resume very long-running, even day-long agents. Coffey’s view was that most of the engineering work should then shift away from orchestration and toward product-specific decisions: which tools the agent should have, which skills it needs, and what context should be available. Starting, stopping, and resuming tasks are intended to be mostly out-of-the-box behavior.

The SDK is close to Codex, but not the Codex binary

Asked to compare the SDK’s code-executing harness with the actual Codex harness, Nish Singaraju said there are “small differences” between the Codex binary and the Agents SDK. The SDK includes many core Codex-style capabilities: bash, Codex-style file editing, skills, and memories. But the Codex binary has some features that are still to be built into the SDK.

One difference Singaraju named was Codex’s multi-threaded behavior: the binary can run multiple sub-agents. The Agents SDK has handoffs, which he said can produce something similar, but the SDK is not identical to Codex. His overall position was that developers should be able to do “a lot of the things” possible in the Codex binary, and that the Agents SDK is “very much in distribution” to the Codex harness today.

Steve Coffey also addressed broader multi-agent coordination. Asked whether a supervisor agent could monitor and coordinate hundreds of specialized agents simultaneously, he said yes, though it would take work today. He said more multi-agent framework work is coming in the “near weeks and months,” which should make the pattern easier. Existing coordination patterns include giving a parent agent a tool to inspect sub-agent status and check progress, or having agents communicate through a shared file system, database, or other shared state.

Coffey’s forward-looking claim was that “massively parallel work” will become “the default in the near future.” In the current SDK, the building blocks he pointed to are handoffs, shared state, file-system coordination, databases, and parent-agent tools for checking child-agent progress — not a fully packaged supervisor-agent system.

AI Application Architecture Inference and Deployment Agents and Autonomy Coding Assistants