Orply.

Persistent Sandboxes Make Agents Remember, Plan, and Reuse Their Work

Nico AlbaneseAI EngineerTuesday, May 12, 202620 min read

Nico Albanese, a Vercel engineer working on the AI SDK, argues that agents become more reliable when they are given a persistent sandboxed computer, not just a runtime and tools. In his workshop, he builds that pattern with AI SDK 6, Vercel’s named sandboxes, a bash tool, and a file-backed memory system, showing how an agent can plan in files, preserve context across sessions, and create reusable scripts without a separate memory layer.

The file system changed the agent’s behavior, not just its storage

Nico Albanese framed the work around a specific claim from Vercel’s internal agent systems: an agent runtime and tools are not enough. The third building block is “a computer” — some sandboxed environment with a file system where the agent can execute code and persist state across work sessions.

The important part, in Albanese’s account, was not that a file system gave the agent somewhere to save logs. It changed the way the agent behaved. He described an internal Vercel agent called D0, built for data work, that already had broad access: chat-with-your-data systems, much of the Vercel backend, an admin panel, Salesforce, and other internal tools. Before the file system was added, it might use five or ten tools in a run and return an answer that was “somewhat hallucinated.” After it was given a file system and instructions about how to use it, the behavior improved.

The mechanism was simple. Each session got a scratchpad in the file system. The agent wrote an initial plan there, with the objective at the top. Its instructions told it to follow that plan file “to a T” and check items off as it worked. Research went into a separate directory. Instead of relying only on a long context window, where early intent could be buried or dropped, the agent repeatedly read a concrete artifact reminding it what it was supposed to do.

And all of a sudden now you have this fascinating thing where the agent is reading and pulling in and reminding itself at almost every single step, okay, this is my objective.

Nico Albanese

The result, in Albanese’s telling, was that the agent started following through on entire tasks. It stayed on track. At the end, the user also got an artifact showing what the agent had done and what work had gone into it.

He said Vercel has since seen the same file-system-backed pattern across “pretty much every agent” built internally: a go-to-market agent, the data agent, and a customer support agent. He described the support-agent result informally, saying it pushed customer support tickets down by “like 90%,” and that he thought about 95% of people were “actually saying thank you,” a response he said they had not really seen before. The broader claim was that agents became more reliable when their work could be externalized into files they could read, write, search, and execute against.

That set up the technical direction: the agent would start as a basic chat loop, gain web search, gain a bash tool, run inside a persistent Vercel Sandbox, read and write a memories.md file, and begin generating reusable scripts for repeatable tasks. By the end of the build, the agent would not only answer questions. It would accumulate context and tools inside its own named computer.

AI SDK 6 makes the agent definition the source of truth

Nico Albanese contrasted AI SDK 6 with earlier SDK usage where most work was organized around primitives such as generateText, streamText, generateObject, and streamObject. Those still exist, he said, and structured outputs have increasingly been pushed into the text-generation functions. But AI SDK 6 also adds a more object-oriented way to define agents.

The architectural motivation was code organization. Albanese said Vercel’s own applications had started to balloon because LLM logic lived at the call site. A Next.js api/chat/route.ts file could end up with 2,000 lines of code because tools and the system prompt were defined inline.

AI SDK’s counterpoint is still “lightweight JavaScript”: define an agent once in code, in a monorepo if needed, and use it from a Next.js app, a Bun server, or another JavaScript environment. The initial agent definition was intentionally small:

import { ToolLoopAgent } from "ai";

export const myAgent = new ToolLoopAgent({
  model: "openai/gpt-5.4-mini",
});

Albanese called ToolLoopAgent “one of our shorter APIs,” joking about Vercel’s German influence on naming, but said the name was deliberately literal: it is an agent that runs a tool loop.

The model string demonstrated another AI SDK 6 feature: a global provider. In previous AI SDK usage, developers might import a provider such as openai from @ai-sdk/openai, create a provider instance, and then specify a model ID. In AI SDK 6, Albanese said, a global provider can be attached to every AI SDK function in an application. By default, that is the AI Gateway, which means plain strings can address models available through the gateway. Developers can override the global provider if they want, but the default makes a first agent definition terse.

The HTTP route was kept narrow. In app/api/chat/route.ts, the route imported createAgentUIStreamResponse, accepted messages from the request body, and passed those messages alongside the agent. Albanese described the helper as an abstraction over the same underlying pattern AI SDK users may know from streamText and result.toUIMessageStreamResponse(). The distinction is architectural: the route owns streaming concerns, while the agent definition owns the model, instructions, tools, and later call-time options.

On the client, the page used the useChat hook from @ai-sdk/react, which managed message state, errors, and sendMessage. At that point, the app was already a working chatbot. Albanese added the first behavior change with an instructions property on the agent: “Respond like a cowboy.” The next “hi” produced a cowboy-flavored “Howdy, partner!” response.

The cowboy example was disposable, but the point was not. Albanese argued that system prompts are still a core agent-building component, even if some developers dismiss them as a 2023 artifact. In his framing, instructions matter most when combined with the other two building blocks: tools and a computer. The rest of the implementation showed instructions becoming less like style guidance and more like operating procedure.

Provider-executed tools buy speed at the cost of portability

The first tool Albanese added was web search, because it is a common way to augment an agent with fresh external context. To do that, he installed @ai-sdk/openai, not to call OpenAI directly in the old provider-instance style, but to use OpenAI’s built-in web search tool definition.

He divided AI SDK tools into three categories. Custom tools are defined entirely by the developer: description, input schema, and execute function. They are provider-agnostic and give the developer full control over what the agent can call.

Provider-defined tools sit between full control and full delegation. The provider supplies the input schema and description, but execution happens on the developer’s side. Albanese cited Anthropic’s bash and text-editor tools as examples. The provider has worked on tool descriptions and schemas, often post-training models to use them effectively, while the application decides what to do when the tool is called.

Provider-executed tools run on the provider’s own infrastructure. Web search is the canonical example. OpenAI has one; Anthropic has one as well. The developer opts into the tool and may configure it, but does not provide the execution function. If the model decides to use it, the provider executes it, adds the result to message state, and returns the expanded state.

For the implementation, the agent could gain search with little code:

tools: {
  webSearch: openai.tools.webSearch(),
}

Albanese was explicit about the trade-off. The benefit is speed: “we don’t have to write any more code.” The downside is provider lock-in: a provider-executed tool ties the agent to one provider’s implementation. For this use, he considered that acceptable.

The more important product problem was visibility. During search, the app appeared to hang. The agent was doing multiple steps and calling a tool, but the user saw no indication. Albanese added rendering for tool parts in the chat UI, using AI SDK’s convention of tool- plus the tool name, such as tool-webSearch.

That led into AI SDK’s end-to-end type system. The agent definition is intended to be the source of truth. Albanese used InferAgentUIMessage<typeof myAgent> to define a typed message type, then used that type in the route handler and in useChat. Once wired through, the UI could discriminate on the webSearch tool part and get typed input and output. Albanese showed the chat rendering a “Searching the web...” state while answering “Who is Nico Albanese?” with a response identifying him as a developer or product engineer at Vercel.

For Albanese, the type system is not cosmetic. If the agent definition controls the tools, the UI should know exactly what tool states can appear and what shape their data has. The tool loop, route, and client renderer should not each maintain their own parallel understanding of the agent.

Persistent named sandboxes remove lifecycle code from the app

Nico Albanese then moved to the part he considered most interesting: giving the agent a sandboxed computer. The implementation used @vercel/sandbox@beta, specifically Vercel’s named persistent sandboxes.

The underlying problem is that most remote sandboxes are ephemeral. A provider might keep a sandbox alive for five hours or 30 days, but at some point it stops. Albanese said he had previously built an entire lifecycle system to work around this: tar the file system after every request, store it in blob storage, then restore it when needed. He called that approach “terrible.”

Vercel’s named sandbox model abstracts that away. Each sandbox has a name. A sandbox can have sessions, which are instances of the underlying sandbox. Application code references the sandbox by name. Behind the scenes, Vercel checks for an active instance and routes to it, or starts a new one and routes to that. When the sandbox stops after inactivity, Vercel snapshots the file system. Later requests resume from that snapshotted state.

The product effect is that code can treat a remote sandbox like a specific computer that persists across invocations. Albanese said the machine will be spun down, but its state remains, so “it effectively feels like the exact same machine.”

The implementation created a utility in lib/sandbox.ts that either retrieves an existing named sandbox or creates a new one. In a real application, Albanese said, the sandbox name would likely be tied to a user or session. In the workshop, it was just a fixed ID.

Adding the sandbox to the agent introduced another AI SDK 6 concept: call options. Albanese said model behavior is rarely determined only by conversation context. Applications often pass structured inputs at call time that should change the agent’s behavior. In a customer support agent, that might be a customer ID or customer type. In his example, a travel company might choose a smaller model for a low-tier customer and a more capable model for a high-tier customer.

Previously, he said, developers often handled this functionally, with createAgent(input) and conditional logic. AI SDK 6 instead lets an agent define a call-options schema. In the sandbox case, the agent expected a Vercel Sandbox instance:

export const callOptionsSchema = z.object({
  sandbox: z.instanceof(Sandbox),
});

That schema was passed into the ToolLoopAgent definition. The route handler then had to fetch the named sandbox and pass it into createAgentUIStreamResponse under options. If it failed to do so, TypeScript produced an error. Albanese repeatedly emphasized this as end-to-end type safety: if the agent says it needs a sandbox, the call site must provide one.

Tools also needed access to the sandbox at execution time. Albanese used AI SDK runtime context for that. He cautioned that “context” here is closer to React context than to “agent context.” It is arbitrary runtime state — data, variables, functions — made available to nested tool execution. A prepareCall function runs once when the agent is called. It takes the sandbox from call options and injects it into the runtime context so tools can use it across the run.

That separation matters. The route knows how to get the sandbox. The agent knows it requires one. The tool knows how to run commands in one. The pieces remain separately defined but type-connected.

A single bash tool turns the sandbox into an operating surface

The first sandbox tool was a bash tool, defined in tools.ts. Albanese said Vercel “feel very strongly that bash is all you need” in many cases, because agents are good at writing bash commands.

The tool definition followed the three standard parts: a description, an input schema, and an execute function. The description told the model the tool could “Run a bash command in the sandbox environment.” The input schema accepted a single string, command. The execute function parsed the runtime context, retrieved the sandbox, and ran the command through bash. It returned stdout, stderr, and the exit code.

Albanese stressed the importance of the description. It is what the model uses to decide whether to call the tool and how to use it. The input schema is the contract. The execute function is the application code that actually runs when the model calls the tool.

He also noted an AI SDK 7 improvement planned for context typing. In the version shown, the tool parses experimental_context; in AI SDK 7, he said, the main context will be typed at the top-level agent. If a tool expects a certain context and the agent does not provide it, that would become an error at the agent definition level.

After Albanese wired the pieces together — the agent received the bash tool, the route fetched a named sandbox with createOrGetSandbox, and options: { sandbox } was passed into the agent invocation — he added a UI renderer for tool-bash so the chat interface could show terminal execution inline. When he asked the agent to run ls -la, the UI displayed a terminal block and returned an empty directory inside a Vercel sandbox. That was the mechanical proof: the agent could now execute commands in an isolated cloud environment rather than on the local machine.

But the first behavioral test failed. When Albanese asked “what do you see?”, the agent answered as though it had no visual feed and asked for an uploaded image or screenshot. It had a bash tool, but it did not know when to use it.

The fix was instruction-level behavior shaping. Albanese added a simple instruction: “You are an agent with a computer you can access with bash. If the user asks what you can see, use ls.” After that, asking “what do you see?” caused the agent to run ls -la and answer based on the file listing.

The example was intentionally basic, and Albanese called the prompt terrible. But it illustrated his earlier point about instructions. Tools do not automatically create useful behavior. The model needs operating guidance: what the tool means, when to use it, and how to interpret user intent through it.

Memory becomes a file the agent can inspect and edit

Albanese’s next step was to use the persistent file system for memory. His “hot take” was direct: memory is a file in the sandbox.

Rather than treating memory as an opaque product feature, Albanese proposed deterministic code around ordinary files. A memories.md file can be read before each agent run and injected into the system prompt. The file system can also contain structured stores for different kinds of memory, such as a core memories.md that is always sent to the model and a conversations.jsonl file for searchable conversation history.

The advantage, in his view, is that the file system becomes a structured playground. Agents are good at shell commands, and can use find, ls, grep, globs, and related tools to inspect and manage it.

The implementation put memory in prepareCall. Before each agent run, the code read memories.md from the sandbox, decoded it if it existed, and returned instructions assembled from strings. The agent was told it was a coding agent with access to a computer via bash; that it had a memories.md file it could read and write; and that it should add facts the user shares to memories.md. If memories existed, they were injected under “Here are your current memories.” Otherwise, the prompt said there were no memories yet.

The first test worked mechanically. Albanese said, “hey my name is Nico.” The agent appended - User name: Nico to memories.md. After refreshing and saying “hi,” the agent responded, “Hey Nico! Nice to meet you.”

But the next behavior exposed the brittleness of vague instructions. The agent also wrote - User greeted with: hi to the memory file. An audience member pointed out that Albanese had told it to always add any fact, and a greeting was technically a fact. Albanese agreed and changed the instruction to “important facts the user shares” and “only record important memories.” The agent still saved the greeting. He then tried adding “don’t share greetings,” while noting that this is often a bad prompting pattern because mentioning unwanted behavior can activate it.

The exchange became a small demonstration of why agent behavior is often wrong even when the underlying tool works. The memory mechanism was functioning. The agent could read and write memories.md, persist it across refreshes, and clear it when asked. The issue was policy: what counts as memory, and how precisely that policy is expressed.

Albanese later pasted a more developed memory prompt. It told the agent not to save trivial interactions like greetings, small talk, or information derivable from the codebase. It told the agent to save user preferences, project context, important facts about the user or work, and corrections or feedback about its behavior. It also instructed the agent to proactively ask questions, one at a time, to learn the user’s name, role, current work, experience level, preferred tools, and coding style.

With that prompt, the agent asked for Albanese’s name, saved it, asked what he did, and saved that he worked on the AI SDK at Vercel. In a new chat, the persisted memory remained available.

The important design point was not the exact prompt. It was the arrangement: deterministic application code reads a file; instructions tell the agent how to maintain it; the bash tool lets the agent edit it; the named sandbox makes it persist across sessions.

The agent can accumulate reusable procedures, not just facts

Once the agent had a persistent file system and bash access, Albanese extended the memory idea from user facts to reusable capabilities. He added instructions telling the agent to create Python scripts for repeatable tasks, record descriptions of those scripts in memories.md, and check existing scripts before doing a task from scratch.

The prompt structure was explicit: before doing any task, check whether a script listed in “Scripts for common tasks” already handles it; if a script exists, run it; if no script exists, do the task using web search, bash, or other available tools; after completing a task, consider writing a reusable Python script for it.

Albanese connected this to why coding agents have advanced quickly. Coding environments provide a strong feedback loop: write code, run it, inspect output, type-check, compile, iterate. He said agents modifying or extending themselves is part of what made systems like OpenDevin exciting, but clarified that the practical value is often not mystical self-modification. It is giving the agent an environment where it can create reusable artifacts, evaluate them, and improve them.

The task was weather. Albanese asked for the weather in London and told the agent to use Python. The agent created a scripts directory, wrote scripts/get_weather.py, fetched JSON from wttr.in, parsed current conditions, and printed a small weather report. It returned London weather as sunny, 14°C, 63% humidity, and 6 km/h SE wind.

Then the agent updated memories.md with a “Scripts for common tasks” section describing scripts/get_weather.py and how to run it. That made the tool discoverable to the agent in future runs because the memory file is injected into its instructions.

The next request showed both the promise and fragility of this approach. Albanese asked, “get weather in sf?” The agent used the script with "San Francisco, CA", but the weather service resolved the location incorrectly to Quebec-Ouest, Quebec, Canada, returning -11°C. The agent noticed the result was clearly wrong for San Francisco and suggested retrying with "San Francisco, California, USA" or a ZIP code. When rerun with the fuller location, it returned San Francisco weather as clear, 12°C, with 86% humidity and 11 km/h wind from the west.

The agent then asked whether Albanese preferred Fahrenheit or Celsius. He answered that he preferred Celsius. In the design being demonstrated, that preference could become another memory and influence future weather responses.

The point was not that the weather tool was robust. It was that a single bash tool plus persistent files enabled a pattern: the agent can create small programs, store them, document them for itself, and reuse them in later interactions. Memory becomes both semantic context and an index of available procedures.

Context compaction is a trade-off, not an automatic win

An audience question shifted the discussion to context management: should every agent request send the full message history, or should the application trim messages to avoid irrelevant or excessive context?

Albanese said the default in the workshop app was to send the entire message history on every sendMessage. In the client-server transport, however, sending all messages over the wire every time is not usually the production pattern. A more classic approach is to send only the most recent message from the client, fetch the full conversation on the server, combine them there, and then send the relevant state to the model.

The harder question is not transport but context engineering: deciding which old material should remain available to the model. Albanese said the SDK gives developers several ways to do that, but the trade-offs are application-specific and not something the framework should impose.

Under the hood, AI SDK converts UI messages into model messages, stripping UI-specific metadata such as timestamps and IDs. Developers can own and transform the message array before it reaches the model: map through messages, remove specific tool calls, keep user messages, or otherwise alter context.

For per-step control, Albanese pointed to prepareStep, a callback that runs before every individual agent step. In a tool loop, a single assistant turn can involve multiple steps: receive a user message, call web search, receive the result, decide whether to call another tool, and so on. prepareStep receives the messages, context, model, step number, and step data. It can return modified top-level parameters for that specific step.

A simple sliding-window example was:

prepareStep: ({ stepNumber, messages }) => {
  if (stepNumber > 20) return { messages: messages.slice(-5) };
},

That would keep only the last five messages after step 20. Albanese emphasized that the API is functional by design. The callback starts fresh each invocation; incoming messages are the aggregated set; the developer can change what is sent into a step without necessarily mutating the persisted final history.

But Albanese was skeptical of aggressive compaction in his own recent coding-agent work. He showed a run that lasted 104 minutes, made 316 tool calls, changed 29 files, and used only 32% of GPT-5.4’s context window, with no compaction. He said he had previously stripped earlier tool calls once he hit thresholds, trying to stay between 40% and 60% of the context window. The problem was cache invalidation: modifying the input history can invalidate the model’s input cache. With million-token windows, he said, compaction had become less urgent in his experience than it was with 400k-token windows.

104m 56s
single coding-agent turn Albanese showed, with 316 tool calls and 29 files changed

His preferred direction is to push independent work into sub-agents. A sub-agent can take a bounded objective, use its own context thread, and return a concise summary — “just a thousand tokens” — to the main thread. He also mentioned a hand-off tool pattern, where an agent can generate context for a fully new thread that becomes the next main thread.

The concern with summarization-based compaction is that it is lossy. Albanese cited a public anecdote about an email agent that was asked to archive emails from yesterday and instead deleted an entire inbox. His explanation was that an inefficient tool brought in too many emails, triggering auto-compaction; the volume of email content overwhelmed the original instruction, and the constraint not to delete was lost. That was the failure mode he worries about: the summary preserves the wrong thing.

Albanese’s practical conclusion was narrow and experience-based. After using his coding agent “14 hours a day” for work over several months, compaction had not been the limiting issue for him. A high cache-token read ratio, he argued, was more valuable for speed, performance, and cost.

The same pattern scales into a background coding-agent system

Albanese showed a more complex system he had built on the same ideas, hosted at open-agents.dev. He described it as effectively “Cursor background agents,” using the same underlying patterns from the workshop: AI SDK, AI Gateway for inference, sandboxes, long-running workflows, and sub-agents.

The system used Vercel Workflow so it could run indefinitely, with each LLM step matched to a durable workflow step. If a step failed, he said, it retried until it got a result. The interface showed active and archived sessions, an internal leaderboard, usage metrics, and individual agent sessions.

The usage dashboard shown in the demo listed 3.8 billion total tokens, 2,823 messages, 56,732 tool calls, 373 tracked PRs, a 91% cache read ratio, and 508,806 total lines of code churn. Albanese said 23 people at Vercel were using it and that he had put 3.8 billion tokens through it in the prior month or two. He singled out the 91% cache read ratio as something he was proud of. These were demo and dashboard figures, not an external benchmark, but they showed the scale at which he was applying the same pattern.

He also demonstrated the sub-agent pattern. In a new session, he asked the system to spin up a sub-agent to explore the project. The sub-agent ran off the main thread, globbed files, inspected package files and configuration, read relevant source files, and returned a compact summary: the stack was Next.js App Router, React 19, TypeScript, Tailwind, and MDX; the package manager was pnpm; scripts included dev, build, and start.

Albanese said that sub-agent used about 30,000 tokens but returned about 500 tokens to the main thread, keeping the main agent thread around 7,000 tokens. That was the concrete version of his earlier context-management argument: isolate independent exploration, then return only the useful synthesis.

The simple agent and the larger background-agent system differed in scale, not in kind. Both depended on the same primitives: define the agent once, type its messages and options end to end, pass call-time state into runtime context, expose a small set of powerful tools, and give the agent a persistent computer where it can externalize work.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free