Long-Running Agents Need Separate Builders, Evaluators, and Disposable Scaffolding

Ash Prabaker Andrew WilsonAI EngineerMonday, May 18, 202619 min read

Anthropic’s Ash Prabaker and Andrew Wilson argue that long-running agents are a harness-design problem, not a matter of writing longer prompts. Their case is that agents can run for hours only when building, judging, planning and state management are separated: adversarial evaluators should test live behavior, work should be decomposed into explicit contracts, and durable state should live outside the model’s context. They also warn that this scaffolding is provisional, because each new model release changes which supports are useful and which have become dead weight.

The durable pattern is not a longer prompt; it is a builder, a judge, and scaffolding you are willing to delete

Ash Prabaker and Andrew Wilson described long-running agents as a systems problem, not a demo problem. The target is not an agent that can produce a plausible browser app in a few minutes. It is an agent that can run for five or six hours, sometimes longer, without losing the thread of the work.

The most consequential pattern Prabaker presented was simple: separate the builder from the judge. A generator builds the thing. An evaluator grades the thing. The evaluator is not just another pass over a diff; in the frontend and full-stack examples, it opens the live application with Playwright, clicks through it, inspects behavior, scores against a rubric, and sends critique back to the generator. The generator then refines, or, when the attempt is not improving, throws the work away and starts again.

The second claim is just as important: none of this scaffolding is permanent. Wilson framed the last year of Claude agent work as a co-evolution between models and harnesses. Anthropic would find a model weakness, fill the gap with harness machinery, train or post-train models against that pattern, and then sometimes remove the machinery when the next model made it unnecessary. Prabaker put the point more operationally: a harness that was right for Opus 4.5 could become dead weight on Opus 4.6.

Long-running agents fail, in Wilson’s framing, because three problems compound: context, planning, and verification. Context windows are finite; new sessions begin with amnesia, and long sessions can lose coherence as the window fills. Wilson used “context rot” for that degradation and “context anxiety” for the model’s tendency to rush toward completion as it nears the end of its context budget. Planning failures are different: a model may try to one-shot an entire project, build half a feature and call it done, or run out of context mid-feature. Verification is the least obvious failure and, in this talk, the most load-bearing one. Models are poor judges of their own work. They may render a button with no backend, run shallow tests — “curl, not click” — and move on.

The generator/evaluator pattern attacks the verification problem by refusing to ask the same model stream to be both proud builder and skeptical reviewer.

Tuning a standalone critic to be harsh is actually very tractable. But tuning a builder to be somewhat self-critical is not.

Ash Prabaker · Source

Prabaker compared this to human judgment: it is easier to critique a meal or a piece of artwork than to make one. The harness exploits the same gap in model behavior. The evaluator is still an LLM and still has biases toward LLM-style outputs, but its job, prompt, context, and tools can be tuned for skepticism.

Wilson made the broader harness point explicitly:

The harness doesn't just disappear as the models get better, it's really evolving as the models change over time.

Andrew Wilson

The bottlenecks moved from planning to context to verification

Wilson’s release history was less a chronology of Claude Code than an argument about shifting bottlenecks. The same three failures — context, planning, verification — kept reappearing, but each model generation changed which one the harness had to compensate for.

Before Claude Code, Sonnet 3.5 and Artifacts created an early closed feedback loop: generated code could render live next to the chat, giving the model something to inspect and iterate on. Computer use and MCP then extended the loop by giving the model browser-like interaction — screenshots, cursor movement, clicking, typing — and a standard way to plug in tools.

Claude Code arrived with Sonnet 3.7 in February 2025. Wilson highlighted a line from the announcement: the goal of Claude Code was to better understand how developers use Claude for coding in order to inform future model improvements. In his reading, the harness was not merely a product surface. It was experimental infrastructure for discovering where the model failed and feeding those lessons back into model development.

By May 2025, Opus 4 and Sonnet 4 improved context management and task completion, while Claude Code became generally available and the Code SDK shipped. Wilson’s slide paired those releases with API and product primitives such as code execution, MCP connector support, Files API, extended prompt caching, GitHub Actions, and IDE integrations. The point was that model releases and harness releases were coupled.

The “Ralph technique” made one harness pattern explicit: put the agent in a loop, constrain each iteration, and prefer predictable failure over unpredictable success. Wilson showed Geoffrey Huntley’s July 2025 shell loop:

while :; do cat PROMPT.md | claude-code done

Wilson said the common shorthand misses some of the structure. The pattern involved planning, breaking a prompt into features, selecting one task, and working with a fresh context window. Anthropic later implemented a related loop inside Claude Code with a stop hook, safe word, and iteration guard. Wilson noted the major difference: Anthropic’s version ran inside one Claude Code session rather than creating a fresh context window each time, relying on compaction over time.

Sonnet 4.5 shifted the context problem again. The model became more context-aware, tracking token consumption and managing its own context as the session advanced. Claude Code 2.0 added checkpoints and /rewind, and the Claude Code SDK became the Agent SDK because Anthropic saw the harness as more general than coding. Wilson said Sonnet 4.5 could sustain about 30 hours of focus by that point.

The October-November 2025 releases changed the economics of subagents and context. Haiku 4.5 made subagent swarms cheaper, according to Wilson’s slide, by matching Sonnet 4 on coding, computer use, and agent tasks at about one-fifth the cost. Opus 4.5 became stronger at planning. Wilson described a common pattern: use Opus 4.5 for planning and Sonnet 4.5 as the execution workhorse. Skills added progressive disclosure, loading front matter first and only pulling in full procedural knowledge when needed. Programmatic tool calling reduced context load by letting the model write code to run a series of tool calls and return only the final result.

By February 2026, Opus 4.6 and Sonnet 4.6 changed the harness calculus again. Wilson described Opus 4.6 as particularly agentic: better at planning, tool choice, debugging, and long-running work. Sonnet 4.6 offered what he characterized as Opus-level coding and computer-use capability at Sonnet price. Agent Teams gave subagents a way to coordinate directly instead of reporting every detail back to the main agent. Server-side compaction and 1 million context availability further reduced the need for fresh sessions in some workloads.

Wilson used METR Horizon-v1.1 to show the model-side trend. He described the metric as the human-time length of a task a model completes with 50% reliability in a minimal scaffold. The progression he presented moved from roughly hour-scale autonomous completion to about 12 hours by Opus 4.6.

~12 hours

METR p50 task-completion horizon Wilson cited for Opus 4.6 in a minimal scaffold

Release period	Model / harness shift	Primary bottleneck attacked
Jun–Nov 2024	Sonnet 3.5, Artifacts, Computer Use API, MCP	Closed feedback loops, tool use, visual/browser interaction
Feb 2025	Sonnet 3.7 + Claude Code research preview	Planning and real developer feedback into model improvement
May 2025	Opus 4 / Sonnet 4 + Claude Code GA and SDK	Context, planning, and verification around coding tasks
Sep 2025	Sonnet 4.5 + Claude Code 2.0, checkpoints, Agent SDK	Context awareness, memory, rewindable sessions
Oct–Nov 2025	Haiku 4.5 / Opus 4.5, Skills, programmatic tool calling	Cheaper subagents, progressive disclosure, stronger planning
Feb 2026	Opus 4.6 / Sonnet 4.6, Agent Teams, server compaction, 1M context	Longer autonomous runs, verification, multi-agent coordination

Wilson presented model and harness releases as coupled responses to shifting planning, context, and verification bottlenecks.

The useful lesson is not that one release made harness design obsolete. It is that the harness should target the model’s current weak spot, and the weak spot changes.

Externalized state lets agents survive context boundaries

The first long-running harness Wilson described came from Anthropic’s November 2025 engineering work. It was designed for vague prompts such as “write me a browser,” “create a Slack clone,” or “create a Salesforce clone.” The harness did not ask one agent to hold the whole product in its context until completion. It externalized the work into persistent artifacts and repeated sessions.

An initializer agent expanded the short prompt into a richer taskable specification. It wrote a feature_list.json, a progress file, a git repository, and an init script. Wilson said the team found models were more likely to overwrite Markdown files than JSON files, so the feature list lived in JSON. Each feature began with a passes: false flag.

Each later session started in a fresh context window by getting its bearings: checking the working directory, reading progress, and looking at git history. It then ran the init script as a smoke test so it did not have to rediscover how to start the app. It selected one failing feature, implemented only that feature, tested it as a user would — Wilson named Puppeteer for browser-based testing — and, if it passed, flipped the flag, committed the change, and wrote progress notes. If failing features remained, the harness spawned another session.

The design mapped directly to the three failure modes. Fresh sessions limited context rot. Persistent artifacts carried state across context windows. The feature list constrained planning. Browser-based testing forced verification beyond superficial checks. Git commits and progress notes became structured handoff points.

Prabaker later preserved the same principle even when Opus 4.6 made fresh sessions less necessary: shared state should live on disk, not in the assumed continuity of a model’s context. In the simplified harness, agents still coordinated through files such as spec.md, the app directory and git history, and findings.md. They did not need to share a context window to collaborate.

Subjective quality became gradable once the rubric became explicit

Prabaker rejected the common claim that “taste” cannot be graded. His narrower claim was that if you have a strong enough view, you can write it down in a form an evaluator can apply.

For frontend work, Anthropic used rubrics with four criteria: design quality, originality, craft, and functionality. The weighting favored design and originality because, in Prabaker’s view, Opus 4.6 was already relatively strong at craft and functionality. The remaining problem was preventing generic LLM aesthetics — purple gradients, library defaults, and “AI slop” — from being accepted as good enough. They calibrated the evaluator with few-shot examples on reference sites so its taste would converge toward the team’s.

The frontend loop ran five to fifteen iterations over roughly four-hour runs. The evaluator launched Playwright, navigated the live page, took screenshots, scored the work against the rubric, and wrote critique. The generator then decided whether to refine the current artifact or pivot entirely.

That pivot behavior was important. Prabaker said a single-agent self-review loop tends to patch the same thing repeatedly. In the adversarial harness, if the generator kept scoring low on originality, the system could discard the whole attempt and try again. In Q&A, he said newer models were surprisingly willing to throw away work after many passes if the evaluator’s rubric showed that the current approach was not improving. That was a behavior Anthropic had not reliably seen from a generator judging itself.

The evaluator’s separation was also why Prabaker was hesitant to give it the generator’s full trace. Anthropic tried variants where the critic saw more of the builder’s reasoning, but he said that muddied the two model streams. It was more effective for the evaluator to judge the output and say, in effect, “this is an issue,” leaving the generator to infer the cause and fix it. If the evaluator absorbs the generator’s self-justifying story, it may inherit the same mistaken confidence.

Full-stack work needs negotiated contracts, not only a plan

To move from attractive pages to working applications, Prabaker added a planner. The resulting structure had three roles: planner, generator, evaluator. He compared it to a simple organization: PM, individual contributor, QA. Anthropic did not invent those roles; it gave each role its own context window.

The planner’s job was intentionally high-level. It took a one-line prompt and turned it into a product specification and a sequence of sprints. It did not attempt to specify granular technical details. Prabaker argued that detailed technical planning up front is dangerous because a planner error can cascade through every sprint and become amplified over a multi-hour run.

The more important mechanism was the “sprint contract” between generator and evaluator. Before the generator wrote code, the two agents negotiated what “done” meant for that chunk of work. The generator might propose a feature and tests. The evaluator might push back that the scope was too large, the tests were too weak, or edge cases were missing. They communicated through files on disk — one writes a Markdown file, the other reads and responds — until both agreed.

The evaluator then graded against the contract, not the planner’s original high-level spec. Prabaker described the contract as the bridge between user stories and testable behaviors. It avoided forcing the planner to overspecify, while still giving the evaluator concrete assertions to check.

This was the part he said the Ralph-style loop lacked. A fixed plan.md can guide a main loop, but there is no adversarial party arguing with the builder before implementation begins. The contract creates that pressure before code exists, not only after the app fails.

The retro game maker example showed the difference between looking done and working

Prabaker used a “build a retro game maker” prompt to compare a solo agent against the fuller harness. He cautioned that the harness was neither cheap nor efficient. The solo run took about 20 minutes and cost $9. The full harness ran for about six hours and cost about $200. The point was not cost-effectiveness; it was whether the result worked.

Metric	Solo agent	Full harness
Duration	20 minutes	6 hours
Cost	$9	$200
Play mode works?	No — entity wiring broken	Yes
Sprite editor	Basic	Full palette, zoom, AI-assisted
Features in spec	4 from prompt	16 across 10 sprints

Prabaker compared the same retro game maker prompt under a solo agent and the planner-generator-evaluator harness.

The solo agent produced a plausible opening screen: a dark retro-themed interface with project creation and a clear call to action. Prabaker said that if this were the whole app, it would look shippable. That was the bait.

The sprite editor also looked reasonable at first. It had a canvas, a palette, a frame timeline, and live preview. It was cramped, and the color picker was just black swatches, but the agent had understood the broad shape of the product.

The failure appeared in play mode. Entities rendered. There was a score, health, pause, and reset. But pressing arrow keys did nothing; pressing space did nothing. The app looked done at the surface and failed at the core interaction. Prabaker said the agent did not understand how to test what it meant to play a game.

With the harness, the same model and prompt produced an app that named itself RetroForge. The new-project dialog asked for canvas resolution, tile size, and color palette — product decisions introduced by the planner. The sprite editor used the project’s 54-color 8-bit palette and showed previews at multiple scales. An AI level assistant let a user prompt for a castle with sprites guarding it, then produced nine placed entities with options to try another layout, discard, or accept.

The play mode was the critical difference. The generated app included a debug HUD showing position, velocity, and ground state. Prabaker said the HUD was clearly useful to the evaluator. The physics loop was running, arrow keys moved the player, and the player collided with castle walls. The evaluator had launched and played the game, not merely inspected whether UI elements existed.

The evaluator caught concrete bugs. One contract criterion required click-drag rectangle fill; the finding was that the app only placed tiles at the drag start and end because fillRectangle existed but was not triggered on mouseUp. Another required the delete key to remove placed entities; the evaluator found a Boolean logic bug where the handler required both a selection and selectedEntityId, while clicking only set one. A third required reordering animation frames via API; the evaluator found that a FastAPI route for /frames/reorder was defined after /{frame_id}, so FastAPI matched reorder as a frame ID integer and returned 422.

Prabaker emphasized that these were not exotic bugs. They were ordinary product defects a real user or QA pass would find and a superficial CI loop might miss. The specificity came from contracts. For the level-editor sprint, the agents produced 27 contract criteria.

contract criteria for the retro game maker's level-editor sprint

Prabaker said that granularity was necessary. Vague criteria lead to vague critiques; granular criteria tell the generator what to fix.

The evaluator had to be trained out of generosity

Prabaker was explicit that Claude was not a good QA agent out of the box. In early runs, it would identify legitimate issues and then talk itself into approving the work anyway. The development note he showed put it plainly: “Out of the box, Claude is a poor QA agent. It would identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway.”

The same generosity bias that appears in LLM-as-judge systems appeared in app QA. A QA agent might find a bug, decide it could be fixed later, and pass the work. Anthropic spent significant time tuning prompts around small layout bugs, edge cases, and concrete rejection behavior.

There was no secret technique beyond reading traces. The debugging loop was: read the evaluator’s logs, find where its judgment diverges from yours, update the prompt, repeat. Prabaker compared it to reading a stack trace. The trace is where the harness developer sees what the model actually did, why it accepted or rejected something, and where its standard diverged from the desired one.

He described some tooling around that process: pipe agent transcripts into files, grep them, point another agent at the traces, and have that agent identify likely loop failures or propose prompt updates. But in Q&A he returned to the stronger point: by far the best internal approach was reading traces by hand. A trace is not just an audit log. It is how a harness builder develops intuition for what the model is experiencing.

Asked whether evaluator tuning is project-specific or reusable, Prabaker said the goal was reusable patterns. Tuning an evaluator for one specific app is possible, but that is not much different from manually prompting Claude Code. The useful work is identifying common weak points — generic frontend design, library defaults, weak QA standards — and encoding judgments that generalize across web apps and potentially other domains.

Opus 4.6 made the harness smaller, not unnecessary

Prabaker argued that harnesses should be revised aggressively as models improve. Opus 4.6 made about half of the previously shown harness “dead weight.” The lesson, in his framing, was not that the older harness had been wrong. It had been right for Opus 4.5. The frontier moved.

On Opus 4.5, context resets between sessions were essential because of context anxiety. On Opus 4.6, Anthropic dropped them entirely for this pattern. A single continuous session with compaction was enough for long runs.

Sprint decomposition was also load-bearing on 4.5. The model needed to be force-fed one feature at a time to stay coherent. On 4.6, Prabaker said the model could hold a two-hour continuous build coherently, so Anthropic removed strict sprint planning in the simplified version.

Evaluator cadence changed too. Previously, the evaluator ran per sprint. In the simplified harness, it ran after a one-shot generation, with one or two passes of findings and fixes. The planner remained required.

Component	On Opus 4.5	On Opus 4.6
Context resets between sessions	Essential because of context anxiety	Dropped entirely
Sprint decomposition	Load-bearing	Removed for a single continuous build
Evaluator cadence	Per sprint	Single pass at end, with one or two follow-up passes
Planner	Required	Still required

Prabaker described the 4.6 harness as a simplified version of the 4.5 harness, not a repudiation of it.

The simplified setup still had three agents and no shared context. The planner ran once for about five minutes and wrote spec.md. The generator performed one continuous build over two or more hours and produced the app and git history. The evaluator used Playwright and a rubric at the end, writing findings.md. Coordination happened through the file system.

Prabaker showed a browser music production app built with the simplified harness. The prompt was to create a music production app with a built-in agent to compose songs. The planner phase took 4.7 minutes and cost $0.46. The first build took 2 hours 7 minutes and cost $71.08. The first QA round took 8.8 minutes and cost $3.24. Build and QA rounds two and three took 1 hour 30 minutes and cost $50. Total runtime was 3 hours 50 minutes and total cost was $124.70.

Phase	Duration	Cost
Planner	4.7 min	$0.46
Build, round 1	2 hr 7 min	$71.08
QA, round 1	8.8 min	$3.24
Build + QA, rounds 2 and 3	1 hr 30 min	$50
Total	3 hr 50 min	$124.70

The simplified browser music production app run cost $124.70 over 3 hours and 50 minutes.

The app could set tempo and key, lay down melody, build a drum track, adjust mixer levels, and add reverb from natural-language prompts. Prabaker noted that Claude could not hear the music and that the music itself was “pretty trash,” but the app was fleshed out. His point was that a model generation earlier, this kind of result would not have worked with so little scaffolding.

For long-lived projects, persist the things a future agent can actually use

Several Q&A exchanges clarified how far Anthropic’s pattern should be taken outside greenfield demos. Prabaker’s default orientation was to push toward fully autonomous harnesses, not to make human review the primary control mechanism. Asked whether a sprint-review-like human checkpoint should be added after a few hours, he said hooks could implement that: an evaluator could hit a stop condition, hand control to a human, receive a developer message, and continue.

But Anthropic’s exploration was aimed at finding what could be made fully autonomous. When a questioner argued that a mid-run review might steer the result toward the desired product, Prabaker distinguished between a permanent harness feature and development-time tuning. Anthropic would run several generations, inspect failures, read traces, adjust the main harness prompts, and run again until the team was comfortable leaving it alone. The preference was to bake steerability into the harness rather than insert a human every time the harness was uncertain.

For long-lived products, Prabaker said the current pattern is to persist breadcrumbs for the next model and the next human. That means JSON files recording attempted fixes, evaluator findings, whether a fix worked, timestamps, and final state. It also means lightweight live documentation of the file structure. He prefers JSON because models are less likely to overwrite it. Wilson had earlier emphasized git history, commits, and progress notes as handoff artifacts.

The practical set of durable artifacts is therefore not mysterious: contracts, evaluator findings, git history, JSON breadcrumbs, and lightweight architecture notes. Contracts preserve what “done” meant. Findings preserve what failed and why. Git history preserves implementation sequence and rollback points. JSON breadcrumbs preserve state across sessions without relying on a model to remember. Architecture notes give the next agent enough map to avoid rediscovering the codebase.

Asked whether the pattern works for brownfield codebases or production features, Wilson said the demonstrated harness is better suited to greenfield applications. Brownfield work often requires more control and project-specific rubrics. But he described adjacent brownfield automation patterns across the software development lifecycle: autonomous monitoring could create issues or feature requests, an agent could produce a pull request, another review loop could inspect it, and a human could approve before merge. Prabaker added that the principles — separate generation from evaluation, use subagents, apply evaluator loops to bug fixes or monitoring — can be lifted into real workflows even if the exact six-hour greenfield harness is not the right unit.

The relationship between Agent Teams and explicit generator/evaluator loops also came up. Wilson said Claude Code and the Agent SDK share underlying harness primitives, so this kind of pattern should be buildable in Claude Code. Agent Teams could express it: the generator might be the main agent and the evaluator a team member, or the two might coordinate directly. Claude Code is a useful testing ground; the Agent SDK can run in a cloud or sandboxed environment for long periods, without relying on a developer’s local machine staying awake.

Prabaker described generator/evaluator as a subset of team-based subagent design, not a contradiction of it. A team might have frontend, backend, and integration agents. Each could have its own critic pairing. The broader principle is adversarial pressure between a builder and a judge, not a fixed agent topology.

The loop is buildable with shipping primitives

Prabaker said developers do not need Anthropic’s internal harness to start applying the pattern. The primitives are already available in Claude Code or adjacent tooling.

Auto mode handles unattended permissioning more safely than simply skipping permissions. Custom subagents can represent evaluator or QA roles. Playwright MCP or Chrome MCP can give the evaluator browser control for click-through verification; for native apps, computer use can play a similar role. Skills can package rubrics and procedural grading knowledge into the development flow.

Asked about Playwright specifically, Prabaker recommended Playwright MCP or Claude for Chrome MCP. He resisted the premise that the user should watch the browser constantly. Watching is useful early, when building trust and reading traces. But the workflow Anthropic is pursuing is one where the user can set work in motion, have the agent navigate, read network and console errors, inspect visual issues such as overlapping text, and return later with enough confidence to evaluate the result.

The final takeaways were direct: self-evaluation is a trap, use an adversarial evaluator; compaction is not coherence, so structured handoffs matter; subjective quality can be made gradable if the rubric is explicit; traces are the primary debugging loop; and scaffolding should be deleted when the model catches up.

AI Application Architecture Evals and Benchmarks Agents and Autonomy Model Releases Coding Assistants