Agents Move From Chat Windows To Accountable Work Product
Cognition’s Devin, OpenAI’s Agents SDK, Accenture’s governance framing, Braintrust’s observability work, and Neo4j’s context-graph model all point to the same shift: agents are being treated less as interfaces and more as production workers. The question is no longer only whether a model can act, but whether its runtime, permissions, approvals, traces, memory, and review process make that work trustworthy.

The agent has become a production worker, not a chat window
The useful distinction in agentic AI is moving from whether a model can complete an impressive task under close supervision to whether an agent can enter a real workflow, operate in the background, produce work product, and leave enough evidence for a human or organization to trust what happened.
Cognition’s Devin is the clearest concrete case because the evidence is operational rather than theatrical. Walden Yan says Devin’s merged PR usage inside Cognition grew 7x over roughly two or three months while engineering headcount rose by about 10%. Across three Devin repositories, non-merge commits attributed to Devin rose from 16% in January to 80% by mid-March.
That figure matters less as a claim about Devin specifically than as a signal about where coding agents are moving. Cole Murray frames the shift as a move from foreground, IDE-based prompting to cloud agents that take a specification and return a tested pull request. The older copilot model kept the human in the loop continuously: accept this completion, correct that step, paste that error, steer the model again. The background-agent model asks a different question: can the human specify the change, let the agent run in a controlled environment, and review the result?
That turns the agent into something closer to a junior production worker inside the engineering system. But the commit share only becomes meaningful because of the machinery around it. Yan and Murray spend far more time on cloud sandboxes, scoped secrets, repo setup, testing, Slack and GitHub workflows, memory, review, and rollback than on model prompting. Their shared point is that autonomy does not stand on its own. It requires a reproducible environment and an institutional workflow that can absorb agent output without letting the codebase degrade.
The architecture discussion makes that concrete. Murray describes the first background-agent design choice as whether the harness runs inside the sandbox or outside it. In the in-box design, the agent, process, files, and execution state are localized. That is simpler, but it can put secrets and control logic inside the same environment where untrusted model-generated code is running. In the out-of-box design, the agent’s “brain” runs in a control plane while the sandbox acts as the machine that receives actions. Yan says Devin was built around a similar principle Cognition calls separating “the brain from the machine,” which lets customers scope what credentials live on the machine and avoid exposing the more privileged parts of the system to the code-execution environment.
The production-agent question, across Yan and Murray’s account, is not whether the model can act. It is what has to exist around the model before its actions can be trusted.
Testing is where the production frame becomes especially visible. Yan argues that testing an app with an agent is not mainly a “computer use” problem. Clicking the right button is a small part of the task. The harder problem is knowing how to run the relevant services, which user state or feature flag is needed, what behavior should be triggered, and how the change should be demonstrated. Murray makes the same distinction: a background coding agent has the codebase locally, but knowing how to test a change is codebase-specific.
That is why screenshots, videos, annotations, and PR comments become part of the product surface. Devin returning a video of a Vercel preview is not just a convenience; it is evidence for the merge decision. Yan says annotations under the video made a large difference because reviewers need to know what the agent was trying to prove. Murray says OpenInspect added screenshot support after customer demand, with video possible through plugins.
The warning is equally important. Yan says Cognition has tried building real products through pure “vibe coding” on top of itself: auto-merge, no human review, no inspection. In the state of the art he describes from December, the codebase survived about two weeks before it had to be trashed and rewritten. Murray’s broader formulation is that, in AI-native coding, “your codebase regresses to your worst engineer” if generated code is not audited. Bad patterns become precedent; future AI work copies them; duplicate helpers and sprawling conditional logic multiply.
The lesson is not that a single coding agent has won. It is that the unit of adoption has changed. A production agent is a model plus a control plane, sandbox, permission model, repo environment, integration layer, testing surface, memory system, and review practice. Without those, autonomy can accelerate decay as quickly as output.
Durable runtimes are becoming the agent operating layer
OpenAI’s updated Agents SDK points to the same systems-level transition from another angle. Steve Coffey presents the SDK as a response to a practical gap: models can now work for longer stretches, but many teams are still hand-building orchestration loops around model calls. Codex is his reference case. Users may have seen Codex run for minutes or an hour; internally, Coffey says people have gotten Codex to run “for days up to a week on tasks.”
The important distinction is between three runtime patterns. A one-shot model call can translate a document or turn unstructured data into JSON. A hosted shell can run bounded command-line work inside an isolated container: transform a CSV, lint a repo, run a build step, inspect logs. A full SDK-managed agent is for work where the hard parts are durable state, file-system access, tools, sandbox lifecycle, skills, and resumability.
| Pattern | What it assumes | What it is for |
|---|---|---|
| One-shot model call | The task fits inside a bounded request or developer-managed loop | Translation, extraction, structuring, or other limited transformations |
| Hosted shell | The model needs an isolated container but not a full durable agent runtime | CLI-grade work such as linting, build/test steps, repository analysis, or data transformations |
| Durable agent harness | The agent needs state, files, tools, snapshots, skills, and resumability across longer work | Multi-step workflows over real code, files, systems, and application state |
The architectural move is the same one Yan and Murray describe: separate the harness from the compute environment. Coffey says the sandbox should be treated as ephemeral. The durable harness can live in a trusted backend; the sandbox can be replaced, expired, killed, or recreated. If the sandbox disappears, the SDK can spin up a new one, restore the file system from a snapshot, reload the run state, and continue. The model need not know that the underlying container changed.
That matters for both reliability and security. If the agent loop and file system are trapped inside a single sandbox, that sandbox becomes the source of truth. If it dies, the task may die with it. If it holds secrets, the environment where untrusted code executes becomes more sensitive. Coffey’s preferred separation lets trusted application tools and secrets remain server-side while the sandbox handles execution, file reads, and file writes.
A harness is becoming a product surface in its own right. Coffey describes built-in support for web search, file search, MCP, code interpreter, skills, hosted or networked containers, shell, function calling, compaction, memory, and tool approvals. The SDK’s SandboxAgent can manage sandbox lifetime and snapshots unless the developer chooses to own the sandbox lifecycle. Providers such as Docker, Modal, E2B, Cloudflare, Vercel, Daytona, and others can supply the execution environment; snapshot storage can move from a local tarball to cloud storage such as Cloudflare R2.
The line Coffey is drawing is not “every agent needs the heaviest runtime.” It is that long-running agents require the runtime to become explicit. They need a file system that can be recreated, a manifest that defines the workspace, skills that package instructions and scripts, and function tools that mediate changes to application state. Tool-call approvals then become a control point for sensitive operations. In Coffey’s task-tracker example, the agent could update status and assignment, but marking a task done required explicit approval.
Skills are another sign of the production turn. A skill is not just a prompt. It is a versioned bundle of instructions, scripts, templates, and resources, with a manifest describing when and how to use it. Coffey’s tax-prep example includes rules, scripts, and logic to fill out a 1040. He also argues that GitHub is a useful home for skills because teams already have version control, pull requests, and review workflows there. That makes the agent’s operating knowledge governable in the same way code is governed.
Persistence has two parts in the SDK account: the file system snapshot and the rollout state. Nish Singaraju says the SDK stores the rollout and snapshots the file system, which is what allows long-running agents to resume. The practical goal is to move engineering effort away from rebuilding the orchestration loop and toward product choices: which tools to expose, what state to persist, which skills to mount, where human approval is needed, and which sandbox provider fits the workload.
This mirrors the Devin discussion without depending on the same vendor or product. Both accounts converge on a pattern: the sandbox should be replaceable; the durable harness should preserve state and enforce boundaries. As agents work longer and touch more systems, the runtime stops being hidden plumbing and becomes the agent operating layer.
Enterprises are discovering that governance is the real bottleneck
If background agents and durable runtimes increase the supply of deployable work, enterprise governance becomes the next constraint. Accenture’s Jess Grogan-Avignon and Jack Wang argue that many agentic AI projects fail not because the agent cannot be built, but because the institution around it cannot approve, deploy, monitor, and learn fast enough.
Wang’s example is deliberately plain: an agentic application took about two weeks to build and another 12 months to get into production. The year was not spent inventing the application logic. It was spent aligning infrastructure, security, centralized AI gateway, data governance, and application teams. His comparison is to Google Search if search results had to wait for several teams, legal review, and a quarter-end change freeze before appearing.
Grogan-Avignon and Wang do not argue for removing governance. Their point is that governance designed for human-speed delivery becomes a bottleneck when AI coding and agents increase the amount of code, workflows, and experiments an organization can produce. Grogan-Avignon says the “real tech debt” is not only old application code; it is underinvestment in CI/CD, engineering automation, and approval infrastructure. Her prescription is governance-as-code: control mechanisms that are executable, adaptive, and able to move at the speed of the systems being deployed.
The enterprise funding model has the same problem. Grogan-Avignon says conventional business cases assume scope, solution, expected value, cost, and timeline can be known upfront. Agentic AI often reverses that order. The value is discovered through prototyping and experimentation, especially when AI makes previously uneconomic work possible. Wang extends that into a portfolio model: the CFO should think more like a venture capitalist, placing enough bets to find the ones that compound rather than requiring each project to arrive with a guaranteed three-year payback.
That delivery logic also changes what counts as progress. Wang argues for hypothesis-driven loops: build, evaluate, iterate, and generate statistical confidence. A conventional milestone program measures whether activities are complete. An agentic program has to measure what has been learned and how much confidence the team has earned in the system’s behavior.
The strongest deployment metaphor is progressive autonomy. Grogan-Avignon describes trust as something shipped through staged exposure: shadow mode, advisory mode, controlled autonomy, expanded autonomy, and eventually broader autonomy. In shadow mode, the agent runs alongside human processes without affecting outcomes. In advisory mode, it recommends while humans approve or reject. In controlled autonomy, it acts only in narrow, lower-risk scenarios with limits and kill switches. Each stage is gated by evidence, not by the completion of a project plan.
- Shadow modeThe agent runs beside the human process and its outputs are compared with human decisions.
- Advisory modeThe agent recommends in live workflows, while humans approve, reject, or correct.
- Controlled autonomyThe agent acts in narrow, lower-risk situations with limits and kill switches.
- Expanded autonomyThe agent receives broader authority only after outcome evidence accumulates.
This governance argument ties directly back to the runtime and background-worker sections. Once agents can operate for hours or days, write code, triage incidents, change task state, or recommend customer-facing actions, governance cannot remain a quarterly committee process. But the Accenture argument is also not a blank check for speed. It is a claim that control has to become more granular, more executable, and more evidence-based.
Wang’s “living memory” idea also anticipates the later memory and observability discussions. He argues that the defensible enterprise asset is not yesterday’s system of record but the feedback generated when customers and employees interact with products in the company’s context. Every feature should either generate feedback signals or act on what prior signals taught the organization. That turns deployment into the start of learning, not the end of a project.
Observability now has to answer whether the agent did the right thing
Phil Hetzel of Braintrust sharpens the operating-stack problem by changing the question observability is supposed to answer. Traditional observability asks whether a system is up, fast, cheap enough, and technically healthy: latency, throughput, error rates, cost, 400s, 500s. Agent observability still needs those measures, but Hetzel argues that they are no longer sufficient. The central question becomes whether the agent behaved correctly.
That means tracking whether the answer was grounded in the right context, whether the agent used the expected tools, whether it followed the policy encoded in the system prompt, whether it matched the intended tone, whether it resolved the issue, and whether it escalated appropriately. A dashboard can be green on latency and error rates while the product is failing because the agent gave bad medical advice, mishandled a refund, violated brand policy, or failed to use the available account-balance tool.
The data shape is part of the problem. Hetzel says Braintrust has seen customer agent traces exceed a gigabyte and individual spans reach 20 megabytes. These traces are not just timings and function names. They include natural-language inputs, model outputs, tool calls, classifications, system instructions, labels, costs, token counts, and expert judgments.
That makes agent traces semantically loaded. A customer-service trace may include the user’s issue, the expected response, sentiment, tool use, escalation decisions, issue resolution, and the underlying instructions governing refunds, payment problems, account balances, and tone. Traditional observability systems were built for deterministic code paths and operational measurements. Agent observability has to support judgments about behavior.
The users change too. Hetzel says the best teams building agents include nontechnical domain experts because those experts know what “correct” means in context. Braintrust customers include clinicians, registered nurses, wealth advisers, and lawyers reviewing traces. Their annotations are not merely manual QA. A product manager, lawyer, or clinician can mark a trace good or bad and explain why; those explanations then become material for scalable scoring functions and future evaluations.
This is where observability and evals converge. Hetzel says Braintrust treats them as the same problem with different timing. In evals, the team knows the inputs ahead of time and runs them in batch. In observability, the inputs arrive live from production. Production traces can then be moved into offline datasets and rerun as evaluations, so real failures become repeatable test cases.
Scores handle known failure modes: tone, completeness, accuracy, policy compliance, escalation. Topic modeling and clustering look for unknown ones. Hetzel describes lightweight LLM processing over incoming traces to group interactions by user intent, sentiment, and issue category. The purpose is not just reporting. It is to shorten the loop between live behavior, expert review, offline experimentation, and improved agent performance.
The governance connection is direct. Progressive autonomy needs evidence. An enterprise cannot responsibly move from advisory mode to controlled autonomy if it cannot see whether the agent is following rules, using tools correctly, and producing acceptable outcomes. Durable runtimes let the agent work longer. Observability supplies the record needed to decide whether that longer work is safe, useful, and improving.
Decision memory is the next layer: rules, precedent, and authority
Neo4j’s context-graph argument adds the institutional memory layer missing from runtime, governance, and observability alone. Zaid Zaim and Andreas Kollegger frame the problem this way: agents may have language ability, tools, retrieval, and logs, but consequential decisions require rules, precedent, causal context, authority boundaries, and reasoning traces.
Zaim distinguishes a context graph from a basic audit log. An audit log may record that a transaction was rejected at 14:32. A context graph is meant to preserve why: relevant entities, relationships, policies, prior decisions, causal chains, tools used, and reasoning steps. That matters for an agent recommending a $100,000 credit line increase, Zaim’s example. The question is not only whether the recommendation is right; it is whether the system can explain which rules applied, what comparable cases existed, and why it had authority to recommend or act.
Neo4j’s memory model has three layers: short-term memory for the current interaction, long-term memory for durable entities and relationships, and reasoning memory for decision traces, tool calls, provenance, and similar prior cases.
| Layer | What it stores | Why it matters |
|---|---|---|
| Short-term memory | Conversation history and session context | Keeps the immediate request coherent |
| Long-term memory | People, organizations, locations, preferences, and other durable relationships | Gives the agent institutional context |
| Reasoning memory | Decision traces, tools used, comparable cases, provenance, and causal steps | Lets future agents reuse precedent rather than rediscover constraints |
Kollegger’s decision workflow turns that memory into an operating process. The agent first frames the decision: objective, causal path, and environment. It then checks guidance: prior precedent and current hard or soft rules. It assesses risk and value, including whether the action is reversible and what the cost of error would be. It determines authority: whether it can act, must escalate, or needs oversight. Finally, it records the outcome back into the graph as future precedent.
The authority point is especially important for production agents. Correctness and permission are separate questions. An agent may know what action would likely solve a problem and still lack authority to take it. A medical, financial, legal, or customer-impacting decision may require escalation even if the model is confident. Kollegger’s framework forces the system to ask who is allowed to decide, not just what should be decided.
Reference-class validation gives the risk discussion its sharpest edge. Kollegger’s medical example is that a drug may be right for 99% of cases with a symptom and dangerous or fatal for the remaining 1%. The relevant question is not the aggregate success rate; it is whether this case belongs to the dangerous class. The same structure applies outside medicine: the same action may be safe for one customer, contract, incident, or workflow and unsafe for another because a boundary condition applies.
This section also explains why “memory” in agent systems cannot mean only stored preferences or chat history. Yan and Murray describe coding-agent memory as still unsolved because the system must decide what to remember, how general a correction should become, when to retrieve it, and when to prune or update it. Neo4j extends that concern from coding practice to institutional decision-making. A production agent needs to remember not just facts, but decisions and the reasons behind them.
Elad Gil’s commercial argument explains why the bar is rising. He says AI startups are increasingly selling “human labor equivalents” — work hours, labor hours, or “bits of cognition” — rather than software seats. SaaS sold tools and workflow systems that humans used. In Gil’s framing, the new AI company sells work product itself: legal analysis, code, research, support handling, or another chunk of cognition that previously would have been purchased as labor.
That commercial shift puts pressure on every layer described above. If a buyer is purchasing completed work rather than a suggestion in a UI, the buyer needs to know where the work ran, which data and tools it touched, who approved sensitive actions, whether the behavior was evaluated by the right experts, and what precedent or policy shaped the output. Runtime, governance, observability, and decision memory stop being backend implementation details. They become part of what the customer is buying when the product is labor-equivalent output.
None of the material settles which vendor architecture will dominate. Yan and Murray emphasize background coding infrastructure; Coffey emphasizes durable agent harnesses; Grogan-Avignon and Wang emphasize executable governance and evidence-based autonomy; Hetzel emphasizes semantic observability; Zaim and Kollegger emphasize context graphs for rules, precedent, and authority; Gil emphasizes the shift from software seats to work product. The common pattern is broader than any one implementation: as agents become production workers, the stack around the model becomes the product.











