Enterprise AI Agents Need Harnesses, Traces, and Controlled Runtimes
LangChain co-founder and CEO Harrison Chase argues that enterprise AI agents are becoming an architectural problem rather than a question of adding autonomy wherever possible. In an NVIDIA AI Podcast interview, he says systems such as Claude Code, Manus and Deep Research share a common “deep agent” pattern: an LLM in a tool-calling loop, supported by a reusable harness, workspace, subagents and planning. For enterprises, Chase says trust depends on choosing the right level of autonomy and surrounding agents with observability, evaluation, secure runtimes and continued iteration.

The enterprise agent problem is now architectural
Harrison Chase traces LangChain’s original premise to a pattern he saw early in the generative AI cycle: developers were building applications around LLMs, and the systems around those models already had recognizable similarities. The models mattered, but so did the data connections, tools, workflows, and control structures that made an application usable.
Noah Kravitz framed LangChain as one of the more striking infrastructure stories of the period, saying the company began about three years ago and had passed one billion downloads. Chase’s origin story was more practical: the team saw that LLM applications would become complex, and that developers would need common tooling for what are now called agents.
The term Chase emphasized was “deep agents.” He said the idea emerged from a pattern LangChain saw about a year ago across systems such as Claude Code, Manus, and Deep Research. Under the hood, these systems shared a general architecture: an LLM running in a loop and calling tools, with recurring pieces such as a file system, subagents, and planning.
That observation led LangChain to release Deep Agents about nine months before the interview. Chase described it as an open source, model-agnostic, general-purpose agent harness: something developers can customize with prompts and tools instead of reinventing the same surrounding system for each application.
And so Deep Agents is, is really this new type of agent harness that we think is really general purpose and that you can customize to do different things. But it's not like you're reinventing the scaffolding each time.
The appeal, in Chase’s account, is not that the harness is elaborate. It is that it is simple enough to reuse. The underlying loop is still an LLM interacting with tools, but the surrounding environment gives it more autonomy and a more capable workspace. He pointed to OpenHands as another example powered by this type of harness.
For enterprise agents, that reusable harness sits alongside observability, evaluation, secure runtimes, and model choice. The more autonomy a system receives, the more important it becomes to define the tools it can call, the environment that constrains it, and the tests that show whether it is doing what the business intended.
Enterprises do not need autonomy everywhere
Harrison Chase separated the enthusiasm around autonomous agents from the practical question of where they belong. “Not everything needs an autonomous agent,” he said. In some enterprise cases, LangGraph — LangChain’s framework for combining LLM autonomy with more directed workflows and more control — remains the better fit.
That distinction matters because many enterprise discussions are not simply about whether agents can do more. They are about what level of autonomy is appropriate, what should be controlled by workflow, and what should be left open-ended. Some enterprise customers, Chase noted, tell LangChain they prefer LangGraph and plan to stay with it. His response was that this is fine: different use cases require different tools.
The deeper enterprise concern is visibility. Chase contrasted agents with conventional software by saying the interaction space for agents is far more open-ended. Text input is effectively unbounded. A traditional UI constrains users to buttons and fields; an agent can be asked almost anything. At the same time, models are non-deterministic and sensitive to small wording changes.
That combination is why observability and evaluation sit at the center of enterprise trust. If the model can take many possible paths, the organization needs to see what path it took. If a change in wording can change the answer, the organization needs ways to test behavior across more than one example.
LangSmith is LangChain’s answer to that layer. Chase described the agent development life cycle as “build, test, run, manage.” The build layer is where teams choose the agent framework: LangGraph, Deep Agents, or another modular tool. LangSmith then covers the operational loop around that build: testing and evaluating the system, deploying it at scale, observing what happens in production, and managing it over time.
Trust comes from traces, scenarios, and iteration
For Harrison Chase, trust in an enterprise means the agent is doing what the organization wants it to do. Teams build that trust in two main ways. The first is traceability: inspecting an agent run and seeing exactly what steps it took. The second is evaluation: creating scenarios, running the agent against them, and judging the results.
Chase called this “evaluation-driven development.” He pushed back on the idea that teams need a large benchmark before evals become useful. Starting with five or ten scenarios, he said, can still clarify what the agent is supposed to do.
Agents can do anything, but they shouldn't do everything.
Writing evaluation scenarios is also product work. Teams identify the questions they expect the agent to receive, define what a good response looks like, and define what a bad response looks like. Those examples then become a benchmark for changes. If a team changes a prompt, it can run the agent against the same scenarios and see whether performance improved or regressed.
The evaluation set is not static. Once an agent reaches users, even a small group, people may use it in unexpected ways. Some uses may call for guardrails. Others may be legitimate use cases the team had not anticipated. In those cases, new data points should be added to the eval set so future prompt or architecture changes preserve the agent’s ability to handle them.
That iterative loop matters because Chase sees agent building as moving too quickly for long, closed development cycles. He described a failure mode in which an enterprise spends three months scoping examples, another three months building, and another three months having humans review everything. By then, the space may have moved enough that there is a better way to build the same idea.
The enterprises that do best still control risk. They ship in limited ways: internally, to alpha customers, or to a small percentage of users. But they do ship, learn, and revise. Enterprises are more cautious than generative AI-native startups, Chase said, but agent systems still require iteration.
Chase went further: given the pace of change, teams may need to revisit their agent architecture roughly every nine months.
You have to basically redo your agent every every nine months at the at the pace that things have been.
His point was not only about performance. It was also about scope. A year and a half earlier, an agent may have been limited to a narrow task because the larger task was not feasible. If a newer harness can now handle the larger task, he argued, teams should reevaluate the architecture rather than preserve a smaller design by default.
Skills package knowledge, tools, and runtime choices
Asked about skills, Harrison Chase described them as a way to package knowledge, instructions, and tools for an agent. The pattern began in coding agents, where a skill could consist of a markdown file with instructions plus scripts the agent could run.
The same structure has broader use because coding agents are, in Chase’s words, general purpose in many ways. Some skills are informational: the agent reads a markdown file to learn how to handle a task. Others do things. A script might hit a URL, or it might run GPU-accelerated compute.
This led Chase to a three-part model of an agent system: the model, the harness, and the environment or runtime where the agent operates. In that model, Deep Agents is the harness. The runtime governs where and how the agent executes actions. Chase mentioned NVIDIA’s OpenShell as a secure runtime, and he connected runtime choices to deployment choices: whether the agent runs on a Mac mini, in a GPU-accelerated environment, or in the cloud.
Security is not handled by a single control. It depends on what the model can do, what the harness exposes, what tools and skills are packaged for it, and what runtime constrains its execution. For enterprise agents, that environment layer becomes part of the architecture rather than an implementation detail.
Open models are becoming useful inside agent systems
Harrison Chase described model choice as increasingly heterogeneous. In a deep research system, for example, there may be an orchestrator agent using a frontier model while subagents use fine-tuned or open source models for cost or latency reasons. In a large agentic system, one part might use a frontier model, another an open model, and another a fine-tuned model.
LangChain had been paying more attention to open source models recently for two reasons. First, open models are getting good enough to drive the harness in some cases. They are still behind frontier models, Chase said, but they can increasingly use the tools and structures that make a system agentic.
When Kravitz asked what qualities a model needs to drive a harness successfully, Chase gave a broad answer first: it needs to be intelligent and good. Then he named a more specific and, in his view, underappreciated quality: coding ability. Because the harness resembles a coding agent environment — with a file system and a bash tool — models that are strong at coding can be stronger general-purpose agents. He gave Qwen Coder as an example of a coding model that LangChain has seen perform better as a general-purpose model than the broader Qwen series.
The second reason open models matter is cost, especially for agents that run proactively or continuously. Chase used the example of a coding agent that a user starts twenty times a day; paying more for that may be acceptable. But if an agent runs every ten minutes, or if several agents run continuously, cost becomes a major constraint.
That is where he placed the importance of the Nemotron Coalition. Chase said LangChain joined because “we need open models and we need harnesses that they can run in.” LangChain believes it can provide the harness, while working with NVIDIA and other coalition members on models that run well in that harness.
Open models still lag frontier models at driving the most capable workloads, in Chase’s view. But if they can handle those expensive workloads, he argued, the impact could be substantial: more use with sensitive data, lower cost, and broader customer offerings. Chase also added a third component beyond model and harness: an open runtime, which he said is important even if it was not originally part of the coalition framing.
The next agent interface may be an orchestrator managing background workers
Looking ahead, Harrison Chase identified asynchronous subagents as a near-term development, potentially arriving within a month or two of the interview. Today, when an agent starts a subagent, it generally waits for that subagent to return. That works for short tasks. But as subagents become long-running, they should be able to run in the background while a manager or orchestrator agent checks in, updates them, and reports progress.
Chase expects this pattern to appear in coding. Instead of interacting directly with the agent doing the coding, a user may talk to an orchestrator. That orchestrator spins up background coding agents, tracks experiments or features, and answers questions about what is happening.
He was precise about where the productivity gain comes from. Asynchronous subagents only matter if the subagents themselves run for a while. If a subagent returns in one second, synchronous execution is enough. The larger change is that agents become long-running workers; the orchestrator is a better interface on top of that work.
Chase then added another development he said may be even more impactful: proactive, always-on, event-driven agents. He described using an email agent that runs in the background, watches his email, and drafts responses. A human remains in the loop: the agent flags a draft and asks whether he wants to approve or change it. The productivity gain, in his example, comes from avoiding the manual workflow of copying an email into ChatGPT, asking for a response, and copying the result back.
And so I think like these always-on, asynchronous, event-driven agents, that will be a really big productivity unlock. And especially at enterprises, there's so many events that are just triggering, triggering, triggering.
For enterprises, Chase sees the event-driven pattern as especially significant because organizations already produce streams of events. If agents can listen to those events and act with appropriate controls, he said, the gain could be “massive.”
Memory and identity are unresolved parts of enterprise agents
Harrison Chase also named agent memory as a coming area of importance. AutoGPT showed an early version of this idea, he said, but he expects more agents to remember interactions, update their own tools and skills, and learn over time.
That expectation is also why he does not expect these systems to become fully autonomous. In his view, learning requires interaction with the environment and with humans. Human-in-the-loop workflows are not only a safety mechanism; they are part of how the agent improves.
The other unresolved issue is agent identity. Chase framed the problem through enterprise credentials: if both he and another employee chat with the same agent, whose credentials should it use? Should it act with the user’s credentials, with the other user’s credentials, or with a fixed set of credentials?
Before AutoGPT, he said, most systems used an “on behalf of” model. The agent acted on behalf of the end user, passing through credentials such as Slack permissions, which meant different users could receive different answers. AutoGPT changed the mental model by making people think of agents as entities with their own identity.
Chase used the example of a marketing agent named Tom. Multiple people could chat with Tom. Tom could have persistent memory, his own credentials, and his own accounts in services such as Slack or Gmail. “Tom is Tom,” Chase said — not merely a temporary agent acting on behalf of a particular user.
He did not claim the industry has solved this. He said he had spoken with a SaaS provider that, during the AutoGPT surge, made it easier for people to create accounts for their agents. But those were still accounts. Whether the market moves toward ordinary accounts for agents, special agent accounts, or something else remains open.
AutoGPT changed the ambition, but not the enterprise control problem
Asked how AutoGPT affected the agent market, Harrison Chase said it changed expectations. He recalled Jensen saying something like every enterprise needs an AutoGPT strategy, and said LangChain is seeing that dynamic.
AutoGPT set a new North Star for what agents could and should be able to do. It also made the ideas easier to communicate. But the very reason AutoGPT took off — that it could do everything — is also why enterprises need more controls when bringing similar systems into production.
For weekend projects and hobbyists, broad autonomy can be the point. In an enterprise, teams understandably want more control. That is why Chase connects the AutoGPT moment back to agent identity, observability, evals, runtime security, and harness design. The ambition expanded; the control requirements did not disappear.


