Enterprise AI Agents Need Sandboxed Runtimes and Deny-By-Default Governance

Adel Hallak Joe DavisAlex KantrowitzTuesday, May 26, 202612 min read

In a ServiceNow-sponsored interview, ServiceNow AI engineering executive Joe Davis and Nvidia agentic AI product chief Adel Hallak argue that enterprise AI agents should be built as governed systems, not as single models with broad autonomy. They describe agents as layered architectures of models, harnesses, tools, sandboxed runtimes, permissions and control towers, with default-deny access replacing trust in the model’s judgment. Davis points to ServiceNow’s internal automation of 90% of some IT support requests as the practical proof point; Hallak frames Nvidia’s OpenShell and model stack as infrastructure for making that kind of autonomy enforceable.

Enterprise agents are systems, not single models

This interview was presented as a ServiceNow-sponsored conversation with Joe Davis, ServiceNow’s EVP of AI Engineering and Delivery, and Adel Hallak, Nvidia’s vice president of product management for agentic AI. The central claim both made is that practical enterprise agents should not be understood as one model wrapped in a user interface. In Nvidia’s work with ServiceNow, Hallak described agents as systems composed of multiple models, runtimes, tools, policies, and specialized sub-agents. Some models may be proprietary frontier models; others may be open-source models customized for a domain.

That matters because the user sees one interface, while the work may be distributed across a more complicated architecture. Hallak gave the example of Nvidia’s “IQ” deep-research blueprint. It is presented as a single agentic capability, but he said it is made up of “no less than seven agents.” One agent acts as an orchestrator or team leader. A planner sits beside it to maintain tasks and to-do lists. Underneath are multiple sub-research agents, each fine-tuned for a specialty.

For orchestration, Hallak said Nvidia has seen strong results with Anthropic’s Opus or OpenAI’s GPT-4o. The sub-research agents, by contrast, can be built with Nvidia’s open-source Nemotron models. He described different sub-agents as having distinct strengths: critiquing, gathering facts, or looking ahead to anticipate what may come next.

Davis emphasized that this is not merely a conceptual architecture.

That’s not a theory. That’s how ServiceNow works.

Joe Davis · Source

ServiceNow uses a range of models for different use cases, including frontier models and fine-tuned models running on Nvidia software inside ServiceNow’s own GPU clusters. The operating question is not simply which model is best in the abstract, but which model should receive which work, given accuracy and efficiency constraints.

That is also why Davis said ServiceNow’s partnership with Nvidia is not limited to chip purchasing or inference infrastructure. The companies have worked together for years on fine-tuning models, open-sourcing models, and publishing benchmarks aimed at enterprise use cases. Davis said those benchmarks can then be used by frontier-model providers to improve on the kinds of problems ServiceNow and Nvidia care about. When Alex Kantrowitz joked that this amounted to inviting model makers to “benchmark hack,” Davis accepted the practical benefit: if models improve on those benchmarks, ServiceNow’s customers benefit.

The partnership is therefore positioned around the layers above and around the model as much as the model itself. Hallak invoked Jensen Huang’s “five-layer cake” description of AI, from energy at the bottom to agents and applications at the top. Nvidia, in this framing, is not just a supplier to model labs. It is also a collaborator on the software and governance layers that Davis and Hallak say are needed to make agents usable inside large companies.

The problem is not autonomy itself, but unbounded autonomy

Joe Davis defined the agent category at issue as “claw”-style digital assistants: 24x7 software workers that run on a computer, have access to the same information an employee has, and can take action on that employee’s behalf. They can manage inboxes, Teams or Slack channels, research reports, customer communications, and other administrative work that accumulates during the day.

The attraction is obvious: these assistants can take over tedious work. The risk is the same sentence read from the opposite direction. Davis said people do not yet trust them in enterprise settings because they are powerful, ungoverned, and unbounded. The enterprise task is to combine autonomy with control.

Adel Hallak adopted Davis’s phrase “unbounded autonomy” to describe what recent open-source agent projects made visible. He referred to OpenDevin as showing what happens when stronger models, better harnesses, and more secure runtimes are combined. Hallak claimed OpenDevin became “the fastest growing GitHub project on GitHub” and said it surpassed Linux and React “in a few weeks.” His point was the intensity of developer interest around agents that can pursue goals with broad tool access.

The examples were not just chatbots answering questions. Kantrowitz described users running these systems on separate machines, attaching an email address, and asking them to complete goals such as finding and booking a Vespa rental in Italy. Hallak said such stories were not theoretical: if the agent could not communicate with a Vespa shop online, it could download a text-to-speech model and call the shop to complete the booking. The user gives the goal; the agent figures out the route.

Davis compared these agents to “mini engineers.” Like engineers, they can read public information, access private context, write code, and in some cases deploy code. The difference is that they can operate continuously and “at the speed of machines,” as Hallak put it. They can also learn or adapt to solve tasks they were not explicitly built to handle.

That capability is precisely what makes the enterprise version difficult. Hallak described the “lethal trifecta” that worries CISOs and compliance officers: unfettered internet access, access to internal knowledge bases, and access to a coding terminal. Any two may be manageable. All three together create a different risk profile.

Davis said ServiceNow’s Project Arc is aimed at that boundary. The goal is not a personal assistant used casually on a separate machine, but an enterprise agent for meaningful work inside the largest companies. The central design question is how to create a safe environment in which the enterprise explicitly chooses what data, files, systems, and actions the agent can access, and which policies govern those actions.

The runtime is where policy becomes enforcement

Adel Hallak described Nvidia’s OpenShell as an open-source secure runtime that sits between infrastructure and the agent. Its job is to define what an agent can do while it is operating: what it can access, what it can read from, what it can write to, which APIs it can call, and which LLMs it should route work to. It also spins up sandboxes and runs agents inside them so that enterprise policies can be enforced during execution.

The analogy Hallak used was employee onboarding. A company does not give a new employee access to every machine and internal system. Access depends on role, task, and scope. OpenShell is meant to provide a comparable runtime environment for agents.

Joe Davis said ServiceNow builds on that runtime through a desktop application inside the sandbox that connects to ServiceNow’s cloud. OpenShell enforces local actions. ServiceNow’s AI Control Tower gives the enterprise central visibility across agents running on users’ machines and adds cloud-level governance. Davis said this lets a company see possible attack surfaces across individual machines in one place. Hallak added that policies can be pushed down from the AI Control Tower to govern how agents behave across the company.

The governance model becomes more important because, Davis agreed, models do not possess human ethics or morals. Alex Kantrowitz raised Mark Cuban’s analogy of AI as a young child that does not understand consequences. Davis accepted the premise: a person probably knows not to hack into Workday to look at a salary; an agent may not. The answer, in his view, is not to expect moral understanding from the model, but to enforce permissions and trust at runtime.

Kantrowitz pressed the core question: how can a company know the agent will stay inside the rules? Davis’s answer was deterministic enforcement. If an AI decides it wants to update a salary in Workday, the agent has an identity and permissions. ServiceNow controls whether it can execute the action or access the system.

We can either block it or prevent it. It’s a deterministic thing that we 100% control.

Joe Davis · Source

Hallak framed the same principle more bluntly: the default is no. When OpenShell spins up an agent inside a sandbox, it does not begin with broad permission. Access to specific processes or actions must be explicitly granted. He called the posture “deny by default,” and compared it to zero trust.

Deny by default. Zero trust is another way to think about it.

Adel Hallak

Davis treated stories of agents “breaking out” of guardrails as evidence of missing sandbox enforcement. If an agent somehow emails a developer despite supposedly lacking internet access, his interpretation is that the environment likely gave the agent access to email, perhaps unknowingly, or operated with “default yes” capabilities. The safety posture he argued for is explicit opt-in.

Hallak also stressed that runtime controls do not replace other guardrail layers. LLM guardrails still matter for content and behavior, such as preventing the model from saying unethical things. OpenShell’s role is different: it controls what the agent can and cannot do in the environment. In his view, the layers augment one another.

A harness turns a model into an agent

Alex Kantrowitz asked for a plain-English definition of “harness,” a term both speakers used repeatedly. Joe Davis noted that some people also use “orchestration,” while Adel Hallak said another term had been “scaffolding.” Hallak argued for “harness” because it captures the set of tooling given to the model.

His short definition was: model plus harness, in a runtime, equals an agent. The harness includes access to file systems, prescribed tools such as a code interpreter, MCP tools, and skills. It is the defined and opinionated set of capabilities that work with the model. Hallak said that while earlier discussion focused heavily on models improving, current agent progress is increasingly about “harness engineering.” Better harnesses can translate into better agent outcomes.

Davis added that a harness is a loop trying to accomplish a defined task with tool access. Which tools the harness can use changes what the agent can do. If one harness can write code, it can behave dynamically. If another cannot write code, its behavior is more static and predefined. In an enterprise deployment, that makes the harness a control surface: it determines not only capability, but also how far the agent can go.

That is also how Davis answered the probabilistic-versus-deterministic concern. The LLM at the core of an agent reasons and plans probabilistically, but the harness around it can add determinism through governance, security, trust, integrations, and permissions. The model may generate variable plans. The surrounding system controls which actions can actually be executed.

Hallak extended the point to “skills.” In a self-evolving agent, the user may describe an outcome and provide a few hints about tools, then let the agent figure out how to complete the task. But if a task recurs, the system can encapsulate it as a skill: a human-language set of instructions for getting to that outcome. His example was scheduling lunch every Friday with specific people. The agent should not have to rediscover who the people are, which calendars to check, or where to find the relevant systems each time. The workflow can become a reusable skill.

Skills do not make an LLM perfectly deterministic, Hallak said, but they make repeated workflows more efficient and more consistent. They also sit within the harness, alongside tools and permissions. In the ServiceNow-Nvidia partnership, he said these capabilities complement ServiceNow’s domain-specific autonomous agents, including what he described as 20 autonomous agents ServiceNow has discussed making available.

The first production proof point is internal IT support

Joe Davis used ServiceNow’s L1 AI IT specialist as the clearest production example. The use case is familiar: employees need access to an application, cannot open email or a browser, or run into some other workplace-technology problem. Historically, they file a support request or incident. A human support worker picks it up, logs into another system, grants access or investigates the problem, and closes the ticket. Because of backlogs, that can take days.

The AI specialist is designed as an ambient first-level triager. It is always running in the background. When requests come in, it evaluates them, does research, and determines whether it can solve the problem directly. In the Zoom-access example, it may simply grant access if policy allows it.

Davis said the harness governs whether the AI can take that action. If the AI can resolve the issue, resolution time can fall dramatically. He said ServiceNow has seen resolution times reduced by as much as 99% when the AI can complete the task in minutes rather than leaving the employee waiting for a human support queue.

99%

maximum reduction in resolution time Davis said ServiceNow has seen when AI can resolve an IT issue

Adel Hallak said the system is also useful when it cannot solve the ticket. It can read documents, inspect screenshots, do deep research, and then provide context to the human support engineer. It might determine that the problem is not a password reset or a simple Zoom-access issue, and instead flag a more likely cause. The human starts with a researched case summary rather than a blank ticket.

Kantrowitz asked whether ServiceNow had automated 80% to 90% of these issues. Hallak responded that he had heard ServiceNow CEO Bill McDermott say at GTC that 90% of L1 tickets were solved. Davis then made a separate claim about ServiceNow’s own internal use: the company uses ServiceNow to run ServiceNow, and he said it has automated 90% of support requests in this area.

90%

support requests in this area that Davis said ServiceNow has automated internally

For Davis, this is the pattern for broader enterprise deployment. The question is not which isolated tasks are impressive demos, but which jobs people already do today. ServiceNow looks for places where an AI can be added to the team the way another human would be: HR service desks, CRM call centers, IT service management, and related enterprise workflows.

Kantrowitz gave another example from ServiceNow: salespeople asking about compensation. In the old process, a salesperson might ask HR to calculate commission earned so far, wait several days, and only then get the answer. Davis agreed that this is the kind of process that can be automated so the employee receives an immediate response.

The next phase is adoption, reliability, and eventually physical systems

Joe Davis kept his near-term outlook focused on enterprise adoption. He said only a fraction of real enterprises have meaningful AI adoption across the company. Over the next couple of years, he expects the work to center on deploying AI into complex business scenarios, not just demonstrating isolated capabilities.

He also said he would not be surprised if something beyond LLMs became popularized in the next few years. He pointed to research aimed at breaking through the reliability limits that people have come to understand, and often work around, in current LLMs. In his framing, hallucination, reliability, and accuracy remain problems where breakthroughs may still come.

Adel Hallak looked further toward what he called “physical AI.” Kantrowitz translated that loosely as robots; Hallak accepted the connection while avoiding the broad label. He said current work is making agentic AI usable in software: LLMs have improved, harnesses have improved, runtimes are becoming more secure, and compliance is becoming clearer. The next frontier, in his view, includes AI governing or coordinating physical assets.

He imagined looking back in a few years and finding it strange that humans once operated heavy machinery. He also connected this to the AI Control Tower concept: today it can help humans see what agents are doing, and it can help agents understand what other agents are doing and report back to humans. In a few years, he suggested, a comparable governance layer may help govern humans, agents, robots, and physical assets.

Alex Kantrowitz offered a different prediction: more internal and business communication will become visual. He pointed to the rise of AI image models and continued development of AI video models, and imagined enterprise users generating on-demand infographics or video explainers for complex, individualized topics. Hallak agreed with the direction, describing it as an efficiency and optimization problem where the underlying technology is present but needs scale, efficiency, and better models. Davis also endorsed the idea.

The shared throughline was not that agents become useful by becoming unconstrained. It was the opposite: the more capable they become, the more the surrounding system matters. Models reason and plan. Harnesses supply tools. Runtimes enforce boundaries. Control towers provide visibility and policy. For Davis and Hallak, that layered architecture is what moves enterprise agents from impressive experiments into deployable workers.

AI Application Architecture AI Security AI in Customer Support AI in Operations Agents and Autonomy Enterprise AI Adoption