Google’s Agent Scaling Problem Is Quota, Observability, and Evaluation

Ian Ballantyne KP Murphy-SawhneyAI EngineerSunday, May 24, 202611 min read

KP Sawhney and Ian Ballantyne describe Google DeepMind’s agent work as an infrastructure problem rather than a single-agent breakthrough. Their account centers on the constraints that appear when thousands of heavy users and agent workflows run at once: quota management, scarce compute, traceability, skills governance, evaluation, and review. Sawhney argues the next step for Deep Research is to move away from passing giant context blobs through a pipeline toward shared workspaces where components can collaborate more like human researchers.

At Google scale, quota becomes part of the architecture

The scaling problem KP Murphy-Sawhney put first was token consumption. Agentic systems are “token-hungry,” so quota management on a per-user or per-team basis becomes important infrastructure rather than an afterthought.

That pressure shapes model selection. Sawhney expects users to mix and match models, including models he referred to as “Gemma 4,” which he described as effectively free from a quota perspective if users are running on whatever GPUs or TPUs they have. More advanced models, in that view, can be reserved for specific components of an agentic system.

Evaluation has the same cost problem. Sawhney said Google DeepMind is looking at ways to test the harness and agentic flow without consuming large amounts of TPU time, including “mock TPUs.” The goal is to evaluate complicated workflows while minimizing the cost of evaluation itself.

Ian Ballantyne connected this to the physical constraint underneath the product experience: compute is scarce. When an audience member asked how to prevent one user from taking down a system by spawning multiple instances or a “team” of agents, Sawhney said the current answer is still partly brute force. Google DeepMind has “real power users,” and at some point the response becomes: “you’ve gotta just stop right now.”

That operational reality also affects pricing models. Sawhney pointed to Anthropic blocking “open Claude” activity as an example of the tension created by token-hungry agentic systems. In his view, the subscription model does not work cleanly for that usage pattern.

Ballantyne added a Google-internal version of the same point. When he joined Google, he wondered how teams knew when they were using too much data-center capacity. A colleague told him, “they’ll tell you.” In practice, Ballantyne said, people monitor spikes and graphs around the clock, and SRE teams will reach out and ask a team to stop a job running on a particular cluster.

The quota issue also shows up directly in the user experience. An audience member said Antigravity’s browser testing was useful “when it works,” but that they struggled with limits and could not tell whether a failure was caused by the network or by hitting a usage limit. Sawhney answered that Google employees can have worse limits than customers because customers are prioritized over internal users. He said some of his repeated clicking during the live demo was because the system recognized him as a Googler.

Sawhney described the direction he expects from the harness: limits will remain, especially across tiers, but a user should be able to run out of credit or capacity on one model and move to another without breaking the workflow. A task might move from a Pro model to Flash, or, if everything in a subscription is exhausted, to a local model. The failure mode he wants to avoid is a task sitting for an hour doing nothing because a limit was hit.

Deep Research could use shared workspaces instead of giant context handoffs

KP Murphy-Sawhney had worked a few months earlier on the Deep Research agent, which he said is now available through the “interactions API.” His current focus has shifted toward making better internal use of the Antigravity harness and generalizing it beyond coding.

The architectural issue he singled out was context handling. Today, Sawhney said, Deep Research passes “a huge amount of context” through the system. Searches produce large bodies of text, and those are passed along the pipeline. That gets expensive, consumes context, and constrains what the system can do.

His proposed direction is to make components collaborate in a shared workspace rather than handing off huge text blobs. For Deep Research, that could mean different parts of the pipeline working through a shared file system, closer to the way human researchers would collaborate on a deep investigation.

Why not have the different parts of that pipeline collaborate in a shared file system?

KP Murphy-Sawhney · Source

Sawhney did not present that change only as cost reduction. He said it could make Deep Research faster, cheaper, and potentially better if it can be orchestrated through the same harness. If each element of the system behaves more like a collaborator in a workspace, it may open the possibility of infographics, additional supporting artifacts, and documents.

The same harness, in this framing, could be useful for coding and for research because both require orchestration. A task is broken across parts of a system; those parts need state, tools, and a way to coordinate without repeatedly passing everything forward as context.

Sawhney did not give release plans for bringing Gemini Deep Research into Antigravity. When asked directly whether Deep Research would become available there, he said it was something Google DeepMind was “actively exploring.”

Antigravity is the harness model: plan, act, test, and report back

Ian Ballantyne described Antigravity as more than a Visual Studio-style coding interface. The familiar surface is the IDE: files, codebase navigation, and a chat panel. The layer he emphasized in the demo was the agent manager behind it, which lets a user spawn multiple agents against different projects and have them plan, edit, test, and report back from inside the same environment.

In the demo, Ballantyne asked the agent to “build an example of this spec,” attached a spec file, and sent the task to Gemini Flash. The first attempt hit an error, which he worked around live. Once running, the agent inspected the project, found existing files, generated an implementation plan, and opened a browser under Antigravity’s control so it could test the application it was modifying.

The browser-control layer was not incidental. The screen shown to the audience said the agent can “click, scroll, type, and navigate web pages automatically,” and that the browser will be used when appropriate from an Antigravity conversation. Ballantyne added that the agent can inspect the DOM, look for errors, and feed those observations back into the task. At completion, it can produce review artifacts such as an implementation report, screenshot, or video for interaction-heavy changes.

The generated implementation plan for the game “Nebula Drift” showed the practical pattern Ballantyne was emphasizing: the agent proposes changes before applying them. The visible plan listed proposed modifications to index.html and game.js, organized under headings including “Core Foundation,” “Visuals,” “Architecture,” and “Automated Tests.” Ballantyne said the user can edit a line in the plan, reject behavior, or say “that’s not what I meant at all” before pressing proceed.

The demo also showed the browser-control output and the agent’s own work trace. Antigravity’s controlled browser opened a “Nebula Drift” start screen, reached a game-over state, and the IDE showed a scratchpad with a generated test plan. Ballantyne described the scratchpad as the agent’s notes while it works through a task, giving the user a trace of the behavior it is trying to figure out and the option to interrupt.

For Ballantyne, this workflow is common to many current agent harnesses. The specific point was how Google DeepMind uses this kind of harness with Gemini models for internal development: planning, tool use, browser automation, traceability, review artifacts, and user supervision are assembled into a repeatable workflow.

Observability has to preserve the agent trajectory

Asked how Google DeepMind handles observability and traceability for agent infrastructure, KP Murphy-Sawhney described a custom internal web application. There is one agent backend system used for a lot of work at Google. When a user issues a query to an agent hosted on that system, the request automatically appears in a UI where engineers can drill into the hierarchy of the system.

That hierarchy can be inspected at multiple levels. If needed, Sawhney said, engineers can go all the way down to the raw predict request made to the model. For coding workflows, Google also has an “agent trajectory store,” because coding agents can produce many steps and because failures are often temporal. Engineers need to know where looping began, or exactly when the model “went off the rails.”

The contrast with the audience was instructive. One attendee said they were currently outputting traces to a log file and inspecting those. Sawhney did not describe an external, general-purpose observability product. He said the tooling is custom internally “for now,” and asked what others were using.

The trajectory store matters in Sawhney’s account because the failure is often in the path, not only in the final answer. A run can include decisions, tool calls, observations, edits, retries, and loops. For systems that act in browsers, edit code, inspect DOM state, and submit changes, the debugging object is the sequence of work.

The skills library is useful only if sprawl is controlled

KP Murphy-Sawhney described a large internal effort to build a library of skills that help Google and DeepMind employees do their work. He did not define skills formally, but he described them as a mechanism that lets agents and users draw on contributed expertise. In an organization the size of Google, however, the same mechanism creates sprawl.

The answer Sawhney gave was deliberately evolutionary: improve the skills, and make sure “only the best ones survive.” He described the internal skills ecosystem as almost Darwinian. The goal is not simply to allow everyone to publish agent capabilities, but to keep the library from sprawling out of control.

This has direct implications for how knowledge moves inside a large engineering organization. Sawhney said the advantage of skills in a company of Google’s size is that they can be contributed by people who are experts in a particular area. When the skill is good, both Sawhney and the agent get that expert knowledge “for free.”

He gave a concrete example from his own work: a skill for debugging raw logs, with much of the work done through the CLI. His preference is a combination of skills and guardrailed command-line interactions. That setup has worked well for him and speeds up his job significantly.

The audience asked where Antigravity sits in the debate among skills, MCPs, and CLIs. Sawhney said he is “team skills.” He called MCP useful from an authentication perspective, which he described as powerful, but said he had thought MCP might be “a little bit of a flash in the pan.” Ballantyne was more ecumenical: Antigravity supports both, and his expectation is that the harness will continue to support what the community uses.

That distinction matters. Sawhney gave a practitioner’s preference: skills plus guardrailed CLI. Ballantyne gave the platform answer: support the interfaces developers adopt, and make sure they work with the harness and models.

The bottleneck for better skills is evaluation

The Darwinian skills model immediately raises the question of fitness: how does Google DeepMind decide which skills are good enough to survive?

KP Murphy-Sawhney said evaluating this work is hard. Even the mechanical setup is nontrivial: spinning up sandboxed environments configured correctly for a particular problem set is itself a burden. Beyond that, the harder problem is creating new datasets.

He distinguished between general benchmarks and skill-specific evaluation. There are useful open-source datasets for external benchmarking, but a specific internal skill often requires a test designed around that skill’s purpose. In that case, the burden can fall on the skill author to provide some form of test.

There is also experimentation with agents designing those evaluations. Sawhney called that “a little bit meta” and said there is still “a lot of work to do” in the space. Google DeepMind’s account was not that it has a finished methodology for skill survival; it was that scaling skills turns evaluation into one of the central problems.

Sub-agents are less visible than the workflow they support

An audience member asked how many hierarchical layers of sub-agents Antigravity can run — how many levels of copilot or sub-agent delegation are possible. Ian Ballantyne said he did not know the short answer. He described Antigravity’s current user model as multiple simultaneous agents working on different tracks, rather than a visibly massive parallel hierarchy for a single task.

In the current presentation, Ballantyne said, the sub-agent structure is somewhat opaque. A user can assign different trains of operation within a project, and jobs can overlap, but he did not characterize it as a highly legible nested system where the user sees each sub-agent and its place in a hierarchy.

KP Murphy-Sawhney said he does think agent-to-agent communication is the future. The problem, in his framing, is how to make that communication efficient and how to give the human enough control to shape it. His metaphor was “a supervisor on a digital assembly line.”

That metaphor does not resolve the implementation question. It marks the direction Sawhney expects: more agent-to-agent work, with the human role shifting toward supervision of the system’s coordination rather than direct execution of every step.

Code review is already entering the agent loop

The final audience question asked about pull requests at scale: whether Google has considered fine-tuning PR models that read tasks, comments, and commits to review code and reduce reviewer load.

KP Murphy-Sawhney said Google already has strong infrastructure in place. On a per-language basis, there are specific auto-review models fine-tuned on style guides and previous good examples of code. On a product-area or product basis, teams add their own specific instructions and prompts so reviewers get a better signal about code quality.

Sawhney also described a recent experience: the previous day, he sent a PR for review and did not have to trigger the auto-review tool. An agent someone else had spun up commented on the PR with what he considered a good suggestion.

He connected the review problem to the scale of engineering work inside Google: many engineers submit large amounts of code, and agents may submit more. Sawhney also mentioned Jules, a web interface for working on GitHub PRs with review components.

The review loop brought Sawhney back to the assembly-line framing. If engineers are supervising agent-produced work, they also need help supervising the review process. The agents may take over some boring work, but the system still needs mechanisms to decide what is correct, what follows style, what fits the product, and what deserves human attention.

AI Application Architecture Evals and Benchmarks Inference and Deployment Agents and Autonomy AI Infrastructure and Compute Coding Assistants