Models Will Absorb Today’s Agent Harnesses Within a Year
Logan Kilpatrick, who leads Google AI Studio and the Gemini API, argues that the current rush to build agent harnesses may have a short shelf life. In an interview with Sequoia Capital’s Sonya Huang, he says models are absorbing the scaffolding around agents and could make much of today’s custom harness layer less distinctive within about 12 months. Google’s own strategy runs on both sides of that claim: Antigravity has become a shared agent layer across products, while Kilpatrick says the durable advantage for builders will move to focus, domain knowledge, risk tolerance and useful outcomes for users.

The agent harness has become Google’s new throughline
Logan Kilpatrick describes Google’s “agentic Gemini era” less as a slogan about chatbots than as a product architecture shift. Gemini, the model family, had become the common AI layer across Google’s many products. Now, Kilpatrick says, a second common layer is emerging: Antigravity, an agent harness used not only as a coding environment but as infrastructure for agentic behavior across Google.
The underlying thesis is sharper than “agents are coming.” Google is building around the harness as a practical shared layer today, while Kilpatrick also expects some of what harnesses do to be absorbed into model systems over time.
Sonya Huang presses on what Antigravity is, because the answer is not simply “an IDE.” Kilpatrick says it includes a core IDE, an agent-first web experience, a CLI, and an SDK. It can also be accessed through the Gemini API as a managed agent for developers who do not want to build the infrastructure themselves. The more consequential point, in his account, is that the same harness is meant to power agentic features in Search, the Gemini app, Cloud, and AI Studio.
Huang summarizes the shift: the Gemini API used to be the way AI was baked into Google products; now the coding harness, or more broadly the agent harness, is becoming the shared layer. Kilpatrick accepts that description but widens it. Coding is a specialized use case of the agent harness, he says, but it has also become the general-purpose harness for agents. The base layer may be “80% of the same stuff,” then specialized for the use case: AI Studio’s version is tuned for vibe coding; the Gemini app’s version is tuned for a consumer, always-on agent.
That distinction matters because Kilpatrick is not saying every product becomes a coding product. He is saying coding has supplied the most functional template Google has for agentic systems, while the same underlying harness can be adapted for products that take action on behalf of users rather than simply answer them.
Google is still in the crawl phase because billions of users change the risk calculus
Asked to grade how agentic Google’s products are on a crawl-walk-run scale, Logan Kilpatrick’s answer is direct.
It’s definitely like crawl right now.
He distinguishes between Google’s frontier or Labs-like environments, where some experiences may be closer to walking or running, and the main product surfaces used by billions of people.
The constraint is not only model capability. It is stewardship. Kilpatrick says Google has 13 products with more than a billion users, and the long tail of those users is not necessarily ready for AI to “run” and do everything. Many users still want to be in the driver’s seat. Search is his clearest example of a product where Google has to bring users along carefully rather than abruptly change how they interact with the internet.
The products he sees as closest to “walk” are the Gemini app and Antigravity. The Gemini app is closer because it can become a 24/7 always-on agent capable of taking actions for a user. Antigravity is close because autonomous coding agents can run for long periods, consume billions of tokens, and spend thousands of dollars on a user’s behalf. Google DeepMind, he says, is taking a frontier view on these kinds of agents, while the rest of Google’s products are moving more incrementally.
Whether Google ends up with a few AI surfaces or thousands remains unresolved in Kilpatrick’s view. But he makes one important product claim: humans may still prefer specialized surfaces because general-purpose interfaces can increase cognitive load. A product that does everything for a user can require more work from the user to specify what they actually want. There is still value, he says, in opening a calendar app and having it simply show the calendar.
Huang offers slide decks as an analogy: information being in the same place, in the same form, has persistence because people are used to it. Kilpatrick agrees that generative interfaces can add cognitive overhead in some cases. He allows that someone may invent an experience that makes general-purpose AI feel more natural, but for the ecosystem he expects many more specialized products rather than one universal interface.
Cannibalization is framed as an outcomes question, not an eyeballs question
Sonya Huang raises the business tension directly. If agents can go through a user’s email and reply on their behalf, maybe the user spends fewer hours inside Google products. Logan Kilpatrick answers by pointing to the early assumption that AI answers would be negative-sum for Search. In his account, the opposite has happened: AI has been “incredibly positive sum” for Search, with people searching more and agents also searching.
He does not claim the long-term outcome is settled. Human time is finite, and he says the next one to two years are clearer than the three-to-five-year horizon, when technology and products will look different. But he frames Google’s success criterion as maximizing outcomes for customers, not maximizing time spent looking at Google interfaces.
Success for Google probably doesn’t look like maximizing eyeball time in front of our products. It’s like maximizing outcome for customers to do the thing that they want to do so that they can go and live their life.
Huang introduces “agent-led growth”: in coding, she says, she lets agents choose infrastructure on her behalf, including decisions such as which database to use. She asks how that changes advertising and value capture for aggregators if similar delegation reaches shopping and other consumer domains.
Kilpatrick’s answer is that the change may be less radical than it currently feels. He sees correlation between older mechanisms such as SEO and newer ideas like “generative engine optimization,” describing these systems as proxies that compound on top of one another. He does not offer a detailed ad-market theory; his position is that the visible shift may be more continuous than discontinuous.
The substantive contrast is clear. Huang is pointing at a world where agents mediate user intent and therefore reshape distribution. Kilpatrick is arguing that, at least so far, AI has expanded use rather than simply replacing human attention, and that Google’s product philosophy can be stated as outcome maximization rather than screen-time maximization.
Coding is the clearest working agent case, and Gemini is rebuilding its flywheel
Coding remains the place where agents have most visibly crossed from demo to work. Enterprise companies tell Sonya Huang that everyone talks about agentic AI, but the only place agents clearly work is coding. Logan Kilpatrick says the answer depends on the bar for “working.” If the task is complex and the domain has not crossed the quality threshold, agents will not solve it. But he thinks the length of agent runs is increasing quickly and wants the ecosystem to measure it more directly: not just total token consumption, but the average duration of an agent task.
Public claims from model labs about multi-day autonomous work are the extreme case. Kilpatrick argues that the same trend is “trickling up” in practice. Enterprises may not have felt it outside coding yet, he says, but they will this year as other use cases improve.
Coding matters inside DeepMind both as a product category and as an accelerant. Kilpatrick says long-running agents matter, and coding agents specifically matter because a strong coding model accelerates every other part of the business.
The uncomfortable market question is Gemini’s standing with developers. Huang says many of her developer friends used Claude; after OpenAI’s Codex improvements, they are split between Claude and Codex; she does not hear many of them using Gemini. Kilpatrick accepts that the observation is reasonable. He adds context from the narrative cycle: when Gemini 3 landed in December, he says, the narrative was that Google had taken a major leap. Then agentic coding became the next wave over the holidays and into January, reminding him how quickly the market story can move.
His explanation for Google’s current position is partly product-flywheel based. It is hard to build a great coding model for long-running software engineering work, he says, without a product that does that work. Kilpatrick says that realization is why the Windsurf deal happened: those people came over and ultimately built Antigravity. Google is now using Antigravity internally, and Kilpatrick refers to a graph Sundar Pichai showed at I/O of token-consumption growth inside Google.
He also points to the timing of large pre-training runs. From the outside, a lab may look behind without observers knowing where its major runs are in the training cycle. DeepMind, he says, has historically been strong at pre-training, while Gemini 3.5 Flash represented a major post-training gain: a Flash model better for coding than any Pro model Google had previously released.
Dogfooding is central to the strategy, but not in a closed-world way. Kilpatrick says it is healthy for DeepMind employees to use other models because otherwise they cannot understand the ecosystem. He uses all the models and products himself. But he also says people have to use Gemini models because internal use creates a feedback flywheel. Google and DeepMind have more than 100,000 engineers using the models, giving feedback, and enabling A/B tests and live experiments. That scale should be a competitive advantage.
Agentic coding is already changing ambition before it changes research itself
Sonya Huang asks whether Logan Kilpatrick believes in the “soft takeoff” narrative: once coding agents are good enough, they accelerate research progress, creating a self-reinforcing cycle. Kilpatrick says it seems obvious that this is true, though he caveats that he may have “drunken too much Kool-Aid.”
He sees signs of it more clearly in product development than in model research. For large training runs, humans remain in the driver’s seat because resource allocation is significant; no one should accidentally kick off a job on 10,000 TPUs. But on the product side, he says, the effect is already visible. His team has built mobile apps with Antigravity that he expects to launch faster than any Google team has previously built mobile apps. He cites Josh’s team delivering the Gemini macOS app faster than any team had delivered a Mac app at Google, attributing that speed to agentic coding.
The larger claim is that coding now feels like a form of narrow superintelligence. Kilpatrick does not try to formalize that threshold, but he says coding is “just so good” that it feels that way. He worries that the focus on AGI can obscure the impact of present capabilities. General-purpose AI remains important, but the ability to build with code is already powerful enough to change what individual developers attempt.
For Kilpatrick personally, coding agents have not reduced the role of the human developer. They have increased his sense of agency. Ideas that once felt slightly out of reach now feel feasible. The burden has shifted: instead of asking whether he can make an MVP, he asks whether he should make the idea more ambitious because the technology enables it.
I have the opposite problem, which is I’m kicking around an idea and I’m like I could probably make this even more ambitious.
He expects other “vertical superintelligence” domains to arrive before full AGI. The pattern, he says, will be jagged: very strong performance in particular domains rather than uniform general capability everywhere. The next candidates are areas with better verifiability. He names math, finance, and science, with science especially interesting because early positive impact could help people understand AI’s potential benefits. He references current work around math proofs but notes that he is not a mathematician and the details are over his head.
Vibe-coded games are close, but the missing layer is product scaffolding
Sonya Huang reads back a prediction Logan Kilpatrick made: everyone would be able to vibe code video games by the end of 2025. Asked whether it came true, he says it feels close, while excluding AAA games such as the next Call of Duty or GTA.
His reason is not simply that models need to be better at writing game code. Games require additional assets and systems that coding agents do not solve by themselves. He mentions 3.js as an enabling tool, but says there are still rough edges: sprite generation, orchestration layers, reliability, replayability, depth, and taste. The model may be capable enough, but the experience requires scaffolding.
AI Studio usage data prompted his earlier prediction. At one point, he says, roughly 20% of apps being made in AI Studio were games. Games are no longer the most popular category, because the ecosystem and user base have shifted. The current prominent categories include finance-related apps, much of it apparently around crypto; personal productivity; and generative media. He also notes DeepMind’s historical connection to games through Demis Hassabis and says Kaggle works with DeepMind on a game arena used as a way to test progress toward AGI with games as a proxy.
For a random person with a good idea to vibe code a fun playable game, Kilpatrick wants to say “this year.” He thinks model capability makes it possible. The gap is more likely product knowledge: someone who understands what makes a great game has to assemble the scaffolding correctly. Some of the remaining gap may be awareness — people do not know they can do it — and some may be model categories that are weeks or months away from crossing a threshold.
When Huang asks whether vibe-coded games are more likely to be built with game engines plus coding agents or with world models, Kilpatrick says the definitions will blur. In the short term, though, he expects more “alpha” from coding agents plus a game engine. World models remain too open-ended for recurring, reliable game experiences unless someone builds the right scaffolding around them.
Omni blurs the line between world model and media model
Logan Kilpatrick uses Google’s Omni model to explain why “world model” is becoming a blurry term. At I/O, Google described Omni as a system that can take any input and create any output. Demis Hassabis framed it as a world model because of its level of world understanding. Kilpatrick says that is different from the more technical historical meaning of world model, which he describes as something closer to an action-conditioned video model, such as Genie.
The shift is from “world model” as a specific architecture to “world model” as a model that has understanding of the world. Omni is not real-time in the way some traditional world-model visions imply, but Kilpatrick says it can visually create many of the same use cases people would associate with such systems. That is why he expects the distinction between video models and world models to change.
Under the hood, the important point is that Omni is a single model. Kilpatrick says Google historically trained multiple models for different modalities: Gemini for text, audio models, music models such as Lyria, image models such as Imagen, video models such as Veo, and other audio systems. Omni was built from the desire to replace that collection with one model capable of handling those tasks.
He emphasizes that it is not routing to a bundle of separate models under one brand. Google could have built something like that earlier and called it an Omni model, he says, but this is a “true Omni model.” The first available use case is video editing because that is where the quality works best right now. The model is technically functional for other outputs, but the quality is not yet state of the art, so those capabilities have not been rolled out. The current release is “Omni Flash,” which he describes as the first crank of the model turn, with more capable versions to come.
The example Kilpatrick returns to is an edited talk in which a dog was inserted onstage. In the generated version, other guests look down, see the dog, and chuckle. The dog jumps into Kilpatrick’s lap; he acknowledges it, continues talking, and pets it. What impressed him was the subtlety: the surrounding people reacted to the inserted object in context, while the talk continued.
That example leads into his view of generative media. Kilpatrick says he has historically been reluctant to use AI for his own content. He wants his words, voice, image, and presence to remain his own, because he sees “so much alpha in authenticity.” What he likes about Omni is that it can alter the environment — the set, the coffee table, the non-personal elements — without replacing the person.
It’s the original content, it’s the person, it’s like the personhood is there, it’s just different and amplified.
In his preferred version of generative media, the original content and personhood remain, while the surrounding presentation is changed or amplified. Sonya Huang, who says she is highly bullish on generative media, frames the appeal from a creator’s perspective: visuals matter as much as the content because they catch attention first. The two agree to try prompts after the recording, including the possibility of editing the set or adding a dog.
AI Studio is becoming a gateway into the rest of Google’s builder ecosystem
Google has added the ability to vibe code Android apps inside AI Studio. Logan Kilpatrick presents this not just as an Android feature, but as part of AI Studio’s strategic role: exposing builders to the broader Google ecosystem without forcing them through many separate Google interfaces.
Android is a useful example because it lets people build apps they otherwise would not have built. Kilpatrick says he built his first Android app in AI Studio: a gardening app prompted by planting trees in his backyard. He had not yet found his breakthrough mobile-app idea, but he describes the experience as a way to “kick the tires.”
The scale of use is already large by his account. He says that, as of numbers reviewed that morning, roughly 350,000 Android apps had been built in AI Studio since the prior week. He does not claim these are all commercial products. In fact, the important point is the opposite: many are personal apps that probably would not have existed before.
That leads into a broader claim about software creation. Kilpatrick thinks “Gen UI” may be further out, but the ability to build software for a personal problem is already real. AI Studio is making it possible for users to unlock native phone capabilities and use context spread across the device. He frames Android as becoming a platform for builders.
Sonya Huang asks whether it matters that something is an app rather than a web product, given how powerful the web has become. Kilpatrick says the web is powerful, but operating systems still offer native richness that the web cannot fully unlock. His example is messaging: the texting experience inside major operating systems feels richer to him than any AI chat app. He would rather talk to AI inside the texting app he already uses than go to a separate AI app, because users are conditioned by operating-system surfaces.
The model eats the harness, but startups still have room to build around focus and risk
Sonya Huang asks directly about the idea that “the model eats the harness” or “the model eats the scaffolding.” Logan Kilpatrick says it is true, but first redefines what a model now is. Two years ago, he says, an LLM was basically a set of weights: send tokens in, get tokens out. Today, products still refer to “Gemini,” “GPT,” or “Claude” as models, but the thing being used is an expanding system around the weights: tool calling, hosted tools, search, code execution, containers, and agent harnesses.
Scaffolding often sits a few steps ahead of what is baked directly into the model. Then, over time, the model digests the scaffolding and it becomes part of the native model system. External scaffolding can still have value. Search and code execution are his examples: even if the model can natively use search, users may want different search providers for different use cases; even if code execution is built in, external versions may still matter.
But he thinks the agent harness is the current quintessential case. Many companies are racing to build harnesses because they believe the harness is where the advantage is. Kilpatrick’s view is that this may not remain true in the way people currently imagine it.
I think that perhaps won’t be true, at least in the way that we think of the harness today, in 12 months.
His rough time horizon is 12 months. He expects models to natively perform much of what custom harnesses now provide, reducing the advantage of spinning one’s own harness.
I think the models will have sort of just like digested a bunch of that, it’ll be upstreamed into the model, and the alpha will be somewhere else now.
Huang objects that application companies build their own harnesses partly to avoid lock-in to any one model provider. Kilpatrick agrees that this can start out true, but argues it becomes less true as models become more capable. A generalized model should be able to use another harness. He proposes the need for something like “harness bench”: an ecosystem benchmark measuring how well different models adapt to different harnesses. If a harness is completely out of distribution, he says, it will remain hard regardless of whether it is proprietary.
The startup question follows naturally: if models eat the harness and surrounding scaffolding, where can independent companies survive? Kilpatrick holds two claims at once. Models are doing more than ever. But he also sees more opportunity than ever for startups.
His answer is focus. Model companies go after general problems; startups can go deep in vertical domains where they know the customers and ecosystem.
Focus is the superpower of startups. Like if you can focus, you can do anything.
Google, by contrast, has many products, users, and obligations, which makes intense focus in one domain difficult even when strategically justified.
He also points to capability overhang as an opportunity. Coding agents help startups close the gap with larger companies that have established codebases, because small teams can now write software faster. Agentic primitives create new product categories. And risk appetite differs: a startup willing to take more risk in a domain can win users who also want to take that risk, while larger companies may have to move more cautiously.
DeepMind’s culture matters because it links research ambition to Google-scale deployment
Inside Google DeepMind, Logan Kilpatrick sees a culture shaped by two pressures that run through the rest of his argument: scientific ambition and product deployment at Google scale.
The first pressure is focus. DeepMind has a large portfolio, which Kilpatrick describes as exciting but not free. Other labs can pull ahead in areas where Google underinvested or lacked focus. What he values is the organization’s response: gather smart people and attack the problem. He connects that approach to the DeepMind culture shown in the Demis Hassabis documentary The Thinking Game, saying the old culture of focused “strikes” still resembles how the organization works today.
The second pressure is the scientific worldview he associates with Hassabis. Kilpatrick characterizes Hassabis as a Nobel Prize scientist and an original figure in the field, and says that perspective permeates DeepMind. The mission he values is grounded in scientific purpose: solving problems such as disease, not merely pushing benchmark numbers higher. The competitive race over measures such as SWE-bench can obscure that purpose.
Kilpatrick and Sonya Huang joke about the Silicon Valley quote from Gavin Belson: “We can’t let other people make the world a better place more than we can.” Framed that way, he says, the race feels goofy; the work is not zero-sum.
The third pressure is operational. DeepMind also operates as what Kilpatrick calls the “engine room of Google.” On one side is the lab culture; on the other are partnerships across Android, Cloud, Gmail, Workspace, and other Google products. The applied work of deploying Gemini into billion-user products creates a kind of problem that, he says, only two companies in the world have. Google has 13 such products.
That scale also shapes how Google tells its story. Huang asks whether Kilpatrick’s public tweeting caused internal heartburn. He says Google’s marketing and communications teams have been strong partners whose job is to protect the company and help tell the right story, but that he has not had to get every tweet approved. He sees part of his role as telling a more authentic developer-facing story than a large company can easily tell through many layers of process.



