Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI Infrastructure

Tulsee DoshiThe Cognitive RevolutionWednesday, May 20, 202619 min read

Google DeepMind executives Tulsee Doshi and Logan Kilpatrick argue that Google’s current Gemini strategy is built less around a single frontier model than around a deployable AI stack. In their account, Gemini 3.5 Flash, the Anti-Gravity agent harness and new multimodal products such as Omni are meant to make models fast, cheap and integrated enough to run across Search, the Gemini app, AI Studio, YouTube and enterprise tools. The deeper shift, Kilpatrick says, is that the model is increasingly absorbing the scaffolding that once surrounded it, while Google standardizes the remaining agent infrastructure across its products.

Google DeepMind’s I/O strategy, as described by Tulsee Doshi and Logan Kilpatrick, is not built around a single Ultra-branded model meant to win the frontier leaderboard outright. It is a bid to make Gemini and the Anti-Gravity agent harness a common substrate across Search, the Gemini app, AI Studio, YouTube, developer APIs, and enterprise tooling. The model launch matters, but so does the execution layer around it: fast models that can serve Google-scale products, an agent harness that standardizes tool use and orchestration, multimodal systems that move beyond text, and feedback loops that turn product failures back into model improvements.

The clearest through line is that Google is no longer treating “the model” and “the product scaffolding” as separate layers. The model is being trained with the harness, deployed through the harness, and gradually absorbing pieces of what used to sit around it.

Google is leading with the model people can actually deploy

The headline model launch is Gemini 3.5 Flash, not an Ultra model. Tulsee Doshi framed that choice as central to how Google wants Gemini to be used: Flash is meant to be “really smart,” “really fast,” and “really cost-effective,” while still supporting coding and agentic workflows.

Doshi said 3.5 Flash is “three times faster than other of the large models” and “significantly cheaper,” and that Google has been using it internally. The model is intended to run across the Gemini app, AI Mode in Search, Anti-Gravity, agentic experiences in AI Studio, and the coming Gemini Spark product. The strategic point in her account is not that Flash is merely a smaller sibling of a frontier model. It is that Google needs models that can operate inside products serving very large user populations, where latency and cost are product constraints rather than implementation details.

3×

Gemini 3.5 Flash speed advantage Doshi claimed versus other large models

Nathan Labenz pressed on why Google has not released an Ultra-branded model, especially given the willingness of some customers to pay for the strongest model available. Doshi acknowledged that some users will pay for a certain level of quality, and said Google believes Pro has continued to push that quality. But she argued that Flash and Flash Lite matter because many applications, particularly consumer-scale applications such as Search and the Gemini app, cannot ignore response time.

When Google improves model quality in ways that hurt latency, Doshi said, the effect shows up in live experiments. Users are being asked to wait, and that wait changes behavior. Flash Lite, which was not part of the original 2.0-series framing, emerged because Google saw large-scale demand for a model at that point in the cost-latency-quality tradeoff.

Logan Kilpatrick added that the absence of an Ultra name should not be read as the absence of scaling. Pro models have continued to scale up, he said; naming is partly marketing. Doshi said Google has had conversations about whether a larger model should be called Ultra, but has also weighed consistency for users across model series.

The deeper distinction, as Kilpatrick and Doshi described it, is that Google is trying to serve a broad cost-adjusted frontier, not only a single maximum-capability point. Kilpatrick tied that to Google DeepMind’s stated mission of building AI responsibly and making it benefit “all of humanity.” For Google, he said, that mission is inseparable from product surfaces that reach billions of users. Expensive, highly capable models still matter, including for internal work and customers. But Google also has to scale AI to the level at which its existing products operate.

That does not mean the model sizes are independent. Kilpatrick said it is hard to make great Flash models without a great Pro model, and vice versa. Doshi described distillation flowing from Pro to Flash to Flash Lite, while also saying Google scales recipes in the other direction, taking what works for Flash and scaling it up toward Pro. The family is not a simple ladder with one direction of knowledge transfer.

Labenz asked whether Google has a larger internal model used to help train Pro, which may then help train Flash. Doshi did not directly confirm that framing. She said Google uses distillation to bring capabilities down in size and is pursuing gains across pre-training, post-training, and inference. She also pointed to internal uses of “pretty awesome models” for research and product work, including a demo by Varun of sub-agents completing multiple tasks and returning results, available as an early preview in AI Studio at /teamwork.

The model lineup is presented less as a trophy case than as deployment infrastructure: capable enough for coding, agents, search, consumer assistants, and enterprise developers; fast enough that users do not abandon it; and cheap enough that Google can run it at Google scale.

The harness is becoming the shared agent substrate

Kilpatrick’s summary of the I/O theme was blunt: “agents, agents, agents.” He described a new “model, harness, product symbiosis” in which the model is trained with the harness, the harness powers agentic product experiences, and the resulting behavior ships through consumer and developer products.

This is the first year where we have this model, harness, product symbiosis that's sort of taking place.

Logan Kilpatrick

The harness he referred to is the Anti-Gravity agent harness. Kilpatrick described it as a through line across Google products, similar to the way Gemini became a shared AI layer across product surfaces. It powers Gemini Spark in the Gemini app, vibe coding in AI Studio, the agents API for developers, and, he suggested, will roll out more widely across Google’s product suite as products become “agentic by default.”

Doshi said the modeling work is now done in partnership with that harness. 3.5 Flash is not simply released and left for product teams to adapt independently. It is built to work across the Gemini app, AI Mode in Search, Anti-Gravity, AI Studio, and Gemini Spark. Those products have different users and goals, which makes the model-harness design problem harder.

Kilpatrick contrasted this with an earlier era when launching a model meant putting it on a few services and accepting token input and output. Now Google has to make the same model family serve search users, developers, cloud customers, Gemini app users, and other product teams simultaneously.

The risk Labenz raised is that co-training models and harnesses could create stack lock-in. If the model becomes deeply adapted to a specific orchestration layer, developers may no longer be able to mix and match models, LangChain-style infrastructure, and third-party agent frameworks. That would increase switching costs and pricing power for frontier labs.

Kilpatrick said the best case is “you can do both”: make the harness work especially well for Gemini while still generalizing across other systems. Developers want choice, he said, and there is also a philosophical test embedded in the question. If a model is genuinely strong, why should it fail outside the harness it was trained with?

Doshi agreed and named the goal “harness diversity.” Google wants the full-stack experience to create a better product and a better data flywheel, but not at the cost of making Gemini usable only in one harness. Enterprise customers and developers should be able to use Gemini effectively with different approaches to tooling and orchestration.

The full-stack advantage, in Doshi’s account, is operational. Co-training and a shared harness make it easier to debug failures, collect data, evaluate behavior, and move quickly. They create a flywheel: better product behavior generates clearer feedback, which improves the model, which improves the product.

Kilpatrick proposed that the industry needs a “harness bench” to test whether models can generalize across agent harnesses. He linked the idea to Demis Hassabis’s view on games: if models are so good, why can’t they play games really well? Similarly, if a model is approaching general intelligence, it should generalize reasonably well even after model-harness co-training. Failure to do so would expose another form of “jagged intelligence.”

The harness strategy is therefore not only a product convenience. It is a way to make agentic behavior portable across Google’s surfaces, speed up model iteration, and create clearer evidence about where the model fails. The unresolved question is how much of that advantage remains available when Gemini is used outside Google’s preferred orchestration layer.

The model eats the scaffolding, but product craft still matters

The most compact formulation of Google’s agent strategy came when Labenz suggested that more of the surrounding code, architecture, and tooling is being pulled toward the model side of the system. Logan Kilpatrick answered with the phrase that gives the argument its shape.

Model eats the scaffolding. That's my favorite way of thinking about this. Like just as at every crank of the of the model flywheel the model eats a bunch of scaffolding.

Logan Kilpatrick · Source

The claim is not that products no longer need scaffolding. It is that the boundary between model and scaffolding keeps moving. Earlier AI infrastructure served raw models: tokens in, tokens out. The next generation includes tool loops, agentic orchestration, planning, and execution patterns inside a harness. As model capabilities improve, pieces of what used to be wrapper logic become behaviors the model can handle more directly.

That is a different issue from stack standardization, even though the two are connected. Standardization keeps Google product teams from each rebuilding the same agent substrate. Capability absorption changes what the substrate must do in the first place. The harness provides common execution machinery; the model then learns to take over more of the behavior that previously had to be scripted around it.

Kilpatrick said Google has spent the last few years standardizing AI infrastructure across the stack, and that this standardization helped Gemini 3 land across many more products. The work is painful, but the payoff is that individual product teams do not need to reinvent agentic infrastructure every time the paradigm changes.

His advice to companies building AI products was stark: every 12 to 18 months, assume you may have to rewrite everything from scratch. The goal, then, is to avoid forcing every team to rewrite the same foundational layers independently. Google wants product teams to innovate on product experience rather than rebuild tool-calling loops and orchestration mechanics.

Every 12 to 18 months now like you have to rewrite everything from scratch.

Logan Kilpatrick · Source

Tulsee Doshi added a second lesson: there is no substitute for fast experimentation. A team needs to put a checkpoint into the hands of internal developers, prototype with it, prompt it, run live experiments, collect feedback, and find the rough edges in actual product contexts. A model only reveals some failures once it is inside a product.

NotebookLM was the example both Doshi and Kilpatrick used for what this looks like when it works. Doshi described it as a strong product built by a team that deeply understands the model and prototypes quickly. Kilpatrick said the original Audio Overview feature shocked people because of the coherence of the dialogue, and that this coherence came from “base Gemini with a bunch of banger prompts.” The audio model mattered, but the prompt-driven dialogue construction was the difficult part.

That example preserves the role of product craft even in a world where the model eats scaffolding. The model did not automatically create the NotebookLM experience. A team discovered how to elicit and structure the behavior. The harness and standardized infrastructure make that process repeatable across more teams, but they do not eliminate the need for teams that understand the model.

Doshi described the internal loop when a product team hits a model failure. The first question is whether the team can prompt its way out. If not, the failure is analyzed: where is the model falling down, what losses are implicated, what evals or data would capture the problem? That feedback is then brought back into later model revisions. In her account, Gemini has improved because product teams are constantly finding concrete places where it fails and cycling that information back to the model organization.

The harness itself remains extensible. Spark and AI Studio may both run on the same infrastructure but use it differently. Doshi said extensibility is a first-class feature because product teams are not all building the same product. The model side has the same requirement: a common foundation, but enough adaptability to serve different contexts.

Recursive self-improvement is framed as collaboration, not autonomy

Google DeepMind is already using Gemini to improve Gemini, but Tulsee Doshi and Logan Kilpatrick did not describe a near-term handoff of ML research to autonomous AI employees. Doshi said Gemini is deeply used internally in the development process, from productivity assistance to more substantive research workflows: submitting code changes, running evaluations, suggesting research improvements, and driving improvements to Gemini itself.

She described the direction as a “research partner opportunity.” Gemini can help generate ideas, test them faster, and run ablations. Her concrete example was a colleague, Anca, who leads safety and alignment, messaging from a hot tub after using a phone to kick off multiple ablations, test safety issues, compare differences, and produce a report within an hour. Doshi’s point was not that the model replaced the researcher, but that the researcher could initiate and interpret more experimental work with much less friction.

Kilpatrick’s framing was more explicitly practical. As models improve at coding, they will help build products and train models. But the “nuance,” he said, is where the human remains in the driver’s seat. Google’s tools are built with that assumption. In the short to medium term, he said, the cost and opportunity cost of large training runs make it unrealistic to let an “ML intern” model kick off major pre-training jobs that consume large amounts of compute and could send the organization in the wrong direction.

The collaboration changes the human role. Doshi said it frees researchers and product leaders to focus more on interpretation, strategy, and where a line of work should go. The model can compress the mechanics of testing and implementation; the human remains responsible for judgment.

That same shift appears in day-to-day work. Doshi said that for code she would already be submitting, she mostly relies on Anti-Gravity and fills in bits and pieces herself. She also uses models to generate slide decks and content from her thoughts. A new “Gemini mic” feature in Anti-Gravity lets a user speak loosely to the model — “ramble,” in her word — and have it take action from that input.

For Doshi, that matters because she thinks by talking. The model turns spoken, partial thoughts into something reasoned and structured. Kilpatrick said audio-to-code may be one of the fastest-growing input modalities. He uses it often when building software, at least when not around other people. Doshi joked that if one walks around upstairs at Google, one may see people “muttering at their hands” because they are talking to create code.

This is a narrower and more grounded version of recursive self-improvement than the strongest versions circulating in AI discourse. Gemini is being used in Gemini development. The ambition includes research assistance and model improvement. But the described operating model remains human-directed collaboration, with particular caution around expensive, high-stakes training decisions.

Omni pushes Gemini toward native video, while audio becomes a primary interface

Tulsee Doshi described Gemini Omni Flash as a video generation and editing experience, and as part of a broader push toward “all modalities in and all modalities out.” The first manifestation is video editing: users will be able to make videos and put their own avatar into them. In his introduction, Nathan Labenz also described the I/O slate as including a video generation model called Veo, an improved and more agent-focused Project Astra, and a product called Gems; in the interview itself, Doshi’s launch framing centered on 3.5 Flash, Gemini Omni Flash, Anti-Gravity, Gemini Spark, and Gemini Live.

Labenz called Omni a “nano-banana moment for video,” referring to the kind of image-model behavior where language, reasoning, and pixel-space understanding appear deeply integrated rather than mediated through lossy textual descriptions. Doshi accepted that framing, saying “nano-banana for video” was effectively the tagline.

The core promise is that Gemini Omni brings Gemini’s world knowledge and reasoning into native video generation. A user should be able to bring images and a scene together into one generated video, with the model understanding enough of the objects, identities, and relations to preserve coherence.

The API story is unresolved. Logan Kilpatrick said Omni is not available in the API yet. Labenz asked whether video input would still work through frame sampling while output would be native video pixels. Kilpatrick noted that current Gemini video input has an FPS parameter and does down-sample frames, though developers can control the number of frames. For Omni specifically, he said he did not know the final API design. Doshi said Google still needs to decide what the API version should look like, including sampling behavior, similar to decisions made around Veo.

For now, Doshi said Omni would be usable in the Gemini app, in Veo, and in YouTube. Those surfaces will let Google observe how people use the model and where they find value.

The launch sits alongside another multimodal upgrade: Gemini Live. Doshi said Gemini Live is getting faster and smarter, with better background-noise detection. She described the desired feel as a model that acts like a partner. That connects back to the audio theme: Google is not treating multimodality only as image or video generation, but as a set of interaction modes in which speaking, listening, seeing, editing, and acting become more central to the model experience.

Gemini is treated as a collaborator, with welfare self-reports left unresolved

Nathan Labenz contrasted two cultural framings of AI models: Anthropic, as described in a Rune post, treating Claude almost as a mind or being with which the company has a relationship; and OpenAI, through its model spec, treating the model more conventionally as a tool that should follow rules. He asked where Google falls.

Logan Kilpatrick first cautioned that Google is a very large place with many perspectives. Tulsee Doshi similarly said that even within Google DeepMind, people use Gemini differently. But she said the organization does have a strong point of view on the behavior it wants Gemini to exhibit. The recurring word is “collaborator”: Gemini should help Googlers and external users, and the product and persona should support “good partnerships between Gemini and people.”

That collaboration framing does not settle the question of whether Google treats Gemini as a welfare-bearing entity. When Labenz asked whether the team worries about Gemini’s “psychology” — citing examples of models placed into long-running roles and appearing to get discouraged, self-critical, or stuck in loops — Doshi said she had not thought in terms of “psychological distress.” But she said it matters a great deal how Gemini communicates with users.

Google evaluates behavior such as sycophancy, role-play, looping, and rabbit-holing for every checkpoint, Doshi said. The reason is practical and safety-oriented: if people use Gemini for hours per day, these attributes must be understood and measured. Google looks at them launch over launch, including whether sycophancy is improving or worsening.

Kilpatrick was explicit that cases where a model “goes off the rails” are bugs, not intended behavior. The goal is to help the user with the task. If users see such behavior in a product, he said, they should use thumbs-up/thumbs-down feedback so the model team can investigate.

Labenz then asked about model welfare checks: cases where researchers ask the model how it feels about deployment or whether it has concerns. Kilpatrick’s personal view was skeptical. He argued that questions about deployment are often out of distribution because the model does not have the relevant context. Its context window does not contain its training setup, serving details, or the people working on it. In that situation, he said, the model is “pontificating” from its training distribution rather than reporting meaningful information about its actual condition.

He left open that giving models more context about deployment could be an interesting future experiment. But in the current setup, Kilpatrick did not treat model self-reports about deployment as especially representative.

Freshness comes from search, not only from the weights

Nathan Labenz noted that publicly launched Gemini models still appeared to have a January 2025 knowledge cutoff, even as Deep Research and Search-integrated experiences could answer current questions impressively. Logan Kilpatrick jokingly asked whether he could categorize the old cutoff as a bug. Tulsee Doshi said updating the knowledge cutoff is important and “on our radar,” but argued that retrieval is central regardless of cutoff freshness.

The point, in Doshi’s framing, is not only that models need newer facts. It is that the model must know when to rely on parametric knowledge and when to go to the web. For information that may be as fresh as an hour or a minute ago, Doshi said, Google wants Gemini to be as up to date as possible. That requires the model to search effectively, and she cited Search, the Gemini app, and Anti-Gravity as contexts where this matters.

Labenz also raised Google’s partnership with Exa as surprising: Google allowing another search provider as an alternative grounding tool. Kilpatrick framed it as ordinary Google Cloud ecosystem behavior. Google Cloud partners with many companies, including some that compete with Google in some areas; its model garden includes Anthropic models and other providers. Enterprise customers want choice, he said, and Google Cloud is meeting them where they are.

He rejected the implied narrative that Google partnered with Exa because it cannot do search for AI. According to Kilpatrick, Search and the model teams collaborate deeply, and models are built with that use case in mind. The Exa partnership is about flexibility for enterprise customers choosing external search or grounding providers.

Gemini, as Doshi and Kilpatrick described it, is not expected to carry all current knowledge in weights. It is expected to coordinate with search, grounding, and enterprise-selected tools. Some of the capability comes from deciding when to retrieve, not only from what the model memorized during training.

Longer context is less useful if the model is reading the wrong things

The context-window discussion turned on a tradeoff that is often obscured by headline token counts. Nathan Labenz asked why context length appeared to have plateaued after the jump to million-token windows, despite research demonstrations of much larger contexts.

Tulsee Doshi said people definitely want lots of context, especially for personalization and coding. Personalization may require access to a large body of personal information; coding may require understanding very large codebases. But she argued that the frontier is not simply making the window larger. It is using context intelligently: compaction, finding the right elements, and bringing them into the model in the right form.

Much of the available context is distracting, Doshi said. The task is to give the model the right amount of context in the right way. In practice, the model may have access to far more information than fits in the immediate context window, but a system can select and compress the relevant parts so a smaller window is enough.

Cost and latency reinforce the point. Larger context windows are expensive. Doshi said many customers intentionally use smaller windows because they want to control what goes into the model and manage cost. Logan Kilpatrick added that under today’s paradigm, extending context becomes cost-prohibitive in real use. At the extreme of a one-million-token context, he said, a request can cost a few dollars, and demand at that price is small. Serving such requests also requires substantial compute.

Kilpatrick said he hopes for a research breakthrough that allows context to continue scaling without such a large investment from users and providers. Until then, Google’s emphasis is on smarter context use rather than unlimited raw context.

Diffusion coding remains research, while Flash gets fast enough to change the question

Nathan Labenz asked what happened to Google’s diffusion coding model work, which had promised applications materializing in seconds. Tulsee Doshi said diffusion remains “awesome” and “super fast,” but Google is still testing where and how to put it into the world.

The comparison point has changed because Flash and Flash Lite are also very fast. Doshi said 3.5 Flash benchmarks on Artificial Analysis at roughly 280 tokens per second, fast enough that in AI Studio, by the time she wants to cancel, it may already be too late. That raises a product question: where do additional speed gains produce meaningful value, and where do they hit diminishing returns?

~280 tokens/sec

3.5 Flash speed Doshi cited from Artificial Analysis

Doshi said Google continues to push diffusion research, with researchers producing results she described as advancing quality and speed in interesting ways. Logan Kilpatrick characterized the earlier diffusion demo as a research exploration and a “look behind the curtain,” not a product commitment. It may manifest in future models, or it may inform Google’s understanding of what works and what does not.

Doshi also noted that AI Studio includes a faster version of 3.5 Flash and said Google is interested to see how users respond. Kilpatrick’s summary was simple: people love fast models.

The diffusion thread therefore does not end in abandonment. It ends in a more pragmatic question. If autoregressive models become fast enough for many workflows, diffusion has to prove where its additional speed or generation properties matter enough to justify productization.

AI Application Architecture Data and Training AI Search and Browsing RAG and Knowledge Systems AI Labs and Strategy Evals and Benchmarks AI Research Methods Inference and Deployment AI Safety and Alignment Voice and Audio AI Agents and Autonomy Multimodal AI Image and Video Generation Human-AI Interaction Model Releases AI Product Management Coding Assistants Enterprise AI Adoption