Gemma Is Google’s On-Device Extension of Gemini Research

Omar SansevieroLatent SpaceMonday, May 25, 202613 min read

Google DeepMind’s Omar Sanseviero argues that Gemma is not a parallel alternative to Gemini but the open, local and on-device expression of the same research stream. He presents Gemma 4 as a model family optimized for efficiency, developer integration and emerging agentic use cases, while drawing a clear boundary around Gemini as Google’s route for frontier capability, broad factual knowledge and long-running tasks.

Google’s open-model strategy, as Omar Sanseviero described it, is not a separate track from Gemini research. Gemma is the local and on-device expression of the same research stream: smaller, open, efficient, integrated into developer tools, and increasingly capable at agentic behavior. But Sanseviero kept a clear boundary around what it is not. Gemini remains the path for flagship capability, broad factual knowledge, and long-running frontier tasks.

Gemma is the on-device path for Gemini-era research, not a Gemini replacement

Sanseviero described Gemma 4 as Google DeepMind’s most capable open model so far, built to “compact as much intelligence per parameter” as possible while adding multimodal capability. The architectural detail he emphasized is the distinction between effective parameters and total parameters.

In a traditional transformer, he said, there is a large embedding layer. Gemma 4 changes the transformer block by adding a per-layer embedding: at every layer, the model has an embedding table. Because that table is used as a lookup rather than through full matrix multiplication, a model can carry more total parameters than it needs to keep loaded on the GPU during inference.

For the small Gemma 4 model Sanseviero described as “e2b,” the effective load is two billion parameters on the GPU, while the model has almost five billion parameters in total. The remaining roughly three billion parameters can sit in CPU memory or on disk, allowing fast inference with a smaller active GPU footprint.

effective parameters loaded into the GPU for the small Gemma 4 model Sanseviero described

Asked why every model would not offload parameters this way, Sanseviero framed it as a research choice optimized for on-device use: phones, Android devices, Raspberry Pis, and similar environments. At larger sizes, he said, teams usually want to compact intelligence differently, using more dense architectures or mixture-of-experts approaches.

The same parameter-offloading idea appeared in Gemma 3N, which Sanseviero connected directly to Gemini Nano. Pixel phones and high-end Samsung phones, he said, already ship with Gemini Nano “baked into the operating system,” and Gemini Nano is built on top of Gemma with additional training and adaptations for traditional on-device use cases. In his telling, Google’s on-device products and open Gemma models share a research lineage rather than sitting in wholly separate tracks.

When Vibhu Sapra asked whether the Gemma 4 approach had been pushed into larger models, Sanseviero gave no numbers or roadmap. “We are doing lots of experiments,” he said, adding only: “Stay tuned.”

The open model launch is an ecosystem integration problem

Shipping Gemma 4 is not just the work of training, post-training, distillation, and related model development. The Gemma team itself is relatively small — Sanseviero described it as two or three product managers, one marketing person, and the rest engineers and researchers — but a release depends on a large external and internal integration surface.

For the Gemma 4 launch, Sanseviero said Google worked with “almost 50 external partners.” He named llama.cpp, Ollama, MLX, Hugging Face, vLLM, Nvidia, and AMD among them. Internally, the work spanned Google Cloud, Vertex, Vertex Models as a Service, ADK, and Android.

Launch surface	Examples Sanseviero named	Role in the release
External open-source and platform partners	llama.cpp, Ollama, MLX, Hugging Face, vLLM, Nvidia, AMD	Make Gemma 4 usable across common inference stacks and hardware paths
Internal Google product surfaces	Google Cloud, Vertex, Vertex Models as a Service, ADK, Android	Connect the model release to Google’s developer and platform products
Developer use case	Android Studio agent mode with llama.cpp, vLLM, or any OpenAI-compatible endpoint	Let developers use Gemma 4 locally to help write Android applications

Sanseviero framed Gemma 4 shipping as a model release plus a broad integration effort.

The Android integration shows what this work is meant to enable. With Gemma 4, Google released an integration with Android Studio’s agent mode, where a model can help developers write code and operate inside the IDE. Sanseviero said Android Studio shipped support for offline models through llama.cpp, vLLM, or any OpenAI-compatible endpoint, allowing developers to use Gemma 4 locally for Android application development.

The justification for using local Gemma instead of Gemini remains narrow but meaningful: offline operation and privacy. Sapra pressed on whether there was a reason beyond those obvious cases. Sanseviero’s answer stayed practical: if a developer wants the full setup local and does not want to send any code to any API, Gemma is the path.

That does not mean Sanseviero sees local models as replacing frontier cloud models. He said Gemma 4 is roughly matching state of the art from one to one and a half years earlier “for most things,” and local models can now deliver agentic behavior, function calling, system instructions, and conversation. But he drew a line between capability and knowledge. Knowledge, factuality, and broad world understanding still favor a larger model like Gemini.

With local models or models that you can run in your own hardware, you can get capability, so you can get agentic capabilities, function calling, system instructions, conversational, and that kind of stuff. Knowledge is much trickier.

Omar Sanseviero · Source

Sanseviero’s expected future is not cannibalization but a shift in which many agentic tasks become feasible on device. He imagined a one- to two-year future in which something like a “Gemini 3 Pro powerful model” can run directly on a phone. Flagship long-running tasks and high-factuality work would still use Gemini, but many product experiences could move locally once on-device models are strong enough.

Gemma 4 inherits multimodality from Gemini research, with clear limits

Gemma 4 was built on the same research as Gemini 3, Sanseviero said, and therefore benefited from Gemini 3’s improvements. On the smaller models, multimodality includes understanding audio, images, and short video. He defined short video and audio as roughly 30 to 60 seconds, which Shawn Wang noted is already “quite long.”

The audio capabilities Sanseviero listed were speech recognition, speech-to-text translation, and some speech understanding — including asking questions about an audio file. He emphasized that these are optimized for on-device phone use cases.

On vision, Sanseviero said Gemma 4 improved object detection, pointing, and captioning. But he also identified two missing capabilities. Image segmentation is not supported, despite being something many users have asked for. Video with audio is also not yet supported in a single prompt: the model can understand video input or audio input separately, but not the combined visual and audio stream together. He suggested additional fine-tuning could produce a good baseline for that combined use case.

Audio output remains undisclosed. Asked about speech out, Sanseviero said Google is exploring things but had nothing to share. The surrounding exchange contrasted excitement about native speech-to-speech systems with the practical persistence of pipeline approaches, but Sanseviero did not claim a near-term Gemma speech-output feature.

Multilinguality is another part of the model design Sanseviero singled out. Gemma supports 140 languages, and he credited the tokenizer as a major reason it can adapt well to additional languages. The tokenizer is based on the Gemini tokenizer, which he described as “extremely good” and able to capture useful token structure across languages.

The notable claim is that tokenizer quality can matter even when the base model is not the best general model. Sanseviero said that, in the Gemma 3 generation, other models may have been stronger as general-purpose base models, but if all were fine-tuned for a specific Southeast Asian language such as Vietnamese, Gemma could yield better results because of the tokenizer.

Fine-tuning is becoming less central for general behavior, but not obsolete

Sanseviero described a real shift in fine-tuning culture. Around 2023 or 2024, he said, fine-tuning communities were especially active. Over the last two years, that enthusiasm has changed because models have become much better out of the box.

The Gemma 4 partner program gave him a concrete example. Some of the 50 to 60 partners planned to fine-tune the 27B model for vision tasks, then found that the model already worked well enough without fine-tuning. Sanseviero said he saw “lots” of cases like that.

His distinction is between general conversational behavior and specialized domains. For changing how a model behaves conversationally, prompting now handles much of what fine-tuning once did. For capability gaps involving specific domains — finance, healthcare, or data the model did not see — fine-tuning still has a role.

Med-Gemma is Google’s own example of that domain-specific path. Sanseviero said Med-Gemma 1.5, released three months earlier, is based on Gemma 3 with additional training on medical datasets.

The on-device case introduces a harder deployment problem. One speaker raised the possibility of task-specific LoRAs for phone models. Sanseviero’s response was less about whether LoRAs can work technically and more about the product and developer lifecycle. If 20 apps on a phone each ship their own LoRA, then a base-model update may require updating all 20 LoRAs. Keeping 20 different base models on device would be bad for battery and system resources, but coordinating many LoRA updates across mobile release cycles is also difficult.

For Sanseviero, that makes on-device ML deployment an industry-level developer experience problem. It is not enough to make small models efficient; the ecosystem needs a workable pattern for how apps, base models, adapters, updates, and platform constraints coexist.

Dense, sparse, and offloaded models make “intelligence per parameter” harder to compare

The discussion of larger Gemma models used several spoken reference points rather than a fully reconciled product taxonomy. Sapra first referred to Gemma 4 as having “a 29B and a 31B,” with one MoE and one dense. Later, the exchange centered on a 31B dense model and a 27B mixture-of-experts model with 4B activated parameters. Sanseviero described the dense model as offering the most “raw intelligence,” while the MoE model is built for very fast inference within developer-friendly constraints.

The sizing decisions were not arbitrary, in his account. The 31B dense model is the largest size that, when quantized, would fit on a consumer GPU. The 27B MoE trades total parameters against active parameters, with 4B activated, to make inference faster within the same broad class of constraints.

MoEs are attractive for inference but harder for fine-tuning. Sanseviero said they are “not as easy to fine-tune for instruction following,” and that standard recipes and hyperparameters may not transfer cleanly. When asked for the intuition — whether routing disrupts backpropagation — he was careful not to overclaim. His intuition was that routing and distribution shifts both matter: a fine-tuning distribution may affect the router differently than it would affect a dense model. He also pointed to variables such as how many experts are triggered and whether the router is frozen.

The broader metric Sanseviero returned to is “intelligence per parameter.” Across Gemma 2, 3, and 4, Google has kept a roughly 30B size class while increasing capability. That makes progress visible without simply increasing parameter count.

Gemma, we have done the same size, right? 27, like almost 30 billion, around 30 billion parameters for Gemma 2, 3, and 4. And the intelligence is much higher, right? Like we have not increased the model size.

Omar Sanseviero · Source

But sparsity and offloading complicate the comparison. Sanseviero said MoEs and dense models are not apples to apples. There are “napkin calculations,” but no simple equivalence. A model with roughly 30B total parameters, a smaller number of active parameters, and another with dense parameters cannot be compared by total size alone.

The same complication applies to future limits. Sanseviero said he could imagine a 30B parameter model becoming “extremely powerful” within three years, especially for agentic tasks. But he again separated reasoning or agency from knowledge storage. A smaller model may do “super wild agentic stuff” while still lacking niche factual knowledge, such as obscure historical or country-specific facts. The limitation is also about the role people expect model weights to play as a database.

Diffusion text generation is still mostly a speed bet

Google brought researchers to AI Engineer London not only to discuss Gemma, but also areas such as diffusion transformer models for text generation and mechanistic interpretability. Sanseviero referred to Gemini Diffusion, announced at Google I/O the previous year, as a text-generation diffusion model that can generate code extremely quickly.

When pressed on whether diffusion text models offer something beyond speed — for example, a qualitatively different way to fill in code structure or handle “fill in the middle” tasks — Sanseviero kept the answer restrained: “It’s mostly speed.” On whether text diffusion might overtake autoregressive models, he called the area “very experimental.” He said Google would share more research around diffusion text-generation models, but current quality is still worse than what a normal autoregressive model can produce. He also noted that diffusion transformer models are difficult to fine-tune.

The discussion around fill-in-the-middle exposed how much has changed in code models. Previously, Sanseviero said, companies treated fill-in-the-middle as a special generation task, often involving strict formats, special prompt structure, or dataset rearrangement. If users deviated from the training format, performance could degrade. Now, in his view, general autoregressive models often provide strong fill-in-the-middle behavior out of the box.

Some possibilities remained speculative. One speaker imagined a system-one/system-two split, with a diffusion model acting as a fast planner or executor and an autoregressive model handling another part of the agent loop. Sanseviero said he could see a world with a strong agent-manager setup and diffusion-based executors for specific coding tasks, but the surrounding exchange treated that as hypothetical rather than a product claim.

Mechanistic interpretability is one route from engineering into research

Mechanistic interpretability was the other research doorway Sanseviero emphasized. Gemma Scope, released in December, lets users analyze activations across different layers based on tokens. The team generated activation data for every layer across all Gemma 3 models, producing a very large dataset — Sanseviero was unsure whether it was a couple of terabytes or as much as one petabyte.

He characterized mechanistic interpretability as a niche field but a good opportunity, especially because it does not require large compute to get started. Engineers can experiment with activations, build or use open-source tools, and get a sense of how transformer architectures behave internally.

That fit into a broader point about why researchers belonged at an applied AI engineering conference. One speaker argued that engineers want to understand how the models they use were trained, even if they never train models themselves, because it makes the systems more intelligible and trustworthy. Sanseviero agreed with the premise that a large part of research is empirical experimentation. Many researchers, he said, are doing ablations: moving pieces around, seeing what works and what does not, and iterating. Some research is deeper architecture design, but much of the day-to-day process resembles engineering more than a clean separation between scientist and implementer.

The same theme carried into “auto-research.” Sanseviero was skeptical of older AutoML framing, which he described as largely parameter search — a greedy search over a hyperparameter space. The current wave looks different because coding agents can run more of the experimental loop, but he did not claim that deep research will soon be fully automated.

He expects the next generation of fine-tuners to include people who do not code fine-tuning pipelines themselves. A year earlier, he said, people had to write code using Transformers, Unsloth, or another library. Going forward, many people will prompt agents equipped with skills from Hugging Face or other tools to launch fine-tuning experiments and compare results. But for deeper architectural research, his hunch is that it will not be automatable in the next one or two years.

DeepMind’s developer organization is meant to carry community signal back into research

Sanseviero described Google DeepMind’s developer experience and DevRel work as an attempt to redefine what developer relations should look like inside an AI-centric frontier research lab. The team is hiring “high agency” people who can build, engage with the community, and operate close to research.

Geography matters, but mainly because of proximity to DeepMind research hubs. Sanseviero said the team is growing in Singapore and India, with Singapore becoming a small but fast-growing DeepMind hub. He emphasized that the goal is not to create isolated sales offices or single-person outposts. New developer-facing hires should ideally be co-located with DeepMind researchers, even if those researchers are working on different projects. He listed Paris, London, Zurich, San Francisco, New York, and now Singapore as relevant hubs.

He also described a broader organizational shift: DeepMind historically did not do much product, but now has AI Studio, the Gemini API, and Kaggle. Kaggle had recently joined DeepMind, and Sanseviero connected it to both community hackathons and benchmarks.

Kaggle’s benchmark role matters because of evaluation quality. Sanseviero said many benchmarks can be “benchmaxed” and gamed. The opportunity he sees is to use Kaggle’s community and leaderboard dynamics to identify capabilities that Gemini has, lacks, or could improve. He mentioned a new experimental agent-evaluation system where agents can take an exam and compete on a leaderboard.

The feedback loop he described is central to the way he framed Gemma, Gemini, and related tools: feedback from startups, community developers, forums, social media, events, and benchmarks is brought back to the modeling teams. The developer experience team’s job is to collect that signal and make it useful to the people building the models.

Data and Training AI Labs and Strategy Evals and Benchmarks AI Research Methods Inference and Deployment Agents and Autonomy Multimodal AI Open Models Model Releases Coding Assistants