Gemma 4 Moves On-Device AI From Chatbots to Local Agents

Chintan ParikhAI EngineerThursday, May 7, 202611 min read

Chintan Parikh of Google DeepMind argues that on-device AI is moving from local chatbots toward local agents, as smaller Gemma 4 edge models become capable of tool calling, structured output and reasoning on phones, laptops and embedded hardware. With Weiyi Wang joining the Q&A, Parikh presents LiteRT as the deployment layer for that shift across Android, iOS, desktop, web and IoT. His case is pragmatic rather than absolute: edge inference can improve latency, privacy, offline use and cost, but teams still have to manage memory, quantization, accelerator support and when to call the cloud.

The edge target is shifting from chat to local agents

Chintan Parikh framed Google DeepMind’s edge AI work around a practical change: smaller models are becoming capable enough to move beyond local chat interfaces and into agentic workflows that run on the device itself. He was joined by Weiyi Wang for the Q&A, but the main walkthrough centered on Gemma 4 edge models and the LiteRT deployment stack.

Parikh said the focus was specifically on Gemma 4 E2B and E4B, the edge-oriented variants of Google DeepMind’s recently launched Gemma 4 family. The broader Gemma family includes smaller models as well, including Gemma 3 models down to 270 million parameters, which he positioned as useful for teams that need extremely small models they can fine-tune. But the emphasis here was the new edge pair: models intended to run locally while supporting more sophisticated agent behavior.

The shift Parikh described is not only about model size. It is about what local inference can now do. He said Gemma 4’s edge evolution is “moving from chatbot-type capabilities to more autonomous agents” with reasoning and more sophisticated features, while still being deployable across different devices.

The reason to put this work on-device, in Parikh’s framing, is not ideological. It is a set of tradeoffs: latency, privacy, offline use, and cost. For real-time camera features such as filters, background replacement, and video calls, he said “real-time latency is king,” and on-device execution can help. For document summarization or other sensitive workflows, privacy is the reason not to send data off-device. For use cases with poor connectivity, offline operation matters. And for teams watching token bills rise, local inference offers a way to rebalance cloud and device workloads.

Parikh did not present edge inference as a replacement for cloud AI in all cases. He described it as a hybrid option: run what can be run locally, call out when needed, and look for the best balance between local execution and cloud services.

Gemma 4 E2B and E4B divide the edge market by memory and task

The two Gemma 4 edge models serve different hardware and product envelopes. E2B is the smaller edge specialist; E4B is the heavier local assistant for richer workflows.

Model	Positioning	Memory guidance	Use cases highlighted
Gemma 4 E2B	Efficient specialist for smartphones and IoT	Roughly 1–2GB RAM when quantized	Light background tasks, voice interfaces, summarization, low-latency local processing
Gemma 4 E4B	Pro assistant for deeper reasoning	3GB+ RAM on higher-end laptops or edge servers after quantization	Agentic workflows, complex coding assistance, advanced vision-to-action logic

Google DeepMind’s edge-oriented Gemma 4 models target different device classes and workloads.

Chintan Parikh qualified the memory numbers as dependent on the end use case and the chosen quantization. E2B, in his description, is plausible for smartphones and IoT-style deployments. E4B is “more heavy-duty,” suitable for larger devices and workflows that need deeper reasoning.

The capabilities he emphasized were agentic rather than purely conversational. Gemma 4 E2B and E4B support built-in function calling, so a local model can invoke tools, APIs, or system-level functions. They support structured JSON output natively, which Parikh distinguished from trying to force formatting through prompt engineering. E4B also includes a “thinking” mode intended to show the model’s reasoning process in the Gallery app.

The fourth capability was hardware-native deployment. Parikh’s point was that these models are meant to run across a wide platform range rather than being bound to a single operating system or accelerator. That hardware story becomes central later in the stack: Android, iOS, desktop operating systems, web, IoT boards, CPUs, GPUs, and NPUs all appear in the deployment path.

He also said the Gemma 4 LiteRT models are available through Google’s Hugging Face LiteRT community page, with the Gemma 4 E2B and E4B LiteRT language-model entries visible in the slides. He described them as Apache 2.0 licensed and ready for developers to download and build with.

The Gallery app shows the local model as controller

The Google AI Edge Gallery app was the main way the talk showed Gemma 4 edge models in action. Its common architecture is more important than any single demo: the model performs local inference, while “skills” let it call tools, local APIs, or other capabilities. In that pattern, the model becomes a controller for actions on the device, rather than only a local chat interface.

Chintan Parikh described the app as a playground for developers. The slides presented it as “your local AI agent,” with on-device capabilities including chat, agent skills, Ask Image, and Audio Scribe. Parikh said the app has sample code, can be forked, and is meant to help developers understand what the models can do and then build their own experiences.

One demonstration showed knowledge augmentation. The on-screen example asked for detailed information about the movie “Dune” using the Brave Search API, then returned a summary. Parikh described this as letting the agent query Wikipedia or another knowledge source to answer encyclopedia-style questions, extending beyond the model’s initial training data. The core inferencing, in his framing, remains on the edge, while the skill can call outward when the product requires it.

Another example turned journal-style mood entries into a trend analysis. The visible prompt included a morning journal entry with a score of 9, a comment about eight hours of sleep, and a request to analyze the trend over seven days. The app returned a general mood trend, highlighted high-scoring days, and noted a lower-scoring day associated with feeling tired and stressed. Parikh used this example to point to the newer reasoning and thinking capabilities in the model.

A third example combined image understanding with generated media. The user sent a breakfast photo and asked the app to “pair this vibe with some music.” The skill called a “vibe music generator” and returned a low-energy pop suggestion labeled “Vibe Lo-Fi.” Parikh said this showed how Gemma 4’s core capabilities can be expanded by integrating with other models, such as text-to-speech, image generation, or music synthesis.

The app also supports developers creating their own skills. Parikh said many of these skills can be written inside the app itself, with instructions included. Slides showed community skills in GitHub Discussions, including Brave Web Search, DuckDuckGo API Search, Google News, Weather Query, and Python via Pyodide. The sample app is open source on GitHub, and Parikh said users are already posting skills for others to use.

LiteRT is the deployment layer beneath the demos

After the model and app examples, the talk moved to deployment. The underlying framework for the Gallery app examples, Chintan Parikh said, is LiteRT: Google’s on-device framework for high-performance machine learning and GenAI deployment on edge platforms.

LiteRT is an evolution of TensorFlow Lite. Parikh said it inherits TensorFlow Lite’s foundation and core functionality and uses the same TensorFlow Lite model format. For developers already using TensorFlow Lite, his message was continuity: existing TFLite-format models continue to run.

The scale claims on Google DeepMind’s slide were large and explicitly attributed to internal Android data: more than 100,000 third-party Android apps, more than 2 billion active Android devices, more than 2,000 Google first-party projects, and millions of daily interpreter invocations in Android. Parikh presented those numbers as evidence that LiteRT is being built on a widely deployed foundation.

100K+

third-party Android apps shown on Google DeepMind’s LiteRT scale slide

The rebranding from TensorFlow Lite to LiteRT also reflects broader model support, according to Parikh. He said LiteRT is meant for “bringing your own models,” including PyTorch and JAX models, not only TensorFlow models. The deployment flow he described is straightforward: convert the model, optimize it, deploy it at the edge, and run inference at the edge.

Stack layer	Component shown	Role in the deployment path
Model conversion	LiteRT Torch	Convert models, including bring-your-own-model workflows, into the TFLite format
Quantization	AI Edge Quantizer	Reduce or tune model precision for edge deployment
Pipelines	LiteRT LM	Run LLM-oriented on-device pipelines
Runtime and kernels	LiteRT	Execute inference at the edge across supported hardware
Tools	Model Explorer, AI Edge Portal	Inspect model graphs and benchmark across device fleets

The Google AI Edge stack shown in the talk connects conversion, quantization, runtime, LLM pipelines, and deployment tooling.

Parikh described Model Explorer as a way to inspect a graph and decide which parts to change or quantize, including mixed-precision choices. He described AI Edge Portal as a cloud-based benchmarking service for developers who need to test models across broad Android device fleets, including older phones. The practical question he posed was the one product teams eventually face: if a model works on one device, how do they know it will work reliably on five- or six-year-old phones?

That question also affects compilation strategy. Parikh said AI Edge Portal can help teams decide whether ahead-of-time compilation or just-in-time or on-device compilation is the right recipe for a model and target fleet.

Acceleration is strongest on prefill and time to first token

Cross-platform support was a recurring claim: one LiteRT Gemma model deployable across mobile, desktop, IoT, and embedded targets, with CPU and GPU broadly available and NPU support depending on platform. The slide listed Android with CPU, GPU, and NPU; iOS with CPU and GPU; macOS, Windows, and Linux with CPU and GPU, with Windows also listing NPU; Raspberry Pi with CPU; Nvidia Jetson Orin with GPU; and Qualcomm IQ-8275 with NPU.

For CPU and GPU, Chintan Parikh described the libraries as relatively universal. For NPU acceleration, he said Google had completed integration with Qualcomm and MediaTek and was focusing on additional partner integrations. The slide also listed Google Tensor, Qualcomm Snapdragon, Samsung Exynos, Intel, and MediaTek under NPU ecosystem coverage.

His strongest claim concerned the performance and power implications of NPUs for real-time workloads. For ASR, TTS, camera feeds, AR/VR, and other applications requiring real-time capability, Parikh said NPUs could deliver “at least like a 3 to 10x improvement in performance,” with meaningful energy benefits.

The benchmark slides showed a more specific hierarchy. On the Android result, the biggest differences were in prefill throughput and time-to-first-token. Decode improved less dramatically and varied by backend. On the S24 Ultra, CPU prefill was 557 tokens per second, OpenCL prefill was 3,808 tokens per second, and NPU prefill was 7,463 tokens per second. Time-to-first-token moved from 1.8 seconds on CPU to 0.3 seconds on OpenCL and 0.1 seconds on NPU. Decode was much closer: 46.9 tokens per second on CPU, 52.1 on OpenCL, and 48.1 on NPU.

Platform and device	Backend	Prefill	Decode	Time to first token
Android + S24 Ultra	CPU	557 t/s	46.9 t/s	1.8 s
Android + S24 Ultra	OpenCL	3,808 t/s	52.1 t/s	0.3 s
Android + S24 Ultra	NPU	7,463 t/s	48.1 t/s	0.1 s
iOS + iPhone 17 Pro	CPU	532 t/s	26.0 t/s	1.9 s
iOS + iPhone 17 Pro	Metal	2,878 t/s	56.5 t/s	0.3 s

Gemma 4 E2B mobile performance shown by Google DeepMind for Android and iOS.

The Android slide stated that the S24 Ultra NPU provided a 13.4x boost over the CPU baseline. Parikh summarized it as “up to a 13x boost in some cases,” and noted that iOS support was part of the same story.

Desktop and embedded results showed that platform performance varies materially. On Linux with an NVIDIA backend through WebGPU, the slide showed 11,234 prefill tokens per second and 143.4 decode tokens per second. On a MacBook Pro M4 using WebGPU, it showed 7,835 prefill tokens per second and 160.2 decode tokens per second. Windows with Intel and WebGPU was much lower in the shown table, at 472 prefill tokens per second and 18.6 decode tokens per second, with the same 2.2-second time-to-first-token as CPU.

For IoT and embedded, the slide showed Raspberry Pi 5 running Gemma 4 E2B on CPU at 133 prefill tokens per second, 7.6 decode tokens per second, and 7.8 seconds to first token. A Qualcomm Dragonwing IQ8 board using NPU showed 3,747 prefill tokens per second, 31.7 decode tokens per second, and 0.3 seconds to first token.

The benchmarking notes matter. The desktop and embedded slides said the results were benchmarked via LiteRT LM with 1,024 prefill tokens, 128 decode tokens, and a 2,048 context length, from warm cache. The model can support up to a 32k context length, according to the slide. Time-to-first-token did not include initialization time.

Parikh also compared LiteRT-LM with llama.cpp on mobile. The slide claimed LiteRT is up to 35x faster on mobile CPU/GPU for Gemma 4 E2B on an S24 Ultra. It showed CPU prefill of 557 tokens per second for LiteRT-LM versus 18.6 for llama.cpp, and GPU prefill of 3,808.8 versus 108.6. Decode results were also higher for LiteRT-LM: 26.6 versus 3.5 on CPU, and 49.7 versus 13.5 on GPU. Parikh summarized the comparison by saying LiteRT is “up to like 35 times faster” on mobile, “at par” on desktop, and about 3x on IoT.

The Q&A narrowed the claims to concrete deployment questions

The Q&A made the edge tradeoffs more concrete. One attendee asked about a home security camera that recognizes whether his son came home, with a local computer sending a phone notification only when a known person is recognized. The motivation was cost: sending the stream to the cloud would be expensive.

Chintan Parikh answered by pointing to face unlock on phones. On iPhones and Pixels, he said, face unlock typically runs locally already. On Google devices, he said, it uses the same framework; Apple’s face unlock runs on-device using Core ML. For a home camera, he cautioned that the deployment would be different from a phone and would require some algorithm to check authenticity, but he agreed that a local Raspberry Pi connected to a camera could run the recognition locally and send the phone message only when triggered.

Another attendee asked how LiteRT compares with ONNX Runtime. The answer was limited. An unidentified speaker said that for the web API they use either WebGPU or OpenGL under the hood, translating to the OS API directly. Parikh said he did not have ONNX material in the slide deck and thought it might be on Hugging Face, but did not give a direct comparison in the captured exchange.

A third question asked about multiple nodes running a model, where one trigger connects into a hierarchy and another agent processes the input. Parikh said he did not have examples to share, but mentioned groups working on speaker-and-thinking-agent-style architectures and orchestration that decides what should run locally versus elsewhere. An unidentified speaker added that a common pattern is to use a small classifier or routing model to determine intent and drop requests that do not need further processing.

The final captured question came from a developer using Gemini Live API for audio-to-text in a mobile app. Parikh said the session had focused on Gemma because it was part of the DeepMind track, but LiteRT can support other open-weight models if they can be put into the right file format and if their sizes fit the application. Some open-weight models are provided for ease of use on the Hugging Face page; others developers can convert and use themselves.

AI Application Architecture Evals and Benchmarks Inference and Deployment Agents and Autonomy Multimodal AI AI Infrastructure and Compute Open Models Model Releases