Android Makes Gemini Nano a Shared System Service for Apps

Oli GaymondAI EngineerFriday, May 22, 202611 min read

Google’s Florina Muntenescu and Oli Gaymond argue that Android’s on-device AI strategy depends on treating Gemini Nano as a shared system service, not something each app ships and manages itself. In their account, AICore centralizes the three-to-four-gigabyte model, scheduling, battery management and privacy boundaries, while developers call higher-level ML Kit GenAI APIs. The constraint is reach: those APIs need recent flagship-class devices, so Google is positioning hybrid cloud fallback and LiteRT-LM as alternatives when local Gemini Nano is unavailable or too limiting.

Gemini Nano is a system asset, not an app payload

Gemini Nano is too large and resource-sensitive for Android to treat it like a normal per-app dependency. Oli Gaymond said the smallest useful models are around one gigabyte, while the Gemini Nano models Google is shipping are closer to three or four gigabytes in total. That is the practical reason Android puts the model behind AICore, a system service, rather than asking every developer to package and manage it independently.

3–4 GB

approximate total size of the Gemini Nano models Google is shipping, according to Gaymond

The design creates a layered AI stack. Florina Muntenescu described three deployment paths for Android apps: on-device inference, hybrid inference, and cloud inference. On device means prompts and data are processed locally through Gemini Nano when supported. Hybrid means the app uses the on-device model when it is available and falls back to cloud inference when it is not. Cloud inference means using hosted Gemini models through Firebase AI Logic.

Inference path	Where the work runs	What the source emphasized
On-device inference	Local Android hardware	Prompts and data are processed without being sent to a server
Hybrid inference	On device when Gemini Nano is available; cloud fallback when it is not	A way to increase device reach while still using local inference where possible
Cloud inference	Google-hosted models through Firebase AI Logic	Access to more powerful models such as Pro, Flash, and Flash-Lite

The three Android AI deployment paths described by Muntenescu

On-device inference gives developers the benefits normally associated with local processing: sensitive data can remain on the phone, features can work offline, and inference does not add server or cloud-compute cost. Muntenescu’s examples were workloads where those properties matter or where a small local model is sufficient: banking-like sensitive data, personalization, short-context work, translation, summaries, and snippets.

Android exposes the on-device route in two broad ways. Developers can use ML Kit GenAI APIs to access Gemini Nano, Google’s on-device model, or use LiteRT-LM when they need to deploy customized LLM models. Muntenescu focused the Android developer path around ML Kit GenAI; LiteRT-LM is the more customizable route.

Gemini Nano is Google’s “most efficient model for on-device tasks.” Muntenescu said it uses the same architecture as Gemma 4, but is optimized for Android devices. The model arrives through AICore, which centralizes deployment and execution so that one system-level model can be shared by apps.

Gaymond compared AICore to a managed cloud service brought onto the device. Developers do not set up the LLM, manage TPU inference, tune runtime behavior per device, or handle model delivery. They focus on the feature and the prompt; the system handles deployment, optimization, hardware selection at runtime, and execution. As he put it, “You just focus on your feature, your prompt, and then the service provides everything. We’re doing the same thing for on-device.”

AICore is also where Android places privacy and safety boundaries for Gemini Nano. Muntenescu said requests are run in isolation so one app’s request is not mixed with another’s. The slide stated that input and output data are never stored on the device. In her wording, the goal is to make the flow “as private and secure as possible.”

ML Kit GenAI is the high-level route; LiteRT-LM is the customizable one

ML Kit is broader than Gemini Nano. Florina Muntenescu described the GenAI APIs as part of the larger ML Kit surface, which already includes Vision and Natural Language APIs. The listed ML Kit capabilities included text recognition, face detection, pose detection, selfie segmentation, document scanning, barcode scanning, image labeling, object detection and tracking, digital ink recognition, language identification, translation, smart reply, and entity extraction.

The GenAI portion adds task-oriented APIs such as summarization, proofreading, rewriting, image description, speech recognition, and a prompt API. Muntenescu called the prompt API “the most powerful” of the GenAI APIs because it lets developers send natural-language requests to Gemini Nano. At the time described, the alpha supported text and images as input and text as output.

The use cases she associated with the prompt API were broad: image understanding, intelligent document scanning, content analysis, creative assistance, and entity extraction. Her concrete examples included classifying photos as “birthday party” or “hiking trip”; taking a picture of a receipt, extracting text with a Text Recognition model, then using Gemini Nano to categorize expenses; labeling reviews or feedback as positive or negative; automatically tagging notes, photos, or emails; brainstorming content ideas; and extracting names, phone numbers, emails, addresses, and event details.

Muntenescu’s practical claim was that, for many app-level AI features, the prompt API is the place to start. It abstracts the model behind a natural-language interface while staying on device when Gemini Nano is available.

LiteRT-LM sits on the other side of the tradeoff. Oli Gaymond later clarified that if the AICore-supported device reach is insufficient, and a developer wants to test and support a wider range of devices with their own models, LiteRT-LM can help. But that path moves more responsibility back to the developer. Android can provide tools for profiling and testing, but custom deployment requires the app team to do that work.

The biggest constraint is not the API; it is device reach

The main limitation of Gemini Nano through ML Kit GenAI is hardware availability. Florina Muntenescu said Gemini Nano models are available on devices in the Pixel 9, Pixel 10 generation and comparable devices from other OEMs. Not every Android phone can run the same local GenAI feature.

Hybrid inference is Google’s answer to that reach problem. The model runs on device when Gemini Nano is available; otherwise the request can fall back to the cloud. Muntenescu said hybrid inference through Firebase AI Logic had launched a couple of weeks before the discussion.

The hybrid architecture shown placed an Android device on one side and a Firebase backend on the other. Requests go through a Hybrid API. Gemini Nano handles the on-device path; the cloud path connects through Firebase to Gemini Flash or Pro. Muntenescu described the developer choice at that level: use Gemini Nano when it is available on the device, and use the cloud path when it is not. She also said this becomes especially useful as the next generation of Gemini Nano becomes available, because developers can aim for a similar experience between on-device Gemini Nano 4 and cloud Gemini Flash models.

Cloud inference is the third option, for cases where developers want access to more powerful hosted models such as Gemini Pro, Flash, or Flash-Lite. Muntenescu described Pro as the flagship model, Flash as the workhorse, and Flash-Lite as the cost-efficient model, following the labels used for the model variants. Firebase AI Logic can connect Firebase SDKs through Firebase servers to providers including Gemini API in Vertex AI and the Gemini Developer API.

Oli Gaymond emphasized that Android is trying to make these paths consistent enough for developers to blend them: local inference for low latency and privacy, cloud inference for the most powerful models, and hybrid inference for the middle ground.

AICore centralizes the battery and scheduling problem

A question about RAM, battery, and the practical consequences of running Gemini Nano or LiteRT on a phone brought out the operational side of AICore.

Oli Gaymond said Google has shrunk the models as much as possible while trying to preserve capability, but the models still require flagship-class devices today. He did not deny battery impact: if an app runs inference nonstop, it will drain the battery quickly. His distinction was between interactive use and batch use.

The common pattern Google sees today, Gaymond said, is a user entering a flow, asking a question, manipulating data, and receiving an answer at that moment. If that happens 10 or 20 times per day, he said, it is “really not concerning” for battery life.

Batch workloads are different. If an app has many items to process and the work is not latency-sensitive, Gaymond said developers often run it in the background, on charge, overnight. That lets processing continue until the job is complete without degrading the active user experience.

The shared-model design also answers a scale question: what happens when many apps all use Gemini Nano? Gaymond said this was “exactly” one of the reasons AICore is implemented at the system level. Centralizing the model solves the storage and deployment problem, but it also gives Android a single management layer for queues, battery attribution, and competing workloads.

From the developer’s perspective, Florina Muntenescu summarized the desired experience bluntly: “you just do your inference,” and AICore does the heavy lifting. Gaymond added that Android handles scheduling and queues requests appropriately.

The prioritization rule he described is simple. If an app is in the background, its requests may be queued. If it is in the foreground, it receives top priority because it is the app the user is actively using. “If you’re in the foreground, of course, you’re going to have top priority,” Gaymond said. “Whichever app is currently being used actively by the user is obviously going to be prioritized by the system.”

That does not mean every request is instantaneous. It means the system distinguishes between active, user-facing inference and background work. The developer does not schedule the model directly; the platform arbitrates access.

Gaymond compared the user-facing tradeoff to resources such as GPS or Wi-Fi. If users feel they are getting value from an app, they are often willing to spend battery on it. If an app consumes resources without providing clear value, users may not want that behavior. Android’s role, in his framing, is to make the capability available, make sensible usage easy, and attribute resource impact.

The Gemini app is separate from the developer APIs

A default Android assistant-style interaction — for example, “Hey Google, what’s the temperature outside?” — is not necessarily using the developer-facing on-device APIs described here. Florina Muntenescu said she was not sure whether that particular interaction ran locally or remotely, but her expectation was that it ran remotely, depending on whether the user was interacting with the Gemini app or the search app integrated with Gemini APIs.

That uncertainty led to an important separation: the APIs discussed are not a way for developers to modify the Gemini app or install local “skills” into Google’s assistant experience. Muntenescu said those are different user journeys. A developer building their own app does not interact with the Gemini app; the app interacts with the model through the provided APIs.

When the audience member asked whether they could install the local capabilities and write skills to improve the answers, Oli Gaymond reframed “skills” as application logic and prompt composition. A skill, in this sense, is information or instructions included alongside the user’s request. The Android APIs are lower-level building blocks that allow an app to construct those prompts and run them through Gemini Nano or another configured path.

His example was an app such as Pocket Casts: the app would take the relevant “skills,” compose them into a prompt, and submit that through the API. Google’s focus in this part of the stack is building the lower layers so developers can build those higher-level experiences.

This distinction matters because it sets the boundary of AICore. It is not presented as an end-user customization layer for Google’s own assistant. It is an infrastructure layer for app developers.

RAG-like patterns are possible, but embeddings are not yet in the API

Gemini Nano itself is contained by AICore. Oli Gaymond said developers access it through APIs; they do not configure or manage the model directly.

For RAG-like solutions, Gaymond said the prompt API can already be used. An app can gather information it has access to, place it into the prompt, and run inference the same way it would with a server-based model. If the app has access to files, photos, or other user-granted local data, it can pass that content into the API as input.

The current prompt API supports text and image input, so Florina Muntenescu said an app could pass images into it. In response to a user asking about summarizing many photos taken during the day into a note, Gaymond clarified that the app can use whatever files it normally has access to and feed that information into the model.

Embeddings are the missing piece for more structured local retrieval workflows. One audience member asked whether a vectorizing embedding model was available for notes, similarity search, and related use cases. Gaymond said there was not yet an embedding API, but one was coming soon. He specifically referenced making something like the Gemma embedding model available directly from the API.

AI Edge Gallery, in Gaymond’s explanation, occupies a different role. It is a showcase for what is possible, including experiences backed by AICore or by custom models. He described it as a way to show the frontier of on-device AI, but said it requires more uplift from the developer. AICore, by contrast, is meant to make production app development easier by hiding setup and letting developers focus on prompts and features. Muntenescu added that AICore is aimed at building “production level apps.”

Classic ML Kit reaches far more devices than GenAI

Device diversity and model diversity are separate problems. OCR, vision, LLM prompting, embeddings, and custom models have different reach and support expectations.

Oli Gaymond separated the older ML Kit surface from the new GenAI one. Classical ML Kit models for OCR, vision, and related tasks are much smaller and can run across a very large range of devices. He said they can run on “billion plus devices, no problem.”

The GenAI APIs are different. They currently require fairly flagship devices from the last couple of years. AICore packages the supported path so that, in Gaymond’s words, developers can call the API and, if it is available, know “it’s going to run well.”

For developers who need broader reach than AICore provides, Gaymond again pointed to LiteRT-LM. That path can let teams target a wider range of devices, but they inherit the testing burden. Android may provide tools to make the testing easier, but AICore is the path where Google covers that work.

Path	What it gives developers	Main constraint
ML Kit classic APIs	Vision, OCR, translation, smart reply, entity extraction, and other smaller models	Not the same as running Gemini Nano-style GenAI
ML Kit GenAI through AICore	Gemini Nano with system-managed deployment, scheduling, optimization, and privacy isolation	Requires recent flagship-class devices
Hybrid inference with Firebase AI Logic	On-device inference when available, cloud fallback when not	Fallback path relies on Firebase and cloud providers rather than purely local processing
LiteRT-LM	Custom model deployment and wider experimentation	Developers must handle more profiling, testing, and optimization work

How Muntenescu and Gaymond differentiated the Android AI deployment options

The result is a layered strategy rather than a single answer. Android can already run many smaller ML features on a very broad device base. Gemini Nano brings local generative AI to a narrower set of capable phones. Hybrid inference expands reach by falling back to cloud. LiteRT-LM is available when a developer wants more control and is prepared to carry more responsibility.

AI Application Architecture RAG and Knowledge Systems AI Labs and Strategy Inference and Deployment Multimodal AI AI Infrastructure and Compute