Gemini Omni Flash Replaces Veo as Google’s Default Video Model

ElevenLabsThursday, May 21, 20266 min read

ElevenLabs’ breakdown of Google’s I/O 2026 launch presents Gemini Omni as a major reset of Google’s AI video stack, with Omni Flash already replacing Veo as the default video model in the Gemini app. The source argues that the significance is not just better text-to-video generation, but a shift toward multimodal, conversational video creation: users can combine text, images, audio, video, and reference photos, then revise clips through successive instructions while preserving characters and scenes.

Omni Flash replaces Veo as Google’s default Gemini video model

Google’s Gemini Omni is described as a new family of Google DeepMind models built around a single premise: one model that can create from text, images, audio, and video, beginning with video generation. The first released version is Gemini Omni Flash, which is available now and has already replaced Veo as the default video model inside the Gemini app.

That default-model shift is the most immediate product consequence of the launch. Omni is not presented as a side experiment adjacent to Google’s video stack; it is now the video model Gemini users encounter by default. Omni Flash is also rolling out across the Gemini app, Google Flow, and YouTube Shorts. Free access is described as coming to YouTube Shorts, with the accompanying description specifying YouTube Create later in the week.

Google’s shorthand for the product is “Nano Banana for video.” The comparison is meant to make Omni legible as an approachable creative model, but the technical claim is broader: a single system that can accept multiple media types as inputs rather than relying on text prompting alone. It is compared at a high level to C-Dance 2.0 because both work with text, image, audio, and video inputs.

The Gemini interface shown in the source supports video-based editing prompts such as: “Edit the video to place me on clouds.” Omni is being presented less as a conventional text-to-video generator and more as a multimodal video system: a user can bring in existing media, add instructions, and generate or revise a finished clip from that combined context.

Reference photos and conversational edits are aimed at consistency

One of the more concrete capabilities is support for up to five reference photos. The stated purpose is consistency: a user can attach multiple character references, object references, and a location reference, then generate a video that keeps those elements aligned.

That same consistency claim carries into editing. Gemini Omni is presented not only as a generator but as a system for iterative video modification. A user gives an instruction, Omni makes the change, and the next instruction builds on the previous state. The intended result is that characters remain consistent and the scene remembers what came before.

A violinist example illustrates the workflow. The initial prompt is “A video of a violinist playing a song.” A second instruction says, “Transport the violinist to the image environment.” A third says, “Make the violin invisible.” The example shows the ability to layer edits while preserving the subject and scene across turns.

With Omni you edit video through conversation: you give it an instruction, it makes the change, and every instruction builds on the last one.

Omni is also described as able to modify real footage filmed by the user. The claim is specific: a user can ask the model to change what is happening inside a video without changing the surroundings. In practical terms, Google is pushing an instruction-based editing interface rather than a traditional timeline workflow.

The physics claim is central to the release

Gemini Omni is described as being grounded in improved Gemini world knowledge, especially around physical behavior. Google is said to have improved the model’s understanding of gravity, momentum, and fluid dynamics, with the goal of reducing strange hallucinations and obvious bad physics in generated video.

The marble demo is the clearest example of that claim. Google footage shows a glass marble rolling through a complex wooden Rube Goldberg-style track. The emphasis is on believable motion and native sound effects for each bounce. The physics and the audio are treated as linked parts of the generation rather than as separate post-production layers.

A second Google-attributed demo shows a claymation-style explainer of protein folding generated from a single short prompt. The animation depicts colorful clay-like balls folding into a spiral structure, with visible labels including “AMINO ACIDS” and “ALPHA HELIX.” The example broadens the use case beyond cinematic or social-media clips: Omni is also being shown for explanatory video.

The demos support Google’s claim that Omni can make scenes hold together more realistically. Physical coherence is a front-and-center product claim for generated video, not a secondary detail.

Native audio and avatars move Omni beyond silent clips

Omni generates audio natively. In the marble example, that means sound effects are produced for the bounces as part of the generated result rather than added afterward as a separate track. Native audio is part of the broader Omni claim: the output is not merely moving images, but video with synchronized sound.

Google has also introduced AI avatars alongside Omni. These are described as reusable digital versions of yourself that can appear in videos you create. The release treats avatars as part of a repeatable creation workflow, where a user can keep bringing the same digital self into different generated scenes.

Omni is therefore being positioned as a video creation and editing system rather than a narrow clip generator. It accepts media references, changes existing footage, produces audio, and can place a reusable avatar into generated scenes.

The first release is live, but constrained

Gemini Omni Flash is available now to users on a Google AI Plus plan or higher. It has replaced Veo as the default video model inside the Gemini app, and it is rolling out across the Gemini app, Google Flow, and YouTube Shorts. Free access is coming to YouTube Shorts, with YouTube Create specified in the accompanying release description.

10 seconds

current maximum clip length for Gemini Omni Flash generations

The current clip length cap is 10 seconds. Google characterizes that as a product decision rather than a model limit, and longer videos are said to be on the way. Every video generated with Omni also receives a SynthID watermark. The on-screen explanation states: “SynthID uses invisible watermarks to help identify AI-generated content.”

Two pieces remain pending for production workflows. The developer API is expected within the next few weeks, which ElevenLabs says would allow Omni to be integrated into ElevenCreative once available. Until then, API access is not yet part of the release described here.

Google has also teased a more powerful model called Omni Pro. No release timing or performance detail is given in the narration beyond that tease, but the displayed article excerpt adds that “the more professional use cases might be better served by the Omni Pro model,” which “should perform better across all Omni tasks.” The same excerpt says Google has not announced a release date for Pro and quotes Brichtova saying it will arrive when “we feel like we’re at a point where we have a step change above Flash.”

AI in Design and Creative Work Voice and Audio AI Multimodal AI Image and Video Generation Human-AI Interaction Model Releases