Gemini Becomes the Prompt Engineer for Google’s Gen Media Stack

Paige BaileyAI EngineerMonday, May 18, 202619 min read

Google DeepMind developer advocate Guillaume Vernade demonstrates a gen-media workflow built around Gemini as the orchestrator rather than as a one-shot generator. Using The Wind in the Willows, he shows Gemini reading the full book, producing structured prompts and scripts, and handing them to Nano Banana, Veo, Lyria and TTS models for images, video, music and narration. His broader case is that multimodal production depends less on a single model than on schemas, reference assets, state management, cost controls and prompt handoffs between specialist systems.

The workflow depends on intermediate artifacts, not one-shot generation

Guillaume Vernade showed a gen media workflow whose central mechanism was not one model producing one asset. The pattern was model-to-model prompting: Gemini reads a long source, produces structured intermediate artifacts, keeps or retrieves context, and hands specialized requests to image, video, music, and speech models.

The live build used Kenneth Grahame’s The Wind in the Willows, downloaded from Project Gutenberg, as the source material. Gemini received the full book through the File API, then generated prompts for character portraits, chapter illustrations, video animation, chapter-specific music, and dialogue narration. Nano Banana generated the images. Veo animated chapter images into video clips. Lyria generated music. Gemini’s TTS model read extracted dialogue.

The important implementation choice was that Gemini was not merely asked to summarize the book. It was asked to produce machine-usable artifacts: JSON objects with names and prompts, later extended to include lists of characters appearing in each chapter. Vernade defined a simple Pydantic schema with two fields, name and prompt, and configured Gemini to return JSON matching that schema. That let the notebook treat Gemini’s output as an input contract for downstream generation calls.

The same idea repeated across the stack. For character generation, Gemini produced descriptions of Mole, Water Rat, Mr. Toad, Mr. Badger, and others. For chapter illustration, it produced one prompt per chapter. For video, it wrote a new prompt describing what should happen in the next few seconds after the still image. For music, it wrote Lyria prompts that described instrumentation, mood, tempo, and chapter-specific themes. For TTS, it extracted a scene from the book and reformatted it as a play script, including parenthetical style instructions for each character.

The source screens made the chain visible. Colab showed Gemini returning JSON arrays of character prompts; Nano Banana then rendered the Mole, Water Rat, Mr. Toad, and Mr. Badger in a colorful building-block style. Later outputs showed chapter-level JSON, including chapter names, prompts, and character lists, followed by generated chapter images. The implementation pattern was source text to schema-constrained prompt objects, prompt objects to generated media, and generated media back into later prompts as context or reference.

Vernade said the pattern works partly because the gen media teams use Gemini in creating training data for those models. He described multiple model teams inside DeepMind working together, and said a “big part” of the training data for gen media models is made with Gemini’s help. He then put the point more colloquially: the models have been trained to listen to Gemini-written prompts.

All of our gen media models are trained with, like, prompts that are written by Gemini. So that's also why Gemini is quite good at creating the prompts for the gen media models.

Guillaume Vernade · Source

He also noted that prompt rewriting happens internally before generation in many cases. A one-line prompt often does not give the model enough direction to produce an interesting result, so longer prompts tend to work better and tend to be followed more closely. In the notebook, Gemini’s role was to produce those longer prompts on behalf of the developer.

Specialist models remain the product surface, even if the destination is multimodal

Guillaume Vernade framed Google DeepMind’s gen media work as part of a larger attempt to build world models: systems that can ingest and output many modalities, including text, image, audio, video, speech, code, sensors, and, eventually, more. A slide labeled “Gemini inputs and outputs” showed system instructions, text, code, images, audio, video, speech, and API information flowing into Gemini, and text, image, audio, video, function calls, Search, MCP, and Maps flowing out.

The practical product surface is still separated into model families: Gemini for general multimodal reasoning and generation, Nano Banana for images, Veo for video, Lyria for music, Gemini Robotics, agents such as Mariner and Jules, open models such as Gemma, and research-focused models such as AlphaFold, AlphaGeometry, and WeatherNext.

The separation is partly a release-management decision. Vernade said the deeper goal is one model that encompasses the modalities, but it is easier and safer to release specialized models than to update a single main model and risk breaking other capabilities at the same time.

His account of Gemini’s multimodal history was deliberately informal. He said Gemini 1.0, or “Gemini 1.1” in his telling, had been meant to be multimodal because Google’s models had “always been multimodal,” but image understanding was removed before release, for reasons he did not claim to know. Gemini 1.5 then restored multimodal input. Even then, residual training from the previous model sometimes caused it to respond to images by saying it was “just an LLM” and could not handle them. By Gemini 2.0, that problem had been addressed. The time horizon was short: what felt novel a year and a half earlier now felt like a minimum expectation.

The release cadence is fast enough to shape developer experience. Vernade said DeepMind ships something, on average, every five days across the organization, and gen media models have shipped at more than monthly frequency. The visible release timeline included Imagen, Veo, Genie, Nano Banana, Lyria, Chirp, and Gemini TTS and live audio releases across 2024, 2025, and early 2026. That cadence is why, in his role as a developer advocate, Vernade said much of his work is documentation, code samples, demos, prompt guides, and feedback to internal teams about what developers can actually use.

He was explicit that this feedback role sometimes means arguing for product simplification. His example was the old split between Imagen and Nano Banana APIs. In his view, a normal developer should be able to swap a model name and keep the rest of the API behavior consistent. He said he fought for that and “never managed to win.” He then joked that because the Imagen brand, in his characterization, no longer exists, he “kind of” won by default.

Cost controls are part of the implementation

The notebook was meant to be runnable, but Guillaume Vernade warned that it used paid gen media models. Running the full notebook would cost about $1, with video generation making up most of that cost. The notebook included opt-in checkboxes for paid sections such as Veo, Lyria, and TTS, and limited the number of generated character and chapter images for demonstration and billing reasons.

The setup used the Google Gen AI SDK, an API key from AI Studio, and model identifiers selected in the notebook. Vernade selected Nano Banana 2 through gemini-3.1-flash-image-preview, Gemini 3 Flash for text orchestration, lyria-3-clip-preview for 30-second music clips, and a Gemini 2.5 Pro preview TTS model. He said users without a paid API key could still use the image generation examples by swapping to the earlier Nano Banana model with a free tier.

The code also used retry options on the SDK client. Vernade said Nano Banana 2 could be overloaded, especially “in the evening when the US wakes up,” so the notebook configured automatic retries for errors such as 429 and 5xx responses. For the live run, he also used a newly released service_tier='priority' option. He cautioned attendees not to copy that line casually: priority costs twice as much but gives a fast track. The other tiers he described were normal pricing and “flex,” which gives a 50% discount in exchange for allowing requests to be delayed, potentially by minutes. He was unsure whether priority applied to Veo, the most expensive model in the notebook, or to Lyria.

2×

price multiplier Vernade cited for the priority service tier

Vernade also paused to explain the difference between Google’s AI product surfaces. Consumer apps such as Gemini and NotebookLM are easy to use but give developers less control over model choice, parameters, and features. Vertex AI is the enterprise end: more control over data centers, buckets, access control, and security, but more setup burden. The Gemini Developer API and AI Studio sit between those poles: API-key access, easier prototyping, and a unified SDK that can move between Developer API and Vertex. The tradeoff is that API keys carry leakage risk if not managed carefully.

The File API mattered because the build used the Gemini Developer API rather than Vertex. On Vertex, developers would normally think about buckets and access control. The Gemini API hides that complexity by letting the developer upload a file and then pass it to the model as accessible context. In this case, that file was the full text of The Wind in the Willows.

Long context makes the book usable, but state management is still changing

The first Gemini chat received the uploaded book and an instruction not to respond yet because further instructions would follow. Guillaume Vernade used chat mode because it keeps history across turns. That was useful for a book-illustration workflow: the model needed to remember the source text, the selected visual style, the character descriptions it had already generated, and the instructions that should apply to later prompts.

The first explicit creative choice was style. Vernade said he often leaves the style blank and lets Gemini choose one appropriate to the book, but for this run he typed “a colorful building block style” because he was tired of seeing the same style. He also added system instructions functioning like a negative prompt: no text in the image, no cover-page look, no borders, no titles, no descriptions, no panels, family-friendly output, and uplifting colors. He had added these because early versions of the workflow often produced book covers or panel pages when he wanted single full illustrations.

Gemini then generated character prompts. The visible output described Mole as a small gentle mole with black fur and mud, Water Rat as a brown water rat with a grave round face, Mr. Toad as a short stout character in driving clothes, and Mr. Badger as a large imposing grey badger in a dressing gown and slippers. Those prompt objects were passed to a separate image-generation chat using Nano Banana, configured for image output and a 9:16 aspect ratio.

Vernade intentionally kept the image model in chat mode too, so it would remember prior images and maintain stylistic and character consistency. The generated outputs showed the characters in the chosen building-block style: Mole in a meadow by an underground home, Water Rat in a blue-and-white boat, Mr. Toad in motoring clothes beside a car, and Mr. Badger in a brick underground hallway. Vernade noted that Toad appeared much bigger than the car and that the implementation was not optimized; in a production version he would save artifacts in a more structured way and make other improvements.

For chapter illustrations, Gemini was asked to produce one prompt per chapter, to make each a single image rather than a multi-paneled page, to be descriptive about the characters, and to list characters who appeared. Nano Banana then generated chapter images using either chat history or explicit reference images. The first method relied on image chat history: since the model had already generated the character portraits, it should “remember” how they looked. The second method used a more production-like structured output: each chapter object included a list of character names, and the code looked up only those character images and passed them into a stateless generate_content call as references.

Stage	Gemini’s intermediate artifact	Downstream use
Character portraits	JSON objects with character names and visual prompts	Nano Banana generated individual character images
Chapter illustrations	Chapter prompts, later extended with character lists	Nano Banana generated scene images using chat history or selected reference images
Video clips	A prompt describing the next few seconds after the still image	Veo animated the chapter image from its first frame
Chapter music	Lyria prompts with mood, instrumentation, tempo, and lyrics when requested	Lyria generated 30-second chapter-specific music
Dialogue narration	A play-script transcript with parenthetical style instructions	TTS read narrator and character lines with differentiated delivery

The Colab workflow used Gemini to produce structured prompts and scripts that other gen media models consumed.

That second method was more controlled. Vernade said if he were scaling the workflow, he would probably generate more than one reference image per character: a portrait, a full-body image, perhaps side and back views. Then he would ask the model to determine exactly how the character would appear in a given scene and pass the most relevant references to Nano Banana.

The run also exposed a limitation: a generated image showed Water Rat holding a pistol in a snowy forest, looking as if he might be attacking Mole. The prompt did not say that; it described rescuing him. Vernade used that mismatch to motivate a more specific video prompt later.

The workflow’s context strategy raised a performance and cost issue. Chat mode keeps history by sending the accumulated history back to the model on each call. In this case, that meant the full book could effectively be resent repeatedly. Vernade said Google had released a newer Interactions API that changes this by making interactions stateful as well as stateless. Each call returns an interactions ID that can be reused so the server can recover context without the client reuploading or resending the same book on every turn. It also supports forking: for example, generating lyrics once, then branching one path into cover image generation and another into song generation.

An attendee asked how long the session is stored. Vernade said he thought it was “two days,” but he was uncertain and noted the API was still in preview. He said there was a “good chance” it would become the default API at Google I/O, not that this was settled. Because the API knows context will be reused, it can also cache that context automatically and make repeated calls cheaper, though he said the normal API also does some caching.

Veo needed a prompt for what happens after the first frame

The video step used Veo to animate one of the chapter illustrations. Guillaume Vernade selected the larger Veo 3.1 model for the run, though the notebook also offered fast and lite variants. He described Veo 3.1 as the main model; versions labeled “fast” or “lite” are smaller, run faster, and do fewer generation turns. Earlier he had said Veo 3.1 Lite cost about five cents per second, or about 40 cents for a video, making it useful for iterating on prompts before upscaling.

The initial approach sent Veo the same chapter prompt and the chapter image. The image served as the first frame of the generated video. Vernade said this is a natural fit because much of video generation depends on the first frame: in his words, “the most important part of generating a video is generating the first frame,” because the model needs to know where to start before deciding what to do.

The first video did animate the scene and generated dialogue. The on-screen dialogue included: “Quick, give me your hand! We must get out of this dreadful place,” and “Oh, thank you, Water Rat. I thought I was done for.” Vernade’s reaction was that it was “not that bad,” but the wrong character spoke. He attributed that to using the same prompt for image, video, and audio: the model did not have enough extra context about exactly what was supposed to happen next.

The improved version added one more Gemini step. Gemini received the still image and was asked to create a Veo prompt describing what happens in the next few seconds after that initial image. The generated prompt said Water Rat lowers his silver pistol, pats Mole reassuringly, Mole exhales a puff of white plastic vapor, and the two walk together through the snowy woods as glowing eyes in hollow blocky trees blink and vanish. Vernade pointed out two useful properties. First, Gemini inferred that Water Rat was rescuing Mole, not attacking him. Second, because the earlier instructions emphasized describing characters, the video prompt repeated character clothing and appearance, helping consistency.

The before-and-after mattered because the generated still had left the scene ambiguous. The first video reused an under-specified chapter prompt and produced misassigned speech. The second asked Gemini to describe motion from the current frame, and the resulting clip better matched Vernade’s intent, though it lacked dialogue. His caution was practical: video generation can get expensive quickly, so it is not a model to run a hundred times casually.

Lyria puts most control into the prompt

Lyria’s music model was introduced as a prompt-to-music generator with two main modes in the demo: a 30-second clip model and a full-song model that can create up to three-minute songs. Guillaume Vernade said the clip model cost about four cents per song and the full-song model about twice that. For the book workflow, he used the clip model.

The pattern was again Gemini-to-generator. Gemini was asked to create instrumental Lyria prompts for each chapter, preserving some consistency across chapters while highlighting what was specific in each one. The visible prompts were musically specific. “The River Bank” became an orchestral pastoral suite with bubbling flute and oboe over harp and warm strings. “The Open Road” became a jaunty instrumental march with horse-trot percussion, brass fanfares for Toad’s ego, and a whimsical bassoon line. “The Wild Wood” became a dark orchestral piece with low strings, dissonant flute trills, and violin tremolos.

Vernade emphasized that Lyria, at least in the demonstrated API, is controlled almost entirely through prompt language rather than separate parameters. Duration, scale, BPM, structure, intro, outro, chorus, bridge, and lyrics can all be described in the prompt. For longer songs, the prompt can specify what happens in the first 30 seconds, what changes later, which part repeats, and how the chorus should behave.

He then modified the Gemini instruction to add lyrics describing each chapter. The generated prompts embedded lyrics directly inside the Lyria prompt, such as Mole flinging down his brush, running to the sun, meeting Water Rat by the river, and learning that a boat is “the only life true.” Vernade noted that the prompts became somewhat anchored to the earlier instrumental versions because chat mode remembered the previous prompts. That was “the price” of using chat history in this particular way.

The model output can include both lyrics and audio when configured with audio and text modalities. Vernade later showed that streaming is useful for Lyria: generate_content_stream can return the lyrics first and the music afterward. That lets an application create a title, image, or other artifact based on the lyrics while the audio is still generating. The lyrics output includes timing, which Vernade said could be used to build karaoke-style applications.

His practical distinction between clip and full-song generation was about structure. The full-song model is better for complex prompts with timestamps, chords, intensity levels, mood changes, or longer musical form. The 30-second model is faster, but it takes shortcuts to make something interesting quickly.

The TTS trick separates speaker identity from speaking style

The speech example showed how two actual TTS voices can be made to sound like more than two characters. Guillaume Vernade started by asking Gemini to extract a specific dialogue from the first chapter, beginning with “Small, neat ears and thick silky hair” and ending with “his heels in the air.” Rather than simply read it, Gemini reformatted it as a play script with a narrator and character lines.

The key instruction was that each character line should include a specific speaking style in parentheses: tone, accent, speed, hesitation, emotion, or other delivery cues. The narrator used one TTS voice. All characters used another. But because each character line carried a different style instruction, the same underlying character voice could be steered into sounding like different speakers.

The visible script assigned the narrator plain narration. One character had “a lazy, deep-toned nautical drawl, punctuated by long poetic pauses.” Another had “short, breathless sentences with a frequent humble stutter.” The resulting audio read the scene with enough differentiation that Vernade said listeners could not guess the two characters used the same voice.

He also mentioned a practical gotcha: the TTS prompt must begin with something like “read this text.” He said he lost about 15 minutes the previous evening because if he only passed the text, the model did not reliably know it should read it aloud. The configuration then mapped the narrator to one voice, Sulafat, and the character role to another, Fenrir.

Vernade was clear that this was a demonstration trick, not how he would build a production audiobook pipeline. At scale, he would create a full transcript with actual character names, keep separate style prompts for each character, and adjust those prompts depending on the local scene, because a normally slow-speaking character may still need to sound excited in a particular moment. But the example showed that the TTS model can take fine-grained stylistic instructions and use them to create the impression of multiple voices from one base voice.

Lyria Realtime turns prompting into continuous control

Guillaume Vernade described Lyria Realtime as his favorite model, partly because he thinks it is underused. Unlike the prompt-to-clip model, he described it as a live predictive model: it continues producing music until stopped, and new prompts can be sent while it is running. The model then mixes or transitions toward the new direction “like a DJ.” He said real time is effectively about two seconds, but still interactive enough to play with.

In AI Studio, he showed a PromptDJ-style interface with circular nodes for styles such as Bossa Nova, Chilliwave, Drum and Bass, Post Punk, Shoegaze, Funk, Chiptune, K-Pop, Neo Soul, Trip Hop, and other musical attributes. Changing the prompt weights altered the music while it played. He removed post punk because he did not know what it was, added more K-pop and drums, and then moved toward something more chill.

His application intuition came from his previous work in video games. A game could generate adaptive music based on where the player is, what biome they are in, whether they are jumping, cooking, or fighting, and how much HP they have. The music could shift in real time rather than relying only on preauthored loops.

He also showed the “Space DJ” example he had mentioned earlier. It presents a 3D space where stars or nodes are labeled with genres. As the user moves through the space, nearby genres influence the music. Similar genres are positioned near each other in the embedding space; odd combinations can happen when unlike styles are close enough in the user’s path. Vernade joked about Christmas songs near Viking metal. The displayed Space DJ interface showed metal-related prompts with weights, including glam metal, stoner metal, nu metal, speed metal, heavy metal, melodic death metal, black metal, djent, groove metal, and technical death metal. He said the online demo session ends after 10 minutes so users do not run it indefinitely.

European access remains constrained by preview-model policy

The most substantive audience question concerned model availability in Europe. An attendee said their company offers models to employees but remains on Nano Banana 1 because they can only offer models hosted in Europe, while newer models are still in preview and unavailable to them. They asked whether that would change.

Guillaume Vernade’s short answer was “no,” but he said he had expected the question because it is painful for European developers. He described it as one of the fights he is currently fighting internally. The underlying issue, as he stated it, is a Google Cloud rule that every preview model is available only on global endpoints. He said that rule is unlikely to change.

The path he saw was not making preview models regionally hosted, but making models generally available faster. He said the problem with Nano Banana 2, Nano Banana Pro, and Gemini 1.5 Pro was that models were released too quickly back-to-back. Instead of taking Gemini 3 to general availability, for example, a Gemini 3.1 release resets the preview counter. That creates a lag for organizations that need European hosting because they cannot adopt preview models on global endpoints.

It's kind of my P0 thing that I want to change.

Guillaume Vernade · Source

The answer preserved the constraint: Europe’s data privacy and data sovereignty requirements matter, developers have a real access problem, and the current internal policy around preview models is the bottleneck. Vernade did not promise a date or a specific fix.

Messier source material needs more guardrails

An attendee said they had run the notebook with Frankenstein in a retro gaming style and found the generated characters looked very video game-like. Guillaume Vernade replied that books like Frankenstein present another practical difficulty: the model may not allow graphic content that occurs in the source. It may tone it down, or refuse to generate an image. That was one reason he settled on children’s books for the example. Even that had earlier been complicated by restrictions on generating images of children in Europe, though he said the current limitation is more specific: creating images from scratch with children is allowed, but editing images with children is not.

The successful path depended on several forms of guardrail and simplification. The book was public-domain and family-friendly. The notebook limited the number of characters and chapters. The image instructions ruled out text, panels, and cover-like compositions. The style was constrained. Video was demonstrated on a single chapter image. Paid sections were opt-in. For scale, Vernade pointed to sturdier versions of the same pattern: store generated assets properly, create multiple reference views per character, pass only relevant references per scene, use structured outputs with character lists, and use the newer Interactions API for stateful context rather than resending everything.

The live run also surfaced the workflow’s weak points. Mr. Toad’s scale was off. Water Rat’s pistol made a rescue scene look like an attack. The first Veo clip had the wrong character speaking. Lyria prompts became anchored by earlier chat context. TTS needed an explicit “read this text” instruction. Vernade treated those issues as implementation problems to reduce through orchestration: better schemas, better references, more specific prompts, and clearer state management.

AI Application Architecture AI Labs and Strategy AI in Design and Creative Work Voice and Audio AI Multimodal AI Image and Video Generation