Orply.

Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines

Paige BaileyGuillaume VernadeIan ValentineAI EngineerSaturday, May 23, 202623 min read

Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.

Google’s developer stack turns prompts into inspectable application pipelines

Paige Bailey’s central claim was operational: AI Studio, Gemini, Google’s generative media models, and Gemma are being arranged so builders can move from a multimodal prompt to application code, then into tools, auth, storage, generated media, and deployable workflows with less manual translation between prototype and API.

The model catalog matters, but Bailey treated it as part of a broader developer workflow. Gemini was described as a natively multimodal system that can take video, images, audio, text, and code as inputs, and can return text, code, images, edited images, interleaved image-and-text responses, and audio tokens. AI Studio was the surface around that capability: a place to test prompts, attach media, turn on tools, compare models, inspect costs, and then export code that preserves the model name, configuration, inputs, offsets, prompts, and tool calls.

if you can get it working in AI Studio, you can get it working as part of your app, all you have to do is click the get code button.
Paige Bailey · Source

Bailey listed a fast-expanding set of releases: Gemini 3.1 Flash Live for real-time conversations, Gemini 3.1 Pro and Flash-Lite, Nano Banana 2 for image generation and editing, Embeddings 2.0 for shared embedding spaces across video, audio, images, code, and text, Lyria 3 for music generation, Genie 3 for world-model experiences, AI Studio’s full-stack runtime, Gemma 4 open models, and Veo 3.1 Lite for video generation. She described Gemini 3.1 Pro as the larger reasoning model and Flash-Lite as the much smaller, cost-effective model for high-volume work.

That pace is itself a developer problem. Guillaume Vernade said that across the generative media models alone, Google was releasing a new model or capability on average every five days. Including AI Studio changes, pricing changes, and smaller features, he said the rate felt more like two to three new features a week. His point was not simply that the catalog is large; it was that even people inside Google need structured walkthroughs to keep track of what exists.

Vernade gave the architectural reason for the separate model families. His definition of a world model is a system that can ingest as many modalities as possible, understand them in something like the way the senses give humans access to the world, and then output across modalities as well. In his account, that has been central to DeepMind’s view of generative AI. Google ships distinct models for images, video, music, speech, agents, open models, and Gemini because shipping and updating one model per modality is operationally easier than updating a single model that does everything.

That distinction matters for builders. Google’s stack is not one monolithic model that can do every creative or application step reliably end to end. It is orchestration: Gemini interpreting a book and producing structured prompts, Nano Banana producing character and chapter images, Veo animating an image into video, Lyria scoring a chapter, text-to-speech turning a rewritten passage into performed narration, and AI Studio Build wiring a web app around model calls, auth, and Firestore. Some handoffs were clean. Others exposed current seams.

Bailey’s opening exchange with an audience member framed the same issue from a product-strategy angle. She argued that when “everybody is sprinting to do the same thing,” it is often a sign that the thing is either the wrong place to build or likely to be absorbed into base-model capability. Her examples were vector databases built around small context windows, fine-tuned models for language support, agent frameworks, and MCP servers. In her view, many workarounds become less durable as models gain longer context, broader language ability, and more built-in agentic behavior.

An audience member pushed back that highly specific domains, such as mathematics, may not be absorbed by a general provider because Google will cater to general applications. Bailey’s counterexample was medical fine-tuning: early PaLM 2 and Gemini work required Med-PaLM and Med-PaLM-style fine tunes for medical use cases, she said, while users who previously needed those fine tunes now often use Gemini out of the box with retrieval or a custom prompt because the relevant data is incorporated into Gemini itself.

The disagreement did not vanish. The audience member raised reproducibility and stochasticity. Bailey answered that no large language model is deterministic, so the problem applies generally, but she conceded that much of the “magic” will come from having an opinionated view of a use case and working directly with customers to solve their problems. Her claim was not that application value disappears. It was that scaffolding built only around today’s model constraints may be short-lived.

AI Studio makes multimodal calls inspectable and portable

Bailey loaded a five-minute slice of a YouTube dinosaur video into AI Studio, selected Gemini 3.1 Flash-Lite Preview for speed and cost, turned on Google Search grounding, and asked for a table of dinosaur and prehistoric-creature sightings by timestamp with a fun fact for each. The video segment registered as 30,901 tokens.

30,901
tokens in the five-minute dinosaur video slice loaded into AI Studio

The model returned a table identifying a Brachiosaurus, Tyrannosaurus Rex, Pteranodon, raptor, Triceratops, and Carnotaurus, with first-seen timestamps and short facts. Bailey noted that the model was not merely reading YouTube metadata; behind the scenes, it was sent the video itself frame by frame for inference. The answer also included a caveat that Pteranodons are often grouped with dinosaurs in children’s media but are technically pterosaurs, a separate group of flying reptiles.

The important part was the export path. AI Studio generated a TypeScript snippet configured with the model name, the YouTube URL as a file URI, video metadata with an end offset of 300 seconds, and the exact prompt. Bailey’s takeaway was that a working playground interaction should not need to be reverse-engineered into an API call. The playground becomes a reproducible configuration.

She then used compare mode for a tool-heavy image task. With code execution enabled for both Gemini 3.1 Flash-Lite Preview and Gemini 3 Flash Preview, she uploaded an image of LEGO bricks and prompted the models to draw bounding boxes around all green LEGO bricks using Python and display the image with the boxes. Code execution, as Bailey described it, gives Gemini access to a sandboxed Python environment with preinstalled data-science libraries. The model can write code, run it, and use the result as a tool against the media supplied in the prompt.

Gemini 3.1 Flash-Lite quickly produced an image with green rectangles around the green LEGO bricks. AI Studio’s visible cost estimate showed an input token cost of $0.000554, an output token cost of $0.000002, and a total estimated API cost of $0.000555.

$0.000555
estimated API cost for the LEGO bounding-box request with Gemini 3.1 Flash-Lite and code execution

Bailey emphasized that the same tool pattern could be used for counting objects, determining orientations, finding items across video frames, or returning timestamps. The useful unit is not just visual understanding. It is visual understanding plus executable analysis plus cost visibility.

Gemini 3 Flash Preview took longer but did more verification. Bailey described it as checking its work: it drew a segmentation mask to identify green-spectrum regions and verified coordinates before producing bounding boxes. That increased cost and tool calls, but she said it still remained “on the order of pennies.” Her recommendation was to test Gemini 3.1 Flash-Lite for use cases that had previously relied on Gemini 2.0 Flash or 2.5 Flash, especially where speed and cost dominate.

The tools panel showed AI Studio as more than a prompt box. Bailey called out structured outputs, code execution, function calling, custom functions, Google Search grounding, Google Maps, and URL context as API-accessible one-liners. The design implication from the demo is that model capability and surrounding tools are meant to travel together: the prototype configuration is also the application configuration.

Build can wire auth and storage, but generated apps still need reviewable failure modes

The full-stack app path began with a prompt: create an app that lets a user log in with Google, upload a photo of a bookshelf, identify visible books from the spines, use Google Search grounding to fill in missing details, and save title, author, description, and genre to a database tied to the user’s account. Bailey called it “a tall ask.”

AI Studio Build opened an IDE-like environment where the model planned the app, generated files, tracked version history, and exposed settings for secrets and integrations. Bailey pointed out that she had pre-added a Gemini API key, but developers could also add keys for services such as Supabase or n8n. Integrations included OAuth and GitHub syncing to public or private repositories. When database support was required, AI Studio asked her to enable Firebase and select a region.

The generated “ShelfScan AI” app had a landing page, Google sign-in, Firestore connection, an upload flow, and an interface for turning a bookshelf photo into a digital library. Bailey found an image of a shelf with cookbooks and tried to upload it. The app initially failed to save, reporting insufficient permissions in Firestore. The failure was materially useful because the generated environment attempted to debug itself.

The model investigated the user-creation process and Firestore security rules. It identified a check on photoURL size, reasoned that Google profile photo URLs could be longer than expected, and later identified an imageUrl length comparison as the root cause: the app had truncated to 1000 characters while the rule required the size to be less than 1000, causing the comparison to fail. The proposed fix was to change the rule to allow <=.

Bailey’s interest was not that the generated app was perfect. It was that the model located the files that needed modification and reasoned through validation logic instead of returning a generic permissions explanation. Once the issue was fixed, she uploaded the bookshelf image again. The app populated a collection of seven books, with dates, book type, details, title, author, and related metadata. Bailey noted that some of the details were not available from the visible spines alone, implying that grounding filled in missing information. Logging out and back in preserved the collection, demonstrating account-linked persistence.

Vernade later showed the same Build path applied to his book-illustration notebook. He pasted the contents of a Python notebook into AI Studio and asked it to build an app that illustrated a book using the generative media models described there. He said an earlier run took about 1000 seconds, roughly 15 minutes, and produced a “GenMedia Book Illustrator” app with file upload, a book URL field, process logs, and a preview. He did not use it as the main pipeline implementation, but he said it was doing essentially the same thing as the notebook “on the first try.”

The more durable lesson was his generated-code hygiene. Vernade keeps AI Studio system instructions for “vibecoding” that ask the model to create a file per feature or related feature, split as much as possible, add docstrings to all functions, start each file with a long comment explaining the feature and use cases, maintain a Design.md document at the root of the app, centralize configurable items such as model names, and always create a way to test setup without altering data.

His rationale was reviewability. If the model is asked to modify one feature and edits an unrelated file, the developer gets an early signal that something is wrong. He also argued that generated apps should include logs by default, because an error message alone often does not show what happened immediately before the failure or where the failure occurred.

Live multimodality combines screen, camera, speech, language control, and exportable code

Gemini Live uses the same prototype-to-code pattern in a real-time setting. Bailey shared her screen with a Google Images search for LEGO bricks and asked, “Hey there, Gemini, what do you see on the screen?” The model answered that it saw a Google search for LEGO bricks and pieces, image results showing different kinds of LEGO pieces and sets, and a larger preview of brightly colored LEGO brick illustrations from Freepik.

She then modified the system instructions to tell the model to respond only in Spanish and asked again. The model answered in Spanish. After removing the instruction, she asked within the conversation for a response in Castilian Spanish, and the model complied. Bailey’s point was that language and dialect behavior can be controlled either through system instructions or conversationally, depending on the developer’s need.

The same Live surface supported camera input. Bailey held up two fingers and asked Gemini how many fingers she was holding up and to compose a poem about her. The model answered that it saw two fingers, “like the peace sign,” and produced a short poem. Bailey then showed the “Get code” overlay for Gemini Live, with Python code using models/gemini-3.1-flash-live-preview and a default camera mode.

Bailey described Gemini Live as a speech-to-text, LLM-understanding, and text-to-speech pipeline stacked together, with video feeds and screen shares incorporated into the interaction. The primitives shown were screen sharing, video feed, spoken interaction, system instructions, dialect control, and code export.

Vernade’s text-to-speech work used the same prompting idea in a different form. For his book pipeline, he asked Gemini to extract dialogue from The Wind in the Willows and rewrite it as a play transcript. The narrator used one voice, while all other characters shared another voice but were assigned different speaking styles, such as fast-paced posh British delivery. When the generated audio played, Vernade argued that listeners would not guess the two speaking characters were using the same base voice, because prompting changed the performance enough to differentiate them.

He gave one concrete TTS caveat: prompts to the text-to-speech model should begin with an instruction such as “read this” or “tell me this.” If the text is sent without that instruction, he said the model may ignore it. That instruction is also where a developer can add performance context, such as reading in a scary way or making characters sound excited.

Genie produces playable pixels, not a conventional game engine

Genie 3 clarified what Google currently means by a “world model” product. Bailey described Genie as a composition of models — including Nano Banana, Veo, and Gemini for prompting — that lets a user describe an environment and a character, generate an initial sketch, and then create a short playable world. It does not generate a Unity project or an Unreal Engine environment. It generates a frame-by-frame pixel experience that responds to keyboard input.

Her prompt asked for Regent’s Canal on a sunny day, dolphins swimming in the canal, boats with pirate flags, and a pink sparkly squirrel with purple feet and a pirate hat as the character. Genie first produced a sketch matching those elements. Bailey then created the world and navigated it with WASD and arrow keys, using the space bar to jump.

The generated squirrel appeared to hop or walk on water, jump onto boats, and move around a canal scene with bicycles, people, pirate flags, and dolphins that were visible but not moving. Bailey noted that the system did not seem to understand that Regent’s Canal has deep water and said she should have specified that in her prompt. That observation captured the current character of the system: it can produce an interactive visual experience from a strange prompt, but its physical assumptions are not guaranteed.

Bailey contrasted this approach with world-model companies such as World Labs, which she said are taking a different path by building actual Unity or Unreal Engine environments. Genie’s generated worlds are not stored as 3D game assets. They are raw pixels incorporated into the experience. She found it striking that each part of the experience was generated dynamically as she moved.

The public implementation she described allowed a 60-second interactive world, available through an Ultra subscription in some parts of the world. A video is created at the end and can be downloaded and reviewed. Genie, as shown, is not a production game engine. It is a promptable interactive visual simulator with enough temporal persistence to make a short playable scene.

The book pipeline relies on structured prompts and selective context

Vernade’s generative media pipeline used The Wind in the Willows as source material because it could be downloaded from Project Gutenberg, which he described as an open-source library of books. He noted that direct access did not work for him while in the UK and suggested the notebook server might still be able to retrieve it from another country. The pipeline illustrated the book’s characters, generated chapter images, animated scenes, created music per chapter, and produced narrated dialogue.

The notebook installed the SDK, initialized a client with an API key, and selected model IDs. Vernade mentioned adding retry behavior when initializing the client, which he said is useful for Nano Banana 2 because demand can make successful generation harder when the U.S. wakes up. He also included safeguards against accidental spending. Because the generative media models are paid, he normally adds checkboxes so users do not run expensive cells by mistake. He said the whole notebook excluding video generation should cost roughly one euro to run, while a Veo notebook could cost around $20.

He also clarified Google’s developer product split. Consumer Gemini apps are broad and easy to use, but developers cannot necessarily choose exact models, parameters, or tool calls. Vertex AI is the enterprise end, with much more control, including data-center choices that matter for European data-residency requirements, but it is harder to set up and better suited to teams already using GCP or with DevOps support. AI Studio and the developer APIs sit in the middle: easy API-key access, enough control to build and test, without requiring full enterprise setup.

The notebook used the client file upload API so Vernade did not need to configure storage buckets. Behind the scenes, he said, it creates a bucket, but the developer only uploads a file and then uses it in Gemini prompts. He used structured output because he wanted Gemini to return only parseable prompt objects, not conversational filler such as “Sure, I can do that.” He used chat mode to preserve the book in history so subsequent requests could refer back to it without repeatedly uploading the file.

That context management became central once images entered the pipeline. Vernade first asked Gemini to generate descriptions and image prompts for the main characters, using a defined style: dark fantasy, black-and-white backgrounds, colored characters. He added system instructions for Nano Banana to avoid borders, titles, descriptions, or comic panels, because earlier image models would sometimes treat portrait-format book illustrations as covers and add text. He also asked for family-friendly output, which he acknowledged might conflict with “dark fantasy.”

Nano Banana generated images for Mole, Water Rat, Toad, and Badger. The backgrounds were black and white, while the characters were colored. Vernade then asked Gemini for prompts for chapter illustrations and generated images such as the characters picnicking by the river, a caravan on the road, and a forest scene. In one forest image, however, Toad did not match the earlier character design. Vernade used that failure to explain a more scalable architecture.

Instead of asking for a chapter prompt alone, he defined a structured chapter output with a name, prompt, and list of characters meant to appear in the image. For each chapter image, the system then passed only the relevant character reference images into Nano Banana. With five characters, passing all images might be fine; with a real book containing 40 characters, Vernade said it would not be sustainable to expect the model to manage the entire context perfectly. At scale, he would create multiple reference images per character — front, back, and sides — and pass only the views needed for a particular scene.

The engineering lesson is specific: character consistency was not treated as a magical property of the image model. It improved through structured extraction, explicit character lists, and selective context.

Video and music generation work, but the seams are still visible

For video, Vernade used Veo to animate chapter images. He chose the largest Veo model for the live run because he did not have to pay; otherwise, he said, he would use the smallest model. The prompt reused the same chapter prompt that generated the still image, and the generated image was passed as the starting frame. He requested portrait output at 720p to keep it smaller.

The first generated animation showed Mole and Water Rat in a forest, with the rat drawing a sword. The clip included generated speech: “Stand back. You shall not pass. Leave this place little ones. You don’t belong here.” Vernade thought the result was quite good, but he also explained why this method is under-specified. A prompt written for a still image does not necessarily tell the video model what should happen next.

His improved method asked Gemini to look at the image and generate a video-specific continuation prompt. In the example, Gemini described Mole shivering and clutching his scarf while Water Rat bravely drew his cutlass and stepped forward to protect him. That produced a new video, but Vernade said the result was not as good in that case. He also observed an unexplained behavior from his tests: when he used the same prompt, Veo tended to add text, but when he had Gemini generate a new prompt, it did not, even though both cases involved passing a prompt and an image.

For music, the pipeline was cleaner. Vernade used Lyria to generate music prompts for each chapter and then called the model to create music. The first chapter, “The River Bank,” produced an orchestral acoustic folk style with flowing acoustic guitar and flute; the second, “The Open Road,” was jaunty and adventurous with fiddle and acoustic bass; the third, “The Wild Wood,” was suspenseful with creepy melody and staccato pizzicato strings. Vernade said he had run the notebook before but had not checked the specific music outputs in advance, and that this kind of generation had worked reliably for him.

He attributed some of that reliability to the relationship between Gemini and the generative media models. A lot of internal training data for the Gen Media models, he said, is made using Gemini, which means Gemini is relatively good at generating prompts those models understand.

Lyria RealTime was a separate primitive. Vernade described it as a live model that creates music indefinitely and can be prompted into different musical directions in real time, “like a DJ.” He showed a “Space DJ” interface in which prompts were arranged as planets in a star field; moving near different planets changed the generated music. He was surprised more people were not using it, because the model can be left on autopilot while the music shifts over time.

Audience questions exposed current orchestration limits. One person asked whether Veo’s generated video, music, and text were produced by Veo alone or by stitched models. Vernade answered that it was just Veo. Asked whether a developer could orchestrate Lyria music into a Veo video without using a tool such as FFMPEG, he said there was no direct way to do that because Veo 3 was not designed to ingest audio files. He described the future direction as every model being able to ingest all modalities, including Lyria using audio references or supporting multi-turn edits such as “make the ending more epic,” but said that is not possible at the moment.

Another audience member asked how well Veo performs when prompted very specifically for music. Vernade said he would not try to use Veo for music. He believes its background-music training data is likely light because it tends to produce similar non-music patterns. He expected future generations to improve because models are trained more together and share some training data, but described this as a generational limitation of the current Veo.

He also surfaced a cost-control feature: service_tier. In the notebook, he set service_tier: 'priority' for demo purposes, which gives higher priority and more reliability but costs twice as much. A flex tier allows requests to take a few minutes but costs half as much, similar in spirit to batch API pricing. His advice for people running the notebook themselves was to remove the priority setting to save money.

Open models and media models have different constraints

Vernade was asked whether Google planned open-weight generative media models, “like Gemma but for art.” He said he did not think so for image and video generation. His reason was control: image and video generation involve checks on what users ask for and what the model actually generates, and Google blocks outputs that are not aligned with its vision. Open-weight models make that kind of enforcement much harder, which he described as “open bar.”

He left more room for music. For music generation, especially real-time generation, he said on-device execution would make sense because it could be faster. That might come at some point. But for image and video, he expected company values and safety constraints to limit open-weight release.

That distinction set up Ian Valentine’s Gemma material. Valentine’s focus was the opposite side of the stack: open models that run on phones, laptops, local GPUs, and single-GPU cloud deployments. His throughline was that the interesting part of Gemma 4 is not that it can do things frontier cloud models can do, but that more of those things can now happen locally or on hardware a developer controls.

Gemma 4 moves agentic loops onto local hardware

Valentine described Gemma 4 as a family of four models released the prior Thursday. The E2B and E4B “effective” models are designed for mobile phones, Raspberry Pis, Jetson Nanos, and other small edge hardware. The “effective” label refers to an architecture with per-layer embeddings that do not need to be fully loaded as part of the model. They can live on flash and be paged in as needed. The “brain” of the model is about 2 billion or 4 billion parameters, while loading everything into RAM looks more like a 5B or 8B model.

The larger models are a 26B mixture-of-experts model with about 4B active parameters and a 31B dense model. Valentine described the 26B as fast because only a fraction of parameters activate, while still being more intelligent than a simple 4B model. The 31B is the flagship quality model and, in the size slide shown, was described as the best model that can fit a single consumer GPU. Later, AI Studio listed Gemma 4 variants including “Gemma 4 31B A4B IT” and “Gemma 4 31B IT”; the sizing slide itself distinguished the 26B A4B mixture-of-experts model from the 31B dense model.

ModelActive or effective parametersGPU consumption at 8-bitUse cases
E2B2B4.6GBEdge and on-device; Android, iOS, Raspberry Pi, Jetson Nano, etc.
E4B4B7.5GBEdge and on-device; Android, iOS, Raspberry Pi, Jetson Nano, etc.
26B A4B3.8B26GBVery fast inference; only a fraction of parameters active
31B31.3B31GBMaximum quality; easier to fine-tune; best model shown as fitting a single consumer GPU
Gemma 4 model sizes and intended deployment targets shown in Valentine’s size slide

Valentine said the Gemma 4 models focus on the agentic side. They have thinking built in and are multimodal. The effective models can understand audio, while the larger models can understand image and video. He said Google is seeing performance comparable to models roughly 10 times larger in parameter size. A radar chart attributed to Arena LMSYS showed Gemma 4 31B improving over Gemma 3 27B and Gemma 2 27B across text categories such as coding, math, hard prompts, instruction following, roleplay, and occupational domains. Valentine’s explanation was that action-taking, function calling, and coding are designed into the model architecture rather than relying only on strong instruction following.

His first phone example used Google AI Edge Gallery. The app lets users test Gemma models locally and now includes “Agent Skills.” In one example, the model ran on GPU on a Pixel, interpreted a user’s note about AlphaGo and Move 37, called a JavaScript script named research-tracker/index.html, and logged the thought under the AlphaGo paper. Another screen showed a research tracker UI with a thoughts list and mention activity chart. The point was that the model can move from chat to local function selection: Android intents, JavaScript skills, web views, and app-specific functions can be exposed, and the model decides which to call.

A second phone example showed a virtual piano skill. The user wrote, “I want to play the piano,” the model called virtual-piano/index.html, and a keyboard loaded in the chat interface. Another showed on-device “vibecoding” with Gemma 4 E2B generating a simple calculator app in HTML, CSS, and JavaScript, including divide-by-zero handling. Valentine said small web apps are within reach on-device; large architectural systems are not the point. Turning on thinking improves quality because it adds a planning step before execution.

Valentine then moved to a laptop. In LM Studio, he loaded google/gemma-4-26b-a4b, using about 17.99GB of memory in the interface and roughly 22GB when accounting for context. On an M4 Mac with unified memory, he could serve the model locally through an OpenAI-compatible endpoint on port 1234. Any app capable of calling a chat-completions API could then point at the local machine instead of a cloud endpoint.

To demonstrate parallel local agents, he ran a script that created one orchestrator and 10 sub-agents, each in its own terminal, to generate SVGs. The orchestrator distributed drawing tasks such as a rocket ship, crescent moon, alien spacecraft, star, and planet. A throughput monitor showed combined token generation. The final page displayed an SVG art gallery, including an animated spinning planet. Valentine emphasized that the work did not require internet access and that the same pattern could apply to file sorting, code implementation, subdivided research, data analysis, or other agentic tasks on a local machine.

He then showed OpenCode configured to use the same local endpoint. The configuration required specifying a provider name, schema, exposed model, any parameters, and a URL pointing to localhost. He also noted that if a developer lacks enough local RAM, Google has a guide for deploying Gemma 4 to Cloud Run on a single GPU, including an RTX Pro 6000, and that Gemma models can be tested directly in AI Studio as well.

The coding example used a game spec for “Nebula Drift,” generated by the 31B model, and asked the local 26B model in OpenCode to implement it. The model created files, but the first browser run failed with an “Unexpected token” syntax error. Valentine copied the error back into the agent. The model investigated, edited files, and eventually produced a working version with a starfield and a movable ship, though Valentine noted he did not see asteroids. The point was not that the game was finished. It was that a local model could participate in the same edit-run-debug loop cloud coding agents use.

He then asked the larger model to write a spec for “a game where you can build your own game inside the game,” using the earlier spec as a reference. The generated spec was titled “OmniForge.” He asked for an implementation as a single HTML file, and the resulting preview included a grid editor and a playable triangle that could reach a goal, triggering a “VICTORY! You reached the goal!” overlay. He also showed Gemma 4 recreating a DeepMind web page from a screenshot as a single index.html, matching the layout and typography closely enough to make the point.

Valentine’s final examples extended the same local-agent theme into other environments: a browser-based Open Duck robot simulator in which the E2B model ran in WebGPU and interpreted commands into robot actions; Android Studio integration using the 26B model for app-building assistance; and Gemma working with Google’s Agent Development Kit, with thinking loops and feedback systems for longer-running tasks.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free