Any-to-Any Agents Rely on Orchestrated Multimodal Models, Not One Network

Patrick LoeberAI EngineerWednesday, May 20, 202610 min read

Google DeepMind’s Patrick Löber presents “any-to-any” agents as an orchestration problem rather than a claim that one model already handles every modality. In his architecture, Gemini reads and reasons across PDFs, images, audio, video and other sources, then uses function calling to invoke specialized native models for images, speech, live audio, video or embeddings. Löber argues that the useful shift is not generating every possible format, but letting an agent decide when a diagram, spoken explanation or other output is warranted.

Any-to-any is a goal, not yet a single model

Patrick Loeber framed “any-to-any” as the practical ambition behind Gemini’s multimodal stack: agents should be able to accept text, code, images, audio, video, PDFs, URLs, API information, and search results, then produce text, images, speech, video, function calls, and code. The pattern he described is an agent that reasons over multiple sources, decides what output formats would help, and calls the right generation model to create them.

He also drew an important boundary around the current architecture. The stack is not yet “one multimodal model” that does everything. Gemini is the main understanding and reasoning model, and in the version Löber described it understands multiple modalities but outputs text. Other native generation models handle specialized outputs: Nano Banana for image generation, Gemini text-to-speech for speech, Gemini Flash Live for real-time audio interaction, Veo for video with audio, Gemini Embedding 2 for multimodal embeddings, and Gemma for local multimodal agents.

The agentic architecture depends on that division. Gemini acts as the reasoning layer: it can read the sources, synthesize a study guide, decide that a difficult concept needs a visual diagram or an audio explanation, and then call a specialized image or speech generator as a tool. The “any-to-any” behavior comes from orchestration across native multimodal models, not from treating the system as a single all-purpose network.

Capability	Model or model family	Inputs described	Outputs described
Multimodal understanding	Gemini	Text, images, audio, video, documents	Text
Native image generation	Nano Banana / Gemini Flash Image	Text	Text and image
Native speech generation	Gemini TTS	Text	Speech
Real-time dialogue	Gemini 3.1 Flash Live	Text, audio, image, video	Text and audio
Local multimodal agents	Gemma 4	Text, audio, image, video	Text
Video generation	Veo 3	Text, audio, image, video	Video with audio
Multimodal search	Gemini Embedding 2	Text, audio, image, video, PDF	Embeddings

Löber described any-to-any behavior as a stack of specialized native multimodal models rather than one model handling every transformation.

The example application was a small Notebook LM-like research partner. Given a collection of sources — a paper, handwritten notes, a lecture or tutorial video, and a voice memo — the system would synthesize a study guide, then enrich it with visuals and audio where useful. Löber’s emphasis was that this should not be a hardcoded linear workflow. A workflow would mechanically generate a summary, a set of images, and audio in a fixed sequence. The agentic version reasons, decides, calls tools, evaluates the result, and either loops or finishes.

The first phase is cross-modal understanding over long source material

The base layer of the application is multimodal understanding. Patrick Loeber used sources related to the “Attention Is All You Need” paper: a PDF research paper, a whiteboard image, a lecture video, and an audio note. The desired behavior is cross-modal understanding — the ability to draw on information from all of those sources together and make connections across them.

In the Google GenAI SDK, he presented this as a simple file-and-content call. Larger files can be uploaded through client.files.upload(), while smaller files can be passed inline. The example uploaded paper.pdf, lecture.mp4, and voice.mp3, read whiteboard.jpg as bytes, and then called client.models.generate_content with model='gemini-3-flash-preview' and a contents list containing all of those inputs plus an instruction such as “Analyze...”.

The practical details were as important as the code. Löber said Gemini Flash and even Flash Lite are strong enough for audio transcription if prompted directly — for example, “generate a transcript of this file.” He gave a conversion rate of one minute of audio to 1,920 tokens. With a 1 million token limit, that translates, by his calculation, to more than nine hours of audio content in context.

1,920 tokens

per minute of audio, according to Löber’s API guidance

For video, he described the rough limit as about one hour, but said developers can stretch this to roughly three hours by adjusting media resolution and frame rate, including media_resolution=low and lower FPS. He also noted that prompts can target timestamps directly, such as “analyze from 5:00 to 15:00,” and that the API can accept YouTube URLs directly.

The cost and repetition problem is handled through context caching. Löber described it as built into the API and especially useful when repeatedly querying long files loaded into Gemini. He said it can save 90% of costs in that pattern.

90%

cost savings Löber attributed to context caching for repeated queries over long files

At the end of this phase, the agent has not yet generated a podcast, diagram, or presentation artifact. It has produced text analysis: a study guide synthesized from PDF, image, video, and audio. That text output then becomes the input for the second phase.

Generation is delegated to tools the reasoning model chooses

The generation layer begins with the study guide rather than the original raw sources. Patrick Loeber’s architecture uses Gemini 3 Flash as the reasoning model, connected through function calling to image and speech generation functions. Gemini reviews the study guide, identifies which concepts are complex enough to benefit from visual explanation, identifies which sections would work better as spoken explanations, and calls the relevant tools.

The image path uses a native image generation model. The slide’s code example used gemini-3.1-flash-image-preview; in the spoken explanation, Löber appeared to say a different version number and identified the model as Nano Banana 2, the more familiar name. The code pattern remained the same as the understanding call: client.models.generate_content, with a prompt such as “Create a picture of ...,” then iteration over response parts to find inline image data and save it as generated_image.png.

He emphasized educational infographics as a strong use case. In the example shown, the generated graphic explained the Transformer architecture: the sequential bottleneck of RNNs, vanishing memory, multi-head self-attention, positional encoding, constant path length, and the relation between translation quality and training cost. Löber’s point was not just that the image model can draw. It can render explanatory material when prompted to create an infographic.

For speech, the text-to-speech call used gemini-2.5-flash-preview-tts, configured with response_modalities=["AUDIO"] and a speech_config containing a prebuilt voice name, shown as "Kore". Löber said the model can be configured for two-speaker audio, enabling a podcast-style output. The example audio began with an explanation of Transformers as neural network architecture introduced in the 2017 Google paper “Attention Is All You Need,” and described them as the models behind GPT-3, T5, and BERT.

The bridge between the reasoning model and the generation models is the function schema. Löber showed a generate_image declaration with a name, a description, and a JSON schema for parameters. The description told the model to use the function for images or diagrams that illustrate concepts, especially complex concepts that benefit from visual explanation. The single required parameter was a string prompt, described as a detailed image description including style guidance such as “educational diagram” or “concept map.” A parallel function would handle speech generation.

The agent prompt makes the tool policy explicit. It tells Gemini it has a comprehensive study guide synthesized from PDFs, lecture video, audio notes, and whiteboard photos. It defines the role as a research partner agent whose job is to enhance the guide with multimodal materials. It instructs the model to decide which concepts are complex enough to need a visual diagram and call generate_image for them; decide which sections would benefit from an audio summary and call generate_speech for them; be selective; and finally summarize what it created and why.

That selectivity is the distinction Löber drew between a multimodal workflow and a multimodal agent. The goal is not to generate every possible asset. The agent should decide that some parts are best left as text, some require a diagram, and some work as audio.

Native generation matters because the generator inherits world understanding

Patrick Loeber used “native” to mean more than “the model can output an image or voice.” For image generation, he said the models are based on Gemini, so much of the training and world understanding from the main Gemini models is available in the generator. That is what enables cases where the model needs to understand a scene, a map, math, or code before producing an output.

The most compact example was a displayed @SirajRaval tweet describing Nano Banana drawing “what does the red arrow see” over Google Maps. In Löber’s explanation, a user can draw arrows on a map and ask the model to create a picture of what is seen from that position. Because Gemini understands the world, he said, the model correctly generated an image of the Golden Gate Bridge. The important point was that the generator used the map context and the arrows to infer what view should be created.

He paired that with an education example: using Nano Banana 2 to correct math homework directly in an image. Löber said the model can create pictures with corrections because it understands math, and added that it can even generate code on images. The thread connecting these examples is that the generation model is useful when the image is an output of reasoning, not just a visual rendering of a prompt.

These models understand the world.

Patrick Loeber

He made a similar argument for speech. The audio models are multilingual and understand accents and tone. The generated audio examples included a British accent and a Bavarian German accent. Löber did not claim every accent is available, but said users can ask for different accents. The British sample opened, “Hey up lads and lasses. We’re getting stuck into building these multi-model agents today. No faffing about.” The Bavarian sample began, “Servus mitnand! Heit schauma uns o, wia ma diese multimodalen Agenten zamschrauben...”

The demonstration was about controllability: speech output can be guided not only by text content but by style, tone, accent, and speaker configuration. In the agent architecture, that makes speech generation another tool the model can call when spoken explanation is the better medium.

Real-time interaction removes the cascaded voice pipeline

After the understanding and generation layers, Patrick Loeber introduced Gemini 3.1 Flash Live for real-time interaction. He described it as a new audio-to-audio model based on Gemini and optimized for real-time dialogue and voice-first AI applications.

The architectural claim was specific: audio goes in and audio goes out through one architecture. Löber contrasted that with a cascaded pipeline involving separate models. In a cascaded voice assistant, audio might be transcribed to text, passed to a language model, then converted back into speech. In the Live API model he described, input and output are native audio in one model, enabling more natural interactions.

The AI Studio live demo URL was ai.studio/live. Löber did not run a live demo, but showed a colleague using the Google AI Studio Playground. The system instruction was “Speak in a friendly Irish accent,” with the voice set to “Aoede” and the stream marked live. The user asked, “Can you see me?” The model responded in voice that it could see him “as plain as day,” describing short hair, a beard, a dark jacket over a blue shirt, and saying he was “looking well.”

For Löber, the striking feature was that the model can understand a live video feed while conversing. He called this one of the most impressive features and encouraged trying it directly in AI Studio.

The code example used an async WebSocket-style connection through the GenAI SDK: client.aio.live.connect(model=model, config=config) with model = "gemini-3.1-flash-live-preview" and config = {"response_modalities": ["AUDIO"]}. As with the earlier Gemini examples, Löber pointed to a Live API skill in the google-gemini/gemini-skills repository so developers would not need to memorize the setup.

The same pattern can move beyond study tools

The Notebook LM clone was the working example, but Patrick Loeber argued that the pattern transfers to other fields. He named field service agents, creative brief agents, and medical intake agents as examples. The shared structure is: ingest multimodal inputs, synthesize an internal representation or guide, reason about what additional artifacts are needed, call native generation tools, evaluate, and return multimodal outputs.

He also pointed to extensions of the stack. Gemini Embedding 2 can embed text, audio, images, video, and PDFs into one unified vector space, enabling multimodal search. Gemma 4 can run locally while still providing multimodal understanding over text, audio, images, and video. Veo 3 adds video generation with native audio.

The closing formulation was: “The world is multimodal — your agents should be too.” In Löber’s technical framing, that does not mean every agent must always produce every modality. It means agents should treat modality as part of the reasoning problem: understand the user’s available sources, choose the output form that best explains the content, and call the right native model when text alone is not enough.

AI Application Architecture RAG and Knowledge Systems Voice and Audio AI Agents and Autonomy Multimodal AI Image and Video Generation