Native Multimodal Models Extend LLMs but Still Lack Unified Representations

Steven Feng Victoria LinStanford OnlineThursday, June 4, 202619 min read

Victoria Lin of Thinking Machines uses a Stanford CS25 seminar to argue that native multimodal models have extended much of the large-language-model recipe into images, audio, video and action, but have not yet unified multimodal intelligence. Her account is that tokenization, Transformers, autoregressive conditioning and scaling transfer only partly: images, video and action require different representations, objectives and sometimes modality-specific parameters. The result, she says, is a field moving beyond text-only systems while still relying on text as its strongest abstraction for reasoning.

Native multimodal models extend the language-model recipe, but do not solve multimodal intelligence

Victoria Lin frames native multimodal intelligence as an attempt to carry the most productive parts of the large-language-model paradigm into images, audio, video, and eventually action. The starting point is familiar: modern language models are trained by next-token prediction over symbolic information, using large corpora from web pages, books, articles, code repositories, forums, dialogs, news, and other domains. As data and model size scale, these systems acquire knowledge, can be post-trained to follow instructions, perform chain-of-thought reasoning, and increasingly plan for complicated tasks.

Her central claim is that language modeling alone is not enough because neither the digital world nor the physical world is text-only. Digital interfaces contain images, audio, video, documents, and mixed media. The physical world presents sensory streams and real-time interactions. An AI system meant to interact with people in these environments must handle multimodal information rather than only symbolic text.

The architectural bet behind many current systems is that the language-model recipe can be generalized by converting different forms of information into token-like units and processing them through a Transformer. Text can be tokenized with byte-pair encoding into subword units. Images can be patchified into fixed-size regions, such as 16-by-16-pixel patches, encoded into vector representations, and then serialized as visual tokens. Audio can be transformed from waveform into frames or spectrogram-like representations and tokenized into audio units. Video can be treated as a sequence of images: patchify each frame, concatenate the patches over time, and represent the result as a sequence.

“Token” in this context does not always mean a discrete symbol. A dense vector embedding can function as a token for the Transformer. The broader move is to build a mixed sequence: text tokens, image patches, audio frames, video patches, and other units arranged so that the model can condition on them and generate outputs autoregressively.

That view divides today’s multimodal language models into two broad categories. The first accepts multimodal inputs but produces text outputs. In training, such a model conditions on a mixed sequence but computes losses only on text tokens. This supports tasks where the model receives an image or video and answers in text. Lin identifies many familiar current systems, including Gemini, Qwen, and Kimi, as operating in this mode for core products, while noting that companies may also have separate models capable of multimodal generation.

The second category is “omni” models: systems that accept multimodal inputs and also generate non-text modalities such as images or audio. Lin describes GPT-4o as an example familiar to the audience of an omni-style system, in her phrasing, because it can generate images. Omni models raise harder research questions: how should a model generate non-text modalities, and does training for understanding help generation or vice versa?

The promise of treating multimodal data as tokenized sequences is that prompting, instruction following, planning, and reasoning can transfer from the LLM paradigm. These models can be prompted with mixed-modal prompts and can plan or reason with multimodal information. Scaling also appears to transfer in broad strokes: increasing data and model size improves performance, and further scaling may produce additional emergent capabilities. But Lin distinguishes this rough trend from the more mature state of text scaling laws. For language models, researchers have fit power-law equations and coefficients with considerable precision; for multimodal models, exact scaling-law study remains under-explored.

Architectural scaling techniques also transfer. Mixture-of-Experts, attention sinks, sparse routing, and related approaches can be applied to multimodal systems to improve scaling efficiency. But the rest of Lin’s argument complicates the idea that multimodality is merely “LLMs plus more tokens.” Different modalities have different information density, different loss landscapes, and different relationships between understanding and generation. The language-model analogy is useful, but not complete.

Discrete image tokens made Chameleon elegant, but the representation was too lossy

Chameleon is Lin’s cleanest example of the “model the multimodal world as a sequence of discrete tokens” hypothesis. It asks whether every modality can simply be converted into discrete tokens, then modeled with the same next-token-prediction machinery that made language models powerful.

For images, the key step is discretization. Starting from an image, the model divides it into patches, runs continuous encoders to obtain embeddings for those patches, and then maps each embedding to the closest entry in a learned vector codebook. The corresponding codebook index becomes a discrete token. Lin identifies this as the VQ-VAE approach: a vector-quantized variational autoencoder turns continuous image information into discrete codebook indices. The model then interleaves text tokens and image tokens and trains with a cross-entropy language-modeling objective.

The appeal is architectural simplicity. If text and images can both be rendered as discrete sequences, then the model can learn arbitrary interleavings: text followed by image, image followed by text, or mixed documents where images and text appear in any order. The Chameleon examples include a prompt asking for “cool, quirky-looking birds” and short descriptions; the model generates a mixed document containing images and explanatory text for birds such as the Keel-Billed Toucan, Puffin, and Golden Pheasant. Lin also describes multitasking behavior: chatting, brainstorming, comparing images, giving advice, producing explanations, writing articles or stories, and reasoning over mixed-modal prompts.

Her assessment is that Chameleon was one of the first models to show that training from scratch on interleaved text-image sequences could induce multimodal capabilities while preserving strong text-only performance. That matters because the result suggests mixed-modal training can produce multimodal behavior while retaining strong text performance.

But the more important lesson is that Chameleon’s elegant unification exposed a representation problem. Discretizing images causes significant information loss for image-understanding tasks. Compared with state-of-the-art multimodal language models that use continuous image encodings, such as SigLIP-style features, VQ-VAE image tokens leave a performance gap in fine-grained visual semantics. The same discrete representation is also inefficient for generation. Generating images token by token requires a large token budget and substantial data before the model can sample well-formed images.

The conclusion is not that Chameleon failed, but that its core assumption was too strong. Treating images as fully discrete token streams brings the architecture closer to language modeling, but images may not tolerate that compression without losing information that matters for understanding and generation.

Transfusion keeps autoregression for text and uses diffusion where images need it

Transfusion addresses Chameleon’s limitation by refusing to force image generation into a purely discrete next-token objective. It still models interleaved text and image sequences, but it uses different objectives for different parts of the sequence: autoregressive modeling for text and diffusion-based generation for images.

Victoria Lin describes diffusion models as image generators that start with noise and iteratively predict the noise to remove until a clear image emerges. Transfusion combines that process with language modeling inside a single Transformer. Text portions of the sequence are trained with the standard autoregressive next-token objective. Image portions are represented continuously and trained with a diffusion objective: the model performs multiple diffusion steps over an image representation segment, arrives at a clear image, then proceeds autoregressively through the mixed sequence.

This requires architectural adaptation. Text can use causal attention, preserving the left-to-right structure of language modeling. Images, for better image performance, use bidirectional attention within the image segment. The model thus operates with different attention patterns and objectives depending on modality, while remaining a single multimodal architecture over an interleaved stream.

Transfusion demonstrated substantially better image quality and token efficiency than discrete token-based image generation, according to Lin’s summary of the paper. The sample prompts in the Transfusion slides include an avocado-shaped armchair, a blue jay on a basket of rainbow macarons, a corgi, the word “Transfusion” written on a blackboard, a close-up of a human hand, and clouds shaped like bunnies playing with a ball. Her emphasis is not on the aesthetics of the individual samples but on the modeling result: using continuous image representations plus a diffusion objective produces better images with much less token budget.

The remaining problem is representation mismatch. Transfusion’s VAE-style representations are efficient for image generation, but Lin says they are not effective for image understanding. That creates a dilemma: the representation that works well for generating images is not necessarily the representation that works well for interpreting them. As a result, state-of-the-art omni models often adopt separate image encodings for understanding and generation, while the field continues trying to unify them.

That unresolved split becomes a recurring theme. Language modeling has a single text representation for input and output: the same symbolic token stream can be read and generated. Image modeling has not yet reached that state. Understanding wants rich semantic visual embeddings; generation often wants representations aligned with diffusion or VAE-style image synthesis. Lin treats unifying those representations as an important open research problem.

Mixture-of-Transformers separates modality parameters without abandoning shared attention

The Mixture-of-Transformers architecture is Lin’s main example of modifying the Transformer backbone itself for multimodality. The motivation is that modalities differ in information density and data structure. Text, images, audio, and actions may not benefit from being processed by the exact same parameters at every layer. The question is whether a model should use a unified parameter set for all modalities, or whether each modality should have specialized parameters while still communicating through a shared architecture.

Mixture-of-Transformers, or MoT, gives each modality an independent set of Transformer parameters. Lin’s examples include separate QKV projection matrices in the attention layer and separate feed-forward-network parameters. Routing is deterministic: if the token is text, activate the text parameters; if it is an image token, activate the image parameters; if it is an audio or speech token, activate the speech parameters.

This does not mean each modality is isolated. After modality-specific QKV projections, the tokens participate in joint attention in a shared feature space. Once the attention output is unified, the model passes tokens through modality-specific feed-forward layers, again selected by token type. From the outside, the layer still looks like a Transformer block with token inputs and outputs. Internally, different modalities are processed with different parameter sets.

MoT can be combined with both earlier families. With a Chameleon-style model, each modality can use autoregressive cross-entropy over discrete tokens. With a Transfusion-style model, text can use autoregressive loss while images use a diffusion objective. MoT is not a replacement for the generation objective; it is a modality-aware way of allocating model capacity inside the Transformer.

In the experiments Lin presents from the MoT work, models were trained across a scaling ladder from 163 million parameters to 7 billion parameters. The comparison includes a dense baseline and a four-expert Mixture-of-Experts baseline, because a direct comparison to a dense model would otherwise give MoT a parameter-count advantage. The training configuration shown uses sequence length 4,096, 250,000 training steps, and about 0.524 trillion tokens across model sizes.

Model size	Hidden dimension	Layers	Heads	Sequence length	Training tokens
163M	768	16	12	4,096	0.524T
760M	1,536	24	24	4,096	0.524T
1.4B	2,048	24	16	4,096	0.524T
7B	4,096	32	32	4,096	0.524T

The MoT scaling ladder Lin presented for the Transfusion setting.

The result Lin highlights is that MoT is especially effective for non-text generation. In the paper, she says, text performance remains comparable to the dense variant, while image generation improves. The MoT model shows lower image generation loss and stronger sampling-based image-evaluation metrics than the dense baseline. Qualitative comparisons use prompts such as a lychee-inspired spherical chair, a sushi city on a wooden table, an anthropomorphic cheeseburger relaxing on a couch, and a chrome-plated duck arguing with an angry turtle. The MoT outputs follow fine-grained instructions better and generate more detailed objects.

Her explanation is capacity competition. Non-text generation may require capabilities sufficiently different from text generation that placing all modalities inside a single dense Transformer creates interference. If the model can allocate separate parameters to separate modalities while still sharing attention, it can scale non-text generation without sacrificing text.

MoT can also be combined with Mixture-of-Experts. Instead of keeping one parameter set per modality, the architecture can assign expert groups to each modality, with separate routers and potentially different numbers of experts for text and image. Lin says earlier findings suggested that adding more experts was especially helpful for text performance, while image generation did not scale as fast with additional experts. That motivates customized expert allocation by modality rather than treating every modality identically.

The practical benefit of modality-specific parameters is not only performance. MoT improves stability and controllability in mixed-modal training and can support asynchronous training across modalities. If a strong text model already exists, a team might freeze the text parameters, add new image or speech parameters, and train those new modality-specific components without damaging the original text capability. In that use case, MoT becomes a technique for extending an existing language model into additional modality generation rather than retraining the whole system.

Understanding helps generation more than generation helps understanding

Image understanding and image generation may look like inverse tasks, but Lin treats them as different capabilities with different representation needs. Image-to-text resembles a multimodal language-model understanding problem; text-to-image resembles diffusion-style generation. The question is whether multimodality requires a single model that does both well, or whether the field is still working with a partially split design.

Victoria Lin is careful on this point. In the MoT context, when she refers to “image modality,” she is often talking about image generation. She says her group did not find modality separation to help image understanding. For text-output tasks, if the model is outputting text, the input tokens may be better off passing through the same set of parameters whether those inputs are text or images. That is different from image-output tasks, where specialized image-generation parameters help.

This distinction leads to a broader claim about transfer. Understanding appears to help generation strongly. If the base model has better multimodal understanding, it has better information processing, planning, and reasoning capabilities. Those capabilities help it generate images with fine-grained details and generate infographics with more accurate information and less hallucination. In the Bagel examples, the model can generate a textual “thinking” trace before generating the final image, using that trace to plan composition, identify objects, decide details, and structure the scene before synthesis. Lin says this kind of “thinking before generation” is used by lots of state-of-the-art image-generation models nowadays.

The reverse direction is much less established. Lin says there is little work showing that training an omni model further for non-text generation reliably improves understanding. In plain terms: spending a large token budget training for image generation does not necessarily improve the model’s ability to answer image-understanding questions.

If we train the model for image generation and we train it on a lot of token budget, it doesn't necessarily improve the model's ability to perform image understanding tasks.

Victoria Lin

Bagel is Lin’s example of a current omni model that reflects the field’s compromise. It uses an architecture closely related to MoT, with separate parameters for image generation, while its base model is a multimodal language model that can understand both text and images. It also uses different image representations for understanding and generation. Lin sees this as evidence that the field has not yet found an effective unified approach for image understanding and image generation, even though mixed-modal autoregressive modeling enables useful behaviors like planning before generating.

The same architectural idea has appeared in embodied AI. Lin points to robotics and vision-language-action models that use MoT-style architectures to predict action vectors. In that setting, action is another modality. The model may use a specialized parameter set for action prediction while relying on the broader language-model and self-attention structure to transfer world knowledge into action. The example shown, Pi 0.7, combines vision, language, action, observation memory, task instructions, subtask instructions, subgoal images, metadata, and an action expert. Lin’s point is that modality-aware architectures are not limited to images and speech; they are also relevant when the output is an action vector.

Language is not just another modality in the training signal

Lin uses Sergey Levine’s observation about language and video to explain why multimodal learning may not inherit all of language modeling’s dynamics. The tweet shown in the slides says: “I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise.”

Lin treats the observation as suggestive rather than settled. One hypothesis is that language is fundamentally different because it is a highly compressed abstraction of human cognition. Text contains reasoning, interpretations, intentions, and descriptions of actions. Training on next-token prediction over language is therefore not just training on surface symbol continuation; it is training on traces of human reasoning and decision-making.

Images and videos are different. They are sensory data and passive observations of the world, not necessarily subjective interpretations. A frame does not carry the same compressed cognitive structure as a sentence. Training a model to predict the next frame can therefore be much less informative than training a model to predict the next word in a reasoning-heavy text corpus.

Lin adds two other hypotheses. First, the loss landscape for images and videos is more complicated. She says experiments have shown cases where image or video generation does not look good to humans even though the loss has started to look good, suggesting the training objective may not align well with perceptual quality or semantic correctness. Second, next-frame prediction may be highly redundant because adjacent video frames often repeat most of the same information. A model can improve loss by modeling local frame correlations without learning much about causality, intent, or high-level structure.

The broader implication is that success in applying language-model ideas to multimodal modeling should not be mistaken for closure. Tokenization, autoregression, scaling, instruction tuning, and Transformer backbones all transfer to some degree. But the modalities do not carry the same kind of information, and the same objective may not produce the same kind of intelligence.

In the Q&A, Lin returns to this point when asked whether multimodal training, especially video generation, might improve general knowledge work and agent capabilities. Her short answer is that, so far, pure video-generation training has not proved very effective for transferring back into knowledge work. There are more promising signals when video models are used inside robotics pipelines for future-state prediction, feedback, or action selection. But she says the field has not yet seen a model become massively better at agentic tasks simply by being trained for video generation.

The frontier is physical-world intelligence, not just better mixed-media documents

Victoria Lin draws a boundary between what current multimodal LLMs do well and what remains largely open. Models like Chameleon, Transfusion, and MoT address an important subset of multimodal intelligence: digital multimodal information processing. They can interpret and reason across text, images, audio, and sometimes video; they can generate content across modalities; they can support multimodal prompting and mixed-modal documents.

But this is still far short of full physical-world multimodal intelligence. The under-captured frontier includes spatial-temporal understanding, real-world grounding, direct interaction with physical environments, real-time perception-action loops, navigation and control, and long-horizon embodied feedback. Today’s systems can process multimodal information, but they do not yet have a full model of physical interaction.

This distinction shapes Lin’s answer to a question about JEPA-style world models and more abstract visual representations. She sees JEPA as an interesting architecture designed to model relations more efficiently, especially for real-world physical understanding. The multimodal space is rich enough that different applications may need different schools of representation. For infographic understanding, PDF understanding, and visual coding, the current patchify-and-encoder paradigm seems to work well. But infographics are not the same as real-world images, especially when the task requires physical and spatial understanding. For those problems, architectures designed for richer relational or semantic representations may be worth exploring.

When asked whether higher-level, object-oriented visual embeddings could help, Lin says yes. Humans naturally group parts that move together as one object, and semantic scene representations could be a valuable direction. She connects this again to JEPA-like research themes.

On spatial reasoning specifically, Lin is cautious, saying she is not an expert in the area. She notes progress in robotics labs and vision-language-action models, and observes that some of these systems are accelerated by multimodal language models: physical-intelligence-style systems can use vision-language models as backbones rather than training from scratch. That suggests positive transfer from multimodal language modeling to spatial understanding. But she also acknowledges approaches that move further away from language dependence and focus more heavily on vision signals, while saying she is less familiar with them.

Her conclusion is that multimodal models are computationally heavier than text-only models, creating additional training and inference infrastructure challenges. Because the problem space is complex, she expects more specialization in the near term: models customized for embodiment, infographics understanding, robotics, visual coding, speech, or other capabilities. The longer-term research question is how to unify those specialized systems into a coherent architecture.

Text remains the strongest abstraction, but not necessarily the final one

The strongest challenge to Lin’s framing is whether text should remain the central abstraction at all. Autoregressive next-token prediction has proved powerful, but one question pressed the cognitive mismatch: humans appear to learn in a structured, hierarchical way, not by simply predicting the next token over a massive sequence.

Lin’s response is that alternatives exist, including diffusion language models and discrete diffusion models, but next-token prediction has proved highly effective. She suggests the apparent simplicity of the objective may be misleading. The surface task is next-token prediction, but the Transformer’s internal connections are complex enough that structure learning may emerge in latent space. The network may be doing more than the training objective visibly states.

A related challenge is whether text could be rendered as images, so that the model learns from image patches rather than symbolic text tokens. The motivation is unification: text formatting, highlighting, italics, layout, and infographic structure are naturally visual; perhaps representing text as images could reduce modality-specific engineering and simplify downstream architecture.

Lin says she has not run those experiments herself and recalls seeing a paper that trained language models with images as input, allowing prompts through OCR-like images, with good results. Her prediction, explicitly a subjective one rather than an experimental result, is that image-rendered text will be less efficient than raw text because text already has a useful symbolic structure. Rendering introduces choices about appearance, paragraph layout, and visual form that may be irrelevant to the underlying reasoning. She acknowledges a possible counterargument: screenshots can represent many words in a shorter context than tokenizing an entire paragraph. But her view remains that clean symbolic text is likely more efficient for scaling reasoning and agentic capabilities.

Based on what the field has learned so far, Lin says, it is better to align other modalities onto text than to align text onto other modalities. Text appears to drive the underlying reasoning and agentic capabilities. Representing text as OCR-like image patches might slow the scaling of those capabilities.

This does not mean reasoning must always happen only in text. A later exchange turns on whether video reasoning effectively requires synthetic augmentation through text — text in frames, text in audio, or text accompanying video — because video data lacks the explicit reasoning traces found in written corpora. Lin agrees that language currently serves as a useful skeleton for visual reasoning and that synthetic data is important for vision tasks. But she does not rule out pure vision models that can reason. If a video model could accurately predict future states from visual input alone, she says, that would count as a kind of reasoning.

The practical constraint is efficiency. Generated video frames are shown directly to users; unlike text chain-of-thought, there is no obvious hidden “reasoning frame” that can be stripped out after generation. Lin agrees that language-as-skeleton is currently much more efficient, partly because language is more abstract than vision. Models already move between visual signals and abstract space through captioning and similar training. She leaves open the possibility that, with enough compute or better representations, reasoning could occur in pure visual space, but treats that as an open question rather than a current fact.

AI in Robotics and Physical Systems AI Research Methods Multimodal AI Image and Video Generation