Vision-Language Models Understand Multimodal Inputs but Still Generate Text
Stanford’s CS336 lecture on alignment and multimodality, led by Percy Liang with Tatsunori Hashimoto, argues that the core problem in vision-language systems is still how to turn non-text data into tokens a Transformer can use. The lecture traces the field from CLIP and SigLIP through LLaVA and Qwen, presenting modern VLMs as largely built around a stable template: a vision encoder, an adapter, and a pretrained language model that generates text. Liang’s larger point is that these systems are powerful multimodal input models, but not true omni models; representing images and video without losing fine detail remains the central technical constraint.

The central problem is still tokenization
Multimodality enters as the missing capability in a course otherwise focused on language models. Text-to-text models are already broad: they can handle natural language, code, poetry, or even sequences such as DNA if those are represented as text. But the world also contains images, audio, and video, and the “North Star” is an omni model: a system that can accept any combination of modalities and produce any combination of modalities.
The constraint is architectural. Transformers work well enough that the practical question is not whether to replace them but how to feed them. Transformers “speak tokens.” In text, those tokens are usually discrete subwords. For multimodal systems, the word “token” has to broaden to include continuous embeddings, but the essential requirement remains: a token should represent some semantic unit of information.
That is easy to forget with text because tokenization has become routine. A BPE tokenizer is imperfect, but it gives a workable bridge from text into a Transformer. For non-text modalities, the equivalent bridge is harder. A pixel is not a semantic unit in the way a subword often is. The multimodal problem therefore splits into two questions: how to input non-text data into a Transformer, and how to output non-text data from one. The emphasis here is mostly on the first question: understanding images, and later images plus video, by converting them into representations a language model can consume.
The most important asymmetry governs the whole technical stack. Understanding and generation may demand different encodings. A classifier may need only high-level semantic information: dog, car, chart, person, text. OCR or image generation needs fine detail: small letters, layout, colors, edges, high-frequency information. Much of modern vision-language modeling is a set of engineering compromises around that difference.
The fundamental challenge I think when you're dealing with multimodality is how do you handle non-text modalities?
That distinction also clarifies the boundary between most vision-language models and true omni models. LLaVA and Qwen-style systems can ingest images, multiple images, and video, then generate text about them. They are multimodal on the input side. They are not, in the architectures discussed, systems that generate images or video.
CLIP made web captions a semantic supervision signal
CLIP, introduced by Radford and colleagues in 2021, is presented as foundational for modern vision-language models. Its historical setting matters. Language modeling had already moved into the foundation-model era, with GPT-style systems trained on large amounts of noisy web text. Computer vision, by contrast, had long depended on annotated image datasets such as ImageNet and models trained for classification with carefully labeled examples.
CLIP asked whether vision could use the much larger supply of naturally occurring image-caption pairs on the web. The method is simple in outline. Given a batch of image-text pairs, encode each image with an image encoder and each corresponding text with a text encoder. If image embedding I1 belongs with text embedding T1, train the model so their dot product is larger than the dot product between I1 and the other texts in the batch. Symmetrically, train T1 to prefer its paired image over the other images.
In code terms, the image encoder produces image features, the text encoder produces text features, both are projected into a joint multimodal embedding space, L2-normalized, and compared with scaled pairwise cosine similarities. The labels are the diagonal of the resulting image-by-text similarity matrix. The loss is a pair of cross-entropy objectives: image-to-text and text-to-image, averaged together.
The visual explanation in the lecture is an image-by-text matrix. The aligned pairs sit on the diagonal; every off-diagonal entry is treated as a negative comparison. That structure is the key to the method: each batch becomes many classification problems without requiring a human-labeled object category for each image.
This is not caption generation. CLIP-style contrastive ranking was much more compute efficient than alternatives that tried to predict text from images directly, either as bag-of-words prediction or as a language model. The point is not to model the exact caption sequence. For ImageNet-style recognition, it is enough to learn a rough semantic representation of the image.
That distinction also explains why image-text training helps more than image-only augmentation. A student asks why train on image-text pairs instead of images alone. The answer contrasts CLIP with methods such as SimCLR, where an image is augmented by cropping, perturbing, or rotating it and the model is trained to treat the augmented views as equivalent. That can teach low-level invariance. But it will not “data augment your way from one type of dog to another dog.” Text supplies higher-level semantic grouping.
CLIP’s data was large and noisy. The paper trained on 400 million image-text pairs, derived by searching for roughly 500,000 queries and collecting about 20,000 pairs per query. The dataset was not released. OpenCLIP later reproduced and extended the approach using LAION-5B, a five-billion-image dataset with textual descriptions, with CLIP itself used in filtering. That creates a bootstrapping loop: CLIP helps clean data used to train CLIP-like models.
The noise is not incidental. A student asks whether the contrastive objective is confused when a batch contains multiple dog images and multiple captions that could plausibly apply to dogs. The answer is that the process is noisy, but the average signal is still useful at scale. Web captions and alt text often do not literally describe the visible content. If an image shows a dog, the nearby text may not say “a dog” at all. Arbitrary web image-text pairs would be too noisy, so filtering matters. But at sufficient scale, the model can still learn.
The first image tokens were patches, not pixels
CLIP’s vision encoder was tested with ResNets and Vision Transformers. The stronger version used a Vision Transformer, and when people refer to CLIP, they usually mean the ViT version. A ViT breaks an image into patches — 16 by 16 in the original Vision Transformer paper, 14 by 14 in CLIP’s common configuration — flattens and projects each patch, adds positional embeddings, and passes the sequence through a standard Transformer encoder.
Each patch becomes a token-like unit. It is still not a full semantic object, but it is far closer to something a Transformer can process than individual pixels. CLIP’s best model is identified as ViT-L/14@336px: a large Vision Transformer using 14-by-14 RGB patches and 336-by-336 resolution images. The lecturer notes that CLIP trained at lower resolution first and then used higher resolution later, and suggests this was for speed because high-resolution images take longer.
Image preprocessing reveals the early assumptions. Internet images arrive in arbitrary sizes and aspect ratios, while neural nets prefer fixed shapes. CLIP resized images so the shorter side became 336 pixels, then center-cropped to 336 by 336. This is a convenience-driven heuristic. It made sense for ImageNet-style classification, where the object is often centered and cropping background may not hurt much. It is much less satisfactory for OCR or documents, where cropping or downsampling can destroy the information the model needs.
The text encoder in CLIP was a GPT-2-style Transformer with 63 million parameters and 12 layers. To get a single text representation, CLIP encoded a sequence with beginning- and end-of-sequence tokens and used the final-layer activation at EOS. The image encoder similarly produces a single vector, using attention pooling rather than a simple average over patch activations.
The headline result was that zero-shot CLIP outperformed a ResNet-50 trained on 1.2 million ImageNet images. The zero-shot procedure is direct: encode the image, encode candidate label prompts such as “a photo of a dog,” and pick the text whose embedding has the highest similarity to the image embedding. The importance of the result is not merely benchmark performance. It showed that web-scale image-text data could compete with a heavily annotated supervised dataset on a canonical vision task.
Still, CLIP’s design decisions were made for image classification, not fine-grained visual reasoning. It captures semantics from noisy text because text usually describes semantic content. But it is not optimized to preserve every detail in the image.
SigLIP changed the loss to make training more practical
CLIP’s contrastive objective has a systems drawback: it relies on large batches and a softmax over the full batch. If the batch size is one, the objective is meaningless; if it is small, it degrades. Because each image competes against every text in the batch and vice versa, the loss is tied to batch size and less decomposable than ordinary language-model training.
SigLIP, from Zhai and colleagues in 2023, addresses this with a sigmoid loss. It is described as an improved version of CLIP. Instead of multiclass classification — the aligned pair versus all alternatives in the batch — SigLIP treats each image-text pair as a binary classification problem: are they aligned or not? In the batch matrix, the diagonal entries are positive examples and the off-diagonal entries are negative examples.
The implementation still computes normalized image and text embeddings, forms dot-product logits with a learnable temperature and bias, and constructs a matrix of labels. But the loss is a sum of log-sigmoid terms rather than a softmax classification over the batch. A student asks whether this requires sophisticated sampling because most possible image-text pairs are not aligned. The answer is that in the initial paper, the setup was straightforward and used the same type of matrix; hard-negative sampling and balancing are possible concerns for contrastive methods generally, but not the point of the initial SigLIP result.
The data came from Google’s WebLI dataset: on the order of a billion image-text pairs scraped from the internet, with automatic OCR used to extract text from images, filtering to the highest-quality 10%, and support for 100 languages. The efficiency claim is stark. CLIP trained for 10 days on 256 TPUv3s. SigLIP trained for 5 days on 32 TPUv4s, and TPUv4s are not simply faster in FLOPs per second; their advantage is more about pod scale and interconnect.
| Model | Objective | Training setup described |
|---|---|---|
| CLIP | Multiclass contrastive classification over the batch | 10 days on 256 TPUv3 |
| SigLIP | Binary aligned/not-aligned sigmoid loss over image-text pairs | 5 days on 32 TPUv4 |
The important conceptual shift is that SigLIP decouples batch size from the definition of the loss. With CLIP, changing the batch size changes the classification problem. With SigLIP, smaller batches increase variance but preserve the same expected objective. The SigLIP paper explored very large batches, up to one million, but found that 32,000 was effectively enough.
The dominant VLM template is encoder, adapter, language model
Once CLIP or SigLIP can encode an image, the next question is how to use those encodings inside an LLM. The standard vision-language-model template is: vision encoder, projector or adapter, language model. The vision encoder converts images into visual features. The projector maps those features into the embedding space expected by the language model. The language model then generates text.
This template is powerful but also limiting. It makes images, documents, charts, GUIs, and videos usable as context for a language model. It does not, by itself, make the model an image or video generator. Across LLaVA and Qwen-style systems, the multimodal material is on the input side and the output remains text, sometimes including structured textual references such as bounding boxes.
| Family | Visual representation | Adapter or fusion | Output modality discussed | Training emphasis |
|---|---|---|---|---|
| LLaVA | CLIP ViT-L/14 image encoding | Linear projection W into LLM embedding space | Text | Two-stage alignment and fine-tuning on GPT-4-synthesized instruction data |
| LLaVA OneVision | SigLIP encodings with AnyRes crops for single images, multi-image inputs, and video | Two-layer MLP projector | Text | Targeted task data, visual instruction tuning, and transfer across input structures |
| Qwen-VL / Qwen2 / Qwen3 | OpenCLIP, larger ViTs, dynamic resolution, SigLIP-2, image and video token sequences | Cross-attention, M-RoPE, interleaved M-RoPE, explicit timestamps, DeepStack | Text | Escalating pretraining, task data, long context, post-training, distillation, and RL |
| Chameleon | Discrete image tokens from VQ-VAE-style tokenization | Mixed-modal autoregressive LM rather than the same encoder-adapter pattern | Interleaved text and image tokens | Standard autoregressive modeling over mixed-modal token streams, with stability challenges |
LLaVA, introduced in 2023, is the simplest detailed example. It used CLIP as the vision encoder and Vicuna as the text decoder. Vicuna was a LLaMA model fine-tuned on ShareGPT conversations, meaning conversations people had with ChatGPT and shared online. LLaVA’s data came from MS COCO images, which had human annotations such as captions and bounding boxes. The LLaVA authors prompted GPT-4 with those captions or detected objects and asked it to generate conversations, detailed descriptions, and complex reasoning examples. These GPT-4 generations were paired back with the original images, producing 158,000 examples.
The architecture is deliberately simple. Encode the image with CLIP’s ViT-L/14. Apply a learned linear projection W so the visual vector lies in the same space as text-token embeddings. Concatenate the resulting visual embeddings with the embedded text instruction. Feed the sequence through the language model and train it to produce the language response.
Training has two stages. In the alignment stage, freeze both the vision encoder and the language model and train only W. The purpose is to make the visual embeddings look like something the pretrained language model can use. In the fine-tuning stage, keep the vision encoder frozen but train both W and the language model on image-plus-text-to-text examples.
LLaVA’s demonstration example is the “extreme ironing” image: a man ironing clothes on the back or roof of a vehicle. Asked what is unusual about the image, LLaVA identifies the unconventional and unsafe ironing setup. The displayed comparison showed GPT-4 also recognizing the unusual premise, while BLIP-2 and OpenFlamingo missed or distorted the point.
The broader lesson is that an open model could achieve visible visual reasoning behavior with a comparatively modular recipe. It was not GPT-4-level, but it made the mechanism inspectable: pretrained vision encoder, pretrained language model, small adapter, and synthetic instruction data.
Resolution and task transfer became the practical bottlenecks
Tatsunori Hashimoto describes LLaVA OneVision as a 2024 successor in the LLaVA line after LLaVA 1.5 and LLaVA-Next. The architecture is the same broad template but with upgraded parts: SigLIP as the vision encoder, Qwen-2 as the text decoder, and a two-layer MLP instead of a linear projection as the adapter. The model handles single images, multiple images, and video.
The key data-processing problem is high resolution. OCR makes this unavoidable. If a document image is resized and cropped to 336 by 336, small text becomes unreadable or disappears. LLaVA’s solution, introduced earlier in LLaVA 1.5, is AnyRes. Rather than force the whole image into one fixed-size crop, break the image into multiple pieces, each matching the vision encoder’s expected resolution, encode the pieces, and concatenate the resulting vectors. A downsampled full-image path can preserve global context, while crops preserve local detail. If the original image is too high-resolution and creates too many tokens, the system reduces the representation with interpolation.
The lecture’s AnyRes diagram shows this visually: a document-like image is split into a grid of local crops while a downsampled whole-image path is encoded in parallel. The local patch embeddings are interpolated, flattened, and concatenated before reaching the language model. The point is not just higher resolution; it is preserving both global layout and local details within a token budget.
Hashimoto draws the analogy to language length. Transformers already handle variable-length text sequences. AnyRes piggybacks on that dynamic capacity: images can now produce variable-length sequences too.
For different modalities, LLaVA OneVision deliberately “puts its thumb on the scale.” A single image may get a downsampled full view plus up to nine crops. Multiple images are encoded more coarsely, because there are several of them. Video is encoded even more compactly per frame because long videos would otherwise exhaust context length and dominate the dataset with repetitive frames. The goal is not to treat every visual input identically, but to produce roughly comparable token lengths across single image, multi-image, and video examples.
| Input type | Token strategy described | Reason |
|---|---|---|
| Single image | Downsampled full image plus crops, up to nine crops in the shown example | Preserve local detail for high-resolution perception |
| Multiple images | Base-resolution representation for each image | Keep several images within a manageable token budget |
| Video | Fewer tokens per frame, with sampled frames | Avoid long videos dominating context and training |
The dataset reflects post-training rather than broad unsupervised pretraining. LLaVA OneVision’s stated philosophy is quality over quantity, but Hashimoto interprets that as targeted task data: visual question answering, chart QA, document QA, math reasoning, video tasks, multi-image tasks, and similar benchmarks. The work leans heavily on synthesized, task-specific examples and distills GPT-4 models where annotation budgets are unavailable.
Training also becomes staged. There is an initial language-image alignment stage where only the projector is trained. Then a high-quality knowledge-learning stage. Then visual instruction tuning. Hashimoto says it is not clear there is a principled reason for exactly this three-stage structure beyond moving from easier to harder and from knowledge-focused examples toward downstream-task-like examples.
The reassuring finding is transfer. LLaVA OneVision trains on single-image data for diagrams and charts, yet can generalize to multi-image settings where one image contains a table and another contains a diagram. It trains on OCR for single images and relational reasoning for multi-image data, yet can generalize to GUI-agent examples where sequential mobile screenshots must be interpreted as tap operations. It trains visual prompting — circles highlighting regions — on single images, yet can apply it to videos, following a highlighted soccer player across frames.
Hashimoto’s reaction is ambivalent but positive. At first glance, targeted task data can look like old-fashioned supervised learning. But if there are enough tasks, the models show transfer across input structures.
Qwen’s gains come from handling space, time, context, and fusion better
Tatsunori Hashimoto presents the Qwen vision-language models as variations on the same template, not as a clean architectural break. The useful through-line is not the release sequence itself, but the recurring pressure points: variable resolution, spatial and temporal position, long video context, data weighting, and the depth at which visual information enters the language model.
The first Qwen-VL already shows the pattern. It used OpenCLIP’s ViT-bigG vision encoder with 14-by-14 patches and an adapter based on one layer of cross-attention with 2D positional encodings, mapping visual information to a fixed length of 256. It also introduced special tokens for images, bounding boxes, and references. Its staged training began with large-scale lower-quality image-text data while freezing the language model and training the vision encoder plus adapter, then moved to higher-quality task-specific data and instruction tuning. The examples shown included OCR, comparing images, reading signs, and producing bounding boxes. Hashimoto’s framing is that Qwen aimed to add vision while retaining language-model capabilities such as code.
Qwen2-VL addresses the fixed-resolution weakness more directly. Its visual encoder is larger, and its image processing becomes dynamic. One image in the shown architecture can map to more than 11,000 tokens, while a tiny equation image can map to only 8. The processing is similar in spirit to AnyRes: encode 224-by-224 patches with a ViT/14, compress every 2-by-2 group, and produce a variable visual token sequence. For video, Qwen2 samples 2 frames per second and caps the visual sequence at 16,384 tokens.
The same shift forces a better account of position. Standard RoPE was introduced earlier in the course as a way to encode one-dimensional token distance. Images and videos have height, width, and time. Qwen2’s multimodal RoPE assigns each patch a triple of coordinates and computes rotary embeddings along those dimensions. Hashimoto describes the idea as straightforward, while noting that Qwen3 later changes the implementation because the first version is suboptimal.
Qwen3-VL keeps the broad architecture but sharpens several pieces. It uses Qwen3 dense and mixture-of-experts language models, including the displayed 235B-A22B configuration, and supports long-context understanding up to 256,000 tokens. That length is especially relevant for long video. It also uses SigLIP-2 as the vision encoder, described as an improved but backward-compatible SigLIP architecture.
The positional encoding change is small in description but important in logic. Qwen2 grouped dimensions by axis: time, then width, then height. But RoPE dimensions correspond to different frequencies. If an axis gets only one band, it may be represented mostly at low or high frequency. Qwen3 interleaves temporal, width, and height axes across frequency bands so all axes get exposure to both low and high frequencies.
Qwen3 also adds explicit video timestamps as text tokens. Previously, time was implicit in positional encodings. In Qwen3, tokens such as “+0.5 seconds” appear before frame embeddings. Hashimoto’s interpretation is that explicit timestamps help because users can directly refer to time: “what happened at two seconds?”
Another training adjustment is square-root-normalized per-token loss. Video examples are long. If every token contributes equally, video can dominate training. Hashimoto says the details are not fully clear, but his reading is that Qwen3 normalizes each example by roughly the square root of its length to down-weight very long examples.
Percy Liang then explains the adapter change in Qwen3 as deeper fusion between the vision encoder and language model. Qwen3 uses DeepStack, a cross-layer fusion mechanism from a DeepSeek-team paper. Earlier adapters treated the vision encoder as a black box that outputs a sequence of vectors to the language model. DeepStack uses the stack of vision embeddings and injects visual information directly into the residual stream of multiple language-model layers. Liang describes this as a deeper fusion of the vision encoder into the language model.
The training pipeline has become elaborate. Qwen3 pretraining has four stages: vision-language alignment, multimodal pretraining at 8K sequence length, long-context pretraining at 32K, and ultra-long-context adaptation at 262,144 tokens. Post-training includes supervised fine-tuning on long chain-of-thought data, knowledge distillation, and reinforcement learning. Liang characterizes Qwen3-VL at this point as “really kind of a systems paper,” with many details and benchmark results shown as competitive with closed models such as Gemini, GPT-5, and Claude Opus 4.1 on many rows in the displayed table.
The conclusion is not that Qwen found a wholly different recipe. The broad framework remains stable. The gains come from sharpening the representation and the training system: preserve resolution where it matters, prevent video from overwhelming loss and context budgets, encode space and time more naturally, fuse visual signals deeper into the language model, and scale both data and context length.
These systems understand multimodal inputs but still generate text
A student asks how the model knows whether to output video or text. Percy Liang answers directly: the VLMs discussed do not generate video or images. The multimodal side is on the input. The output is text. In ordinary supervised stages, every output token is supervised by the dataset. There is no inherent decision by the model to choose an output modality because there is only one output modality in the architecture. Reinforcement learning can introduce different reward structures, but it does not change that these models, as discussed, are text-generating VLMs.
That clarification is important because “multimodal” and “omni” can blur together. LLaVA, LLaVA OneVision, and Qwen-VL variants can ingest images, multiple images, and video. They can answer about them, describe them, read text, reason over charts, and sometimes output structured textual references such as bounding boxes. But they are not image or video generators.
Students also ask whether multimodal training is harder from a systems perspective. Liang says it is certainly not easier. Video data is large, and even loading it can become a bottleneck, unlike text loading, which is comparatively cheap in the language-model setting. The same systems principles apply — asynchronous data loading, balancing compute and input pipelines — but the data modality makes them more pressing.
The token balance question recurs. Images and especially videos can produce many tokens, but Liang cautions against assuming multimodal tokens vastly outnumber text tokens; large language models are trained on enormous text corpora. The issue is not only total count but weighting. If a video example produces a long sequence, it can receive disproportionate loss weight unless down-weighted or normalized. This connects Qwen3’s square-root-normalized loss to the broader data-mixture problem.
Another student asks about alignment and whether the language model must be pretrained. Liang says it must be. If the language model has no pretrained embedding space or linguistic capability, “alignment” has nothing to align to. In the alignment stage, the language model is frozen and the adapter is trained to connect a given vision encoder to a given pretrained language model. The token budget, such as the 67 billion tokens shown for Qwen3’s adapter stage, is a training choice rather than an adaptive threshold.
A final architectural question concerns parameter scale. Are vision encoders much smaller than the language model? In general, Liang says yes. The projector is much smaller, and the ViT is generally under a billion parameters. His explanation is that the vision encoder performs a more local operation: it looks at patches and extracts visual features. Most of the model’s knowledge and reasoning capability remains in the language model.
Chameleon pursued a cleaner but harder route to text-and-image generation
Chameleon, from Meta’s Chameleon Team in 2024, represents a different design. Instead of encoding images as continuous vectors and injecting them into an LLM, it maps everything into discrete tokens. Percy Liang presents this as aesthetically appealing, especially from a language-modeling perspective: text and images can be analyzed and generated by the same autoregressive model. A prompt can contain text and image tokens; the output can interleave text and image tokens. In the paper’s example, the model responds to a request for quirky-looking birds with text descriptions and generated bird images interleaved.
This is closer to text-and-image generation in one autoregressive model than the input-only VLMs discussed earlier. The price is that images must be discretized. Chameleon uses the older idea of a Vector Quantized Variational Autoencoder, or VQ-VAE. An image is encoded into continuous latent features, each latent vector is rounded to the nearest entry in a learned codebook, and the decoder reconstructs the image from those discrete code indices. The codebook might contain 8,000 entries, each a prototypical vector for a visual patch-like unit. The training objective includes reconstruction loss plus vector-quantization terms needed because the rounding operation is not directly differentiable.
The lecture’s Chameleon diagram shows the appeal of the approach: an image tokenizer and image de-tokenizer wrap a mixed-modal autoregressive language model. Text tokens and image tokens flow through one sequence model, allowing mixed-modal pretraining and mixed-modal generation with interleaved outputs.
In Chameleon, a 512-by-512 image becomes 1,024 tokens drawn from a codebook of size 8,192. The model also trains a new BPE tokenizer, because the combined text-image token stream differs from ordinary natural language. Once everything is tokenized, training is conceptually straightforward: standard autoregressive language-model training over mixed-modal sequences. Stage 1 is large-scale unsupervised training with 2.9 trillion text tokens, 1.5 trillion text/image tokens, and 400 billion interleaved text/image tokens. Stage 2 mixes half of that stage-one data with high-quality data.
The elegance does not eliminate modality differences. Chameleon encountered training stability problems because text tokens and image tokens have different entropy. Next-word prediction in text is often relatively low-entropy; many words are predictable from context. Image-token prediction is higher-entropy; the exact shade or local detail is harder to predict. Liang says this led to norm growth and logit drift. The fixes included QK norm and z-loss regularization.
The deeper limitation is information loss. Discretization compresses images into a finite set of codes. That may be acceptable for some generation tasks, and VQ-VAE-style tokenization was once popular for transformer-based image generation. But for OCR or fine-grained perception, small details can vanish. Liang does not dwell on Chameleon’s results because his point is structural: the unified-token approach is elegant, but in this form it was not as performant, and training multiple modalities in one autoregressive stream is tricky.
The likely frontier recipe is hybrid, not pure
Percy Liang closes with a pragmatic view. Frontier systems are now expected to be multimodal, and increasingly described as natively multimodal or omni. But public frontier releases rarely disclose how they are built. He speculates that strong current systems likely combine continuous encoders, Transformer trunks, and diffusion models for generation, but presents that as speculation rather than disclosed fact.
The reason is the asymmetry running through the material. CLIP-style encoders are still useful because they capture image semantics well. Even years later, Liang says CLIP or similar ideas remain the go-to way to encode vision for understanding. Transformers remain the central sequence-processing trunk. Diffusion models, though not covered in detail, are strong for generation because they can optimize fine-grained visual detail.
Purely discrete autoregressive image-token modeling may still improve with scale. Liang explicitly leaves open the possibility that standard LLM-style generative pretraining over discretized image tokens could eventually win. But in his current framing, continuous representations still “hold favor,” particularly for preserving information in understanding tasks, while diffusion remains attractive for high-detail generation.
The main engineering lessons are stable across the model families. First, the non-text modality must be converted into something token-like. Second, the representation must match the task: semantic classification, OCR, chart reasoning, GUI control, video understanding, or generation. Third, data curation and data mixture matter as much as architecture, especially when video and images can dominate by token count while carrying lower information density than text. Fourth, much of the field’s progress has come not from replacing the template but from sharpening it: better encoders, better adapters, dynamic resolution, long-context training, explicit timestamps, more carefully weighted losses, and larger pretrained language models.



