Open Image Models Converge on Flow Matching and DiT Architectures

Shervine AmidiStanford OnlineMonday, June 1, 202623 min read

Stanford adjunct lecturer Shervine Amidi uses Lecture 8 of CME296 to argue that modern visual generation is best understood as a stack of choices for transporting noise into data: the paradigm, representation, architecture, training procedure, and evaluation method. He presents flow matching as the current default for image-generation systems, diffusion transformers as the dominant architectural direction, and latent spaces as a practical compression tradeoff now being challenged by scaled pixel-space models.

The course’s unifying problem was transport from an easy distribution to a hard one

Shervine Amidi framed the quarter around a deceptively simple objective: given a prompt such as “a teddy bear reading a book,” generate an image aligned with that prompt. The machinery underneath that objective was split into tractable pieces: the generation paradigm, the image representation, the model architecture, the training process, and evaluation.

The mathematical core treated image generation as a problem of moving from an easy distribution to a hard one. The data distribution of natural images is complex and unknown; a Gaussian distribution is easy to sample. Diffusion, score matching, and flow matching were presented as three ways of formalizing the route from one to the other.

In the diffusion formulation, the model begins with a forward process that corrupts a clean image into Gaussian noise. The model’s task is to learn the reverse process. The derivation begins from maximum likelihood, which is not directly tractable, and proceeds through an evidence lower bound. The resulting objective becomes a simple regression problem: given a noised image and a noise level, predict the noise that was added. The lecture displayed the loss as an expectation over a squared error between the model’s predicted noise and the actual noise added to the image:

L = E_{t, x_0, ε}[ || ε_θ(√ᾱ_t x_0 + √(1-ᾱ_t) ε, t) - ε ||² ]

That formula mattered less as an isolated equation than as a pattern. A model does not need to understand an image “from scratch”; it needs to learn, at many noise levels, what part of the input is noise and how to remove it.

Score matching gave a second interpretation. Instead of asking what noise to remove, the model asks where to move. The score is the gradient of the log data density, ∇_x log p_data(x). Amidi described it as a compass: if one knew the score, Langevin dynamics would allow samples to move from noise toward the data distribution. The difficulty is that the true score is unknown.

The workaround was to add Gaussian noise to data and estimate the score of the resulting noisy distribution. That creates a tradeoff. With heavy noise, the score is easier to estimate but farther from the true data distribution. With light noise, the target is closer to the desired score but harder to estimate. The resolution was to estimate a score conditioned both on position and on noise level. The denoising score matching loss trains a model to estimate the score of the noisy distribution from a noisy image and a noise level.

The two views then meet. In the forward process, the score from a noisy sample toward a clean one is proportional to the negative of the noise that was added:

∇_{x_t} log q(x_t|x_0) = -ε / √(1-ᾱ_t)

That identity is why the noise-prediction and score-prediction views are so close. Diffusion asks the model to predict what to remove; score matching asks it to estimate where the data lies.

The discrete formulation also raised a design burden: how many steps, what schedule, and how to choose noise levels. The continuous formulation placed diffusion inside stochastic differential equations. A forward process can be written as a deterministic drift term plus a stochastic diffusion term:

dx = f(x,t)dt + g(t)dW

A corresponding reverse process exists, but requires knowing the score:

dx = [f(x,t) - g(t)^2 ∇_x log p_t(x)]dt + g(t)dW̄

The DDPM formulation was described as a variance-preserving special case. Noise-conditioned score networks were described as a variance-exploding special case. The continuous formulation made those earlier approaches part of a wider family.

Flow matching then reframed the same goal as mass transport. Rather than beginning with “noise removal,” it treats the noisy distribution as an initial distribution and the data distribution as a target distribution. The key object is a vector field, or velocity, u_t(x), defined at every position and time. At the level of individual samples, one follows an ordinary differential equation:

dx/dt = u_t(x)

At the level of distributions, the evolution is governed by the continuity equation:

∂p_t(x)/∂t = -∇ · (p_t u_t)(x)

The training problem becomes estimating the vector field with a model u_t^θ(x). The inference procedure is then to sample from the easy initial distribution and numerically solve the ODE from time 0 to time 1:

x_1 = x_0 + ∫_0^1 u_t^θ(x)dt

Conditional flow matching makes the target vector field tractable by considering paths from the initial distribution to individual target samples, then learning an aggregate field. Amidi’s practical recommendation was explicit: in 2026, flow matching is the default for modern image-generation systems, and if students wanted to master one paradigm, it should be flow matching. In particular, he pointed to rectified flow, a variant designed to make paths straighter so that inference can use fewer numerical solver steps.

People nowadays, they use flow matching by default.

Shervine Amidi

The mathematical core therefore reduced to three equivalent-looking questions: What noise should be removed? Where should a sample move? What velocity field transports one distribution into another?

Latent space solved dimensionality by accepting a fidelity tradeoff

The early derivations treated images as vectors. That abstraction was useful mathematically, but insufficient for building systems. A pixel-space image is high-dimensional and redundant: neighboring pixels are strongly correlated, and much of the raw representation is not useful for generation. The practical question became how to represent images in a smaller, more learnable space.

The autoencoder was introduced as the first answer. It compresses an image through an encoder into a latent space, then reconstructs it through a decoder. The bottleneck forces the latent representation to retain useful information while discarding redundancy. But a plain autoencoder does not control the shape of the latent space. If the latent distribution contains isolated spikes and empty gaps, generation remains hard.

The variational autoencoder added a regularization term. Its loss combines reconstruction with a term that pushes the latent distribution toward a prior:

L_VAE = -E_z[log p_θ(x|z)] + KL(q_φ(z|x)||p(z))

The first term preserves the ability to reconstruct the input. The second structures the latent space so that sampling and generation become easier. This reintroduced the ELBO trick from the diffusion derivation, but for representation learning rather than reverse-process learning.

Representation was then extended beyond images alone. Transformer-based encoders such as vision transformers were discussed, as were multimodal embedding spaces such as CLIP, which use contrastive losses to align modalities. The conditioning problem was handled through methods such as classifier-free guidance, which steer generation toward the prompt without relying on a separate classifier.

The architecture question followed naturally. If a model receives a noisy latent, a noise level, and a condition, what should it output? Under the flow matching paradigm, the answer is velocity: u_θ(x_t, t, c).

U-Nets were presented as the earlier dominant architecture. Their downsampling path gives the model wider receptive fields and global understanding; their upsampling path restores shape; skip connections preserve lower-level detail. But convolutional U-Nets have a limitation: distant patches in the image cannot directly interact at full resolution.

Diffusion transformers addressed that limitation by using self-attention over patches. The example Amidi gave was a teddy bear looking at itself in a mirror: lower-level details in distant regions may need to be coordinated. DiT-style models can provide direct interactions between patches through attention. Conditions can be injected through mechanisms such as adaptive layer normalization, which modulates patch embeddings.

By 2026, Amidi said, image-generation models were largely DiT-based or based on variants. The historical sketch placed U-Net-based systems earlier, including DDPM-style pixel-space U-Nets, latent diffusion, and Stable Diffusion XL. Later systems such as FLUX.1, Qwen-Image, Stable Diffusion 3, and Z-Image were placed on the DiT side of the timeline. The timeline was not exhaustive, but the direction was the point: the field had moved strongly toward transformer-based image generation.

Training choices decide which parts of generation get more effort

Shervine Amidi described model training as more than minimizing the generation loss. Before discussing training stages, he revisited how time steps are sampled. Earlier derivations sampled time uniformly, but that is not optimal if all noise levels are not equally difficult.

The hardest regions are neither nearly clean nor completely noisy. They are the middle noise levels, where the model must make consequential decisions about where the sample should go. That motivated use of a logit-normal distribution, which places more mass on middle steps.

Resolution also changes perceived noise. For a fixed noise level, a lower-resolution image appears noisier than a higher-resolution image, because there are fewer nearby pixels from which one can infer the underlying value. Higher-resolution training therefore calls for somewhat more noise. The intuition is spatial correlation: if an image area is represented by more pixels, some noisy pixels can be compensated for by nearby pixels that still reveal structure.

Training itself was divided into stages.

Pre-training teaches the model how to generate images at all. It is the most expensive and time-consuming stage because it requires a large, high-quality corpus with the right mixture of examples. The pre-training data distribution matters because it defines what the model can learn to generate.

Post-training teaches the model how to generate good images. “Good” can mean aesthetically pleasing, but it can also mean aligned with a target domain. Continued training can adapt a pre-trained model to a field or object category that was underrepresented in the original corpus.

Tuning is optional and narrower. If a user wants a model to repeatedly generate a particular subject, object, or person, the model can be tuned on a small set of images. DreamBooth was given as the example: collect roughly five to ten images, associate the target subject with a rare token, and train the model so that the token evokes the subject. Because full model tuning is expensive for large systems, low-rank adaptation, or LoRA, can tune only a subset of weights.

Distillation addresses deployment. A production model must be cheaper and faster, so distillation methods shorten the number of steps needed to generate samples. Progressive distillation was named as one example among several.

Evaluation closed the loop. Without a way to measure whether generated images are good, one cannot know where to improve. Human pairwise comparisons remain central to leaderboards, and the course emphasized Elo as a better metric than raw win rate because it accounts for opponent strength. Beating a weak model is less informative than beating a strong one. Elo updates ratings by comparing expected outcomes, derived from current ratings, with actual outcomes; the surprise in that comparison determines the update.

Automated metrics fill the gap when human ratings are unavailable. Frechet Inception Distance compares the distribution of generated images with the distribution of real images after both are embedded by a pre-trained encoder. The metric assumes Gaussian distributions in the embedding space. Lower FID is better, but Amidi stressed that it is a proxy, not a perfect metric.

A newer evaluation loop uses multimodal large language models as judges. Given an image and a prompt, an MLLM can score or describe alignment and quality. That enables faster iteration before spending on human evaluation. Automated judges can tighten the loop; they do not remove the value of human ratings.

The best open systems mostly use the course’s playbook; the newest exception points back to pixels

The state-of-the-art discussion tested the abstractions against 2026 leaderboards. The top text-to-image models shown from the Artificial Analysis leaderboard were closed systems from major AI labs: OpenAI, Google, and xAI. Their rankings used Elo scores, but their internal methods were not public, so Amidi did not try to infer details that had not been published.

Rank shown	Creator	Model	Elo	Released	API pricing
1	OpenAI	GPT Image 2 (High)	1,339	Apr 2026	$211.0/1k imgs
2-3	OpenAI	GPT Image 1.5 (High)	1,266	Dec 2025	$133.0/1k imgs
2-3	Google	Nano Banana 2 (Gemini 3.1 Flash Image Preview)	1,264	Feb 2026	$67.0/1k imgs
4-5	Google	Nano Banana Pro (Gemini 3 Pro Image)	1,220	Nov 2025	$134.0/1k imgs
4-5	xAI	grok-image-image-quality	1,211	Apr 2028 as shown in the screenshot	$50.0/1k imgs

Closed-source text-to-image models shown from an Artificial Analysis leaderboard screenshot dated May 24, 2026

The xAI release date appeared on the screenshot as “Apr 2028,” even though the lecture itself was dated 2026. The table preserves it as visible screenshot text rather than treating it as an independently verified release date.

The open-weights leaderboard was more revealing. The top open models shown included HiDream-O1-Image-dev-2604, Qwen Image Max 2512, FLUX.2 [dev], FLUX.2 [dev] Turbo, and FLUX.2 [dev] Flash. These systems had public technical material, allowing Amidi to compare their design choices with the course framework.

Open-weights rank shown	Creator	Model	Elo	Released	API pricing
1	HiDream	HiDream-O1-Image-dev-2604	1,187	May 2026	No API available
2-4	Alibaba	Qwen Image Max 2512	1,159	Dec 2025	$20.0/1k imgs
2-4	Black Forest Labs	FLUX.2 [dev]	1,159	Nov 2025	$12.0/1k imgs
2-4	Fal	FLUX.2 [dev] Turbo	1,159	Dec 2025	$8.0/1k imgs
5	Fal	FLUX.2 [dev] Flash	1,143	Dec 2025	$5.0/1k imgs

Open-weights text-to-image models shown from the Artificial Analysis leaderboard screenshot used in the lecture

Model family	Generation paradigm	Architecture	Representation	Text conditioning
FLUX.2	Rectified flow	Hybrid dual/single-stream multimodal DiT	VAE latent space	Mistral 3 embeddings
Qwen Image	Flow matching loss	Double-stream multimodal DiT	VAE latent space	Qwen2.5VL embeddings
HiDream-O1-Image	Flow matching loss	Transformer-based model	Pixel space; VAE removed	No disjoint pre-trained text encoder

How the open-weights systems discussed in the lecture map onto the course abstractions

FLUX.2 was described as “the usual”: rectified flow, a diffusion-transformer-based architecture, a VAE for latent representation, and a pre-trained text encoder. Qwen Image was also “the usual”: flow matching, a multimodal diffusion transformer, a VAE, and Qwen-based text embeddings.

HiDream-O1-Image was the interesting deviation. It still used a flow matching loss and a transformer-based model, but removed the VAE and did not rely on a disjoint pre-trained text encoder. Amidi treated this not as a refutation of the course, but as a live tradeoff.

The VAE makes generation easier by compressing images into a smoother, smaller latent space. But it is lossy. The decoder may fail to reconstruct fine details faithfully. Removing the VAE preserves fidelity by operating in pixel space, but makes the transformer’s job harder because pixel-space valid images are more isolated and the dimensionality is higher.

HiDream’s apparent solution, in Amidi’s interpretation, was scale and patching. Instead of the smaller patches used in latent-space systems, the paper used much larger 32-by-32 patches, making pixel-space computation more tractable. Amidi said he believed the model had been scaled to 8 billion and also to 200 billion parameters. He presented that as his interpretation of why the approach could work: if the transformer is scaled enough, some of the learnability burden that the VAE previously absorbed may be moved into the generation model itself.

The live question is whether this becomes a broader move back toward pixel-space diffusion. The VAE’s costs are real, especially fidelity loss. But the VAE’s benefits are also real, especially tractability and learnability. HiDream showed, in Amidi’s reading, that VAE-free systems can be competitive when scaled aggressively; he explicitly declined to turn that into a settled rule. In a few months, he said, the field might again find that the VAE is necessary.

A student asked what replaces a pre-trained text encoder. Amidi clarified that the keyword was “pre-trained.” HiDream still has a text encoder; it is trained as part of the system rather than taken off the shelf. He also noted that the system uses prompt enhancement, making implicit prompt details more explicit—lighting, camera position, and similar details—which can make the conditioning problem easier.

Video generation reuses image generation, but time changes the representation problem

Videos were presented as the natural adjacent field. A video is a sequence of frames, so it adds a time dimension to image generation. That addition creates three immediate concerns: temporal consistency, tractability, and evaluation.

Temporal consistency means a plausible frame is not enough. The sequence must also make sense. A teddy bear reading a book should not suddenly acquire a hat and sunglasses in the next frame unless the video explains how. Video models must preserve identity, objects, and scene continuity across time.

Computation grows because height, width, and time all contribute to dimensionality. If one simply treats video as a stack of images, the representation becomes much larger. Evaluation also changes. The lecture described Frechet Video Distance as an extension of FID: instead of embedding images with an Inception-style encoder, it embeds videos with a pre-trained video encoder and compares generated and real video distributions. The slide displayed the formula:

FVD = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_rΣ_g)^{1/2})

As with FID, the metric is a proxy. Amidi again emphasized human-in-the-loop evaluation.

The architectural answer was a spatio-temporal version of the latent pipeline. A video model can use a VAE-like encoder and decoder, but it compresses not only height and width; it also compresses time. Spatial compression ratios around 8 were discussed by analogy with image VAEs. Temporal compression is motivated by the same redundancy argument: consecutive frames often share most of their information.

Quantity	Shape shown in the Wan example	Interpretation
Input video	[1+T, H, W, 3]	Frames in pixel space, with the first frame treated specially
Latent space	[1+T/4, H/8, W/8, C]	Spatial compression by 8 and temporal compression by 4, while preserving an anchor frame
Patches used by the DiT	Space-time patches	Latent patches represent regions across both space and time

The spatio-temporal VAE representation shown for Wan-style video generation

In the Wan architecture example, the latent shape was shown as [1 + T/4, H/8, W/8, C] for an input video of shape [1 + T, H, W, 3]. The “1 plus” term was explained as special treatment for the first frame. The first frame acts as an anchor, represented at full temporal status so the rest of the video can continue naturally from it.

The VAE becomes a 3D, or causal, VAE. “Causal” here means that features for a frame should depend only on that frame and previous frames, not future frames. Ordinary convolutions are symmetric; causal temporal processing is asymmetric. One practical reason is streaming: if a frame’s representation does not depend on future frames, encoding and decoding can be performed incrementally, reducing memory pressure. Amidi also noted that stacked convolutions widen the receptive field, so the causal constraint matters not just for a single operation but for the range of frames that can influence a representation.

A student asked why earlier frames are needed at all if one has the last frame. Amidi gave the example of a teddy bear walking across the street while a pedestrian passes behind it and later reappears. Information from earlier frames can matter for maintaining consistency when objects leave and re-enter view. A receptive field of one frame would be too narrow.

Another student asked about efficiency. Amidi said long videos cannot be generated as one arbitrarily large sequence; a common approach is to generate fixed-length clips, then use the last frame as the anchor for the next segment, along with conditioning. That converts an unbounded generation problem into a sequence of bounded ones.

The diffusion transformer also changes in video. It does not operate on purely spatial patches; it operates on space-time patches. A patch can represent some part of the latent video across time, not just a region in a single image. Self-attention lets those patches interact across space and time so that the generated video remains coherent.

When asked whether masked attention, as in causal LLMs, could be used, Amidi distinguished the causal VAE from the attention pattern inside the generation model. In image generation, all parts of the image are typically allowed to interact with one another. In video generation, the same motivation remains: the model needs consistency between every part of the video. For that reason, people typically keep full self-attention, though Amidi left room for empirical testing.

The suggested readings reflected how much of video generation is an extension of the image stack rather than a wholly separate field: Video Diffusion Models, MagicVideo, Stable Video Diffusion, OpenAI’s “world simulators” discussion, Movie Gen, HunyuanVideo, and Wan were all listed as pointers rather than an exhaustive map.

Editing is safer when the model chooses constrained actions instead of redrawing the image

If a user asks a model to “make this image black and white,” they usually expect the same image with a color transformation. A text-image-to-image generation model can treat the task as conditional generation, but that is still from-scratch generation. It may preserve the image, or it may change content. In the shown example, the generated black-and-white teddy bear had raised its right arm, which was not requested.

The proposed alternative was to reformulate editing as constrained action selection. Instead of asking a generative model to redraw the image, one can ask a vision-language model to infer editing actions and pass them to editing software. If the allowed action set is constrained to transformations such as coloring actions, the system is operating with guardrails intended to preserve the original image rather than regenerate it.

The advantage follows from the allow-list framing in the lecture: if the system is limited to harmless coloring actions, it has a narrower path to preserve the initial image than an unconstrained generator does. Amidi described this as giving “some guarantee” under that constraint, not as a universal guarantee that editing systems cannot fail. The challenge is that the VLM must know the editing action space well enough to produce meaningful operations. It must translate user intent into executable commands, such as changing brightness, luminosity, or color properties in an editing tool.

Amidi described one data strategy. Editing software logs contain initial images, sequences of user actions, and final images. What they typically lack is the user’s stated intent. One can infer that intent by giving an off-the-shelf VLM the before and after images and asking what changed. For a color-to-black-and-white example, the inferred intent might be “make this image black and white.” A system can then be tuned on triples: initial image, inferred intent, and editing actions. The goal is for the VLM to output actions that correspond to user intent.

The loss, in this framing, does not discover intent by itself. A student asked how the loss could reflect user intent, and Amidi clarified that the intent can be generated upstream. The model can be trained with inferred or annotated intent as part of the supervised data: given the initial image and the instruction, output the sequence of editing actions that produced the final image.

The suggested readings—SmartEdit, MonetGPT, JarvisArt, and RetouchIQ—were presented as evidence that this is an active research area, with papers from 2024 through 2026. The larger point was direct: not every visual task should be solved by unconstrained generation. Some tasks are better cast as constrained transformation, especially when the user wants preservation more than imagination.

Diffusion for language trades token-by-token latency for harder training

Much of modern vision generation borrowed from the text world: transformers, post-training ideas such as DPO, and reinforcement-style methods such as GRPO, which have diffusion analogues including Diffusion-DPO and Flow-GRPO. The reverse transfer is diffusion for LLMs.

Autoregressive language models generate one token at a time. Each new token depends on previous tokens. This makes inference cost scale with output length: O(output tokens). For long outputs—such as large code generations—that sequential process becomes a bottleneck.

Diffusion language models propose a different generation pattern. Start with a noised text sequence and denoise the entire sequence iteratively. The inference iteration count then scales with diffusion steps rather than output tokens:

O(output tokens) → O(diffusion steps)

Amidi acknowledged that this feels unnatural because humans speak sequentially. But he offered a writing analogy: drafting a speech is often coarse-to-fine. One starts with a rough structure, then refines. Diffusion for text similarly moves from rough, masked content toward a final sequence.

The key complication is that text is discrete. Image noise can be Gaussian; text tokens have semantic meanings. Replacing a token with a random token may inject unintended meaning. The common approach Amidi described is a dedicated mask token, representing an unknown or noised text token.

Training then resembles learning to fill blanks. Take a clean sentence, sample a noise level t, mask approximately a t fraction of tokens, and train the model to reconstruct the masked tokens from the unmasked context. This resembles BERT’s masked-token task, with an important difference: the masking ratio varies according to the noise level, ranging from lightly masked to heavily masked sequences.

Stage	Diffusion language model mechanism	Purpose
Training	Sample a noise level and replace a corresponding fraction of tokens with [MASK]	Teach the model to reconstruct masked tokens from the visible context
Initial inference	Start from a sequence made entirely of [MASK] tokens	Represent 100% noise in the text setting
Iterative inference	Predict masked tokens, commit some predictions, and remask others	Move from coarse text toward a refined final output
Revision policy	Remask randomly or according to confidence scores	Give uncertain tokens another denoising attempt

The diffusion-for-text procedure described in the lecture

At inference, the model starts from a fully masked sequence. It predicts masked tokens all at once, then may remask some tokens for revision. Remasking can be random, or based on confidence: if the model is uncertain about a token, mask it again and try another denoising step. Repeating this process yields a final unmasked output.

The advantage is speed. Amidi said that in one paper he was reading, diffusion language models could reach speeds up to 10x compared with the traditional autoregressive approach. Coding is a natural use case because latency matters for coding agents and because code often involves “fill in the middle” tasks: inserting logic between existing functions, rather than generating everything left-to-right.

The challenges are substantial. Training is significantly more expensive. Autoregressive training can parallelize next-token prediction efficiently across a sequence; diffusion-style text training does not reuse the same scheme as cleanly. Many post-training and alignment techniques are also designed around autoregressive models and must be adapted. Hybrid approaches such as block diffusion attempt to combine both paradigms.

A student asked how variable-length output is handled. Amidi said that in practice one sets an output length and stops after an end-of-sentence token. This can waste computation when outputs are shorter than the allocated length. Block diffusion offers one workaround: generate a fixed-size block with diffusion, then, if the output has not ended, condition the next block on the previous one and continue until an end token appears.

Another student asked whether the approach applies to other discrete domains. Amidi said he was sure it could, though the lecture did not cover it; the relevant mathematical work involves adapting diffusion to discrete state spaces.

A third student asked whether text could instead be treated as images of text and processed with OCR-like mechanisms. Amidi called it a promising direction and mentioned a recent DeepSeek-OCR paper as related, noting that there may be savings in how many visual tokens are needed compared with text tokens.

Speculative use cases depend on concrete bottlenecks being solved

Shervine Amidi closed with a forward-looking view of where image and vision generation may go. He said he had looked across major labs and estimated that top image-generation APIs cost roughly 10 cents per megapixel for high-quality images. The visible pricing screenshots mixed per-megapixel, per-image, per-token, and credit-based pricing examples from Google, OpenAI, Black Forest Labs, and Stability AI, so the 10-cent figure should be read as Amidi’s synthesis of that pricing research rather than a directly displayed uniform tariff. He suggested tracking price per megapixel over time as a way to watch generation move from “nice thing” toward commodity. In practice, users may rely on distilled versions rather than the top models, but the top-model price provides an upper bound on what people appear ready to pay for quality.

~$0.10

Shervine Amidi’s estimate of price per megapixel for top image generation, based on pricing research shown in the lecture

The speculative opportunities were separated by time horizon. Near-term progress may come from importing wins across modalities. Reasoning over images remains underdeveloped compared with text. If asked for a diagram, many models produce a visual projection of the request rather than a genuinely insightful, precise diagram. Amidi described this as an area where image generation could benefit from the kind of reasoning progress seen in language models.

Image editing may also benefit from agentic tool use. Rather than unconstrained generation, systems could use existing tools such as Photoshop or CAD, along with human expertise encoded as actions. This connects directly to the earlier editing section: models can become controllers of constrained tools rather than replacing those tools with raw generation.

Multimodal synthesis was another near-term opportunity. A class contains slides, video, audio, and text. Today, pieces can be assembled one by one, but Amidi suggested there is room for systems that coherently synthesize across all modalities.

The longer-term examples were robotics, medicine, and visual design. Robotics may benefit from generated environments and from lessons learned in text and image generation. Medicine and other domains may move more slowly because of approval processes and institutional inertia. He also raised a more speculative possibility: a future model that assembles a perfectly filmed, coherent, pedagogical lecture. He remained cautious, saying that knowledge transmission still requires taste and opinion that even text systems do not yet have.

The concrete bottlenecks were cost, data quality, trust, and safety.

Costs remain a central challenge. Distillation helps, but hardware may also change. One cited direction was analog in-memory computing for attention mechanisms. Current hardware is built around matrix multiplications, while transformers depend heavily on attention, a structured sequence of operations that might be simplified in hardware.

Data quality is another concern. As generated images proliferate online, future models may have difficulty accessing the “true” data distribution assumed in the mathematical formulations. Amidi discussed model collapse as an echo chamber of mistakes: when models train on generated data, errors can accumulate over generations, eventually making real and synthesized distributions separable in embedding space.

Trust is linked to both data quality and social reliability. The border between real and generated images is becoming thin. One response is provenance metadata, such as the C2PA norm, which can attach content credentials or “AI info” labels to generated media. But metadata has a trivial flaw: screenshots or stripping metadata can remove it.

Watermarking is another response. Amidi mentioned SynthID from Google DeepMind as an example of hiding patterns in pixels that can reveal origin, even when metadata is absent.

Safety remains a separate challenge. Harmful image generation, including deepfakes, has societal implications. Amidi described two fronts: model-side policies adopted by companies, and legal systems that are trying to catch up.

The practical advice for staying current was concrete. Follow the computer vision section of arXiv, though the volume can reach hundreds of submissions per day. Use major machine learning and vision venues—NeurIPS, ICML, ICLR, CVPR, ECCV—as partial filters. Read code, not just papers; authors increasingly release GitHub repositories, and working through them with coding-assistant tools can clarify what formulas obscure. Papers with Code, Stanford courses such as CS231n, technical blogs from labs and companies, YouTube explainers, and communities on Twitter were all listed as useful complements.

Amidi also pointed students back to the course study guide, which the instructors intend to keep updated, and to a related fall course on transformers and large language models.

Data and Training Evals and Benchmarks AI Research Methods AI Safety and Alignment Agents and Autonomy Multimodal AI Open Models Image and Video Generation