Orply.

Text-to-Image Training Is Becoming a Problem of Signal Allocation

Shervine AmidiStanford OnlineTuesday, May 19, 202621 min read

Stanford adjunct lecturers Shervine Amidi and Afshine Amidi present text-to-image model training as a problem of allocating scarce learning signal across the full model lifecycle, not simply choosing a diffusion or flow-matching loss. In Lecture 6 of Stanford’s CME296 course, they argue that practical training depends on emphasizing hard timesteps, adjusting for resolution, using data curricula and representation alignment, then applying post-training, personalization, and distillation methods to improve control and reduce inference cost.

Training is signal allocation, not just loss selection

In the lifecycle presented by Shervine Amidi and ? afshine-amidi, training a text-to-image model is not only a matter of choosing the right objective. The image generator sits inside a larger system with a variational autoencoder that moves images into latent space and embedding models that encode conditioning signals such as text. The scope of the lecture was narrower: training the image generation model itself, the denoising or flow model that moves noisy latents toward image latents.

The lifecycle separates four interventions. Pre-training teaches the model to generate images at all. Post-training teaches it to generate good images, where “good” may mean added domain knowledge, aesthetics, lighting, instruction following, or human preference. Tuning adapts the model to a special subject or use case. Distillation compresses the generation process so the model can produce useful images with fewer steps and lower latency.

StageOperational goalPrimary constraint
Pre-trainingLearn how to generate imagesData mixture, filtering, resolution handling, curriculum
Post-trainingLearn how to generate good imagesKnowledge, behavior, instruction following, human preference
TuningLearn how to generate images for a special caseSubject fidelity without overfitting or full-model retraining
DistillationLearn how to generate images fastReduce inference steps while preserving acceptable quality
The lecture’s lifecycle separates broad capability, alignment, personalization, and production efficiency.

That framing rests on the architectural shift described in the recap: from U-Net-style diffusion models to Diffusion Transformers. U-Nets dominated the early diffusion era because their downsampling path supplied global context, their upsampling path restored resolution, and skip connections preserved local details. Their weakness is that distant local regions do not directly communicate. The example used was a person looking at themselves in a mirror: details in two separate image regions must be coordinated.

Diffusion Transformers address this by using self-attention over image patches, allowing each patch to attend to the others. Conditioning can be injected through adaptive layer normalization, where the condition and timestep modulate token embeddings through shift, scale, and gate factors. But that applies the conditioning signal uniformly across patch embeddings. Multimodal Diffusion Transformer variants instead treat text as a separate modality, reflecting what the lecture described as a newer trend in image-generation architectures.

The timeline shown was explicitly “not meant to be exhaustive,” but it established the direction the instructors wanted students to retain: original U-Net in 2015, DDPM-style U-Nets in pixel space around 2020, latent diffusion around 2021, Diffusion Transformers in 2022, Stable Diffusion XL in 2023, Stable Diffusion 3 in 2024, Z-Image and FLUX.1 around 2025, and Qwen-Image in 2026. The instructors summarized U-Net variants as largely dominant in the early 2020s and DiT-based variants as dominating many newer releases from late 2022 onward.

Within that architecture shift, Shervine treated flow matching as the practical default for the rest of the lecture. DDPM diffusion corrupts clean images step by step and trains a model to predict the noise to remove:

Score matching reframes denoising as estimating the score, which Shervine described as a compass pointing toward clean image regions:

Flow matching casts generation as transport from noise to data. A point on the path is an interpolation between a noise sample and a data sample:

The model predicts a velocity field, and sampling can be performed with an ODE solver such as Euler’s method:

The conditional flow matching loss regresses that velocity toward the displacement between the data point and the noise point:

In the simple version, , , and . The practical problem starts where that simple version is too blunt: not all timesteps, resolutions, or training examples deserve equal probability.

Hard timesteps and high resolutions need different probability mass

Uniform timestep sampling wastes probability on tasks the model can learn comparatively easily. In the convention used for the timestep-difficulty discussion, is pure noise and is a clean image.

A timestep close to zero is easy because the input contains almost no useful image information. The model cannot decide exact eyes, texture, limbs, lighting, or layout; it is incentivized to predict something like an average direction toward the target distribution. A timestep close to one is also easy because the image is already nearly clean and needs only small refinements.

The difficult region is the middle. A partially denoised teddy bear contains enough structure to require consequential decisions but not enough structure to make them obvious. The model may know roughly where the head, body, and arms are, while still needing to decide texture, eyes, and other local details.

And for that reason, T equal to 0.5 is a hard, hard task for the model.

Shervine Amidi · Source

The practical fix is to replace uniform timestep sampling with a distribution that emphasizes the middle. Shervine said that the distribution commonly used in this setting is the logit-normal distribution:

with

The logit transformation solves a constraint problem. A normal distribution can take values across the whole real line, but timesteps must lie between zero and one. Sampling in logit space and mapping back through the sigmoid keeps samples in the valid interval while allowing the density to concentrate where the training task is harder. The mean and standard deviation can be adjusted to emphasize earlier, middle, or later parts of the trajectory.

Resolution adds another distortion. The same nominal noise level is not perceived the same way at different resolutions. A low-resolution teddy bear and a high-resolution teddy bear were shown with the same noise level; the lower-resolution image appeared noisier.

The explanation is spatial correlation. Nearby pixels in natural images tend to have similar values. In a low-resolution image, if one of the few pixels representing a region is corrupted, important information can be lost. In a higher-resolution image, many neighboring pixels represent the same region. Since the added Gaussian noise has mean zero, corruptions can average out across those pixels, preserving a better estimate of the underlying value.

The lecture derived the intuition with a toy image in which every pixel has the same unknown value . Using the timestep convention from the cited high-resolution rectified-flow paper, is clean and is noise. For pixel index , the noised value is:

Averaging over pixels gives:

The mean of is , and its variance is . A natural estimator for is , whose standard deviation is:

The key result is not the algebra itself but the dependence on resolution: uncertainty in estimating the underlying pixel value decreases as resolution increases. To keep perceived noise comparable across resolutions and , the timestep must be shifted so those uncertainty terms match. When asked whether and refer to the latent space because diffusion happens there, the answer was yes.

The first major training decision is therefore allocation: move probability away from easy endpoints and adjust timesteps across resolution regimes, because the nominal loss does not by itself encode where learning is hardest.

Representation alignment and data curriculum accelerate the expensive part

Pre-training teaches the model to generate images, but the expensive part is not only gradient descent. It is data mixture, filtering, cleaning, resolution management, and curriculum. A Qwen-Image technical report chart was used to show that data composition itself is an engineering object: the visible mixture included Human at 30.73%, Objects at 21.57%, Text at 18.06%, Artwork at 4.60%, Nature at 3.87%, Design at 2.52%, Animals at 2.50%, Screenshot at 2.45%, Plants at 1.10%, Portrait at 0.94%, Cityscape at 0.81%, Indoor at 0.65%, Food at 0.58%, Scene at 0.28%, and Sport at 0.14%.

30.73%
Human share in the Qwen-Image data-mixture chart shown in the lecture

The point was not that these percentages are universal. It was that broad image generation depends on deliberate data pipelines, not a generic scrape plus a loss.

Curriculum learning imposes an order on that data. The model first sees easier examples: low resolution, fixed square aspect ratios, and simple prompts such as “a teddy bear.” Harder examples come later: high resolution, variable aspect ratios, and complex prompts. The hard prompt shown described a plush teddy bear in a beige trench coat, plaid scarf, and crossbody bag walking along a rain-soaked Parisian boulevard at dusk, with boutique lights, Haussmann architecture, glowing streetlamps, wet pavement reflections, and a cinematic cozy luxury travel scene.

Different resolutions fit naturally into DiT-style architectures. The image is divided into patches of size . For a square image of dimension , the number of patches is:

Each patch becomes an embedding token. Low-resolution inputs produce fewer tokens; high-resolution inputs produce more. A student compared this to long context, and the instructor accepted the analogy.

? afshine-amidi introduced representation alignment, or REPA, as another way to reduce the cost of learning useful internal structure. REPA uses representations already learned by a large pretrained visual encoder and encourages the diffusion transformer to align with them. The analogy was giving a learner a book about the topic: the model still learns, but it is guided by an existing representation.

REPA augments the usual velocity loss:

For a patch , the generator’s internal representation is passed through a trainable projection head and compared with the pretrained encoder’s representation of the corresponding clean patch, :

The generator is penalized when its projected internal representation of a noised patch is dissimilar from the pretrained encoder’s representation of the clean patch.

The results shown from Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think reported a significant speedup, with the chart marking “17.5x faster.” The slide also stated that aligning only the first layers leads to the best performance and that bigger diffusion models benefit even more from REPA.

17.5x
training speedup marked on the REPA results slide

Afshine offered an interpretation of why earlier layers work better. Later layers of the diffusion transformer may be closer to local details needed for the final prediction, while pretrained visual encoders often provide semantically meaningful representations. Aligning a semantically meaningful generator layer with a semantically meaningful pretrained representation is more natural than aligning the very late layers. He also noted that REPA, published in 2024, is a technique he has personally seen in many newer papers.

The common pattern across curriculum, data mixtures, resolution handling, and REPA is that pre-training quality is not won by the base objective alone. The objective supplies the target; the training system decides which examples, resolutions, timesteps, and auxiliary representations make that target reachable.

Post-training splits quality into knowledge, behavior, and preference

Post-training starts from a model that can generate images and asks what “good” should mean. The lecture separated three signals: knowledge, behavior, and preference.

Continued training focuses on knowledge. If a model has been broadly pretrained on nature images, cars, and other domains, but the target use case is teddy bears, continued training uses a teddy bear dataset to increase capability in that domain. In the cooking analogy used later, this is like reading more recipes in the relevant cuisine.

Supervised fine-tuning focuses on behavior. It improves properties such as lighting, aesthetics, and instruction following. In the same analogy, this is about presentation and delivery rather than merely knowing more recipes.

Preference tuning teaches what to do more of and what to avoid. For a given prompt, two or more generated images can be compared, with one preferred over another. The preference signal can come from human ratings or from a model. ? afshine-amidi emphasized that this remains a developing field and that techniques from LLM post-training are being adapted to text-to-image generation.

Reward Feedback Learning begins by collecting prompts and generated images, then annotating them through ratings, pairwise comparisons, or listwise rankings. A shown example used a ranking such as A > B > C > D. A reward model is trained to take a prompt and image and output a score for how good that image is with respect to the prompt. Shervine said this can use a pairwise loss based on the Bradley-Terry formulation, though the lecture did not go through the math.

Once trained, the reward model becomes a differentiable signal. The generator produces an image from a noisy latent, the reward model scores it, and the reward-maximizing objective can be backpropagated into the image generation model. The hope is that the reward model aligns with human preference, so increasing the reward increases preferred outputs.

Flow-GRPO adapts Group Relative Policy Optimization to flow matching text-to-image models. For a prompt such as “A photo of four cups,” the model generates a diverse group of images. The authors frame generation through SDE sampling as a Markov decision process, partly to increase diversity. Each image receives a reward from a black-box reward model. The algorithm computes an advantage: the reward of an image relative to other images in its group. The image generation model is then updated as the policy.

The risk is reward hacking. If the reward is not exactly aligned with the desired outcome, the model can over-optimize the proxy. Flow-GRPO mitigates this with a KL divergence term between the current policy and the old policy:

That term discourages updates that move too far from the previous model.

Diffusion-DPO adapts Direct Preference Optimization to diffusion. Its equations compare how well the policy and a reference model predict velocities for winning and losing images. The practical interpretation given was simple: incentivize the model to do a better job predicting velocities associated with winning images and a worse job predicting velocities associated with losing images. That should increase the chance of preferred images and decrease the chance of dispreferred ones.

Prompt enhancement addresses a different post-training mismatch. Users often write short prompts, while aligned image models may be trained or tuned on long, detailed prompts. A high-quality image of a teddy bear reading a book was paired with the prompt that produced it. The visible prompt began, “A warm, intimate indoor reading scene featuring a plush teddy bear as the central subject,” then specified the chair, beige fur, knitted brown sweater, eyeglasses, a book titled The Tales of Woodland, table-lamp lighting, shallow depth of field, a ceramic mug, and a wooden side table. The visible text was truncated; the instructor said the full prompt was about four times longer than what appeared on the slide.

Ordinary users would likely type “a teddy bear reading a book.” Prompt enhancement inserts an intermediary that rewrites the user request into a detailed, in-distribution prompt that better uses the model’s learned capabilities.

The cooking analogy tied these interventions together. Pre-training is learning about food in general. Continued training is reading recipes for a specific domain. Supervised fine-tuning is improving presentation and following the order. Preference tuning is a food inspector tasting multiple attempts and saying which is better. Prompt enhancement is the waiter translating “I want meat” into the precise language the cook has learned to execute.

Personalization works when it preserves what the model already knew

Tuning adapts a model to a special case, such as generating one specific teddy bear rather than generic teddy bears. The method used to explain this was DreamBooth.

DreamBooth uses a few images of the target subject in different poses and associates that subject with a rare token, shown as . A training prompt might be “a photo of a [V] teddy bear.” After tuning, inference can place that specific subject in a new context, such as “[V] teddy bear attending CME 296.”

The failure mode is overfitting. If the model trains only on the few subject images, it may specialize too aggressively and forget broader concepts it previously knew. DreamBooth addresses this through prior preservation. Alongside instance examples using the rare token, it includes generic class examples such as “a photo of a teddy bear.” The combined loss is:

The prior term helps prevent the model from losing its previous ability to generate generic teddy bears and related concepts.

Updating all weights is often the wrong unit of work. Image models can contain billions of parameters, and Afshine said the biggest open-source text-to-image models he had in mind were on the order of tens of billions. LoRA, or low-rank adaptation, reduces the adaptation cost by representing the adapted weights as the base weights plus the product of two low-rank matrices:

Only a small fraction of parameters needs to be trained while preserving similar performance. This is why combinations such as DreamBooth LoRA appear in practice.

The trade-off is narrow but useful. DreamBooth-style tuning has minimal inference overhead and can preserve high fidelity to reference images. But training is time-consuming and costly, and the result is not very reusable for unrelated subjects. It is best suited when the goal is to generate many images of one specific subject, person, or object.

Distillation is a latency strategy with several incompatible shortcuts

Distillation attacks a different bottleneck from the earlier stages. Pre-training, post-training, and tuning are mostly about capability and control. Distillation is about the cost of using the model after those capabilities exist. The taxonomy matters because every method relocates the difficulty: some compress a fixed trajectory, some straighten the trajectory first, some enforce consistency along it, and some replace pointwise matching with distributional or adversarial pressure.

The production constraint is straightforward. Real systems face high-volume usage, limited user money, limited compute, limited time, and sometimes real-time requirements such as interactions or animations. A model that needs hundreds or thousands of denoising steps per image may be state of the art but still impractical.

In traditional distillation, a large teacher and a smaller student produce output distributions, and the student is trained to match the teacher, often with KL divergence:

The language-model example was BERT and DistilBERT: roughly half the parameters, around 60% savings, and 97% retained performance. In the image-generation setup described by Shervine Amidi, simply shrinking the student tends to degrade quality, so the more attractive knob is often the number of generation steps. The teacher may take many steps; the student aims for few, ideally one. Teacher and student are often the same size, differing mainly in step count.

A naive one-step student is too hard. Asking a model to approximate the teacher’s final output directly from noise in a single pass is like asking a painter to reproduce a detailed artwork in one stroke. The methods that follow differ in how they make that one-step or few-step problem less impossible.

MethodCore moveCost or constraintFailure mode or caveat
Progressive DistillationHalve the number of steps iteratively by training a student to match a teacher’s multi-step transitionRequires repeated distillation rounds and fixed discrete samplingAccepts the original curved path as given
InstaFlowUse reflow to straighten paths, then distill an N-step model into one stepRequires expensive teacher pre-work to pave the pathMore reflow can introduce discretization errors
Consistency ModelsTrain predictions from different points on a deterministic ODE path to map to the same endpointNeeds boundary conditions and stop-gradient machineryNaive consistency could collapse to a constant output
DMDMatch teacher and student output distributions using score-difference guidance plus regressionStudent advice loop is itself an expensive learned diffusion processTeacher output is not ground truth; loss balance matters
ADDBring in a GAN-like discriminator to make real/fake supervision sharperAdversarial training dynamics are harder to managePixel-latent back-and-forth can be costly
LADDMove adversarial diffusion distillation into latent spaceStill inherits teacher-student and discriminator structurePresented as a mitigation of pixel-latent transition overhead
The distillation methods differ by where they move the difficulty: path compression, path straightening, local consistency, distribution matching, adversarial pressure, or latent-space reformulation.

Progressive distillation halves the number of steps iteratively. The teacher takes two small steps; the student learns to fit that transition in one larger step. Then the student becomes the new teacher, and the process repeats until the desired number of function evaluations is reached. The visual interpretation was secant lines along a curved path: each local approximation is tractable, and repeated halving compresses the sampling process.

The results shown from Salimans et al. reported better quality than DDIM at comparable numbers of function evaluations. Progressive Distillation had FID 9.12 with 1 model evaluation, 4.51 with 2, 3.00 with 4, and 2.57 with 8. DDIM had FID 13.36 with 10 evaluations, 6.84 with 20, 4.67 with 50, and 4.16 with 100.

MethodModel evaluationsFID
Progressive Distillation19.12
Progressive Distillation24.51
Progressive Distillation43.00
Progressive Distillation82.57
DDIM1013.36
DDIM206.84
DDIM504.67
DDIM1004.16
The progressive distillation slide reported better FID than DDIM at the listed sampling budgets.

Its limitation is that it accepts the original curved path as fixed. It requires discrete, fixed sampling at inference time and needs rounds of distillation. That motivates a different shortcut: make the path itself easier.

InstaFlow builds on rectified flow and the reflow procedure. In reflow, one samples from the easy distribution, integrates the ODE to obtain a sample, pairs inputs with outputs, and fits a new model. The result is typically straighter trajectories, which require fewer Euler steps. InstaFlow starts from a pretrained model, applies reflow, and then distills an N-step model into a single-step one.

More reflow is not automatically better. Shervine called this a trick question: if the path is not fully straight, one might want more reflow steps, but additional reflows introduce discretization errors in the generated pairs. The preferred move is to let a teacher integrate the ODE and train a student to fit that output.

InstaFlow’s distillation uses a two-stage distance. The warm-up stage uses a simple metric such as MSE. The polish stage uses LPIPS, Learned Perceptual Image Patch Similarity. Rather than comparing raw pixel values, LPIPS passes images through a frozen pretrained model and compares L2 distances between feature maps across layers. The slide on rectified flows showed that most straightening gains arrive after the first reflow, reducing the incentive for repeated reflow.

The InstaFlow results table supported the claim that distillation on straightened paths works better than reflow alone. The highlighted comparison was InstaFlow-0.9B at one step versus the one-step reflow-only line above it: InstaFlow-0.9B had inference time 0.09, FID-5k 23.4, and CLIP 0.304; 2-RF one-step had inference time 0.09, FID-5k 47.0, and CLIP 0.271. Stable Diffusion 1.5 at 25 steps was listed with inference time 0.88, FID-5k 20.1, and CLIP 0.318.

The cost is pre-work. InstaFlow needs a teacher to “pave the entire way” first, and the learned path is then skipped by the one-step student. That raises the question of whether local consistency can be exploited more directly.

Consistency models impose the property that every point on a deterministic ODE path maps to the same clean endpoint:

If the model learns that mapping, it can predict the clean image from any point along the path, including very noisy states. Consistency training noises a clean image to two nearby levels, and , passes both through the same model, and trains their outputs to match. Consistency distillation uses a teacher model to provide the nearby transition, then trains the student’s predictions from those nearby points to be consistent.

A collapse concern is obvious: a model could always output zero and be consistent. Consistency models avoid this with boundary conditions and stop-gradient systems. The parameterization shown was:

The design enforces boundary behavior, and the stop-gradient setup avoids symmetric updates through both branches by keeping one branch as a slow-moving average while the other performs the consistency prediction task.

The decision map is therefore not “which distillation method is best.” Progressive distillation compresses a fixed path through repeated local halving. InstaFlow first straightens the path, then distills. Consistency models train a direct-to-endpoint mapping along deterministic paths. Each moves difficulty to a different part of the system: repeated rounds, expensive reflow precomputation, or self-consistency constraints that must avoid collapse.

Distributional and adversarial distillation trade softer supervision for sharper samples

Distribution Matching Distillation, or DMD, shifts from matching a teacher’s individual output to matching the teacher and student output distributions. Instead of forcing the student toward a single point, it asks how the student should move in sample space so its outputs look like the desired distribution.

The setup gives noise to both teacher and student. The student produces an output, which is noised again. A teacher-like denoising process and a student-like denoising process then provide “advice” for reaching the clean image. The update uses the difference between teacher and student advice: more of what the teacher says, less of what the student says. A regression loss also encourages the student prediction to roughly match the teacher prediction, which the lecture said helps stabilize training.

The distribution loss is a KL divergence between fake and real distributions:

That expression was labeled intractable. But its gradient with respect to the generator parameters has an interpretable score-difference form:

In flow-matching terms, the score difference can be read as a difference in velocity-like advice. The result is a parameter update that moves the student toward what the teacher distribution recommends and away from what the student distribution already says.

The results table from One-step Diffusion with Distribution Matching Distillation showed DMD with 1 forward pass and FID 2.62, compared with an EDM teacher using 512 forward passes and FID 2.32. Other one-pass methods in the table had worse FID values, including Progressive Distillation at 15.39, DFNO at 7.83, Consistency Model at 6.20, and Diff-Instruct at 5.57.

MethodForward passesFID
DMD12.62
EDM teacher5122.32
Consistency Model16.20
Progressive Distillation115.39
The DMD slide positioned one-pass distribution matching close to a much more expensive teacher.

When asked what “real” and “fake” mean, the answer was that the paper uses GAN-style terminology. “Real” corresponds to the true or teacher distribution one wants to reach, while “fake” corresponds to the student distribution being moved closer to it. On collapse, the answer was conditional: collapse could occur without comparing teacher and student advice; without regression to the teacher output, the student may fail to reach the desired teacher distribution. Balancing the regression and distribution losses requires hyperparameter tuning.

The lecture’s targeted complaints about DMD were that the teacher output is not ground truth, the student advice loop is a learned diffusion process and therefore expensive, that student-advice model duplicates the concept of the student being trained, and the teacher-minus-student signal can be soft rather than actionable.

Adversarial Diffusion Distillation, or ADD, reframes the problem with a GAN-like discriminator. It introduces a detector that classifies outputs as real or fake, making the student’s prediction task sharper and encouraging crisper samples. The trade-off is that adversarial training brings harder-to-manage dynamics. ADD also involves back-and-forth movement between pixel and latent space.

Latent Adversarial Diffusion Distillation, or LADD, mitigates that by reformulating the discriminator loss in latent space. Shervine described this as a direction used quite a bit in the methods discussed in the lecture: keep the adversarial pressure, but avoid unnecessary pixel-latent transitions.

The final methodological pattern was explicit. Teacher-student combinations are powerful. Complexity and quality trade off. Regression losses such as MSE and LPIPS recur. Distributional and adversarial pressure can improve quality. And the lecture identified latent-space training as one of the recurring ingredients in current distillation models.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free