Text-to-Image Evaluation Requires Metrics Matched to Specific Failure Modes

Shervine AmidiStanford OnlineThursday, May 28, 202619 min read

Stanford adjunct lecturers Afshine Amidi and Shervine Amidi argue that evaluating text-to-image models starts with separating aesthetic quality from prompt adherence, then choosing metrics suited to the failure being tested. In Lecture 7 of Stanford’s CME296 course on diffusion and large vision models, they treat human ratings, FID, CLIPScore, reference-based measures, multimodal judges, and benchmarks as imperfect instruments rather than substitutes for a universal image-quality score. Their central warning is practical: automated and qualitative evaluations can be useful, but only when their assumptions, calibration, and failure modes are made explicit.

Evaluation begins with two separable failures

? afshine-amidi frames evaluation as the prerequisite for improving a text-to-image model: before deciding what to change, one needs a way to decide whether the generated output is good. The practical scope is narrowed to two central, non-exhaustive dimensions: aesthetics and prompt adherence.

The distinction matters because a generated image can fail in different ways. For the prompt “A teddy bear reading a book,” one output is rejected because it does not look real or high-quality. Its failure is primarily aesthetic. A second image is visually pleasing but shows a teddy bear walking down a street holding a cell phone. Its failure is prompt adherence: it does not depict the requested teddy bear reading a book. A third image, showing a teddy bear sitting at a desk and reading an open book, is treated as good enough because it both follows the prompt and is more aesthetically pleasing.

Aesthetics asks whether the picture is good on its own. The lecture lists physical plausibility, cleanliness, perceptual quality, and realism as examples. The teddy bear’s book should be on the table rather than floating; the image should look plausible and clean.

Prompt adherence asks whether the model followed the input. That includes object recall, counting, text rendering, and style adherence. If the prompt says the teddy bear is reading a book about CME 296, then “CME 296” should appear on the book. If the prompt asks for a particular style, that style should be present.

These two buckets are not exhaustive. Safety, diversity, memorization, and bias are also evaluation dimensions. A model should not generate unsafe depictions; it should not always produce the same kind of image for a prompt; it should not merely reproduce what it saw during training; and it should avoid output patterns that could be interpreted as biased. But aesthetics and prompt adherence serve as the tractable core for the rest of the evaluation methods.

Evaluation approach	Best use case	Main failure mode or limitation
Human ratings	Directly measuring human preference or quality judgments	Expensive, slow, subjective, and affected by rater noise
Elo from pairwise comparisons	Ranking models when comparisons are relative rather than absolute	Depends on comparison design and still requires human or judge preferences
FID	Comparing generated and real image distributions for aesthetics or realism	Proxy metric with encoder, sample-size, reference-data, and Gaussian-assumption dependencies
CLIPScore / PickScore	Automated image-text alignment or preference scoring	Compresses complex failures into opaque scalar scores
Reference-based metrics	Reconstruction, editing, or tasks with a target image	Pixel and patch metrics can be sensitive to shifts; learned metrics may be hard to interpret
MLLM-as-a-Judge	Rubric-based, explainable evaluation of prompt-image pairs	Needs calibration against human judgment and careful prompting
Benchmarks	Testing specific failure modes such as objects, dense prompts, text, or edits	Coverage is specialized rather than universal

The lecture positions evaluation methods as tools for different failure modes, not as interchangeable replacements for one universal image score.

Human ratings are intuitive, but the protocol determines what the score means

Shervine Amidi starts with the most natural evaluation method: ask people. But the design of the human-rating task changes both the reliability of the result and the meaning of the score.

One option is to ask raters to score an image on a one-to-five scale, from “very bad” to “very good.” Across a dataset, the model’s score is the average rating: sum of ratings divided by number of ratings. The appeal is nuance. A rater can distinguish a very good image from a merely good one. The drawback is noise. Different raters may interpret the scale differently: one person’s five may be another person’s four or three. Shervine also notes that for many problems it is genuinely hard for a person to decide whether something is a four rather than a five.

A second option collapses the task to a binary rating: good or bad. The score is still the sum of ratings divided by the number of ratings, but now it represents the pass rate: the proportion of images that clear the bar. This is easier for raters and less noisy than a nuanced scale. But it still asks people to judge on an absolute scale, and absolute judgment is difficult without a reference.

Pairwise comparison addresses that difficulty. Given the same prompt, a rater sees two generated images and chooses which is better, or declares them equal. Shervine argues that this is easier than deciding whether a single image is above some absolute threshold. The teddy-bear example illustrates the point: students may differ on whether a single image is “good,” but when shown two images side by side, it can be obvious which one is better.

The naive metric for pairwise comparison is win rate: number of wins divided by number of comparisons. But win rate depends on the opponent. Winning against a weak model says little; winning against a strong model says much more. A leaderboard makes this operationally difficult. If models enter and leave the list, a raw win-rate approach would require repeated all-against-all comparison to keep the results comparable.

Elo scoring is introduced as a way to account for opponent strength. Each model has a rating, R. Suppose Model A starts at R_A = 1000 and is compared with a weaker model at R_B = 600. The expected score for Model A is:

E_A = 1 / (1 + 10^((R_B - R_A) / 400)) ≈ 0.9

That quantity is interpreted as a 90% expected chance of winning. The actual score S_A is 0 for a loss, 0.5 for a tie, and 1 for a win. The update is based on the difference between actual and expected performance:

Δ = S_A - E_A
R_A ← R_A + K × Δ

If Model A wins, Δ is only about +0.1, because the win was expected. If it ties, Δ is about -0.4. If it loses, Δ is about -0.9, because losing to a weak model is strong negative evidence. The lecture points to the shown text-to-image leaderboard as an example of this style of scoring and says Elo helps avoid reevaluating everyone against everyone whenever the leaderboard changes. It also notes that Elo is the name of the person who devised the method, not an acronym.

Human ratings remain limited. They are expensive, slow, subjective, and not necessarily ground truth. Fatigue, cultural bias, time of day, and external conditions can affect raters. Those limitations motivate automated metrics, not because automated metrics are perfect, but because relying on humans for every evaluation is impractical.

FID compares distributions, not individual prompt correctness

Reference-free metrics are introduced with a caveat: “reference-free” does not mean no reference data exists. It means the generated image is not compared against a single target image. In text-to-image generation, many different outputs can validly satisfy the same prompt. A single reference image would be unfair because it would privilege one possible realization.

For aesthetics, the lecture compares distributions instead. Take many real images and many generated images. Pass both sets through the same pre-trained encoder, placing them in a feature space. Then characterize the real-image distribution by a mean and covariance, written μ_r and Σ_r, and the generated-image distribution by μ_g and Σ_g. The mean captures location; the covariance captures spread and direction of spread. Spread matters because a model should generate diverse images, not a collapsed set of near-identical outputs.

The Fréchet Inception Distance, or FID, quantifies the distance between these two distributions:

FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_rΣ_g)^(1/2))

The first term is the location difference: how far apart the distributions’ means are. The second term is the shape difference: how their covariance structures differ. The lower the FID, the better.

The metric is derived from the Wasserstein distance, which the lecture describes as measuring the effort required to transport one distribution into another. In general that distance does not have a closed-form solution. But if both the real and generated image feature distributions are assumed Gaussian, it can be expressed in the closed form used by FID. That Gaussian assumption is also one of the main criticisms of FID: real and generated image distributions are typically not Gaussian.

The “Inception” in FID comes from the pre-trained encoder used. If the goal is to compare with other models, one must use the same representation space; otherwise the numbers are not comparable. A pixel-space diffusion model can still be evaluated with FID: generate images in pixel space, pass those images through the Inception model, and compute the statistics in feature space.

The reference data should match the task of interest. If a model is intended for faces, nature scenes, or indoor scenes, the real-image distribution should represent that target. The lecture mentions ImageNet and MS COCO as examples of datasets whose images, captions, or class conditions can define the relevant distribution.

Sample size matters. The class had already seen FID-50K plots in earlier lectures, where 50,000 generated images are compared with 50,000 real images. The lecture notes that 30,000 is also seen, but the typical order is tens of thousands.

50,000

real and generated images commonly used in FID-50K

FID is widely used, “for better or worse.” In response to a student’s concern that mean and covariance may not fully represent quality, the answer is that FID is not perfect and the community knows it. It is a good-enough proxy, or at least treated that way by the community, and it persists partly because papers need to compare against prior work that already used FID. That inertia makes it difficult to replace. Papers often supplement FID with sample images, cherry-picked or not, to show the model’s qualitative behavior.

The practical reading is cautious. A large location difference may indicate different quality or style. A shape difference may indicate lack of diversity or variety. But FID remains a proxy, dependent on sample size, reference distribution, encoder choice, and the normality assumption.

CLIPScore and PickScore automate alignment and preference, but remain opaque

For prompt adherence, the lecture returns to CLIP, introduced earlier in the course as Contrastive Language-Image Pretraining. CLIP separately encodes text and images, then uses a contrastive objective that pushes matching image-text pairs together and nonmatching pairs apart.

CLIPScore applies that setup directly to evaluation. Given the original prompt and the generated image, the raw CLIP model produces a score for alignment between text and image. This is a natural reference-free prompt-adherence metric: the prompt is the condition, not a single reference image.

PickScore is related but more preference-oriented. Instead of using a raw CLIP model only to estimate image-text alignment, PickScore uses a CLIP-like model trained on preference data. The input is still text plus image, but the output is meant to predict human preference. Shervine describes it as a holistic score that combines aesthetics, prompt adherence, and other factors into a measure of likely human satisfaction with the generated output.

These metrics sit between hand-coded mathematical distances and direct human judgment. They automate evaluation and use learned image-text representations, but they still compress a complex judgment into a number. If a score is low or unexpectedly high, the score itself often does not explain why.

Reference-based metrics are useful when a target image exists

Reference-based metrics apply when there is a specific image the output should match. The clearest example is VAE reconstruction: an encoder and decoder attempt to reconstruct an original input image. In that setting, the original image x is a legitimate reference, and the reconstruction x-hat can be compared against it.

The same logic can apply beyond VAEs. In image editing, for example, one may want to change some aspect of an image while preserving everything else. A reference-based metric can help verify that the output remains close to the original where it should.

The simplest metric is mean squared error, or MSE: a pixel-wise distance between x and x-hat.

MSE(x, x̂) = (1 / HWC) Σ_i Σ_j Σ_c (x_ijc - x̂_ijc)²

MSE is appropriate when exact pixel reconstruction is the target and alignment is controlled. Its limitations are severe when perceptual similarity is the goal. A perfect reconstruction shifted a few pixels to the right can receive a terrible MSE. The number is also not directly interpretable because it depends on pixel scaling: values might be represented between 0 and 1 or between 0 and 255.

Peak signal-to-noise ratio, or PSNR, normalizes MSE by the maximum possible pixel value and wraps the result in a logarithm:

PSNR(x, x̂) = 10 log10(MAX_I² / MSE(x, x̂))

PSNR is better when one still wants a pixel-wise measure but needs a more interpretable normalized scale. The logarithm reflects a perceptual intuition: the same absolute increase in error matters more when the image is already very accurate than when it is already poor. Afshine uses a lighting analogy. In a dark room, turning on one light bulb makes a large perceptual difference. In a room already full of light, adding one more light bulb matters much less. PSNR remains sensitive to spatial shifts, so it should not be mistaken for a robust perceptual metric.

Structural Similarity, or SSIM, is more appropriate when the question is whether local structure is preserved. It compares corresponding patches using luminance, contrast, and structure. Luminance asks whether the patches have the same brightness, represented by the mean. Contrast asks whether they have the same variance. Structure asks whether pixels vary together, represented through covariance or correlation.

The lecture demystifies the similarity form behind these terms with the expression:

2AB / (A² + B²)

For positive A and B, this can be rewritten as:

1 - ((A - B)² / (A² + B²))

That shows why the quantity lies between 0 and 1 and reaches 1 when A = B. It also captures relative difference. A = 10 and B = 20 gives a similarity around 0.8, while A = 100 and B = 110 gives a value around 0.995, even though both pairs differ by 10. The point is the same as in the logarithmic PSNR intuition: the level at which a difference occurs matters.

SSIM multiplies luminance, contrast, and structure similarity for each patch, then averages across patches:

SSIM(x, x̂) = (1 / M) Σ_j SSIM(x_j, x̂_j)

The score ranges from -1 to 1, with values closer to 1 indicating greater structural similarity. SSIM is useful when patch-level structure matters more than exact pixel equality, but it is still vulnerable to spatial shift, especially if the shift exceeds the patch structure being compared.

LPIPS, or Learned Perceptual Image Patch Similarity, is the better fit when perceptual similarity is the target and exact alignment is too brittle. Instead of measuring distance directly between x and x-hat in pixel space, both images are passed through a pre-trained encoder. Distances are computed between feature maps at different layers, with learned weights chosen to align the metric with human perceptual judgments:

LPIPS(x, x̂) = Σ_l (1 / (H_l W_l)) ||w^l ⊙ (φ_l(x) - φ_l(x̂))||²_2

The lecture says LPIPS commonly uses encoders such as VGG or AlexNet, and that the layer weights depend on the encoder. LPIPS is widely used, but its limitation is interpretability: if the score is bad, the number alone does not say what went wrong.

The practical hierarchy is straightforward. Use MSE or PSNR when exact reconstruction and pixel alignment are central. Use SSIM when preserving local structure is the concern and modest patch-level comparison is acceptable. Use LPIPS when the desired proxy is closer to human perceptual similarity, while accepting that the result will be less diagnostic.

Multimodal LLMs make evaluation conversational rather than purely scalar

Fixed metrics operate at different levels, fragment meaning, and can produce inevitable misalignments. They may give a number without explaining the failure. Multimodal LLMs are introduced because they can take image and text inputs and produce text.

The desired setup is simple: provide an image, ask a question about it, and receive an interpretable answer. Given a teddy bear image, one might ask, “How cute is this teddy bear?” and receive “Very cute!” The question can guide the evaluation, and the output can be read as language rather than only as a scalar.

Two architecture patterns are contrasted. The first reuses the original Transformer’s cross-attention. Text tokens feed a decoder, while image embeddings are used as cross-attention keys and values. Flamingo is given as an example: images are provided as keys and values, and placeholder tokens gate when the model should attend to them. This is computationally efficient because the fixed context length is composed of text, with images handled through cross-attention.

The caveat is that the latest large language models are often decoder-only and have removed cross-attention. To use those models directly, one would need to re-engineer cross-attention into the architecture.

The second approach treats multimodal inputs as regular LLM inputs. Image patches are encoded, for example with a ViT-like encoder, and their embeddings are fed into the decoder-only model along with text embeddings. LLaVA is cited as an example. Shervine says this latter technique is often what people use in practice for recent MLLMs, because it reuses a proven decoder-only design and the “wins” of modern LLM architecture.

The capabilities that matter for evaluation include working across resolutions, spatial awareness, OCR, additional modalities, and reasoning. OCR matters because image-generation models may be asked to render readable text inside images. Reasoning matters because evaluation should not be a black box that emits only a number; it should provide a rationale behind the grade.

TIFA makes faithfulness debuggable by turning a prompt into claims

Traditional metrics can be holistic and opaque. A generated image might receive a CLIPScore of 0.922, but the score does not say why it is not higher or what aspect failed. TIFA — Text-to-Image Faithfulness Evaluation with QA — is presented as a way to make faithfulness evaluation interpretable.

The idea is to decompose the prompt into atomic questions. For the prompt “A cute teddy bear is reading a book,” the questions might be:

Is there a teddy bear?
Is the teddy bear cute?
Is there a book?
Is the teddy bear reading the book?

Each question has a simple yes-or-no answer. A multimodal LLM can assess each dimension independently against the generated image. If all four are satisfied, the TIFA score shown in the lecture is 100.0. If one fails, the evaluator can see exactly which claim failed.

The benefit is simplicity and debuggability. Unlike CLIPScore, which may indicate partial mismatch without explaining it, TIFA can identify whether the model forgot the book, failed to depict the teddy bear as cute, or missed the reading relation.

The drawbacks come from the rubric itself. The questions are bespoke to each prompt, so generating them across a dataset is expensive and error-prone. The weighting of claims may also fail to match their importance. If an image has no book at all, one or two questions may fail and the score may be roughly half of the maximum, but that may not reflect how severe the failure should be for the intended use case. TIFA improves interpretability, but it does not eliminate rubric design.

VQAScore targets compositional failures that CLIPScore can miss

Shervine Amidi gives a second complaint about traditional metrics: they may not carry composition well. CLIPScore can preserve a broad semantic match while missing important relational changes.

The example is deliberately simple. The sentence “A cute teddy bear is reading a book” and the sentence “A cute book is reading a teddy bear” contain similar words but describe very different scenes. Intuitively, swapping “book” and “teddy bear” changes the image one should expect. But from a CLIPScore standpoint, the scores may not change much.

Two explanations are offered. First, the image and sentence are projected into vectors, and those vectors may not richly encode all semantic subtleties. Second, CLIP’s contrastive training uses in-batch negatives: the model is incentivized to separate captions and images that have little to do with each other, but it is not necessarily incentivized to distinguish fine compositional differences among prompts with overlapping concepts.

VQAScore — Visual Question Answering Score — uses a different template. Given an image and a prompt, ask:

“Does this figure show [Prompt]? Please answer yes or no.”

Then inspect the next-token probability assigned to “yes.” That probability becomes the score. If the image matches the sentence, the probability should be high; otherwise, low.

This uses an MLLM’s language-generation structure rather than only comparing global embeddings. It can better distinguish the teddy-bear sentence from the book-reading-teddy-bear sentence. A student asks whether a regular LLM can do this; Shervine answers that, strictly speaking, no, because the input includes an image. But an off-the-shelf MLLM can be used zero-shot because it is trained to understand image-text concepts.

VQAScore has practical limitations. It assumes access to next-token probabilities, which many closed models do not expose. The lecture notes that token probabilities are sensitive in distillation contexts because they reveal useful information about a model. VQAScore also requires one MLLM call per question. If combined with a TIFA-style decomposition, cost scales with the number of dimensions.

This motivates a broader shift: instead of engineering many prompt-specific outcomes and then inferring what went wrong, define what “good” means conceptually through a rubric and ask a multimodal judge to evaluate against that rubric.

MLLM-as-a-Judge depends on rubrics, calibration, and disciplined prompting

VIEScore — Visual Instruction-guided Explainable Score — is presented as a representative rubric-based approach. It takes a prompt, a generated image, and a rubric, then asks an MLLM to produce evidence and a score. The two broad dimensions from the beginning reappear: semantic consistency, meaning alignment with the prompt, and perceptual quality, meaning whether the image looks authentic and natural.

The value of the judge is not only the score but the evidence. A structured output can include a rationale and numeric rating, often returned in JSON so downstream systems can parse fields. The example structure is: provide the input, provide rubric dimensions with guidelines, and ask for a structured result containing fields such as rationale and score.

The central problem is calibration. How does one ensure that the judge’s rationale and score match human intuition?

The proposed workflow has three stages. First, seed the process with human expertise: collect prompts, generated images, and human ratings for the task and rubric dimensions of interest. Second, calibrate the MLLM-as-a-Judge by crafting, or using a model to help craft, rubric instructions that align judge scores with human ratings. Third, automate: use the calibrated judge on new prompt-image pairs.

Three use cases are distinguished. Pointwise evaluation scores a single response and is useful for diagnostics, because the rationale can identify what worked and what failed. Pairwise evaluation asks which of two responses is better and is useful for comparing model versions, such as deciding whether a new model iteration should replace the old one. Batch ranking asks the model to rank several responses, but Shervine says it is not typically used in practice because ordering can be too sensitive and high-variance.

The judge section closes with operating guidance:

Parse the score into atomic criteria rather than relying on one broad number.
Ask the MLLM to describe evidence before giving a score.
Use low temperature and structured outputs for deterministic judging.
Randomize A/B order in pairwise judging to reduce position bias.
Validate judge scores against human ratings before trusting them.

The final warning is explicit: do not trust an MLLM-as-a-Judge blindly. It is only useful once its scores have been checked against human judgment for the task at hand.

Benchmarks are strongest when tied to specific failure modes

The benchmark examples are organized around capabilities rather than a single universal score.

GenEval targets object alignment. Its goal is to test whether a model renders specific objects and attributes. The lecture describes roughly 600 prompts across six tasks: one or two objects, counts, colors, positions, and color attribution. Evaluation is yes-or-no per task and uses object detection, geometry, and color classification. A sample prompt asks for “a photo of a purple backpack and a white umbrella”; the evaluation must verify both objects and their correctly attributed colors.

DPG-Bench targets dense prompt following. It asks whether a model remembers every detail in a long prompt. The benchmark decomposes dense prompts into a graph of yes-or-no questions over entities, attributes, and relations, with about 14,000 questions across about 1,000 prompts. Its graph structure matters: if a prerequisite is absent, downstream questions need not be checked. In the example involving an invisible man, floating horn-rimmed glasses, a pearl bead necklace, a smartphone, a couch, magazines, and a remote control, the evaluation proceeds coarse-to-fine through logical dependencies.

LongText-Bench targets text rendering. It tests whether models can generate readable multi-line text in images. The lecture describes 160 English and 160 Chinese prompts across eight scenarios: signs, labels, printed materials, webpages, slides, posters, captions, and dialogues. Evaluation uses OCR extraction of the rendered text and compares it with the reference text used in the prompt.

GEdit-Bench targets image editing. The goal is to edit an image without destroying the original context. The lecture describes about 600 editing examples across 11 categories, including background, color, material, motion, style, text change, photoshopping, subject add/remove/replace, and tone transfer. Evaluation uses a VIEScore-style MLLM-as-a-Judge over perceptual quality and semantic consistency. Example edits include restoring and colorizing an old photo, replacing the text “TRAIN” with “PLANE,” transforming an image into Ghibli style, removing a beard, and making a person give a thumbs-up.

Benchmark	Evaluation focus	Mechanism described
GenEval	Object alignment	Object detection, geometry, and color classification over object, count, color, position, and attribution tasks
DPG-Bench	Dense prompt following	A graph of yes/no questions over entities, attributes, and relations, judged with VQA
LongText-Bench	Text rendering	OCR extraction of rendered text and matching against the reference text
GEdit-Bench	Image editing	VIEScore-style MLLM-as-a-Judge over semantic consistency and perceptual quality

The benchmark examples map evaluation to concrete failure modes rather than a generic notion of image quality.

Sample images are evidence, not proof

Qualitative examples remain useful, but they can be misleading as proof of model quality. Metrics are imperfect: quantitative scores are proxies; MLLM-as-a-Judge depends on calibration; benchmarks cover selected dimensions. But selected sample images also do not tell the whole story.

The lecture gives a statistical thought experiment. Suppose a model’s generated image quality is represented by a random variable:

X_i ~ N(0, 1)

If one displays the best of three samples, the distribution is:

Z ~ max(X_1, X_2, X_3)

If one displays the worst of three, it is:

Y ~ min(X_1, X_2, X_3)

These distributions differ even though the underlying image-generation model is the same. The sampling procedure can change the impression of model quality. Showing selected images can demonstrate capabilities, but it should not be treated as proof of overall performance.

Evals and Benchmarks Multimodal AI Image and Video Generation