AI Evaluation Benchmarks Measure Different Questions, Not One Scoreboard

Tatsunori HashimotoStanford OnlineWednesday, May 20, 202619 min read

Stanford’s CS336 lecture on evaluation, led by Percy Liang with sections from Tatsunori Hashimoto, argues that model evaluation is not a single scoreboard but a choice about what behavior is being measured and for what purpose. The lecture treats perplexity, exam benchmarks, chat preferences, agent tasks, reasoning puzzles, safety tests and realistic professional evaluations as different instruments with different failure modes. Its central claim is procedural: before reading or designing a benchmark, define the object being evaluated, the use case it serves and the trade-offs among difficulty, realism and validity.

Evaluation is where an abstract goal becomes a metric

Percy Liang framed evaluation as the question that has to be answered before choosing training data. Data shapes behavior: a model trained on code should become better at code; one trained only on DNA sequences likely will not speak English. But deciding what data to use depends first on deciding what behavior is wanted.

Evaluation asks a simple question — given a trained model, how good is it? — but the simplicity is misleading. The mechanical version is straightforward: define prompts, send them to the model, get responses, compute accuracy. The deeper problem is converting an abstract construct such as “good at conversation” or “good at reasoning” into concrete prompts, environments, and metrics.

That conversion matters because evaluations become north stars. Model developers, open and closed, look to benchmark scores as measures of progress. A metric does not merely report development; it shapes it.

Several incompatible but plausible notions of “good” can coexist. A model might be good if it ranks highly on an aggregate benchmark such as the Artificial Analysis Intelligence Index, which combines multiple applications. It might be good if it performs well relative to inference cost, since more expensive models often score higher but not in a perfectly linear way. It might be good if people prefer its responses, the premise behind Arena AI’s pairwise preference rankings. Or it might be good if people choose to use and pay for it, as reflected in OpenRouter usage statistics. No one of these was treated as the answer. The point was that “good” is not a primitive property of a model; it is a lens.

The governing distinction is that evaluation always trades among what is being measured and why. Difficulty, realism, validity, and the evaluated object — method, model, agent scaffold, user distribution, or policy risk — pull in different directions. A score from an exam benchmark, a chat preference arena, a coding-agent task, and a safety red-team suite can all be useful, but they do not mean the same thing.

Perplexity is the natural metric for language models, and also an incomplete one

The most native evaluation for a language model starts from the definition of the object being trained: a probability distribution over token sequences. Perplexity asks how much probability mass the model assigns to a dataset, normalized into an interpretable scale. In pretraining, models minimize perplexity on the training set; historically, language modeling research measured perplexity on a test set from the same distribution.

That older paradigm was simple. Researchers trained on a train split and evaluated on a test split from standard datasets such as Penn Treebank, WikiText-103, and the One Billion Word Benchmark. Liang described the 2016 result applying pure CNNs and LSTMs to the One Billion Word Benchmark — reducing perplexity from 51.3 to 30.0 — as an important signal at a time when n-gram and hybrid models were still part of the conversation. In that setting, progress meant lower test perplexity.

GPT-2 changed the evaluation style. OpenAI trained it on WebText, a 40GB dataset of websites linked from Reddit, and evaluated it zero-shot on standard datasets. That made the evaluation out-of-distribution: the model was not trained on those benchmark train splits and then tested on their test splits. GPT-2’s gains were especially strong on small datasets where transfer helped, such as Penn Treebank, while it did less well against state-of-the-art systems on larger in-distribution benchmarks such as One Billion Word. Liang also cautioned that overlap was uncertain: he did not know whether careful train-test decontamination had been done, and some benchmark material might have appeared indirectly in training data.

Liang then described a “perplexity is all you need” argument, calling it “more faith than science” but important as a mindset behind scaling language models. If there is a true distribution (t) and a model distribution (p), the best possible perplexity is achieved when (p=t). If the model has learned the true distribution, then it can in principle condition on a problem and generate the solution, or condition on a question and generate the answer.

By pushing down on perplexity, we will eventually reach AGI, is the argument.

Percy Liang

He immediately qualified the argument. Perplexity may be more than is needed because it penalizes prediction on every token, not only the tokens relevant to a desired behavior. In the sentence “Stanford was founded in 1885,” predicting “1885” reflects useful factual knowledge; predicting the first word or even “founded” may be less relevant. Conditional perplexity — measuring the likelihood of a response given a prompt — can focus evaluation on the tokens that matter.

Some benchmarks that appear different from perplexity are, in Liang’s phrase, “perplexity in disguise.” LAMBADA is a cloze task: given a long context and a target sentence with a missing final word, predict the word. Its examples require broad discourse context, so it focuses next-token prediction on long-range dependencies. HellaSwag, a multiple-choice sentence-completion benchmark, similarly asks which continuation best fits a context such as a woman trying to bathe a dog. It is scored as multiple choice, but the underlying structure still resembles language modeling.

There is also a practical warning for perplexity leaderboards. If participants submit a model and the evaluator asks for log probabilities on test data, the evaluator must trust that the probabilities are valid and sum to one. Otherwise a submitted “model” could simply return probability one for everything. Downstream task evaluation is easier to black-box: send a prompt, receive a response, compute accuracy. Perplexity evaluates distributions, so it requires more trust in the model or code, and with models such as VAEs or systems that only provide bounds, the evaluator may also need to trust the math.

The bottom line was not that perplexity is obsolete. It is still heavily used in language model development, especially because it varies smoothly with scale and is useful for scaling laws. But for those not already convinced by low perplexity, real-world benchmarks remain necessary.

Exam benchmarks keep getting harder because models keep saturating them

Exam-style benchmarks borrow from human testing. Their advantages are control and gradeability: the evaluator can choose subject and difficulty and design questions with unambiguous answers. MMLU was one of Liang’s influential early examples. Massive Multitask Language Understanding covered 57 subjects including math, U.S. history, law, and morality, using multiple-choice questions collected by graduate and undergraduate students from freely available online sources.

Despite its name, Liang said, MMLU is less about language understanding than about knowledge and reasoning. It was evaluated on GPT-3 using few-shot prompting: the prompt includes examples of multiple-choice questions and answers, then asks the model to answer a new one. That setup looks mundane now, but around GPT-3’s release it was radical to expect a language model to solve such tasks from in-context examples. Smaller models were barely above chance, while the largest GPT-3 models were meaningfully above chance.

MMLU then saturated. The leaderboard snapshot for MMLU had GPT-3.5 Turbo at 0.698 in March 2023, GPT-4 at 0.864 in June 2023, and GPT-5 at 0.925 in August 2025. Once the benchmark became too easy, MMLU-Pro tried to make it harder by removing noisy and trivial questions, expanding answer choices from four to ten, and evaluating with chain-of-thought reasoning. Model accuracy dropped by 16% to 33% compared with MMLU, but a 2026 leaderboard snapshot had Qwen3.6 Plus at 0.885.

GPQA pushed further. The name stands for Graduate-Level Google-Proof Q&A, built from questions written by 61 PhD contractors on Upwork. The construction process was labor-intensive: a question writer produced a question and answer choices, expert validators answered and reviewed it, the writer revised it, and non-experts with Google access attempted it. A “diamond” subset required agreement from two expert validators and at most one correct answer among three non-experts. PhD experts achieved 65% accuracy; non-experts with Google access achieved 34%; GPT-4 achieved 39%. Yet the leaderboard snapshot Liang displayed had Claude Opus 4.7 at 0.942.

Humanity’s Last Exam was the next escalation. It contained 2,500 multimodal, multi-subject questions, including multiple-choice and short-answer formats. Question creators were incentivized with a $500,000 prize pool and co-authorship; submissions were filtered by frontier models and multiple review stages; a private set was held out to reduce contamination risk. The examples ranged from Roman inscriptions and ecology to category theory and graph Markov chains. The familiar pattern appeared again: previous benchmarks look too easy, then a new dataset is introduced on which models perform poorly. In the 2026 snapshot, Humanity’s Last Exam still had room: Claude Mythos Preview led at 0.647.

Benchmark	What was highlighted	Leaderboard snapshot
MMLU	57-subject multiple-choice knowledge and reasoning benchmark	GPT-5: 0.925
MMLU-Pro	MMLU variant with cleaned questions, 10 answer choices, and chain-of-thought evaluation	Qwen3.6 Plus: 0.885
GPQA	Graduate-level “Google-proof” questions validated by experts and non-experts with Google access	Claude Opus 4.7: 0.942
Humanity’s Last Exam	2,500 multimodal, multi-subject questions with public and private sets	Claude Mythos Preview: 0.647

Exam benchmarks were presented as a repeated progression toward harder questions after earlier benchmarks saturate.

Each score has to be read against the possibility that the public benchmark, its sources, or close variants have entered training data.

A student asked how one knows the models were not contaminated by test data. Liang’s answer was blunt: “we don’t know,” because outsiders do not know what is in training sets. He added that contamination is subtle. It is not necessarily literal training on the benchmark; questions can be derived from sources that were themselves in training data. The right stance, he told students, is to take the numbers with a grain of salt.

The lesson was not that multiple choice is inherently easy. Liang explicitly rejected that. Multiple-choice questions can be made arbitrarily difficult. The limitation is that the format restricts what can be asked, and exam questions do not resemble ordinary use. People generally do not ask an AI assistant HLE-style questions except during evaluation. Real prompts are open-ended, sometimes ill-formed, and often lack a single correct answer.

Open-ended chat evaluation replaces answer keys with preferences, judges, and rubrics

For chat, the evaluation problem changes. A user might ask, “I would like to make a beet salad with goat cheese. What kind of herbs would work well and what would not work well?” The answer is open-ended. There is no exact match target and no single ground truth.

Chatbot Arena’s answer was to ask humans. A random internet user enters a prompt and receives two anonymized model responses, then votes: A is better, B is better, both are good, or both are bad. Those pairwise comparisons are converted into Elo-style rankings. The probability that model A beats model B is modeled as a function of their Elo score difference, and the scores are fit to maximize the likelihood of observed pairwise preferences.

The strengths are practical. The prompts are at least plausibly real-world because users come for free model access and may be trying to accomplish something. The system does not need every model to answer every prompt; as in chess, a connected graph of pairwise comparisons can support rankings. It can also incorporate new prompts and models over time.

But the weaknesses are substantial. The distribution of “random people on the internet who come to LLM Arena” is not controlled. Demographics do not fully characterize it. There can be bias, spam, and gaming by people who want a submitted model to do well. Binary preference also conflates style and correctness. Unlike chess, where winning is unambiguous, “better” in chat is underspecified. Users may not know the correct answer to the question they asked. Pleasing or sycophantic answers may beat correct but less flattering ones.

AlpacaEval replaced human judges with an LLM judge. It used 805 instructions from various sources, generated responses from a tested model and a baseline such as GPT-4 Preview, and asked GPT-4 Preview to judge which was better. That raised an obvious concern about judge bias. LLM-as-judge methods can mitigate bias with multiple judges or ensembles, but the first version of AlpacaEval exposed another failure: LLM judges favored longer responses, producing leaderboard gaming by fine-tuned models that answered at greater length. AlpacaEval 2.0 used regression to debias the metric.

That led to a meta-evaluation question: how do you evaluate the metric itself? There is no final answer. A common sanity check is correlation with other metrics. AlpacaEval 2.0 had a Spearman correlation of 0.98 with Chatbot Arena for the model set shown, suggesting it could approximate Arena rankings in that setting. Liang cautioned that this correlation may not hold for stronger models than the judge baseline or outside the model set used to establish it.

0.98

shown Spearman correlation between AlpacaEval 2.0 and Chatbot Arena

WildBench added another idea: checklists. It sourced 1,024 examples from 1 million human-chatbot conversations, then used GPT-4 Turbo and GPT-4 as judges. Its main innovation was a prompt-specific checklist, akin to chain-of-thought for judging. A generic request to judge whether a response is “good” is ill-defined; a rubric scopes the task and improves reliability. Liang generalized this point beyond LLM judges. Humans also produce noisy evaluations if asked to rate outputs without a rubric.

The chat-evaluation lesson was cautious. Pairwise comparisons between similar responses provide more signal than absolute scores such as “7 out of 10.” Human and LLM judges both have biases. Rubrics and checklists improve reliability. But the evaluation target is not fully well-defined, because “good open-ended answer” depends on context, intent, correctness, and style.

Agent benchmarks evaluate systems, not just models

Agentic benchmarks move from what language models say to what systems do. An agent, in this framing, is a language model plus a scaffold: the logic for deciding how the model is called, what tools it can use, how it iterates, and how context is managed. That means evaluation is no longer purely model evaluation. It measures the model and the scaffold together.

SWE-bench is the canonical coding example. The task is to receive a codebase and a GitHub issue description, then submit a pull request. Evaluation is based on unit tests: some tests fail before the fix, the agent’s patch should make them pass without breaking others. The benchmark originally had 2,294 tasks across 12 Python repositories. In the SWE-bench Verified leaderboard snapshot, DeepSeek-V2.5 was at 0.168 in May 2024 and Claude Mythos Preview was at 0.939.

TerminalBench broadened the agent setting to computer terminals. Its environments are simple and universal: tasks can be done by typing commands. The benchmark had 229 tasks crowdsourced from 93 contributors, with 89 tasks constituting Terminal-Bench 2.0. The task-time table showed wide difficulty variation, from under an hour to more than a week depending on whether the worker was an expert or junior. The same model can score differently depending on the agent wrapped around it.

CyBench used 40 capture-the-flag cybersecurity tasks. An agent interacts with an environment, inspects source code, accesses a web server, and tries to extract a flag — a unique string proving success. The source showed a simple scaffold where memory grows as the agent produces actions and receives environment feedback, but the point was that such histories become long and require better context management.

MLEBench used 75 Kaggle competitions involving data processing, model training, debugging, and submission. Again, the leaderboard varied both by underlying LLM and by agent scaffold.

Benchmark	Task setting	Evaluation signal
SWE-bench	Fix GitHub issues in Python repositories	Unit tests
TerminalBench	Complete tasks in a computer terminal	Task completion accuracy
CyBench	Solve capture-the-flag cybersecurity tasks	Extract the flag
MLEBench	Compete in Kaggle-style machine-learning tasks	Submission grading

Agentic benchmarks evaluate actions in environments, so the scaffold is part of the measured system.

This distinction changes how to read a score. A high SWE-bench, TerminalBench, CyBench, or MLEBench result is not only evidence about the base model’s competence. It also reflects tool use, memory, planning, prompting, orchestration, and the engineering choices that determine how the model is applied over time.

The scaffolds matter because long-horizon tasks cannot be solved by a naive stream of consciousness. Liang pointed to explicit planning, such as maintaining and checking off a to-do list; hierarchical delegation, where a master agent calls sub-agents with clean context and receives only the result; persistent memory through files; and “extreme context engineering” about when to delegate, when to switch strategies, and what to store.

The implication is structural. For chat or exam benchmarks, one can often treat the unit of evaluation as the model. For agents, the evaluated object is the system: model plus tools, memory, prompting, orchestration, and process.

Pure reasoning benchmarks try to remove knowledge, but cannot fully escape priors

Pure reasoning benchmarks ask whether evaluation can isolate reasoning from linguistic and world knowledge. The ARC-AGI line was the central example. It began in 2019 with tasks intended to be 100% solvable by humans but hard for AI. Each task was meant to be unique, so memorization and previous problem exposure should help little.

Pretrained language models initially did not move the needle. That was part of the point: GPT-2-era language models learned internet facts and linguistic patterns, which did not directly solve grid-based visual reasoning puzzles. Then reasoning models such as OpenAI’s o1 and o3 changed the trajectory. ARC-AGI-1 became effectively solved, and ARC-AGI-2, released in March 2025 with more multi-step reasoning, appeared to be on a path toward saturation. ARC-AGI-3, released in March 2026, moved to interactive environments. The leaderboard snapshot was again extremely low: Anthropic Opus 4.6 Max at 0.50%, Gemini 3.1 Pro Preview at 0.40%, GPT 5.4 High at 0.20%, and Grok-4.20 Beta reasoning at 0.10%.

Tatsunori Hashimoto emphasized both the value and the limits of this benchmark family. ARC-AGI tries to disentangle reasoning from knowledge, and it may be one of the best attempts, but it is not clear that full decoupling is possible. The tasks still come from some prior. If people care enough, they can “bench max” almost anything. It is also explicitly constrained to human-solvable reasoning, not superhuman reasoning such as winning IMO gold medals or solving open math problems. Even so, he said, it clearly exposes gaps in current models.

A student asked whether ARC-AGI-3’s graphical interface means multimodal input is required. Hashimoto said the input can be provided as an image or as ASCII art or another textual representation, but either way there is a spatial reasoning element that is not ordinary English.

Safety evaluation lacks the settled meaning that car safety has

Tatsunori Hashimoto introduced AI safety evaluation by analogy to vehicle safety ratings. For cars, decades of lobbying and standard-setting have produced tests such as crash tests with airbags and dummies. For AI, he said, there is no comparably settled answer to what safety means.

HarmBench represents one common framing: prompt the model with harmful requests and expect it to refuse. It is based on 510 harmful behaviors that violate laws or norms and is oriented toward automated red-teaming and robust refusal. This is probably the dominant intuitive meaning of safety in language models: prevent bad actors from using the model for harmful purposes.

AIR-Bench takes a broader approach. It draws on regulatory frameworks in the EU, China, and the U.S., along with company policies, and taxonomizes risks into 314 categories with 5,694 prompts. The point is to treat safety as a wider set of legal, rights-related, control, operational, and social risks rather than a single refusal test.

Jailbreaking is a related subproblem. Language models are trained to refuse harmful instructions, but prompts can be optimized to bypass those refusals. Hashimoto discussed Greedy Coordinate Gradient, an automatic method that optimizes prompts, and noted that attacks optimized on open-weight models such as Llama transferred to closed models such as GPT-4. The examples shown used gibberish-like prompts to elicit “step-by-step plan to destroy humanity” responses from models. Hashimoto did not claim the resulting plans were necessarily practically harmful, but said the expected behavior should be refusal.

The larger difficulty is that safety is contextual. It depends on politics, law, and social norms, which vary by country. The risks are also heterogeneous: hallucinations in medical, legal, or financial settings; sycophancy; abetting crimes; inequality; loss of critical thinking. Some risks correlate with capability — more accurate models may hallucinate less — while others may be in tension with capability. Cybersecurity agents illustrate dual use: the same capability can hack a system or perform penetration testing to secure it.

Realism asks whether the benchmark resembles actual use

Ecological validity asks whether an evaluation captures real-world use. Hashimoto argued that exam benchmarks such as GPQA are far from real usage. Chatbot Arena prompts come from real people, but the distribution is uncontrolled.

Several newer benchmarks try to improve realism at the use-case level. GDPVal, from OpenAI, selected 44 occupations from the top nine sectors of U.S. GDP and asked experienced professionals — about 14 years of experience on average — to create tasks. Examples included a manufacturing engineer designing a 3D cable reel stand, a financial analyst creating a competitor landscape for last-mile delivery, a registered nurse assessing skin lesion images and writing a consultation report, a concierge planning a luxury Bahamas itinerary, and a real estate agent designing a sales brochure.

MedHELM addresses the same issue in medicine. Earlier medical benchmarks were based on standardized exams. Hashimoto argued that passing medical exams should not imply deployment readiness for language models, just as human medical training has a longer supervised process. MedHELM sourced 121 clinical tasks from 29 clinicians, aiming to capture what clinicians would actually ask language models to do: clinical decision support, note generation, patient communication, workflow, research, operations, and planning.

Anthropic’s Clio project addressed realism through real user data. Because model developers have the actual query streams, they are best positioned to understand how people use models. Privacy prevents simply exposing user data, so Clio uses language models to analyze and summarize general patterns. Hashimoto pointed out the trade-off: the most realistic evaluation would sample actual queries and inspect errors deeply, but that runs into privacy constraints.

Validity breaks down through contamination, stale data, and flawed labels

Validity raises a different concern: whether the evaluation scientifically measures what it claims to measure. The first issue is train-test overlap. Before foundation models, datasets such as ImageNet and SQuAD had clear train and test splits. Today, frontier models are trained on the internet and more, and providers often do not disclose training data. That changes the interpretation of benchmark scores.

There are four responses. One is to infer contamination from model behavior. A method from Oren et al. exploits exchangeability: if benchmark questions are randomly ordered, a model should not prefer the canonical order over a shuffled order. Higher log probability for the canonical order can reveal possible contamination.

A second route is reporting norms: model providers should disclose train-test overlap when reporting scores, just as statistical practice often reports confidence intervals. A third route is fresh evaluations such as LiveCodeBench or UncheatableEval, which scrape new webpages, papers, or GitHub material after a model’s cutoff date. But timestamps are imperfect because new material can be copied or derived from old sources. A fourth route is private evaluations: companies can use internal codebases not on the internet, and individuals can use private writings. Private datasets were described as especially convenient for perplexity evaluation because only log probabilities over a dataset are needed.

Dataset quality is another validity problem. SWE-bench Verified exists because the original SWE-bench included tasks with problems, such as insufficiently rigorous tests. Similar audits have found broken examples in GSM8K, MMLU, and other benchmarks: mislabeled questions, logical contradictions, ambiguous images, missing equations. Agentic benchmarks are harder to audit because the task is embedded in an environment. Unit tests may be incomplete; a patch can pass tests and still be wrong. In ToolBench, a trivial empty response reportedly achieved 38%, showing how benchmark artifacts can create false success.

The recommendation was a qualitative complement to quantitative scores: look at outputs and traces. A tool called Docent uses language models to inspect agent traces for problems, but the broader point was manual auditing. Anyone developing or running a benchmark should inspect whether it is measuring what they think it is measuring.

The purpose determines the benchmark

A company may need a purchase decision: model A or model B for a particular use case such as customer service chatbots. A researcher may want to measure raw capability or “intelligence.” A business or policymaker may want to understand benefits and harms. A model developer may want feedback to improve the model. Those goals do not imply the same benchmark.

There is also a shift in what is being evaluated. Before foundation models, research often evaluated methods under standardized train-test splits. The data and split were fixed; the algorithm varied. Today, evaluation mostly targets models and systems where “anything goes.” That is useful for downstream users because the shipped system is what matters, but it is a different game.

There are exceptions. The nanoGPT speedrun is an evaluation of algorithmic training efficiency: fixed data, fixed target validation loss, and a competition to get there faster or with fewer tokens. But most language model evaluation is about end systems.

The final rule was procedural rather than metric-specific: declare the game. Say whether the evaluation is about a method, a model, or an agent. Say what the purpose is. Then choose among difficulty, realism, and validity knowing there are trade-offs. It is difficult to have an evaluation that is simultaneously hard, realistic, ecologically valid, uncontaminated, cheap, and easy to grade. The right compromise depends on what the evaluation is for.

Evals and Benchmarks