Model Behavior Depends More on Post-Training Data Than Algorithms

Tatsunori HashimotoStanford OnlineWednesday, May 27, 202623 min read

Stanford computer scientist Tatsunori Hashimoto’s CS336 lecture argues that post-training is less a matter of exotic algorithms than of choosing the data and feedback that turn a broadly capable pretrained model into a controllable product. He presents supervised fine-tuning as a way to extract behaviors already latent in pretraining, and RLHF as preference optimization whose results depend heavily on annotators, reward models, safety data and evaluation incentives. The lecture’s central warning is that style, refusals, hallucination, and reward hacking are not side issues; they are consequences of the data pipeline that shapes what users actually see.

Post-training is where a pretrained model becomes controllable

Tatsunori Hashimoto frames mid- and post-training as the point where language modeling leaves the relatively clean scaling story of pretraining and enters “some of the messier parts” of the field. Pretraining can produce something like a stronger GPT-3: a large base model with broad capability. But the practical utility of such a model is limited without further steering. GPT-3 could support copywriting and lightweight novelty tasks, but it did not give users reliable fine-grained control. The shift from GPT-3 to ChatGPT was not merely “more flops” or more pretraining data; it required extracting particular behaviors from the pretrained model.

The target behavior is instruction following: the ability to accept a long, specific, sometimes programmatic prompt and produce a reasonable answer in one shot. Before GPT-3.5 and GPT-4, users mostly steered models through few-shot examples and hoped the examples were good enough. Later systems could follow much more detailed instructions, including examples where GPT-4 is asked to generate Python plotting code and returns code that produces the requested visual output.

The central question is what it takes to produce that control. Hashimoto breaks it into three parts: what kind of data should be collected, how that data should be used, and whether post-training needs scale in the same way pretraining does. His answer is deliberately qualified. Pretraining remains the foundation; “if we ignore pretraining and we just try to post-train our way to victory, we will get none of the things that we want.” But once pretraining has produced broad latent capabilities, post-training becomes the process of selecting and shaping the behaviors that will actually be exposed to users.

The public record is uneven. The richest information still comes from the period before ChatGPT intensified commercial competition: Stiennon et al.’s “Learning to Summarize from Human Feedback” and Anthropic’s 2022 helpful-harmless work include details such as annotation guidelines and safety setup. Modern frontier post-training, by contrast, is mostly opaque. Open-source models disclose more, but many open recipes rely heavily on distillation from stronger models, which Hashimoto treats as meaningfully different from frontier labs’ human data collection.

That opacity matters because the algorithms are not where he locates most of the leverage. The standard recipe is public: supervised fine-tuning followed by reinforcement learning from feedback. The leverage is in the data: the prompts, demonstrations, comparisons, annotators, guidelines, safety mixes, and all the small decisions that determine what behaviors the model learns to emit.

The causal chain running through the material is simple, even if the practice is not: pretraining creates broad capability; supervised fine-tuning extracts and formats behavior; midtraining blurs where that extraction begins; RLHF shapes preferences over outputs; and reward optimization introduces its own failure modes.

SFT mostly extracts behaviors already latent in pretraining

The supervised fine-tuning phase is, algorithmically, almost ordinary next-token prediction. The model is trained on input-output pairs: a prompt and a desired response. Aside from minor variations, “we all know how to SFT a model”; the important question is what goes into the training data.

The open-source history of SFT data shows a shift from repurposed NLP benchmarks to chat-style demonstrations and now to agentic tool-use traces. FLAN was early and, in Hashimoto’s view, visionary: it took existing supervised NLP datasets and turned them into instruction-following data. An email body becomes a prompt to “write a subject line”; a news article becomes a classification or summarization task; structured restaurant data becomes a sentence-generation task. The idea was reasonable: if supervised tasks already exist, train the model across all of them.

But the artifacts of those source datasets carried through. FLAN prompts often look unnatural compared with how people talk to chatbots. Instructions may appear at the end. Summaries can be very short, and benchmark summaries often contain hallucinated details. The dataset inherits both the structure and the deficiencies of the NLP datasets from which it is built. FLAN also reflected an early assumption that post-training might require very large scale, analogous to pretraining. Later experience suggested a different point on the quality-quantity tradeoff: with a strong pretrained model, a much smaller number of high-quality examples can pull out instruction-following behavior.

Alpaca marked a different regime. It distilled ChatGPT-style traces into input-output pairs and produced more natural prompts with longer, chattier outputs. Hashimoto says this reliably induced ChatGPT-like behavior when applied to the original LLaMA models, underscoring that both pretraining and post-training have to work. Once the base model was strong enough, chat-style data could push it most of the way toward the desired interface.

Open Assistant then represented a more human-driven attempt to build a high-quality instruction dataset. Volunteers wrote difficult prompts and long expert-like responses, in a Wikipedia-like crowdsourcing model. Hashimoto describes it as admirable and impressive, while also using it as an example of the pitfalls of high-detail data.

The newest SFT data has moved beyond plain chat. In agentic systems, users want not only text but tool calls, plans, and structured interaction with external systems. Nemotron-style examples include both natural language and JSON-like tool calls. In such data, tool use is not emergent from a separate mechanism; it is explicitly supervised into the model.

Across these datasets, three dimensions matter. First is “chattiness”: classic NLP datasets may be valid, but “people don’t really want to talk to an NLP benchmark.” Later datasets move toward longer, more detailed, more human-like responses. Second is detail and factual content: Open Assistant-style responses often contain citations and domain knowledge. Third is tool use: the recent emphasis is on downstream agentic applications rather than only a conversational answer.

A student asks whether the correctness of the input-output pairs matters. The answer is nuanced. At the top level, one should collect the highest-quality responses possible because bad data teaches bad behavior. But strong pretrained models can learn instruction following from surprisingly strange or low-quality supervision. Hashimoto mentions work from Percy Liang’s former student suggesting that models can be trained to follow instructions even without normal response data. Pretraining generalization lets SFT “get away with” worse data than one might expect.

The practical implication is that SFT is not primarily a way to create arbitrary new competence. It is strongest when it finds modes already made available by pretraining and turns them into repeatable user-facing behavior.

Style can look like capability if evaluation rewards it

Response style is one of the most consequential and easily confounded parts of post-training. Length, bullet points, tone, and formatting are not incidental. They are decisions made through data collection, and they strongly influence preference evaluations.

A table from Wang 2023 shows wide variation across instruction datasets. FLAN V2 completions average 31.2 tokens, SuperNI 38.7, Alpaca 64.6, Open Assistant 212.5, and ShareGPT 357.8. The variation is not just in content but in the model behavior these datasets encourage.

Dataset	Source	Instances	Avg. turns	Avg. prompt length	Avg. completion length
SuperNI	NLP datasets + human-written instructions	96,913	1.0	291.1	38.7
FLAN V2	NLP datasets + human-written instructions	100,000	1.0	355.7	31.2
Dolly	Human-written from scratch	15,011	1.0	118.1	91.3
Open Assistant 1	Human-written from scratch	34,795	1.6	34.8	212.5
Alpaca	Generated with Davinci-003	52,002	1.0	27.8	64.6
GPT4-Alpaca	Generated with Davinci-003 + GPT-4	52,002	1.0	28.0	161.8
ShareGPT	User prompts + outputs from various models	168,864	3.2	71.0	357.8

Instruction datasets shown in the Wang 2023 slide differ sharply in response length and dialogue structure.

When outputs are judged by preference, these style factors can dominate. Results attributed to Dubois et al. 2023 show strong length effects in both human and GPT-based evaluations. Annotators often choose answers with bullet lists or more detail. That preference is not irrational: in a side-by-side evaluation, a longer, more structured answer may genuinely seem more useful. But it can distort the resulting model’s tone and create the appearance of improvement where the underlying capability has not changed.

This is why Hashimoto separates style control from capability control. In benchmark comparisons across instruction-tuning datasets, some datasets substantially improve open-ended preference metrics such as AlpacaEval while not necessarily improving standard factuality, reasoning, multilingual, or coding benchmarks. A model can become more preferred without becoming smarter. For companies relying on engagement signals, it is easy to mistake a style shift for a capability gain.

The same problem appears later in RLHF. Both human and model judges can reward longer answers, and Hashimoto points to studies showing that models can improve win rates by pushing response length outward. One paper, in his telling, showed that optimizing for length alone could do well on many benchmarks. Length and formatting are therefore not superficial presentation issues; they are variables that training and evaluation pipelines can inadvertently optimize.

Adding correct facts during SFT can make hallucination worse

High-quality SFT data is not always good data for a particular model. In an Open Assistant example about monopsony, the response gives a definition, discusses labor-market monopsony, and includes a citation to Bivens and Mishel. Tatsunori Hashimoto uses it to show that next-token prediction teaches two things at once: the factual content of the citation and the behavior “when asked for research, output citations.”

That coupling can be dangerous. If the model does not actually “know” the citation or the relevant tail facts, SFT may teach it to produce citation-shaped text even when it lacks the underlying knowledge. Hashimoto describes this as a folklore claim with empirical support: fine-tuning a model on facts it does not know can make it hallucinate. The model is learning both a knowledge item and a format, and it may generalize the format without the knowledge.

You might not want to train on the highest quality data if the model doesn’t already know that data.

Tatsunori Hashimoto

This is one reason RL can be useful. Hashimoto summarizes John Schulman’s argument that teaching a model what it knows and does not know may require policy-dependent feedback. In SFT, an external demonstrator supplies the target sequence. The model is penalized for assigning probability to anything other than that sequence, even if the broader lesson it extracts is “emit a reference.” RL, by contrast, samples from the model’s own policy and can reward or punish outputs based on whether the model’s own answer is correct or calibrated.

Hashimoto offers a folk story: suppose the model has some internal “I know this” direction in its activations. SFT might force it to emit references regardless of that internal state. RL could reward references when the model is in the “I know” direction and penalize them when it is in the “I don’t know” direction, allowing the output policy to depend on that latent signal. But the limitation is equally important: if the model has no internal knowledge of whether it knows something, RL cannot create that calibration from nothing.

When asked to define “tail knowledge,” Hashimoto says there is no formal definition. In prior work, something like the length of a Wikipedia article could proxy for how well-known an entity is. Less well-known material tends to be harder for the model and can increase hallucination when used in SFT. But knowledge itself is not cleanly pinned down.

The practical point is that post-training is not simply about packing in as much correct information as possible. SFT works best when extracting behaviors already present in pretraining. Adding factually correct data can hurt if it pushes the model to emit knowledge it cannot reliably support. Small amounts of the right behavioral data—safety, instruction following, style—can have large effects, but long-tail distinctions still benefit from more data.

Safety tuning is a Pareto problem, not a refusal switch

Once a model is deployed to users, post-training becomes the last line of defense against misuse. Hashimoto contrasts pretraining’s “compress the world” perspective with the post-training burden of handling scams, spam, political manipulation, disinformation, and individualized spear phishing. The standard operational mechanism is refusal: train the model not to comply with malicious prompts.

But safety tuning is not just “refuse more.” The tradeoff is between violation rate and false refusal rate. A model should not help write a scam email asking for a deceased family member’s Social Security number under the pretext of COVID-19 funeral assistance, as shown in a Kang 2023 example. But it also should not refuse benign queries such as “how do I kill a python process.” The objective is a Pareto frontier: reduce unsafe compliance without over-refusing legitimate requests.

Public detail on safety SFT is even sparser than for capability SFT. Hashimoto presents Llama 2’s safety description as among the more detailed public accounts, while noting that it still leaves out basic information such as the exact number of examples used for safety tuning. He says such efforts usually involve a few thousand or tens of thousands of examples, and that for Llama 2 he thinks the number was about a few thousand.

Tølu 3 is presented as one of the best public references for a reasonably performant post-training pipeline. Its safety and non-compliance components include CoCoNot and WildJailbreak/WildGuardMix, with the slide showing 10,983 CoCoNot examples and 50,000-example WildJailbreak and WildGuardMix sources, filtered in the displayed table to 26,356 for two of the Tølu 3 rows.

Safety source shown	Displayed scale
Tølu 3 CoCoNot	10,983
Tølu 3 WildJailbreak	50,000 source examples; 26,356 shown in the Tølu 3 column
Tølu 3 WildGuardMix	50,000 source examples; 26,356 shown in the Tølu 3 column

The Tølu 3 safety slide gives one of the more concrete public views of safety and non-compliance data scale.

The strategy is straightforward: mine real user interactions for unsafe requests and jailbreak attempts, then create preferred responses that resist the jailbreak or refuse the unsafe request. Similar “whack-a-mole” processes appear, at least from public technical reports and model cards, to be happening in closed-source labs: look at usage, identify unsafe behavior, have annotators create counterexamples and refusals.

The scale needed for coarse steering can be surprisingly small. Hashimoto cites a safety-tuning result where adding roughly 500 Alpaca-style safety examples significantly reduced compliance with malicious instructions, hate speech prompts, and other unsafe categories. His interpretation is consistent with his larger SFT thesis: a capable pretrained model may already contain a “safe versus unsafe” behavioral axis, so a small amount of supervision can pull it out.

~500

safety examples shown as sufficient for significant broad safety improvements

That does not mean frontier safety tuning can be done with 500 examples. Fine-grained safety policy—especially distinctions that matter to companies such as OpenAI or Anthropic—requires much larger and more carefully designed data collection. Small data can steer broad behavior; large data is still needed for long-tail policy boundaries.

Midtraining blurs the boundary between base models and chat models

The method for SFT itself is almost trivial: “just do gradient descent.” Hashimoto shows a standard PyTorch training loop as a joke because, in many academic settings, that is essentially the algorithm. The more important methodological shift is that instruction tuning is increasingly being mixed into the end of pretraining.

The older mental model separated pretraining and post-training. First train on web-scale text; then fine-tune on instructions. Hashimoto says many current systems instead mix high-quality and instruction-tuning data into the tail end of pretraining, especially during the learning-rate decay phase, and then perform a short instruction-tuning round. This “midtraining” or two-phase training lets teams scale up instruction tuning without catastrophic forgetting and emphasize higher-quality data close to deployment.

The effect, in Hashimoto’s characterization, is that the term “base model” becomes misleading. His pet peeve is that some models called base models may already have seen chat and instruction data such as UltraChat during training. They are not base models in the older sense of simply predicting internet text. The boundary between pretraining, high-quality data filtering, and instruction tuning has blurred.

A Stanford slide gives an example of a two-stage mixture with categories including conversational/chat, web, Wikipedia, math, code, math-instruct, reasoning, books, Stack Exchange, science, news, and social media. The decay phase is not lower-quality data. Usually the intuition is the opposite: the decay phase is closest to deployment, uses the lowest learning rate, and should contain the highest-quality material available.

When asked how the mixture is decided, Hashimoto says both pretraining and post-training data mixtures remain heavily trial-and-error. Algorithms exist, but he describes them as unreliable or brittle. The advantage of midtraining is that it is much shorter than full pretraining, so teams can run many more ablations. A common pattern, as he describes it, is to ablate domains during the decay phase, estimate their impact, and reflect those estimates back into the first-stage pretraining mixture. The reason not to make all pretraining “high quality” is token supply: there is not enough Wikipedia, books, or other curated material to fill the whole run.

Midtraining is the hinge between SFT and RLHF in Hashimoto’s account. It shows that behavioral extraction is no longer confined to a clean post-training stage; teams increasingly prepare the model for instruction-following before the formal SFT and preference-optimization pipeline begins.

RLHF replaces imitation with optimization

Supervised fine-tuning imitates a reference distribution: fit $p (y ∣ x)$ to demonstrations. RLHF changes the problem. The model is no longer only trying to mimic a dataset; it is treated as a policy that should maximize a reward.

Hashimoto stresses the conceptual difference. In pretraining and SFT, the model is a generative model of a sequence distribution. In RLHF, the objective is to find a policy that maximizes expected reward. That policy could, in principle, collapse to a single answer per prompt if that answer receives high reward. Diversity is no longer guaranteed by the modeling objective.

Why optimize rather than continue collecting demonstrations? One reason is the gap between what people generate and what they prefer. Hashimoto describes a study in which freelance writers wrote news summaries, then some preferred InstructDavinci summaries over their own. The writers were competent, and their summaries were checked for quality, but they sometimes judged the model’s summary to be better after seeing it. Humans are not optimal demonstration-generating systems. Rating outputs can reveal preferences that demonstrations do not express.

Another reason is that verification can be easier than generation. Math is the example Hashimoto defers to the next lecture: verifying a proof or solution can be easier than producing it. That motivates reinforcement-style approaches and self-verification in reasoning models.

The standard RLHF data loop begins with an SFT model that can follow instructions. Prompts are sampled, the model generates multiple outputs—often with temperature one because the SFT model remains diverse—and a rater ranks them. A reward model is trained on those rankings, often pairwise, and the policy is optimized against that reward model.

The old InstructGPT annotation guidelines show the omnibus objective: helpful, truthful, harmless. Helpfulness includes following the user’s intent, writing clearly, handling ambiguous prompts sensibly, respecting international context, and avoiding rambling. Truthfulness includes not hallucinating, especially in summarization or factual questions. Harmlessness includes avoiding abuse, threats, violence, bad real-world advice, and illegal activity. The guidelines explicitly say harmlessness and truthfulness are usually more important than helpfulness, though tradeoffs require judgment.

Leaked Bard annotation guidelines use a similar structure, with helpfulness and presentation. They include ratings from “Not Helpful” to “Extremely Helpful” and presentation judgments from poor to excellent. They reward organization, formatting, directness, neutral tone, grammar, and avoiding filler. Again, style and substance are entangled in the human feedback process.

The annotator distribution becomes part of the model

RLHF does not only encode abstract preferences. It encodes the people, incentives, time limits, demographics, expertise, and tools involved in producing the preference data. Because post-training is one of the last shaping steps before deployment, the annotator distribution can materially shift model behavior.

The labor market for annotation has become bifurcated. A survey from Oxford Economics, covering one Scale AI platform, reports many annotators with bachelor’s or master’s degrees, a modal age around the mid-30s, and domains such as language, technical writing, creative writing, coding, math, and data science. Hashimoto cautions that this is only one platform and not representative of all language-model annotation, but he treats it as a reasonable subset.

At the high end, bespoke expert annotation has grown. A Business Insider example in the source describes Project Stagecraft, where freelancers were reportedly paid at least $50 an hour to create materials for ChatGPT to understand occupations. A pay chart shows specialist annotator rates varying by domain, with some expert roles above $100 an hour. The mental model of annotators as only low-cost overseas pairwise raters is incomplete. The system is more like a pyramid: expert annotation is growing, but lower-cost scalable annotation remains.

The reason is not simply that highly educated annotators have lower variance. For some tasks, the work requires domain knowledge: a lawyer may be needed to check a Bluebook citation; a domain expert may be needed to evaluate tacit professional practice. LLMs have also made low-quality annotation harder to police. People increasingly use ChatGPT in survey responses and annotation workloads, and some annotation shops now sell verification that “real people” are doing the work.

The practical difficulty is severe. It is hard to verify annotator expertise. It is hard to make annotators truly check correctness under time pressure. In the Bard case, annotators reportedly complained that they were expected to check long responses for correctness in under a minute, which made the written standards impossible to follow. And the ethical concerns remain: Hashimoto points to reporting about OpenAI’s use of Kenyan workers paid less than $2 per hour to make ChatGPT less toxic and to MIT Technology Review’s coverage of an “AI underclass.” Expert annotation has not replaced low-paid work; both exist.

Demographics can move model behavior, but Hashimoto presents the specific demographic explanation as an observed alignment and interpretation rather than a proven causal account. He describes a study he did with collaborators in the early days of instruction tuning: they asked opinion-poll questions to language models and compared the answers to human demographic groups. After post-training, the models became less similar to Protestants and Roman Catholics and more similar to Buddhists, Hindus, and atheists in the opinions they gave. Hashimoto says this roughly lined up with the annotator demographics disclosed in the InstructGPT appendix, including many Southeast Asian annotators and people on the U.S. West Coast.

He adds that subtle preferences can transmit through data in hard-to-detect ways. In “emergent misalignment” examples, innocuous-looking data generated by a model trained to like owls can cause another model trained on that data to inherit an owl preference. The broader warning is that data carries latent biases and preferences even when they are not visible as explicit labels.

Expertise changes error detection too. Hashimoto cites Hosking, Blunsom, and Bartolo 2024, which compared crowdsourced annotations with expert annotations. Non-expert annotators tended to emphasize formatting, while expert annotators more often detected factuality and inconsistency errors. Annotators were also less likely to spot those errors in assertive outputs. The problem is not merely ideological bias; it is also what different annotators are capable of noticing.

When asked how to measure annotator quality, Hashimoto gives two imperfect answers. One is compliance with detailed semi-objective guidelines: if factuality is defined and a web search contradicts a response, one can identify a failure. The other is inter-annotator agreement, which measures variance in a population. But agreement does not measure bias. If everyone uses ChatGPT, agreement can be high for the wrong reason. And for subjective preference tasks, variance may be inherent rather than a sign of poor quality.

AI feedback dominates when the goal is catching up, not moving the frontier

Model-generated feedback has become central to open post-training. Tatsunori Hashimoto says modern models are “surprisingly good” pairwise feedback systems. In work with students during the GPT-4 era, GPT-4 judgments correlated strongly with carefully curated human annotations at the system-ranking level. The slide reports a Spearman correlation of 0.98 and $R^{2} = 0.87$ , with agreement near human inter-annotator levels and much lower cost.

0.98

Spearman correlation shown for GPT-4 pairwise feedback versus human win-rate at the system level

At the time, Hashimoto says it was unclear whether open models would move toward expensive human annotation or toward distillation and AI feedback. His view now is conditional but strong: for non-frontier actors trying to catch up to frontier capabilities, model-based annotation leaves little room for human-collected data as the main strategy. Strong models are scalable, cheap relative to human experts, and often better than random crowd workers.

Zephyr is his example. Hugging Face initially tried to avoid model distillation and collect human feedback through vendors, but found the process costly, time-consuming, and not better than model-based annotations. They ultimately used AI feedback. Open pipelines such as UltraChat, UltraFeedback, and Tulu 3 likewise rely heavily on model-generated data and feedback.

That does not mean human data is obsolete. If the goal is to push the frontier, especially into domains requiring lawyers, scientists, doctors, or other experts, model feedback cannot conjure missing world knowledge. Human annotators remain necessary for new expertise and for tasks where the model is not already competent enough to judge.

Hashimoto also notes self-training approaches that are not purely distillation. Constitutional AI used a model to generate critiques and revisions for safety data, then trained on that prompted self-improvement loop. Self-Instruct is a capability-oriented version of the broader idea: use models to bootstrap post-training data. But these loops are bounded by what the models can already generate and evaluate.

This is the feedback-shaping stage of the pipeline: after SFT has made a model usable, preference data—human, model-generated, or hybrid—decides which usable behaviors become more likely.

PPO is the original RLHF optimizer; DPO made preference training accessible

The classic RLHF objective is to maximize reward while staying close to a reference model. In InstructGPT, this appears as a reward term minus a KL penalty against the SFT model, with an additional pretraining term. Stiennon et al.’s summarization paper has the same structure: train a reward model from pairwise preferences, then optimize a policy for reward subject to a KL constraint.

The KL term is not cosmetic. It prevents the model from drifting too far from the pretrained or SFT policy and becoming degenerate. In practice, RLHF is a relatively simple bandit-like setting rather than rich multi-turn RL, but optimizing language-model policies is still finicky.

PPO is the original workhorse. Hashimoto sketches the progression: policy gradients let one write the gradient of expected reward as reward-weighted gradients of log probability, which looks somewhat like weighted SFT. But policy gradients require sampling for each optimization step, and sampling is expensive. TRPO allows reuse of rollouts by constraining the new policy to stay close to the old one. PPO replaces the hard trust-region constraint with a clipping heuristic that discourages updates from moving too far. The conceptual line is policy gradients, off-policy reuse, then PPO.

Because PPO is complex and finicky, many researchers tried to eliminate “real RL.” Hashimoto lists failed or insufficiently strong ideas: train with a control token by prepending GOOD to preferred outputs and BAD to rejected outputs; train only on preferred outputs; train a reward model, sample outputs, and SFT on the preferred ones; sample many outputs and keep the best. Some work partially, but not as well.

DPO—Direct Preference Optimization—is the simpler alternative that “works pretty well” and looks much closer to SFT. Its intuition is simple: increase the likelihood of the preferred response and decrease the likelihood of the rejected response, with appropriate weighting. Hashimoto derives it from the KL-regularized RLHF objective by making a strong nonparametric assumption: allow the policy to range over all possible policies rather than a neural-network family. Under that assumption, the optimal policy is the reference policy exponentially tilted by the reward. Solving for the implied reward lets one plug that expression into the pairwise reward-model loss and obtain the DPO objective.

L_{D P O} (π_{θ}; π_{r e f}) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{r e f} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{r e f} ( y _{l} ∣ x )})]

Mechanistically, the DPO gradient increases log probability for the winning response and decreases it for the losing response. The update is scaled by how wrong the model’s implied reward estimate is. If the model already strongly prefers the winner, the step is small. If it treats winner and loser as similar, the step is larger.

Hashimoto’s practical view is that DPO is good enough for many purposes. LLaMA used DPO as a core RLHF primitive inside a broader expert-iteration loop: SFT, DPO, generate candidates, rejection sample, repeat. Many variants have appeared, including SimPO, which removes the reference policy and changes weighting, and length-normalized DPO, which tries to address length hacking. Hashimoto does not treat these variants as decisive. Results comparing DPO and PPO are highly contingent on setup; in some papers PPO wins, in others DPO done carefully wins. The stable lesson is that the core preference-gradient idea is close enough to be useful if step sizes and details are handled well.

Reward optimization can overfit and collapse the policy

Two failure modes follow from turning a language model into a reward-maximizing policy: overoptimization and mode collapse.

Overoptimization follows directly from using a learned reward model. If the policy is pushed too hard, it can overfit the reward model rather than improve under the true human preference distribution. Hashimoto says that when InstructGPT appeared, it was natural to wonder whether one could simply collect enough thumbs-up/thumbs-down data and RLHF toward much more intelligent systems. In practice, optimizing a proxy reward past a point can degrade real evaluation performance. The KL regularizer is therefore critical; it keeps the model from exploiting weaknesses in the reward model.

Mode collapse follows from the conceptual shift from distribution modeling to reward maximization. An RLHF policy does not need to represent the diversity of plausible human outputs. It can concentrate probability mass on a narrower set of high-reward responses. Hashimoto notes that RLHF models have often shown reduced diversity and entropy.

This also affects calibration. OpenAI, in the GPT-4 era, identified post-RLHF uncalibration as an open problem, and Hashimoto says he does not think anyone has fully solved it. Anthropic, he says, has argued that the uncalibration is natural: one can sometimes recalibrate, but not always. The issue becomes especially important for reasoning-model training, where entropy and exploration matter. If a model’s policy collapses too much, it may not explore enough candidate solutions to improve on hard problems.

The transition to the next topic is therefore about rewards that may be less prone to overoptimization: settings where more compute can be applied and performance may improve more monotonically. That is the appeal Hashimoto assigns to RLVR, which he leaves for the following session.

Data and Training Evals and Benchmarks AI Research Methods AI Safety and Alignment