Reasoning Gains Persist When Models Learn Them During Pretraining

Shrimai PrabhumoyeStanford OnlineMonday, May 11, 202617 min read

Shrimai Prabhumoye of Mistral AI used a Stanford CS25 seminar to argue that large-language-model pretraining is becoming less a matter of adding tokens and more a question of training strategy. Drawing on studies of curriculum ordering, early reasoning data, and reinforcement as a pretraining objective, she said base models improve when they see broad data before high-quality data, encounter reasoning traces during pretraining rather than only post-training, and are rewarded for intermediate thoughts that improve prediction.

Pretraining is becoming a question of learning strategy, not just token volume

Shrimai Prabhumoye framed state-of-the-art language-model development as four coupled problems: smart data, smart architecture, smart algorithms, and smart collaboration. The focus of her Stanford seminar was the algorithmic part of that recipe, especially the design of pretraining procedures that make the same data more useful.

Her premise was that large models are approaching a data constraint. The Epoch AI chart she showed visibly marked a median date around 2027 for “5x overtraining” and around 2028 for “full stock use,” while Prabhumoye characterized the broader implication as LLMs consuming more than 95% of human-generated data somewhere around 2030. The same chart placed GPT-3-era training in the hundreds of billions of tokens and Llama 3-era training in the tens of trillions. In that setting, simply adding more human text is not the main lever. The question becomes how to weigh, order, and use the available data.

Prabhumoye used four fictional learners — Pascal, Volta, Ampere, and Hopper — to keep the distinctions clear. All four have access to the same data. Pascal reads it randomly, does not use the available high-quality reasoning data, and does not learn through thinking. Volta follows a curriculum but does not front-load reasoning or learn through thinking. Ampere uses curriculum and reasoning-rich data early but does not learn through thinking. Hopper uses all three: curriculum, front-loaded reasoning, and learning through explicit thought.

Learner	Curriculum	Front-loaded reasoning	Learning through thinking
Pascal	No	No	No
Volta	Yes	No	No
Ampere	Yes	Yes	No
Hopper	Yes	Yes	Yes

Prabhumoye used Pascal, Volta, Ampere, and Hopper as the organizing device for increasingly strategic pretraining.

The comparison was not just rhetorical. Those names mapped onto the experimental progression in the talk: natural-distribution pretraining versus two-phase data ordering; no-reason versus reason-base models; and ordinary next-token continuation versus reinforcement as a pretraining objective.

~60%

relative improvement claimed for Hopper over Pascal when curriculum, front-loaded reasoning, and learning-through-thinking are combined

Prabhumoye’s larger claim was that pretraining should not be treated as passive next-token exposure followed by reasoning added late through supervised fine-tuning and reinforcement learning. In her account, pretraining can be made more strategic in three ways: by sequencing data from broad diversity toward high-quality concentration, by placing reasoning data into the base model’s early learning rather than reserving it for post-training, and by giving the model a reward signal for useful intermediate thoughts while it is still learning from pretraining corpora.

The two-phase curriculum starts broad, then concentrates on high-quality data

The first algorithmic result came from work titled “Maximize Your Data’s Potential: Enhancing LLM Accuracy with Two-Phase Pretraining,” coauthored by Steven Feng, Prabhumoye, and collaborators at NVIDIA, Stanford, and Boston University.

The problem begins with mixture construction. Pretraining data comes from heterogeneous sources: legal documents, books, papers, web crawl, math documents, code, and task-specific sources. Prabhumoye separated two decisions that are often collapsed into “make a data blend”: how much each source should count, and when the model should see it.

For weighting, the proposed pipeline estimates dataset quality and the number of useful repeats. Quality estimation is meant to ensure that sources of similar quality receive similar mixture weights, while higher-quality sources are weighted above medium- and low-quality sources. Epoch estimation asks how many times a high-quality source can be repeated before it stops improving downstream tasks. In her explanation, some datasets may give diminishing returns after two repeats; others may remain useful for four or six.

The curriculum then applies those weights across two phases. Phase 1 emphasizes diversity. It includes a large amount of web crawl, including medium- and low-quality crawl, and uses lower epochs of high-quality sources. Phase 2 emphasizes high-quality data: more epochs of sources such as math, Wikipedia, code, and task-specific data.

The baselines clarified what each part contributes. A “natural distribution” baseline ignores quality, epochs, and ordering. Its mixture weight for a dataset is simply its token count divided by total available tokens, so a very large low-quality source can dominate because it is large. An “optimal blend, random order” baseline uses quality and epoch estimates but does not sequence the data. The proposed “optimal blend, two-phase” condition uses quality, epoch count, and order.

Condition	Quality used	Epochs estimated	Order used	Description
Natural distribution	No	No	No	Weights follow available token counts.
Optimal blend, random order	Yes	Yes	No	Dataset weights reflect quality and useful repeats, but presentation order is random.
Optimal blend, two-phase	Yes	Yes	Yes	The same optimal blend is sequenced so Phase 1 is diverse and Phase 2 is high quality.

The two-phase study separates data quality, repeat count, and curriculum order.

The load-bearing result slide compared Natural Distribution, Random Order, and Two-Phase bars across MMLU, reasoning, GSM8K, code, and overall accuracy. Its visible caption stated that the two-phase approach improved average accuracy by 3.4% compared with Random Order and 17% compared with Natural Distribution.

The reported results compared Pascal and Volta. Pascal corresponds to the natural-distribution setting: no quality estimate, no useful-repeat estimate, no curriculum. Volta corresponds to the two-phase approach with an optimal data mixture. Volta was 17% better on average than Pascal and 3.4% better than an optimal mixture shown in random order.

One audience member asked whether the reverse curriculum — quality first, diversity afterward — had been tried. Prabhumoye said it had: the ablation swapped the order, and it did not perform as well. In this work, exploration through data diversity came first, followed by exploitation of higher-quality data.

She also clarified how automated the recipe construction is. Quality estimation and epoch estimation are automated, but the final mixture creation is “a mixture of both” automated and handcrafted. Epoch estimation itself is done through ablations: start with a base mixture using one repeat of a data category, increase to two or four repeats, and shift weight away from lower-quality crawl or another source to keep the overall budget fixed. In practice, she said, these ablations are done over domain categories such as math rather than every individual source within a domain.

Reasoning data works differently when it appears before post-training

The second thread challenged a common pretraining/post-training division. In Shrimai Prabhumoye’s description of current practice, pretraining learns language and factual patterns, supervised fine-tuning learns to mimic reasoning-like answers or formatting, and reinforcement learning finally teaches the model to evaluate and refine reasoning. Her criticism was that this creates an “unreasoning foundation”: reasoning is added post hoc rather than treated as part of the base model’s early learning.

Front-loading reasoning proposes a different division. During pretraining, the model learns general knowledge plus reasoning. Post-training then amplifies and refines an existing skill rather than trying to bolt it onto a base that never saw reasoning-rich traces.

The study injected reasoning-style data at different stages — pretraining, supervised fine-tuning, and reinforcement learning — and varied the reasoning data across three axes: diversity, quality, and quantity. Prabhumoye used two labels throughout: a “no-reason base,” where no reasoning data appears during pretraining, and a “reason base,” where some reasoning data is included during pretraining.

The evaluation differed by training phase. Base models were evaluated on general-purpose reasoning benchmarks such as ARC-C, HellaSwag, WinoGrande, and RACE; math reasoning benchmarks such as GSM8K and Math-500; science benchmarks including MMLU and MMLU-Pro; and code benchmarks including HumanEval and MBPP variants. Post-trained models were evaluated on instruction following, math, science, and code benchmarks including IFEval, AIME24, AIME25, GPQA-Diamond, and LiveCodeBench.

The first result was immediate: adding reasoning data during pretraining improved the base model. Ampere, the reason-base model, was 16% better on average than Volta, the no-reason baseline, after pretraining.

The more important question was durability. If the same or similar reasoning data is later used in SFT, early exposure might be redundant or overfit. Prabhumoye reported that the advantage persisted. After SFT, Ampere remained 9.3% better on average than Volta. The point was that the gain from early reasoning was not washed away by SFT.

+19%

reported average lead for the reason-base model over the no-reason model after pretraining, SFT, and RLVR in the FLR study

The most complete durability result she reported within the FLR study came after the full post-training sequence: pretraining, SFT, and reinforcement learning with verifiable rewards, or RLVR. Ampere remained 19% better on average than Volta. On difficult competition math benchmarks such as AIME, the advantage “ballooned” to 39.3%.

Prabhumoye also argued that high-quality reasoning data can have a delayed effect. The study compared three reasoning-data regimes: SHQ, small but high quality; LDQ, large and diverse but lower quality; and LMQ, a combination of SHQ and LDQ. Immediately after pretraining, LMQ and LDQ performed similarly on average, making the added high-quality data appear to have little benefit. After SFT, however, LMQ delivered a 4.25% boost over diverse-only pretraining. The interpretation was that high-quality pretraining data can create latent gains that supervised fine-tuning unlocks later.

This point mattered because the same audience question came up in several forms: why not just put the reasoning data into SFT? Prabhumoye described two tests. In one, the no-reason baseline received twice as many SFT epochs, while the reason-base model used one epoch. Extra SFT compute improved the baseline by 4.09%, but it still fell short of the weakest reason model by 3.32%. In another, the total reasoning-data budget was fixed: either keep the reasoning data for SFT, or split it so some appears in pretraining and the highest-quality data appears during SFT. The split condition, with a reason base, performed 12% better on average.

Her conclusion was direct: pretraining without reasoning data did not catch up merely by spending more compute in SFT, and when the reasoning-data budget was fixed, the reported experiments favored using some of that data during pretraining.

The reasoning data in the study is mostly community-defined and STEM-heavy

The Q&A exposed an important boundary around the term “reasoning data.” A questioner asked how Shrimai Prabhumoye distinguishes small high-quality reasoning data from medium- or lower-quality reasoning data, and how the work classifies data as reasoning at all. She said the study used open-source datasets already released by the community as reasoning datasets, including OpenThoughts and the Nemotron SFT dataset.

The quality distinction was not presented as a universal taxonomy. OpenThoughts was treated as higher quality because it was more heavily filtered. Nemotron SFT was treated as more diverse and lower quality “in some sense” because it mixed many datasets, spanned more domains, and had not gone through the same heavy filtering. Other possible quality metrics include complexity, question length, and available difficulty labels.

On the definition of reasoning data, Prabhumoye said the work follows the community’s current usage. Such data typically consists of a question, a long reasoning trace, and a final solution. Math Olympiad-style data was her example. She also acknowledged that the domains were not broad enough to settle the question of legal reasoning or other domain-specific forms of reasoning. The Nemotron SFT data was not domain-filtered, but she did not think it contained legal-domain reasoning in any significant sense. The composition was mostly STEM: math, code, and other areas relevant to benchmarks such as MMLU.

That clarification matters for interpreting the reported gains. The talk’s claim was not that every domain-specific form of professional reasoning has been covered. It was that reasoning-rich traces, as currently represented in open-source reasoning datasets and largely STEM-oriented corpora, improved model foundations when introduced during pretraining in the reported experiments.

RLP turns pretraining into a place for exploration, not just imitation

The final technical contribution, RLP — reinforcement as a pretraining objective — pushed the early-reasoning idea further. Instead of merely including reasoning-style examples in pretraining, Shrimai Prabhumoye described a method that gives the model an incentive to generate useful intermediate thoughts before predicting the next token.

Prabhumoye motivated the idea with a “tale of two learners.” Leo learns by doing; Bolt learns by observing. When asked to build a bridge for a toy car, Leo makes a simple structure that works. Bolt analyzes every bridge ever made and builds a beautiful suspension bridge that is too small for the car. The lesson was that pattern matching can miss purpose. Standard models, in this analogy, “watch text” by predicting the next token. RLP tries to teach models to reason through their own thoughts, not merely observe text.

In standard pretraining, the model receives a context and predicts the next token. Her toy example was a sentence about photosynthesis: “Photosynthesis is the process plants, algae and some bacteria use to make their own food using ____.” Vanilla pretraining estimates the next token from the context and predicts “sunlight.” RLP instead gives the model an opportunity to generate a thought, such as “Photosynthesis relies on solar energy. Hence the next token must be sunlight,” and then predicts the token conditioned on both the context and the thought.

The slide summarized the key difference this way: RLP “produces an explicit reasoning trace before predicting the next token,” making the “why” visible and trainable rather than only the final answer. In Prabhumoye’s explanation, this is what turns ordinary pattern completion into reasoning-driven prediction.

The prompt used for thought generation instructed the model to act as a continuation-and-reasoning assistant: receive a prefix of a context, problem, solution, or derivation; briefly think between think tags about what should come next; then continue in the same style as the prefix, focusing on the next few steps rather than jumping to a final boxed answer. The prompt also told the model not to restate the question or add metadata commentary.

The RLP mechanism has two policies. The “thought policy” receives the prompt plus the input prefix, explores different thoughts through rollouts, and produces both a thought and a next-token prediction. A “no-think baseline” receives the same input context without the prompt and produces an ordinary next-token probability distribution. The reward is based on information gain: the log probability of the next token under the thought-conditioned policy minus the log probability under the no-think baseline.

r = lo g p_{θ} (x_{t} ∣ x_{< t}, c_{t}) - lo g p_{ϕ} (x_{t} ∣ x_{< t})

The reward is positive only when the generated thought meaningfully improves next-token prediction relative to not using the thought. If the thought is irrelevant or harmful, the reward can be zero or negative. Prabhumoye emphasized two properties. First, the reward is dense and non-binary rather than a sparse yes/no reward. Second, it can be applied at every position in the document without an external selection process.

The no-think baseline is updated through an exponential moving average of the thought policy. The reason, she said, is to make it current enough to provide useful comparisons while keeping it intentionally lagged to mitigate reward hacking.

In response to a question comparing RLP with GRPO, Prabhumoye said the work actually uses the GRPO technique. The difference is the reward. In ordinary post-training GRPO, the reward may be zero-one; in RLP, the reward is information gain. The rest of the advantage calculation, including reasoning-token probabilities, remains similar.

RLP beat token-matched and compute-matched continuation baselines

The first RLP experiment asked whether a base model’s reasoning ability can improve without task-specific tuning. The setup used Qwen3-1.7B-Base at its final checkpoint, trained with RLP for 1 billion tokens on general pretraining corpora — not special reasoning-only data. Comparisons included the base model, the base model after post-training, token-matched continued pretraining using ordinary next-token prediction, and corresponding post-trained versions. Post-training used SFT on OpenThoughts and RLVR on OmniMath.

RLP improved the base model by 19% on average across math and science benchmarks. Compared with token-matched continued pretraining — the same 1 billion tokens used for ordinary next-token prediction — RLP was 17% better on average. After identical SFT and RLVR, the RLP advantage compounded: Base + RLP + Post was 8% relatively better than Base + Post and 7% better than Base + CPT + Post.

The result slide for this comparison showed paired bar charts over math, science, science pass@1, and overall scores. The visible annotations stated that RLP outperformed BASE by 19% and CPT by 17% on average before post-training, and that after identical SFT plus RLVR it retained a relative 8% advantage over BASE + Post and 7% over CPT + Post.

The natural objection is compute. RLP requires rollouts and thought generation, so a token-matched comparison does not fully represent the cost of the RLP condition. Prabhumoye therefore described a flop-matched comparison. For RLP trained on 170 million tokens, the compute-equivalent next-token-prediction baseline used 6 billion tokens. The continued-pretraining data was Nemotron-CrossThink. Even when the next-token-prediction baseline saw 35 times more data, RLP outperformed it by 14% on average.

Comparison	RLP setup	Baseline	Reported result
Token matched	1B tokens on general pretraining corpora	1B tokens of ordinary continued pretraining	RLP was +17% over CPT and +19% over base on average.
Post-training matched	RLP followed by identical SFT + RLVR	Base or CPT followed by the same post-training	RLP + Post was +8% over Base + Post and +7% over CPT + Post.
Flop matched	170M RLP tokens	6B CPT tokens on Nemotron-CrossThink	RLP was +14% over CPT despite CPT seeing 35x more data.

RLP was compared against both token-matched and compute-matched continued pretraining baselines.

Prabhumoye also reported a larger-scale test on NVIDIA-Nemotron-Nano-12B-v2-Base, a hybrid Mamba-2-based model rather than a pure transformer. The RLP model started from an intermediate checkpoint trained to 19.8 trillion tokens and then received 250 million RLP tokens on general pretraining corpora. It was compared with a base model trained to 20 trillion tokens, so the RLP model had seen roughly 200 billion fewer tokens.

Despite seeing fewer tokens overall, the RLP model achieved a 35% average gain over the 20T-token base, with the largest boost in science: +23% absolute. After identical post-training, the RLP model still outperformed the base by a 3% absolute margin. Prabhumoye described this as evidence that RLP benefits persist and may amplify across larger models and different architectures in the reported text-model settings.

An online question asked how far back in pretraining RLP can be applied and still beat a continued-pretraining baseline, given that the 19.8T checkpoint is already very late. Prabhumoye said the paper presents the 19.8T result, but the team also ran experiments at 20% of pretraining. For a 20T-token full run, that means a 4T-token checkpoint, and they still saw gains from RLP compared with the baseline.

RLP differs from earlier reinforcement-pretraining approaches by making reward intrinsic and dense

Shrimai Prabhumoye situated RLP alongside three related efforts: Quiet-STaR, Reinforcement Pre-Training, or RPT, and Reinforcement Learning on Pre-training Data, or RLPT. Her comparison focused on reward source, reward granularity, and reasoning emergence.

Next-token prediction has no reinforcement reward and produces reasoning only implicitly, if at all. RPT and RLPT use external verifiers and sparse binary rewards. RLP uses an intrinsic, verifier-free reward based on information gain and applies it densely. In her qualitative table, RPT/RLPT produced “explicit/weak” reasoning emergence, while RLP produced “explicit/strong” reasoning emergence.

The quantitative comparison used Qwen3-1.7B-Base, OmniMath, and a 170 million-token budget. RLP was 4% better on average than RPT. Prabhumoye attributed the difference to reward design. RPT uses an external filter — a separate model that goes through pretraining data and selects the tokens where reward can be applied. It then uses a sparse binary reward, reinforcing only selected tokens and ignoring the reasoning steps. RLP applies its reward to any token, uses a dense per-token information-gain signal, and accounts for the reasoning trace itself when calculating reward.

The ablations she highlighted all pointed in the same direction. Giving the model a reward for intermediate thoughts outperformed simple next-token prediction. Position-wise credit at every token was more effective than sparse end-of-sequence reward. And RLP was token efficient in the Nemotron-12B setting: a model given 250 million RLP tokens from the 19.8T checkpoint beat the 20T-token base comparison by 35% on average, despite seeing about 200 billion fewer tokens overall.

A later Q&A clarified how tokens are chosen for RLP in practice. Theoretically, Prabhumoye said, RLP can be applied to any token in a document. In the experiments, the team takes a document, randomly selects a token, applies RLP, backpropagates the reward, discards the document, and moves to a new one. They do not select tokens with entropy-based or other “fancy” techniques. She also corrected the assumption that RLP is run only on the highest-quality data: it is applied to the pretraining data mixture, including web crawl and other general sources, not only reasoning data.

The reported work changes where reasoning belongs in the pipeline

The practical implications Prabhumoye drew are mostly about how the reported work allocates data and training effort across stages. For data-mixture design, the two-phase result argues against treating a blend as only a static global distribution. The same weighted mixture can behave differently depending on when the model sees each source. Her recipe is broad exposure first, then concentration: use the early phase to cover diverse web crawl and other sources, and reserve the later phase for more repetitions of high-quality data such as math, Wikipedia, code, and task-specific corpora.

For SFT allocation, the front-loading results caution against assuming that all reasoning examples can be held until post-training without cost. In the reported fixed-budget experiment, splitting reasoning data between pretraining and post-training beat using the reasoning data only in SFT. In the extra-compute experiment, giving the no-reason baseline more SFT epochs improved it but did not close the gap with a reason-base model. In Prabhumoye’s framing, SFT amplified and refined reasoning more effectively when the base had already been exposed to reasoning-rich data.

For continued pretraining, RLP changes the comparison class. A token-matched comparison is not enough because RLP consumes additional compute through rollouts and thought generation. Prabhumoye therefore reported both token-matched and flop-matched baselines. Her claim was not merely that RLP beats ordinary continuation when both see the same number of tokens; it was that RLP still beat a compute-equivalent CPT baseline that saw far more data.

The scope limits are equally important. The front-loading reasoning data was largely community-defined and STEM-heavy. RLP was not shown on vision-language alignment. RLHF, hallucination mitigation, and alignment bias came up in Q&A, but Prabhumoye treated them as adjacent issues rather than claims established by the presented work. Asked about RLHF’s role in current model pipelines, she said that, independent of RLP, her understanding is that some RLHF data is still mixed into RLVR or other RL stages, though different model families use it differently and RLHF has become a smaller part of the pipeline.

Across the three bodies of work, the claim was cumulative but bounded: in the reported text-based LLM settings, curriculum ordering improved accuracy over random or natural data mixtures; reasoning traces introduced during pretraining created advantages that SFT and RL did not erase; and reinforcement-style exploration could be moved into pretraining itself, so the model was rewarded for thoughts that improved prediction rather than only imitating text until late-stage alignment.

Evals and Benchmarks Data and Training AI Research Methods