Language Model Scaling Depends on Controlling Hyperparameter Drift

Tatsunori HashimotoStanford OnlineTuesday, May 19, 202619 min read

Stanford’s CS336 scaling-laws lecture, taught by Tatsunori Hashimoto, argues that modern language-model scaling is less about accepting a single Chinchilla-style rule than about controlling which training choices drift with size. Hashimoto presents scaling laws as useful empirical tools for choosing model/data tradeoffs, learning rates, batch sizes, sparsity, optimizers, and architectures, but repeatedly cautions that their transfer depends on the regime that produced them. Techniques such as µP and WSD schedules can reduce some uncertainty, he says, while data mixtures, optimizer details, weight decay, architecture changes, and post-training can still break clean extrapolations.

Scaling laws are useful, but the hard part is controlling what changes with scale

Tatsunori Hashimoto frames practical language-model scaling as a problem of controlling drift: as models get larger, some hyperparameters can be assumed stable, some can be made more stable by reparameterization, and some have to be empirically fitted. The basic Chinchilla-style question — how to trade model size against data under a compute budget — remains central, but the more operational questions are often learning rate, batch size, initialization, optimizer choice, sparsity, and architecture.

The “classical” scaling-law canon — Kaplan, Hestness, Chinchilla — only gets a builder to roughly 2022. Since then, Hashimoto says, there have been a number of scaling papers published by people who train big models, though fewer in recent years, and many of the detailed recent public papers have come from the Chinese open-source community. He treats these papers not as a complete frontier audit, but as public evidence of how serious model builders reason about scale when they are actually training models.

The practical questions are narrower than “do scaling laws exist?” They are questions like: does Chinchilla’s scaling recipe still work at open frontier scale? Can the compute cost of fitting scaling laws be reduced? Should learning rates and batch sizes be extrapolated with their own scaling laws? Can a parameterization such as µP make learning rates transfer across sizes? Do optimizer gains seen in small benchmarks survive larger runs?

Hashimoto’s answer is deliberately mixed. Chinchilla-style analyses have been replicated at meaningful open-model scales, and small-model scaling studies can predict larger-model losses surprisingly well. But scaling work is not a mechanical guarantee. Architecture details, data mixtures, optimizers, weight decay, and post-training can all disturb the transfer. Scaling laws provide a way to make better bets, not a way to remove judgment.

Scaling laws kind of have this very scientific feel to them. It's like, oh, yes, fit these lines and extrapolate them, but ultimately a big part of scaling laws is still vibes.

Tatsunori Hashimoto · Source

That line comes in response to a student asking whether one should just apply published scaling laws rather than running a new grid search. Hashimoto’s answer is conditional: if a published law was fit in a regime similar to one’s own, it may be a good default. But for a serious pre-training run, even minor differences — architecture, weight decay, data, optimizer — can justify redoing the key sweeps. The point is not that published laws are useless; it is that transfer is an empirical assumption.

MiniCPM uses µP to make learning-rate tuning transfer

MiniCPM is Hashimoto’s first detailed case because it exposes several now-standard practical tricks. It was a 2024 effort from ModelBest and Tsinghua-affiliated authors to train high-performing small language models in the 1–2.5B parameter range. Hashimoto emphasizes that MiniCPM is not state of the art by 2026 standards, but it is valuable because it documents scaling decisions that later papers often take for granted.

The paper’s first important move is using µP to stabilize learning-rate transfer. The goal is straightforward: choose initialization and per-parameter scaling rules so that the optimal base learning rate stays roughly constant as width changes. MiniCPM’s recipe includes scaling embedding outputs, scaling residual increments by a depth-dependent factor, changing initialization for matrix-shaped tensors, applying per-tensor learning-rate scaling, and scaling the language-model head.

Hashimoto stresses that this is not just a cosmetic initialization choice. It is meant to remove one of the most sensitive tuning dimensions. In the MiniCPM experiments, after applying µP, loss-versus-learning-rate curves for several model sizes have minima close to the same learning rate, around $1 0^{- 2}$ . The smallest model’s optimum shifts slightly, but the larger point is that the minima align enough to make a single transferred learning rate plausible.

MiniCPM then scales through a ladder of smaller models before training the final release model. The displayed ladder ranges from roughly 9M to 0.5B parameters, while Hashimoto describes the gap between the largest ladder model and the actual model as about 5x. The reason to build the ladder is to avoid brute-forcing the final run: train many cheaper models, fit the sensitive scaling behavior, then extrapolate.

Problem	MiniCPM strategy	Why it matters
Learning rate	Use µP-style initialization and per-parameter scaling	Try to make the optimal learning rate stable across size
Batch size	Fit optimal batch as a function of loss and data/model scale	Batch remains scale-sensitive even if learning rate transfers
Data/model tradeoff	Use WSD schedules to make repeated data-axis measurements cheaper	Avoid retraining from scratch for every target token count

MiniCPM’s scaling recipe separates parameters it tries to stabilize from parameters it still fits empirically.

Batch size does not get the same simple invariance. Hashimoto describes MiniCPM’s batch-size analysis as a Kaplan-style critical-batch exercise: for several model sizes, the authors train with different fixed batch sizes and token counts, then identify loss-minimizing batch sizes at different points. The trend is clean enough to fit a power law: as target loss decreases, optimal batch size increases polynomially. The displayed fitted relation is $lo g_{10} (B S) = - 6.24 lo g_{10} (L) + 20.93$ .

The deeper lesson is that even if µP stabilizes learning rate, it does not make all scale-dependent tuning disappear. It reduces one nuisance dimension. Batch size still changes with the regime, and the data/model tradeoff still has to be measured.

WSD schedules make Chinchilla sweeps less wasteful

A major practical annoyance in Chinchilla-style scaling is that cosine learning-rate schedules are tied to the total training horizon. If a model is trained for four million sequences with one cosine schedule, that run cannot simply be extended into an eight-million-sequence cosine run; the schedule would have been different from the start. For scaling-law measurement, that means repeatedly training from scratch for different data budgets.

MiniCPM’s workaround is the warmup-stable-decay schedule, or WSD. Instead of a full-horizon cosine, the learning rate warms up for a fixed number of steps, stays constant for most of training, and then decays rapidly near the end. Hashimoto describes a typical decay phase as about 10–20% of the total run, often ending around 10% of the maximum learning rate.

The practical advantage is checkpoint reuse. A model can be trained through a long stable phase, then rewound to different stable checkpoints and decayed separately. That allows the builder to sample several effective data budgets without rerunning the entire pre-training trajectory from scratch. The cost is not zero — each endpoint needs its own decay — but repeatedly decaying is far cheaper than repeatedly retraining the whole run.

Hashimoto notes that WSD curves can look worse than cosine during the stable phase, then recover quickly during decay. The decay phase is doing substantial work. Anecdotally, he says many practitioners still find cosine slightly better in some cases, but WSD is often close enough, versatile enough, and operationally useful enough that it has become a common default. It is especially useful when the aim is not just one training run, but efficient scaling-law measurement along the data axis.

With WSD in place, MiniCPM can conduct Chinchilla-style analyses more cheaply. The paper fits losses as a function of model size $N$ and data size $D$ using a form like:

Given a compute budget $C = 6 N D$ , the fitted exponents imply an optimal model/data tradeoff. Hashimoto is cautious about MiniCPM’s particular fit. He says their lower-envelope and joint-fit curves look reasonable, but he is unsure whether their claim for much higher data-to-model ratios than Chinchilla reflects a real effect or an artifact of somewhat strange fits relative to the original Chinchilla paper.

The useful part is less the exact ratio and more the method: WSD turns repeated Chinchilla-style data sweeps from an expensive retraining problem into a cheaper restart-and-decay problem.

DeepSeek fits learning-rate and batch scaling directly instead of stabilizing them

DeepSeek LLM represents the alternative strategy. Rather than using µP to make the learning rate invariant, DeepSeek directly estimates how learning rate and batch size should change with scale. Hashimoto describes it as one of the more carefully executed public scaling analyses in open models.

DeepSeek runs extensive grids over learning rate and batch size at smaller compute budgets, finds near-optimal regions, and then fits scaling laws for the optimal hyperparameters. The formulas shown are:

Here, learning rate decreases with non-embedding training FLOPs, while batch size increases. Hashimoto finds the batch-size fit more visually convincing than the learning-rate fit. The learning-rate plot has more scatter and, in his view, looks somewhat questionable as a clean line. His concern is not that the resulting DeepSeek models failed — they trained and performed reasonably — but that grid-based learning-rate extrapolation can suffer from quantization error if the grid is not placed well.

A student asks why there can be multiple optimal learning-rate points for the same number of training FLOPs. Hashimoto attributes this to changes in model size and related variables: the FLOPs axis alone is not the whole state of the system.

DeepSeek also uses a WSD-like schedule for Chinchilla analysis, but with a multi-step decay. The described schedule warms up over 2,000 steps, stays high until 80% of training tokens, decays to 31.6% of the maximum learning rate, then decays further to 10% after 90% of tokens. Hashimoto says he is not sure why DeepSeek uses two decay stages, and that this specific variation has not obviously become standard. The important point is that it serves the same role as WSD: enabling cheaper Chinchilla-style measurement.

DeepSeek’s IsoFLOP curves are cleaner than MiniCPM’s, in part because of its analysis choice. The paper fits model/data tradeoffs and then checks whether small-scale scaling curves predict the final 7B and 67B model losses. Hashimoto treats this as one of the strongest demonstrations in the material: small curated runs, extrapolated with a power law, land close to the actual losses of larger open-source models.

That is why DeepSeek matters. It shows the “fit the sensitive hyperparameters” school: assume most transformer hyperparameters are stable enough, but explicitly model learning rate, batch size, and model sizing as scale-dependent quantities.

Recent reports use scaling laws to justify architecture, sparsity, and downstream expectations

Hashimoto says the detailed public scaling sections have become less common partly because the core machinery is now widely understood. Qwen 2.5 and Qwen 3 are presented only briefly: in his description, Qwen 2.5 says it runs scaling experiments to tune optimal batch sizes and learning rates, and Qwen 3 largely reuses that recipe. The point is not that the lecture documents Qwen’s procedure in depth; it is that these details now appear as sparse references to a standard practice rather than extended scaling-law sections.

The newer frontier for scaling-law work is often mixture-of-experts and architecture selection. Kimi K2, in Hashimoto’s description from the lecture slide, uses sparsity scaling laws to reason about how sparse its MoE should be. The basic logic is to vary FLOPs and sparsity levels, observe validation loss, and quantify how sparsity changes the loss/FLOP tradeoff. Hashimoto says that, looking across sparsity levels, more sparse networks give better validation loss for a given FLOP, and that the point of the analysis is to measure diminishing returns. He describes the analysis as supporting a 4/8 sparsity choice, but does not unpack the underlying result further.

Hunyuan performs a related MoE scaling exercise, but fixes sparsity and studies active parameter sizing. Hashimoto notes that Hunyuan reports an optimal 96:1 data-to-active-parameter ratio. Llama 3 has IsoFLOP-style scaling with a reported 39:1 token-to-model ratio, but Hashimoto finds its more interesting contribution to be a compute-to-downstream analysis: better log loss maps to higher downstream accuracy, roughly via a sigmoid. He does not treat the fitted curve as a universal truth, because the plotted points show systematic deviations, but he considers the connection useful. The material has mostly discussed log loss, and Llama 3 offers evidence that log loss can be tightly coupled to benchmark accuracy in some cases.

MiniMax-01 uses scaling laws for architecture choice. The team compares Lightning attention, full Softmax attention, and a hybrid. Hashimoto says the plots suggest the variants are broadly comparable in scaling behavior and require similar model sizes at given compute levels. That supports choosing the hybrid architecture for the deployed system. This is the kind of decision scaling laws are well suited for: not only predicting final loss, but ruling out an architecture that would degrade badly at scale.

Model/report	Scaling use described	Hashimoto’s emphasis
Qwen 2.5 / Qwen 3	Learning-rate and batch-size scaling, with few details in the lecture	Now part of a standard recipe rather than extensively documented
Kimi K2	MoE sparsity scaling	Used, in Hashimoto’s description, to choose a sparsity level under diminishing returns
Hunyuan	MoE active-parameter scaling	Reports a 96:1 data-to-active-parameter ratio
Llama 3	IsoFLOPs plus compute-to-downstream mapping	Shows loss can connect to downstream accuracy, though not perfectly
MiniMax-01	Architecture scaling across attention variants	Uses scaling to justify a hybrid architecture

Recent scaling work often shifts from proving Chinchilla-style laws to making concrete architecture and sparsity decisions.

A student asks how this changes once post-training is standard. Hashimoto’s answer is that it remains an open question. There is no great fully integrated scaling framework that accounts for post-training. Post-training can change what pre-training should optimize for, and the closest related work he mentions involves notions like coverage or diversity in pre-training that might predict post-training outcomes. But he calls that work nascent and says there is no good general answer yet.

Learning rate and batch size are smooth enough to fit, but the variables are still disputed

The StepFun scaling-law study is used as a more recent and large-scale version of the DeepSeek-style hyperparameter program. Hashimoto says he does not think there is a definitive robust study of hyperparameter scaling, especially learning rates, but StepFun is “getting somewhat close” because the team trains credible large models and spends substantial compute grid-searching the space.

The core uncertainty is not only the exponent, but the functional form and inputs. Kaplan-style critical batch size scales batch as a function of terminal loss. DeepSeek scales learning rate and batch as powers of compute. StepFun argues for a different formulation, including a batch-size dependence primarily on data size. Hashimoto warns against treating any row in these comparison tables as gospel: the specific fits are brittle, but the empirical phenomena are informative.

StepFun’s method is simple in concept: train many small models across a grid of learning rates and batch sizes, at different model sizes and data sizes, then map the validation-loss surface. The high-resolution grids show a favorable property: for fixed data and model size, pre-training loss over learning rate and batch size is fairly smooth and convex-looking. Slices through the landscape are not jagged. That makes the search problem more viable; if the surface were chaotic, gridded scaling-law fitting would be much less trustworthy.

Hashimoto highlights one trend as especially interesting: optimal batch size appears to depend mainly on the amount of data being trained on. In the StepFun results, different model sizes fall roughly along the same log-log trend when batch size is plotted against dataset size. This is one of the few observations he says he has not seen clearly contradicted.

Learning rate is more complicated. In the StepFun results, larger models have lower optimal learning rates, while larger datasets at fixed model size have higher optimal learning rates — a counterintuitive pattern. Hashimoto says this may be fragile; other papers argue for a reversed dependence on data. But if one assumes Chinchilla-like scaling, where model size and data size both grow with compute, the effects can combine into the familiar pattern: as compute rises, optimal learning rate goes down and batch size goes up, as in DeepSeek.

The study also tests robustness. The learned scaling laws transfer to MoEs to some extent when controlling for active parameters. But shifting the training data — bilingual, code-integrated, code-dominant mixtures — shifts optimal learning rates and batch sizes. That reinforces the earlier warning: scaling laws depend on the regime and data recipe that produced them.

Optimizer gains are scale-dependent, and small benchmarks can mislead

Optimizer choice is a particularly delicate scaling problem. Hashimoto uses Muon as the main example because it showed large gains on the NanoGPT speedrun benchmark and later appeared in a large-scale model, Kimi K2.

The small-scale story is compelling: in the NanoGPT speedrun, Muon significantly improves time-to-loss relative to Adam-like baselines. But optimizer performance can shrink with scale. Hashimoto points to scaling studies in which alternatives such as Muon, SOAP, and NAdamW show speedups over AdamW at small model sizes, but the speedup decreases as model size increases. That does not mean the optimizers are bad; it means the conclusion “this wins at small scale” is not enough.

Two confounders matter. The first is compute scale itself. Any algorithmic claim should be checked as model size and compute rise. The second is the Chinchilla ratio — the ratio of data to parameters. Some methods may work better when models are overparameterized relative to data; others may work better when there is more data per parameter. Hashimoto says even papers with good model-size scaling hygiene often neglect this axis because it is expensive. In the optimizer study he discusses, the gains are fairly consistent across Chinchilla ratios, but he stresses that this is not guaranteed.

Hyperparameter fairness is also a major issue. A poorly tuned AdamW baseline can make a new optimizer look much better than it is. In the “Fantastic Pretraining Optimizers and Where to Find Them” example, tuning AdamW’s learning rate can yield a 2x speedup, erasing apparent gains from alternatives. Weight decay can also be optimizer-specific: the material says Lion performs best around a weight decay of approximately 0.6 in one setting. If all optimizers are tested with the same weight decay, the comparison may be unfair.

Hashimoto also shows a case from Will Held’s Delphi work where scaling appears clean over many orders of magnitude, then breaks. The run uses cautious AdamC, square-root batch-size scaling of learning rates, and other standard-looking choices. Up to a dashed threshold, the scaling law looks beautiful. Beyond it, performance degrades, then a run diverges. The fix involved more careful parameterization, scaling, and optimizer changes. The point is that even long, attractive scaling trends can fail suddenly.

Muon itself is conceptually different from Adam-like scalar adaptive methods. It treats matrix-valued parameters differently from vector-valued ones. In a standard momentum optimizer, the update momentum $B_{t}$ would be applied directly. Muon instead approximately orthogonalizes the matrix update before applying it. If $B_{t} = U S V^{T}$ , the idealized operation replaces the singular values $S$ with ones, producing $U V^{T}$ . This makes update directions unit-sized in a spectral sense, rather than normalizing individual coordinates as Adam-like methods do.

A student asks whether singular value decomposition is fast on GPUs. Hashimoto clarifies that Muon is not doing full SVD. It uses Newton-Schulz, a finite-iteration, matrix-multiply-only approximation to orthogonalization. That systems detail is part of why the method is practical.

Muon is only directly meaningful for matrix-valued parameters. Vector-valued parameters, such as normalization parameters, may still use AdamW or another optimizer. Another student asks whether the hyperparameters are shared with Adam for those vector parameters; Hashimoto says no, they are different. He notes that one possible future direction is increasingly parameter- or layer-specific optimizers and learning rates, though he adds that he does not want to tune all of that manually.

The final status of Muon is unresolved but important. Some scaling studies suggest its gains diminish with model size. Kimi K2 nevertheless trained a strong large model with Muon, after adding stability measures. Hashimoto’s conclusion is careful: Kimi K2 shows Muon works at scale, but without large-scale ablations, it does not prove Muon is better than AdamW at that scale.

µP tries to replace empirical drift with scaling invariants

The final technical thread returns to µP in depth. Tatsunori Hashimoto calls it “a bit of a mysterious object”: many papers and implementations exist, and they do not all agree on the math or the exact implementation. But the shared program is clear enough. µP seeks scale-invariant hyperparameter tuning, especially learning-rate transfer across width.

The motivating picture is simple. Under standard practice, as width increases, the optimal learning rate shifts. Under µP, the desired outcome is that the optimum stays fixed. To achieve that, the method changes initializations, per-parameter learning rates, output scaling, and sometimes residual or attention scaling as a function of width.

CerebrasGPT is one empirical example. The slide describes a broader CerebrasGPT family of 0.1B to 13B models trained with the Chinchilla recipe, and separately says the team used µTransfer by first tuning a small 40M-parameter µP model and transferring hyperparameters along a µP scaling law up to 2.7B parameters. Hashimoto says the µP parameterization made scaling-law fits more stable and made projected µP trends line up closely with actual models. MiniCPM provides another example. Hashimoto treats these as evidence that µP-style parameterization is useful enough to be part of the practical toolkit, while still emphasizing that the theory and implementations are not fully settled.

He then gives a conceptual derivation, drawing on “A Spectral Condition for Feature Learning” by Greg Yang, James B. Simon, and Jeremy Bernstein, which he describes as an accessible “µP for babies” paper. The core assertions are:

Activations at initialization should remain $Θ (1)$ per coordinate as width changes.
After one gradient step, the change in activations should also remain $Θ (1)$ per coordinate.

If each coordinate is $Θ (1)$ , then the norm of an activation vector of width $n_{l}$ is $Θ (n_{l})$ . This norm accounting drives the scaling rules.

For a simple deep linear network $h_{l} = W_{l} h_{l - 1}$ , with Gaussian initialization, the aim is to choose the standard deviation so that activation norms stay at the right scale layer by layer. Hashimoto describes his derivation as order-of-magnitude reasoning, with worst-case operator-norm assumptions, not as a rigorous theorem. The result differs from standard initialization when fan-out and fan-in are imbalanced.

For updates, the aim is similar: choose learning-rate scaling so that one gradient step changes activations by the right order of magnitude. In the simplified SGD derivation, the layer update has a rank-one form involving the loss gradient and the previous activation. Requiring the resulting activation-change terms to stay comparable leads to a learning-rate scaling like fan-out over fan-in. For Adam, Hashimoto says the corresponding rule changes, giving layer-specific behavior more closely tied to fan-in.

Quantity	Standard parameterization	µP-style implication in the lecture
Activation scale	Kept reasonable by conventional initialization	Explicitly required to remain Θ(1) per coordinate across width
Change in activation after a step	Not usually the organizing invariant	Required to remain Θ(1) per coordinate so the model keeps feature-learning behavior
Initialization	Often scales mainly with fan-in	Adjusted when fan-in and fan-out imbalance would disturb activation scaling
Adam learning rates	Often treated as a shared base rate	Can become layer- or parameter-type-specific under µP rules

Hashimoto’s µP explanation centers on preserving activation scale and update scale as width changes.

Transformer implementations then turn these principles into tables of rules for embeddings, attention projections, MLP matrices, and the softmax/language-model head. Some variants also use attention scaling like $1/ D$ rather than the usual $1/ D$ . Hashimoto’s emphasis is not only that µP yields useful rules, but that it represents a distinctive way to design algorithms: take a scaling limit, assert invariants, add assumptions, and derive constraints on hyperparameters.

µP works in controlled settings, but not every modern component respects it

Empirically, µP can do what it promises under controlled width scaling. Hashimoto cites a stress-testing paper by an independent researcher that replicates the basic result: when width is scaled by 4x at each step, making models 16x larger, the smallest model’s optimal base learning rate transfers reliably to larger models under baseline µP. A variant involving projection biases also transfers.

But modern language models include many components that deviate from the simplified µP theory: SwiGLU and squared ReLU activations, varying batch sizes, zero-attention initialization, RMSNorm gains, exotic optimizers, and regularizers. Hashimoto says most of these do not necessarily break µP, but some do.

Learnable RMSNorm gains are one failure mode. In the cited stress test, adding vector or scalar learned RMSNorm gains makes learning-rate transfer unreliable. Hashimoto notes that such gains can often be removed with little loss of performance, so this may be manageable.

Exotic optimizers are another issue. Lion, which relies heavily on the sign of the gradient, fails to transfer in the displayed experiments. Hashimoto says this is spiritually related to sign- or matrix-normalized update ideas like Muon, suggesting that other optimizer changes may also interact badly with µP.

Strong decoupled weight decay appears to be the most concerning failure in the stress tests. With weight decay around 0.1, µP transfer breaks significantly. This matters because weight decay is often treated as a secondary hyperparameter, but in scaling contexts it may alter the invariance assumptions.

The overall conclusion is still positive but limited. Standard parameterization becomes much more unstable as width increases, with optimal learning rates shifting sharply and some large-width settings failing. µP tends to make learning-rate transfer easier. But it is a tool, not a solved theory of all transformer training.

Data and Training Evals and Benchmarks AI Research Methods