Pre-Training Scale Is Losing Ground to Adaptive AI Systems

Sara HookerHugging FaceThursday, May 21, 202620 min read

Sara Hooker, co-founder of Adaption Labs, argues in a Hugging Face ML Club India talk that AI progress is moving away from ever-larger pre-training runs as the default path and toward systems that adapt more efficiently after deployment. She says compute still matters, but the higher-return questions now concern data curation, post-training, test-time compute, interfaces, routing, and how cheaply models can learn from new information. Her case is that monolithic, one-size-fits-all models push the cost of adaptation onto users and concentrate participation among labs with the largest compute clusters.

AI competition is moving from larger pre-training runs to adaptive systems

Sara Hooker’s central claim is not that compute has stopped mattering. It is narrower and more consequential: the return on scaling model size through pre-training no longer looks like the dominant path of progress, and the locus of competition is moving toward adaptive systems — data, inference, interfaces, automated training loops, and the cost of learning from new information.

She framed the last decade of AI as a “bigger is better” race across compute, parameters, and data. The intellectual backbone of that race, in her telling, is Rich Sutton’s “Bitter Lesson”: over long horizons, general methods that leverage computation tend to win over human-crafted domain knowledge. Hooker described this as “a punch to the ego of every computer scientist,” because it implies that elegant human ideas matter less than methods that can absorb more computation.

That belief did not merely shape technical strategy. It reorganized the ecosystem. Hooker pointed to jokes about being “GPU rich” or “GPU poor,” Michael Jordan’s line that “today we can’t think without holding a piece of metal,” and the migration of talent and resources from academia into industry labs. It also narrowed who gets to participate. If progress depends primarily on ever-larger training runs, contribution is gated by access to large, reliable, colocated compute. Compute becomes not just a research input but a national priority.

The belief is sticky for organizational reasons as much as technical ones. Throwing compute at a problem is seen as lower-risk than algorithmic or architectural work. It fits quarterly planning. It matches the story told in large capital raises. Once an organization has raised money on the premise that compute is the most important ingredient, it becomes awkward to admit that more compute may not be the best next use of capital.

Hooker’s disagreement with the scaling orthodoxy rests on several kinds of evidence. First, models of the same size have become far more capable over time. Using historical data from the retired Hugging Face Open LLM Leaderboard, she showed that models under 13B parameters steadily improved over 2022–2024. More pointedly, she showed a chart in which small models frequently outperformed larger models submitted to the same leaderboard. Her conclusion was not that small models always beat large ones, but that size alone is no longer a reliable recipe.

Second, she argued that neural networks contain severe redundancy. She cited Denil et al.’s 2014 paper, which found that a small set of weights could be used to predict 95% of the weights in a network. She also pointed to pruning work, including her own 2019 sparsity paper with Trevor Gale and Erich Elsen, showing that a ResNet-50 could lose about 90% of its weights after training while losing only a few percentage points of test-set accuracy under certain pruning methods. The implication is that large parameter counts may be needed for optimization and convergence, but many of those weights are not needed at inference in the same way.

Third, she emphasized data quality. Better curation, deduplication, pruning, and synthetic data can reduce the amount of compute needed or improve training dynamics. In her framing, this matters because large models pay heavily to learn the long tail: rare features, rare domains, rare cases, rare languages, and rare behaviors.

Recent model releases, in Hooker’s account, have made the problem harder to ignore. She described GPT-4.5 as widely regarded as not a dramatic stepwise improvement despite being much larger and more expensive to serve. She said it was only briefly productionized and then replaced with routing because, in her view, the serving cost was disproportionate to its value. She made a similar point about reactions to Llama 4. She also said Mythos appeared useful for a specialized part of the distribution but that she doubted it would be served at scale because of cost. Her strongest inference was that frontier labs are unlikely to keep quadrupling model size this year.

The shift she described is not away from compute altogether. It is away from training-time model-size scaling as the default answer. She separated that from post-training, test-time scaling, adaptive compute, and hardware-aware serving, all of which she sees as more promising places to spend effort. Training-time compute, she said, is “currently the least interesting idea to throw at a problem.” Test-time compute is different: the question is how to make it task-adaptive, how to spend it where uncertainty is high, and how to avoid treating every query as if it deserves the same computational budget.

Monolithic AI pushes adaptation costs onto the user

Sara Hooker introduced adaptive intelligence with a deliberately mundane failure: she asked ChatGPT to create slides for a talk about why adaptive intelligence matters. It produced a bombastic opening slide with a chameleon, a futuristic skyline, and copy about markets shifting, customers evolving, and disruption accelerating. When she asked it to introduce “Sara Hooker,” it produced a polished speaker slide with the wrong woman’s photo and generic language: “AI strategist,” “innovation leader,” “adaptive intelligence advocate.”

For Hooker, the point was not that the model made an amusing mistake. The point was that today’s AI systems generally give users two unsatisfying options. A user can give a thumbs up or thumbs down and hope that the feedback eventually enters a future reinforcement learning pipeline. Or the user can become a prompt engineer, spending hours trying to coax the system into producing the desired result.

That is what she means by monolithic AI: frontier labs build the best general model they can, ship the same model to everyone, and expect end users to make it work. The model is static relative to the user’s task. The interface is static. The computational pattern is static. The user absorbs the burden of adaptation.

Hooker sees two costs in that design. The first is a human cost: users around the world are forced into “acrobatics” with prompts, even when their needs are local, specialized, linguistic, cultural, or domain-specific. The second is a systems cost: the same kind of model and the same amount of compute are applied regardless of problem. A simple task, a rare-domain task, a high-uncertainty task, and a task requiring long-horizon exploration are not treated as fundamentally different allocation problems.

The alternative she described is an adaptive stack: data, models, and interfaces that can change based on objective, task, distribution, and feedback. Adaption Labs’ internal framing has three pillars: Adaptive Data, Adaptive Intelligence, and Adaptive Interfaces. Hooker said the company’s “single North Star” is to make the whole stack adaptable.

Adaptive intelligence, as she defined it in discussion, is closely related to continual learning but not identical to it. Continual learning typically emphasizes a time factor: adding capabilities over a long horizon without forgetting prior knowledge. That becomes urgent when models move into long-horizon tasks, where they must absorb information, make decisions, and proceed through a sequence. Hooker’s emphasis is adaptation at every step: model behavior should change as new information arrives, so the search space becomes more controlled and less wasteful.

The present alternative, she said, is often massive rollout: systems explore inefficiently and receive useful signal only at the end. Efficient adaptation would let them incorporate information during the trajectory rather than waiting until the terminal reward or final outcome.

She was explicit that real-time adaptation, in Adaption’s intended sense, should not require gradient updates at test time. Post-training and alignment can make a model more flexible, and models can be trained with the expectation that their behavior will need to change. But once deployed, she said, the goal is to adapt using gradient-free methods: context, retrieval, search control, external memory, task inference, and other mechanisms that do not require updating the model’s weights in real time.

That distinction matters because adaptation is not just a capability goal. It is an efficiency goal. If a model is interacting with the world, the speed at which it learns from new information becomes central. Hooker argued that what matters most now is “the cost of adaption” and who can make adaptation and learning from new incoming information as efficient as possible.

The data space has become a place to optimize, not just a thing to collect

Sara Hooker argued that data has become more malleable than the field’s older assumptions allow. Classical machine learning often begins with a dataset as a random sample from a distribution. The expensive work is collection and annotation. Since that work is slow and costly, the categories and objectives must be chosen carefully up front.

Hooker argued that synthetic data and modern data pipelines change this. It is now cheap enough to steer in the data space. Instead of passively accepting the long tail as rare and expensive, she said, one can “make your data space the long tail.” Rare examples can be oversampled synthetically. Missing parts of the distribution can be created or expanded. Data can be shaped toward target model behavior, including non-differentiable objectives, by penalizing or selecting based on desired data characteristics.

That is the logic behind Adaption Labs’ first public pillar, Adaptive Data, which Hooker said was released in partnership with Hugging Face. She described it as making available data research techniques previously reserved for frontier labs: shaping datasets toward target model behavior, expanding what is missing, and evolving existing data. One Adaption Labs slide described “self-hosted extremely fast scalable adaptation of data.” Another internal slide claimed that AI-ready data could be created quickly and showed figures of 82% average quality gains, 242 languages covered, and 27,063,096 data artifacts processed in about three weeks.

242

languages covered, according to an Adaption Labs Adaptive Data slide

Hooker treated language coverage not as a side note but as part of the broader claim about participation. If the scaling era concentrated power in a handful of labs with large compute clusters, adaptive data is supposed to lower the cost of targeting specific distributions — including distributions that large general-purpose models often under-serve.

The same idea applies across the training lifecycle. Asked what data curation and generation look like in a compute- and data-efficient paradigm across pre-, mid-, and post-training, Hooker returned to the central point: once the data space is cheap to steer, it becomes one of the most efficient levers for behavior. It can be used in gradient-free ways, such as retrieval or context, or in gradient-based training. The key is the ability to specify, scale, and create data on the fly.

She also mentioned an upcoming Adaption capability she called “Inventive Dataset,” which she described as allowing users to summon missing parts of the distribution. The claim is not merely that better datasets improve models. It is that data generation and curation become active optimization processes, closer to model design than to static dataset preparation.

This shifts how one thinks about capacity. If transformers are expensive partly because they struggle to learn rare features from naturally occurring data, then an adaptive data pipeline can work around that limitation by changing the data distribution itself. In Hooker’s terms, this is one way to leverage capacity better rather than simply adding more capacity.

AutoScientist is presented as automation of the training loop, not just automated fine-tuning

Sara Hooker placed Adaption Labs’ AutoScientist inside the same shift from static training to adaptive optimization. Most fine-tuning outside frontier labs, she said, fails. People lack the right data, find training too expensive, and do not know how to configure the process. Much of that know-how remains locked inside frontier labs.

AutoScientist is Adaption’s attempt to automate the full research loop behind model training and alignment. The slide Hooker showed framed it as “Your Data + Model Training,” with the system making decisions over algorithm, epochs, LoRA rank, alpha, target layers, optimizer, warmup ratio, and gradient stopping. In her description, the important feature is that it optimizes across multiple steps: first the data, then the training process, then the next choice based on what was learned.

Hooker said Adaption benchmarked AutoScientist against its own researchers. In the released chart she showed, AutoScientist outperformed researcher-set configurations, with a reported 35% relative improvement in average win rate: 48% for human-configured training versus 64% for AutoScientist.

Configuration	Average win rate shown
Human-configured training	48%
AutoScientist	64%

Adaption Labs’ slide comparing human and AutoScientist training configurations

Hooker attributed part of the advantage to search breadth. Human research specialists often develop deep expertise around a single model family. When she worked on Aya and language models, for example, her team knew how to configure that stack and architecture. AutoScientist, by contrast, searches across “any model type” among frontier open-weight models. That is harder for a human researcher to do without prior experience across those model families.

Another Adaption chart compared AutoScientist against original models across verticals including business, finance, legal, marketing, medical, news, technology, and communication. AutoScientist’s win rates were shown in the 59–69% range, compared with original-model win rates in the low- to mid-30s. Hooker noted that the 60s ceiling partly reflected an imposed budget rule: the search stopped once performance exceeded 60. She said the company had removed that limit and was interested in seeing what expanded search budgets would produce.

Her larger point was that AutoScientist is an early example of a longer arc: automation of research and development itself. It is not only about tuning a model once. It is about automating how to set up a harness, how to explore, how to allocate search budget, and how to adapt the training or inference process based on the problem type.

That connects to her answer to Sayak Paul’s question about the relationship among pre-training, post-training, test-time scaling, and test-time training. Hooker said test-time compute is especially valuable when applied to examples where the model is most uncertain. That is the premise of adaptive compute: spend more on high-uncertainty cases. Decoding is one example — how many samples to draw in parallel, how to ensemble them, and how to decide when additional sampling is worth the cost.

But she also argued that the field lacks a mature understanding of how data mixes across training stages interact with test-time behavior. Some things are known: repeated data across continued pre-training, post-training, and RLHF can be harmful; new data should be injected at different stages to keep training fresh and dynamic; continued pre-training has become associated with introducing reasoning, math, and code. But the deeper question is what knowledge should live in parameters, what should live in context, what should be retrieved externally, and what should guide search budget.

Facts that are stable and easy to retrieve may not be a good use of parameter capacity. Skills and navigation abilities may need to be taught more generally, because hard-coding the exact tools a model should use can make it brittle when new tools appear. Hooker argued that models may need to know that tools can be used without encoding exactly which tools must be used. The broader research problem is the allocation of knowledge across weights, context, retrieval, memory, and search.

The hardware lottery still punishes alternatives to transformer-era computation

Asked why frontier investment continues to optimize LLMs rather than shift toward alternatives such as Yann LeCun’s JEPA, Sara Hooker gave two answers that sit in tension. First, even if model-size scaling is plateauing, there remain useful optimization paths around LLMs: post-training, test-time scaling, context use, model collaboration, and agentic systems. Current architectures are expressive and can generate and collaborate with other models in powerful ways. So the shift away from pre-training scale does not automatically mean abandoning LLMs.

Second, she argued that alternatives are structurally disadvantaged by hardware. She referred to her “Hardware Lottery” paper and said the problem has worsened. Modern accelerators are deeply optimized for matrix multiplication, and matrix multiplies make up the overwhelming bulk of modern neural networks. Hooker compared deep neural networks to humans being mostly water: deep nets are primarily matrix multiplies.

When hardware is over-optimized for one computational pattern, other ideas can be hard to make empirically successful even if they are conceptually interesting. She used capsule networks as an example. Sara Sabour and Geoffrey Hinton’s capsule work attempted to address limitations in convolutional networks, such as invariance to position and information loss through max pooling, by representing structure differently. But operations such as squashing did not map cleanly onto hardware optimized for standard deep learning primitives.

Sparsity is another example. Structured sparsity can work with current hardware. But the strongest compression results often come from unstructured sparsity, which current hardware does not handle well in practice. As a result, an approach can be theoretically or empirically attractive in isolation and still fail to “scale” in the system that matters commercially.

Hooker therefore welcomed what she called a new wave of “neo labs” betting on the next era of intelligence. She said the field needs more diversity in approaches. But she was also clear that transformers are not the final answer. Their inefficiency is severe, especially because training dynamics force models to pay heavily to learn the long tail. The difficulty is that hardware has become overfit to the current paradigm, slowing the emergence of alternatives.

That also shaped her answer about undervalued research domains. Alternative architectures and sparsity are undervalued because they do not fit the scaling narrative or the hardware path that scaling has reinforced. The penalty is not only intellectual fashion. It is empirical: if an idea cannot run well on the dominant hardware, it struggles to demonstrate value at the scale where the field now expects evidence.

Smaller labs may gain room as progress shifts from colocated pre-training to recipes, routing, and interaction

For a small lab with strong data and training strategy but limited model size, Sara Hooker said the moment is “fun again.” Her reasoning depends on the difference between pre-training scale and the newer optimization targets.

The last decade’s frontier progress required massive, colocated compute: enough GPUs in the same data center, stable interconnects, and reliable training at large scale. If the frontier was simply doubling, quadrupling, or increasing model size by 10x, a small lab had little chance of competing directly.

But if no frontier lab is doubling or quadrupling model size this year, then more of the competitive surface moves to post-training, interaction, context, routing, agentic harnesses, and adaptive compute. Those workloads have different infrastructure characteristics. They can tolerate more redundancy. They can use distributed data centers. They can make use of multiple providers. The recipe matters more.

Hooker did not say compute becomes irrelevant. She said the rate of return is higher in test-time and dynamic post-training compute, and those forms of compute do not require as much dedicated capacity in one place. That opens more room for distributed innovation and, in her words, global innovation.

This is also where her social argument reappears. Hooker described it as “sad” that ambitious researchers might feel they must aim for a handful of foundation labs. She connected that to her own path: growing up in Mozambique, receiving a scholarship to the United States, and being lucky enough for that sequence to open doors. Looking at colleagues from Stanford, Berkeley, and a few other institutions, she sees an absurd narrowing of who gets to build the technology.

Automation of R&D, in her view, can change those dynamics in the same way code generation changed who can build software. If the cost of adaptation and experimentation falls, then more people can try ideas, run useful experiments, and contribute without being inside one of a few labs. Her hope is that “the best idea starts to win,” rather than the right school, scholarship, or institutional access.

That said, Hooker also emphasized that the frontier is harder to see from the outside than it used to be. When she joined Google Brain, she experienced the field as unusually open: NeurIPS felt like a pilgrimage, papers and techniques were shared, and the Deep Learning textbook was a central object of study. Now, she said, unless someone is inside a frontier AI lab, there is a sense that they do not know the “secret sauce.” At NeurIPS, frontier lab researchers may meet privately and play games like “underrated, overrated,” trying to infer what others are actually working on from what they call overrated or underrated.

So the extremes have sharpened. It is easier than before to get started and to make an impact with tools such as code generation and automated training workflows. It is harder than before to know the true state of frontier practice.

Interfaces are part of the learning system

Sara Hooker’s third pillar, Adaptive Interfaces, follows from a claim about feedback. Code accelerated because engineers had interfaces where they were comfortable giving rich feedback. Design saw similar acceleration because designers had expressive interfaces and strong preferences. Most other tasks, she argued, were pushed into a chat interface with thumbs up and thumbs down.

That feedback channel is too thin for broad adaptation. If systems need to learn from interaction, the interface is not decoration; it determines what signal the model receives. A better interface can make feedback more frequent, more structured, more task-specific, and more useful.

Hooker distinguished this from one common direction in agent work: browser agents. Browser-agent approaches try to impart agents with knowledge by observing how humans traverse the existing web. She found this less interesting because it mimics human interaction with today’s internet rather than creating a new collaborative paradigm between humans and AI. It may produce lots of data, but not necessarily the right data.

Her preferred direction is to create useful interfaces for people doing their everyday tasks in collaboration with AI. The future internet, she suggested, should probably look different: users should be able to “summon” the right interface for a given task at the right time. That would make the interface itself adaptive, rather than forcing all tasks through chat.

The stakes become higher in agentic workflows. Hooker argued that sequential processes compound error in unexpected ways. A model that is slightly wrong for a user’s language, database structure, tone, domain, or slice of the distribution may become much worse when used repeatedly in a workflow. That is one reason she expects renewed interest in customization, pools of models, and dynamic switching. Prompt engineering alone is not enough when users need stronger levers of control.

She also linked customization’s earlier failures to the scaling era. Fine-tuning was often slow, data preparation was painful, results were disappointing, and then the next larger general model erased many of the gains. Now three changes are altering the calculus: usage-based billing is making API costs more visible; auto-research and auto-R&D are making customization faster and less failure-prone; and agentic workflows make the shortcomings of one-size-fits-all models more costly.

The strawberry problem is not just a trick question

The question about models counting the number of “r”s in “strawberry” drew a longer answer because Sara Hooker saw it as a useful miniature of the broader allocation problem. Many models struggle because of tokenization: the relevant letters can be collapsed inside tokens rather than represented in the way humans perceive the word. Some frontier providers, she said, have solved it by placing a rule on top, which she found funny because it acknowledges the underlying shortcoming.

The deeper question is whether that kind of knowledge should be learned in pre-training at all. If a failure comes from tokenization, then solving it parametrically may require addressing it from the beginning. But how much compute should be spent in pre-training so a model can answer an esoteric example that humans find obvious?

For Hooker, this returns to the same allocation problem: what belongs in the weights, and what should be handled at test time? Some failures reveal limitations. But not every limitation should be solved by spending more pre-training compute. A rule, a tool, a character-level operation, a retrieval process, or a test-time reasoning strategy may be more efficient than forcing the model to internalize every such edge case.

She also used the example to push back on the assumption that the goal should be to replicate human intelligence exactly. Models are already better than most humans at some forms of math and sequential processing. Humans remain much more efficient at processing new information. Humans also perform cheap global updates through social mechanisms: Hooker cited COVID lockdowns as an example of many people changing behavior globally because humans care about social respect and coordination.

The better objective, in her view, may not be to make models human-like in every respect. It may be to make them useful complements: systems that solve problems humans are poor at, while relying on humans or interfaces for the forms of judgment, update, and collaboration where humans remain efficient.

The theoretical opening is optimization stability

Sara Hooker closed the technical discussion by identifying an optimization problem that she sees as especially important. The field often needs large models to train successfully, but can remove many weights afterward. That suggests large parameterizations are useful not only because all parameters are needed for final performance, but because optimization is unstable and overparameterization helps convergence.

If models could start small and train stably, the compute-return paradigm would change. That is a theoretical and practical optimization problem. It sits underneath the evidence on pruning and redundancy: the waste may not be that the final model needs every parameter, but that the training process currently depends on excess capacity to find a good solution.

Hooker cautioned that optimizer research is difficult and has defeated many attempts. But she also described it as an area of massive return if done well. In the larger arc of her argument, this is the deepest version of the same claim: the next gains come less from spending more compute in the old place and more from understanding why the old process is inefficient.

AI Application Architecture Data and Training AI Research Methods Inference and Deployment Agents and Autonomy AI Infrastructure and Compute Human-AI Interaction