Inference Hardware and Continual Learning Are Replacing Data as AI Bottlenecks

Jeff DeanTwo Minute PapersMonday, June 1, 202613 min read

Google chief scientist Jeff Dean argues in a Two Minute Papers interview that AI progress is not chiefly constrained by running out of public text, but by systems work: extracting more from existing data, building inference-specialized hardware, distilling large models into smaller ones, and giving models access to much larger context. Dean frames the next phase less as better chatbots than as action-driven, agentic systems that can test, simulate and learn under controlled safety gates, while acknowledging unresolved problems in continual learning, healthcare deployment and infrastructure reliability at Google scale.

Dean does not see data exhaustion as the limiting wall

Jeff Dean, introduced as Google’s chief scientist and as a co-creator of MapReduce and TensorFlow, rejects the simple version of the claim that large language models are about to run out of useful training data. He concedes the premise in a narrow sense: “quite a lot” of the world’s public text data has already been used. But he argues that this is not the same as exhausting the sources of capability improvement.

I’m not too worried about that as like an impediment to making progress. It seems like there’s lots and lots of things we can do.

Jeff Dean · Source

His answer has four parts. First, models are still not making full use of video data. Second, synthetic data can be generated and used for training. Third, existing data can be revisited in more productive ways, including making additional passes over it. Fourth, better algorithms can extract more information from each piece of data already available.

That last point is central to Dean’s view. When Károly Zsolnai-Fehér presses him on the concern that future models may increasingly train on AI-generated material, with everyone learning from the same recycled substrate, Dean answers through a concrete compute-and-filtering example.

His example is coding. A system doing reinforcement learning on a high-level coding task might generate 100 or 1,000 candidate solutions. Many can be discarded quickly: code that does not compile, code that fails unit tests, code that performs poorly. The remaining candidates can be judged against the desired reward signal and added back into the training process. In that setting, Dean’s point is that additional compute can produce more attempts, and filters can select the attempts that are measurably better.

Dean also describes data augmentation in a stronger sense than the older computer-vision habit of shifting or cropping images. If a working Python codebase already specifies the behavior of a system, then translating it into Go is not just a loose paraphrase. The existing program and its tests act as a detailed behavioral specification. In internal Google tools, he says, people have used Python code plus tests to ask for different versions of the same tool and have found “much faster solutions.”

Zsolnai-Fehér frames this as getting far more out of the same amount of data. Dean agrees. That is why, in his view, data scarcity is not the immediate obstacle: the frontier is not only how much raw text remains, but how much structure, validation, action, and augmentation can be built around the data already available.

Inference is becoming the hardware problem

The shift in machine-learning workloads changes the hardware design problem. Károly Zsolnai-Fehér cites Bill Dally’s claim that something like 90% of what happens in modern data centers is now inference rather than training, and asks how that changes Google’s hardware thinking. Jeff Dean first narrows the claim: data centers do many things besides training and inference, including search, Gmail, and other applications. But within machine-learning workloads, he agrees with the direction of travel. Training is becoming a smaller share of the compute Google wants to run, because inference demand is growing so quickly.

Dean’s definition of inference includes offline inference, reinforcement-learning rollouts during training, online inference for user requests, and agent-based behavior. These workloads differ from training in ways that matter for chips and systems. Model weights may be fixed during inference. Precision requirements can be lower. Request volume can be very large. Latency, energy, and cost become defining constraints.

That pushes hardware toward specialization. Dean says it now makes “a ton more sense” to design chips specifically for inference workloads, because specialization can produce much better energy efficiency. He points to Google’s TPU v5e and v5p chips as examples of already-announced work in this direction, and says he expects still more specialization.

One striking part of the discussion is how low precision has become viable. Zsolnai-Fehér says FP4 still sounds implausibly small to him, especially given the quality of the resulting intelligence. Dean jokes that a computer scientist from 15 years earlier would respond, “FP what?” because “that’s not enough numbers.” Yet he treats the fact that it works as a meaningful sign.

The discussion then moves below FP4. Dean says people are experimenting with formats that use even lower precision, paired with scaling factors shared across groups of weights. He mentions possibilities such as 2-bit integers or 1-bit integers, while noting he has not heard anyone say “2-bit float,” because it is unclear what that would mean. The design question becomes how often the higher-precision scaling factor is needed: every 64, 128, or 256 weights, for example. The hardware frontier, in his framing, is not just faster chips; it is co-designing numerical formats, model architectures, and serving systems around the actual shape of inference demand.

Pre-training and post-training are an unsatisfying split

Jeff Dean calls the current separation between pre-training and post-training “intellectually dissatisfying.” His objection is conceptual. Pre-training mostly exposes a model to passive streams of tokens. Post-training then tries to shape behavior afterward. Dean thinks the more natural system would interleave observation with action: periods of absorbing data, periods of trying to use what has been learned, and then learning again from the consequences.

Károly Zsolnai-Fehér connects this to experience replay in deep reinforcement learning. Dean extends the analogy to agents acting in environments: simulated environments, robotic settings, coding tasks, or other domains where the system can take an action and observe what happens. A model can learn more, he argues, from writing code and seeing whether it works than from only watching tokens pass by.

Zsolnai-Fehér then raises the deployment concern. Today, a model can be trained, post-trained, red-teamed, safety-tested, packaged, and released. If the model keeps learning continuously, he asks, how do you know an intermediate state is safe?

Dean does not dismiss that concern. He suggests a more discrete version of continual learning: many interleaved steps, perhaps 100 or 1,000, which begin to resemble a continuous process mathematically, but still allow safety gates before user-facing releases. A model might continue learning behind the scenes, then undergo safety protocols and red-teaming before a new version is exposed to users.

Dean is not describing a production model that mutates freely in front of users. He is describing a research direction in which learning becomes more action-driven and more interleaved, while deployment still requires controlled evaluation. Asked later for a problem he has tried to solve but has not cracked, he returns to the same theme.

We still don’t have an answer to how do you do continual learning appropriately. Something I’ve thought about a little. I’ve dabbled a little bit with some techniques along with colleagues, but I think if we’re able to crack that, it’s gonna be amazing. But it’s not there yet.

Jeff Dean · Source

A million-fold compute leap points toward autonomous engineering, not just better chat

Asked what another million-fold increase in compute over the next decade might make possible, Jeff Dean refuses to give a precise forecast, but he does not expect the rate of progress to slow. Looking back ten years, he notes that the field was only beginning to have useful language models. Sequence-to-sequence models had appeared. Transformers had not yet arrived. LSTMs were still popular. Compared with today’s systems, those models now look far less capable.

Projecting forward by a comparable magnitude, Dean expects major gains from new hardware, new research techniques, and the increased attention being paid to AI. The capability he emphasizes is complex autonomous workflows.

He points to multi-agent systems that can begin to work on complicated tasks. One demonstration he cites from Google I/O involved an AI system writing an operating system from a relatively simple prompt and producing something capable of running Doom. Dean adds an important caveat: there is a lot of operating-system-like material in training data, so the task is not completely out of distribution. Still, he calls the result “pretty amazing.”

Dean is particularly excited by scientific and engineering acceleration. He asks whether systems with the right simulation environments, the right agent scaffolding, and the ability to break work into smaller tasks could accomplish projects that currently take many people many years. His example is designing an airplane in five days. He gives similar examples in chip design, computer systems, and new hardware.

He is careful not to claim that such systems already exist. “We’re not there yet,” he says. But the aspiration is clear: the compute leap matters because it may allow AI systems to explore, test, simulate, coordinate, and iterate at scales that make previously long engineering cycles much shorter.

Small fast models depend on large expensive teachers

Open and smaller models, in Dean’s account, are benefiting substantially from distillation. Asked whether open models would continue improving as quickly if frontier models stopped being released, Jeff Dean says “a bunch” of current progress is driven by distillation data. Google’s Gemma models, he says, are “definitely distilled from higher quality larger scale models,” and he believes many other open-source models benefit from distillation as well.

The same logic applies inside Google’s model lineup. Dean says Flash models are capable for their size because Pro models can be used to teach them. This makes the closed-versus-open distinction less central than the large-versus-small one. If the goal is to build small, highly capable, inference-efficient models, Dean argues, the field still needs to build larger models that are more capable but less efficient. The knowledge from those larger models can then be transferred into smaller footprints.

Károly Zsolnai-Fehér asks about moments when fast models have seemed surprisingly close to frontier models, with only small gaps on hard benchmarks, and suggests there may be “magic sauce” beyond distillation. Dean allows that there is always some unrevealed “magic sauce,” but he emphasizes that distillation is one of the key mechanisms making smaller, cheaper, faster models nearly as good as the frontier models.

The pattern he describes is repeated work: build a more capable frontier model, distill its knowledge into a lighter-weight model, then push the frontier forward again and repeat. Flash-style models matter because they are the “workhorse” models people generally want to use: much cheaper and faster while remaining close enough in capability for many tasks.

The context-window dream requires a hierarchy, not just quadratic attention

Jeff Dean identifies several machine-learning trends he is most excited about: continual learning, agents and multi-agent systems, inference hardware, and the co-design of hardware and model architectures. But one of the most concrete technical problems he discusses is the context window.

The ideal, in Dean’s description, is that a model should appear to have all relevant information at its fingertips. At web scale, that might mean the whole internet. At a personal scale, for a user who has opted in, it might mean email, photos, watched videos, and other digital history. For an internal Google developer, it might mean the entire Google codebase, which Dean estimates at around 10 billion lines of code and perhaps 100 billion tokens.

The current transformer attention mechanism makes that impossible to do directly at such scales because of its quadratic cost. Dean’s proposed direction is a cascade: retrieve a broad candidate set, use cheaper mechanisms to narrow it, and reserve the expensive context window of a larger model for the items most likely to matter. He gives a concrete sketch: from 10 billion documents, identify perhaps 30,000 relevant ones; then use a lighter model to reduce those to 117 highly relevant items; then pass those into the more expensive model. The goal is to hide this orchestration from the user.

Károly Zsolnai-Fehér presses the algorithmic question: are we still stuck with O(n²) attention, or can the field move to cheaper approaches such as n log n? Dean says there is already a large body of work, perhaps 100 papers, on more efficient context algorithms. The standard quadratic approach works very well, so alternatives face a high bar. But he sees traction in methods that lower algorithmic costs or reduce large constant factors, and he believes multiple approaches can be combined to make attention over many more tokens much cheaper.

The ambition here is what Zsolnai-Fehér calls a “lifetime AI” — a system that can find needles in a very large personal or professional haystack. Dean endorses the idea, joking that all he wants is “a hundred billion tokens of attention.” The joke captures the engineering challenge: the desired result is broad, seamless access to relevant context, but the system required underneath is a layered retrieval, ranking, compression, and attention stack.

At Google scale, improbable failures become routine engineering inputs

Jeff Dean treats data-center reliability as a problem of building dependable systems from unreliable parts. At Google’s scale, he says, “lots of things that are very, very unexpected happen,” often because one failure coincides with another and produces a cascade. A software system may stop working. A bus bar may overheat. Too much power may reach a rack. In rare cases, something may catch fire.

He recalls an internal chat group called “data centers on fire,” which collected “exciting events” and sometimes videos. The point is not spectacle; it is that infrastructure design must assume failures that look exotic at small scale.

Google’s early infrastructure philosophy, Dean says, was shaped by this constraint. In the earliest days, Google bought consumer machines without ECC memory — not even parity — and consumer motherboards without redundant power supplies. That was viable only because systems were designed to handle faults at a higher software level.

Károly Zsolnai-Fehér asks about one of his favorite failure-mode images: a distant supernova, a cosmic ray, a memory cell, and a bit flipping from zero to one. Dean answers that bit flips do happen, and grounds the answer in observed memory-error monitoring rather than in the host’s supernova framing.

Alpha particles definitely can flip, you know, DRAM state. We’ve actually observed this because we have monitoring data of how many ECC errors and like single bit errors that are corrected and two bit errors that are not corrected are happening in all of our machines.

Jeff Dean · Source

Dean describes seeing clusters in particular orientations on Earth experience much higher error rates for a brief period, such as ten minutes, while clusters elsewhere do not show the same pattern.

For a single laptop, Dean says, the risk is generally not too bad. Zsolnai-Fehér raises MacBook Pros as an example and says that, as far as he knows, they do not have ECC memory. Dean answers cautiously: “I think they have parity,” so at least some single-bit errors are typically detected, though not necessarily corrected. ECC, he says, usually gives single-bit correction and dual-bit detection. But at tens of thousands of machines, the problem must be engineered around. When Google used machines without parity, it built software-based checksumming for large amounts of data. In a web-crawling and indexing pipeline, if a particular record was detected as corrupted, it was often acceptable to ignore that record.

Healthcare AI is slowed by deployment reality, not only technical capability

Asked for something he was wrong about and later changed his mind on, Jeff Dean points to healthcare AI. He still believes AI will influence healthcare dramatically and sees “tremendous world benefit” in doing it. But he now thinks the difficulty is not necessarily technical. The hard part is getting systems into regulated industries with privacy constraints, safety concerns, and high stakes. He expected progress to happen faster; instead, he says, it is taking longer than he hoped because it must be done carefully and safely.

That answer sits beside, but is not identical to, his answer on continual learning. In healthcare, the obstacle Dean emphasizes is the regulated, privacy-constrained environment into which AI systems must be introduced. In continual learning, the issue is more specifically how to evaluate and release a system that keeps changing: how to apply safety protocols, red teaming, and final testing before the newest version reaches users.

The lighter exchanges at the end mostly serve as context for Dean’s public persona. The source shows a GitHub page collecting “Jeff Dean Facts,” described on-screen as programmer-humor equivalents of Chuck Norris jokes. Visible examples include jokes that Dean’s PIN is the last four digits of pi, that he proved P=NP when he solved all NP problems in polynomial time on a whiteboard, and that he was promoted to level 11 in a system whose maximum level is 10. Asked whether he enjoys these jokes, Dean says he does, calling them an April Fool’s joke by colleagues in 2009 that went awry — “very both flattering and kind of embarrassing.”

The final developer-culture question is less weighty but unambiguous. Asked Vim or Emacs, Dean chooses Emacs. Károly Zsolnai-Fehér, a Vim user, objects in good humor. Dean concedes that one can spend a lot of time customizing Emacs and learning tricks. The exchange is minor, but it fits the broader tone: even where the subject is massive compute, large models, and data-center-scale failure, Dean tends to return to systems that must be made usable, maintainable, and safe under real constraints.

Data and Training RAG and Knowledge Systems AI Research Methods Inference and Deployment AI Safety and Alignment AI in Healthcare and Life Sciences Agents and Autonomy AI Infrastructure and Compute Open Models

Dean does not see data exhaustion as the limiting wall

Inference is becoming the hardware problem

Pre-training and post-training are an unsatisfying split

A million-fold compute leap points toward autonomous engineering, not just better chat

Small fast models depend on large expensive teachers

The context-window dream requires a hierarchy, not just quadratic attention

At Google scale, improbable failures become routine engineering inputs

Healthcare AI is slowed by deployment reality, not only technical capability

The frontier, in your inbox tomorrow at 08:00.