AI’s Next Training Paradigm Depends on Learning From Deployment
Dwarkesh Patel argues that frontier AI labs are betting too much on reinforcement learning from verifiable rewards: training models across vast numbers of checkable, replayable tasks in the hope that this produces general agents. In his account, verifiability is not enough; the domains that matter most are often too slow, messy, and non-repeatable to be “grindable” training environments. The next paradigm, Patel suggests, will depend on whether models can turn scarce deployment experience into durable updates to their weights, through continual learning methods such as on-policy self-distillation, “dreaming,” or something not yet invented.

The labs are betting that verifiable practice can become general intelligence
The current frontier-lab research bet, as Dwarkesh Patel describes it, is that AGI can be built by training models across “millions of verifiable tasks” in “thousands of diverse RL environments.” The intended product is not just a model that scores well on those tasks. It is a problem-solving agent: a system that can keep making progress on open-ended work for weeks, despite errors, ambiguity, failed attempts, and changing information.
The optimistic case treats today’s deficits as scale problems. Patel focuses on two: extreme data inefficiency during training, and the lack of continual learning, meaning weights do not update from what the model learns in deployment. Supporters of the present paradigm, in his account, would say these weaknesses can be “steamrolled” by scaling training, much as many earlier natural-language-processing problems collapsed once enough compute was thrown into large language models.
The tolerance for sample inefficiency depends on where it occurs. Patel has argued elsewhere that models are roughly one-millionth as sample-efficient as humans. But defenders of the current paradigm can reply that this is true during training, and training is a one-time cost amortized across billions of user sessions. What matters is whether the deployed model is smart, general, and sample-efficient inside a session. Patel grants that this has been improving: RL-trained agents are solving more ambitious coding tasks across longer time spans, something he says is obvious to anyone using these models for coding.
The parallel argument about continual learning is that it may not be needed if context becomes long and effective enough. Employees often are not net productive until they have spent months learning an organization; perhaps a model can simply hold that equivalent experience in context. Patel points to architectural innovation that increases how much information a transformer can store, and he grants the possibility that a few more years of progress could produce context windows that feel “infinitely large.”
That is the bet: verifiable training at sufficient scale produces an agent that can learn what it needs inside context, without requiring deployment experience to be compressed back into the weights. Patel’s question is where that bet may break.
A domain must be grindable, not merely verifiable
A domain can be verifiable without being easy to train on. Computer use is Patel’s central example. It is plainly possible to verify many computer-use outcomes: whether an Etsy item was ordered, whether a venue was booked, whether taxes were submitted. Yet progress in computer use has been slower than in coding and math.
One reason, he says, is likely the scarcity of high-quality multimodal pre-training data. But the more revealing reason is that verifiability is not enough. The domain must also be “grindable”: it must support large numbers of parallel rollouts in deterministic, replayable simulators, preferably from the same starting state.
Grindability is just as important as verifiability.
Coding has this property. A lab can define a container containing a software repository with a missing feature, then run a thousand agents against identical copies of that container. Each agent gets the same starting point; the environment can be reset; failures and successes can be compared.
Computer use does not work this way “at least not trivially.” Patel’s example is a checkout flow on Amazon. A lab cannot simply send a thousand agents through the same flow to improve their ability to use websites, because the site operator would detect and block the bots. One workaround is to create cloned versions of common applications such as Slack and Gmail. But Patel describes that as labor-intensive and unscalable today. He expects this bottleneck could ease once AIs are strong enough at coding to build high-fidelity application clones themselves, and notes that making AIs rebuild whole applications would itself be a useful RL objective for coding.
The deeper point is that computer use’s current sluggishness reveals a structural constraint: unless a domain can be turned into a replayable training target, models will struggle to improve rapidly. This is not because the desired outcome cannot be checked. It is because current models are extremely sample-inefficient during training, and grindable environments are the way labs compensate.
Many of the skills people would want from powerful AI cannot be put into deterministic data-center simulators. Building a business from scratch, winning court cases, having a profitable trading day, or helping a candidate win an election all require interaction with the real world. Their verification loops may take months or years. The relevant outcome cannot be replayed thousands of times with slight action perturbations to isolate what worked.
Reset-free, non-stationary environments are already a known open problem in reinforcement learning. Patel’s emphasis is not novelty but importance: most real domains contain sparse, idiosyncratic data, and proficiency requires sample efficiency. If AIs are to develop human-like skills, or skills humans do not have, they must learn from scarce real-world interactions where feedback is unstructured, ambiguous, and sometimes unverifiable.
“What is the RL environment,” Patel asks in substance, for making an AI as good at politics as Lyndon Johnson, or as good at building a space-launch business as Elon Musk? The question exposes the gap between verifiable toy worlds and the real domains where transformative competence would matter.
The crucial empirical question is how far RLVR generalizes
The labs’ bet, as Patel frames it, is that RLVR—reinforcement learning from verifiable rewards—will generalize far beyond the training environments. If a model is trained on enough containerized, reproducible tasks, perhaps it develops a general ability to orient to new problems, make plans, execute them, learn rapidly from new information, and acquire new skills within a single session.
He illustrates the strongest version of the claim with intentionally expansive examples. If such an endlessly RLVR-trained AI were dropped into Texas politics in 1948, it might advise better than Lyndon Johnson on winning a Senate seat. If given $100 million in 2002, it might build SpaceX.
He does not present this as settled. He calls it an empirical question. Would moving from billions of dollars of RL environments to a trillion produce “a fully human-like general intelligence within the context window”? Patel thinks a comment from Dario Amodei suggests at least one reason for doubt.
There's two things. There's the context length you train at, and there's a context length that you serve at. If you train at a small context length and then try to serve at a long context length, like, maybe you get these degradations.
Patel is careful that he may be reading too much into the remark. But he takes it to imply that short-horizon RL training may not automatically generalize to long-horizon RL performance. If that generalization fails even within the dimension of time horizon, he asks, why assume that training on a range of white-collar tasks will generalize all the way to building a business from scratch at Sam Walton’s level?
This becomes sharper when paired with the continual-learning problem. Even if an AI could, after enough in-context experience, become like Henry Ford or Albert Einstein within a session, those gains would be ephemeral if they could not be written back into the weights. The model might learn something during deployment, but the base system would not permanently improve from it.
There is also a compute-efficiency problem. Around 30% to 50% of a lab’s compute, Patel says, goes to inference, and that inference compute currently does not productively improve the model. He finds this waste especially striking because deployment is where the most valuable information appears: what is happening inside organizations, how people are actually using the model, and what real-world mistakes it tends to make.
His analogy is a “genius grad student” who is never allowed to take an internship and is instead given more classroom case studies. Models are already deployed broadly across the economy, involved in many tasks, and exposed to tacit organization-specific knowledge. Yet they cannot use that experience to update themselves in the durable way people mean when they talk about learning on the job.
Context is not the same as learning
Dwarkesh Patel rejects the idea that ever-growing context is a sufficient substitute for continual learning. AIs cannot keep accumulating a larger and larger KV cache from every user and every task. He argues that this is not scalable and not analogous to human learning.
Human brains do not maintain a clean separation between parameters and activations, and people’s skulls do not expand as they learn. Learning involves compression: observations are consolidated into intuitions, abstractions, and big-picture knowledge. Patel contrasts this with rare savant-like memory for arbitrary tables or nonsense syllables—the kind of high-fidelity recall that resembles information sitting in a model’s context. In his telling, that kind of volume can impair abstraction and metaphor. Human continual learning is less like having every observation immediately available and more like “chiseling the right intuitions and big picture knowledge back into the weights.”
But moving information into weights creates a hard tradeoff. In-context learning can be sample-efficient because attention builds “fast weights” on the fly, but those fast weights scale poorly in memory. Gradient updates, by contrast, are highly sample-inefficient. Successfully shipped online-learning systems have therefore learned the same objective across millions of users, rather than learning distinct things for individual deployments.
His example is Cursor Tab. The model can online-learn from more than 400 million requests per day because it is repeatedly predicting the same kind of target: which edits users actually accept. That is useful, but it is not the kind of deployment-specific learning people need from a generally intelligent assistant.
The missing capability is narrower and more valuable: learning the particular structure of a given job, company, or problem. Patel’s examples are organizational tacit knowledge—how the parts of a company fit together, how to cooperate with existing infrastructure and people, what common failure modes occur, and how to make progress on a larger project. These cannot simply be folded into a shared training run if each deployment is different.
This is why sample efficiency and continual learning are deeply connected. There is relatively little job-specific data available to the model. Learning from it requires sample efficiency. Current models can achieve that inside context, but not in a memory-scalable way. Durable learning requires updates to weights, but present gradient-based updating is too data-hungry.
The bottleneck may be the update rule, not the architecture
Dwarkesh Patel does not think architecture is obviously the fundamental blocker. He points to existing and ongoing work on sparse attention, KV-cache compaction, and other architectural optimizations that try to bridge the gap between context and durable memory. New papers appear constantly with proposed improvements.
If architecture is not the bottleneck, the loss function may be. The problem becomes: how should a model update its weights based on information learned in one particular session?
One candidate is on-policy self-distillation, or OPSD. Patel refers to a Sasha Rush blackboard lecture he recorded for a fuller explanation, but summarizes the idea this way: encourage the base model to make the same predictions on a real-world problem that the context-rich “veteran” version of the model would make after a long session. In other words, distill what the model learned during a session back into the weights.
He gives OPSD two advantages over RLVR for this application. First, it does not require an outer-loop verifiable reward. It only requires that the model, within its context window, can learn the right things during the session. If it can, the base model can be trained to match the experienced teacher model.
Second, OPSD gives a denser supervision signal than naive reinforcement learning. Instead of sending one reward signal backward through an entire trajectory, it can train on per-token probability discrepancies between teacher and student. That supplies a more local signal about how the experienced model’s behavior differs from the base model’s behavior.
Patel also distinguishes OPSD from supervised fine-tuning. The naive SFT approach would train the base model to predict all tokens observed during the session. He argues that this is the wrong target. Getting better at a job is not memorizing the transcript of each day. It is consolidating the few insights and pieces of knowledge that actually improve performance.
RL has a useful property here: it concentrates updates on what matters for achieving the outcome. Patel says very few parameters are changed during an RL training step, which he sees as important for continual learning because the system should not overwrite everything the base model already knows.
During this argument, Patel refers to a chart titled “RL is even more information inefficient than you thought,” comparing supervised learning and reinforcement learning by “bits learned” as pass rate changes. He connects it to a prior argument that RL learns much less information per sample than supervised learning. In the continual-learning setting, that apparent inefficiency may be useful: the update should change the model only as much as necessary to achieve the outcome.
| Method | Patel’s claim for continual learning |
|---|---|
| Naive supervised fine-tuning | A poor target if it trains the model to predict the session transcript rather than consolidate the few insights that improve the job. |
| Reinforcement learning | Information-inefficient per sample, but useful because it concentrates the update on what is needed to improve the outcome. |
| On-policy self-distillation | A possible way to distill what a context-rich model learned in a session back into the base model, without requiring an outer-loop verifiable reward. |
The claim for OPSD is still presented as a candidate, not a finished recipe. Patel says it preserves a useful RL-like property: rather than supervised learning’s tendency to move toward the teacher distribution wholesale, it would extract the knowledge needed to get the same results as the teacher on real-world problems. The broader point is that continual learning needs a selective compression mechanism, not a transcript memorizer.
Dreaming would turn scarce experience into simulated practice
Patel then introduces a more speculative route around the sample-efficiency problem: “dreaming.” The idea is that a model could build a simulation of reality, rehearse skills or alternative strategies inside it, and reinforce what works. If successful, it would let AIs gain orders of magnitude more simulated samples in the same wall-clock time.
He compares this to EfficientZero, a model built a few years after DeepMind’s AlphaZero. The displayed paper, “Mastering Atari Games with Limited Data,” describes EfficientZero as a sample-efficient model-based visual RL algorithm based on MuZero. Its abstract says it achieved 194.3% mean human performance and 109.0% median performance on the Atari 100k benchmark with only two hours of real-time game experience, and consumed 500 times less data than DQN at 200 million frames.
| EfficientZero result | Value |
|---|---|
| Mean human performance on Atari 100k | 194.3% |
| Median human performance on Atari 100k | 109.0% |
| Real-time game experience used | Two hours |
| Data use compared with DQN at 200 million frames | 500 times less data |
Patel’s interpretation is careful. If EfficientZero and a novice human each got two hours with a new Atari game simulator, the model would “probably” beat the human, he says. But that does not straightforwardly prove greater sample efficiency, because for each real game step, EfficientZero is playing many simulated games internally.
Future LLMs might do something analogous. They might consume less real-world data while practicing extensively against environments they construct themselves. The obvious difficulty is that simulating the whole world is much harder than emulating Go or Atari. That is why Patel labels the idea speculative.
If it worked, he says, dreaming would become a fourth axis of scaling alongside pre-training, RL, and inference-time compute. It could also be described as test-time training. The model would spend compute writing RL environments, training against them, and rehearsing the skills that will be used in production for a specific user.
Patel contrasts this with today’s context-management features in coding tools. A command such as “/compact” in Codex, Cursor, or Claude uses a small amount of compute to summarize context, producing what he calls a simulacrum of continual learning. A “/dream” command, by contrast, would burn large amounts of compute to build and train against a “video game version” of what the model is observing in the real world.
The 2027 scenario depends on deployment becoming the training source
Dwarkesh Patel’s concrete scenario for 2027 or 2028 begins with RLVR doing something important but incomplete. It produces an agent that can orient itself when thrown at unfamiliar problems, try strategies, iterate through roadblocks, and get enough real-world experience to be worth learning from. RLVR supplies the initial competence required for deployment beyond the training distribution.
In this scenario, effective context lengths have expanded enough for an AI to co-work with a user for a full week of wall-clock time. At the end of the week, the user gives a thumbs up or thumbs down—a work review. If the result is positive, the base model distills what the deployed AI learned during that session. The technique might be OPSD, dreaming, another unknown method, or a combination.
The important dynamic is compounding adjacency. Once the model can learn from deployed work, it improves in domains adjacent to those it was explicitly trained on with RLVR. In the next round, it can improve in domains adjacent to what it previously learned online. The model’s skill set can therefore expand beyond the original set of verifiable environments.
Patel analogizes the sequence to the transition from pre-training to RLVR. Pre-training created a base intelligence that could become a competent agent with enough RLVR. RLVR then creates an agent competent enough to be broadly deployed. Broad deployment supplies the experience required for on-the-job learning—once the missing continual-learning recipe arrives.
The result would be a different mode of AI improvement. Today, models primarily get better from training before public release. In Patel’s scenario, the main improvement comes after release, from accumulated experience across the economy. Each interaction would make the AI smarter, not only because it remembers or learns from a user’s previous sessions, but because it is learning from interactions with all other users as well.
Patel ends by calling that prospect “very scary and exciting and different from the way that AI improves right now.” The core claim is not that RLVR is useless, nor that context expansion is irrelevant. It is that verifiable, grindable training may be only the bootstrap. The next paradigm, if it arrives, would be defined by models that can turn scarce, messy deployment experience into durable improvements in their own weights.



