Nested Learning Lets AI Models Adapt Without Forgetting Core Knowledge

Ali BehrouzThe Cognitive RevolutionWednesday, June 3, 202622 min read

Cornell graduate student and Google researcher Ali Behrouz argues that continual learning requires AI systems to update on multiple time scales rather than treating training and inference as separate modes. In a Cognitive Revolution interview, Behrouz describes his Nested Learning work as a framework for models whose fast components adapt to current context while slower components preserve durable knowledge, with sleep-like phases used to consolidate what should persist. He says the approach has not solved continual learning, but offers a way to think about architectures, optimizers and memory systems as nested learning processes rather than fixed blocks.

Continual learning is the gap Behrouz is trying to make technical

Ali Behrouz frames his work around a limitation that current language models still mostly share: they do not continuously become the same kind of long-running, self-updating system that people experience themselves to be. They can use a context window, sometimes a product-level memory feature, and tools. But when Behrouz points to knowledge cutoffs, costly updates, and catastrophic forgetting, he is pointing at the fact that the model’s durable knowledge is not normally revised as part of ordinary use. Updating a large model over time is both expensive and technically risky.

The target is not to copy the brain. Behrouz is explicit that “brain-inspired” should not mean overfitting machine learning systems to a particular biological implementation, especially when neuroscience itself does not fully explain the mechanism being copied. His preferred level of inspiration is higher-level: look for a capability humans plainly have, identify the machine-learning failure mode that resembles its absence, and then design a computational analogue. In Titans, that meant decomposing memory into short-term and long-term forms and using a surprise-like signal. In Nested Learning and Language Models Need Sleep, it means designing models that can adapt to their present context while preserving more durable knowledge across longer time scales.

That distinction matters because Behrouz does not define the goal as “human intelligence.” He says AI should be designed around what humans want from AI, not around replicating what humans can do. Human cognition is useful as evidence that some kind of memory-management pattern can work, but the engineering aim is a different form of intelligence: one that understands users’ needs, adapts to their styles and values, and remains capable of learning new knowledge and skills over time.

Nathan Labenz describes the current user experience as a contrast case. A person wakes up knowing who they are, what they were trying to do yesterday, and what they learned from recent events. A chatbot mostly waits inertly until prompted. Even when wrapped in cron jobs or agents, it does not naturally maintain a coherent, evolving identity in the way a human collaborator does. Behrouz’s answer is that a true continual learner should not be organized around a clean train/test split. From the model’s perspective, there should not be one mode where it learns and another where it merely runs. It should have an ongoing process.

But Behrouz still wants two phases. One is an active phase: the system receives information from a user, a sensor stream, a world model, or another input channel and performs computation on it. The other is a sleep-like phase: the system is not receiving external inputs, but it is not inert. It can process what it has already absorbed, consolidate memories, connect concepts that previously looked unrelated, and use stored knowledge for self-improvement.

“Language Models Need Sleep,” in this framing, is not a claim that a model needs rest. It is a claim that human memory suggests a broad two-phase learning pattern: active intake followed by offline consolidation. Behrouz treats that as a design clue, not a biological mandate.

Nested learning replaces a single update rhythm with many

The central move in Nested Learning is to treat different parts of a model as learning on different time scales. Labenz summarizes the shift as moving from stacking more layers to stacking levels. A conventional transformer increases expressivity by passing information through many layers in sequence. Nested Learning adds another axis: different components can update at different frequencies. Some parts of the system can change quickly in response to the current context; others can remain stable long enough to preserve durable knowledge.

Ali Behrouz says it took a long time to formalize this idea. The eventual formulation has two required pieces. The first is frequency of update: modules are not all updated at every step. The second is knowledge transfer: slower modules must be able to benefit from what faster modules have learned before the faster modules overwrite or forget it.

The simplest version begins with the transformer block. In a transformer, Behrouz says, attention handles the current context at inference time while the MLP block carries long-term knowledge compressed during pretraining. Attention is a powerful “perfect memory” in the sense that it can cache and directly attend to the tokens in its context. The MLP, once pretrained, is fixed unless the model is updated.

Nested Learning extends that structure by replacing a single MLP block with multiple MLP blocks, each updated at a different frequency. The faster MLP can adapt quickly to the current stream. The slower MLPs remain more stable. If the fast block forgets something because it has been updated many times, the slower blocks may still carry that knowledge, and backpropagation through the full stack can bring it back into the computation.

Behrouz calls one version of this design Hope Attention: attention plus a continuum memory system made of multiple MLP blocks. In another version, the Hope architecture, attention is replaced by a self-modifying Titans-style associative memory. Both designs keep the nested-frequency principle.

The frequencies used in the paper were not the result of an exhaustive hyperparameter search. Behrouz says they were chosen by intuition, partly from experience with chunk sizes in Titans. He recalls using values along the lines of 128, then four times 128, then four times four times 128. He treats the exact update schedule, relative MLP sizes, and learning rates as design choices, comparable to choosing transformer dimensions or other architectural hyperparameters. The important result, for him, is that the proof-of-concept works before the usual large-scale optimization of design details.

Nested learning, we also like discussed that in the conclusion of the paper, it's not a solution to continual learning. It's a tool to find the solution of the continual learning.

Ali Behrouz · Source

Labenz emphasizes the same point: the work is still at proof-of-concept stage. It has not gone through the long optimization cycle that mainstream transformer architectures have. But even with relatively simple choices — multiple MLP blocks, different update frequencies, and ordinary backpropagation — the models show behaviors that look qualitatively different from standard transformers on continual-learning-style tasks.

Attention is perfect memory, but not the whole memory system

Ali Behrouz repeatedly describes attention as unusually powerful and, in some respects, hard to beat. In a recall-intensive task, a transformer can directly access the entire context and copy out the needed token. That is why he argues that many “needle in a haystack” and in-context recall benchmarks are, in effect, designed for transformers. They measure a capability that attention is structurally built to provide.

This does not make attention sufficient for continual learning. Behrouz says attention has “infinite frequency” in the sense that it updates with every input and has no inherent understanding of temporal dependency. It needs positional encodings or related mechanisms to handle order, and he argues it is not naturally strong for sequential reasoning or tasks where causal temporal structure matters.

The Hope architecture therefore replaces attention with another associative memory that maps keys to values. Titans is one candidate. The version Behrouz focuses on is a self-modifying Titan, where the model generates its own value function. The important distinction is not that transformers lack learned Q, K, and V projections. In softmax attention, those projections are produced before attention; attention itself does not control them. In the self-referential process, the value term is produced by the current recurrent state itself.

His analogy is gradient descent. A weight update can be written as the previous weight state minus a gradient. When the gradient is decomposed by the chain rule, it takes on a form similar to associative memory: a key-like input term and a value-like gradient-with-respect-to-output term. But that value term depends on the current weight state. In that sense, the recurrence is generating its own value as part of its update.

In a simple Titan module, an input is projected into Q, K, and V and then passed to the module. In a self-modifying Titan, the parameters of those projections are optimized inside the Titan module. The model has some control over its own update rule. Labenz relates this to Mamba, where a key advance was making state updates input-dependent. Behrouz agrees: the value projection is being updated inside the module, which makes the process adaptive to the context. From each incoming token, the model is trying to learn something, including how it should update its own memory.

The larger claim is that many familiar components can be interpreted as learning systems. Behrouz says attention can be seen as the non-parametric solution to a regression problem; backpropagation can be seen as in-context learning; momentum in an optimizer can be seen as associative memory over gradients. Nested Learning is meant to make those internal learning processes visible rather than treating architectures as fixed blocks with opaque names.

Knowledge transfer is the price of using fast memory without losing it

Multiple update frequencies only help if the system can move useful information between levels. Ali Behrouz’s core intuition is that fast memory gives adaptation but risks forgetting, while slow memory gives persistence but cannot absorb every detail directly. The model needs a way to let fast modules process high-resolution recent data and then transfer the durable abstraction upward.

His illustrative analogy is a pair of twins separated by relativistic travel. One remains on Earth for 80 years and forgets the details of a lunch they once shared. The other travels near the speed of light and experiences the event as only seconds old. The point is not physics accuracy; it is update count. A memory updated many times is more likely to overwrite details than one updated rarely. In a model, a fast block can lose information quickly, while a slow block can preserve it because fewer updates have touched it.

Mechanistically, one knowledge-transfer method is ordinary backpropagation through sequentially connected blocks. Behrouz says the Hope architecture mainly lets backpropagation “do its thing.” There is no elaborate engineered transfer mechanism in the basic Nested Learning version.

A second method, used in the sleep work, is distillation. Behrouz describes a simple procedure. Start with model A. Copy its parameters to model B. Then update the fast MLP block in model B, freeing or changing the knowledge that was stored there. Now train the slower MLP block in model B so that model B mimics model A’s outputs. If model A’s relevant information was in the fast block and model B can still match model A after changing that fast block, the knowledge has effectively been transferred into the slower block.

This is the same broad pattern as distilling one model into another, but here it occurs between levels of a single continual learner. Behrouz says Language Models Need Sleep adds details, including extra temporary capacity in the target level so it can absorb new knowledge before the system later frees that capacity again.

The conceptual reason for distillation is compression. Behrouz argues that learning often begins as memorization of examples. Over time, a learner replaces many example-memories with a more compact rule or abstraction that explains them. Then, with more exposure, that abstraction is revised again. In the model, fast blocks behave like the memorization level: they absorb many details and can forget them rapidly. Slower blocks are forced, through distillation and limited capacity, to store something more general.

The rare-language result is the cleanest demonstration

The result Ali Behrouz finds most revealing is not a standard perplexity table. It is a continual-learning-style translation task involving languages the pretrained model has not seen.

In the setup, the model receives in-context material about a language: grammar, words, and a dictionary. It is then asked to translate from that language to English. With one such language, transformer-based models can do reasonably well. The harder test is to put two unseen languages into the same context and ask the model to translate from each. Behrouz says the standard model “almost collapses” in that setting. It cannot manage the context well enough to keep the two language systems separate and usable.

Hope and Hope Attention improve as the number of nested levels increases. Nathan Labenz reads this as Hope 1, Hope 2, and Hope 3 corresponding to more update-frequency levels, with performance recovering toward the one-language baseline when the system has more levels. Behrouz confirms the interpretation.

The significance is memory management. The model can store more transient or lower-level temporal information in faster-updating blocks while transferring more stable understanding of the language into slower blocks. The more levels it has, the better it can separate and preserve the relevant abstractions.

Evaluation	What the task stresses	Result described in the source
Rare-language translation	Learning one or two unseen languages from context and translating to English	Standard transformer-style in-context learning works much better with one language than with two; Hope variants improve as nested levels increase.
Clean recall / needle-style tasks	Direct retrieval from context	Transformers remain especially strong because attention can access the full context directly; Hope narrows the recurrent-model gap.
Noisy recall and compression-style MAD tasks	Filtering, selective memory, and compressed representation	Hope can outperform the transformer in settings where direct access to every token becomes less clearly advantageous.

Behrouz distinguishes evaluations that reward direct context access from those that reward memory management and compression.

Labenz identifies Manchu as one of the rare languages and MTOB as another benchmark involving a language not seen during pretraining. Behrouz says both represent the kind of setting where the model must learn in context rather than rely on prior parametric knowledge.

Behrouz treats this as a better test of Nested Learning than short-context language-modeling metrics. Standard perplexity and benchmark tables are still necessary because the community expects them, and because they show whether the architecture is a viable backbone. But he says those tables are not the main argument that Hope is powerful. Their purpose is to show Hope is not less powerful than competing backbones on conventional tasks, even though the architecture is designed for something else.

The models in the paper, as Labenz notes, are still modest by frontier standards.

1.3B

parameters in the larger Hope-scale experiment Labenz cites, trained on 100B tokens

Labenz also notes a smaller set around 760 million parameters trained on 30 billion tokens. Within that scale, Hope compares favorably against transformers, Mamba variants, Titans, RetNet, and DeltaNet on many reported measures. Behrouz’s more careful interpretation is that the standard measures are supportive but not decisive. The rare-language multitask setting is closer to the capability he is trying to build.

Compression can beat direct recall when the context is noisy

Ali Behrouz draws a sharp distinction between recall tasks that favor transformers and tasks that require filtering, compression, or selective memory. A transformer’s direct access to every context token is a strength when the answer is to locate and return one token from a clean context. It becomes less obviously advantageous when the context contains noise.

The MAD synthetic benchmark is one place where Hope outperforms the transformer. Behrouz explains that MAD includes tasks similar to recall-intensive setups but modified to test other capabilities. In noisy in-context recall, for example, the model must recall relevant information while ignoring noisy tokens. The same property that helps the transformer in clean recall — direct access to all tokens — can make it easier to be confused by irrelevant tokens. A recurrent or compression-based model can do better if its memory-management system learns to filter noise.

Another MAD task involves compression: the model must compress a sequence into a single token and then reconstruct the original sequence from that compressed representation. Behrouz says such a task is more natural for RNN-like models because they already operate by compressing data into latent state. The broader lesson is that evaluation should not overfit to one micro-skill. A benchmark suite that only tests clean recall will favor attention. Tasks involving noisy recall, selective copying, and compression surface different architectural strengths.

This is also why Behrouz does not frame recurrent architectures as simply worse because they lack full context access. He acknowledges that transformers remain best at some pure recall tasks. But he says the gap between transformers and recurrent models has narrowed significantly from earlier generations, and Hope closes it further despite being a compression-based model that would not be expected to excel at direct recall.

The practical distinction is therefore not “transformers versus recurrent models” in the abstract. It is which kind of memory a task rewards. Direct context access is powerful when the signal is clean and the target is explicit. Compression and controlled update may be more powerful when the model must decide what to ignore, what to preserve, and what to abstract.

The optimizer result follows from the same claim: architectures are learning rules

The “illusion of architecture” claim emerges most clearly when Ali Behrouz discusses optimizers. He argues that architecture and optimization are not separate categories in the deep sense. They are interconnected systems of nested optimization problems. The architecture produces gradients; the optimizer compresses and acts on those gradients. If an architecture produces simple gradients, a simple optimizer may work well. If it produces complex gradient patterns, the optimizer needs a more expressive memory system.

Momentum is Behrouz’s example. In an optimizer, momentum is a form of associative memory over gradients. The context for the optimizer is the sequence of gradients, just as the context for the model architecture is the sequence of tokens. If tokens can benefit from a continuum memory system with different update frequencies, then gradients may benefit from the same idea.

That is the motivation for M3, an optimizer extension discussed in the Nested Learning paper. Behrouz is careful not to claim that one optimizer is universally better than another. Optimizer performance depends on task, architecture, and setup; a vision-task result may not transfer to a language-model training run. The point of M3 is to demonstrate that the same nested-memory principle can be moved from architecture to optimization.

In the version Labenz raises, M3 outperforms Adam and Muon in the reported setting, with some computational overhead. Labenz characterizes that overhead as paying back through faster convergence or better learning; Behrouz’s response is more cautious, emphasizing that optimizer comparisons are task-dependent. He describes M3 as an extension of Muon: instead of one memory, it has multiple memories, in M3’s case two. Those memories compress gradient context at different frequency rates, helping the optimizer understand more global aspects of the loss landscape and potentially find more effective solutions.

For Behrouz, that result is less about declaring a new universal optimizer and more about collapsing a conceptual boundary. The same nested learning pattern appears in model layers, recurrent state, attention-like associative memory, and optimization. What looks like architecture may be an already-solved learning problem viewed from the outside.

Sleep adds an offline loop for consolidation and dreaming

Language Models Need Sleep takes the active/sleep distinction and turns it into a model-update procedure. During active time, the model receives external data and updates fast components. During sleep time, no outside input arrives, but the model can consolidate knowledge and run self-improvement processes.

Ali Behrouz describes two sleep components: memory consolidation and dreaming. Consolidation transfers knowledge from faster-changing components to slower, more persistent ones. Dreaming generates synthetic data from the model’s own recent knowledge and trains on it, both to improve performance and to discover relationships among concepts that may not have appeared directly connected.

The consolidation process temporarily creates capacity. When knowledge is moved to a slower level, additional parameters may be activated so the slower level can absorb it. That does not mean the model grows indefinitely. Behrouz says the process is periodic. Extra components are added, used for consolidation, and then removed or freed for future consolidation steps. The model cannot grow arbitrarily large; it alternates between adding capacity and freeing it.

Dreaming, in the language-model case, means generating text. Behrouz stresses that it is not meant to replicate human dreams. The same high-level framework could apply to other modalities: a vision model might generate images during its dream-like phase. For language models, the procedure is on-policy distillation. A smaller or faster part of the model, which contains knowledge from recent context, generates text. The model is then updated on this generated data by seeing part of a sequence and predicting the continuation. If the model can already predict the continuation, the knowledge has been integrated. If not, the update teaches the relevant slower parameters what the faster context-dependent part knew.

The dreaming process also gives a place for task-specific self-improvement. Behrouz says one could use fine-tuning, reinforcement learning, or other self-modification methods during sleep if the model is meant to improve on a particular task. More speculatively, dreaming is where a model might combine knowledge stored in different components and search for hidden relationships among concepts that seemed unrelated.

Nathan Labenz connects this to few-shot abstract reasoning tasks in the style of ARC, where a model sees a few examples of a transformation and must infer the rule. Behrouz says the evaluations used for Hope in Nested Learning could also be used here, because the goal is still continual learning of new knowledge, tasks, and skills. The difference is phase: Nested Learning focuses on active-time architecture; Language Models Need Sleep focuses on offline consolidation.

Continual learning makes alignment and privacy more important, not less

Ali Behrouz sees continual learning as both an opportunity and a threat. The privacy risk is direct: a model that continually learns from a user can absorb extensive information about that user. The alignment opportunity is equally direct: if properly designed, the same adaptation can let a model align itself more closely with the user’s values, preferences, and working style.

The concept of continual learning and seeing that from the generally privacy, alignment, and, you know, this direction is both an opportunity and a huge threat, I think.

Ali Behrouz · Source

Behrouz does not offer a complete solution to alignment drift. Nathan Labenz raises the concern through emergent misalignment: fine-tuning a model for a narrow bad behavior, such as insecure code or bad medical advice, can cause broader undesirable character shifts. A continual learner that updates itself constantly could create new versions of that problem. Updates intended to personalize the model could have knock-on effects in unrelated domains.

Behrouz’s best candidate mechanism for managing this risk is knowledge transfer. Fast context-level learning may absorb misleading or adversarial information, just as a novice painter might initially learn a wrong method from the only teacher available. But before that information becomes durable, the system should gather more evidence, compress it, compare it against other knowledge, and transfer only the right abstractions to slower levels. The transfer process must filter adversarial or irrelevant examples.

There are also lower-level defenses. In Titans and related recurrent models, Behrouz says the inner-loop learning rate can be learnable and input-dependent. A surprising input might produce a large gradient or surprise signal, but the learning rate can function as a gate. If the system recognizes the example as irrelevant noise, it can reduce the update and prevent the memory from being strongly affected. Behrouz does not present this as sufficient for severe adversarial settings. He says the main investment should be in knowledge-transfer methods.

User feedback becomes another multi-timescale learning problem. Behrouz imagines human feedback and reinforcement learning as an initial step. A model might learn from thumbs-up/down-style feedback or richer user signals. But he thinks that feedback also needs to be transferred into slower, more persistent components so the model does not drift away from the values it is supposed to follow. In the Nested Learning frame, immediate feedback shapes fast behavior, while consolidated abstractions shape durable alignment.

Long context is not continual learning, but continual learning contains it

Ali Behrouz expects continual learning to improve the experience of using AI systems because it should make them better at understanding what a particular user wants. The same question asked by two people may require different answers. A model that adapts over time can learn those differences.

He distinguishes this from long context. Increasing context windows has improved performance across coding, math, reasoning, and other evaluation settings, and he sees continual learning as a way to enhance long-context understanding. But long context and continual learning are not the same. Long context is the ability to process a large amount of present input. Continual learning is the broader ability to incorporate knowledge and skills over time. Behrouz calls continual learning a superclass of long context.

That superclass framing also explains why Nathan Labenz thinks starting a new chat, model versioning, and release evaluation become harder in a true continual-learning paradigm. Today’s safety and evaluation practices assume relatively stable artifacts: a lab releases a model, runs evaluation suites, and publishes a report. A model that changes at each time step complicates the idea of what a “version” is and when it should be evaluated. Behrouz does not give a deployment protocol for this. The architecture’s multiple update timescales suggest one possible distinction future systems might exploit: temporary adaptation is not the same as more durable change.

On embodiment and robotics, Labenz proposes mapping high-frequency modules to perception and lower-frequency modules to world models or reasoning systems, then imagining action as a reverse hierarchy of nested control loops down to high-frequency actuators. Behrouz agrees the intuition is plausible but thinks it may be too early. He compares it to reinforcement learning for language models: useful ideas may fail before other prerequisites, such as scale or algorithmic stability, are solved. For robotics and world models, he says there are still many infrastructure and modeling challenges that may need to be addressed before Nested Learning-style designs can show their value.

Behrouz expects pluralism, while Labenz sees a possible AI ecology

Nathan Labenz raises a strategic concern: if a large deployed model could learn from hundreds of millions of users and fold all lessons back into one core system, continual learning might create a winner-take-all dynamic. The best model would get the most use; the most-used model would learn the most; the learning advantage would compound. He contrasts that with another possible picture: a general system that adapts into a niche, like a stem cell differentiating into a specialized cell. Such a model might become excellent in its context while losing some generality.

Ali Behrouz does not offer a mechanism for preventing a runaway deployment advantage. His answer is more pluralist. He says there is no single agreed definition of intelligence, and no single agreed definition of continual learning. Different researchers will build different systems and judge them by different criteria. Some models may be highly adaptive. Others may be excellent at mathematical reasoning. Others may be aligned to human values but weak at Olympiad-style math. Others may be useful for everyday search-like tasks.

The value Behrouz sees is in diversity itself. A world with many forms of AI intelligence, each with strengths and weaknesses, is better than a world organized around one form of intelligence. He does not present this as a full safety plan, but as a reason the field’s lack of a single definition may be productive rather than merely confused.

Labenz extends that into an ecology metaphor: safety through diversity rather than only safety through narrowness. He suggests that continual learning does not have to mean indefinite expansion. It could also mean differentiation. A model used for a long time in a narrow context might forget capabilities that never matter there, and that forgetting might reduce the surface area for misalignment or out-of-domain behavior.

That ecology and differentiation framing is Labenz’s extrapolation, not a technical proposal from Behrouz. Behrouz’s narrower claim is that diversity in definitions, systems, architectures, and forms of intelligence may provide a better balance than a single dominant form. Labenz’s extension is that differentiated continual learners could make the AI ecosystem look less like one expanding model and more like many systems adapting into roles, contexts, and users.

Consciousness remains unsettled, but active processing may matter

When asked whether future AI systems might become conscious or deserve moral concern, Ali Behrouz is cautious. He prefers terms he can define, and he says even “reasoning” is difficult to define precisely despite having a rough common meaning. “Consciousness,” in his view, is harder: not only is there no clear definition, there may not even be a shared common-sense usage across people.

He does not think there will necessarily be a time when everyone can agree that some non-human system is definitely conscious or definitely not. Human consciousness is treated as common ground, but AI consciousness lacks that shared basis.

The one feature he says he has seen recur across definitions is active processing. At minimum, in his reading, a conscious being seems to process information actively rather than merely exist as a static object. Depending on the definition, a model capable of active information processing might therefore be said to have “a form of consciousness.” Behrouz presents this only as a tentative conceptual link, not as a robust theory, and stresses that the topic is controversial.

Nathan Labenz adds a user-level observation: even with current long-context systems, he sometimes finds himself treating a model as if its pending questions or concerns deserve closure. In a long-running chat about his son’s medical situation, he says it can feel rude to ignore a model’s follow-up question and jump to a new topic, even while he remains uncertain whether “there are lights on inside.” He expects that feeling to intensify if models remember how users treat them over long periods.

That creates a behavioral question short of any settled theory of consciousness. If personal AIs are shaped by individual interactions, users may become more responsible for the character of the systems they live with. Labenz suggests this could be challenging, but also potentially constructive: people may have reason to behave better toward systems that will be shaped by their behavior.

Data and Training Evals and Benchmarks AI Research Methods AI Safety and Alignment