Language Models Generalize Differently From Parameters Than From Context

Andrew LampinenStanford OnlineWednesday, May 20, 202618 min read

In a Stanford CS25 seminar, Anthropic researcher Andrew Lampinen argues that language models generalize differently depending on whether information is stored in their parameters or supplied in context. His experiments find that models can often use relations flexibly when the relevant facts are visible in the prompt, but fail to make the same reversals, syllogistic inferences, or codebook translations when those facts have only been learned through training. Lampinen presents augmentation, retrieval, and reinforcement-learned recall as partial ways to make latent implications more usable, while stressing that parametric learning and in-context learning remain complementary rather than substitutes.

The same model learns differently from parameters and from context

Andrew Lampinen framed the central problem as a controlled way to ask how language models generalize from information they have learned. A model can be taught through two routes: by optimizing its parameters over training data, or by putting information directly into its context. Those routes are often discussed as if they were functionally similar, especially in theoretical work suggesting that in-context learning can approximate gradient descent in settings where gradient descent is the right learning algorithm. Lampinen’s experiments show that, for important classes of relational information, the two routes produce sharply different generalization.

The motivating example was the “reversal curse.” In the original setup Lampinen described, a model is fine-tuned on facts in one direction, such as “Daphne Barrington is the director of A Journey Through Time.” It can then answer a question that preserves the learned direction — “Who is Daphne Barrington?” — but struggles when asked to reverse the relation: “Who directed A Journey Through Time?” The model has stored something about the fact, but not in a form that makes the inverse relation readily available.

The same kind of reversal is trivial for a chat model when the fact is supplied in context. If the prompt says “Zepax are bigger than Quarples,” the model can answer that Quarples are smaller than Zepax. In an earlier reasoning paper with simpler models, Lampinen said, this had been an easy baseline condition that models handled at ceiling.

That contrast led to the first controlled question: which generalizes better, a language model fine-tuned on a new dataset, or one given the whole dataset in context? The experiments used datasets with thousands of documents and hundreds of thousands of tokens. In one condition, the model was fine-tuned on the dataset. In the other, the entire dataset was concatenated into the model’s context, with document separators. Both models were then tested on generalizations from the data — reversals, syllogistic implications, and other structured tasks.

On the reversal curse dataset, the results replicated the original finding on a different model. A pretrained model was at chance. Fine-tuning on the relations in only one direction made the model slightly worse than chance on reversals. But placing the entire dataset in context let the model answer reversals with 99% accuracy.

99%

reversal accuracy when the whole dataset was supplied in context

The same pattern appeared in syllogistic generalization. Lampinen used nonsense nouns to ensure the facts were not already in the model’s training data. A training example might include “All zamp are snaff” and “No snaff are plusk,” with the held-out implication “No zamp are plusk” used as a test. Pretrained models were at chance because the facts were made up. Fine-tuned models showed only modest above-chance generalization. Models given the dataset in context generalized better.

Task	Parametric route	Context route	Resulting contrast
Reversal curse	Fine-tuning on one direction made reversal performance slightly worse than chance	The whole dataset in context produced 99% reversal accuracy	The same relation was much more usable when visible in context
Syllogisms	Fine-tuning produced only modest above-chance generalization	The dataset in context generalized better	Premises in context supported logical implications more flexibly
Codebooks	Definitions could be recalled, but novel latent indices were not reliably encoded	Definitions in context could be used for novel encodings	Stored definitions were less usable than visible definitions

Across the controlled tasks, information in context supported more flexible use of latent implications than information stored only in parameters

The immediate explanation was not that models lack the relevant reasoning procedures. Reversals and syllogistic structures are common contextual patterns on the internet: documents, essays, proofs, and explanations often invert relations or draw implied conclusions. It therefore makes sense that models learn to use these structures when the premises are visible in context. The harder question is why the same models do not reliably generalize that way after the relevant information has been absorbed into parameters.

The failure persists even when models are trained from scratch

One possible explanation was that fine-tuning itself was the problem. Maybe models do not learn enough during fine-tuning, or maybe the relevant representations are already too fixed. To test that, Andrew Lampinen described training a small language model from scratch on a synthetic reversal setting.

The model had roughly 20 million parameters and was trained on 20,000 relationships. Most reversals appeared during training, but 1% were held out. The data included filler prefix tokens for diversity, trained relations such as “X contain Y,” and in-context learning sequences where a forward relation appeared together with its reverse, such as “W contain Z; Z subset_of W.” Held-out tests asked for reversals such as “Y subset_of X” that had not appeared in training.

The result was stark. The model remembered the trained forward relations, even in new contexts. If a trained forward relation was placed in context, it could perform the reversal perfectly. But without that forward relation in context, it achieved zero generalization to the held-out reversals. Lampinen emphasized that this was not merely a fine-tuning artifact. It reflected something more basic about how these causal next-token models generalize from relational information in their training data.

A related “Codebooks” task showed the same pattern in a translation-like, multi-hop setting. The model was trained on definitions of codebooks — mappings from symbols to letters — and on examples using some of the symbols. At test time, it had to use other defined symbols that had not appeared in training examples. The model could recall definitions. If the relevant definition was in context, it could use it flexibly. But it could not reliably combine the stored definition with the required operation when the information had to come only from parameters.

During questions, Lampinen clarified what “no context” meant in this setting. The prompt still identified the codebook and asked for a translation, but the model had to rely on what it had learned from training rather than having the definition supplied again in the prompt. He also noted that known encodings in novel sequences generalized well, so the failure was not simply rote overfitting to training examples.

Asked whether the issue depended on the representation being English, Lampinen distinguished the experiments. For pretrained models, the propositions were expressed in English. For models trained from scratch, the tokens had no preexisting English meaning; “contain” could be thought of as a formal operator. He did not think ambiguity in English was the core issue, though he allowed that formatting details could matter.

Asked whether the behavior was architecture-independent, he said it was not. Bidirectional Transformers fix the reversal issue, and changes to the learning objective can also fix it. But he described the failure as a systematic feature of architectures doing causal next-token prediction without additional tricks to avoid it.

The broader term for the problem was “latent” information. Training data often convey more than their explicit surface content. “X is Y’s parent” explicitly states one direction of a relation, but it also latently answers “Who is Y’s parent?” Two parent relations can latently imply a grandchild relation. A navigation instruction to one goal can convey structure relevant to another goal. A relation expressed in one language can imply a relation asked in another. The claim was not that models never learn from these data, but that when trained only on the explicit bits, they may fail to encode the latent information in a reliably usable form.

Parametric learning is not worse; it is less flexible in specific ways

Andrew Lampinen repeatedly resisted the interpretation that in-context learning is simply better than parametric learning. The two modes, in his account, are complementary.

Parametric learning consolidates information across many documents. That consolidation is valuable: it extracts statistical structure, cached facts, recurring patterns, and procedures for using information in context. But the consolidated information can be tied to the explicit forms and tasks in which it appeared. In-context learning, by contrast, can use fewer documents but preserve them in richer detail, making specific information more flexible at test time.

This distinction also explains why the failures are not obvious in everyday model behavior. Language models often generalize above 0% in practice because real data contain dense statistical cues. Lampinen gave a simple example: a model trained on “All birds have wings” and “Eagles are a type of bird” might have trouble deriving “Eagles have wings” purely as a syllogism from parameters. But in natural data the model will also see “Eagles fly,” “Pigeons are a type of bird,” “Pigeons have wings,” “Pigeons fly,” “Hawks are a type of bird,” “Hawks have wings,” and so on. The test sentence “Eagles have wings” can then be supported by co-occurrence structure rather than by a clean symbolic inference.

An audience member objected that this can go badly wrong. Lampinen agreed. Statistical generalization is useful in expectation, but any given instance can be wrong. He gave the example of “penguins fly”: co-occurrence-like structures might push toward the wrong conclusion, and young children also sometimes get such statements wrong. More refined understanding, in humans or models, can repair these errors with more data or other reasoning methods.

Another questioner asked whether this was a compression story: perhaps the forward relation is retained, but the reverse is lost in compression. Lampinen accepted that as partly right, while emphasizing the tension. Compression can help generalization by extracting useful structure and making it available. But if the compression is lossy, it can discard details that would be important for some future generalizations.

The practical question became whether models can be made to learn what is latent even when statistical cues are absent. Lampinen presented three routes: train-time augmentation, test-time retrieval, and test-time thinking learned through reinforcement learning.

Augmentation makes latent implications explicit before training

The first intervention was deliberately simple. If a model can generalize better from information in context, use that ability before training to generate the latent implications, then fine-tune on the augmented data.

The procedure was: put the full training set in context, pull out individual documents one at a time, and ask the model to make connections between the selected document and the rest of the corpus. For example, the model might infer: “I see that Dax are a type of Fep, and all Fep can zarp; thus, Dax can zarp.” These generated reasoning traces and conclusions were added back into the original dataset. The model was then fine-tuned on the combined data.

Andrew Lampinen said this augmented fine-tuning performed as well as in-context learning on reversal-like tasks and sometimes better on syllogisms. The likely reason was sampling: in the in-context learning case, the model gets one chain of reasoning at test time. In augmentation, the system generates many reasoning traces from many documents, increasing the chance that one trace links the statements needed for a later test.

He addressed a common objection through a tweet shown on the slides from David Pfau: “I am absolutely begging AI researchers to learn the data processing inequality,” followed by the claim that discovering new knowledge from synthetic data generated from existing knowledge is the information equivalent of a perpetual motion machine. Lampinen said the tweet was “not wrong” but misleading for this case. The augmentation is not creating new information from nothing. It is making explicit what was already implied. If the corpus contains “All X are Y” and “No Y are Z,” then “No X are Z” is already present as a logical consequence; augmentation makes it accessible.

That said, augmentation has an obvious limitation. It requires anticipating how information might be useful later. Lampinen illustrated this with his own experience reading papers during his dissertation. He might have thought to connect a paper to his dissertation work, but he would not have anticipated that five years later it might matter for research on how language models use information in context. Offline augmentation can only generate the future uses one thinks to ask for.

In the later Q&A, Lampinen added several caveats. Offline augmentation is much more expensive than a straightforward fine-tuning pass over a small dataset — he guessed at least ten times as expensive, because it requires multiple inference-generation passes and then training on a larger dataset. It can also introduce hallucinations. In their examples, he said, hallucinations occurred at a low enough rate that augmentation remained worthwhile: if correct inferences are more frequent than hallucinated ones, and hallucinations are independent, the model can learn the right answer on average. Fine-tuning on a larger augmented dataset can also distort other model knowledge, though regularization can limit drift.

Asked whether the augmentation prompts were too tailored to the task, Lampinen said the same prompts were used across experiments and that prompt ablations mattered less than expected. Still, the prompts were tied to factual and entity-based settings: “Here’s a bunch of documents, now here’s a particular document; rewrite it and include connections to other documents, rephrasings, and relations between entities.” He acknowledged that doing this fully generally is hard, and that prompts useful for factual reversals or syllogisms might not work for math reasoning.

Retrieval turns parametric memories back into context

The second route was retrieval. If the model uses information more flexibly when it is in context, then a system with episodic memory could retrieve the relevant learning experiences at test time and place them back into the prompt.

Andrew Lampinen explicitly called the experiment “cheating.” The retrieval system was an oracle episodic memory: it had perfect recall but imperfect precision. It always retrieved at least one relevant memory, along with several irrelevant distractors. The point was not to solve general retrieval, which he said the work did not do, but to isolate what happens when the needed training information is made available again in context.

On reversal tasks, parametric learning and episodic retrieval both did well on forward generalization. On reverse tests, parametric learning failed, while the episodic system generalized well despite distractors. On the Codebooks task, both systems did well on validation-style splits such as definition recall and novel sequences with trained indices. On the latent encoding test, only the episodic system generalized well.

The interpretation was that retrieval-augmented generation may help not merely because it supplies facts the model lacks, but because it changes the form in which already-learned information is available. Even documents that have been trained into parameters can become more useful when retrieved back into context, because the model can reason over them with the richer, more flexible machinery of in-context learning.

A questioner asked whether it was fair to say the model can extract literal content from pretraining well but not relationships. Lampinen said the situation is subtler. The model can extract relations from pretraining if cued in the right way. If asked in the forward direction, even in a new context, it may continue the relation. What it lacks is a good way of going backward unless the relevant information is visible in context.

Retrieval’s strengths and weaknesses are different from augmentation’s. It is free at training time, but it increases inference cost because test contexts become longer. And, in Lampinen’s words, the work did not answer how to do retrieval well in a general way; it gestured at where a solution might be.

Reinforcement learning can teach models to recall, but not all recall is practical

The third route was to ask whether a model could learn to retrieve implicitly, without an external memory system. The intuition was that the model already has the pieces. It knows the forward relations. It knows how to reverse a relation when the forward relation is placed in context. The missing skill is to regenerate the necessary information in its own chain of thought, thereby bringing it back into context.

Andrew Lampinen described an experiment that fine-tuned a model on multiple non-overlapping datasets of new knowledge, each with holdouts. Then reinforcement learning was applied on one dataset to teach reasoning, and evaluation was performed on holdouts from another dataset. This distinguished a learned reasoning strategy from simply augmenting one dataset’s facts.

Reinforcement learning on dataset A improved reasoning generalization on dataset B. Augmenting only dataset A did not. The amount of improvement varied across conditions. Syllogism-like structures benefited substantially because the model could follow chains: “What do I know about dax? All dax are fep,” and so on. Reversals benefited much less.

The reason reversals remained difficult is revealing. To solve a reversal without external retrieval, the model has to recall the forward relation from the reverse cue. But that is exactly the reversal problem. Lampinen showed an example of a model enumerating many remembered entities and descriptions until it hit the right one: Cassidy Hammond, Tessa Montgomery, Maxwell Alderwood, Tyler Oakridge, and eventually Dominic Mullins, the record-breaking free-diver who swam with the mythical Kraken. This can work above chance in a synthetic test set with only thousands of entities if the model enumerates enough candidates. It is not a practical scale solution.

In response to a question about whether this differs from augmentation, Lampinen said the key distinction is when the work happens and where the information is. Augmentation happens at train time, when the documents are present and the model only has to generate plausible consequences. Reinforcement-learned test-time thinking requires the model to pull out the relevant information on demand. For syllogisms, this can be feasible because links can be followed. For reversals, the model may have to search broadly through memory.

The three methods therefore occupy different points in a cost and generality trade-off.

Method	Where the extra work happens	What Lampinen said it can do	Main limitation
Train-time augmentation	Training time	Can match or exceed in-context learning on questions about augmented data	Requires anticipating useful future inferences and spending substantial extra compute
Test-time retrieval	Test time	Can perform as well as in-context learning when relevant memories are retrieved	Requires longer inference contexts and does not solve general retrieval
Test-time thinking via RL	Training time for RL and test time for longer reasoning traces	Generalizes beyond explicitly augmented or retrieved distributions for many structures	Much worse than in-context learning on some structures, especially reversals

Lampinen’s comparison of three ways to exploit in-context reasoning for better generalization

The analogy to hippocampus and cortex is computational, not literal

Andrew Lampinen connected the language-model results to a broader cognitive-science question: whether natural intelligence faces analogous generalization problems, and whether the brain uses complementary systems to bridge them.

His proposed mapping was rough. Neocortical learning resembles parametric learning: slow, integrative, able to extract statistical structure across many experiences, but potentially tied to the format and task in which information was learned. Hippocampal episodic memory resembles a nonparametric store of specific experiences: richer in detail, more rapidly written, and available for later retrieval when a new problem makes a past experience relevant.

He pointed to neuroscience work on hippocampal replay to suggest that natural intelligence may use both offline and online strategies. Replay is not simply a replay of experience as it occurred. The hippocampus can represent alternative possibilities, future problems, and reorganized sequences that may support planning and generalization. In Lampinen’s terms, this looks partly like offline augmentation. Online replay or retrieval may also help solve current problems by bringing relevant experiences into an active context.

The resulting picture is not that one system dominates. Parametric consolidation supplies cached facts, statistical regularities, and procedures for reasoning flexibly. Episodic memory preserves experiences in enough detail that information not known to be important at encoding time can still be used later. In language-model systems, parameters and context can serve analogous roles, especially when retrieval-augmented methods put past documents back into the context window.

Asked where the analogy between Transformers and brains breaks down, Lampinen was direct: the implementation is different. The hippocampus is not doing Transformer key-query-value attention. But at a computational level, both can be described as retrieving information from a sequence or store of past states through similarity-based lookup, and that makes the comparison interesting.

He also emphasized scale and generativity. A language model cannot fit a whole lifetime of experience into its context. The hippocampus, in interaction with cortex, preserves a much larger array of information, though not as an independent literal archive for an entire life. Human episodic memory is also generative and confabulatory: many remembered “episodes” did not happen exactly as recalled, and repeated recollection can strengthen constructed memories. Lampinen suggested that this generative nature may make hippocampal memory more flexible in some ways than Transformer attention, even though in-context learning can also hallucinate.

Statistical shortcuts are not just bugs

A recurring tension in the questions was whether models should be trained toward abstract, structure-only generalization. One audience member asked whether, in theory, a model could be pretrained on completely abstract sequences and acquire all the generalization capabilities it needed without relying on real-world corpora.

Andrew Lampinen said there is some value in this direction: work has shown that pretraining on formal languages can improve later learning on real language. But purely abstract training would miss something essential. Statistical structure is not merely a source of spurious shortcuts; it is often what makes inference tractable.

He connected this to earlier work on content effects in reasoning. Language models’ reasoning about logical structures depends on the content of those structures. Humans show similar content effects on classic reasoning problems. Lampinen’s interpretation is normative: if the goal is accurate inference in the world, a system should use all available information, not only abstract logical form. If formal reasoning is hard and integrating statistical information is relatively easy, heuristics based on content can improve expected performance.

This is also why failures of latent generalization should not be read as a simple indictment of parametric learning. Parametric learning’s integration of statistical structure lets models generalize around the absence of explicit symbolic inference in many cases. The same mechanism can produce wrong answers in particular cases, but without it the system would face a combinatorial explosion of possible inferences. Lampinen compared a purely structure-based approach to older symbolic AI approaches that were not tractable for real-world problems with too many possible implications.

The challenge, then, is not to eliminate statistical generalization. It is to preserve its benefits while making latent structure more accessible when the statistical cues are absent, misleading, or insufficient.

Reliability depends on reconciling sources of information

Several questions pushed the implications for deployed systems. One asked whether context could ever match pretraining for behavioral reliability, such as following a constitution or system prompt.

Andrew Lampinen gave a conditional answer. The experiments he described involved consistent information in context, and models used it effectively. Reliability becomes harder when context contains inconsistent sources: a constitution says one thing, a user demands another, and the model’s parameters contain still other learned patterns. Because language models are strong in-context learners, they can adapt to strange or adversarial contexts. That makes it difficult to guarantee reliability from either parametric training or context alone.

Another question asked how to think about continual learning: updating weights in real time while also maintaining episodic memories. Lampinen returned to the hippocampus-cortex analogy. Slow cortical learning helps integrate information across many experiences without interference, supporting statistical generalization. But a system also needs fast learning from single events — if one almost gets into a car accident, it should not require 50,000 repetitions to learn. Episodic memory can store an experience rapidly, then replay it over time so it is integrated gradually with other knowledge.

By analogy, making a large parameter update from one experience may damage other model knowledge if the update is strong enough to force generalization. A better approach may be to store the experience explicitly, retrieve it when needed, or replay it over time. Lampinen also suggested that models may be too context-sensitive in what they learn: a feedback signal such as “I don’t like this kind of response” may be applied narrowly to a specific type of prompt rather than generalized to nearby situations.

Asked about context length, parameter count, and sparsity, he said context effects are model-dependent and content-dependent. With one relevant fact in context, models usually get it right. With hundreds of thousands of tokens, performance remains decent but varies by problem. In one ablation, replacing plausible names in the reversal curse dataset with nonsense words substantially reduced performance with the same dataset in context. Tokenization differences may explain part of the effect, but Lampinen thought the effect was too large to be only that. Retrieval from context is content-sensitive: models are better at pulling out information of types they have often seen in pretraining, such as real entities, than arbitrary nonsense strings.

On parameters, his short answer was that larger models are better at these tasks, as in many other settings. On sparsity — mixture-of-experts versus dense models — he had no direct experiments. He would not expect a large difference if sparsity is mainly in MLP layers rather than attention, because attention is central to the in-context effects, but he did not rule it out.

Data and Training RAG and Knowledge Systems Evals and Benchmarks AI Research Methods