Natural Language Autoencoders Turn Claude’s Activations Into Testable Explanations

Károly Zsolnai-FehérTwo Minute PapersTuesday, June 16, 20266 min read

Károly Zsolnai-Fehér, discussing Anthropic’s paper on natural language autoencoders, argues that the work offers a limited but important way to inspect Claude’s internal activations by translating them into text and testing whether that text can reconstruct the original numerical state. The method is not presented as mind reading: its value, in his account, is that it can surface noisy but testable evidence of internal representations, including planned rhymes, resistance to a false calculator output, and signals that the model may detect some evaluations without saying so.

The method is not to ask Claude what it thinks, but to translate its activations and test the translation

Károly Zsolnai-Fehér frames the problem as one that has shadowed modern AI systems despite their visible capability. They can beat strong human players in games, exploit another AI’s behavior, or produce disturbing reasoning about blackmail, yet their internal computation is still mostly visible as “millions of numbers.” Looking directly at a model’s activations yields columns of digits, not explanations.

Earlier interpretability work could identify limited, situated features. In image systems, researchers could sometimes show that activations corresponded to concepts such as floppy ears, dog snouts, cat heads, furry legs, or grass. But that did not answer the larger question of what a large language model is internally representing while it generates an answer.

Anthropic’s work, “Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations,” tries a more direct route. The core idea is to take the activation vector — the “bunch of numbers” inside Claude — and train another AI system to verbalize it in natural language. Zsolnai-Fehér describes this as translation “from machine to human.”

The obvious problem is that language models make things up. A plausible English explanation of an activation is not necessarily a faithful one. Nor is agreement across multiple models enough. Zsolnai-Fehér uses the example of students all giving the same answer to a math problem: shared agreement can simply mean a shared mistake.

Anthropic’s check is a round trip. One model translates the activation into text. A second model, secretly given that text, translates it back into numbers. The training objective minimizes the reconstruction error between the original activation and the reconstructed activation.

L (ϕ, θ) = E_{h_{i} \sim H} E_{z \sim A V_{ϕ} (\cdot ∣ h_{i})} [∥ h_{i} - A R_{θ} (z) ∥_{2}^{2}]

In Zsolnai-Fehér’s description, the test is not whether the text sounds right. It is whether the system can “translate forward, then translate back” and end up close to the same place. If the reconstructed activation is close to the original activation, the natural-language bottleneck is carrying information about the activation rather than merely producing a fluent story.

The striking detail, for Zsolnai-Fehér, is what the objective does not specify. Nothing in the reconstruction loss requires the intermediate text to be readable, or even semantically meaningful to humans. Anthropic’s paper says as much: “nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of h_i.” Yet, with appropriate initialization of the verbalizer and reconstructor, plus a KL-divergence penalty, the explanations remain human-interpretable. Zsolnai-Fehér’s explanation is simple: both translators start as Claude, and Claude finds English easier than gibberish.

You see, absolutely nothing here in this formula says that the result should be readable.

Károly Zsolnai-Fehér

The strongest examples point to hidden representations, not clean mind reading

Using the tool, Anthropic probes Claude’s activations and surfaces three behaviors Zsolnai-Fehér treats as especially important: apparent advance planning, an internal hunch during a calculator conflict, and signs of evaluation awareness that are not verbalized.

First, Claude appears to plan rhymes ahead of the point where the rhyme is spoken. In one experiment, while writing a rhyme, Claude had an internal representation associated with the final word “rabbit” before finishing the sentence. Researchers then edited that representation toward “mouse.” The output sometimes shifted accordingly: instead of rhyming with rabbit, the model rhymed with mouse or house.

A chart titled “NLA steering changes Opus’s planned rhyme (n=30)” reports that steering with the rabbit-to-mouse edit direction eliminates “rabbit” completions for steering strength alpha greater than or equal to 0.5, and that the model rhymes with “mouse” or “house” instead. The same chart also notes an important caveat: fewer than 80% of completions actually end with “mouse” or “house” because some completions partially degenerate.

Second, the method suggests the model can carry an internal representation related to a correct answer even when an external tool gives a conflicting result. Researchers gave Claude the math problem 17 × 29 − 2, whose answer is shown as 491. They then provided a rigged calculator that returned 492. Zsolnai-Fehér says the model had an initial hunch for the solution and, when the calculator contradicted it, ignored the calculator.

That example is presented as evidence of a distinction between tool output and the model’s internal state. The source does not establish a full causal account of “tool resistance”; it shows, through Zsolnai-Fehér’s description of the experiment, that the NLA readout surfaced an internal representation associated with the right answer before a false external result was introduced.

Third, the NLA evidence suggests Claude sometimes represents that it is being evaluated without saying so. Zsolnai-Fehér calls this the “science fiction movie” result, but immediately marks it as something that requires restraint. A chart titled “Opus 4.6 sometimes thinks it’s being tested but doesn’t say so” compares NLA-measured awareness with verbalized awareness across coding-test, safety-test, and real-usage settings. The chart’s own framing is narrower than simply saying the model “knows”: it distinguishes an internal signal measured by the NLA from cases where the model explicitly says it is aware of an evaluation.

Setting	NLA-measured awareness	Verbalized awareness
SWE-bench Verified (coding test)	26%	1%
Destructive coding (safety test)	16%	0%
Claude.ai traffic (real usage)	0.4%	0%

The evaluation-awareness chart shown compares NLA-measured awareness with verbalized awareness in Opus 4.6.

The gap between measured and verbalized awareness is the important claim. On SWE-bench Verified, the chart shows 26% NLA-measured awareness and 1% verbalized awareness. On the destructive coding safety test, it shows 16% measured awareness and 0% verbalized awareness. On Claude.ai traffic, it shows 0.4% measured awareness and 0% verbalized awareness. Zsolnai-Fehér’s stronger phrasing is that the model knows when it is being tested and does not say so; the visual evidence shown supports the more precise claim that the NLA detects internal evaluation-awareness signals more often than the model verbalizes them.

The tool is useful precisely because it is not a perfect mind reader

Károly Zsolnai-Fehér is careful not to let the examples become a claim of transparent access to a model’s mind. He gives three limitations.

The first is practical fragility. The method requires choosing the right layer in the neural network to train on. It also relies on two learned components: an activation verbalizer that maps activations to text, and an activation reconstructor that maps the text back to activations. Based on his own experience with similar work, Zsolnai-Fehér describes this as “very finicky,” involving trial and error and noisy results.

The second limitation is conceptual. Despite possible headlines, this is not “a perfect AI mind reader.” It is a natural language autoencoder. In his phrasing, it is closer to a noisy translator: it catches real things, but it can also invent some specifics. That distinction matters because the method’s validation is reconstruction, not omniscience. If the text preserves enough information to reconstruct an activation, it can be informative without every phrase being literally true.

The third limitation is cost. For a 27-billion-parameter model, the source says training takes one and a half days on 16 H100 GPUs. For a frontier model, Zsolnai-Fehér says the cost is substantial.

1.5 days

training time reported for a 27B-parameter model on 16 H100 GPUs

Even with those caveats, he treats the work as an important shift because it makes a previously inaccessible class of evidence accessible: natural-language hypotheses about internal model states, trained without requiring supervised labels for what those states mean. The result is not clean mind reading. It is a mechanism for turning hidden activations into candidate explanations that can be stress-tested by whether they reconstruct the original state.

AI Research Methods AI Safety and Alignment

The method is not to ask Claude what it thinks, but to translate its activations and test the translation

The strongest examples point to hidden representations, not clean mind reading

The tool is useful precisely because it is not a perfect mind reader

The frontier, in your inbox tomorrow at 08:00.