Shannon’s Entropy Limit Frames Language Models as Text Compressors

Grant Sanderson3Blue1BrownSunday, June 7, 202613 min read

Grant Sanderson’s 3Blue1Brown video uses the question of how far English can be compressed to rebuild Shannon’s definitions of information and entropy. Sanderson argues that prediction and compression are mathematically equivalent: a good language predictor is, in principle, a good text compressor, and Shannon’s estimate of roughly one bit per English character frames the limit such systems are trying to approach. The result is a narrower version of the slogan “compression is intelligence”: not a definition of intelligence, but an explanation of why compression theory sits so close to modern language-model training.

The compression question becomes a prediction question

Text can be encoded wastefully or efficiently, but the harder question is whether there is a fundamental limit to how efficient it can get. ASCII uses a fixed eight bits for each character. Huffman coding improves on that by assigning shorter bit strings to common characters, bringing the average shown in the source to roughly 4.2 bits per character. Neural methods are described as doing better still, around 1.1 bits per character in the motivating example, by using patterns across longer sequences of text.

Method or estimate	Bits per character	What it represents
ASCII	8.0	Fixed-width character encoding
Huffman	~4.2	Shorter codes for more common characters
Neural methods	~1.1	Visual example using long-range text patterns
Shannon estimate for English	≈1.0	Estimated entropy with at least 100 preceding letters of context

The comparison motivating the question of language’s fundamental compressibility

The number is not the main point. Grant Sanderson uses the question of text compression to reconstruct why information theory needed the definitions of information and entropy in the first place. Claude Shannon’s work in the 1940s is not treated merely as historical background, but as the mathematical root of a question that now reappears in machine learning: large language models are usually described as being trained to predict the next token, using “cross-entropy” loss. That term comes from information theory.

The key claim is that prediction and compression are mathematically equivalent. They are “two sides of the same coin.” If so, the pre-training objective for a language model can be reframed: not merely next-token prediction, but construction of an efficient text compressor.

That reframing is why the slogan “compression is intelligence” is treated carefully. Sanderson says the phrase is hard to judge rigorously because intelligence is too ill-defined. The safer claim is narrower: the mathematical theory of compression is unusually relevant to artificial intelligence. The path to that claim starts with the limit of compression and with the definitions Shannon’s theory made central.

A toy robot makes the coding constraint visible

A deliberately simple signal carries the essential structure of the problem. A robot on a distant moon receives one of four movement instructions: up, down, left, or right. The instructions are not equally likely. Half are up, one quarter are down, one eighth are left, and one eighth are right. Each instruction is sampled independently of the previous ones.

If transmission is costly and slow, the problem is to encode those instructions as efficiently as possible in bits. The straightforward answer uses two bits per instruction: 00, 01, 10, and 11. That is easy for the robot to decode, since it can split the incoming bit stream into fixed two-bit chunks.

The better answer takes advantage of the unequal probabilities. Assign the most common instruction, up, to the one-bit code word 0. Assign down to 10. Assign left to 110 and right to 111. The average number of bits per instruction is then:

rac{1}{2}cdot 1 + rac{1}{4}cdot 2 + rac{1}{8}cdot 3 + rac{1}{8}cdot 3 = 1.75

That beats the fixed two-bit scheme. The saving comes from spending only one bit on the most common instruction and accepting longer code words for rarer instructions.

The catch is decoding. Once different symbols have different code-word lengths, the receiver must know where one instruction ends and the next begins. The proposed code works because no code word is the prefix of another. If the robot reads a 0, it can immediately decode “up,” because no other code word begins with 0. If it reads 1, it waits. A following 0 gives 10, which is “down.” If it instead reads 11, it still waits, and the next bit distinguishes 110 from 111.

That property is called a prefix-free code, or more commonly, a prefix code. Sanderson’s visualization of all binary strings makes the constraint concrete. One-bit strings sit on the first layer, two-bit strings above them, and longer strings keep branching upward. Any chosen string claims not only its own position but also every longer string above it that begins with the same prefix. Choosing 0 as a code word consumes the entire half of the space beginning with 0. Choosing 10 consumes a quarter. Choosing 110 and 111 consumes one eighth each. For the robot distribution, those consumed regions match the probabilities exactly: 1/2, 1/4, 1/8, 1/8.

Instruction	Probability	Code word	Code length
Up	1/2	0	1 bit
Down	1/4	10	2 bits
Left	1/8	110	3 bits
Right	1/8	111	3 bits

The prefix-free robot code aligns code-space consumption with instruction probabilities

This alignment suggests more than cleverness. It suggests perfection. But Sanderson pauses on the burden of that claim: how can we know there is not a more elaborate scheme that encodes long sequences of instructions and achieves an even lower average?

Perfect compression should look like random noise

The theoretical move is to argue from the other end: random noise should be incompressible, so a perfect compressor should produce output indistinguishable from random noise. Here “random noise” means that each bit is independently 0 or 1 with probability 50%.

The robot code has this property. Half the time, the first bit is 0 because the instruction is up; otherwise it is 1. Conditional on seeing a first bit of 1, half of the remaining probability is down, so the next bit is equally likely to be 0 or 1. Continuing the same reasoning, each new bit behaves like an independent coin flip. To someone looking only at the compressed bit stream, it looks random.

The incompressibility argument is made from the receiver’s perspective. A string of $n$ bits is one among $2^{n}$ possible strings of that length. If the compressed bit stream is truly random-looking, all those $2^{n}$ strings are equally likely. Therefore each corresponding underlying message has probability:

P(m_i)= rac{1}{2^n}

Trying to improve on that by giving one of those equally likely messages a shorter encoding creates a trade-off. In the binary-string diagram, moving a message to a shorter string means it occupies a larger region of prefix space. Other messages must move out. At minimum, saving one bit for one message pushes other messages to longer encodings. Sanderson describes it as pushing down a bump in a rug only to see it pop up worse somewhere else.

For equally likely messages, the optimal encoding gives each message the same number of bits. That observation leads directly to the central formula. In a perfect compression scheme, a message encoded with $n$ bits has probability $2^{- n}$ . Solving for $n$ gives:

p = 2^{- n} q u a d R i g h t a r r o w q u a d - l o g_{2} (p) = n

Sanderson calls the negative logarithm “the fundamental formula for all of information theory.” It is not introduced as an arbitrary definition but as what falls out when perfect compression is required to produce random-looking output.

Information is measured by how many halvings locate an event

Shannon defined the information of an event with probability $p$ as:

I = - l o g_{2} (p)

One intuitive reading is: how many times must the space of possibilities be cut in half to isolate an event of that probability? An event with probability 1/2 carries 1 bit of information; probability 1/4 carries 2 bits; probability 1/8 carries 3 bits. Sanderson notes that the same expression can be written as $l o g_{2} (1/ p)$ , and, less conventionally, as $l o g_{1/2} (p)$ .

Unlikely messages carry more information. Highly expected messages carry very little. But the point is not merely to rescale probabilities into a new unit. In a perfect compression scheme, the number of bits assigned to a full message is exactly its information content. More generally, when perfect compression is not possible, the information content gives a lower bound on how much compression can achieve at least when averaged over possible messages. Sanderson is not claiming that every individual message, in isolation, has an absolute compression floor; a compressor can be overfit to a particular case. The bound is about the average behavior of a coding scheme across the possible messages.

This is where fractional bits become meaningful. For the toy robot, all probabilities are powers of two, so the information values are whole numbers: 1 bit, 2 bits, 3 bits. Language does not behave that way. In the example, a small local GPT assigns probabilities to successive letters in a phrase, producing fractional information values such as 4.19 bits for one letter and 0.03 bits for a highly predictable one.

No literal code word is 4.19 bits long. The meaning becomes clear only when encoding full messages. The probability of a phrase is the product of the probabilities of each successive letter, each conditioned on the preceding context:

P (e x t p h r a se) = p_{0} p_{1} p_{2} c d o t s p_{n}

Taking negative log base 2 turns that product into a sum:

- l o g_{2} P (e x t p h r a se) = - l o g_{2} (p_{0}) - l o g_{2} (p_{1}) - c d o t s - l o g_{2} (p_{n})

That is the operational reason fractional information values are useful. They add cleanly across successive events, even though actual encoded bit strings ultimately have whole-number lengths. Sanderson says a later part will show a specific compression algorithm that can encode text within one or two bits of this total information value. The immediate lesson is that information theory works at a higher abstraction where information is continuous, while physical bit strings remain discrete.

Language makes the probability model the hard part

For natural language, the mathematical expression is only as good as the probability distribution behind it. Grant Sanderson explicitly flags the issue: using a language model’s probabilities in an animation is not necessarily the same as using the true probabilities of language. More fundamentally, it is not obvious what “the true probabilities of language” would even mean.

Shannon’s early work addressed the problem without modern language models. One approach examined short character sequences, or N-grams. If one scans books and records what tends to follow the letters “TH,” one can estimate a distribution over next letters such as E, A, I, O, or R. That works for short contexts. It breaks down for long strings, because many long strings never appear in the analyzed books.

But long contexts are precisely where language becomes most predictable, and therefore where the greatest compression is possible. A string may never have appeared in a corpus, yet a human can still have strong expectations about what should follow it.

Shannon therefore used human prediction as a way to probe the information content of English. In one story, he asked his wife Betty Shannon to guess each next letter from a book. When she guessed incorrectly, he wrote the correct letter. When she guessed correctly, he wrote a dash. The resulting transcription used fewer actual letters than the original, but it preserved enough information in a specific sense: if Shannon could somehow use an exact duplicate of Betty to replay the guessing process, the reduced text would provide the prompts needed to reconstruct the original.

That experiment conveyed the compression value of predictability, but it was not a precise measurement. In his 1950 paper “Prediction and Entropy of Printed English,” Shannon used a more robust design. He interviewed more people and recorded not just whether a guess was right or wrong, but how many guesses were required to identify the correct next letter. He then associated the number of guesses with an implicit probability assigned to the true next letter.

Sanderson’s point is that Shannon was not simply computing statistics over a corpus. He was probing a model of language, namely a human brain. The interviewees were treated as black boxes with sophisticated but hard-to-describe expectations about language. In Sanderson’s framing, the modern change is that “we have gone from merely interrogating black boxes that process language, to designing them.”

Entropy is average information per symbol

Once information is defined as $- l o g_{2} (p)$ , entropy is the next expression to rediscover. For a signal made of symbols — robot instructions, English characters, or anything else — entropy asks for the average information per symbol.

For a fixed distribution with probabilities $p_{i}$ , the entropy is:

H (p) = s u m_{i} p_{i} l e f t (- l o g_{2} (p_{i}) i g h t)

This is the same calculation performed earlier for the robot code. Because the robot’s encoding was perfect, each symbol’s code-word length equaled its information content. The weighted average of code-word lengths was also the average information per symbol: 1.75 bits.

Sanderson presents the expression geometrically as probability width times information height. Imagine horizontal probability bars whose total width is one. Above each bar sits a rectangle whose height is the information value, $- l o g_{2} (p_{i})$ . The entropy is the total area of those rectangles, summed over all symbols. That visualization matters because it keeps the formula tied to the compression story: each probability contributes its share of the average, and each height is the number of bits implied by that probability.

The naming story is handled with some caution. John von Neumann supposedly advised Shannon to call the quantity “entropy,” partly because it resembled the entropy expression in statistical mechanics and partly because “Nobody knows what entropy really is, so, in an argument, you’ll always have the advantage.” Sanderson says he looked into the story and regards it as probably apocryphal, though with “a grain of truth.”

Whatever the naming story, Shannon denoted the quantity by $H$ . Its qualitative interpretation is uncertainty: more evenly spread distributions have higher entropy, while distributions dominated by one highly probable event have lower entropy. If the probability space is divided among more possible symbols, entropy can rise because each individual outcome carries more information.

But Sanderson emphasizes the precise interpretation over the vague one. For the fixed-distribution setting just described, entropy gives the lower bound on the average number of bits per symbol needed to encode messages following that distribution. That is the substance of Shannon’s noiseless coding theorem as presented here: no encoding can be more efficient than the entropy limit, and it is possible to get arbitrarily close to that limit.

The vague, qualitative intuition here is that entropy measures the amount of uncertainty in a distribution. But the more precise understanding, justifying why it deserves to be given in units of bits, is that it describes the minimum number of bits per symbol necessary to encode a message following this distribution.

Grant Sanderson · Source

English requires an entropy rate, not just an entropy formula

The entropy formula above applies cleanly when each new symbol follows the same distribution, as in the robot example. English does not. The probability of each new character depends heavily on context. Shannon’s broader target was therefore the entropy rate of a stochastic process: the average information per symbol when averaging over whole messages rather than over repeated independent draws from one fixed distribution.

The entropy-rate expression is:

lim_{n oinfty} rac{1}{n}H(X_1,X_2,ldots,X_n)

The question remains the same: what is the average information per symbol? But for language, Sanderson says this is not something one can calculate exactly from a neat formula. There is no simple probability distribution that “describes language.” Reference to the entropy of language is therefore already beyond exact calculation and into estimation.

Shannon’s estimates came from observation and human prediction. When his interviewees had at least 100 preceding letters of context, he estimated the entropy of English to be about one bit per character. Sanderson calls this “a pretty wild number,” because it suggests that English could, in principle, be compressed to roughly one yes-or-no answer per character.

≈1.0 bit/ch

Shannon’s estimated entropy of English with at least 100 preceding letters of context

That estimate gives the opening comparison its theoretical shape: the deep question is not just how clever a compressor is, but how well it captures the structure that makes the next character predictable. ASCII ignores that structure. Huffman coding uses simple frequency differences. Neural methods use longer-range patterns. Shannon’s human-prediction experiments treated people as models with rich contextual expectations, long before those expectations could be approximated by artificial systems.

Cross-entropy is introduced as the next mathematical step, not as the same thing as entropy or entropy rate. Entropy, in the fixed-distribution setting, is the average information under the distribution itself. Entropy rate extends that idea to a context-dependent process such as language. Cross-entropy asks what happens when one distribution is used to encode or predict data generated by another distribution, which is why it becomes central to training large language models. Sanderson sets that up as the bridge from compression limits to modern machine learning, along with related topics such as model distillation and the ability of gzip to recover structure between distinct languages.

AI Research Methods