KV Cache Movement Has Become the Core Inference Bottleneck
Stanford’s CS336 lecture on inference, taught by Percy Liang with Tatsunori Hashimoto, argues that serving language models is now a core systems problem rather than an afterthought to training. Liang’s central claim is that autoregressive Transformer generation is sequential and often memory-bound, especially because attention must repeatedly move KV-cache data rather than perform dense, easily parallelized computation. The lecture treats batching, grouped-query and latent attention, quantization, pruning, speculative decoding, continuous batching, and PagedAttention as different attempts to move fewer bytes, reuse memory better, or trade latency for throughput without degrading model quality too much.

Inference is no longer the afterthought after training
Percy Liang frames inference as the problem that begins after training: a trained model receives a prompt and must produce a response “as accurately and as quickly as you can.” The simplicity of that definition hides why inference has become a central systems problem. Training is expensive, but it is a one-time cost. Inference is paid repeatedly: in chatbots, code completion, agents, batch data processing, model evaluation, and reinforcement learning rollouts.
Liang’s motivating comparison is token volume. OpenAI is estimated to process roughly 8.6 trillion tokens per day, while DeepSeek V4 was trained on 32 trillion tokens. Liang’s point is not that the workloads are identical, but that daily inference token volume can reach the scale of a major training run in only a few days.
The importance of inference rises further in agentic settings. In a chatbot, most generated tokens are intended for human reading, and humans are the bottleneck. In an agent, a query can trigger internal traces, reasoning, tool calls, and introspection before any final human-facing output. The blunt accounting is: “Tokens generated = compute spent.” Unlike a chatbot response that only needs to stream at a human-readable pace, an agent may find value in generating far more internal tokens if those tokens buy better work.
Three speed metrics matter, and they do not always point in the same direction. Time-to-first-token is the wait before any generation appears; this matters most in interactive products because a blank wait is bad user experience. Latency, measured as seconds per token, describes how quickly tokens appear for a single query. Throughput, measured in tokens per second, describes how quickly many queries are processed. Latency and throughput often improve together under some interventions, but they can also trade off sharply, especially through batching.
The core difference from training is sequential generation
The main systems difference between training and inference is not the model architecture; it is the dependency structure of the work. During supervised training, all tokens in a sequence are visible. Transformer computations can parallelize across the sequence dimension: attention and MLP operations become large tensor operations. During autoregressive inference, output tokens must be generated one at a time. The next token depends on the previous generated tokens, so the sequence dimension no longer provides the same parallelism.
Percy Liang states this as the high-level fact to remember: training can parallelize over the sequence, while inference cannot parallelize over generation. That is why inference is “a very different workload than training,” and why it is harder to achieve high arithmetic intensity or fully use accelerator compute.
The Transformer notation makes the bottleneck precise. The lecture uses B for batch, T for sequence length, D for model dimension, H for head dimension, L for number of layers, S for key/value sequence length, N for query heads, K for key/value heads, and G for query heads per key/value head. In training, S and T are the same; in generation, T becomes 1 while S is the history being conditioned on.
A student catches a notation issue around grouped-query attention: whether keys and values are split across G groups or K groups. Liang checks the definitions and agrees that K should be the number of groups and G the number of heads per group, saying he will fix it later. The correction matters because the later KV-cache arguments depend on how many key/value heads are stored.
Arithmetic intensity explains why generation is memory-bound
Percy Liang reviews arithmetic intensity as compute per byte transferred. The simple warm-up is multiplying X with shape B × D by W with shape D × F. With bf16 values, reading X costs 2BD bytes, reading W costs 2DF bytes, writing Y costs 2BF bytes, and the matmul costs 2BDF FLOPs. When B is much smaller than D and F, the arithmetic intensity simplifies to B.
That simplification is used to compare the computation to accelerator intensity. For an H100, the lecture computes accelerator intensity from 989e12 FLOPs per second divided by 3.35e12 bytes per second of memory bandwidth, giving about 295. If the operation’s arithmetic intensity is above that, it is compute-bound; if below, it is memory-bound. In the warm-up matmul, the operation becomes compute-bound only if B is greater than 295. The extreme case B = 1 is a matrix-vector product with arithmetic intensity 1, which is memory-bound. Liang says that is “basically what happens with inference”: the model is often asked to process very thin tensors rather than large square-ish matrices.
The KV cache is the basic fix for the naïve generation algorithm. Without caching, generating each token would require feeding the entire history back into the Transformer. Because each forward pass has attention work that grows quadratically in sequence length, generating T tokens naïvely costs O(T³) FLOPs. But in a causal Transformer, the key/value activations for previous tokens do not change when new tokens are appended. The system can store those key/value vectors in high-bandwidth memory and reuse them.
| Stage | What happens | Parallelism | Bottleneck implication |
|---|---|---|---|
| Prefill | Encode the prompt and populate the KV cache | Parallelizable over the prompt, like training | Easier to make compute-bound |
| Generation | Produce new response tokens one at a time while extending the KV cache | Sequential over generated tokens | Often memory-bound, especially in attention |
For MLP layers, under the lecture’s assumptions, arithmetic intensity scales as B × T. During prefill, large batches and long sequences can make this favorable. During generation, T = 1, so the MLP intensity becomes B. That can still be workable if there are enough concurrent requests.
Attention is worse. With FlashAttention-style accounting, attention intensity is S × T / (S + T). During prefill, where T = S, that becomes S / 2, which can be good for long prompts. During generation, where T = 1, it becomes S / (S + 1), which is less than 1. That is far below the H100 comparison point of about 295.
| Computation | Prefill intensity | Generation intensity | Why it matters |
|---|---|---|---|
| MLP | B × S | B | Batching can reuse shared MLP weights across requests |
| Attention | S / 2 | < 1 | Each request has its own KV cache, so batching does not raise generation-time attention intensity |
The reason batching helps MLPs but not generation-time attention is that MLP weights are shared across sequences. Loading the same weights once and applying them across a batch increases useful work per byte moved. Attention does not share the relevant KV cache across batch elements: every sequence has its own K and V vectors. Increasing B therefore means doing more independent low-intensity attention work, not reusing one shared memory read in the same way.
Prefill is compute bound, generation is memory bound.
Liang’s summary is specific: prefill MLP intensity is B × S; prefill attention intensity is S / 2; generation MLP intensity is B and requires concurrent requests; generation attention intensity is less than 1 and is the fundamental bottleneck if one stays with a Transformer. “Whenever you hear people say ‘oh inference is memory bound,’ you know why,” he says.
Batch size buys throughput by charging latency and memory
The Llama 2 13B example makes the latency-throughput tradeoff concrete. Tatsunori Hashimoto works with a configuration using sequence length 1024, model dimension 5120, feed-forward dimension 13824, 40 query heads, 40 key/value heads, head dimension 128, 40 layers, vocabulary size 32,000, and H100 memory bandwidth of 3.35e12 bytes per second.
There are two main memory terms: model parameters and KV cache. The number of parameters computes to 13,015,449,600, a sanity check for a 13B model. The parameter memory is twice that count in bytes under bf16. The KV cache per sequence grows with sequence length, key/value heads, head dimension, layers, two tensors for key and value, and two bytes per bf16 value. Total memory is therefore parameter memory plus B times KV-cache-per-sequence.
Since generation is memory-bound under the lecture’s assumptions, latency is approximated as total memory divided by memory bandwidth. Throughput is B divided by latency, because B tokens are generated in parallel for a batch of B sequences.
| Batch size | Memory | Latency | Throughput |
|---|---|---|---|
| 1 | 26,869,760,000 bytes | 0.0080 seconds/token | 124.6755 tokens/second |
| 64 | 79,717,990,400 bytes | 0.0238 seconds/token | 2689.4807 tokens/second |
| 256 | 248,779,264,000 bytes | 0.0719 seconds/token | 3561.7685 tokens/second |
Increasing B worsens latency because the KV cache is larger and must be read and written. Hashimoto compares this to waiting for a bus: each individual waits longer, but the system moves more people at once. Throughput improves because parameter reads are amortized across the batch, but the gains diminish and memory capacity becomes a hard limit. In the batch-256 example, the configuration does not fit in H100 memory.
Time-to-first-token is described in the lecture as essentially a function of prefill time. Within the simplified latency framework used here, smaller batch sizes during prefill improve TTFT, while larger batch sizes during generation improve throughput. The operational tension is not that all “fast” metrics agree; it is that an interactive system and a batch-processing system may choose different batch sizes because they optimize different definitions of speed.
Parallelism is another dimension. Launching M copies of the model is the easy version: latency stays the same and throughput increases by M. Harder parallelism involves sharding the model and KV cache, which is pointed back to the Scaling Book chapter rather than covered in detail.
The first design question is how much KV cache the model really needs
Once the bottleneck is memory, the obvious target is the KV cache. Tatsunori Hashimoto says the cache can become larger than the parameters at sufficiently large batch size. The goal, repeated across the lecture, is to reduce inference complexity without hurting accuracy too much.
Grouped-query attention is the first architectural shortcut. In standard multi-head attention, there are N query heads and K = N key/value heads. In multi-query attention, K = 1; Hashimoto says in an aside that “no one uses” it because it is “really bad.” The useful version he emphasizes is grouped-query attention, which sits between those cases: N query heads share a smaller number K of key/value heads, reducing the KV cache by a factor of N/K.
The slide attributes the GQA comparison to Ainslie et al. 2023. The time-per-sample chart shown in the lecture places MQA as fastest, MHA as slowest, and GQA as a set of intermediate choices depending on the number of groups. The evaluation table shown for the same work reports that GQA-8-XXL has much lower inference time than MHA-XXL while retaining a similar average on the displayed benchmark set in that paper’s table.
Applied to the Llama 2 13B example at batch size 64, changing from K = 40 to K = 8 reduces memory and improves both latency and throughput in the calculation. The lecture frames the improvement through the KV-cache reduction: less memory to move in a memory-bound workload means better latency and throughput.
| Configuration | Parameters | Memory | Latency | Throughput |
|---|---|---|---|---|
| MHA, K=40, B=64 | 13,015,449,600 | 79,717,990,400 bytes | 0.0238 seconds/token | 2689.4807 tokens/second |
| GQA, K=8, B=64 | 11,337,728,000 | 33,412,874,240 bytes | 0.0100 seconds/token | 6416.6883 tokens/second |
| GQA, K=8, B=256 | 11,337,728,000 | 65,625,128,960 bytes | 0.0196 seconds/token | 13060.1648 tokens/second |
Hashimoto draws an important distinction: latency and throughput are not inherently opposed. Reducing total memory can improve both. The tension appears mainly when increasing batch size.
But the accuracy tradeoff is not settled by one paper. Hashimoto warns that evaluation claims should be taken “with a grain of salt” when they are not mathematical. A slide attributed to DeepSeek-AI 2024 reports that, on several hard benchmarks in that table, a dense 7B model with MHA outperforms GQA and MQA. The same DeepSeek material motivates multi-head latent attention, or MLA: instead of storing full K and V vectors, store a compressed latent vector c and project up to K and V when needed. Hashimoto states that DeepSeek V2 reduced N × H = 16,384 dimensions to C = 512, with an added 64 dimensions to handle RoPE, for 576 total dimensions.
MLA’s appeal is the same memory argument: smaller cache, faster inference. The DeepSeek-AI 2024 table shown for MLA reports MLA as comparable to or slightly better than MHA on the displayed MoE benchmark rows while using far fewer KV-cache elements per token. Hashimoto does not claim that lowering model dimension would produce the same effect. When asked how MLA compares to reducing the model dimension, he says the ablations shown do not answer it; his guess is that indiscriminately reducing model dimension makes things worse, and the trick is to find places where the model can be squeezed.
The same decision pattern appears in other cache-reduction methods. Cross-layer attention, on a slide attributed to Min et al. 2024, shares KVs across layers rather than only across heads. Local or sliding-window attention truncates attention to the last K tokens, making effective KV cache independent of total sequence length, which is especially attractive for long context. Liang notes that multiple layers let information propagate farther than one window, but he also says local attention reduces expressivity and can hurt accuracy, so practical systems may interleave local and global attention.
A student asks how sliding window compares with linear attention variants. Liang says linear attention and related models compress the history into a state rather than storing the full KV cache. Naive linear attention might sum KV values into a single vector; more powerful variants include gated nets, delta nets, and Mamba. His qualitative distinction is that sliding window preserves local high-resolution information, while linear-attention or state-space approaches can offer broader summaries of the past. For needle-in-a-haystack long-context tasks, compressing the entire history into a small state can lose information.
DeepSeek V4’s attention slide is treated as another example of the same pattern. The visual says the design supports 1M context length and lists components such as sliding window attention, dilated sliding window, global-plus-sliding-window attention, compressed sparse attention, DeepSeek sparse attention, and heavily compressed attention. Liang does not unpack every acronym, but the decision framework remains stable: reduce KV-cache size because inference is memory-bound, and then verify that the model still behaves well enough.
| Technique | What is reduced or shared | Caveat emphasized in the lecture |
|---|---|---|
| GQA | Key/value heads are shared across groups of query heads | Accuracy results vary across the displayed Ainslie and DeepSeek tables |
| MLA | Full K and V are replaced by a compressed latent KV representation | Requires architectural handling for RoPE |
| CLA | Key/value cache is shared across layers | Presented as an empirical Pareto-frontier improvement in the cited slide |
| Sliding window / local attention | Only a local window of KV history is retained | Can reduce expressivity; hybrid local/global layers are used to recover accuracy |
| Linear attention / state-space variants | History is compressed into a state | May lose information needed for exact long-context retrieval |
Quantization and pruning change what must be stored and moved
Quantization attacks the same bottleneck from a systems angle rather than primarily an attention-architecture angle. Percy Liang defines the key idea as reducing number precision. Less memory should improve latency and throughput in a memory-bound workload, but accuracy must be checked.
The number formats compared run from fp32 to bf16, fp8, int8, and int4. fp32 is needed for parameters and optimizer states during training; bf16 is described as the default for inference; fp8 is one byte and can be used for training “if you dare”; int8 and int4 are cheaper but less accurate and presented as inference-only options.
There are two broad ways to quantize. Quantization-aware training simulates quantization errors during the forward pass while training by quantizing and dequantizing. The advantage is that weights are trained to work under quantization; the disadvantage is expensive large-scale training. Post-training quantization is done after training and is therefore much cheaper. A naive version determines scale and zero point for each layer or tensor using sample data. GPTQ, as described in the lecture and attributed on the slide to Frantar et al. 2022, uses Hessian information to update non-quantized weights to account for quantization error.
Activation-aware quantization is more selective. Liang says some activation channels are large, and weights interacting with those channels matter more. AWQ, attributed on the slide to Lin et al. 2023, therefore keeps a small fraction of salient weights—shown as 0.1% to 1%—in higher precision or scales weights before quantization. The slide states that FP16 to int3 can produce 4× lower memory and 3.2× speedup. The AWQ comparison contrasts naive round-to-nearest int3 quantization with much worse perplexity against AWQ variants that preserve or scale salient weights and recover lower perplexity.
Model pruning changes the model itself. Liang describes the idea bluntly: take a large model, remove pieces of it, and then repair it. The slide attributes this pruning method to an NVIDIA paper by Muralidharan et al. 2024. The procedure starts with a trained LLM, estimates importance on a small calibration dataset, ranks layers, heads, or hidden dimensions, removes less important components, and distills the pruned model from the original. When asked how important layers are distinguished, Liang says the high-level approach is to pass calibration inputs through the model and inspect activation magnitudes; dead or near-zero units are easier to remove. When a student asks whether always-high activations are necessarily important, Liang says variance-like measures may matter too: if a neuron is always 100, removing it breaks things, but if it has high mean and low variance there may be ways to fold it into a bias.
The broader taxonomy is two recipes. One can define a faster architecture and train it from scratch. Or one can define a faster architecture, initialize it using parts of the original model even if the architecture differs, and repair it through distillation. In both cases the objective is the same: reduce parameters or KV cache without losing too much quality.
Speculative decoding is lossless because checking is easier than generation
Speculative sampling is presented as a different kind of shortcut: not lossy compression, but a lossless way to exploit the asymmetry between prefill-like checking and autoregressive generation. Percy Liang’s premise is that if someone gives the target model a sequence, it can score the tokens in parallel relatively quickly. Generating those tokens one by one is slower.
Checking is faster than generation.
The method uses a cheaper draft model p to propose several tokens, such as four. The larger target model q then evaluates those proposed tokens in parallel. Tokens are accepted with a probability based on the ratio q/p; if rejected, the algorithm samples from a residual distribution. Liang describes this as modified rejection sampling. The key property is that it is guaranteed to produce an exact sample from the target model.
The intuition is operational. Without speculative decoding, the large model emits one token at a time. With speculative decoding, the small model generates a burst of candidate tokens and the large model critiques them, allowing accepted bursts to move the output forward more quickly. Initial speculative-sampling results shown in the lecture for Chinchilla report that, for XSum with batch size 1 and K = 4, mean token time drops from 14.1 ms/token under autoregressive nucleus sampling to 7.35 ms/token under speculative sampling, a 1.92× speedup, with similar reported ROUGE-2.
There is a sweet spot in the number of draft tokens. Too few draft tokens fail to exploit parallel checking by the target model. Too many increase rejection. In the example Liang cites, the favorable region is around three or four draft tokens.
In practice, the draft model must be much smaller than the target model—examples shown include a 70B target with an 8B draft, or an 8B target with a 1B draft—but also close enough in behavior that proposals are often accepted. Liang connects this back to distillation: if a compressed or pruned model is good enough, serve it directly; if not, use it as a draft model and let the main model correct it. Medusa, where a draft mechanism generates multiple tokens in parallel, and EAGLE, where the draft model uses high-level features from the target model, are named as extensions.
Live inference requires scheduling and memory management, not just model math
The final systems problem is dynamic workload handling. Live serving is not a neat training batch. Requests arrive at different times, share prefixes, and have different lengths. Waiting to form a batch hurts early requests; padding ragged sequences wastes work; shared system prompts and repeated sampling create opportunities for reuse.
Continuous batching, attributed to the Orca system in the slide, schedules decoding at the iteration level. Instead of waiting for a full request to finish before changing the batch, the server decodes one token for all active sequences, removes finished sequences, and inserts newly arrived requests. The batch is continuously updated.
Different sequence lengths complicate tensor batching. Percy Liang describes selective batching as the solution: attention computations may need to process sequences separately because their lengths differ, but non-attention computations such as MLP layers can concatenate all tokens into one larger tensor and process them together.
PagedAttention, introduced in the vLLM paper according to the lecture slide attributed to Kwon et al. 2023, addresses KV-cache memory management. The previous status quo allocated a contiguous KV-cache region for the prompt and possible response up to a maximum length. If generation stopped early, much of that reserved space was never used: internal fragmentation. Gaps between allocations created external fragmentation.
PagedAttention borrows the operating-system paging idea. The KV cache for a sequence is divided into non-contiguous blocks. A sequence such as “Four score and seven years ago our fathers brought forth” can be chunked into fixed-size blocks that live anywhere in physical memory, as long as the system maintains the mapping from logical to physical blocks.
| Serving problem | Why it happens | Technique discussed |
|---|---|---|
| Requests arrive and finish at different times | A fixed batch would make early requests wait or waste slots after completion | Continuous batching at the decode-iteration level |
| Requests have different sequence lengths | Attention depends on per-sequence history length, producing ragged arrays | Selective batching: handle attention separately, concatenate non-attention work |
| KV-cache allocations waste memory | Reserved maximum-length regions create internal fragmentation; gaps create external fragmentation | PagedAttention with non-contiguous KV blocks |
| Requests share prefixes | System prompts or multiple samples from one prompt repeat the same prefix work | Prefix sharing and copy-on-write at the block level |
PagedAttention also enables prefix sharing. If many requests share a system prompt, the server can compute and store that prompt’s KV cache once. If one prompt is used to sample multiple responses, all samples can share the prefix cache until they diverge. Liang describes copy-on-write semantics at the block level: shared blocks carry references, and when two continuations generate different tokens, the relevant block is split rather than duplicating the entire prefix.
Other vLLM optimizations are mentioned but not detailed: fusing block reads and attention to reduce kernel launch overhead, using kernels such as FlashAttention and FlashDecoding, and using CUDA graphs to avoid kernel launch overhead. Liang’s summary is that dynamic inference serving benefits from operating-systems ideas such as paging, just as speculative decoding borrows from speculative execution and rejection sampling.
The Transformer is powerful but not inference-friendly
Percy Liang’s closing point is that all the techniques covered are responses to a structural mismatch. The Transformer used for inference is the same model as in training, but the workload is different: generation is sequential, memory-bound, and dynamic. The techniques discussed—new attention architectures, quantization, pruning, distillation, speculative sampling, continuous batching, and PagedAttention—are all ways of reducing memory movement, especially KV-cache movement, or making better use of the memory that must be moved.
He says the governing principle is to “reduce your KV cache, but don’t hurt accuracy too much.” Lossy methods must be checked empirically because benchmark results can differ across papers and models. Lossless methods such as speculative sampling depend on mathematical correction but still require practical draft-model design. Systems methods do not change the model’s distribution but determine whether live traffic can be served efficiently.
The final architectural claim is open-ended. Liang argues that new architectures may have large potential because the KV cache and attention structure make the Transformer “fundamentally” inference-unfriendly. State-space models, linear attention, and diffusion-style generation are named as possible directions. The implication is not that one specific alternative has won, but that inference efficiency may require architectures designed for the serving regime rather than inherited from training convenience.



