Tatsunori Hashimoto

Assistant Professor of Computer Science at Stanford University whose research focuses on robust and trustworthy machine learning systems, including large language models, NLP, distribution shift, fairness, and scaling laws. He is an instructor for Stanford CS336, Language Modeling from Scratch.

KV Cache Movement Has Become the Core Inference Bottleneck

Stanford’s CS336 lecture on inference, taught by Percy Liang with Tatsunori Hashimoto, argues that serving language models is now a core systems problem rather than an afterthought to training. Liang’s central claim is that autoregressive Transformer generation is sequential and often memory-bound, especially because attention must repeatedly move KV-cache data rather than perform dense, easily parallelized computation. The lecture treats batching, grouped-query and latent attention, quantization, pruning, speculative decoding, continuous batching, and PagedAttention as different attempts to move fewer bytes, reuse memory better, or trade latency for throughput without degrading model quality too much.

Stanford OnlineMay 12, 202617 min read