Inference Constraints Are Reshaping Language Model Architecture

Dan FuStanford OnlineFriday, June 5, 202622 min read

In a Stanford CS336 guest lecture, Dan Fu argued that language-model inference is no longer downstream plumbing but a central research and design constraint. Fu described serving as the machinery that turns a trained model into a usable system, where schedulers, KV caches, GPU kernels, routing policies and hardware choices determine which architectures are practical, economical and reliable at scale.

Inference is where model capability becomes an engineering system

Dan Fu framed inference as the machinery that “turns electricity into intelligence”: the layer between a trained model as a mathematical object and a service that returns tokens to users. Training produces a model that can “talk back to you,” but serving that model requires a separate stack of schedulers, caches, GPU kernels, parallelism strategies, debugging tools, and hardware-aware choices.

The central claim was not just that inference matters operationally. Fu’s stronger point was that understanding inference and GPU kernels enables “full-stack innovation” in machine learning algorithms. The serving stack is not merely implementation detail. It changes what architectures are practical, what workloads can be served economically, what bottlenecks matter, and what research questions become visible.

Understanding inference and GPU kernels enables full-stack innovation in ML algorithms.

Dan Fu

He situated that claim against the broader scaling story. The slide described a shift from roughly 100 million-parameter models in 2018 to 500 billion-parameter models by 2022, and Fu said open-weight models now reach trillion-parameter scale, with frontier models, in his estimate, probably in the 5 trillion to 10 trillion parameter range. That scaling has coincided with capabilities shown on the slide as “human-level text generation,” along with code generation, image and video generation and understanding, and applications in science, biology, health, and DNA modeling.

Fu used the transition from horses to cars in Manhattan as an analogy for the speed of change. In 1902, he said, Manhattan had 130,000 working horses, each producing 22 pounds of manure a day, and conferences were held to discuss the manure problem. By 1912, cars outnumbered horses in Manhattan. For Fu’s own coding work, he said, the analogous “1912 moment” was 2025: he began writing the majority of his code using language models, and most people on his team do the same.

The scale behind that change is increasingly physical. Fu called GPUs “the new oil,” pointing to hundreds of billions of dollars of investment in GPU data centers and sovereign-scale AI infrastructure. But GPUs alone are inert. The useful conversion happens through inference engines and GPU kernels that map a model’s directed acyclic graph of operations onto hardware with high-bandwidth memory, streaming multiprocessors, caches, interconnects, and all the associated constraints.

A request becomes tokens through scheduling, cache lookup, prefill, and decode

A user request entering an inference system first gets tokenized, then scheduled, then run through two distinct phases of model computation: prefill and decode. Fu emphasized that those two phases are “very different beasts,” and many serving decisions follow from that difference.

Prefill handles the prompt. If a user submits 10,000 tokens that have not been seen before, the system computes over those 10,000 tokens and produces the first next-token distribution. Fu described this as compute-bound and similar to training, except without the backward pass. The workload can use the parallelism of the GPU relatively well.

Decode is the token-by-token generation loop after the prompt has been processed. Each generated token must be fed back through the model to produce the next token. That pass has relatively few floating-point operations compared with prefill, but it still requires loading model weights. The result is a memory-bandwidth-bound workload: the GPU becomes, in Fu’s phrasing, close to a “glorified memory loader.”

The inference engine is therefore a loop: schedule, execute, sample, repeat. After model execution, the resulting token ID is detokenized into text, stop tokens and safety conditions may be checked, and the output is streamed back to the user.

The system cannot optimize this loop in the abstract. It has to optimize for workload shape and service-level targets. Fu distinguished coding workloads, narrative summarization, general chat, voice interaction, and long-running agentic workflows. A coding assistant may receive tens of thousands of input tokens from a codebase and then generate shorter or longer outputs depending on model behavior. A summarization workflow may involve an entire book pasted into context. A voice chat application demands quick, interactive responses. An autonomous agent may run for long stretches, use tools, pause, ask for help, and resume later.

Those workload shapes determine how much input arrives, how much output is expected, how many turns occur in a session, and how long the system should retain intermediate state between turns. Fu gave one personal example: a ChatGPT thread about his workouts that he revisits roughly every other week. That session has a very different caching and scheduling profile from a fast coding loop or a one-off question.

Service-level targets similarly differ. One application may prioritize time to first token under a second so the user sees that the model is responding. Another may care more about completing 500 generated tokens within a particular wall-clock interval. A provider also cares about throughput per GPU. Fu’s point was that the same latency target can be easy or impossible depending on input length, output length, cache hit rate, number of turns, and time between turns.

Term	What Fu emphasized	Primary bottleneck
Prefill	Processes the prompt, such as 10,000 new input tokens, before generation begins	Compute / FLOPs
Decode	Generates one token at a time after the prompt has been processed	Memory bandwidth
Time to first token	Latency before the user sees the model begin responding	Prefill speed and scheduling
Time between tokens	Latency between streamed generated tokens	Decode speed

Fu separated inference into phases with different bottlenecks and service-level implications

The request path shown on the slide was request, tokenize, schedule, prefill, decode loop, detokenizer, and streaming output tokens. Its key line was that “the engine is a loop: schedule -> execute -> sample -> repeat.”

Continuous batching and KV cache management turn inference into an operating-systems problem

Serving many users at once makes inference a scheduling system. Fu described continuous batching as one of the basic techniques: rather than waiting for a fixed batch of requests to complete, the engine admits new requests as old ones finish. A long request may be generating tokens while a short request enters, completes, and frees a slot. Another request may be blocked because there is not enough GPU memory to hold its key-value cache. Long prompts can also be split into chunks and interleaved with decode steps from other requests.

The KV cache is central to that system. During attention, a model stores keys and values for previous tokens. If later turns in a conversation share a prefix with earlier turns, the system does not need to recompute the prompt-side work. It can reuse cached KV blocks for shared prefixes and compute only the newly added tokens. Fu compared this to a traditional data structure: a radix-like tree over token blocks, where a shared system prompt or conversation prefix can branch into many sessions. The engine traverses hashes, finds matching blocks, reuses cache hits, and computes misses.

This matters because modern usage is heavily multi-turn. A user who pastes a book once and then asks follow-up questions should not force the system to prefill the book every time. Similarly, many users may share common prefixes such as system prompts. Prefix sharing can save large amounts of compute.

But the cache itself becomes a resource to schedule. The ideal is to keep all useful KV cache blocks in GPU memory, but GPU HBM is limited. Systems therefore spill blocks to CPU DRAM and, eventually, to NVMe disk. Fu described a memory hierarchy of hot blocks in GPU memory, warm blocks in pinned CPU memory, and cold blocks on disk, with eviction and prefetching policies moving blocks between tiers. The target is simple: by the time a request is actually scheduled to run, the relevant KV cache should already be back in GPU memory.

A student asked whether offloading to CPU or SSD is only appropriate for special workloads, since SSD access is slow. Fu answered by explicitly connecting inference to classic operating systems problems. The GPU-specific version, he said, resembles memory-management diagrams from the 1970s or 1980s: when too many applications exceed CPU memory, the system pages to disk. The inference version pages KV cache across GPU, CPU, and disk.

Least-recently-used eviction is a reasonable heuristic, though Fu described the best possible policy as predicting the future. In some cases, that may be possible. If a user opens an old conversation in a chat interface, that is a strong signal they may ask about it, so the system could prefetch the associated cache. In the absence of future knowledge, the provider uses heuristics, subject to service-level objectives. Fu added that he has never spoken to anyone who wants to put less traffic onto their GPUs; the question is how much traffic can be served within the required latency and quality constraints.

Prefill and decode are different enough to be split across machines

Because prefill and decode have different bottlenecks, Fu said, production systems increasingly split them onto different GPUs or machines. Prefill is FLOP-heavy and can maximize compute utilization. Decode is memory-bandwidth-heavy and repeated once per generated token. A prompt is prefilling once; a long response may require hundreds or thousands of decode steps.

That split enables different parallelism strategies. Very large models may be divided with tensor parallelism, where each GPU holds a shard of every layer’s tensors and devices all-reduce activations. Mixture-of-experts models can use expert parallelism, where different GPUs own full experts and a learned router sends tokens to them. The choice affects bottlenecks, communication, the number of GPUs required, and the number of sessions that can be served simultaneously.

Fu connected this distinction to hardware strategy, explicitly as an inference-driven reading of where the hardware market may be going. Decode’s characteristics are different enough from prefill’s, he said, that specialized chips may be attractive for it. In Fu’s account, Nvidia’s purchase of Groq matters because decode differs so much from prefill that future systems could use GPUs for prefill and Groq LPU chips for decode. He also cited Cerebras, OpenAI’s compute partnership with Cerebras, and companies such as SambaNova as examples of bets across different parts of the inference space.

The same logic applies inside Nvidia’s own systems, as Fu described them. He discussed NVL72 Grace Blackwell systems: 72 GPUs connected by fast interconnect. Such systems raise questions about how to split a trillion-parameter model across many GPUs, whether the split makes sense, and how to handle failures. At 32 or more GPUs per worker, failures become routine enough that automated detection, expert migration, request replay, and hot spares become part of the serving problem. Fu mentioned flaky NVLink connections caused by physical connectors as one practical source of failure.

Long context adds another axis. Models with one million or more tokens of context require the system to process and store enormous KV caches. Fu pointed to context parallelism: splitting context across GPUs and handling cross-partition attention. He described million-token context as a useful capability that can be “gated purely by engineering.”

At production scale, rare bugs become normal events

Inference systems serving trillions of tokens a day expose bugs that may never appear in small tests. Fu described several production debugging cases from open-source inference engines, mostly from the previous year, where rare failures produced visible model behavior.

One class involved NaNs from a slightly wrong kernel. The triggering conditions were rare, but once logits became NaNs midway through computation, models could start outputting the same token repeatedly. Fu mentioned outputs devolving into “hi hi hi hi hi” or chains of exclamation points.

Another involved tool-call handling. Tool calls are the mechanism by which a model asks the harness to perform an external action, such as an internet search. In one engine, a change caused tool calls not to be processed correctly. The visible symptom was a spike in completion length. Instead of emitting a tool call and stopping so external code could execute it, the model kept asking for the same search, producing a long loop that could run for tens of thousands of tokens.

A third bug caused models to suddenly emit Chinese characters in English contexts. Fu said this affected multiple inference providers and was blamed by some on quantization. The underlying cause, in his account, was an off-by-one kernel error that read uninitialized GPU memory, ran it through attention, and produced a random Chinese character. The model then interpreted its own output as a signal that it should continue in Chinese. Fu cautioned that sometimes models have genuinely been trained to think or answer in Chinese; in this case, the cause was an implementation bug.

His broader point was that observability in production catches what unit tests miss. At scale, a failure rate of 0.001% can become a recurring operational problem.

A two-line cache-aware router can improve long-context serving

Fu used cache-aware prefill-decode disaggregation, or CPD, as an example of a systems-level inference optimization. The idea is simple: route cold and warm requests differently.

A cold request has a low cache hit rate. It may be the start of a new conversation or a user pasting thousands of tokens for the first time. A warm request has a high cache hit rate, often because it is a later turn in an existing conversation. Fu argued that these requests should not necessarily run on the same prefill nodes. A user pasting a book and asking to discuss it should not compete on the same GPUs with a short follow-up inside an existing conversation about why 1 + 1 = 2.

The CPD design shown places a cache-aware router in front of prefill and decode nodes. Low-cache-hit requests are sent to one set of GPUs; warm requests are sent to another. The slide showed asynchronous reads and writes to a distributed KV cache over high-speed RDMA. Fu described the routing change as “like two lines of code” in the routing layer, but the result shown on the slide was up to 40% faster long-context LLM serving, measured in queries per second, compared with baseline prefill/decode configurations.

40%

reported maximum serving speedup from cache-aware prefill-decode disaggregation

Request type	Cache profile	Routing idea
Cold request	Low cache hit rate; often a fresh conversation or large new prompt	Send to prefill nodes suited to expensive new computation
Warm request	High cache hit rate; often a later turn in an existing session	Send to nodes where cached prefixes can be reused efficiently

CPD separates cold and warm requests instead of mixing them on the same prefill path

Fu characterized this kind of result as evidence that the field is still early. In 10 or 20 years, he said, such routing may look obvious. It is not obvious now because production traffic patterns are new, and providers are only beginning to see how multi-turn, cache-heavy, long-context workloads behave at scale.

Megakernels attack decode by treating the GPU as a distributed system

The first deeper research project Fu described was fast LLM decode with megakernels, a collaboration between Stanford and Together. The problem starts from decode’s basic inefficiency: to generate one token, the system must run the whole model. The ordinary programming model worsens this because kernels are usually written one operation at a time. A transformer block may have separate kernels for RMSNorm, QKV matrix multiplication, RoPE, attention, output projection, residual connections, feed-forward layers, SiLU, and down projection.

This modularity makes kernels easier to write and reason about. It also creates downtime. Fu showed Gantt-style execution traces where GPU streaming multiprocessors perform useful work during some intervals and sit idle during others. Kernel launch and teardown create gaps. Tail effects occur when some work finishes earlier and waits for stragglers; the same phenomenon appears at many levels, from request batching to attention over mixed sequence lengths. When a model runs across many separate kernels, these gaps accumulate.

The megakernel approach fuses multiple operations into a single larger kernel. Fu compared this to the fusion in FlashAttention, but more aggressive and spanning more of the model. The conceptual shift is to stop treating the GPU as a device running one operation at a time and start treating it as a distributed system: many streaming multiprocessors, many dependent tasks, and a scheduling problem over available work. Some tasks can begin before others finish if their dependencies are satisfied.

For an attention inference kernel, Fu reported 1.3x to 1.7x speedups. In the whole-model case, the team placed an entire Llama-1B layer into one kernel. The execution trace showed operations overlapped in non-obvious ways. For example, during attention, the system can start loading KV cache before QKV plus RoPE has finished. Once QKV completes and the query is available, attention can proceed. Similarly, the output projection can begin loading weights before attention is fully over.

That kind of control required a CUDA framework built around an instruction-based abstraction, where each “sub-kernel” can be implemented separately, plus virtualized shared memory to coordinate overlapping I/O. Fu also described ThunderKittens, a kernel-writing library developed for these patterns. He compared it to Triton but lower-level, with finer-grained control.

The payoff shown was near “speed-of-light” decode. On a Llama-1B BF16 batch-size-one decoding slide, the comparison was against vLLM and SGLang on H100 and B200 GPUs, with the megakernel shown as the faster implementation. A later summary slide used different comparison labels — vLLM, DeepSpeed, and Megakernel — on A100 and H100. Fu’s specific utilization claim was that the H100 implementation achieved 72% bandwidth utilization, which he described as close to the physical speed limit for that operation on the GPU.

72%

reported H100 bandwidth utilization for the megakernel Llama-1B decode benchmark

The trade-off is labor. In response to a question, Fu said the cost of megakernels is “people’s blood, sweat, and tears.” A strong kernel engineer might spend a year writing megakernels for one hardware target, two or three models, and batch sizes 1 through 16. Batch size 17 may require starting over. Together, he said, is working on compilers to automate some of the process, but megakernels remain difficult. If they are done well, they are extremely fast; the engineering burden is the price.

Another question asked how megakernels interact with multi-GPU communication. Fu said there is early work showing NCCL calls can be fused into a megakernel if configured correctly, though his team has not yet found a “killer use case.” DeepSeek-V2, he noted, included a megakernel for the mixture-of-experts inference layer that fused some communication. His expectation is that future systems may use megakernels for parts of computation rather than necessarily for entire models, unless teams are willing to pay the full engineering cost.

Parcae makes parameter reuse a model-design lever

Parcae’s model-design implication is that parameter count need not be the only way to buy more computation. Fu described Parcae as work from his UCSD lab, led by Hayden Prairie in collaboration with Zachary Novak and Taylor Berg-Kirkpatrick. The project asks whether models can reuse parameters through recurrence: instead of making the network deeper by adding distinct layers, run some transformer blocks repeatedly.

In a looped transformer, activations enter a recurrent block, pass through it, and then pass back through the same block multiple times. This increases FLOPs while keeping parameter count constant. If additional compute improves quality, looping gives the model designer a dial for spending compute without increasing the memory footprint of weights.

Fu gave two reasons to care about that dial. First, looped models may offer better quality per parameter — “intelligence per parameter,” as he put it. Second, prior work suggested higher expressivity: some functions can be represented more effectively with recurrence than with a non-recurrent model of the same parameter count.

The incentive is especially clear from the inference side. Fewer parameters can mean more room for KV cache, less communication across GPUs, and more flexibility in fitting models onto particular serving hardware. The architecture question is therefore not only whether recurrence improves benchmark quality; it is whether recurrence can move a model to a different serving regime.

Early signs made the idea worth investigating. Fu cited work from Tom Goldstein’s group at Maryland on scaling test-time compute with latent reasoning through recurrent depth, including ARC task results. He also mentioned social-media speculation that Claude Mythos was a looped language model, which he said he did not believe. According to Fu, the person who made the claim later wrote that he had made it up. Fu’s group had already been working on recurrence before that speculation, but the episode reflected wider interest in looped models.

The obstacle was instability. When Fu’s group tried training looped models and changed hyperparameters even slightly, models often failed to converge. Learning-rate sweeps produced NaNs and large loss spikes. Prior work used fixes such as inserting norms in every layer or choosing a particular learning rate and avoiding others. Fu treated the loss spikes as evidence that something deeper in the recurrence dynamics was wrong.

Parcae stabilizes looping by constraining the recurrence dynamics

Parcae’s technical move is to make the loop stable by construction. Fu’s group approached the problem through a state-space-model-like analysis rather than treating the recurrent transformer block as an opaque stack of attention, nonlinearities, softmax, GeLU, RoPE, and parameters.

Their empirical observation was that the residual activation did not change dramatically from block to block. That made it possible to model the residual dynamics more simply. They wrote the recurrent update as a dynamic system:

h_{t + 1} = A h_{t} + B e + R (h_{t}, e)

In this expression, the complicated nonlinear transformer behavior is placed in $R$ . The remaining terms are matrices $A$ and $B$ : $B$ transforms the initial vector injected into the loop, while $A$ determines how the residual transforms across loop iterations.

Fu said that when they empirically examined the system, the $A$ and $B$ terms dominated the magnitude. Dropping the nonlinear residual term produces a linear system with a closed-form solution. The key quantity is the spectral radius of $A$ , which Fu described roughly as a norm-like measure. Because $A$ is repeatedly powered through the loop, an unstable value can blow up quickly. As a scalar analogy, if the recurrence multiplied by 2 for 16 steps, the activation would grow as $2^{16}$ . This helps explain loss spikes.

In previous looped transformer designs, Fu said, choices for $A$ and $B$ were marginally stable or unstable. Parcae constrains them. For $A$ , the group effectively made the matrix a negative diagonal form such that the powered term eventually goes to zero rather than exploding. For $B$ , they applied a simple linear norm, noting that $B$ is only applied once and does not itself blow up. The resulting spectral radius is less than 1, making the linearized system stable.

The empirical result was stable loss curves, including at a learning rate of 6e-4 that caused other recurrent models to fail. Fu contrasted three behaviors. An unconstrained baseline can blow up to norms on the order of 10^19. A model with normalization may keep activation norms controlled, but the model still tries to expand activations while the norm forces them back down; that tension appears as loss spikes. Parcae’s reparameterization stabilizes the recurrent residual without relying on residual norm in the same way.

The stability change also improved quality. Fu reported that Parcae outperformed recurrent depth model baselines across validation, WikiText, HellaSwag, ARC-C, and ARC-E results shown in the table. It also outperformed strong transformer baselines based on a fast-learning nano-chat-style transformer architecture: taking the same basic transformer, looping it, and stabilizing it produced better perplexities and downstream quality.

The scaling-law result is that more data favors more recurrence

Parcae’s scaling-law result turns recurrence into a knob that should move with data, not a fixed architectural novelty. Fu connected this to the familiar scaling-law question: given a training FLOP budget, should a model get more parameters, more data, or both? Prior scaling-law plots, in his shorthand, slope “down and to the right,” meaning compute-optimal scaling increases both model size and data.

For recurrence, the question becomes whether to keep recurrence fixed, increase it aggressively, or scale it jointly with data and parameters. Fu’s group ran early scaling-law experiments using iso-parameter and iso-FLOP curves. Model size was fixed within a curve, while recurrence and training data varied. The experiments shown used 140M and 370M parameter models over FLOP budgets from 1e18 to 128e18.

The pattern, Fu said, was again “down and to the right”: as data increases, optimal recurrence increases. Recurrence followed predictable power-law relationships. The slide reported recurrence scaling approximately as $m u_{r ec} p r o pt o C^{0.40}$ for 140M and $m u_{r ec} p r o pt o C^{0.38}$ for 370M, while tokens scaled approximately as $D p r o pt o C^{0.77}$ and $D p r o pt o C^{0.78}$ . Fu summarized this as FLOP-optimal scaling laws that scale recurrences and tokens jointly.

Model size	Optimal recurrence scaling	Optimal token scaling
140M parameters	μ_rec ∝ C^0.40	D ∝ C^0.77
370M parameters	μ_rec ∝ C^0.38	D ∝ C^0.78

The Parcae scaling-law slide reported recurrence and token-count power laws that increase together with compute

A student asked about jointly scaling model size as well. Fu said the team had a complex 3D figure over recurrence, data, and parameters that was hard to read, but if believed, it suggested scaling all three together. The cleaner result was for fixed model size: if data increases, recurrence should increase too. Fu noted that current large models, as far as he knows, have no recurrence and sit at the far left of these curves while using enormous amounts of data. That suggests there may be a better training regime available.

The comparison between predicted optimal looping and predicted fixed depth reinforced the point. For fixed model size and fixed FLOP budgets, the orange fixed-depth curve represented a traditional transformer, while the blue curve represented the looping model’s optimal recurrence. The dots compared models trained with the same number of FLOPs but different mixes of data and recurrence. Increasing recurrence as well as data produced lower validation loss. Fu’s tentative conclusion was that large pretraining runs may benefit from looping, though he presented it as an implication of early scaling-law evidence rather than a settled rule.

Serving economics decide whether recurrence is worth using

The practical case for recurrence depends on what constraint the model is designed around. In Q&A, a student asked whether, for compute-optimal training, looping is ever preferable to simply adding more parameters, or whether it is mainly an inference-cost trick. Dan Fu answered that compute optimality is partly contrived because if higher quality is the only objective, one can usually increase the FLOP budget, increase model size, train longer, or choose a better overtrained point.

The design decision depends on constraints: serving cost, model size, open-source release targets, whether users can run the model on a laptop, and how the model will be deployed. Bigger models trained on more data generally improve, but recurrence becomes interesting at particular design points where parameter count or serving footprint matters.

That is also why Fu is interested in pre-trained models that can be looped after the fact. A student asked whether Parcae trains from scratch or whether one could take a pre-trained model and loop it. Fu said Parcae’s work trains from scratch, but he also described a blog post in which, according to Fu, someone looped two or three layers in a Qwen model without training and saw higher quality on some math tasks. Fu called the result weird and said his group has some related work under investigation. His hope was to examine activations and weights to understand why looping a pre-trained model could improve behavior at all.

Asked about inference implications, Fu said one reason he is excited about looped models is that GPU memory is a major bottleneck in serving. Fewer parameters can leave more room for KV cache, reduce communication, or allow the model to fit on fewer GPUs. He also imagined a recurrent block small enough to be implemented as a very fast megakernel loop, though his group had not yet made the blocks small enough.

Specialized hardware sharpens that point in Fu’s telling. He said future Groq LPU chips associated with the Nvidia systems he described may have around 256 megabytes of memory, meaning only very small weights can fit. If an architecture can be designed to fit those constraints, weights could stay in memory while activations run through quickly. Crossing such thresholds can create nonlinear benefits.

Architecture choices increasingly depend on the serving hardware and workload

The co-design implication is that model architecture is increasingly a serving decision. Asked how to design a model for known serving platforms such as Groq or Cerebras, Dan Fu said memory comes first. For a Cerebras chip, one should inspect the wafer memory and size the model so it fits with enough room for KV cache or other requirements.

Fu also said some Chinese models appear, in his view, to be making choices that may reflect Huawei hardware constraints. Quantization formats are another hardware-dependent choice in his account: a model intended for Nvidia GPUs might train in NVFP4, an Nvidia-specific FP4 format, while AMD-oriented systems may use MXFP4. He described the formats as having different trade-offs.

Workload also shapes architecture. In agentic workflows, keeping KV cache hot matters greatly because sessions involve repeated turns over shared context. In batch processing where each document is seen once, KV cache reuse may matter much less. Fu pointed to DeepSeek’s MLA attention as a radical compression of KV cache relative to other models, and to FP8 or FP4 KV-cache processing as major departures for workflows sensitive to cache size.

The largest architectural distinction is causal versus non-causal attention. For batch processing, bidirectional models such as BERT can process input once, produce a vector, store it in a database, and avoid decode entirely. Fu said Google used BERT models for search for a long time and added that he thinks it probably still does. Chat workflows necessarily have an autoregressive decode portion. Intermediate encoder-decoder choices such as T5 sit between those poles.

Across these examples, Fu’s argument was consistent: inference is not downstream plumbing. Serving constraints reach back into routing, kernels, architecture, quantization, recurrence, model size, and hardware selection.

AI Application Architecture AI Research Methods Inference and Deployment AI Infrastructure and Compute