Text Diffusion Trades Batch Throughput for Faster, Revisable Generation

Brendon DillonAI EngineerThursday, June 4, 202611 min read

Google DeepMind’s Brendon Dillon argues that text diffusion changes language generation by refining blocks of tokens rather than committing to one token at a time. In his account, that gives diffusion models lower latency and the ability to revise earlier text after later reasoning emerges, but it also creates a serving problem: weaker throughput when many requests are batched at scale. Dillon frames the technology as most compelling today for on-device and interaction-heavy products, where fast, revisable generation matters more than large-batch economics.

Text diffusion buys latency and revisability by spending compute on blocks

Text diffusion changes the unit of language generation. Instead of committing to one next token at a time, the model starts with a noisy block of tokens and repeatedly refines the whole block into coherent text. Brendon Dillon describes it as the language analogue of image diffusion: during training, a clean sequence is corrupted at different noise levels — for example by replacing tokens with random tokens — and the model learns to correct the corrupted text. At inference time, it begins from random discrete tokens from the vocabulary and denoises them over multiple passes.

Standard Gemini, Gemma, GPT-style generation conditions on a context and produces one token, then the next token conditioned on the previous one, and so on. A diffusion language model receives the context, initializes a “canvas” of tokens as noise, and updates that canvas jointly over a number of denoising steps.

That means a diffusion model is not limited to causal attention inside the generated block. It can attend bidirectionally within the canvas: earlier positions can be revised after later positions become clearer, and later positions can inform earlier ones. Dillon frames that as the basis for faster inference, bidirectional reasoning, self-correction, adaptive computation, fast in-place editing, and a different latency-throughput tradeoff.

The tradeoff is the central constraint. Dillon says the largest practical reason text diffusion is not used everywhere today is lower throughput for large batches. Autoregressive models are slow for each individual user, but a serving system can batch many user requests together and keep GPU or TPU utilization high. Text diffusion lowers latency for an individual request, but because it performs multiple forward passes over the same generated block, it can hit a compute threshold earlier and become more expensive to serve at scale.

In Dillon’s formulation, this is not primarily a quality objection. It is a serving economics objection: low latency for one user, worse throughput when many users are batched.

The latency gain comes from moving less model state through memory

The hardware explanation turns on memory bandwidth, not just raw compute. Brendon Dillon says large transformer inference on GPUs and TPUs is often memory-bound. Tensor cores can perform large matrix multiplies quickly, but weights, activations, and the KV cache must be streamed from high-bandwidth memory into the compute units. The bandwidth channel is tight relative to available FLOPs, and Dillon says that ratio is becoming more pronounced in future GPU generations: it is easier to add compute than to add memory bandwidth.

Autoregressive generation pays that memory-transfer cost once per generated token. At batch size one, each new token requires streaming over the model and relevant cache to produce a single token, then repeating the process for the next token.

Diffusion changes the unit of generation. If a model generates a 256-token block over 20 forward passes, it does not stream the model once per token. It streams the model once per denoising step. Dillon’s rough arithmetic is that 20 forward passes to generate 256 tokens can mean about 10 times fewer memory transfers than an autoregressive model. If inference is truly memory-bound, that translates into a comparable latency gain.

~2,000 tokens/s

reported browser-visible Gemini Diffusion generation speed in the research demo

The Gemini Diffusion research preview, released roughly a year before the talk to about 100,000 people, reportedly reached around 2,000 tokens per second in the browser. Dillon emphasizes that the displayed tokens-per-second figure included prefill and other overhead; it was not a synthetic decode-only number. He also notes that the longer the generated sequence, the less the request is dominated by prefill. For very short outputs, the benefit narrows because the prefill cost dominates.

A benchmark slide from the older research preview compared Gemini Diffusion with Gemini 2.0 Flash-Lite, Mercury Small, ChatGLM, and LLaDa across coding, general, science, math, multilingual, and latency categories. Dillon cautions not to “fixate” on the numbers because they were a year old, which he called “prehistoric times” in the field. The relevant claim was narrower: the Gemini Diffusion model, branched from Gemini 2.0 Flash-Lite, had broadly similar quality at substantially better latency, with some advantage in code and some disadvantages elsewhere.

Bidirectional attention lets the model revise an answer after seeing its own reasoning

The clearest demonstration of the architectural difference was a math prompt: compute an expression involving the square root of 81, (2/3)^2, and a second arithmetic term, and provide the answer before deriving the solution. Brendon Dillon says the correct answer is 39.

Gemini Diffusion did not get it right immediately. After one denoising step, the answer line said 60 while the reasoning underneath was still partially incoherent. After two denoising steps, the answer changed to 49 and the reasoning became more complete, though still unstable.

The third displayed denoising state carried the key transition: the reasoning at the bottom had reached the correct conclusion — “36 + 3 = 39” — but the answer line at the top still read 49. In the final displayed state, the model had gone back and edited the answer line itself to 39 while also cleaning up the intermediate reasoning.

It had a mistake twice, 60 and 49. But once it finished the reasoning, it was able to return back and fix the mistake that it made at the start.

Brendon Dillon · Source

The point is not that the model never makes a bad early guess. It is that the early answer is not fixed once emitted. Because the model is refining a whole block rather than committing to a left-to-right stream, it can use later reasoning to edit earlier text.

Dillon contrasted that with larger autoregressive models on the same prompt. ChatGPT-4o initially gave 40, then produced a correction saying the answer was actually 39. Gemini 2.5 Flash gave 42 and, according to the slide, retained the error even after writing an invalid step that added 6 and 36 to get 42. Dillon’s interpretation is that autoregressive models can incorporate an early mistake into later reasoning because the earlier token is already part of the fixed past.

He acknowledges that modern reasoning or “thinking” models can mitigate this, but describes that as moving the problem elsewhere rather than changing the underlying one-token-at-a-time constraint. The diffusion model’s advantage in this example is structural: future generated content can influence earlier generated content before the final answer is returned.

More denoising steps can buy quality, but the model can also stop early

Brendon Dillon separates two related ideas: dynamic computation and adaptive computation.

Dynamic computation means giving the diffusion model more denoising steps at inference time. In internal coding evaluations shown on a slide, performance generally increased as the number of denoising steps per token increased. Dillon says the relationship is not exactly monotonic, but is “roughly monotonic” across the monitored evaluations: even when an answer is nearly clean, another pass can reveal and correct a remaining mistake.

Adaptive computation means training the model to decide when it is done. The user or serving system can still impose a maximum number of denoising steps, but the model may stop earlier. Dillon showed three examples from Gemini Diffusion: generating the first 100 digits of pi took four denoising steps; writing FizzBuzz in Python took 18; explaining quantum mechanics in one paragraph took 31.

Prompt	Denoising steps
First 100 digits of pi	4
Write FizzBuzz in Python	18
Explain quantum mechanics in a single paragraph	31

Gemini Diffusion examples of adaptive computation

Dillon’s explanation is that the pi prompt is easy for the model because the response can be memorized and emitted as a block. An autoregressive model, by contrast, would only have produced four tokens after four steps. Code generation required more passes. The quantum mechanics paragraph required more still.

He says the same pattern appeared in evaluations: harder tasks, such as GPQA Diamond for the model size they were targeting, took longer; easier programming tasks, such as MBPP, took less time. The step count was determined by the model rather than by a hand-coded prompt category.

In the Q&A, Dillon added that larger diffusion models tend to require fewer denoising steps for the same output. Even though a larger model has more FLOPs per forward pass, it may need fewer passes, creating what he called a kind of diminishing serving cost for bigger models.

In-place editing follows naturally from the same non-causal generation pattern

Text in-place editing follows from the same structure as image diffusion. Brendon Dillon points to image-editing diffusion models that can see the whole image, a prompt, and a masked or corrupted region, then fill in the missing content consistently with the surrounding pixels. They are not raster-scanning left to right, top to bottom. Every pixel can condition on every other pixel.

Text diffusion has the analogous property. A code example can be presented with a bug, and the model can modify the relevant indices directly rather than regenerate the whole file one token at a time. A user can ask for documentation to be added, and the model can insert text in the appropriate places. In prose, a user can provide a first and third paragraph and ask for a middle paragraph; because the model can see both sides, it can fill the missing section consistently with the surrounding story.

The architectural claim is the same as in the math example: the generated sequence is not a fixed past plus an unknown future. It is a canvas that can be refined in place.

Low latency makes generated interfaces feel less like waiting for text

Low latency is not merely the same chatbot experience made faster. Brendon Dillon argues it can support interfaces where generation happens behind every interaction.

One application pattern is a Wikipedia-like site in which the page content and HTML are generated on the fly. Clicking through the interface produces new pages quickly enough to resemble browsing a real encyclopedia page.

Another is a fake Reddit clone, “GREDDIT,” where posts, comments, images, and page structure are generated dynamically. The Gemini Diffusion model generates the text and HTML; a separate image model produces images more slowly, appearing later. Dillon jokes that all the replies to a user’s posts would now be from bots, “if they weren’t already,” but the demonstration shows an interactive site synthesized around an arbitrary topic.

A third pattern is a simulated operating system. Each click generates the next screen: desktop, readme files, web pages, navigation back to the desktop, and other interface states. Dillon calls it his favorite example because the whole operating-system surface is generated in response to user actions.

The last example came from someone outside Dillon’s team using the Gemini Diffusion web page for voice-driven “vibe coding.” The spoken request asked for a to-do app, 10 random to-dos, completed states, four random completed items, sorting by name and state, later deletion and addition behavior, and a dark mode conversion. The demonstrator said it was “literally 15 seconds of work.”

Dillon’s broader claim is that once latency falls enough, generation can be part of the interaction loop rather than an output the user waits for after submitting a prompt. He frames the next generation of diffusion models as an invitation to AI engineers to find new product forms, not just faster versions of current ones.

The strongest deployment case is where batching economics do not dominate

The operational details are still closer to research infrastructure than a finished product surface. Brendon Dillon says the team uses the same data as for autoregressive models, though the algorithms change. Reinforcement learning from human feedback and other RL-style approaches are possible, but require algorithmic changes. Distillation is possible too, though Dillon was unsure whether any relevant techniques from the team had been published.

For output length, the simplest approach is to fix a window length and iterate within that window. Longer outputs can be generated as multiple blocks in a blockwise autoregressive fashion; Dillon says the previous window could potentially be revisited, but their approach sets it “in stone” and continues. A length-prediction head is also possible. The same blockwise framing explains hybrid autoregressive-diffusion systems: prefill is the usual context prefill, and the generation step then proceeds in fixed-size diffusion blocks — 512, 1,000, 32, or whatever size is chosen.

The models Dillon discusses are discrete diffusion models: tokens in, tokens out. The simplest way to return token IDs is a vanilla logits prediction head at the top. Latent-space text diffusion is possible and has been explored elsewhere, but he says most text diffusion literature today uses discrete diffusion. On availability, he says there is no open-source version from his team, though there is outside literature, and that the team is “gonna release something soon.” He did not give pricing details; when asked how to price an API when tokens no longer capture the cost structure cleanly, he answered that they had not reached that point and that he is a research scientist.

The deployment case Dillon emphasized is on-device and other low-batch settings. If a model runs on a phone, a robot, or another device, it is not being batched with thousands of server-side requests. In that setting, the throughput penalty matters less, and low latency matters more. Dillon says diffusion models are already in “a couple” of on-device applications within the Alphabet ecosystem, including robotics-related settings.

When asked whether diffusion and autoregressive models will occupy different use cases, Dillon’s answer was yes, at least for now. When pressed on whether diffusion models can reach the quality of current frontier models, his answer was that quality is not the concern; throughput in large-batch serving is.

Quality isn't the concern. It's the throughput for serving in a big batch setting.