Orply.

DeepSeek’s DualPath Raises GPU Utilization by Rerouting KV-Cache Traffic

Károly Zsolnai-FehérTwo Minute PapersMonday, June 22, 20265 min read

Károly Zsolnai-Fehér presents DeepSeek’s DualPath paper as an infrastructure fix for a specific bottleneck in agentic LLM serving: GPUs can sit underused because KV-cache data cannot reach the model fast enough. The work, from researchers affiliated with Peking University, Tsinghua University and DeepSeek-AI, argues that routing memory traffic through underused decoding machines can relieve the congested prefill path while keeping computation traffic prioritized. In the demonstrated setting, Zsolnai-Fehér says GPU utilization rises from about 40% to about 80% without adding compute.

More GPUs do not help if KV-cache I/O is the constraint

More compute does not automatically make an AI assistant answer faster. Károly Zsolnai-Fehér frames the problem as a counterintuitive failure mode in AI serving: companies can spend more on GPUs and still hit a speed plateau because the bottleneck is not always computation. In the visual he uses, speed rises with money spent on GPUs and then flattens.

The limiting factor, in his explanation, is the path by which information reaches the model. His analogy is deliberately simple: imagine trying to discuss a book while forgetting the characters every time the page turns. If the book is one page, that barely matters. If the book is huge, the system must keep rereading. The “brain” can be large and power-hungry, but useful work is gated by how fast context can be brought back in.

The paper screenshot names the issue directly: “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.” Its visible abstract says the performance of multi-turn, agentic LLM inference is increasingly dominated by KV-cache storage I/O rather than computation, and that in prevalent disaggregated architectures, loading massive KV-cache data from external storage creates a fundamental imbalance. The visible affiliations include Peking University, Tsinghua University, and DeepSeek-AI.

In Zsolnai-Fehér’s terms, today’s GPUs can spend too much time waiting for data to arrive “through a straw” instead of doing useful computation. A gauge shown alongside a cost scale from $0.1B to $2.9B puts utilization at 40%, which he calls “a horror story”: billions of dollars of hardware sitting at only partial utilization.

40%
GPU utilization shown before the optimization

The fix is a second path, but only if computation keeps priority

The key move, as Zsolnai-Fehér explains it, depends on a distinction inside the serving network. “Prefill machines” do the reading; they are the straws carrying context into the system. In the scenario he describes, those straws are jammed. “Decoding machines,” by contrast, are different machines in the network whose own data paths are often close to empty.

The researchers’ answer is to use those underused decoding machines to help with the reading. Instead of relying only on the congested prefill path, the reading work can take a second path to the prefill machines. Zsolnai-Fehér calls it “a clever detour that lets the brain do its job.”

The detour has an obvious risk: the shortcut uses the same high-speed roads the system needs for thinking. If unmanaged, it could fix one traffic jam by creating another. The mechanism therefore depends on traffic control. Computation gets priority; memory movement uses leftover space.

It does not give you more compute. No, it gives you access to the compute that you already have.

Károly Zsolnai-Fehér

That is why the claim is about utilization rather than raw model capability. The work is not presented as a new model, a better reasoning algorithm, or a larger neural network. It is a serving-system technique for extracting more work from infrastructure that has already been purchased.

The demonstrated gain is roughly twice as much useful work on the same hardware

The reported result is a jump in utilization from around 40% to about 80%. The utilization graphic moves from an “underutilized” region through intermediate values — 2%, 11%, 24%, 34%, 40%, and 61% — into an “optimized” region around 76% to 80%.

~80%
GPU utilization Zsolnai-Fehér says the technique reaches in the demonstrated setting

He translates that into operational terms: “almost twice as much work from the machine you already bought.” The caveat is explicit and important: this is not a general promise that every AI agent runs twice as fast. It is situational.

The situations that matter are long, multi-turn, agentic workloads: long conversations, lots of data, and hard problems where the system repeatedly needs large amounts of context. Those are also the conditions under which current systems slow down most sharply in Zsolnai-Fehér’s explanation.

A performance chart attributed on screen to Wu et al. 2025 reports offline inference experiments under varying numbers of agents and maximum agent context lengths. The visible settings include 32k, 48k, and 64k maximum agent lengths, across models labeled DS 27B, DS 660B, and Qwen 32B. Some entries are marked N/A because they ran into an error before finishing.

The narrower claim is therefore not “all AI gets twice as fast.” It is that when KV-cache storage I/O dominates agentic inference, routing the work differently inside the serving system can make already-bought machines do substantially more useful work.

The value is infrastructure-level, not consumer-facing

Károly Zsolnai-Fehér distinguishes the work from a “shiny new AI system.” It is “not the brain,” he says, but “a better road system to the brain.” That makes it less headline-friendly than a new chatbot or model release, but potentially important for the economics of serving large AI systems.

The technique would be implemented in a data center, not experienced by users as a new interface. If adopted in real serving systems, it might contribute to cheaper AI inference in the future. The economic logic is conditional but direct: if expensive GPU fleets are bottlenecked by memory movement rather than computation, raising utilization on already-owned hardware can change the cost structure of inference.

The work is also described as openly available. Zsolnai-Fehér says the researchers “give this technique away for all of us, for free, forever.” The visible paper lists authors from Peking University, Tsinghua University, and DeepSeek-AI.

The billion-dollar problem, in this framing, is not only the cost of buying more GPUs. It is the waste created when those GPUs cannot be kept busy on the workloads they were bought to serve. DualPath is presented as a way to attack that waste by routing memory traffic more intelligently, preserving priority for computation, and raising utilization in the long-context agentic settings where the old path becomes the bottleneck.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free