Orply.

Together AI Targets 100-Millisecond Responses With Full-Stack NVIDIA Inference

Dan FuNVIDIATuesday, June 30, 20265 min read

Together AI’s Dan Fu argues that low-latency inference is a stack-wide engineering problem, not a single model optimization. In NVIDIA’s account of the company’s work, Fu says Together AI uses NVIDIA GPUs and software including CUDA, CUTLASS, TensorRT-LLM and Dynamo to support projects such as a megakernel for returning the first 64 words of a voice-agent response within 100 milliseconds and ATLAS, a system for adapting speculative decoding as traffic changes.

The latency target is concrete: 64 words in 100 milliseconds

Dan Fu describes Together AI as both “an AI cloud and an inference provider,” with research embedded throughout its stack. This is Together AI’s own account of its work: the company’s focus, as Fu frames it, is making models run better on GPUs, and on NVIDIA GPUs in particular.

The clearest example is a real-time voice-agent requirement: a customer asks Together AI to return the first 64 words within 100 milliseconds. That is a first-response latency constraint, not a vague benchmark. The system has to produce enough usable output fast enough for an interactive voice experience to feel immediate. The “megakernel” is the project Fu names as the answer to that kind of deadline.

100 ms
target time to return the first 64 words in Fu’s real-time voice-agent example

The megakernel, as Fu describes it, can take “a whole model” and put it “into a single kernel.” The point is tied to a specific product constraint rather than a generic performance claim: real-time voice agents need a very fast initial response.

So there's real-time voice agents where a customer will come to us and say, hey, I need you to get the first 64 words back to me within 100 milliseconds. And we say, okay, hey, I've got a megakernel for that.
Dan Fu · Source

That example sets the shape of Together AI’s inference work. Fu is not presenting latency as a single optimization pass. He is describing a stack problem that spans model execution, decoding, traffic adaptation, GPU kernels, and the NVIDIA software and hardware environment underneath them.

ATLAS is meant to keep speedups as traffic changes

Fu’s second named project is Together ATLAS. He characterizes it as a system for adapting speculative decoders to user traffic over time. The operational premise is that deployed workloads do not stay fixed: traffic may change, users may change, and the serving system still has to preserve performance.

In Fu’s phrasing, ATLAS can “take the speculative decoders that come with our models and adapt them to user traffic over time.” The promised result is that, even as those patterns shift, “you get all the speed up.”

That is more specific than saying ATLAS makes inference faster in general. The claim is about adaptation: the decoding strategy changes with the traffic it is seeing, rather than assuming one static pattern of use. In the structure Fu lays out, the megakernel addresses a hard first-token or first-response deadline, while ATLAS addresses the fact that the distribution of requests behind that deadline can move. One is framed around fitting the model execution into a tighter serving path; the other is framed around preserving acceleration as users and traffic patterns change.

Together, those examples make Fu’s latency argument more concrete. The performance problem is not only that models must run quickly on idealized benchmarks. They must run quickly for real applications, under changing demand, while still producing responses at the pace users expect.

NVIDIA’s stack is the substrate for Together AI’s deeper optimizations

Dan Fu ties Together AI’s work directly to NVIDIA’s ecosystem. He names CUDA and CUTLASS among the libraries and technologies that let the company “build very deep things in AI.” Later, he also points to NVIDIA Dynamo and NVIDIA TRT as “huge parts” of the ecosystem.

The relationship he describes is not just access to GPUs. It is the software-and-hardware environment around NVIDIA GPUs that supports low-level and systems-level inference work. CUDA and CUTLASS appear in Fu’s account as tools for building close to the GPU. Dynamo and NVIDIA TRT appear as major pieces of the broader ecosystem Together AI uses.

Fu also refers to NVIDIA Blackwell and Vera Rubin. He says Together AI has been “really excited” about Blackwell “for a long time” and is also excited for Vera Rubin and how the company will be able to use it. He does not expand on those platforms in the transcript, but he places them in the same ecosystem as Dynamo and TRT.

Named technologyHow Fu frames its role
CUDANamed as part of the NVIDIA software ecosystem that enables deep AI systems work
CUTLASSNamed alongside CUDA as technology Together AI values for building deep AI infrastructure
NVIDIA DynamoDescribed by Fu as a huge part of the NVIDIA ecosystem Together AI uses
NVIDIA TRTNamed by Fu as another huge part of the NVIDIA ecosystem
BlackwellA NVIDIA platform Fu says Together AI has been excited about for a long time
Vera RubinA future NVIDIA platform Fu says the team is excited to use
NVIDIA technologies Fu names as part of Together AI’s inference stack

Fu’s emphasis is that Together AI can take work that was “relatively niche” for a long time and apply it to customer-facing systems. In his telling, the NVIDIA ecosystem is what lets research-level kernel and inference work become part of deployed infrastructure.

Code generation raises the latency bar because the interaction is continuous

Dan Fu singles out code generation as one of the most exciting current use cases. His reason is specific: these are “long-context multi-turn applications” that need very low latency.

Cursor is his example. Fu describes it as one of Together AI’s customers and says that when users call models on Cursor’s platform, they expect “a really fast response” and “really fast tokens.” The example matters because code generation is not a one-shot request in Fu’s framing. It is an ongoing interaction, with long context and multiple turns, where delay accumulates and becomes visible to the user.

The same latency requirement appears in a different form from the voice-agent example. In voice, Fu’s concrete target is the first 64 words within 100 milliseconds. In code generation, the pressure is responsiveness across an interactive workflow: the model has to keep up while the user is working, asking follow-up questions, and relying on fast token delivery. In ATLAS, the pressure is maintaining speedups as the traffic itself changes.

Those are distinct technical problems, but Fu connects them through one requirement: deployed AI systems have to respond quickly enough for the application’s interaction model. The infrastructure has to support fast starts, fast streams, and adaptation to changing workloads.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free