Untied Ulysses Pushes Llama-3-8B Training to 5 Million Tokens

Max RyabininAI EngineerMonday, June 8, 202611 min read

Together AI’s Max Ryabinin argues that training transformers at multi-million-token context lengths is chiefly a memory-scheduling problem, not a matter of applying a single long-context technique. Using a Llama 3-8B run on an 8xH100 node as the example, he shows how fully sharded data parallelism, DeepSpeed Ulysses, activation checkpointing, CPU offloading and chunked sequence training each remove one bottleneck and expose the next. His proposed addition, Untied Ulysses, chunks attention heads and reuses context-parallelism buffers, with the presented results claiming scaling to 5 million tokens with limited throughput loss.

The practical limit is memory before it is ambition

Max Ryabinin framed multi-million-token training as a systems problem that appears simple only at the level of desire: agents want more context, video models need temporal continuity, and developers want models that can actually use large inputs at training time. The difficulty is not just that long context is expensive. It is that the dominant memory bottleneck changes as each obvious fix is applied.

Ryabinin described two reasons long-context training has become a pressing target. Agentic systems can consume context quickly: tool definitions, system prompts, messages, agent state, and buffers all compete for the same window. A slide showed one long-context usage example at 163,000 of 200,000 tokens, with MCP tools alone accounting for 94,300 tokens, or 47.1% of the context. Video generation creates a different pressure: multiple frames, potentially multiple frames per second, must remain available if the model is to maintain temporal consistency over seconds or minutes.

The underlying transformer problem has two parts. Computation grows quadratically because attention involves pairwise interactions across sequence elements. Memory grows with sequence length as well, and although Ryabinin described the memory growth as “linear” in the training stack he was walking through, he emphasized that it is still difficult to manage without a set of targeted techniques. The abstract displayed on the final paper slide sharpened the same point around attention activations: for extreme sequence lengths, intermediate activations, especially those tied to attention, quickly exhaust individual GPU memory even when common optimizations are used.

The demonstration target was concrete in the memory charts: train a Llama 3-8B model at 3 million tokens of sequence length and fit the run on a single 8xH100 node. Those chart labels are used here for the memory progression and benchmark numbers.

With default settings, the attempt fails before the full long-context problem is even reached. The slide showed an out-of-memory result at 119 GiB peak memory, and Ryabinin said that even placing the model parameters is enough to exhaust GPU memory.

3M tokens

initial sequence-length target shown for Llama 3-8B on an 8xH100 node

No single technique solves that target. Each intervention removes one memory pressure and exposes the next one.

Sharding the model exposes the attention activations

The first intervention was fully sharded data parallelism, or FSDP. Ryabinin described it as chunking model parameters across available GPUs. That addresses the immediate problem that the model itself cannot fit, and the slide showed model memory dropping to 15.0 GiB under FSDP.

But the run still failed. Once parameters were sharded, attention activations became the dominant pressure. In Ryabinin’s account, long-context training is not blocked by a single large object that can be moved once. It is blocked by parameters, activations, checkpoint state, offloaded inputs, and large sequence-wide buffers at different stages of the stack.

The next technique was context parallelism, specifically DeepSpeed Ulysses, introduced by Microsoft. In Ryabinin’s explanation, standard multi-head attention does not have to be computed independently on every GPU for the whole sequence. Ulysses redistributes the work so that different heads can be computed on different GPUs, with activations communicated as needed. In the simplified diagram he showed, one GPU is responsible for one attention head while still computing attention over the whole sequence.

That arrangement also preserves compatibility with optimized attention kernels. Ryabinin noted that Ulysses can use the “best possible attention implementation,” naming Flash Attention versions 1 through 4, before aggregating results as usual. The Ulysses diagrams in this portion of the presentation were attributed on the slides to Jacobs et al., 2023.

The effect was large but insufficient. Adding Ulysses to FSDP reduced utilization by about 8x, “as it should,” Ryabinin said. Yet the 3-million-token run still did not fit on the single 8xH100 node. The slide showed another out-of-memory point, with FSDP + Ulysses still failing.

Configuration	Model memory shown	Outcome shown
Default	Not fit; OOM at 119 GiB peak	Out of memory
FSDP	15.0 GiB	Out of memory
FSDP + Ulysses	15.0 GiB	Out of memory

The early stack removes parameter pressure, then exposes attention activation pressure.

The comparison does not make Ulysses look weak. It shows that at multi-million-token scale, even an 8x reduction in one category of memory does not complete the job.

Activation checkpointing and offloading reduce what must remain on GPU

After FSDP and Ulysses, Ryabinin turned to activations more directly. The first move was activation checkpointing: instead of keeping all intermediate activations in memory for the backward pass, the system recomputes them when needed.

Ryabinin described checkpointing as broadly available in modern deep learning frameworks, but with an implementation caveat: it must be enabled “in a correct way” so it does not impose too much computational burden. In the presented stack, activation checkpointing reduced activation usage by another factor of 8. The run still failed, with the slide showing FSDP + Ulysses + activation checkpointing ending in an out-of-memory result.

The next optimization was activation offloading. Some inputs to each transformer block cannot be cheaply recomputed, so instead of holding them on GPU, they can be moved to CPU memory when not needed and prefetched before the backward pass reaches the corresponding layer. Ryabinin said this is not very impactful for performance because the offload and prefetch can be scheduled around the backpropagation path.

He attributed this optimization, to the best of Together AI’s knowledge, to Unsloth’s gradient checkpointing work, shown on a slide as “Unsloth Gradient Checkpointing — 4x longer context windows.” In the presented memory progression, adding activation offloading brought the stack to a point where 37.5 GiB of data was shown alongside the 15.0 GiB model allocation. Yet it still failed.

That remaining failure came from a different source: large buffers created by sequence-wide operations. Ryabinin then introduced Arctic Long Sequence Training, or ALST, citing Bekman et al., 2025 on the slide. The idea is to tile element-wise computations along the sequence axis. Loss computations and MLPs do not necessarily need buffers with a dimension of 3 million allocated all at once; they can be chunked to avoid creating tensors that span the entire sequence in a single allocation.

With ALST, the 3-million-token target finally fit. The slide showed the stacked configuration with 15.4 GiB in model memory, 37.5 GiB in attention activations, and 15.0 GiB in other memory.

Added technique	What it targets	Effect described
Activation checkpointing	Stored activations needed for backward pass	Further 8x activation-memory reduction, but still OOM
Activation offloading	Transformer-block inputs that cannot be quickly recomputed	Moves them to CPU and prefetches later; still OOM
Arctic Long Sequence Training	Large sequence-wide buffers for loss and MLP computation	Tiles computations along sequence axis; 3M tokens fits

The later stack shifts from reducing attention activations to avoiding large sequence-wide allocations.

The sequence of interventions makes the systems point explicit: after parameters, attention activations; after attention activations, checkpoint state; after checkpoint state, offloaded block inputs; after those, buffers from apparently simpler computations such as loss and MLPs.

Untied Ulysses cuts deeper into the context-parallelism buffer

The 3-million-token case could be made to fit with known techniques. Going further required the contribution Ryabinin centered in the presentation: Untied Ulysses, a modification of context parallelism through headwise chunking.

The problem he identified was intermediate buffers. In standard Ulysses-style context parallelism, even after distributing attention work, the system may allocate large buffers for all query, key, and value tensors, with additional pressure from all-to-all communication. The slide summarized the issue as: “Intermediate buffers are too large! (all Q,K,V, x2 because of all-to-all).”

Untied Ulysses uses the multi-head structure of attention more aggressively. Ryabinin said Together AI found that computing one set of heads at a time is already enough to saturate the GPU’s computational capacity within one iteration. If multiple head groups are scheduled on one GPU, those groups do not need to be materialized together. They can be divided into chunks and processed over time.

In the first stage, one group of heads is computed, attention is run, and a partial result is stored. In the next stage, the system processes another group of heads while reusing the buffers allocated for the previous group. The advantage is not a new attention formula, but a different memory schedule: allocate a smaller buffer, reuse it across multiple iterations, and avoid holding all head groups in memory at once.

Instead of allocating this huge buffer as you would have before, like here in this slide, you allocate a buffer which is smaller, but you reuse it across two or more different iterations.

Max Ryabinin · Source

Ryabinin said this saved activation memory “without any significant impact to the throughput at smaller scales.” The displayed paper material described the method as tailored to multi-query and grouped-query attention, and as decoupling the all-to-all communication of keys and values from queries to reduce peak attention-activation memory. The slide title identified the work as “Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking,” by Max Ryabinin, Alexander Borzunov, Sergey Ivanov, and Anton Babenko, with affiliations including Together AI, Hugging Face, Inria, and the Higher School of Economics. The Untied Ulysses slides were attributed to Ghadia et al., 2025.

The final stacked-memory slide showed what this adds on top of the earlier techniques. For Llama 3-8B at 3 million tokens on 8xH100, the ALST configuration showed 15.4 GiB of model memory, 37.5 GiB of attention activations, and 15.0 GiB of other memory. The slide labels the added configuration “UPipe”; with that configuration on top of the stack, attention activations fell to 19.5 GiB while model memory remained 15.4 GiB and other memory remained 15.0 GiB.

Configuration	Model	Attention activations	Other	Outcome implied by slide
ALST	15.4 GiB	37.5 GiB	15.0 GiB	3M tokens fits
Configuration labeled UPipe	15.4 GiB	19.5 GiB	15.0 GiB	Additional memory freed

The final stacked-memory chart showed the UPipe-labeled configuration cutting attention-activation memory versus ALST alone.

Ryabinin said the freed memory could be reinvested elsewhere in training, “for example among the stages,” or used to extend the sequence length target from 3 million to 5 million tokens. The final paper slide claimed that Untied Ulysses enables scaling up to 5 million tokens on a standard 8xH100 node with minimal throughput degradation.

The throughput tradeoff is controlled by chunk size

The benchmark slides compared context-parallelism methods across sequence lengths for both Llama-3-8B and Qwen2-32B. Ryabinin’s spoken summary was that, at both the 8B and 32B scale, the work matched the most memory-optimized transformer training implementations closely while being able to scale further, and sometimes performed better at shorter context lengths.

The Llama-3-8B table showed Native PyTorch running through 1 million tokens before out-of-memory, Ring and Ulysses reaching 3 million before failing at 4 million, UPDT reaching 4 million, and UPipe reaching 5 million with 98.25 TPS. At shorter lengths, UPipe was close to Ulysses and Ring: at 1 million tokens, UPipe was shown at 472.53 TPS, while Ulysses was 475.33 TPS and Ring was 458.51 TPS. At 3 million tokens, UPipe was shown at 166.32 TPS, compared with 162.41 TPS for Ulysses and 159.96 TPS for Ring.

Method	Highest sequence length shown without OOM	Selected throughput figures shown
Native PyTorch on Llama-3-8B	1M	249.85 TPS at 1M; OOM at 2M
Ring on Llama-3-8B	3M	159.96 TPS at 3M; OOM at 4M
Ulysses on Llama-3-8B	3M	162.41 TPS at 3M; OOM at 4M
UPDT on Llama-3-8B	4M	119.76 TPS at 4M
UPipe on Llama-3-8B	5M	125.56 TPS at 4M; 98.25 TPS at 5M

The Llama-3-8B benchmark table showed the UPipe row reaching 5M tokens.

The Qwen2-32B table was different at the largest lengths. It showed Native PyTorch failing at 1 million tokens, Ring and Ulysses failing at 3 million, FPDT reaching 4 million, and UPipe marked out-of-memory at 3 million and 4 million. It also showed UPipe competitive with Ring and Ulysses at shorter lengths: at 1 million tokens, UPipe was 113.26 TPS, Ulysses was 117.02 TPS, and Ring was 110.27 TPS. At 2 million tokens, UPipe was 59.56 TPS and Ulysses was 59.98 TPS.

Together, the benchmark slides support two related claims. For Llama-3-8B, the UPipe row reaches 5 million tokens. For Qwen2-32B, the comparison emphasizes close throughput to Ring and Ulysses at shorter lengths while the more memory-optimized FPDT row reaches 4 million. Ryabinin’s broader statement was that the results matched highly memory-optimized implementations at both 8B and 32B scale while enabling longer contexts in the reported work, and the paper slide stated that Untied Ulysses scales up to 5 million tokens on a standard 8xH100 node.

He also showed a chunk-size ablation. The relationship was straightforward: larger chunks compute more heads at the same time, increasing memory utilization, but they can run the model somewhat faster. Smaller chunks save memory at the cost of throughput. The slide’s memory chart rose as chunk size moved from 4 to 32, while the throughput chart also rose modestly, roughly from just over 800 TPS to the mid-830s.

That ablation matters because Untied Ulysses is not presented as eliminating the memory-throughput tradeoff. It moves the control surface. Instead of being forced to allocate buffers for all relevant head groups together, the training run can choose a headwise chunk size that fits the memory budget and accept the corresponding throughput.

Profiling matters because the bottleneck keeps moving

Ryabinin closed the technical arc by emphasizing that the bottlenecks “might appear where you least expect.” The memory progression illustrated why: the run first fails on model parameters. FSDP fixes that and exposes attention activations. Ulysses cuts those down and still fails. Activation checkpointing and offloading reduce what must remain resident on GPU, but still do not remove all large allocations. ALST handles sequence-wide buffers for losses and MLPs. Untied Ulysses then goes back into context parallelism and reduces intermediate attention buffers by chunking heads and reusing memory.

His recommended tool was the PyTorch Memory Profiler, which he said Together AI elaborated on extensively in the paper. He added that profiling can help even in “much smaller setups” that run out of memory, not only at multi-million-token scale.

The brief Q&A reinforced one point about what Untied Ulysses is and is not. An audience member asked whether the QKV part involved a quantization parameter. Ryabinin said not exactly: Q, K, and V referred to the query, key, and value matrices in the transformer layer. The complexity comes from the attention operation, where queries interact pairwise with keys. At a sequence length of 3 million, the vanilla approach would allocate a huge tensor with 3 million along one axis, which is already significant enough to require a stack of techniques rather than a single fix.

That’s the key idea and the key challenge of working with transformers at this scale.