Ultra-Scale Training Depends on Memory Sharding and Communication Overlap

Karan SinghStanford OnlineMonday, May 11, 202618 min read

Nouamane Tazi of Hugging Face uses a Stanford CS25 seminar to argue that ultra-scale model training is less a question of adding GPUs than of managing memory, communication, batch size, and hardware topology. His central case is that 5D parallelism—data, tensor, pipeline, context, and expert parallelism—lets training runs span massive clusters only when each axis is chosen for a specific bottleneck. The practical rule, he says, is conservative: shard only as much as the workload requires, because every added parallelism dimension buys scale by spending communication, complexity, or both.

Ultra-scale training is mostly a memory-and-communication problem

Nouamane Tazi framed the practical problem of training modern language models as a set of infrastructure constraints, not as an abstract desire for more GPUs. Current frontier-scale training, in his description, involves models around 1 trillion parameters, training corpora around 15 trillion tokens, and context lengths reaching 1 million tokens. Those numbers create three different pressures: continuously reading enormous datasets from storage, keeping each training iteration fast enough despite limited accelerator memory, and periodically writing very large checkpoints back to storage.

1T / 15T / 1M

parameters, training tokens, and context length used to describe current frontier-scale training pressure

The central constraint is the training iteration. The baseline is the single-GPU loop: forward pass, backward pass, gradient computation, optimizer step, updated model. That loop becomes inadequate in two ways. First, the model may not fit in a single GPU’s memory. Second, the global batch size used in large language-model training can be in the range of 1 million to 10 million tokens, and sometimes as high as 50 million tokens. If that batch is handled sequentially by gradient accumulation on one GPU, the run may be mathematically valid, but it is too slow.

The goal of distributed training is therefore twofold: shard the model, gradients, and optimizer states so the training step fits in memory; and shard the data so the system does not simply wait through a long sequence of microbatches. The “5D parallelism” taxonomy names five axes that do this in different ways: data parallelism, tensor parallelism, pipeline parallelism, context parallelism, and expert parallelism. The methods apply beyond GPUs, including TPUs and other accelerators, and they are relevant not only to pretraining but also to post-training, distillation, and inference. The practical threshold is not “thousands of GPUs”; once a workload spans more than one accelerator, these choices begin to matter.

A repeated theme was that scaling is not just about finding another way to split the work. The split must keep GPUs busy. Communication that cannot be hidden behind computation turns accelerators into expensive idle hardware. Much of the engineering task is arranging collectives, buckets, modules, and schedules so communication overlaps with useful compute.

When you have a lot of GPUs, you definitely want to overlap efficiently your computation with communication so that you avoid any idle time.

Nouamane Tazi

Data parallelism is simple until batch size and memory stop cooperating

In ordinary data parallelism, every GPU gets a full copy of the model and a different shard of the batch. Each GPU performs its own forward and backward pass, producing different gradients because it saw different data. To keep the model copies synchronized, the gradients must be accumulated across GPUs. The relevant collective operation is AllReduce: effectively a distributed sum.

In PyTorch, this is what DistributedDataParallel handles when a model is wrapped with DDP. During backward, DDP AllReduces gradients in the background. The abstraction is intentionally easy: duplicate the model, split the data, synchronize gradients, run the same optimizer step everywhere.

The performance issue appears in the timeline. A profiler view showed three constraints: GPU compute streams should not go idle; the CPU should remain ahead of the GPU so it can launch kernels in time; and communication should overlap with computation where possible. A naive DDP implementation leaves the GPU waiting for gradient synchronization. The fix is not to wait until the entire backward pass is complete. Once a bucket of gradients has been computed, communication for that bucket can begin while the rest of the backward pass continues.

The size of those buckets is tunable in PyTorch through bucket_cap_mb, whose default was shown as 25 MiB when unset. The tuning objective is better overlap: bucket size affects when communication begins and how efficiently it can be hidden behind backward computation.

Data parallelism has clear advantages. It is easy to implement, can produce little idle time when overlap is good, and is largely model-agnostic because the model is simply replicated. But its limitations are structural. The optimizer step is duplicated across GPUs. Global batch size scales with the number of data-parallel workers, so it cannot increase indefinitely. Most importantly, basic data parallelism assumes the full training step fits in each GPU’s memory.

That last limitation leads to ZeRO-style data parallelism, which Tazi presented as a sequence of increasingly aggressive ways to remove redundant memory.

ZeRO-1 shards optimizer states. Instead of every GPU holding all optimizer state and doing the same optimizer step, each rank owns one shard of the optimizer states and updates only that shard of parameters. The communication pattern changes from a plain AllReduce to a ReduceScatter followed later by an AllGather. The important intuition is that this does not add communication, because AllReduce can be decomposed into ReduceScatter plus AllGather. ZeRO-1 changes where the pieces happen so each rank can avoid duplicated optimizer work and memory.

ZeRO-2 goes further by not keeping gradient shards that are not useful to the local optimizer shard. After ReduceScatter, the rank only needs the gradients for the parameters it will optimize, so it can discard the rest. Tazi noted that in practice people often use ZeRO-1 and keep gradients because the implementation details of discarding them can be annoying.

The optimizer can constrain how sharding is done. Tazi singled out Muon, saying it requires full tensor gradients. A common approach that flattens all parameters and shards the flattened vector loses the notion of individual tensors. For Muon-style requirements, ranks should hold full tensors — for example, the first five full tensors on one GPU and the next five on another — rather than arbitrary fragments of flattened parameters.

ZeRO-3, also known as Fully Sharded Data Parallel, shards parameters themselves. The apparent problem is that forward and backward computation require the parameters. The trick is to never materialize the full model permanently. When a layer is needed, the rank AllGathers the missing parameter shards, computes the layer, and then frees those extra parameters. The same idea applies in backward. To avoid turning this into a communication stall at every layer, the system prefetches the next layer’s parameters while computing the current layer.

This was the strongest form of data-parallel memory optimization presented: optimizer states, gradients, and model parameters are all sharded, and parameter AllGathers can be overlapped with compute. In PyTorch, Tazi distinguished older FSDP1, using an FSDP wrapper and an auto-wrap policy, from FSDP2, using fully_shard. FSDP2’s use of DTensor and its ability to preserve full tensors make it easier to combine with other forms of parallelism and avoid the flattening problem for optimizers that need tensor structure.

The warning was just as important as the technique: do not use ZeRO-3 merely because it exists. It trades memory for communication. If a model fits with ZeRO-1, Tazi said, ZeRO-1 should be faster than ZeRO-3 because it avoids the additional parameter AllGather traffic. ZeRO-3 is useful when memory requires it, not as a default badge of sophistication.

At larger scale, even ZeRO-3’s overlap may fail to hide communication, especially on slower networks. Tazi pointed to hybrid sharding as the next move: use FSDP within a smaller group and vanilla data parallelism across the larger group. But data parallelism still has the global-batch-size problem. If a run reaches its desired batch size at 1,000 GPUs and the cluster has 10,000 GPUs, adding more pure data parallelism is no longer the right answer.

Tensor parallelism spends communication to avoid spending more samples

Tensor parallelism attacks the scaling problem from a different direction. Instead of giving different GPUs different data, it gives them the same batch and shards the model computation across them. For a tensor-parallel size of two, each GPU might hold half the model parameters, half the gradients, and half the optimizer states for the relevant tensors. The global batch size does not need to grow with the number of tensor-parallel ranks.

Tazi introduced the idea through matrix multiplication. For a single matrix multiply, one can split the weight matrix by columns, compute partial outputs on different GPUs using the same input, and AllGather the outputs. For two sequential matrix multiplications, one can split the first weight matrix by columns and the second by rows, so each GPU performs half the compute and holds half the weights, with an AllReduce needed to restore the correct result.

The same logic applies in backward under an assumption: the ranks must have the same upstream gradients. Across two matrix multiplications, the distributed implementation needs one AllReduce in forward and one AllReduce in backward, assuming the same input and the same upstream gradient.

For an MLP block, this maps naturally to the up and down projections. The first projection is column-parallel, the second row-parallel, and an AllReduce restores correctness. The hidden dimension is sharded inside this tensor-parallel region. The correctness conditions matter: the ranks need the same input entering the tensor-parallel region and the same upstream gradient coming back.

Attention requires more care. The QKV projections and output projection can be parallelized, but the sharding dimension matters. Tazi gave the practical correctness question: would the GPUs compute the same activations as without tensor parallelism? If attention were sharded along the head dimension, then the softmax would see a reduced hidden dimension and produce different values than the non-parallel computation. Sharding along the number of heads works because attention heads are independent. The principle is less “split the tensor somewhere” than “split it where the operation’s semantics remain unchanged.”

Tensor parallelism is memory efficient and sample efficient. It shards model and compute without requiring more data-parallel batches, and because ranks are jointly computing the same operation, it does not require the same kind of gradient AllReduce as data parallelism. But it is communication-heavy. The collectives occur inside every MLP and attention block, on the critical path of forward and backward computation.

There is also a correctness problem at the boundaries. Parameters outside the tensor-parallel region, such as LayerNorm parameters in Tazi’s example, are duplicated. They should remain synchronized so that the assumptions of same input and same upstream gradient hold. In theory this may not require explicit synchronization, but numerical imprecision can make duplicated parameters drift apart. In practice, he said, gradients for those parameters are AllReduced.

This is one reason tensor parallelism is usually kept within a node. Tazi described the common recommendation from distributed training libraries: use tensor parallelism at sizes less than or equal to the number of GPUs in a node, such as eight, so the frequent communication stays on the fast intra-node interconnect rather than crossing slower links.

Sequence parallelism is a variation that removes one of tensor parallelism’s annoyances. In plain tensor parallelism, activations inside the tensor-parallel region have shape like sequence, batch, hidden divided by TP. But outside that region, operations such as LayerNorm may store full hidden activations on every GPU. Sequence parallelism replaces the AllReduce at the tensor-parallel boundary with ReduceScatter and AllGather, using the identity that AllReduce equals ReduceScatter plus AllGather.

The result is a controlled movement between two shardings: hidden-dimension sharding inside tensor-parallel regions and sequence-dimension sharding in sequence-parallel regions. Instead of storing full activations of shape S, B, H at duplicated points, ranks store activations divided by the tensor-parallel degree. Communication volume is not increased, because the AllReduce has merely been decomposed.

The backward pass also cleans up the LayerNorm synchronization issue. The backward of ReduceScatter is AllGather, and the backward of AllGather is ReduceScatter. Because ReduceScatter appears before the sequence-parallel regions in backward, it synchronizes the relevant gradients. In Tazi’s formulation, sequence parallelism means LayerNorm gradients no longer need a separate AllReduce.

The axes are orthogonal, but the network decides how far to push them

Tazi’s explanation of combining parallelisms rested on an orthogonality principle. Data parallelism shards the batch and replicates the model along other axes. Tensor parallelism shards the model and replicates the data along other axes. In a mesh containing both dimensions, a batch may be split across the data-parallel axis while the model is split across the tensor-parallel axis.

This matters because distributed training libraries have to know which ranks participate in which collective. Tazi said that libraries such as TorchTitan, Megatron, and Nanotron initialize process groups corresponding to these axes. Once the groups are defined, an AllReduce can be issued along the tensor-parallel group or the data-parallel group, depending on what the operation requires. The same physical cluster becomes a logical mesh of parallelism axes.

The clean abstraction does not remove hardware constraints. In the question period, Karan Singh relayed a question asking whether there is an automatic way to decide the best parallelism strategy. Tazi said there have been papers on the problem and pointed to the JAX scaling book as an example of work exploring automatic choices on TPUs. But he stressed that the decision depends on model size, amount of data, global batch size, and the network.

His hardware examples were informal, but the point was direct. GPU systems differ in how much fast interconnect they expose — he referred to NVLink configurations with counts such as 4, 8, and, for newer systems, 32 or 72 — and those differences affect whether communication-heavy strategies are practical. For TPUs, he said people often allow tensor parallelism over 32-device pods. The general rule is that hardware topology and batch-size constraints determine which communication costs are acceptable.

The same answer also clarified what the parallelism strategies are not supposed to change. When asked whether different strategies affect scaling efficiency, model performance, or convergence rather than merely enabling scale, Tazi answered that the forwards and backwards should be unchanged if the implementation is correct. From a computation point of view, the same scaling laws should apply, and using pipeline parallelism rather than expert parallelism should not change the model’s mathematical behavior. The strategies are meant to reproduce the one-GPU computation at scale, not alter the objective.

Pipeline parallelism shards layers and fights bubbles

Pipeline parallelism splits the model vertically by layers. Some layers live on one GPU, later layers on another. Its appeal is that it is model-agnostic: as long as the model has layers and ranks can send activations forward and gradients backward, the particular architecture matters less than it does for tensor parallelism. Communication is also comparatively cheap: adjacent pipeline stages exchange activations and gradients rather than performing frequent global collectives.

The cost is idle time. If the first GPU owns early layers and the second owns later layers, the second GPU cannot begin until it receives activations. Once the first GPU has sent activations onward, it may also wait while later stages continue. This idle region is the pipeline bubble.

Schedulers determine how microbatches move through the pipeline. Tazi first described an all-forward all-backward scheduler. With one layer per GPU as a simplifying assumption, the first microbatch moves forward stage by stage, then additional microbatches are scheduled so devices do not remain idle, and eventually the backward passes move in reverse, sending gradients to previous stages. Scheduling many microbatches reduces idle time but does not eliminate the bubble.

A one-forward one-backward scheduler prioritizes backward work once it becomes available, interleaving forward and backward passes in the middle of the pipeline. Tazi said this helps memory but still leaves the same pipeline bubble.

The more aggressive answer is to inject work from both ends. Tazi cited DeepSeek’s DualPipe as an advanced scheduler in which layers are distributed in a round-robin fashion over devices so forward passes can begin from both pipeline ends. Backward passes also move in both directions, with overlap managed between them. The advantage is a smaller bubble; the cost is implementation complexity. The system must track batches, forward activations, and backward gradients carefully enough to send the right data to the right stage at the right time.

Pipeline parallelism therefore has a different failure mode from tensor parallelism. Its communications are not the main headache; activation storage and scheduling are. To hide bubbles, the system needs multiple microbatches, which pushes on global batch size in a way reminiscent of data parallelism. It also needs to save activations for multiple microbatches before backward. Tazi noted that for pipeline parallelism and FSDP, activation pressure often leads to activation checkpointing or even CPU offloading, where activations are moved off GPU memory and reloaded when needed.

Context parallelism exists because long sequences make activations explode

Context parallelism is more specialized. Its purpose is long-context training, where sequence length drives activation memory sharply upward. Tazi described it as the only parallelism that efficiently partitions memory for large sequences.

The analogy is data parallelism, but the shard dimension changes. Data parallelism splits a batch across GPUs. Context parallelism splits the sequence length. Each GPU holds a chunk of the sequence rather than a different sample. The immediate problem is attention: if each rank sees only part of the sequence, attention over the full context is no longer locally available.

Ring Attention solves this by exchanging keys and values among GPUs and using an online softmax approach similar to the one used in FlashAttention. Each GPU computes locally, exchanges K and V with neighboring ranks, updates the softmax, and continues. The system does not need to materialize the whole attention computation in one place at once.

The cost is communication inside every attention block. Tazi said the K and V exchange can be implemented with send-receive operations, and AllGather is another possible mechanism. But the communication lies on the critical path, much like tensor parallelism’s block-internal collectives. Since each GPU sees different sequence data, gradients also need to be AllReduced.

For short sequences, context parallelism does not help much. Its rule of use was narrow: use it if and only if long-context training requires it.

Expert parallelism is the efficient route for distributing MoE experts

Expert parallelism is the MoE-specific axis. In a mixture-of-experts model, tokens are routed to different feed-forward experts. Expert parallelism shards those experts across GPUs and uses communication to send tokens to the ranks that own their selected experts.

The relevant collective is All-to-All. Tazi described it as the most generic and most complex exchange pattern: each process sends distinct messages to every other process and receives distinct messages from every other process. For MoE routing, that maps directly onto the need for a rank owning a given expert to receive all tokens routed to that expert, no matter which GPU originally held those tokens.

Expert parallelism is not orthogonal to data parallelism in quite the same way tensor parallelism is. Tazi said only the experts are sharded by expert parallelism; attention is not touched. To avoid duplicated attention work, data is also sharded across GPUs. The MoE block then needs All-to-All dispatch to route tokens to experts and a corresponding combine operation to reconstruct the original sequence structure after the experts have run.

The biggest practical problem he identified was not merely the All-to-All volume. It was knowing the buffer sizes for dispatch. The router determines how many tokens each expert, and therefore each rank, will receive. That information is computed on the GPU. The CPU must wait for the GPU to finish router scoring and dispatch preprocessing so it can know buffer sizes and tokens per expert before launching the dispatch operation. That CPU-GPU synchronization can slow training substantially.

A slide attributed to Nvidia’s HybridEP blog, shown in the Stanford deck, illustrated the CPU waiting on GPU-produced routing information before dispatch. Tazi pointed to DeepSeek’s DeepEP and Nvidia’s HybridEP as attempts to address this class of problem using recent hardware capabilities such as IBGDA and RDMA. His claim was explicitly practical rather than a general hardware audit: most current solutions he had in mind for avoiding the CPU-GPU synchronization path rely on recent hardware. He said that without the relevant InfiniBand-related capability, systems can remain stuck with CPU-GPU synchronization, and he attributed many slow MoE trainings to this hardware problem. He also said DeepSeek’s open-sourced DeepEP works only with IBGDA.

Expert parallelism also faces load imbalance. In the room, an audience member asked what happens if tokens are not evenly routed across ranks: if all tokens go to one GPU, the rest are idle. Tazi answered that load balancing is meant to fix this. One approach is a load-balancing loss that penalizes uneven token distribution. He also mentioned auxiliary-loss-free approaches that add a bias term during router computation, automatically adjusting distribution so tokens are more evenly spread across GPUs.

The practical mitigation for expert-parallel communication is to combine it with pipeline parallelism. Tazi described overlapping the dispatch and combine communication for one batch with computation from another batch under a one-forward one-backward pipeline scheduler. In an MoE block, while dispatch for one batch is occurring, the system can perform backward computation for another batch. He said this kind of overlap is already exposed in some libraries, including Megatron through relevant flags.

Expert parallelism’s rule of use was as narrow as context parallelism’s: use it if and only if training MoEs requires it. It is the efficient way to distribute MoE experts, but it introduces All-to-All communication in every MoE block, requires gradient synchronization because ranks see different data, and may be bottlenecked by CPU-GPU synchronization unless the hardware and kernels avoid that path.

The five dimensions compose into a training recipe, not a hierarchy

The five headline axes can be combined because each touches a different part of the training computation. Data parallelism splits the batch; ZeRO and FSDP remove redundancy from optimizer states, gradients, and parameters within that data-parallel frame. Tensor parallelism shards hidden-dimension computation in attention and MLP blocks. Sequence parallelism is a tensor-parallel variant that moves between hidden and sequence sharding to reduce activation duplication. Pipeline parallelism shards layers. Context parallelism shards sequence length for long-context attention. Expert parallelism shards MoE experts and routes tokens with All-to-All.

The 5D parallelism map shown in the lecture placed these methods inside the model stack rather than as abstract alternatives: TP applies inside attention and MLP projections; CP touches self-attention for long context; EP touches MoE expert blocks; PP divides layers; and FSDP or ZeRO-style sharding operates across data-parallel groups. The point of the map was compositional: TP, CP, EP, PP, and FSDP can be present in the same training run, with different communications appearing at different points in the step.

Method	What it shards	Main communication cost	Practical use in the talk
Plain data parallelism	Batch; model is replicated	Gradient AllReduce	Simple baseline when the training step fits in memory and global batch size can grow
ZeRO / FSDP	Optimizer states, gradients, and, in ZeRO-3/FSDP, parameters	ReduceScatter and AllGather; ZeRO-3 adds parameter AllGathers around modules	Use the lowest ZeRO degree needed to fit memory; ZeRO-3 trades memory for more communication
Tensor parallelism	Hidden dimension and model compute inside attention and MLP regions	Collectives inside every attention and MLP block	Useful when global batch size should not grow and fast intra-node communication is available
Sequence parallelism	Sequence dimension outside TP regions, as a TP variant	AllReduce decomposed into ReduceScatter and AllGather	Reduces duplicated activation storage and avoids separate LayerNorm gradient sync
Pipeline parallelism	Model layers	Activation and gradient sends between pipeline stages	Useful when layers can be split and bubbles can be managed with microbatches or advanced schedulers
Context parallelism	Sequence length	K/V exchange inside attention blocks	Long-context training
Expert parallelism	MoE experts	All-to-All dispatch and combine	MoE training, with routing balance and CPU-GPU sync as key engineering issues

Tazi’s parallelism axes solve different bottlenecks and introduce different communication patterns; sequence parallelism is treated as a TP variant rather than one of the five headline axes.

The operational advice was conservative. Use the least aggressive sharding that solves the memory problem. Keep communication-heavy tensor parallelism close to fast links when possible. Use context parallelism only when sequence length demands it. Use expert parallelism only for MoEs, and expect routing, balance, and CPU-GPU synchronization to dominate the engineering work. Combine axes only because the workload forces it, not because a taxonomy makes all five available.

In response to a question about GPU idle time during RL training when CPU-side checkpointing or environment data processing becomes sequential, Tazi gave a smaller but consistent version of the same principle: move bottlenecking work ahead of the training step where possible. For text workloads, he suggested using PyTorch dataloader workers to preprocess data before the training iteration reaches it.

The final note was about responsibility rather than technique. After describing how to scale training to thousands of GPUs, Tazi closed by asking that such scaling be done with awareness of energy impact.

AI Infrastructure and Compute Data and Training AI Research Methods