Ultra-Scale Training Depends on Memory Sharding and Communication Overlap
Nouamane Tazi of Hugging Face uses a Stanford CS25 seminar to argue that ultra-scale model training is less a question of adding GPUs than of managing memory, communication, batch size, and hardware topology. His central case is that 5D parallelism—data, tensor, pipeline, context, and expert parallelism—lets training runs span massive clusters only when each axis is chosen for a specific bottleneck. The practical rule, he says, is conservative: shard only as much as the workload requires, because every added parallelism dimension buys scale by spending communication, complexity, or both.
Stanford Online·May 11, 2026·18 min read