Multipath Reliable Connection Keeps Massive GPU Training Clusters in Sync

Greg SteinbrecherOpenAIThursday, May 7, 202610 min read

OpenAI’s Mark Handley and Greg Steinbrecher argue that frontier AI training has outgrown conventional data-center networking because synchronized GPU clusters are constrained by their worst congestion or failure, not average throughput. They present Multipath Reliable Connection, developed with major hardware and cloud partners, as OpenAI’s answer: a protocol that spreads traffic across many paths, detects loss quickly, routes around failures from the endpoints, and is being pushed as an open standard for the wider industry.

Training no longer looks like ordinary data-center traffic

AI training turns worst-case network congestion and ordinary hardware failures into model-training bottlenecks. In OpenAI’s account, Multipath Reliable Connection, or MRC, is a new network protocol designed to keep very large GPU clusters moving in lockstep by spreading traffic across many paths, detecting congestion without guesswork, and routing around failures at the endpoints.

Mark Handley, from OpenAI’s Core Networking team, framed the core problem as a mismatch between old network assumptions and frontier training. Traditional data centers and internet-derived systems handle many separate conversations at once. As more users and flows are added, traffic tends to smooth out statistically because the activity is independent.

Training a large model is different. Handley described it as “a lot of the world’s fastest GPUs” working together on a single task. The communication among GPUs is not incidental traffic around the computation; it is part of the computation itself.

The communication between the GPUs is actually part of the computation.

Mark Handley · Source

Greg Steinbrecher, from Workload Systems, put the consequence in operational terms: if one GPU goes slow, the other GPUs wait. If one GPU stops because of a bit flip or another fault, the training step may become unusable, and the system may have to stop, assess what happened, and possibly roll back. During that time, the GPUs are not doing useful work.

That synchronized character is what makes the network design problem severe. The workload proceeds in lockstep, so the relevant question is not whether the average pair of GPUs can communicate quickly. Steinbrecher said the limiting case is the worst bottleneck in the system: the single link that became most congested can set the pace for the whole job.

He described this as moving from average statistics to “the tail of the tail,” or what he called P100, the hundredth percentile. In conventional settings, the law of large numbers may help. In frontier training, the worst case is often the case that matters.

Scale turns small failures into constant background noise

The reliability math gets worse as systems grow, Greg Steinbrecher said. If failures are independent, doubling the size of a system roughly halves the mean time between failures. That is already a problem for GPUs, but the network compounds it because each GPU depends on many more network components than a simple GPU count suggests.

Even a single GPU connected to a network adapter can involve multiple optical components. Steinbrecher described a network adapter with an optical transceiver that might have four lasers, and another four lasers at the other end. That is already an order of magnitude more lasers than GPUs before adding layers of switching. At training-cluster scale, the network can contain many more components than the edge of the system because the system needs enough bandwidth to keep GPUs fed.

Mark Handley summarized the scale bluntly: “You have literally millions of optical links within the same building.”

Millions

optical links Handley said can exist inside one training building

The network also cannot be built as a single switch or a simple hierarchy. Handley said the required bandwidth demands “hierarchies of hierarchies of switches,” producing thousands of possible paths through the network between one GPU and another. In one such building, he said, there are several thousand switches.

That creates a path-selection problem. If a GPU-to-GPU transfer chooses a path and no one else chooses the same path, performance is fine. If two transfers collide on the same path, both slow down. If ten collide, the slowdown is worse. The statistical multiplexing that helped internet-style workloads does not solve the synchronized training case.

Failures make the same problem sharper. Links fail. Switches get confused and need to be rebooted. Routing can take time to reconverge after a failure, and that delay can cause a glitch. In the worst case, Handley said, a single failed transfer can crash the whole job.

The goal, then, is not merely more bandwidth. It is a network that can keep synchronized training moving when congestion happens and when parts of the network fail. Handley argued that this cannot be retrofitted onto existing network protocols; it requires protocols designed differently from the start.

MRC sprays traffic, trims packets, and removes ambiguity

Mark Handley described MRC as a combination of ideas from prior networking research rather than a wholly new invention. The design pulls those ideas together into a feature set for large AI-training networks.

The first idea is to spray packets across many paths. If traffic is spread across the network carefully, the paths can be load-balanced more evenly, avoiding hot spots when the topology has enough capacity. That still leaves one unavoidable congestion case: multiple senders trying to reach the same destination at the same time.

Spreading packets across many paths also creates a new difficulty. Packets can arrive out of order because they took different routes. If the network gets congested and a packet appears to be missing, the receiver may not know whether it was lost or is merely delayed by reordering.

MRC addresses that with packet trimming. Handley said that when congestion would overflow a queue, the network does not simply drop the packet. Instead, it removes the packet’s payload and forwards the small header to the destination. The receiver then immediately knows that the packet needs retransmission.

That matters because it removes ambiguity. The receiver no longer has to guess whether to keep waiting for a reordered packet or request that it be sent again. It can request retransmission immediately.

For Greg Steinbrecher, the significance is practical rather than abstract. MRC lets OpenAI accelerate research and deployment, reduce job failures, reduce sensitivity to where a job is scheduled, and train frontier models faster and more reliably. For users, he said, the effect should be “better models, more intelligent models faster from OpenAI.”

Failures become local decisions instead of global coordination problems

Greg Steinbrecher said Handley had “mildly undersold” one important effect of MRC: it breaks the need for broad network coordination after a link fails.

In conventional routing, when a link goes down, a switch on one side of the link notices and has to tell its neighbors. Those neighbors tell their neighbors, and so on. Steinbrecher described this as a distributed-systems problem conventionally addressed with Border Gateway Protocol, or BGP, a gossip-like system that eventually spreads the information that a path is no longer usable. That convergence takes time. He said it can take seconds, and in the tail, tens of seconds.

MRC changes the failure response. Each endpoint independently detects that a path should not be used and stops using it. Instead of waiting for the whole network to converge on a new view, endpoints route around the problem within milliseconds, in Steinbrecher’s description.

Mark Handley said that when a failure affects flows across the network, each flow is affected only a little. Within a few round trips across the network, traffic stops using the failed link. Steinbrecher called the behavior “self-annealing.”

This was not only a design claim. Steinbrecher said that as one data center was being built, links were going up and down frequently because of manual construction work: fibers from one data hall coming in, technicians assembling another, and other shared physical work. He said MRC handled it without researchers noticing.

That reliability allowed another simplification. Handley said OpenAI realized MRC itself could determine which paths still worked, so the team turned off dynamic routing protocols and used static routing in the switches at the largest scale. Switches boot with a configuration and do not change routing tables afterward.

The complexity moves to the edge. Handley said MRC source-routes packets through the network using IPv6 segment routing, so each packet’s address lists the precise set of switches it should traverse. The switches in the middle can be “really dumb,” which he framed as a reliability advantage: simplifying the middle of the network helps when scaling.

OpenAI wants MRC standardized, not kept proprietary

Mark Handley said OpenAI is working with Microsoft, Nvidia, Broadcom, AMD, and Intel on the MRC specification and on hardware for new supercomputers. He said the specification is due out through OCP as an open standard.

OpenAI’s stated incentive is practical alignment. Handley said OpenAI builds its networks on Ethernet, itself an open standard, and benefits when the wider industry can move quickly. If the best solutions become shared infrastructure, the industry does not need to repeatedly reinvent the wheel.

Greg Steinbrecher made the same point in supply-chain terms. Given the scale of AI build-out, he said it would be a shame for the supply chain to fracture because different groups invest in different underlying technologies for small advantages. Infrastructure, in his words, is a shared fate for the industry.

He also argued that an OpenAI-exclusive MRC would not do as well. Ethernet’s strength came partly from open standardization and broad industry investment. Steinbrecher said OpenAI wants “the next layer” to be ready for AI’s systems challenges and widely adopted.

There is still an OpenAI-specific benefit. Steinbrecher said MRC has removed one of the key barriers to continued scaling. Researchers have given positive feedback on the stability of clusters using MRC, and the ideal state is that they do not need to know which network protocol a cluster is using.

We know we've won when researchers stop needing to know what network protocol this particular cluster is using.

Greg Steinbrecher

That does not mean infrastructure disappears. Steinbrecher emphasized that OpenAI is still pushing limits and that “plenty of other things” break. But if networking no longer dominates researchers’ attention, the teams can focus on the next constraints.

The protocol also changes cost, power, and future model scaling

One less obvious advantage of MRC, according to Greg Steinbrecher, is that multipath spraying can enable simpler and smaller networks. Because traffic can be distributed across paths more effectively, OpenAI can build flatter networks with fewer layers of switches. Those networks use less power and cost less, and more of the power budget goes directly to GPUs doing useful work rather than to extra switching layers.

He framed this as an efficiency concern. If OpenAI is going to spend power running these systems, he said, it should be used productively.

The demands are not standing still. Asked whether multimodal models change the training-system requirements, Steinbrecher declined to discuss model-architecture details, but said that as models become more advanced, the systems demands become substantially harder. The data movement required, and the penalty when the network is slightly too slow, get worse as clusters grow and the rest of the stack improves.

That creates a moving target for networking teams. If GPU work becomes faster because researchers and infrastructure teams improve other parts of the system, the network has less time to move data before it becomes the bottleneck. Steinbrecher said the work “never stops.”

He also warned that adding more paths without MRC can worsen tail statistics. With the same amount of bandwidth spread across more paths, the worst path can diverge more from the average path. That is why deterministic routing and careful load balancing across a large number of links matter.

Mark Handley tied the future of MRC to Ethernet’s future. Ethernet, he said, has changed dramatically over decades, and MRC benefits from that wider industry development. Because MRC puts intelligence at the edge of the network, the network core can scale as Ethernet scales. Handley said there is no obvious reason Ethernet will not keep scaling in at least the near future.

There are still physical limits. Steinbrecher pointed to the speed of light as a fundamental constraint. Faster links change how much data can be outstanding on a connection at a given time, which means each hardware generation will require more engineering to use well. But he presented MRC as a flexible base for the next few generations.

Space compute remains hard to justify for training

Mark Handley was skeptical that the kind of training OpenAI does in Stargate data centers could be done in space. Latency would be a major problem, he said, as would background failure rates. He also noted the practical issue that Microsoft and Oracle technicians go into terrestrial data centers constantly to fix things. Doing that in orbit would be hard.

Greg Steinbrecher said smart people have made reasonable arguments on both sides of the space-compute question, and the dreamer side of him finds it compelling. But the practical side sees severe barriers.

The major barrier he emphasized was failure. Each GPU generation makes the GPU more powerful, expensive, and important. OpenAI is working on Earth to route around failures automatically, but Steinbrecher said space deployment could leave operators with a lot of hardware they could not use quickly. Unless one can also put technicians in space, the maintenance problem remains.

His bottom line was that these systems are already very hard to build and operate on Earth. MRC itself required close collaboration with engineers at multiple companies, hands-on testing, and hardware fixes. Adding the complications of space would require a very strong reason.

When the interviewer reduced the implication to “build more terrestrial compute centers,” Steinbrecher’s answer was simple: “Please.” He said the goal is to build a lot of compute so OpenAI can increase the net amount of intelligence in the world.

AI Labs and Strategy AI Infrastructure and Compute