NVIDIA Says Vera Rubin Is in Full Production for Agentic AI

NVIDIATuesday, June 2, 20265 min read

NVIDIA says its Vera Rubin platform is now in full production, positioning it as a pod-scale “AI factory” for agentic workloads rather than a conventional accelerator launch. The company argues that agents shift the bottleneck from model execution to full-system orchestration — reasoning, memory, tool use, low-latency token generation, storage, networking and power — and that Vera Rubin addresses this through five connected rack-scale systems. NVIDIA frames the milestone as both a technical and manufacturing claim, built on extreme co-design across chips, racks, data centers and Taiwan’s supply chain.

Agentic workloads make AI a systems problem, not just a model problem

NVIDIA’s case for Vera Rubin begins with a change in workload shape. Large language models generate answers; AI agents are expected to do work. That means observing, reasoning, planning, using tools, managing context, maintaining working and long-term memory, and calling specialized sub-agents on demand.

A software-building assistant illustrates the burden NVIDIA has in mind. A user asks how to build a web management application with authentication, project management, and a test suite; the interface positions the assistant as able to plan architecture, research and review, compare approaches, and summarize. The harder problem is the orchestration implied by an agent moving among prompt handling, context, observation, reasoning, action, memory, tools, skills, and interaction with an external environment.

Vera Rubin is NVIDIA’s answer to that systems problem: a “multi-rack pod-scale system” built to process agentic AI and now in full production. NVIDIA presents the platform as more than a single accelerator: it combines compute, CPU orchestration, low-latency token generation, storage and memory processing, networking, manufacturing automation, and supply-chain coordination.

Vera Rubin is described as five connected rack-scale systems: “a supercomputer for AI agents.”

Vera Rubin extends the DGX-1 line from rack scale to pod scale

NVIDIA places Vera Rubin in a decade-long progression that begins with DGX-1, which it calls the first AI supercomputer. The company compresses that history into a sequence of system limits: Pascal and the first NVLink, followed by Grace Blackwell, called here the first rack-scale AI supercomputer.

Vera Rubin is the next jump: “the first multi-rack pod-scale supercomputer built for the agentic age.” In NVIDIA’s logic, DGX-1 starts the AI-supercomputing line, Grace Blackwell moves it to rack scale, and Vera Rubin moves it to pod scale around agentic workloads.

The manufacturing chain starts at TSMC, where NVIDIA says the seven new chips in Vera Rubin are produced through hundreds of processing steps. The named ingredients include a 3-nanometer process, CoWoS-R and CoWoS-L packaging, and HBM4 memory supplied by Micron, SK Hynix, and Samsung. The compute board contains 6 trillion transistors and more than 18,000 components on a single board, according to NVIDIA.

6 trillion

transistors on the Vera Rubin compute board

This is the baseline for NVIDIA’s own “extreme co-design” framing. The evidence the company offers is not a benchmark alone, but the accumulation of fabrication, packaging, memory, board, rack, networking, and factory-floor details. Vera Rubin is cast as something whose feasibility depends on designing chips, boards, racks, data-center power, and manufacturing flow together.

NVL72 ties performance to rack-scale physical integration

The Vera Rubin NVL72 rack handles the core “thinking” role: prompt and context understanding, reasoning, and planning. NVIDIA pairs that role with a visible “10X lower cost per token” claim.

The rack architecture is part of the same performance argument. NVL72 uses a modular compute tray and a new PCB mid-plane design. Superchips, ConnectX-9 SuperNICs, and BlueField-4 DPUs mate in place without cables, a design choice NVIDIA links to resiliency at AI factory scale.

The rack configuration includes 18 compute trays and 9 hot-swappable NVLink switch trays. It also includes high-efficiency manifolds and liquid-cooled bus bars carrying more than 5,000 amps, which NVIDIA compares to 20 electric cars at full acceleration. In total, NVIDIA says 1.3 million components make up this third-generation MGX rack design.

NVL72 element	Role described by NVIDIA	Quantitative detail
Modular compute trays	Rack-scale compute assembly using a new PCB mid-plane	18 compute trays
NVLink switch trays	Hot-swappable switching inside the rack	9 trays
Cableless mating design	Superchips, ConnectX-9 SuperNICs, and BlueField-4 DPUs mate in place for resiliency	No cables
Liquid-cooled bus bars	Rack power delivery	Over 5,000 amps
Third-generation MGX rack	Integrated physical assembly	1.3 million components

NVIDIA presents NVL72 performance as inseparable from rack-scale physical integration.

Early engineering-rack deployments are part of the production claim. Microsoft is congratulated for an operational Vera Rubin NVL72 engineering rack, with footage labeled “Foxconn for Microsoft.” Dell and CoreWeave are also named for standing up a Vera Rubin NVL72 engineering rack, with a visual label reading “Dell | Wistron for CoreWeave.”

The five-rack platform divides agent work across compute, orchestration, latency, memory, and networking

Vera Rubin is not one rack doing every job. NVIDIA’s lineup graphic shows five rack-scale systems: Vera Rubin NVL72, Vera CPU, Groq 3 LPU, Vera BlueField-4 STX, and Spectrum-X SFX.

The Vera CPU rack serves as the orchestration layer: 256 CPUs in a single liquid-cooled rack, responsible for orchestrating models, shuffling memory, and launching tools. The on-screen performance claim is “1.8X the agentic sandbox performance of x86 CPUs.”

The Groq system has a latency-specific role. NVIDIA’s lineup graphic labels it “Groq 3 LPU,” while the spoken narration says “Groq 3 LPX” takes shape at Foxconn and Quanta. The stated configuration is 256 Groq 3 LPUs across 16 trays and 40 petabytes per second of SRAM bandwidth for ultra-low latency. NVIDIA draws a distinction between two token-generation functions: NVL72 generates tokens at the highest throughput, while Groq LPX generates them at the lowest latency. The visible claim says Groq 3 LPU provides “35X higher throughput at low latency with Vera Rubin NVL72.”

Vera BlueField-4 STX is where, in NVIDIA’s phrasing, “AI keeps its memory”: storage processing accelerated by BlueField-4, connecting memory, storage, and in-silicon security. Spectrum-X SFX Ethernet supplies the networking layer, which NVIDIA describes as the world’s first Ethernet switch with 200 gigabit co-packaged optics, using TSMC’s COUPE process, chip-scale packaging, and ultra-high-powered laser dies on indium phosphide.

Rack-scale system	Function described by NVIDIA	Metric shown or stated
Vera Rubin NVL72	Reasoning, planning, prompt and context understanding	10X lower cost per token
Vera CPU	Model orchestration, memory movement, tool launching	256 CPUs; 1.8X agentic sandbox performance of x86 CPUs
Groq 3 LPU / LPX	Ultra-low-latency token generation	256 LPUs across 16 trays; 40 PB/s SRAM bandwidth; 35X higher throughput at low latency with NVL72
Vera BlueField-4 STX	Storage processing, memory and storage connection, in-silicon security	5X higher tokens per second
Spectrum-X SFX Ethernet	AI networking with co-packaged optics	200 Gb co-packaged optics; 1.6X higher AI network performance

NVIDIA breaks the agentic AI factory into five connected rack-scale systems.

Full production depends on the manufacturing system around the platform

Full production is a supply-chain claim as much as a product milestone. NVIDIA names Foxconn, TSMC, Quanta, Wistron, Microsoft, Dell, and CoreWeave in the production and engineering-rack story. It says the platform spans 150 supply-chain partners across Taiwan, millions of square feet of factory floor, and hundreds of sites.

Those details support NVIDIA’s chosen phrase: “extreme co-design.” The term is used here for chips, packages, systems, and data centers being pushed together to the limits of size, power, and scale. A data-center layout with a chart labeled “Total Projected Power (MW)” reinforces that Vera Rubin is being discussed at data-center scale rather than as a standalone server announcement.

NVIDIA closes by explicitly crediting Taiwan: “We did this with Taiwan.” The company says Taiwan was present at the beginning of its AI supercomputing journey and remains central as Vera Rubin is brought to market. The claim is not only that Vera Rubin is in full production, but that its production depends on a manufacturing ecosystem capable of fabricating, packaging, assembling, and scaling the system.

Inference and Deployment Agents and Autonomy AI Infrastructure and Compute