NVIDIA Frames Cosmos 3 as Compute-Generated Data for Physical AI

NVIDIATuesday, June 2, 20265 min read

NVIDIA presents Cosmos 3 as an open foundation model for physical AI, built to address what it frames as a data-scaling problem in robotics, autonomous vehicles and other systems that operate in the physical world. The company argues that real-world data cannot capture enough variability on its own, so compute must generate usable training and evaluation signals: synthetic video, predicted sensor outputs, simulation loops and action plans. Cosmos 3 is positioned as a post-trainable mixture-of-transformers system that combines multimodal reasoning with generation to support perception, prediction, simulation and action.

Physical AI’s data problem is framed as a compute problem

NVIDIA introduces Cosmos 3 around a constraint: physical AI needs data from a world that is “infinite and unpredictable,” but real-world data cannot be scaled indefinitely. The proposed answer is a model system in which compute produces signals physical AI systems can use: generated video, next sensor outputs, simulation loops, and action plans. The shorthand is blunt:

For physical AI, compute is data.

Cosmos 3 is introduced as “an open frontier omni-model for physical AI,” built on a new “mixture of transformers” architecture. The source description points developers to a press release, Cosmos learning materials, and a Hugging Face collection for trying the model.

The architecture diagram places text, image, video, audio, and action on both the input and output sides of the system. Inside the “Cosmos Mixture of Transformers” block, the model is split into an “Autoregressive Reasoner” and a “Diffusion Generator.”

The division of labor is explicit. Pixels, action, sound, and language flow into the autoregressive transformer, which reasons, plans, and instructs the diffusion transformer. The diffusion transformer then generates what comes next.

That makes Cosmos more than a single-purpose video generator in the source’s framing. It is described as a system that takes multimodal physical-world signals as inputs, reasons over them, and uses generation as part of a broader loop involving perception, prediction, planning, and action.

Post-training is the mechanism for adapting one foundation model to many physical systems

Cosmos is described as a post-trainable foundation rather than a finished application. Developers “post-train Cosmos across embodiments and use cases,” adapting it for different machines, environments, and tasks.

As a VLM, Cosmos “watches the physical world,” understands what is happening, describes scenes, and flags what matters. The visual example is an aerial timelapse of a traffic intersection. The frame includes bounding boxes around people and vehicles and a generated report interface. The visible prompt asks: “Generate a report on this drone timelapse of a traffic intersection. Include observed behaviors and 3 actionable insights for a traffic engineer.” The report begins with an overview of “a busy urban intersection with multiple lanes and crosswalks.”

The vision-model example is not merely object labeling. It shows a physical scene being interpreted over time and converted into a report meant for an operational user: a traffic engineer.

As a world model, Cosmos generates “physics-accurate synthetic video” from an image, text, or video. Compute-generated video is offered as a scalable data source for physical AI, consistent with the opening claim that real-world data cannot scale to the variability of the physical world.

As a simulator, Cosmos “closes the loop for policy training and evaluation.” That loop matters because the source’s premise is that physical AI cannot rely only on scaling real-world data. A model that can generate future physical states is shown as part of training and evaluation workflows.

OmniDreams makes the prediction loop action-conditioned

Cosmos is also positioned as the foundation of NVIDIA OmniDreams, described as “an action-conditioned world model.” The OmniDreams diagram places “AlpaSim Simulation Runtime,” “Text,” and a “Cache of History Frames” as inputs into “Post-Trained Cosmos.” The model performs “Autoregressive Causal Video Generation” and outputs the “Next Sensor Output.”

The loop is more specific than ordinary text-to-video generation. In the OmniDreams framing, Cosmos receives simulator context, language, and prior frames, then predicts the next sensor output. The source describes this as predicting “the future frame by frame.”

That frame-by-frame prediction is the basis for the simulator and policy-evaluation claim. The model is useful in this account not only because it represents a scene, but because it can respond to action-conditioned context and produce the next observable state.

Post-trained further, Cosmos is said to become a “world action model”: perceiving, reasoning, planning, and generating actions. The example is a robotic arm manipulating tools on a table. The task text reads: “Pick up the screwdriver and put it on the top rack.” The generated plan is procedural: lower the gripper to grasp the screwdriver, lift it from the table, move it horizontally toward the white rack, and place it on the designated shelf.

The example compresses the intended stack into one scene. The system must recognize the screwdriver and rack, understand the instruction, break it into physical steps, and produce a plan tied to robot motion. Cosmos is offered as a bridge between perception, reasoning, and action.

The foundation claim is for everything that moves

The closing claim is deliberately broad: Cosmos is for “robots of every kind” and “everything that moves.” It is called “a new kind of data” and “a new kind of teacher,” generated by compute.

That claim ties back to the opening constraint. If real-world data cannot scale to match the variability of the physical world, Cosmos is offered as a way to generate synthetic video, next sensor outputs, simulation loops, and action plans. The product positioning is not simply that Cosmos can generate media; it is that physical AI needs models that integrate perception, prediction, simulation, and action.

Cosmos 3 is described as open, post-trainable, and designed for developers building in robotics, autonomous vehicles, vision AI, synthetic data generation, closed-loop simulation, and world action models.

Data and Training AI in Robotics and Physical Systems AI Research Methods Agents and Autonomy Multimodal AI Open Models Image and Video Generation Model Releases

Physical AI’s data problem is framed as a compute problem

Post-training is the mechanism for adapting one foundation model to many physical systems

OmniDreams makes the prediction loop action-conditioned

The foundation claim is for everything that moves

The frontier, in your inbox tomorrow at 08:00.