Language Models Are Becoming the Bottleneck in Video Generation
Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.

The bottleneck in video generation is moving toward language
Ethan He’s strongest claim is that the next gains in video generation will come less from diffusion itself and more from language models: prompt rewriting, tool use, long-horizon planning, and agentic editing. Diffusion models still matter. They are the machinery that produces pixels. But He described mature video diffusion models as “kind of dumb” in a specific sense: they take their instructions literally.
The visual intelligence are actually mostly coming from language.
That literalness is a consequence of how they are trained. To build a text-to-image or text-to-video model, teams first need paired examples of language and pixels. Internet videos do not naturally arrive with useful captions. A YouTube title, description, or comment thread may have little relation to the visual content. He gave the example of a natural mountain scene whose title might be “I’m so happy today.” The text is not a dense description of the scene; it is social metadata.
So teams generate synthetic language-video pairs. In Cosmos, He said, human labelers were instructed to describe images or videos with enough detail that a blind person could reconstruct the scene mentally from the text. Those captions become the language conditioning for the generative model. The model then learns a relationship between detailed descriptions and visual outputs. If a user later prompts it with “a cat,” the model has no reason to infer the missing details. It may produce a cat on a white background, not moving, because the user did not specify the background, motion, lighting, camera, or surrounding scene.
That is why prompt rewriting matters. In Cosmos, He said the video model itself was 7B parameters, while the prompt rewriter used a larger language model such as Llama or Mixtral. The language model’s job was to convert the user’s sparse instruction into the kind of dense visual description the diffusion model had learned to follow. He recalled generating a “happy sheep” in Cosmos: without rewriting, it looked “so CGI”; with rewriting, and without joint training, it looked much better.
He generalized that point beyond prompt expansion. In his view, the “thinking process” in modern image and video generation increasingly sits in the language layer. If a user asks for an image of today’s news, the system may need to fetch current information, digest it, decide on a layout, and only then render an image. That planning is not diffusion. It is language-model reasoning and tool use wrapped around a generative media model.
The source opened with this claim over generated video examples: a preview labeled “Grok image video 1.5 preview” showing Elon Musk in military gear looking out of a transport aircraft, followed by a side-by-side comparison attributed on screen to Shutterstock, contrasting a distorted March 2023 generation of Will Smith eating spaghetti with a more realistic January 2026 generation of Will Smith chopping garlic. He used that kind of improvement to make the point that, now that diffusion technology is more mature, many visible gains in video systems come from language-model improvements rather than from the video diffusion model alone.
Vibhu Sapra raised the concern that this makes video agents, rather than new video architectures, the real future. He called the view somewhat dissatisfying if the implication is that video models are “tapped out” and the remaining work is just harnesses. He’s response was that the harness is not necessarily trivial. A video agent can know how to prompt different models, use deterministic tools where generation is weak, and iteratively refine outputs the way human creators do. It might call a video model, an image editor, Photoshop-like tools, FFmpeg, or other editing systems, rather than expecting the generative model to do everything in one shot.
That distinction mattered in the demonstrated Grok Imagine Agent Mode flow. The interface was shown with the title “Imagine Agent Mode - Grok,” options such as “Create Video,” “Short Film,” “UGC Product Stories,” and “Brand Identity,” and a prompt box asking, “What would you like to create?” A user then typed a request to “generate a 1 min video of a dog going on a walk, sniffing around, getting a pup cup, meeting a friend.” The agent’s visible steps included drafting prompts, structuring clips, generating clips, stitching clips, evaluating them with FFmpeg, preparing a concatenation command, verifying properties, and executing the video stitch. The output shown was 36 seconds, not one minute, but He argued the pattern was already present: long-form video generation becomes an agentic composition problem.
This also reframes “video editing.” When Grok Imagine first shipped editing, He said some users tried prompts like “edit this video to be one minute.” For a conventional video editing model, that is not what editing means; editing typically covers removal, addition, replacement, restyling, or motion control. But under a video-agent assumption, “make this one minute” becomes a valid long-horizon task. The agent can plan clips, generate alternatives, stitch, evaluate, and revise.
He predicted that video agents could become a major category by the end of the year if they cross a usability threshold: production-grade outputs that can be presented and distributed in ads. He framed that as the point where enterprise budgets would expand for video systems, even though agents are inherently more expensive than single video generations because they iterate, generate many variations, and use tools.
Grok Imagine was built under two different infrastructure realities
Ethan He joined xAI in mid-2025 after working on Cosmos at NVIDIA. Cosmos, as He described it, was a giant video foundation model intended to simulate the world and serve as a base for roboticists. A displayed NVIDIA Cosmos page described it as “an open platform for physical AI with world foundation models,” with Cosmos Predict, Cosmos Transfer, Cosmos Reason, data processing, evaluation, and post-training frameworks.
At NVIDIA, He concluded that video models appeared to have scaling laws similar to language models and that they needed to be scaled further. That pushed him toward a compute-rich environment. When he arrived at xAI, he said the company was about to build video and multimodal models. For that new effort, he described the starting point as “no infra, no data, and no model.” But he also said xAI already had strong foundations in data infrastructure and model infrastructure that could support model development. The distinction matters: the video/multimodal product line had to be built from zero to one, while the company’s broader platform gave the team a base to move quickly.
A small group built the first Grok Imagine model, Grok Imagine 0.9, in three months. He attributed part of that speed to having already helped build Cosmos, where the first pass took about a year. By the second time, he had a rough map of what had to be built, what order the pieces came in, and where mistakes were likely to hide.
He was careful not to describe xAI’s internal process in detail. Instead, he framed the standard path using Cosmos as the example: build or acquire data, caption it, train a compressor/tokenizer, train an image model, then bootstrap the video model. But on the organizational side, he was direct about what made the timeline possible.
The first factor was talent density. The team was small, strong, and aligned. That reduced communication bandwidth. He described the cadence as roughly one daily sync and then building. The second factor was infrastructure. xAI’s broader data and model foundations made rapid experimentation possible. For He, the key metric in model development is not how many meetings happen or how elegant the algorithm sounds; it is how many full cycles the team can run per day.
An iteration, in his definition, runs from acquiring or changing data, designing or changing an algorithm, training a model at some scale, evaluating it, and deciding whether it beats the previous version. More iterations give the team more chances to discover what is actually limiting quality. He emphasized that many of the largest improvements did not come from new algorithms at all. They came from finding small bugs in data pipelines and training pipelines.
That answer created an apparent tension. A small team can move quickly, but more people might seem better suited to finding bugs. He did not resolve that tension as a universal rule. In this case, he argued that a strong, closely collaborating team with good infrastructure could move faster because communication overhead was low and iteration speed was high.
Coding models have changed that bottleneck. In mid-2025, He said, coding models were useful but not yet consistently maintainable. They could generate large, tangled codebases that neither the human nor the model could easily improve. By December 2025, he found them much better. As implementation accelerates from weeks to hours, compute becomes the bottleneck again: if engineers can generate synthetic data pipelines or implement algorithmic ideas quickly, they immediately want to train models to test them.
The stress of that environment is real. Shawn Wang described the pressure as a job where there is always another experiment one “should” be trying. Sapra added that consuming thousands of GPUs per hour carries pressure because compute is finite and could go to other researchers. He acknowledged the stress but described it as a marathon: coding models can automate more work, but researchers still need health and regular schedules.
The culture, in He’s account, intensified that speed. He later summarized xAI’s operating principles as “move fast,” “build,” “no goal is too ambitious,” and “first principle.” Goals such as building something in three months initially felt impossible to him. But he said the team would reason backward from operational constraints: how fast videos can be acquired, how long model training takes end to end, how much adding GPUs accelerates the timeline, and how quickly human data can return. That, he said, is first-principles thinking applied to model development rather than physics alone.
Image models usually come first because they teach language-to-visual grounding more cheaply
Ethan He described a common path for serious video model training: build the image model first. The reason is not only cost, though image models are cheaper. It is density of language grounding.
An image model can be trained on a billion text-image pairs, giving it many opportunities to associate language tokens with visual concepts. Training on a billion text-video pairs is much more expensive because videos contain far more tokens. If a video model is trained on too few videos, He said, it may not see enough language variation to understand human intent well. The image diffusion model supplies a cheaper base of language-visual mapping, and the video model can be bootstrapped from it.
The pipeline has several core steps. First, teams build synthetic caption pairs. If no vision-language model exists, humans must create the initial captions. Once a VLM is available, it can caption images and videos at scale. Shawn Wang raised the possibility of an unsupervised unlock from interspersed image and text, rather than only human-curated supervision. He agreed that both kinds of data are useful, and said generative model training often includes a small percentage of unlabeled data where the model is instructed to generate without text, which can improve generalization.
Second, teams train a tokenizer or compressor, often a VAE. Directly training a transformer on pixels is theoretically possible but practically impossible at normal image sizes. A 1000-by-1000 image contains about one million pixels. Treating that as one million tokens is too large. A VAE maps images or videos into a latent space, then maps from latent space back to pixels. He described the latent representation as continuous rather than a discrete vocabulary: patches of pixels are mapped into fixed-length vectors, such as 16- or 48-dimensional vectors.
Third, teams train the diffusion transformer over visual latent tokens and language tokens. He described this as similar to language model training except that the inputs and outputs are visual tokens and the model learns through denoising. Noise is added to visual tokens, and the model learns to remove it. At inference time, the model starts from noise and iteratively denoises toward a clean image or video.
Video adds temporal compression decisions. Wang asked whether video generators could use something like MP4 compression, where neighboring frames are stored as deltas rather than full frames. He said people have tried using MP4-like tokens, but the latent space is difficult for models to learn. VAEs are used because they create a more continuous and comprehensible latent space.
Within VAEs, there are choices. A model can compress frame by frame, or compress across time. Temporal compression exploits redundancy between adjacent frames. He cited Wan 2.1’s VAE as using an 8-by-8-by-4 compression rate: four temporal tokens are compressed into one. That saves context length. Frame-by-frame compression, such as 8-by-8-by-1, leaves the context length roughly four times larger.
The tradeoff is interactivity. Temporal compression introduces lag because the model emits or reasons over chunks of time. If the goal is a real-time interactive system, frame-by-frame streaming has advantages: the model can respond to user input immediately. That tradeoff later becomes central to He’s definition of a world model.
| Design choice | Advantage He described | Cost or limitation He described |
|---|---|---|
| Temporal VAE compression | Much higher compression by exploiting redundancy between frames | Introduces lag and makes immediate response harder |
| Frame-by-frame compression | Better suited to streaming and real-time interactivity | Context length becomes much larger |
| MP4-like tokens | Use existing video compression ideas | Latent space is hard for models to understand and train on |
| VAE latent tokens | More continuous latent space for generative training | Requires separate compression training and storage of features |
The VAE decision also affects the rest of the training system. Once teams compress video into continuous features, they often store those features rather than recomputing them constantly. That creates another storage burden, but it makes training runs practical. He’s broader point was that the tokenizer is not an implementation detail on the side; it determines the token budget, the latency profile, the data system, and the feasibility of long-horizon video.
The cost of video models is not just GPU hours
Video training costs are comparable to medium-scale language models, according to Ethan He, but the cost structure differs. Storage, network transfer, feature storage, and IO matter much more than many simple GPU-hour estimates capture.
He offered a back-of-the-envelope example. A billion videos at 5 megabytes each require about 5 petabytes just to store the raw videos. But teams often also store the VAE-compressed continuous features, which may be comparable in size to the videos themselves. That can push storage into the tens of petabytes.
During the discussion, Shawn Wang looked up cloud storage and transfer prices live and estimated 5 petabytes of S3 Standard storage at about $100,000 per month. He noted that tens of petabytes would scale that number up, and that network costs could be even more painful. Wang then estimated AWS egress for 5 petabytes at about $230,000, more expensive than storing the same amount for a month. He’s broader point was that storing and moving the data can reach millions per month before GPU training is counted.
This makes video training more IO-bound than ordinary language-model training. Wang summarized the implication: data loading, caching, and transfer become central. He said Cosmos included substantial optimization to avoid being IO-bound.
The model sizes are also familiar to LLM practitioners. He cited open-source video models such as LTX at 19B dense parameters, and said people are exploring MoE versions with perhaps 20B active parameters and hundreds of billions total. Cosmos disclosed tens of trillions of visual tokens, he said. Put together, the parameter scale and token counts resemble medium-sized LLM training, but with different infrastructure and possibly lower efficiency.
The storage issue also changes what it means to build “compute.” Sapra asked whether, if companies build GPU data centers, they should also build storage infrastructure rather than relying on cloud storage and paying egress. He said that is a reasonable idea, but that it comes with its own challenges: people building GPU data centers may not plan for this much storage, while storage systems are often built elsewhere around CPUs. For video, the training cluster and the data system have to be designed together.
Inference has a different set of levers. Training cost is harder to reduce; inference can be reduced dramatically through step distillation. Flow matching models may require about 100 steps to generate a strong result, and diffusion models can require even more. Step distillation trains a student model to produce comparable outputs in fewer steps, learning from the teacher model rather than from the full internet distribution.
He’s intuition is that this works because the teacher’s output distribution is simpler than the original data distribution. The teacher has already compressed the complexity of internet images and videos into a fixed model. The student only needs to imitate that model. In production, He said, models usually run in a few steps. Cosmos had four-step and eight-step versions, and simpler image-to-image translation in Cosmos Transfer could run in one step.
Wang brought up consistency models, and the screen showed an OpenAI research page titled “Simplifying, stabilizing, and scaling continuous-time consistency models.” The visible example compared a consistency model completing a butterfly image in two steps with a diffusion model taking 63 steps. He placed consistency models alongside GANs and distribution matching distillation. GANs, he said, were the original one-step generative approach: the generator produces an image and a discriminator judges whether it looks real. Distribution matching and GAN-style losses can be combined with consistency and distillation approaches to reach few-step generation.
Audio-video generation adds a harder alignment problem than text-video generation
Grok Imagine 0.9, Ethan He said he believed, was xAI’s first large-scale deployed joint audio-video model. A displayed xAI page described the Grok Imagine API as “state-of-the-art video generation across quality, cost, and latency,” “a world-class video generation model,” “a breakthrough video editing model,” and “our most powerful video-audio generative model yet.” Another shown xAI post announced Imagine v0.9 with “native audio + video generation,” using a roaring dragon example with synced sound.
The hard part, in He’s account, is modality alignment. Text-to-video alignment is already loose. A caption can describe a clip overall without specifying what occurs at each timestamp. Audio-video alignment is more demanding because the two modalities must correspond in time. A sound effect, word, beat, or impact must occur when the relevant visual event occurs.
Audio itself is mixed in character. Speech has a discrete component: words, or text-like tokens with acoustic characteristics. Music is much more continuous and harder to model as language-like tokens. Most VLMs, He said, understand images and videos more than audio, and many LLMs are poor at detailed music recognition. They may identify a song or provide a general description, but they struggle to describe beat, tone, and fine musical details.
The synthetic-data challenge repeats, but in a harder form. For images, Cosmos asked annotators to describe a scene so a blind person could reconstruct it. For audio, He framed the analogous task as describing music and dialogue so a deaf person could reconstruct how it sounds without hearing it. That requires detailed captions of music, dialogue, and sound.
This matters because the model needs more than a label like “dramatic music” or “people talking.” If the audio is tied to visual action, the captioning system has to capture when the relevant sound happens, what it corresponds to visually, and how it should evolve. A sword hit, a footstep, a dragon roar, a line of dialogue, and a musical beat all impose timing requirements. The looser text-video captioning regime does not provide that granularity by default.
Temporal alignment also pushes models toward time awareness. He contrasted video/audio models with language models, which often do not have a direct sense of elapsed time. If asked how long a task will take, an LLM may answer based on internet text about human time estimates rather than its own operational time. Vibhu Sapra pushed back somewhat, arguing that this is grounded in the training corpus: humans have written about how long tasks take, and models inherit that distribution. He accepted that point but maintained the broader distinction: for audio-video generation, the model design must explicitly handle time-based correspondence between modalities.
A world model, for He, is real-time, interactive, and long-horizon video
Ethan He did not try to settle the general definition of “world model.” From a multimodal and video perspective, he defined it as “real-time, interactive, long-horizon videos.”
Each term carries a practical constraint.
Interactive means the user can act through keyboard, mouse, voice, or other modalities, and the model responds reasonably. The examples shown were Flipbook and NeuralOS. Flipbook appeared as a browser-like interface from flipbook.page where pages about “Architecture of the Great Pyramid of Giza” and “Engineering the Grand Gallery” were generated visually in response to interaction. He described it as a web browser where the UI is generated by a real-time image model. The pages do not exist; the model imagines them. Users can click into descriptions and receive newly generated subpages.
NeuralOS appeared as a simulated operating system from neural-os.com with icons for Trash, Firefox, Terminal, and Doom, plus instructions for mouse and keyboard interaction. Shawn Wang found it less impressive in one sense because it simulates an OS that already exists, but the generated Doom interaction was “shockingly fast.” He compared it with older first-person-shooter demos from image models that lacked consistency and speed. Sapra framed the demo as useful less because it was already better than a game engine and more because it gave a visual representation of the kind of future He was describing.
For He, these demos point toward generative UI. Instead of a language model writing code, compiling or rendering it, and then showing pixels, a future system could go directly from user intention to pixels. If a user wants email to behave like TikTok, with swiping left and right, the UI could be generated that way. If a user dislikes an Instagram-like button because they click it accidentally, the interface could be generated without it.
Wang summarized the architecture as a “diffusion frontend” with a deterministic backend. He agreed, while acknowledging cost. He estimated that if an H100 cost $1 per hour and a person used such a system eight hours a day for 30 days, the monthly compute cost would be $240, which users would not likely want to pay. But he expected falling compute costs, faster models, smaller models, and greater efficiency to make this plausible within a few years. Sapra separated pure compute-cost decline from the combined effect of cheaper compute, faster hardware, smarter models, and smaller models.
The bandwidth argument is also central. He said humans have maximum input bandwidth when looking at images and video, and maximum output bandwidth when talking. Before Neuralink-like interfaces, the highest-bandwidth AI interaction may be speaking to a model and receiving a generated visual interface in response. Sapra added that generative UI can still include text, so the interface could adapt to users who are more or less visually oriented.
Real-time means latency low enough for the interaction. A game-like model may need extreme responsiveness. He mentioned professional CS:GO players and sub-10-millisecond response, with 300 FPS implying roughly 3 milliseconds per frame. Most video models cannot do that. A digital human has a more forgiving latency budget, perhaps around 200 milliseconds for real-time voice interaction, but even that is hard if temporal VAE compression introduces lag. Avoiding temporal compression makes context length explode, so real-time world models also inherit long-context problems.
Long horizon means minutes or hours, not a few seconds. Most video models generate short clips. A world model needs to maintain continuity over extended interaction. That is why He described Grok Imagine’s video extension work as a first step toward world models.
Long-horizon video is a context-management problem
Creators have long tried to extend video generations by feeding the last frame of one clip as the first frame of the next. Ethan He called it a fun hack, but the quality degrades after repeated extension because the model only sees the last frame. It loses the broader temporal context. If two characters are speaking, their voices or identities may drift, especially when the conditioning covers only a short window.
He said he remembered Veo 3 as having about a one-second context of the previous video. In his account, that is better than a single last frame but still vulnerable to similar degradation when extended repeatedly toward a minute.
Grok Imagine’s video extension, as He described it, keeps historical context across previous generated videos: who is speaking, what objects have appeared, and what has happened. The naive implementation would put all prior video tokens into context, but video context explodes quickly. In Cosmos, he said five seconds of video could be 50,000 to 60,000 tokens. Fifty seconds could be 500,000 tokens. Longer horizons can reach millions of tokens.
| Video duration | Approximate visual tokens in He’s Cosmos example |
|---|---|
| 5 seconds | 50,000–60,000 tokens |
| 50 seconds | About 500,000 tokens |
| Longer horizons | Can reach millions of tokens |
Reference-to-video is one intermediate workaround. He said the feature allows users to upload up to seven reference images as conditioning for characters, objects, or scenes. A LinkedIn post shown on screen from He described the feature: “You can drop in up to 7 image references and turn them into a video. The images can be characters, objects, or anything!” The visible generated example included He, Tesla Optimus, Albert Einstein, and an anime-style character. A Grok search result shown on screen summarized the feature as accepting up to seven condition or reference images for characters, objects, scenes, styles, and similar inputs.
This is not full memory, and He accepted the characterization that it is partly a “cheat.” But it addresses a real redundancy: not every part of video history is needed at every moment. If a character appears in the first clip, disappears, and returns near the end, the model does not need all intervening tokens. It needs the relevant reference when the character returns.
He connected this to active research on context selection. A paper shown on screen from arXiv was titled “Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models.” Its visible figure described FramePack variants using time proximity and feature similarity. He described one heuristic: keep the most recent history at high fidelity, and compress older history into smaller representations, maintaining a fixed maximum sequence length. The farther a frame is from the present, the smaller it becomes.
He thinks that should become more automatic. Rather than hand-designed heuristics, the model should know which historical parts to retrieve. He also argued that video research may be ahead of language models on this specific long-context issue. In LLM agents, context grows through tool calls, file reads, and intermediate results. Agent harnesses prune tool results or limit file displays to top lines, but those are still heuristic. He expects a breakthrough in continual learning or memory to involve models managing their own context.
The analogy to human attention was explicit. He said standard attention attends to all tokens unless sparse attention or another mechanism intervenes. Humans, by contrast, have a small attention span but can dynamically pull context from different places. Wang described this as pattern matching and feature detection; He agreed that humans are strong pattern matchers. The model equivalent would not simply be a larger context window. It would be a learned mechanism for deciding what context matters now.
That idea returned near the end of the discussion, when he described one reason he is moving toward language models. In current language-model products, context compaction may trigger automatically when the context window nears a threshold, but the model may not know it is happening. Some systems attach local time to user messages, giving the model time awareness through the harness. Tool-call results may be pruned, context may be added, and summaries may be compacted outside the model. He expects such heuristic harness engineering to be absorbed into the model itself.
He imagined a model that can see and modify the agent harness code governing its own operation. If the harness is short enough to fit in the system prompt, the model could decide how to spawn a future version of itself, how to read a long document, whether to process it in chunks, summarize, or inspect only the first 200 lines. In his framing, the model would program its own test-time behavior online.
xAI’s public story misses some of the engineering story
Vibhu Sapra repeatedly argued that xAI’s public communication did not convey the technical significance of Grok Imagine. The product pages shown emphasized broad claims: “A world-class video generation model,” “A breakthrough video editing model,” and benchmark tables comparing Grok Imagine with models such as Veo and Sora on text-to-video ranking and video editing quality. Sapra thought the more interesting details were not prominent: full-context extension, reference-to-video, and the path toward long-horizon world models.
Ethan He did not criticize the company directly. He said different labs have different communication styles. But the hosts’ point was that public product pages with generated cookies and benchmark tables can hide substantial underlying systems work. The screen showed an xAI page with “Performance and Benchmarks,” including “Artificial Analysis: Text-to-Video Rankings” and “Video Editing Benchmark,” while Sapra argued that the product story did not explain enough of what the system could do or how much work sat underneath.
He described xAI’s culture in three short principles: move fast, build, no goal is too ambitious, and first-principles thinking. The stated goals often initially felt impossible to him, such as building a model in three months. But the team would reason backward from physical and operational constraints: how fast videos can be acquired, how fast models can be trained end to end, how additional GPUs change the timeline, how long human data takes to arrive. That, he said, is first-principles thinking applied to model development.
He also said Elon Musk was hands-on, as people imagine online, and that working at xAI gave employees more chances to interact with him. The discussion briefly showed Musk quoting He’s post that “Grok has the best voice mode” with “True,” a post that the screen showed with 50 million views. He did not work on the voice-mode team, but Shawn Wang described Grok voice as strong, especially interruption and high-speed speech in a Tesla driving context.
On safety, He said Grok Imagine used watermarks in countries that require generative AI videos to be watermarked, and that takedowns happened extremely fast. On watermarking more generally, he said detection will get harder. SynthID was initially Google’s, and now other labs are adopting it, but the paper and technology are public enough that people can reverse engineer ways to remove it. Wang noted that people online have analyzed the pattern Google applies.
As generated media improves, He said he now often judges by whether a video makes logical sense. Visual artifacts like extra fingers are less reliable than they were. Sapra said he currently looks at audio flaws, especially mismatched or stylistically similar generated audio, but treated that as a temporary imperfection. He linked the trend to GAN-style training: if the model is trained against a discriminator whose task is to judge whether an image looks real, the model will become better at defeating visual detection.
The agentic path changes what a video model must do
Ethan He’s video-agent argument is not simply that an LLM can wrap a video generator. It changes the allocation of responsibility across the system. The generator does not have to solve every production requirement directly. The agent can decide when to call a generative model, when to use deterministic editing tools, when to rewrite a prompt, and when to inspect or revise the result.
That matters for tasks where diffusion is weak or imprecise. He gave the example of adding a block of text at an exact timestamp. A video model might not follow that instruction precisely. A deterministic tool can. An agent can route the task to the tool that is best suited to it, instead of asking a single generative model to absorb every capability.
He also argued that AI models may prompt AI models better than most humans do. Most users are not expert prompt engineers. A language model can learn the quirks of different generation models and write prompts in the form each model follows best. If systems are jointly trained, that model-to-model prompting could improve further.
Sapra’s concern remained: if the alpha comes from the language model, perhaps video progress becomes mostly a matter of waiting for better reasoning models and building better harnesses. He agreed that language is becoming the main source of gains, but he did not reduce the system to a trivial wrapper. He described a future in which language models use generative models, editing tools, and traditional media tools as a set of instruments, iteratively creating higher-quality video the way human creators do.
The comparison to coding assistants was explicit. Early AI coding looked like assisted completion. Later systems such as Cursor-like agents became more autonomous, taking larger tasks and operating over files and tools. He sees Grok Imagine Agent Mode as an early version of the same transition for video. Users can still enter and edit the workflow manually, but as the model improves, more of the process can become automated.
The economics follow the capability. A single video generation can be relatively bounded. An agentic workflow may generate many clips, evaluate them, stitch them, regenerate failures, add deterministic edits, and run more tools. That makes it more expensive. He’s prediction was that enterprises will pay when the output crosses the threshold into production-grade usefulness, especially for ads and distributable media.
Robotics may arrive through video-first world models rather than direct embodiment
Shawn Wang noted that many world-model researchers frame robotics as the end game: real-time, interactive models matter because robots must act in the physical world. Ethan He’s emphasis was different. He is interested in video generation and world models for the interface and media implications, but he agreed robotics will be a major part of the story.
His prediction is that physical AI might be solved without needing to begin in the real world. A language model with very strong video capability could learn real-time, interactive, long-horizon dynamics first through screen recordings and computer interfaces. Once such a model can use computers and predict future computer states well, robots could become another tool for the AI to use. Embodiment might emerge as an extension of a general world model controlling tools, rather than as a separate robotics-only path.
That view is consistent with his broader claim: video, language, agents, and tool use converge. A system that can plan, retrieve context, generate pixels, evaluate outputs, and operate tools in software may eventually control physical embodiments as another tool domain.
Leaving xAI follows the same trajectory: the center of leverage has shifted
Ethan He said he left xAI because there is research he wants to do that is difficult inside a company, and because company priorities and objectives can change quickly. The research he wants to pursue is now more on the language-model side.
That may sound like a sharp pivot from video and computer vision, but He framed it as the continuation of his path. Ten years earlier he worked with ResNet authors Xiangyu Zhang and Jian Sun on computer vision tasks such as image recognition, object detection, object tracking, and neural network compression. He applied to top PhD programs with several first-author papers but was rejected by all the top programs, which led him into industry. At FAIR, under Yann LeCun, he moved into self-supervised learning.
At NVIDIA, he worked on Cosmos and on scaling, including video diffusion models and MoEs. He said Megatron MoE was the first open-source framework able to train very large MoEs, from hundreds of billions to trillions of parameters, efficiently at around 40% MFU. Moving to xAI was another scaling move, toward even larger compute.
Looking back, he said machine learning careers may be more transferable than people assume. Researchers often think they must stay within “computer vision” or “language,” but the core principles of training large models are largely similar. He worked on both language-model MoEs and video models at NVIDIA. Now, because he believes the bottleneck for video models is increasingly language, agents, and context management, LLM work does not feel like a huge jump.
Shawn Wang suggested that stepping away from the big-lab path — train larger models, get more compute, train better models — is a career risk. He did not dispute that directly. His answer was biographical: his career has already involved several large transitions, and the current transition follows where he sees the bottleneck.
The language-model research direction he described is self-managed context. Today, much of that management sits in the harness. The product or agent framework decides when to compact, summarize, prune, attach timestamps, or hide tool output. He wants models to become aware of those operations and eventually control them. The same problem that appears in long-horizon video — what history matters, what can be compressed, what should be retrieved — appears in LLM agents as tool histories and task contexts grow.
In that sense, his move from video to language is less a departure than a bet on a shared underlying problem. The model must know what it is doing over time, what context it has, what context it lacks, and how to reorganize its own work.




