Language Models Are Becoming the Bottleneck in Video Generation
Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.
Latent Space·Jun 1, 2026·28 min read