Ethan He

AI engineer at xAI working on Grok Imagine, video generation, and world models; previously a senior deep learning algorithm engineer at NVIDIA focused on large-scale training frameworks, multimodal models, and mixture-of-experts, and a coauthor of NVIDIA Cosmos research.

Language Models Are Becoming the Bottleneck in Video Generation

Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.

Latent SpaceJun 1, 202628 min read