Luma Is Rebuilding Video AI Around a Unified Multimodal Transformer
In a Stanford CS153 guest lecture, Luma AI co-founder and chief executive Amit Jain argues that generative video is only a staging point toward “unified intelligence”: models that understand and generate across text, images, video, audio, code and tools in a single work loop. Jain traces Luma’s path from Apple-era LiDAR and 3D capture to internet-scale video, saying the company followed the data but now sees prettier clips as insufficient. The destination, he says, is a multimodal AI factory for professional creative and physical work, where human skills, tool use, feedback and unified transformer architectures produce full campaigns, schematics, productions and eventually robotics workflows.
Stanford Online·May 7, 2026·19 min read