Luma Is Rebuilding Video AI Around a Unified Multimodal Transformer

Amit JainStanford OnlineThursday, May 7, 202619 min read

In a Stanford CS153 guest lecture, Luma AI co-founder and chief executive Amit Jain argues that generative video is only a staging point toward “unified intelligence”: models that understand and generate across text, images, video, audio, code and tools in a single work loop. Jain traces Luma’s path from Apple-era LiDAR and 3D capture to internet-scale video, saying the company followed the data but now sees prettier clips as insufficient. The destination, he says, is a multimodal AI factory for professional creative and physical work, where human skills, tool use, feedback and unified transformer architectures produce full campaigns, schematics, productions and eventually robotics workflows.

Unified models are Luma’s answer to the limits of video generation

Amit Jain framed Luma’s current work as a move beyond image and video generation toward “unified intelligence”: systems that can reason across text, images, video, audio, code, and tools, then produce end-to-end work rather than isolated assets.

The problem, as he described it, is that creative and physical work depends on information that does not live cleanly in text. Language models are producing substantial value in adjacent tasks such as coding and system design, but creative work also involves visual information, auditory information, temporal continuity, physical causality, and the trace of how a final output was reached. Robotics has a related problem: text models, vision-language models, and vision-language-action models may be useful starting points, but Jain argued they will not generalize unless they become more end-to-end systems that can jointly handle perception, language, control, and feedback.

That is why Luma’s target has shifted from generating short clips to building systems that can complete larger creative or professional tasks. Customers now ask, he said, why a model should only make a five-second video rather than a full shot; why it should not make an entire advertising campaign; or, in robotics, why it cannot produce an action, evaluate whether it was correct, and reason about force and outcome.

So now the Luma Factory is about building systems that can do end-to-end work in multimodal domains.

Amit Jain · Source

The distinction matters because Jain does not treat “beautiful pixels” as the same thing as intelligence. He compared non-text generative models to language models: words can be arranged into a meaningless poem or a mathematical proof; pixels can be arranged into a pretty image or into a visual explanation, schematic, slide, or production plan. In his account, the intelligence is not in the medium. It is in whether the system can organize the medium to carry useful information.

That claim showed up in the slide-generation workflow he described. Jain said the presentation slides he used had been created with Luma’s Uni 1 system. He gave it an existing “frontier AI factory” slide as a style reference, added his own rough thoughts in a mind-map-like workspace, and then asked the system to produce slides in that style. According to Jain, the system produced the slides essentially in one shot, though he deleted one output he did not like.

For Jain, this was not just a demo of slide generation. It was an example of the broader thesis: if language is the convenient output, a model should output language; if slides, images, or video are the right medium, it should output those. The goal is not a better image model standing next to a language model. It is a model that can express intelligence through whichever medium the task requires.

The company began with 3D, then followed the data to video

Amit Jain traced Luma’s origin to his work at Apple, where he worked first on LiDAR systems that later appeared on iPhones and had originally been built for Apple’s canceled car project, Titan. After Titan, the work moved into Vision Pro, which included multiple LiDARs. Jain said this period made it clear to him that future computers would need different interfaces, different media, and different ways of capturing and creating content.

In 2020, before language-model scaling was widely understood to be working and before DALL-E, he and others at Apple began exploring generative models. NeRF had already appeared, and Jain connected that work in differentiable 3D with the possibility of language scaling. His question was what would happen if these systems could be combined: if observations of the world could be learned differentiably, then they could be understood and eventually generated.

The instructor asked him to define “learn the world in a differentiable manner.” Jain’s answer was straightforward: differentiable means a system can be placed in a training loop, with a loss function that can be iteratively optimized through gradient descent. If a function is not differentiable, “deep learning doesn’t work” in the current compute-and-gradient-descent paradigm. Differentiability, for Jain, was the property that made a world representation trainable at scale.

Luma’s first plan followed that logic into 3D capture. The company intended to collect a very large amount of 3D data, build a flywheel around user capture, and use that to train world-simulation systems. Luma released its 3D Capture app, which Jain said was popular because it productionized NeRFs and Gaussian splats. Matthew Tancik, whose NeRF work Jain had referenced earlier, later joined Luma to push that frontier forward.

But the plan ran into a scale problem. It did not matter how many people used a company’s capture app, Jain said; the data would never reach the level needed to learn enough about the world. Internet-scale text, photos, and video had already accumulated over decades. A single company could not distribute a capture product at comparable scale.

You have to design the algorithms around where the data is, not the other way around.

Amit Jain

That lesson drove Luma’s 2023 pivot toward generative video. Jain said the company concluded that whatever theoretical advantages one modality might have, the “physics of scale” would dominate. Video had two spatial dimensions and one time dimension, and he argued that humans learn 3D representation through time as a proxy. After Nvidia’s Hopper architecture was announced, Luma began to believe video could be learned at sufficient scale. In 2023, Jiaming, then at Nvidia and a Stanford graduate, joined the company along with others from Stanford and Berkeley to build the infrastructure for video.

In March 2024, Luma released Dream Machine, its first video model. Jain said it reached 6 million users in its first three to four weeks. He attributed the demand partly to the fact that OpenAI’s Sora had been announced but not released, so many users had not yet experienced generative video directly.

That was not the final stop. Jain said Luma had a similar realization in early 2025: video alone was not enough. Video could represent motion and appearance, but it did not pair those with human logic: why an event matters, what sequence of events means, or what an action should lead to. Using language models merely as embedding components was not sufficient. That is where Luma’s current emphasis on unified intelligence began.

The frontier factory is a feedback system, not just a training run

When the instructor asked how Luma bootstrapped its video flywheel after Dream Machine, Amit Jain described the core problem as moving from the broad distribution learned in pretraining to the narrow and uneven distribution humans actually find useful.

A pretrained model may be able to produce many things. Usefulness, in Jain’s account, lies in “pockets of greatness” inside that distribution: outputs aligned with human aesthetics, use cases, and value systems. The practical question is how to elicit and reinforce those pockets.

Dream Machine’s early user base created the first signals. Luma treated liked and downloaded videos as preference data, because those actions suggested users preferred those outputs. The signal was noisy. Some people downloaded bad videos to show how bad AI video could be, and Jain said the model learned from that too. Luma then had to build systems with paid human reviewers to filter and label outputs.

From that experience, the shape of a frontier lab became clearer. A frontier lab is not only data, compute, and algorithms. It also includes skills, trainers, tutors, labelers, human annotation pipelines, and products designed to emit useful feedback. The product must produce information that helps the next model become better than the previous one, which improves the experience, attracts more users, and generates more preference data.

The Stanford factory diagram shown during this part of the lecture described the loop as “Pre-train -> Mid-train -> Post-train -> Rinse & Repeat.” It placed data, compute, and algorithms at the base; foundation models and mid-trained models in the middle; post-training through SFT and reinforcement learning; and agents producing “agentic traces” that feed the cycle again.

Factory layer	What Jain said Luma is doing
Pretraining	Learning jointly from large-scale video, images, text, and other multimodal data.
Post-training	Using customer data, user preference data, and human annotations.
Deployment	Putting systems into production for creative and professional workflows.
Reinforcement and continual learning	Learning from interaction traces and execution feedback while respecting deployment constraints.

Jain’s description of Luma’s AI factory as a multimodal feedback loop

Jain described Luma’s current agent system as producing a large amount of feedback from user interactions: whether users like or dislike an output, how they like or dislike it, whether the model’s chain of thought and chain of work are useful, and which elements fail.

He also gave a sense of the scale. Luma’s final trainable outputs are on the order of 30 petabytes, he said. The company currently trains on H100s and expects to use GB300 GPUs soon, at roughly the “10k scale.” Jain characterized this as similar to a second-tier language-model training effort, while noting that Luma is not yet at trillion-parameter scale because that scaling has not been figured out for its current approach.

The instructor then pushed on deployment constraints for large studios. If a studio uses Luma on sensitive production material, it may be willing to train with its data for its own project but not have that material appear in another studio’s model loop. Jain said Luma works with both Netflix and Amazon Prime Studio, which he called “arch nemeses,” and therefore has to guarantee no data overlap. He mentioned standard internal controls such as SOC 2 and additional AI-lab-specific controls governing what does and does not enter training.

His example was a blockbuster production: if something is marked so it should not be trained on, Jain said it will not appear in training data or related loops. But Luma can still learn from product traces — the interaction data generated as users work with the agents — separate from the visual artifacts themselves. The instructor clarified that this meant interaction data from people working in the agent interface, and Jain agreed.

Why Luma is betting on one backbone instead of many towers

The technical center of Amit Jain’s argument was a contrast between fused, multi-tower systems and unified architectures.

He described Luma’s 2025 systems as “disparate towers”: language tower, image tower, video tower, audio tower, later unified with fusion techniques. He compared that pattern to systems like Stable Diffusion, where a relatively small language component provides embeddings for visual generation. That structure, he argued, was not enough for professional creative work.

The instructor asked why an ordinary LLM could not produce the kind of slide output Jain had shown. Jain’s answer was that an LLM does not generate images and does not see in the way required for visual reasoning. It can be asked to use a computer to make images, but he said this falls apart because the model is effectively blind: it sees everything as a sequence, without direct access to the grid nature of images and visual information. Vision-language models add image understanding, but Jain said they are still not generative. Image generators can produce beautiful images, but they lack language-model-like understanding.

In language, the situation is different: an LLM understands and generates text in one system. There is not a separate understanding model and a separate generation model connected by a bridge. For world models, Jain said, that gap has to close.

The instructor raised models that generate both language and image tokens, including Nano Banana. Jain said that from what Luma knows of Google’s architecture, Nano Banana is still a fused system: a large diffusion tower generates images, a large language tower generates text, and a thin bridge connects them. He described the process as generating an “enhanced prompt” in text, then handing that to an image model whose encoder is narrow by comparison.

Luma’s approach, by contrast, is one transformer backbone, with information from text, image, audio, video, and code encoded into the same space, communicating through self-attention. The unified-transformer slide shown on screen represented text, image, audio, video, and code flowing into one tower labeled “Unified Transformer,” with the line: “All modalities in one tower — communicating through self-attention.”

Jain called transformers the single architecture he would bring back in time if he could, because they do not fundamentally care what kind of information passes through them; the hard parts are the encoders and decoders before and after the backbone. He compared the design loosely to the human brain: different areas may process visual or auditory information, but reasoning and judgment converge in one place.

Approach	Jain’s critique or rationale
Separate models	Specialized models pass outputs to one another, often with an orchestrator or judge on top.
Fused towers	Language and diffusion systems are connected by a bridge, but understanding and generation remain split.
Unified model	One backbone reasons over modalities in a shared representation, with context flowing through self-attention.

The architectural choice Jain presented: orchestration across models versus reasoning inside one model

He said it took Luma about a year and many failed attempts to reach an architecture it is comfortable scaling. The company now believes it can build hundreds-of-billions-of-parameter models with this unified architecture, including language, and expects it to scale.

Jain also argued that diffusion models are not the final architecture. In the Q&A, when asked about GANs, he said Luma still uses GAN techniques, especially in distillation and potentially real-time systems, but GANs are finicky and have not shown transformer-like scaling. Then he added that diffusion models are also “on the way out,” because their scaling behavior is not bearing out. Luma, he said, is moving toward hybrid autoregressive and diffusion regimes in its unified models.

End-to-end work requires loops, skills, tools, and memory

Amit Jain used the REPL loop — read, eval, print — as the operating metaphor for deploying these models. If a model is meant to do work rather than produce a single text or image output, it needs to read context, evaluate before the next step, print or generate output, and continue iterating. The REPL framework slide called this “the fundamental operating loop.”

In Luma’s design, that loop sits inside a broader stack: skills, tool harness, and unified models. The “Flywheel” slide shown during the lecture put those three layers in a vertical stack, with human instruction, visual examples, and actions feeding skills; traces from execution improving the next generation of models; and “more + higher value tasks” driving continual learning.

The skills layer contains domain-specific understanding. Jain gave the example of teaching a robot how to assemble an iPhone: that knowledge does not necessarily need to be built into the base model or tool layer. It can be provided as context or skill information. The tool harness provides the ability to use Linux, call APIs, run code, deploy code, or interact with other systems. The unified model sits underneath, orchestrating the work: interpreting multimodal information, deciding which skills to use, generating tool calls, and producing outputs.

Layer	Role in Jain’s architecture
Skills	Domain-specific knowledge and taste supplied by humans or experts.
Tool harness	Access to systems such as APIs, Linux, coding tools, execution, and deployment.
Unified model	The reasoning and orchestration layer that interprets context, selects skills, calls tools, and generates outputs.

The skills-tool-harness-unified-model stack Jain used to explain Luma’s agent architecture

The instructor asked Jain to map that architecture to the slide-generation demo. The skill layer, Jain said, was a roughly 50-page internal document written by someone on Luma’s team who was very good at slide design — a “best in class slide creation skill.” That document taught the system what good slide design meant. The model layer generated and orchestrated the work. The tool layer was relatively light in that specific example, though Jain said the system likely ran OCR on the slide reference. If he had asked for an interactive webpage animating the material, he said the agent would have called coding tools, run code, and deployed it.

That mapping is central to Jain’s view of where human expertise belongs. In a later Q&A about human creativity, he rejected the premise that the model itself should be labeled creative or not creative. “Whether it’s creative or not creative is for humans to judge,” he said. The human role, practically, is in the “fat skills area”: teaching the model what good looks like, repeatedly and with domain knowledge.

He compared this to programming leverage. Programmers have long written something once and had it run many times. Artists and other creatives, in his view, historically produced one thing and then had only that thing. With the skills architecture, a creative can teach a model once and have that taste, judgment, and process reused across many contexts.

This is actually an explosion of creative potential that just never has been, right? So the skills and human creativity are much more important.

Amit Jain · Source

Jain said this will “weed out” mediocre people and elevate people who are very good, because their work can be rerun at enormous scale. The claim is not that the model replaces judgment; it is that judgment becomes more leveraged.

The business case is professional work language models do not yet serve

The instructor asked why Luma’s work is so capital-intensive if it is not operating at the same scale as the largest language-model efforts. Amit Jain corrected the funding figure first: Luma had raised $1.5 billion total, he said, not $1 billion. Then he argued that doing unified models correctly is ultimately larger-scale than language, because it is a superset of language work. Luma can operate at a smaller scale today, he said, because it is not trying to be best at everything language models already do, such as coding. It can focus on domains where language models are weak.

His examples were professional and multimodal. One customer in the energy industry, which he did not name, had grid diagrams, grid code, and planning needs. Jain said Luma ingested that material, and its systems became better at producing schematics and planning than Anthropic’s coding models in that context, because those coding models could not read and reason over the visual layout in the same way. For studios, the equivalent point is that a story is not only text; it also includes physical action, scene dynamics, visual continuity, and world behavior.

Jain said Luma now works with some of the largest studios, along with Publicis and Coke. He described Coke as moving $3 billion of annual content production to Luma. He also said a new Prime Video show called “Old Stories,” about Moses and starring Sir Ben Kingsley, is a $4.5 million-per-episode production “pretty much all produced using Luma agents.” Jain stressed that it is “a proper production,” not simply “an AI video.”

He put the market in terms of labor as well. Jain estimated there are about 120 million professional creatives in the world, two to three times his estimate of the number of coders. Their daily work, he said, involves replicating the physics of the real world into computers. Luma’s goal is to build systems for that work.

The instructor observed that at a recent Luma event in San Francisco, the audience was not primarily machine-learning people but artists, Hollywood figures, designers, and creators — including people who had previously opposed the technology. Jain’s explanation was that earlier technology was not good enough. Creatives saw systems using their data and producing poor outputs, so the value proposition was not visible. Now, he said, Luma can sit with a company, open a board, and generate useful production material in front of them.

His example was Savvy Games, which he said produces Monopoly Go and some of the most played games in the world. In a meeting where the benefits were initially hypothetical — scale up production, reduce cost — Jain said he took some of their assets and produced a roughly 500-scale campaign while they were sitting there. Seeing that kind of output, he argued, changes minds.

For creatives themselves, Jain said the benefit is exploration. In current production systems, execution is expensive, so teams spend a large amount of effort validating an idea before committing resources. With AI systems that reduce execution cost, the equation changes: execute many ideas, then see which one is strong. Jain compared great creators across domains — Einstein, Archimedes, Mozart — as prolific rather than one-shot. His point was that creative greatness often comes from producing a large body of work, not from perfectly predicting the best idea before execution.

In his view, the industrial creative system constrains artists by measuring each action against immediate output. Luma’s tools make them feel less constrained because they can explore more ideas with less slog.

For Sora, Jain pointed to focus rather than market size

During Q&A, Amit Jain was asked for his hypothesis about why Sora “shut down” and what it meant for Luma, the industry, and creatives. He emphasized that he did not know what was happening inside OpenAI and could only offer a hypothesis.

His answer was focus. OpenAI, he said, is fundamentally a large-language-model lab that is very good at chat. Chat, in his description, is a vertical with about 8 billion potential customers, because nearly all of humanity may want to talk to computers. Executing at that scale is hard enough by itself. Trying to do “literally everything,” he argued, is not good for OpenAI’s business.

Jain said Luma had learned a similar lesson in its early days, when it explored too many paths before becoming clear about execution. He also said Apple taught him that large companies choose not to do far more things than they choose to do, because money and headcount do not eliminate organizational physics. A company has only so much attention.

He challenged the idea that OpenAI had been the largest player in video generation, saying Google is doubling down on video, images, and visual generation through Gemini. Sora’s premise, in his view, did not indicate that the market was small. It indicated, as a hypothesis, that OpenAI was being forced to focus while Google, Anthropic, and others pressed hard in markets OpenAI cared about.

For Luma, Jain said, that would be good news because it validates the thesis that a company can only do so many things at once. Luma has chosen visual and multimodal work because Jain believes it is a large market with many people who treat it as their profession.

Copyright and Hollywood are production-model questions

Asked about copyright in a world where anyone can generate video about anything, Amit Jain argued that copyright and capability are orthogonal. A skilled person can already make Mickey Mouse in Photoshop, he said, but does not have the legal right to do so. Generative AI may make infringement easier, but in Jain’s view it does not change copyright “in any way, shape or form” on the output side.

He also said platform responsibility is similar to Photoshop’s: it is not Photoshop’s responsibility to prevent someone from making Mickey Mouse; it is the user’s responsibility not to violate the law. If Luma receives a DMCA notice for hosted content, Jain said it will take the material down. But if someone used Luma to create infringing work and Luma receives a complaint, he said it is not Luma’s responsibility to point law enforcement to the user unless the law requires it.

Jain treated Hollywood as a separate but related production question. He said Hollywood is “default dead” for reasons that have little to do with AI. Its business model, he argued, has been deteriorating for 30 years; Covid accelerated the problem, and the writers’ strike was “the nail in the coffin.” He said production has moved to places with tax incentives — Greece, Canada, Ireland, and elsewhere — and summarized the shift this way: Hollywood finances movies but does not make them.

His criticism was aimed at what he called a private-equity mindset: taking a successful franchise and extending it into sequels, crossovers, and rent-seeking around existing assets. He contrasted that with Netflix, which he said produces 800 productions a year, compared with far fewer from large studios. Those productions may have smaller budgets, but Jain argued they allow more kinds of stories to be told and appeal to broader audiences.

AI, in this account, is not what broke Hollywood. It is a chance to change the production model: reduce the cost and time sinks, try more ideas, and make work that audiences want to watch. The instructor connected this to a broader pattern: capital markets seek predictability, and that can stagnate innovation.

The remaining gap is intelligence

Near the end, Amit Jain was asked what separates current world models or video models from being as generally useful as language models. His one-word answer was “intelligence.”

Current image and video models, he said, are “really, really stupid” in a non-derogatory sense. He defined that by analogy to working with a person who does not seem intelligent: they forget what was said, require the same instruction repeatedly, misread literal words without understanding context, and can do small things but fail when asked for more. That, in Jain’s view, describes today’s image and video models. They can generate beautiful pixels but lack memory, multi-turn interaction, introspection, physics understanding, and coherent grasp of what they are making.

He compared the missing step to the difference between LLMs as research projects and generally useful chat systems. RLHF enabled chat and multi-turn interaction. Without multi-turn, he said, ChatGPT would be intolerable: users would have to repeat the whole context each time. Image and video models need the equivalent ability to remember, revise, and reason over prior outputs.

Jain said unified models are designed to solve that gap. The target is not stock footage. It is end-to-end work: educational explanations, visual planning, professional schematics, full creative campaigns, production workflows, and eventually robotics. His history-class example made the ambition concrete: instead of teaching history only as text, a system could show events unfolding and explore alternatives — what if the Rubicon had not been crossed, Caesar had not been murdered, or Archduke Franz Ferdinand had not been shot. The point was temporal understanding, causal coherence, and the ability to reason visually across time.

AI Application Architecture Data and Training AI Labs and Strategy AI Research Methods AI in Design and Creative Work AI Governance and Regulation Agents and Autonomy Multimodal AI AI Infrastructure and Compute Image and Video Generation Enterprise AI Adoption