Orply.

Computing Is Shifting From Prerecorded Execution to Continuous Generation

Jensen HuangStanford OnlineWednesday, May 13, 202619 min read

In a Stanford CS153 Frontier Systems lecture, NVIDIA chief executive Jensen Huang argues that AI is forcing the first fundamental reinvention of computing in decades, moving the industry from prerecorded, on-demand execution to continuous real-time generation. Huang says that shift requires rebuilding the full stack — chips, compilers, networks, storage, systems and institutions — around new bottlenecks, with NVIDIA’s co-design approach producing gains that conventional Moore’s Law scaling cannot match.

Computing is moving from prerecorded execution to continuous generation

Jensen Huang framed the current AI transition as the first deep reinvention of computing in roughly 64 years. The reference point was the IBM System/360: for decades, he said, the industry changed form factors and distribution models — PCs, internet, mobile, cloud — without changing the basic mental model of what a computer is, how software is written, how it is run, and what applications are assumed to do.

That model was largely “pre-recorded.” The content was pre-recorded: images, video, documents. The software was pre-recorded: compiled binaries expressing instructions ahead of time. The user invoked computation on demand, and the machine executed.

AI changes that model because computing is now moving toward real-time generation. Huang emphasized not only the familiar ability to generate text or images, but the broader possibility that generated outputs can be contextually relevant, contextually consistent, and responsive to intention rather than only to explicit instruction. The computer is no longer merely retrieving and executing what has already been prepared.

The more important implication is that the whole stack has to be reconsidered. If software is not only compiled code but neural networks running and generating tokens, then the development process changes, the role of software engineers changes, company organization changes, computer architecture changes, networking and storage change, cloud services change, and the applications worth building change.

Huang treated GPT as the inflection point not because it made image generation, summarization, or translation impressive, but because he saw it as evidence that AI could “think.” His framing was specific: if a model can generate tokens externally for users, it can also generate tokens internally as thoughts. Tool use, in that framing, is external token generation applied to action. Once that was clear, he said, the emergence of agentic systems was “fairly easy to predict,” even if the engineering required to get there was substantial.

The next shift is from on-demand computing to continuous computing. Cloud computing inherited the time-sharing idea that users consume compute when they ask for it. Agentic systems are different: computers are now “continuously running.” They reason, call tools, and perform work over time. That pushes the industry toward what Huang described as “generative computing in a continuous way,” rather than retrieval-based computing initiated per use.

Everything about computer science has changed. And everything about every field of science has changed because of the things that we’ve changed.

Jensen Huang

That claim was not presented as a slogan about AI adoption. Huang’s argument was architectural: once computing becomes generated rather than prerecorded, and once systems become continuously active rather than merely on demand, every layer of computing has to be rebuilt around different bottlenecks.

Co-design is NVIDIA’s answer to the end of easy scaling

Jensen Huang’s answer to why co-design matters started with a contrast between abstraction and integration. In the old model, microprocessor designers, compiler writers, language designers, systems engineers, and application developers could largely operate as separate fields. That abstraction was productive, but for Huang it is no longer sufficient for the hardest computational problems.

He used Stanford’s own RISC heritage as an example. John Hennessy’s work mattered, Huang said, because it treated compiler design and microprocessor architecture harmoniously. A processor that is individually elegant but hard to compile for can lose to a simpler instruction set that exposes the right structure to the compiler. The point was not nostalgia; it was the principle that optimizing subsystems independently can produce a worse system.

NVIDIA’s version of that principle is “extreme co-design.” Huang said the company co-designs chips, CPUs, GPUs, networking, switches, compilers, frameworks, storage, and systems together. The reason is that the most important workloads — computer graphics earlier, and now molecular dynamics, quantum chemistry, fluid dynamics, multiphysics simulation, and deep learning — are too computationally intense to be treated as ordinary general-purpose computing problems.

The claim Huang made for the result was stark. Moore’s Law in its strong form, he said, offered roughly 2x every 18 months, or about 10x every five years and 100x every decade. With Dennard scaling exhausted and Moore’s Law slowing, ordinary microprocessor scaling over the past decade might have delivered much less, perhaps closer to 10x in practice. NVIDIA’s co-design approach, he said, delivered between 100,000x and one million times improvement over 10 years.

1,000,000x
Huang’s upper-end estimate of NVIDIA’s co-design-driven computing speedup over 10 years

The consequence of that scale-up, in Huang’s account, was not merely faster hardware. It changed what AI researchers could attempt. If computation becomes a million times faster, the question of carefully curating small datasets gives way to a different idea: take the internet, take the world’s data, and train at unprecedented scale. Huang compared it to transportation at near-light speed: if travel time collapses, assumptions about where people live and how society is organized change.

This is why he presented co-design as more than technical optimization. It is the mechanism by which a computationally impossible class of problems becomes thinkable, then economically plausible, then widely deployed.

The right system metric is not FLOPs utilization

A central disagreement emerged around compute scarcity and utilization. The Stanford host raised open scaling, scarce compute, and a reported xAI Memphis cluster memo claiming 11% MFU utilization. Jensen Huang rejected the metric itself.

MFU is just simply wrong.

Jensen Huang · Source

MFU, or model FLOPs utilization, measures what fraction of available floating-point operations are used while doing a workload. Huang’s objection was that large-scale AI systems are constrained by many resources at once: FLOPs, memory bandwidth, memory capacity, network capacity, and more. At any given moment, some part of the system is the bottleneck. If a system is properly over-provisioned to avoid one bottleneck dominating the whole workload, then some resources will look idle.

Huang said he would personally prefer low MFU if it meant the system was over-provisioned intelligently for the work. The host pushed back: if a cluster is provisioned for peak workloads, expensive FLOPs may sit idle during base loads. Huang’s response was that peak moments matter. If a workload needs a large burst and the FLOPs are not available during that window, the short period becomes a long period. In his phrase, “flops are cheap” — not because H100 systems are cheap, but because the value of those systems is not captured by FLOPs alone. Hopper’s value, he said, comes from bandwidth, architecture, and the surrounding system.

The car analogy made his objection more concrete. Asking only about FLOPs is like asking only about horsepower. It may have once been a crude proxy for capability, but it is not how a sophisticated buyer should evaluate a vehicle. The right measure, Huang said, is performance — but performance against a real evaluation, not a contrived internal metric that teams can improve without making the system more useful.

When the host suggested “intelligence per watt,” Huang accepted the direction and focused on tokens per watt. For decoding large language models, he said, the decisive factor is not peak FLOPs but aggregate bandwidth across the NVLink 72 system. Decode is mostly memory-bandwidth constrained, not FLOPs constrained. A system can therefore deliver very high tokens per watt while showing low MFU.

That matters because inference is not a monolithic operation. Huang distinguished prefill — context and attention processing — from decode, the repeated generation of tokens. In disaggregated inference, those stages can be separated. The system that maximizes useful token generation may look inefficient through the lens of FLOPs utilization.

The host then pressed a harder point: not all tokens are equally valuable. Coding tokens may be worth more than some other generated tokens, and different customers optimize for different kinds of intelligence. Huang’s answer was that evaluation design becomes the core problem. NVIDIA cannot optimize only for one domain, because a system overfit to a single workload may be excellent but commercially too narrow to fund the required R&D. But if the platform is good at everything, it becomes merely general purpose and “good at nothing.” Finding the balance between broad applicability and domain-specific excellence, Huang said, is “artistry.”

Each NVIDIA generation targets a different AI workload

Jensen Huang described NVIDIA’s recent and future systems as responses to successive AI workload regimes: pre-training, inference, agents, and then swarms of agents.

Hopper was built for pre-training. At the time, NVIDIA reasoned that pre-training would require systems larger than the largest scientific supercomputers. Huang said the largest supercomputer in the world was around $350 million, while NVIDIA was designing for multi-billion-dollar systems — effectively a market with “precisely zero customers” at the time. The company built for that future anyway.

Grace Blackwell NVLink 72 was designed around the next bottleneck: inference, especially decode. Huang said the goal of AI is not training but inference. Generating tokens requires memory bandwidth beyond what a single chip can provide, so NVIDIA built a rack-scale computer by ganging together 72 GPUs with new switching, interconnects, and SerDes. He described the speedup over the previous generation as 50x in two years, compared with roughly 2x from Moore’s Law over the same period.

GenerationPrimary workload Huang emphasizedKey architectural idea
HopperPre-trainingBuild for training systems larger than prior scientific supercomputers
Grace Blackwell NVLink 72Inference and decodeAggregate bandwidth across 72 GPUs in a rack-scale computer
Vera RubinAgentsLow-latency CPUs, fabric-connected storage, and tool-use-oriented system design
FeynmanSwarms of agentsSystems for agents, sub-agents, and sub-agents of sub-agents
Huang described NVIDIA’s roadmap as a sequence of workload-specific architectural bets.

Vera Rubin, the next step, is designed for agents. The premise is that the purpose of AI is not merely to think, but to do work. Agent systems need long-term memory, working memory, tool use, and fast interaction between storage, CPUs, and GPUs.

That changes CPU requirements. Cloud CPUs with hundreds of cores are built for a different pattern. In an agentic system, Huang said, a multi-billion-dollar GPU supercomputer may send an instruction to a CPU-based tool and then wait for the result. In that situation, single-thread latency matters intensely. He said NVIDIA designed Vera’s CPU for highly performant single-threaded code because an agentic GPU system cannot afford to wait on slow tool execution.

Storage also changes. Long-term memory may live in storage, but Huang argued that the system cannot afford to copy that data through conventional network-storage paths. The storage needs to connect directly into the processor fabric.

Feynman, the generation after Vera Rubin, was described more speculatively. Huang said agents today may be understood as tomorrow’s software modules. Future systems will include agents with sub-agents, sub-agents with their own sub-agents, and ultimately swarms. Feynman is likely about the computer architecture required for that swarm-like software structure.

The pattern across these generations is the same co-design logic: infer the emerging compute pattern, identify the bottleneck one generation ahead, and build a system for it before the market is fully visible.

Open models are a strategy for domains the frontier labs will not cover alone

Jensen Huang separated two positions that are often conflated: using closed frontier products and building open models. NVIDIA, he said, uses Anthropic and OpenAI tokens heavily. All NVIDIA engineers are now “agentically supported,” and Huang recommended using frontier products such as Claude and Claude Code because they work well and improve quickly. He did not suggest that downloading an open model from GitHub would generally provide the same product experience.

NVIDIA’s commitment to open models comes from a different argument. Language models are one expression of AI’s broader function: learning the representation, meaning, and structure of information. If a model learns the structure of language, it can manipulate and generate language. But biological systems, chemicals, proteins, genes, physics, robotics, and climate systems also have structure. Their representations differ from language, their dimensionality differs, and their training strategies differ because there is no internet-scale corpus of text-equivalent data for each domain.

NVIDIA’s open model work is meant to create first artifacts in domains where individual scientific communities may lack the scale or infrastructure to build foundation models themselves. Huang named Nemotron for language, BioNeMo for biology, Alpamayo for autonomous vehicles and navigation, GR00T for humanoid robotics, and climate-science models for mesoscale multiphysics.

The goal is activation of downstream ecosystems. Huang said NVIDIA’s work in these areas activated healthcare and life sciences, and that the company is working with self-driving-car and robotics companies across those ecosystems. Without a foundation model and associated data and training approach, he argued, many domain specialists would be unable to start.

Nemotron has a second purpose: language coverage. Huang argued that many societies speak languages too small to become top priorities for closed frontier labs. Major languages will be handled; many others may not. He used Swedish as an example of a language likely to be understood by frontier systems but not necessarily treated as a top priority, and noted that India has many dialects beyond those likely to receive intense commercial attention. NVIDIA’s intent with Nemotron, he said, is to provide a near-frontier base that others can fine-tune for languages they care about.

The third purpose is fusing language with domain models. Huang described Alpamayo as a language model fused with a world model for self-driving. It detects cars, roads, and relevant physical structure, but the language component provides human priors and reasoning ability. Huang said this reduces the amount of experience needed to produce a safe driving system. He claimed Alpamayo is probably one of the most effective self-driving-car systems in the world despite having only a few million miles of experience, not billions.

For safety, Huang argues that capable systems must be inspectable

Jensen Huang’s defense of open models also rested on safety and security. His position was direct: if AI is to be safe and secure, it has to be open. A black box, he said, cannot be defended against, secured, or safely embedded into critical systems when its capabilities are opaque.

He acknowledged that opacity can be addressed in partial ways. A system can be required to explain its plan before acting or reason step by step before doing anything. But Huang noted that a system could lie. Transparency gives researchers and defenders a better chance to interrogate behavior rather than merely trust reported reasoning.

His cybersecurity example was a swarm defense model. If future threats are highly agentic, Huang argued that defense should not become a sequential arms race in which a defender is vulnerable until it produces a model one version better than the attacker’s. Instead, he described an approach in which vast numbers of cheap, specialized AIs surround and detect threats, like a dome. Nemotron-Nano, he said, is being used for cybersecurity because it is fast and cost-effective enough for firms to train and deploy in very large numbers.

The argument was not that openness eliminates risk. It was that black-box capability is the wrong substrate for defending systems, and that cheap, inspectable, deployable models are part of the defense architecture.

Energy demand is the bottleneck NVIDIA can only partly control

Jensen Huang treated energy as an unavoidable constraint but split the problem into what NVIDIA can control and what the broader system must build.

The controllable part is efficiency. He returned to tokens per watt and said NVIDIA had improved it by 50x, then added that the company would have to keep improving it by significant factors. Earlier, he had described Grace Blackwell NVLink 72 as 50x faster than the previous generation in two years, largely because of architectural insight around inference and decode. In the energy discussion, his point was less a narrow benchmark claim than a general operating principle: NVIDIA’s first lever is to keep compounding efficiency through co-design and architecture.

The uncontrollable part is the energy system itself. Huang said he has spent years trying to educate people about the amount of compute likely to arrive. His estimate was that computing may need roughly a thousand times more energy than it currently has, and he said he would not be surprised if that estimate were off by a couple of orders of magnitude. The reasoning follows from the earlier computing-model shift: future computing is generated rather than prerecorded, and continuous rather than initiated only per use.

~1,000x
Huang’s rough estimate for how much more energy compute may need

That demand, in Huang’s view, makes the present an unusually strong market moment for sustainable energy and grid upgrades. In the past, he said, solar farms and nuclear plants often required government subsidies because market demand was insufficient. Now, he argued, the market will pay for new energy. He called this “the best time ever in the history of humanity” to invest in sustainable energy, upgrade the grid, and add sustainable sources of all kinds.

The claim was not that AI’s energy use is small. Huang’s argument was the opposite: demand will be enormous, and therefore the market signal for new energy infrastructure is unusually powerful.

Compute scarcity at universities is a budgeting and organization problem

When the discussion turned to whether American researchers, startups, and universities should get priority access to scarce compute before chips are sold abroad, Jensen Huang agreed in principle — “absolutely” — but rejected the premise that orders are being denied because chips are unavailable. If Stanford’s president placed an order, Huang said, NVIDIA would deliver it.

His diagnosis was that the university system is not organized to buy and operate massive-scale compute. Research departments raise separate grants and maintain separate budgets. No individual grant is large enough to fund the kind of shared AI supercomputer that researchers might only need intermittently, but need intensely when they do. The world moved away from centralized computing environments toward laptops and fragmented access, while the AI era moved back toward very large shared systems.

The host asked whose fault that was. Huang answered: Stanford’s. He framed that not as blame for its own sake, but as empowerment. If the problem is Stanford’s, Stanford can solve it. The university, he said, needs to change budgeting, aggregate demand, and build a campus-wide shared AI supercomputer, or contract for one as a cloud service.

He was explicit about scale. Stanford needs on the order of a billion dollars of compute, not a small departmental cluster. When the host mentioned Stanford’s roughly $40 billion endowment, Huang said he would cut a billion dollars from it immediately, give it to someone as a cloud service, and ensure every student and researcher had access to AI supercomputers.

His grocery-store analogy was aimed at procurement timing. If someone wants a billion dollars of tomatoes, they cannot show up at the store and accuse the grocer of withholding tomatoes when the inventory is not immediately available. The same applies to billion-dollar compute: universities need to plan demand and say what they will need next year.

The broader point was that compute scarcity is not only a chip-supply issue. In research institutions, it is also a governance, funding, and aggregation problem.

Huang rejects the bomb analogy for AI chips

The sharpest policy answer came in response to a question about adversarial countries getting access to NVIDIA chips. Jensen Huang objected first to the analogy between GPUs and atomic bombs. NVIDIA makes GPUs, he said, and GPUs are used for video games, medical imaging, logistics, and many ordinary computational tasks. He said NVIDIA is present in every medical imaging system in the world. A billion people have NVIDIA GPUs; he advocates GPUs to students, family, and people he loves. He does not advocate atomic bombs to anyone.

Starting from the bomb analogy, he argued, makes coherent policy reasoning impossible.

He also rejected the idea that American companies should avoid foreign markets because they will lose anyway. Competition, in his view, serves markets and strengthens companies. He said he does not subscribe to a “we are going to lose anyways” posture: if competitors want NVIDIA to lose, they will have to “deal it” to the company.

Huang then made a broader industrial-policy argument. NVIDIA, he said, is a general-purpose computing company. Depriving large parts of the world of that capability so that one or two companies benefit makes no sense to him. He warned that, in his view, if American policy were to concede “two-thirds of the world” to other companies, students graduating into the technology industry may enter a weakened version of what should be one of America’s strongest industries.

His historical analogy was telecommunications. Huang argued that similar policy reasoning helped push fundamental telecommunications technology out of America, leaving the country without that core capability. He presented himself as fighting to prevent the same outcome in AI computing.

He also criticized what he sees as speculative “singularity” rhetoric: the idea that AI will arrive suddenly, become infinitely powerful in a flash, and end society at an unknown hour. Huang called that irresponsible science fiction when presented as public reasoning. He insisted it is not true that researchers have no idea how these systems work, not true that the technology will become infinitely powerful in a nanosecond, and not true that there is no way to defend against it.

Everybody should have AI, nobody should have nuclear bombs.

Jensen Huang · Source

That was the distinction he wanted students to hold: AI chips are general-purpose computing infrastructure, not weapons analogous to nuclear bombs.

Strategy starts with observation, first principles, and optionality

Asked how he forecasts under uncertainty, Jensen Huang described a method built from observation, first-principles reasoning, and backward planning. The first question is what he is observing. The second is whether it matters. The third is what follows if the observation continues or generalizes.

He used AlexNet as the example. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton produced a neural-network model that dramatically surpassed decades of computer vision work. Huang’s question was: is that a big deal? His answer was yes, because the step change in quality and performance was too large to treat as incremental. The next questions were how far the method could go, what else it could solve, what it implied for computing, and how neural-network processing differs from traditional floating-point and integer computation.

From that reasoning, Huang said, he builds a mental model of the future: how large models might become, what computers will need to look like, where self-driving cars and robotics might fit, and what role NVIDIA should play. He does not expect to be completely right. He distinguishes between things that will likely happen, things that will absolutely happen, and things that may happen, then moves in the general direction while feeling the way forward.

The business skill is managing opportunity cost. Strategy consumes time, energy, and money that cannot be spent elsewhere. Huang said the task is to reduce opportunity cost while increasing optionality — to make the journey pay for itself.

That principle also shaped his answer about mistakes. He did not count NVIDIA’s first failed graphics architecture as a pure mistake, even though he said the company made almost every technical choice wrong: curved surfaces instead of triangles, no Z-buffer instead of a Z-buffer, forward texture mapping instead of inverse texture mapping, and no floating point inside. Those choices nearly destroyed the company, but the recovery taught him strategy, competition, resource conservation, and maneuvering.

The mistake he did identify was mobile. NVIDIA shifted resources into mobile devices after being approached by important companies in the space. The business grew to about a billion dollars, but during the 3G-to-4G transition NVIDIA was locked out by Qualcomm’s modem leadership. In hindsight, Huang said, he should have recognized that NVIDIA might have an interesting opportunity for a couple of years but would likely be shut out afterward.

Even there, the loss created assets. The low-power and energy-efficiency expertise from mobile became useful for robotics. Huang described Thor as a descendant of the chips NVIDIA built for mobile devices. But he was clear that this was rationalization: entering mobile was, in his view, a strategic mistake.

AI-native education still needs first principles

Jensen Huang’s answer on education belonged to the same larger argument about a changed computing model. Textbooks take years to write and revise; AI-mediated knowledge changes in real time. AI should therefore be part of the curriculum in two senses: students should learn AI, and they should use AI to learn everything else.

Huang described his own learning process as AI-native. He has AI read papers, then asks it to read related papers, summarize, and answer basic questions. In his description, once the AI has absorbed enough context from the paper and the related literature, he can interact with it like a dedicated researcher rather than merely use it as a compression tool.

At the same time, he defended first principles. Mead and Conway, semiconductor-design foundations, and older architecture knowledge remain valuable even if the scaling assumptions behind parts of the historical semiconductor methodology have been exhausted. It is still useful to know where the field came from.

Huang connected this to his own Stanford experience while working at AMD. He was learning first-principles methods in school while designing microprocessors in practice. The two views reinforced each other. Students can now get a similar duality by pairing first-principles education with AI tools that expose contemporary practice and real-world context.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free