Frontier AI Has Become a Gigawatt-Scale Industrial Infrastructure Race

Sachin KattiStanford OnlineWednesday, May 27, 202620 min read

In a Stanford MS&E seminar on the economics of the AI supercycle, OpenAI infrastructure executive Sachin Katti argued that frontier AI has become an industrial systems problem, not a GPU procurement problem. Katti said usable compute now depends on synchronizing chips, memory, networking, power, cooling, buildings, land, suppliers and operators at gigawatt scale. His broader case was that OpenAI’s model and revenue ambitions depend on how quickly it can turn that whole chain into reliable infrastructure for training, inference and agentic workloads.

Compute is no longer just chips

Sachin Katti described OpenAI’s infrastructure problem as a full-stack industrial problem, not a procurement problem for GPUs. His remit at OpenAI is “industrial compute”: delivering the compute the company needs across training and inference. In that framing, compute means chips, memory, networking, power, cooling, data center buildings, power generation, power distribution, and land. All of those have to arrive together for a gigawatt-scale system to become usable.

That distinction matters because the hard part, in Katti’s account, starts after the visible dealmaking. Signing contracts is the easy, “fun” part. The work is making sure suppliers deliver, engineering systems so they operate together at scale, and keeping the resulting cluster running at high performance. A gigawatt, he said, is roughly half a million GPUs. When the planning question becomes six or ten gigawatts, the job becomes orchestrating enormous numbers of chips, power systems, cooling systems, networking components, buildings, and operators into a working machine.

~500,000

GPUs in a gigawatt-scale AI compute build, in Katti’s estimate

The systems are not robust commodities yet. Katti said today’s chips are “very brittle,” sensitive to cooling and power fluctuations, and can quickly throttle down in the amount of useful compute they deliver. The metric that matters is not nominal capacity alone but usable compute: how much of the installed base can actually be run, at high utilization, for training and inference workloads.

That is why he treated infrastructure as an execution discipline rather than a simple capacity target. The bottleneck is not one item. Asked whether the constraint is power, energy, chips, land, or sourcing, he answered: “All of the above.” At gigawatt scale, a missing component anywhere in the chain can prevent the entire system from becoming operational.

The class showed an OpenAI compute-ambitions chart that pointed to a 30-gigawatt target by the end of the decade. Katti called that number aspirational and said it would be split across research and products. He did not describe OpenAI as a company simply maximizing near-term revenue from the compute it can obtain. He described it as still a research lab, trying to make as much compute as possible available so researchers are not constrained in exploring new models, new ideas, and new ways of pushing the frontier of intelligence.

That said, he also made clear why the compute target is commercially central. Over the previous three years, OpenAI’s compute had tripled year over year, and revenue had tripled as well. Katti called revenue a lagging indicator for frontier labs: a function of how much compute they have and how well utilized that compute is.

Every year we have tripled compute year over year. Right. And revenue has tripled. We don’t see any end in sight to the correlation yet.

Sachin Katti · Source

The example he used was Codex after the release of 5.5. He said Codex had seen meaningful double-digit growth in roughly two weeks and was being used not only for coding but for general-purpose knowledge work. In his telling, the growth mechanism is simple: more users, more tokens, more complex tasks, and therefore more revenue, so long as the compute exists to serve the demand.

For operators and suppliers, that makes AI capacity different from ordinary cloud expansion. Katti’s version of compute is not a rack count or a GPU allocation. It is a synchronized industrial delivery schedule. Chips without power are not compute. Power without cooling is not compute. A signed contract without operational performance is not compute. The constraint is the whole chain landing together.

Inference is becoming the dominant compute workload

Apoorv Agrawal pressed Sachin Katti on the split between training and inference, because the economics of infrastructure change if inference becomes the dominant use of compute. Katti’s answer was that the old training-versus-inference distinction is becoming less useful.

Scaling laws were first discussed mostly in the context of pre-training. Katti said they now apply across the lifecycle of compute. Pre-training remains one part of the system. But post-training with reinforcement learning is primarily an inference workload. Synthetic data generation is also primarily inference, and Katti said synthetic data has become necessary because “we have run out of real world data to train models on.” Product usage — ChatGPT, Codex, and related systems — is inference as well.

The result is that inference is already the majority of OpenAI’s compute, and Katti expects that share to grow. He projected that “80% plus” of future compute would be inference. But he cautioned against equating inference with product serving. A large part of research and “training the next level of intelligence” is also inference.

That shift changes the revenue question. Agrawal suggested that if inference becomes a larger share, monetization per gigawatt may rise because inference is what customers directly consume. Katti agreed that more token consumption should lead to more revenue, but he immediately separated monetization from OpenAI’s technical objective. The mission, as he put it, is to make tokens cheaper.

He broke that objective into three dimensions. OpenAI wants to make each token cheaper to generate through hardware and software improvements; make each token more intelligent through model capability improvements; and reduce the number of tokens required to complete a given task. Those are not just cost optimizations. Katti tied them to the company’s stated goal of making intelligence more widely accessible.

That formulation also explains why compute advantage appears in the user experience. Asked about rumors that OpenAI has a significant compute advantage over other labs, Katti pointed to 5.5. He described it as a large model and expensive to serve, and said “there are no limits” in the context of users being able to try it without the token limits that often constrain large-model access. He also said OpenAI is more generous with subscription token allocations and often resets limits so people can use the systems more heavily.

The important point was not that compute helps build frontier models, but that it determines whether intelligence can be distributed. “It’s no good if you build the intelligence but you can’t really deliver it at scale,” Katti said. In his account, the compute advantage shows up when a user can keep working with fewer artificial constraints, not merely when a benchmark improves.

Agents turn inference into a graph, not a call

Sachin Katti explained agentic workloads as the shift from one-shot answers to closed-loop work. ChatGPT, in its initial form, was “one shot inference”: a user asks a question, the model gives an answer, and the interaction ends. Reasoning models added more internal inference. The system could think longer, introspect, and produce better answers. But even reasoning systems remained passive in Katti’s account: a user asks, the model responds.

Agents are different because they have agency to act. Katti defined the shift as closing the loop: not only thinking and suggesting, but trying, observing the output, iterating, and then trying a refined answer. For coding, that means generating code, spinning up a virtual machine, running tests, inspecting the result, and returning to the model. For other knowledge work, it may mean searching, retrieving relevant data, opening tools such as Excel or PowerPoint, testing an output, and iterating.

That changes the compute graph. In the chatbot world, Katti said, the graph is simple: a user, one inference call, and an answer. Reasoning models add multiple inference calls. Agents create what he called a more complex directed acyclic graph: inference calls, tool calls, databases or search queries, RL or VM environments, then more inference calls, and so on.

The class slide placed this shift in an “AI Evolution” sequence: chatbots in 2023, reasoners in 2024, agents in 2025, innovators in 2026, and then organizations. The examples on the slide moved from GPT-4o and ChatGPT to o1, o3, search, deep research, ChatGPT Agent, Codex, specialized agents, OpenAI for Science, and autonomous researcher. The slide — not Katti’s spoken explanation at that moment — also described GPT-5.3-Codex as “the first model instrumental in creating itself.”

This is the reason Katti does not expect one kind of hardware to serve all AI workloads economically. The workload itself is becoming heterogeneous. Some parts require very fast inference. Some require long context and large memory state. Some require CPUs. Some require GPUs. Some may use other accelerators.

The class slide summarized the product goal as making “the human the bottleneck,” with the success condition that the user is waiting on their own decisions rather than on model inference, tool hops, or scheduler delays. Katti said the phrase was tongue-in-cheek but directionally accurate. Today, an agent may take minutes or hours to complete a task. The user context-switches, does something else, and then has to reload the original context when the system asks for a decision. The desired state is that the system runs so quickly that the user is waiting on their own judgment and steering.

We have succeeded from a compute perspective when we have built the systems and the infrastructure such that the human becomes the bottleneck.

Sachin Katti

That objective pushes infrastructure toward specialization. Katti gave Cerebras as an example of an accelerator useful for very fast inference. He also described possible accelerators built for long context — systems that hold a great deal of state in memory so they do not need to page it in and out. For coding, that could mean holding an entire GitHub project in context and retrieving relevant parts quickly.

The task for infrastructure builders, as Katti framed it, is to match each part of the agentic graph to the right compute substrate while optimizing both efficiency and performance. Pure GPU compute may be powerful, but he said it cannot economically deliver the desired agentic experience by itself.

Latency exposes the whole stack

The move toward agents makes latency a competitive dimension, but Sachin Katti resisted a simple “move inference to the edge” answer. Apoorv Agrawal asked whether inference-heavy workloads would require more distributed clusters, Cloudflare-like mini-clusters close to users, rather than giant facilities in places such as Texas or Virginia. Katti said that future may come, but not yet.

His first reason was economic. Building 50 megawatts of compute is far more expensive per megawatt than building a gigawatt in one location. Labor is a bottleneck, especially in the United States. If a company must assemble the human capacity to build these systems, Katti said, it is economically preferable to apply that effort to a larger concentrated build rather than to many smaller sites.

His second reason was technical. For agentic workloads, he said, time to first token is still on the order of 400 to 500 milliseconds because the model has to page in the relevant context before generating the first token. That latency is larger than the network latency savings from putting compute nearer to the user. Until the models are distilled into smaller, highly capable systems that can run closer to the user, the economics still favor concentrated inference clusters.

Katti gave Codex as the example. A Codex query must combine the prompt with the user’s code base, creating the full context for the model. He described the prefill phase: before producing the first output token, the system has to run the entire context through the attention mechanism. Codex models, he said, now have 400k-token contexts. Those 400k tokens must be computed before the first output token appears.

400k

context length Katti cited for current Codex models

Other latencies exist — converting a prompt into tokens, sending it to the cloud, load balancing it to the right GPU, and returning through the API and app layers — but Katti said those add tens of milliseconds. The prefill latency dominates.

That balance changes when one layer gets faster. Katti said OpenAI’s deployment of Cerebras earlier in the year generated tokens so much faster that other inefficiencies in the stack suddenly became visible. The app, API infrastructure, and surrounding systems had to be reengineered to keep pace. In his words, improving one layer “showed up all the inefficiencies that we had in the rest of the stack.” He said OpenAI had published a blog post the previous day about changing its API infrastructure to keep pace with Cerebras, but did not name the post beyond referring to the OpenAI blog.

Katti compared the work to shaving latency across every layer. Agrawal likened it to the old internet problem of optimizing page load times. Katti agreed and said the familiar trope applies: every 30 or 50 milliseconds of latency removed can lead to higher engagement, higher revenue, and higher retention. He expects latency to become a major axis of competition.

For infrastructure teams, the consequence is that accelerator performance is not separable from the rest of the system. A faster token generator can turn the API, scheduler, app path, load balancer, or context pipeline into the new bottleneck. The stack has to be optimized as a whole, because the user experiences the slowest visible layer, not the fastest component.

The grid was not designed for synchronized AI loads

The infrastructure decisions Sachin Katti described have societal implications because a gigawatt-scale data center is not just a large customer; it can become a major dynamic load on a regional grid. He gave the example of putting a gigawatt data center in a state such as Georgia or Michigan. When a large training job runs, the workload is synchronized. Power intensity can move up and down in sync. Katti said the resulting energy fluctuations on the grid can be hundreds of megawatts very quickly.

The existing grid, in his view, was not designed for that behavior. Poorly managed, a data center could cause collateral damage to surrounding infrastructure. He said that, depending on how these data centers behave, a grid could “fall apart” and an entire state could experience a blackout. Because of that, OpenAI spends time on how to design systems that do not impose those effects on the rest of the country’s infrastructure.

This is why Katti tied AI infrastructure to broader industrial redevelopment. He said the company thinks about de-risking supply chains, moving fabs and memory factories to other parts of the world, decoupling from grid energy, using natural gas, and increasingly using nuclear in the future. He argued that AI demand will create infrastructure investments and innovations that society benefits from beyond AI, because these investments otherwise lacked the same impetus.

The energy numbers are large enough to change markets, though Katti’s phrasing was explicitly approximate. Asked to put OpenAI’s 30-gigawatt aspiration in perspective, he said he did not have the U.S. energy consumption number off the top of his head. He then said that, if one adds up hyperscaler build plans beyond OpenAI’s own 30-gigawatt aspiration and the numbers from companies such as Google and Amazon, “100 gigawatt” was probably already “a fifth to higher of the grid.” He described that as a double-digit percentage of U.S. capacity. The point was directional rather than a settled forecast: the scale is large enough to alter energy markets.

That leads to a broader conceptual shift. Energy has historically been treated primarily as serving human consumption. Katti said that framing is no longer true. If AI systems become a major destination for electricity, the energy market is no longer only about homes, factories, and traditional digital services. It is about serving machine intelligence at industrial scale.

Katti extended the point with a more speculative comparison. People now take for granted that everyone should have a mobile phone and upgrade it every year or two. He said it is “not that crazy” to think everyone should have a GPU. A GPU, in his estimate, is one to two kilowatts. For seven billion people, that implies seven terawatts of compute — two orders of magnitude above the 30-gigawatt number under discussion.

The point was not that OpenAI is building seven terawatts. It was that, if one believes in widely distributed personal access to large amounts of intelligence, even 30 gigawatts is an early number, not an end state.

The supply chain cannot stay single-threaded

Sachin Katti argued that the AI infrastructure market needs a more resilient compute supply chain, because it is dangerous for the world to be single-threaded on any one component. That applied to GPUs, accelerators, memory, fabs, and ultimately the semiconductor equipment layer.

Asked about Nvidia, hyperscaler accelerators, and the likely market share path for chips, Katti declined to name favorites. His broader answer was that the workload itself will force diversity. Agents are more complex than pure training or pure inference jobs on a GPU. Different parts of the graph will benefit from different hardware.

But he added a supply-chain reason: the way TSMC allocates wafers. Katti said TSMC has been successful partly because it tries to make multiple customers successful. That is in TSMC’s business interest because it avoids dependence on a single large customer. Since TSMC is a choke point in the supply chain, the allocation of wafers will result in multiple companies receiving capacity and therefore multiple varieties of chips entering the market. At the scale OpenAI, Google, Amazon, and others are discussing, Katti said, major AI companies will have to learn how to use all of those chips because they will not have a choice.

The deeper choke point, in his view, is fab capacity across logic and memory. He named TSMC, Samsung, and Intel on the logic side, and Micron, SK Hynix, and Samsung on the memory side. The market is highly concentrated. If one digs even deeper, he said, the chokepoint is ASML, because all of these advanced fabs need ASML machines.

He gave a time-horizon answer to where the next race will be. In the short to medium term, he pointed to orchestration software, the harness around agentic systems, and models becoming more token-efficient. In the medium to long term, he pointed to new memory architectures. Unless the transformer is replaced, he said, the shape of the compute unit is reasonably well understood; what keeps changing is the memory architecture around that compute unit.

His criticism of the current systems was that they are too simple. Today’s AI compute often looks like large compute units attached to one layer of high-bandwidth memory. Katti compared that to early general-purpose computing, where CPUs later acquired multiple cache layers, flash, hard drives, and more complex storage hierarchies. AI systems infrastructure, in his view, is still early in the comparable evolution.

For suppliers, the implication is that the opportunity is not limited to the obvious accelerator market. Memory hierarchy, orchestration software, low-level runtime systems, networking, and the ability to integrate heterogeneous chips into usable clusters all become part of the competitive frontier. For buyers, the implication is more uncomfortable: no single vendor path is likely to be enough at the scale being discussed.

AI may compress the infrastructure design cycle

One of Sachin Katti’s more important claims was that AI is beginning to feed back into the design of the infrastructure that runs it. He said OpenAI is increasingly using its latest models to help design the next chip and the low-level software needed to run the next model. He called this “recursion”: AI figuring out the chip, system, and software it needs to run most efficiently.

The current process is more decoupled. A lab trains a model. A chip company designs a chip independently. The lab receives the hardware and figures out how to make the model run efficiently on it. Katti argued that this cycle is too slow for the pace of model change. A typical chip design cycle is three years from the initial idea of what a chip should be to production.

That lag now looks enormous. Apoorv Agrawal noted that three years is roughly the time since ChatGPT launched. Katti agreed: in AI, that is an eternity. If the next model can help specify the chip and system design it wants while it is being trained, the cycle from model discovery to infrastructure deployment could compress.

Katti did not present this as a fully realized industrial process with detailed operational disclosure. He said OpenAI is “not that far from that world,” and described it as probably the only feasible way to bend the curve on compute cycle time. Human teams interpreting model requirements, designing systems, and deploying them on the traditional semiconductor timeline may not be able to keep pace.

This recursive loop also explains why Katti is interested in the lowest levels of the stack. He urged students to look at foundational infrastructure — transformers, batteries, generation, distribution, cooling, materials, transistors, and components — because the ability to build those things at scale is both technically and operationally hard to replicate. In the United States, he said, “we have forgotten how to build very foundational infrastructure.” He connected that observation to his experience as a Stanford faculty member, where he saw enrollment dwindle in electrical engineering, especially in lower-layer work such as transistors and materials.

His advice was not simply to study AI models. It was to go where the physical bottlenecks are. If AI transforms the economy in the way its proponents expect, then the enabling layer underneath it must also be transformed.

Value starts in infrastructure, but may not stay there

The class used Jensen Huang’s “AI 5-layer cake” framing: energy, chips, infrastructure, models, and applications. The slide placed energy at the base, followed by chips, infrastructure, models, and applications, with companies mapped into each layer. Energy included companies such as Constellation, NextEra Energy, Vistra, and GE Vernova. Chips included Nvidia, AMD, Broadcom, Intel, and Groq. Infrastructure included AWS, Microsoft Azure, Google Cloud, CoreWeave, Crusoe, and Databricks. Models included OpenAI, Anthropic, Google DeepMind, Meta Llama, and xAI. Applications included Cursor, Glean, Harvey, Perplexity, and Waymo.

Asked which layer is most likely to accrue value over the long term, Sachin Katti said the answer changes over time.

He compared the AI cycle to the mobile revolution. Early in mobile, a large amount of money went to telcos and infrastructure builders. Later, value moved up into applications and then into cloud services. Katti said he does not see why AI would be different. At the current stage, the infrastructure layer is where the profits are. Over time, he expects value to move toward platforms and applications.

But his later comments complicated a simple “apps win” conclusion. Asked what he is bearish on, Katti said he is short anything that is merely a model wrapper. He called that an easy answer, but still true. The pace of model improvement, and the ability of models to introspect and figure out how to deliver outcomes, make it very hard to sustain a business that simply wraps a model.

He went further and questioned whether “apps” themselves are the right interface category for the future. If users interact with computing by specifying outcomes — “this is the outcome I want, go figure it out” — then today’s apps may be a crutch rather than the final interface. That was not framed as OpenAI or Anthropic wanting to build all applications. It was a more basic question about whether the app model remains the way users get work done when agents can execute across tools.

Katti’s long position was OpenAI, where he said he was “voting with my feet.” But beyond that, he said the underappreciated long opportunity is the lowest layer of the stack. Foundational infrastructure has sustainable differentiation because it combines technical difficulty with scale. If someone can build it, he argued, it is hard for others to replicate.

His answer to a rapid-fire question about the first company to reach a $10 trillion market cap was Nvidia. OpenAI, he said, would get there “for sure,” but if OpenAI does, Nvidia will as well. The logic was consistent with the rest of his argument: in the current phase, the ability to supply the underlying compute is itself an extraordinary value capture point.

Open weights may catch up, but the frontier lead still matters

A student asked how open-weight models affect the compute thesis, especially because they often use fewer parameters and require less compute. Sachin Katti said open-source models have a role in the ecosystem, but he did not see them changing the need to invest in frontier intelligence.

His argument was based on scaling laws. Frontier model intelligence, he said, will require orders of magnitude more compute, and OpenAI does not see that changing. Open-weight models may play catch-up and distill frontier intelligence into more compact form factors. Katti did not describe that as a problem for OpenAI. But he said a six-month lead in intelligence is an enormous lead, so the company sees no reason to back off from continued investment in frontier models.

That answer fits his broader infrastructure view. Open-weight models may lower deployment costs for certain use cases and broaden the ecosystem. But if the frontier keeps moving through compute-intensive training, post-training, synthetic data generation, and inference-heavy research, then the industrial race remains about time to usable compute.

That phrase — time to compute — was how Katti described OpenAI’s approach to Stargate. Asked about the project, he said Stargate is essentially his job: delivering all of this compute. At the scale in question, he estimated a gigawatt at $70 billion in spend. It is also operationally hard: roughly half a million GPUs to build, manage, staff, and operate.

$70B

spend Katti associated with one gigawatt of compute

So the question is not only how much capacity can be announced. It is how quickly that capacity can land and become operational. Katti said that is one reason OpenAI prefers bigger concentrated chunks of compute: many small distributed clusters are operationally harder to bring online quickly.

AI Application Architecture Data and Training AI Labs and Strategy Inference and Deployment Agents and Autonomy AI Infrastructure and Compute Open Models AI Business Models