Value Per Gigawatt Is Becoming AI Infrastructure’s Core Metric

Amin VahdatStanford OnlineWednesday, May 27, 202619 min read

Amin Vahdat, Google’s chief technologist for AI infrastructure and leader of its internal compute and TPU programs, argues in a Stanford CS153 lecture that AI infrastructure should be judged by value delivered per dollar, not by gigawatts or flops alone. With a gigawatt-scale buildout costing roughly $40 billion to $50 billion, he says the scarce discipline is building systems that are reliable enough, balanced across compute, memory and networks, procurable on multi-year timelines, and useful to customers and communities rather than merely large.

The useful unit is not the gigawatt

Amin Vahdat treats the present AI infrastructure buildout as a measurement problem before it is a capacity problem. Gigawatts and flops are visible, headline-friendly quantities. They are also incomplete proxies for what matters.

Anjney Midha put one gigawatt of AI infrastructure at roughly a $40 billion buildout. Vahdat said he has heard estimates closer to $50 billion per gigawatt, with costs rising. But his main point was that the raw number does not say whether the system delivers value. A gigawatt of unreliable or unbalanced compute is not equivalent to a gigawatt that can be scheduled, kept healthy, fed with data, and translated into useful product output.

The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar.

Amin Vahdat

For Vahdat, value is not an abstract infrastructure metric. It rolls up into business and user outcomes: happy daily active users for Gemini, paying enterprise customers, developers getting their work done, and growth in usage that indicates people are choosing a service because it is useful. Infrastructure measures still matter — flops, HBM bandwidth, accelerator interconnect bandwidth, network bandwidth, availability — but they matter because they determine whether the end system can deliver those outcomes efficiently.

He pushed back on treating “compute” as a single generalizable input. In the age of agents, he said, utility depends on orchestration across the whole stack. A TPU or GPU may be expensive, but it can still sit idle while an agent waits for a CPU simulation, which in turn waits on data from storage in another region. If accelerators, CPUs, storage, and networks are not balanced, capacity becomes stranded inside the system itself.

That is why Vahdat described idle capacity as a bug. Google spends heavily to secure megawatts and gigawatts of capacity, but in his account, the discipline is extracting the most utility from every deployed machine. If a company can spend half the money, deploy half the capacity, and deliver the same capability, that is better. If it can deliver twice the value from the same gigawatt, it needs fewer gigawatts — or can do more with a scarce energy budget.

Midha pressed the measurement question from another angle: if the outputs are heterogeneous — code tokens, image tokens, agentic work — and the input is compute, how should a systems builder measure intelligence? Vahdat said Google is working on benchmarks for “intelligence per dollar” and has published some work externally, but he treated the general version of the question as difficult and close to impossible. In practice, he said, business outcomes are the most usable proxy. People “vote with their feet” by using services that provide value.

Reliability is being repriced by frontier workloads

Amin Vahdat began the reliability argument with a simple arithmetic point: 99 percent availability sounds high until it means 3.65 days of downtime per year. Moving from 99 percent to 99.9 percent is difficult, but the difference matters. At the same time, AI training is changing how some customers value reliability versus raw capacity.

Historically, enterprise services wanted five nines of availability. Vahdat described that as roughly 30 seconds of downtime a year. Achieving that level of power reliability often requires heavy overprovisioning: redundant feeds, 2N or 1+1 capacity, so that one power path can fail and the other can immediately take over. The consequence is that a large fraction of provisioned power capacity is not actively used by compute at any given moment.

Frontier training customers are now more willing to trade that reliability away. Vahdat said that if one asks whether they would prefer 99.9 percent reliability and double the capacity, or 99.999 percent reliability and half the capacity, many frontier labs would take the capacity. Training a frontier model is throughput-driven. Losing one or several days a year can be acceptable if the remaining days provide much more compute.

3.65 days

annual downtime implied by 99% availability

This is not the old model of internet-scale services. Vahdat contrasted frontier training with web search. Google’s web infrastructure was designed so that any rack could disappear and users would not notice. Data is replicated; spare compute is fungible; loose coupling allows the service to continue. The system notices the failure and repairs it, but the user path survives.

AI training and serving are different. In a frontier training run, tens of thousands or more accelerators may participate in a synchronous computation. In serving, hundreds or thousands of TPUs or GPUs may each hold a specific part of a model. If one node fails, the whole computation can halt or propagation can stop.

If you think about TPU or GPU training inference, every node is special.

Amin Vahdat · Source

The old distributed-systems posture — assume individual failures, mask them with loose coupling, and keep going — no longer transfers cleanly. Vahdat said this change reaches into how value is delivered at scale. A gigawatt may represent 150,000 or 200,000 TPUs or GPUs, depending on the system. If a single failed node can stop a computation and the organization cannot quickly identify and repair it, much of the spend is wasted. The key metric becomes not just utilization but “goodput”: the portion of deployed capacity that is actually doing useful work.

That emphasis also changes the meaning of an outage. A scheduling system that cannot place jobs on deployed TPUs is, in Vahdat’s formulation, a failure of value delivery even if the hardware exists. Reliability therefore includes failure prevention, rapid fault isolation, repair, scheduling, and the availability of all adjacent resources required to make the accelerators useful.

System balance turns capacity into work

Amin Vahdat returned repeatedly to system balance: the ratios among compute, memory bandwidth, memory capacity, interconnect, storage, and data center networking. His warning was direct: overbuilding flops while underbuilding the rest of the system wastes money.

He invoked Amdahl’s 1967 law of system balance. As Vahdat stated it, for every million instructions per second in a parallel system, one needs a megabyte per second of I/O. The exact technology has changed; the principle has not. Compute without data is useless. Today, the I/O is largely networked I/O, and the scale has moved from small systems in the late 1960s to 10,000- and 100,000-ish node systems, sometimes even spread across a wide area network.

System dimension	Why it matters in Vahdat's framing
Flops	Necessary but easy to overemphasize; unused flops do not create value.
HBM bandwidth and capacity	Sparse and mixture-of-experts workloads can require more memory bandwidth relative to compute.
Accelerator interconnect	Synchronous training depends on fast collective communication among many nodes.
Data center network	Agents and storage-heavy workflows require movement across CPUs, storage, and clusters, not only accelerator fabrics.
Reliability	A single failed node can halt a synchronous training computation or disrupt model serving.

Vahdat described useful AI capacity as a balanced system rather than a pile of accelerators.

The modern challenge is not merely building a powerful chip. It is building a coordinated supercomputer with the right balance point across tens of thousands and, in Vahdat’s formulation, roughly 100,000 TPUs. Scaling flops is easy compared with building the memory, network, storage, and reliability envelope that can make those flops productive.

He tied this to low model-flop utilization in the industry. With the move toward mixture-of-experts and sparse computation, he said, current hardware is often not at the right balance point; these workloads need more memory bandwidth relative to computation. The result is that hardware may have plenty of theoretical compute but cannot keep it fed.

The difficulty compounds at scale. Vahdat compared it to a CPU pipeline: even inside a processor, balancing instruction fetch, decode, memory access, and execution is hard, and pipeline bubbles reduce efficiency. Across 100,000 nodes, perfect model-flop utilization is impossible for real computations. A slight variation in cache hit rate on one accelerator can create a delay; because other nodes are waiting for data, that delay propagates into lower utilization.

He used this to resist simplistic capacity accounting. Spending $40 billion or $50 billion per gigawatt is not inherently good or bad. If spending $55 billion produces a more reliable and better-balanced gigawatt, the extra cost may be justified because the system produces more useful output. Conversely, a cheaper gigawatt that cannot feed its accelerators or survive failures may be false economy.

Agents broaden the balance problem. It is no longer only the TPU or GPU fabric. Agentic systems may coordinate accelerators with CPUs, storage, simulation, and remote data. Vahdat emphasized the ordinary data center network — not just high-speed accelerator interconnects such as ICI or NVLink — because it connects the whole system. Value per gigawatt depends on whether all of those layers can move together.

The hard constraint is lead time under uncertainty

Amin Vahdat described procurement and supply chains as constraints that money alone cannot clear. For a net-new gigawatt of capacity, he said, the lead time is roughly two to three years. That remains true even if the buyer has $40 billion or $50 billion available. The buildout is physical.

The forecasting problem is therefore structurally punishing. If an organization must commit today to the amount of capacity it will need in two years, it will almost certainly be wrong. Underpredicting leaves opportunity on the floor. Overpredicting wastes capital. Predicting perfectly has, in Vahdat’s telling, infinitesimal probability. If that lead time could be compressed to a day, he said, the forecast would be much more accurate; even an overprediction might be off by only a small fraction.

Shortening that lead time is a technical problem, not only a finance or procurement problem. The chain includes land, permitting, site preparation, buildings, power contracts, chips, memory, components, manufacturing, and deployment. Even land is not enough: it may need to be graded, prepared for construction, and connected to power. Utilities are no longer operating with the slack that once allowed a hyperscaler to ask for 10 megawatts and receive it without much friction.

Vahdat described a new utility posture: if a company asks for a gigawatt, the utility may require a 20-year commitment to pay for that power around the clock. The reason is that there is no spare grid capacity to back the request casually. The buyer is effectively asking the utility to build or secure new capacity, and the utility wants assurance that the demand will persist.

That changes infrastructure planning. Watts and data center space have some fungibility — they can host generation X, X+1, X+2, or even X-1 hardware — but chip lead times are also substantial. Orders must be placed early, and use cases change. A new internal invention, a product launch, or a cloud customer asking for a particular type of GPU in a particular location can force replanning. Vahdat described planning at Google as massive, complicated, fast-changing, and effectively continuous.

The same logic applied to older chips. Vahdat rejected the assumption that older chips are becoming obsolete. Demand is high enough that older TPUs and GPUs continue to be heavily used at Google and across the industry. He said Google depreciates compute hardware over six years, which he described as more or less standard, with some firms perhaps using five years. In practice, Google sees use for at least that period and often longer.

Inference demand is exposing the same limits of prediction. A classroom question referred to a SpaceX-Anthropic partnership announced that day, in which Anthropic would use compute from the former xAI Colossus cluster, and Vahdat also mentioned a similar announcement involving Cursor using SpaceX/xAI capacity. He said he did not know the inside story of what Elon Musk and Dario Amodei, or others involved, had discussed. His narrower interpretation was that the pattern reflects massive current demand for inference compute, especially from coding agents. Coding agents had existed for some time, he noted, but their recent demand surge had not been predicted at that level. As a result, no one had enough lead time to secure all the GPUs or TPUs needed for serving.

Serving also changes the geography of useful capacity. Training demand historically favored large contiguous clusters. Serving is more fungible and can often use smaller deployments. Vahdat said this shift should naturally help unstrand some smaller power sites, including sites below the scale that hyperscalers might prefer for large training clusters. A 100-megawatt site may not satisfy a gigawatt-scale training need, but it can serve tokens.

He cautioned, however, that unstranding smaller sites will not meet total demand. Scale will still have benefits, and the industry will still need to concentrate large amounts of power in some locations. Serving changes the shape of demand; it does not eliminate the need for large-scale energy and infrastructure buildout.

On supply chains, Vahdat said Google is deeply engaged across the stack, with teams in places including Taiwan, South Korea, and Thailand. He was not worried about Google securing its fair share of supply. The harder question, again, is efficient use of capacity.

He also resisted a zero-sum framing of component supply. If a component vendor has one customer that offers to buy all output for three years, that may be attractive in the short term but bad for the vendor’s long-term business. Concentration risk matters. Vendors generally want customer diversity, and investors do not like a business whose revenue depends overwhelmingly on one or two customers. Vahdat tied that point to a broader view of AI infrastructure as an ecosystem with many future winners, rather than a single-company contest.

Optical circuit switches make the topology programmable, not magical

Amin Vahdat separated where optical circuit switching helps from where it does not. Google does not use it everywhere. It is not appropriate for on-chip networks, and not for large portions of the wide-area network. Even inside the data center, it augments rather than replaces electrical packet switching.

Within a TPU rack, Vahdat said, the connections are copper point-to-point links. Between racks, Google uses optical circuit switches to create a three-dimensional torus. The main reason is reliability.

If a TPU rack participates in a torus and one TPU is lost, the lattice can break. Optical circuit switching lets Google virtually remove an entire failed rack and plug another rack into the same logical position. In current racks, he said, 64 TPUs can be swapped as a unit into the topology. The system maintains the torus by changing the optical paths under software control.

Vahdat described the mechanism physically. An optical circuit switch contains a square chip with, in his example, 136 mirrors, though the number could vary. Fibers from racks connect into the switch. Light exits a fiber, hits a mirror, and is reflected toward an output port. Each mirror can be rotated in three dimensions under MEMS control, allowing software to choose which output port the light reaches. Functionally, it is like unplugging and replugging fiber without humans.

That gives the data center a programmable topology. If one rack needs to be removed, another can be inserted into the same position in seconds, assuming spare racks are available. Those spare racks do not have to sit idle; they can run smaller computations until needed. Vahdat called this ability to recover from failures quickly a differentiator for TPU availability.

The second use case is bandwidth steering between clusters. Vahdat described a higher-level optical circuit switching layer that can point bandwidth toward the cluster where a job’s storage is located. Instead of provisioning layer upon layer of general-purpose electrical packet switches and miles of fiber for full bandwidth everywhere, the system can create a high-bandwidth direct path for the duration of a job.

If Borg schedules a five-hour job that needs storage in a particular cluster, the scheduler can point the mirrors there for five hours. The next five-hour job may need storage elsewhere, so the topology can be reconfigured. This is not per-packet flexibility. If the workload suddenly needs full bandwidth somewhere else at second granularity, it will fall back on electrical packet switching, but not necessarily with the same bandwidth.

Vahdat’s description was deliberately bounded: optical circuit switches have a role; they are not a magic bullet. Google still uses many electrical packet switches. The value is not universal fungibility but the ability to put high bandwidth where it is predictably needed, while also preserving a topology through failures.

The choice of torus topology came from the dominant communication pattern in early ML training. Vahdat said all-reduce was the number-one collective, and a torus is well suited to disseminating parameters with a small amount of computation at each step. If the workload is arbitrary all-to-all communication, a switched topology such as a fat-tree Clos would be better. Model designers, he added, can and do work around topology constraints in clever ways.

Specialization is becoming worth the loss of fungibility

Asked whether Google’s goal is for TPUs to “beat” GPUs, Amin Vahdat rejected the premise. He said the market is expanding so dramatically that there is no simple winning and losing frame. Google buys, sells, and uses a large number of GPUs, and he described them as fantastic products. TPUs are aimed at different domains and customer use cases.

The more important shift, in his account, is specialization. Vahdat said Google had recently announced its eighth-generation TPUs: 8I for inference and 8T for training. For the first time, Google is launching two TPU chips in one year, specializing the line rather than using a single chip for both serving and training.

Previously, one fungible chip made sense. If one chip would be 5 percent better for training and another 5 percent better for serving, the operational value of a single general-purpose TPU could outweigh the specialized gain. But Vahdat said the needs of inference and training are now diverging enough that specialization produces major uplift.

The key difference is system balance. Inference and training need different ratios of memory, compute, and networking. Designing different chips lets Google tune those ratios for the workload rather than force one design to serve both.

Vahdat situated this in a broader hardware trend. General-purpose CPUs are not going away, because they can run anything, but their year-over-year performance-efficiency gains have slowed. When demand keeps rising, waiting for CPUs to get faster is not enough. Large workloads have to be identified and accelerated directly. TPUs are one example: for the domains where they run, Vahdat said they can be 100 times more efficient than a CPU.

That does not mean every future accelerator must be a TPU. Some other large workload may not require tensors or matrix algebra. It may need a different balance point and a different specialized device. The direction Vahdat expects is more specialization, workload by workload.

This logic also informed his answer on robotics. He described robotics as an exciting domain and pointed to Waymo as, in his view, the best example of advanced robotics operating in complex real-world scenarios. But robotics changes the infrastructure constraints. Latency matters, and safety becomes primary. If a safety-critical algorithm needs to run, the system cannot tolerate variability from remote scheduling or a context switch at the wrong moment. Similar scaling laws may apply, but the amount of remote data-center scale a robotics application can depend on is much smaller. Counting on 20,000 TPUs in a data center a thousand miles away may or may not work, depending on the application.

Google’s code red reorganized the company around AI infrastructure

Amin Vahdat described the period after ChatGPT’s launch as a time when Google changed as a company. Asked what the “ChatGPT code red” was like, he called it “a great time” and said he now speaks of November 2022 internally, and “frankly fondly.”

The institutional change he emphasized was not only product urgency. He credited Sundar Pichai with a major reorganization, including the higher-profile move of bringing Brain and DeepMind together, and the lower-profile move of bringing different infrastructure teams together under Vahdat’s leadership. Vahdat said the infrastructure consolidation allowed more speed and unification, while also stressing that he did not attribute the outcome to himself personally.

He described Google’s culture as different from three and a half years earlier and called the period a reinvention. A year before the lecture, he said, he probably would not have said Google was through that transition. By the time of the class, he said, he thought it was. He gave credit to Pichai, Demis Hassabis, and Jeff Dean.

The answer fits Vahdat’s broader account of infrastructure as an end-to-end discipline. The code red, in his telling, did not simply increase demand for accelerators. It changed organizational boundaries so model work, product pressure, and infrastructure execution could move with fewer seams.

Energy is the bottleneck with the least obvious answer

When asked for the true bottlenecks, Amin Vahdat refused to name one universal constraint. The bottleneck shifts: memory supply one day, cluster reliability another, a particular training run the next. But if forced to name the area where he has the least satisfying answer, he chose energy.

Scaling energy to the level AI infrastructure needs across the planet is possible in some ways, he said, but many are brute force and expensive — not only in dollars. The goal is energy abundance and affordability. That is the innovation bottleneck he understands least well.

In the United States, Vahdat said the community could look much more seriously at wind, solar, and batteries. Google is doing so, but he described the broader challenge as a manufacturing and scaling process with physics and time constraints. Solar, wind, and batteries are proven elsewhere in the world, relatively affordable, and comparatively fast to manufacture and deploy at significant capacity.

A classroom question asked about a company that had reportedly raised money to build data centers as distributed floating pods. Vahdat answered by broadening the category: Google and others are looking at data centers in space. He said space could offer energy that is 5 times more efficient, and that a sun-synchronous orbit could provide 24/7 power with little or no battery requirement. But he characterized directions such as these as risky and relatively far out — on the order of five to ten years. His preferred posture was a portfolio: explore promising long-term options while continuing to deploy proven approaches.

On whether hardware will eventually stop being a bottleneck, Vahdat was clear: he sees no point in the foreseeable future where it does. He cited Rich Sutton’s “Bitter Lesson” essay as the relevant principle: historically, more compute has tended to produce better AI results. Even a major algorithmic breakthrough would not remove the constraint. Transformers, he said, were roughly 5 times more efficient than the previously dominant LSTM approach for comparable results. If another “Transformers Prime” delivered another 5x efficiency gain, Vahdat believes the freed capacity would still be used quickly and usefully.

A data center has to be a grid and community asset

Amin Vahdat closed the infrastructure argument by returning to responsibility. His stated goal for Google data centers is that they should be an uplift for the local community and the grid. That includes noise, water, power, jobs, and access to technology. He did not deny concerns about data centers across the country and the world, but said Google is working proactively and that he takes the issue seriously.

Our data centers should be an uplift for the local community and an uplift for the grid.

Amin Vahdat · Source

The concrete example was power usage efficiency versus water. Historically, he said, Google considered two data center designs. One was 10 percent more power efficient but used more water. The other used essentially no water but was 10 percent less power efficient. At gigawatt scale, 10 percent is enormous: it can represent 100 megawatts of usable power.

From a company-wide or bottom-line perspective, using more water to save power might appear rational. But in a water-scarce community, Vahdat said, that could be a huge net negative. He described the decision as: unless there is abundant water in a particular community, and unless that community would prefer the more power-efficient design, Google would use the less power-efficient design that consumes almost no water.

The second example was demand response. Vahdat said Google has developed technologies for a gigawatt of demand response across the country. The idea is to give capacity back to utilities during the hottest or coldest periods of the year, when residential demand peaks and the grid must guarantee power to homes. A data center can power down some load for those periods, accepting downtime in exchange for reducing the amount of overprovisioning the utility needs.

This connects directly to the earlier reliability tradeoff. If some frontier workloads can tolerate lower availability in exchange for more capacity, then data centers can become flexible grid resources rather than fixed loads. Google can take power most of the year and return 100 megawatts, in Vahdat’s example, when the utility needs it for households.

The concluding standard was end-to-end responsibility. Building a gigawatt is not merely an abstract line in a spreadsheet. It is a massive deployment in a real place — Vahdat used Utah as the example — and it needs to be welcomed there. That requires efficient capacity delivery for users and customers, but also a check mark that the deployment is an asset to the grid and the community.

AI Labs and Strategy Evals and Benchmarks Inference and Deployment AI Infrastructure and Compute