Cost Per Token Is Replacing FLOPS as the AI Infrastructure Metric

Shruti KoparkarNVIDIAThursday, May 21, 202612 min read

Shruti Koparkar of NVIDIA’s Accelerated Computing team argues that AI infrastructure should be evaluated by token economics rather than by GPU-hour pricing or FLOPS per dollar. On NVIDIA’s AI Podcast, she lays out a four-part framework — token utility, supply, demand and monetization — in which cost per token becomes the central measure of business value. Koparkar says NVIDIA Blackwell’s system-level design delivers 50 times more tokens per watt than Hopper and 35 times lower token cost, while lower token costs will expand GPU demand by making more AI workloads economically viable.

Tokenomics starts with the value of the token, not the cost of the GPU

Shruti Koparkar defines tokenomics as the economics of how tokens are “valued, supplied, consumed, and monetized.” In her framework, those map to four pillars: token utility, token supply, token demand, and token monetization. Utility is the value of the token itself. Supply is the infrastructure question: what systems maximize token output while minimizing cost. Demand is the operating forecast: users, use cases, volume, velocity, and variability. Monetization is the conversion of token output into business value.

Her first constraint is that tokens should not be treated as uniform units. A token’s value depends on two factors: the intelligence it carries and how fast it arrives.

Not all tokens are created equal.

Shruti Koparkar

The intelligence in a token depends partly on the model that produced it. More complex models generally produce more intelligent tokens. It also depends on context length: the more context a model can inspect, the more accurate and intelligent the output can be, though Koparkar notes a caveat that excessively long context can sometimes degrade output quality.

The second part of token value is interactivity, which she describes as tokens per second per user. A use case that needs fast, responsive generation values tokens differently from one that can tolerate slower output. Koparkar places token value on a spectrum: at one end are basic models with shorter context and slower generation; at the other are more complex models with larger context windows and faster tokens.

The important business judgment is not simply to buy the most intelligent token available. Koparkar distinguishes absolute value from relative value. A more capable model may produce higher-value tokens in the abstract, but that extra value can be useless if the task does not require it. In a narrow domain-specific application, she says, a fine-tuned small language model can provide exactly the needed value and, in some cases, better accuracy for that task than a larger model.

The same applies to speed. Agentic applications may require highly interactive tokens because multiple model and tool calls must complete before the user receives a final result. Chat interfaces or enterprise search may not need that level of interactivity. For Koparkar, the practical act is mapping each use case to the right point on the intelligence-and-interactivity spectrum, rather than assuming one model class or one infrastructure profile is optimal for every workload.

Demand forecasts have to include invisible tokens and agent loops

Koparkar describes token demand forecasting in layers. The simplest version is “back of napkin math”: number of users, multiplied by the number of requests or sessions per user over a given period, multiplied by tokens per request or session. That gives a base estimate for daily, monthly, or other time-period demand.

In plain terms: base token demand equals users times sessions per user times tokens per session. But she argues that this base view is too simplistic unless it includes several multipliers.

The first is reasoning. Reasoning models use “thinking tokens” that are never shown to the end user. AI deployments can often set thresholds for how many thinking tokens are allowed per interaction, so demand forecasting must account for whether reasoning models are used, what thresholds are set, and what peak and average use may look like.

The second multiplier is agentic workflow design. In an agentic application, a user prompt may trigger multiple model turns, tool calls, sub-agent calls, code execution, or other intermediate steps before an answer or action is returned. That makes token demand significantly higher than in a simple conversational exchange.

The third factor is cache behavior. Koparkar explains the KV cache as “the short-term memory of a model.” When a request has been seen before, stored cache values can sometimes be reused rather than recomputed. Cache hit rate therefore changes the amount of fresh computation needed to serve demand.

Finally, demand is not static. Businesses need to model variability within a day, seasonal surges, and user growth. A product may be heavily used in the morning and lightly used in the evening, or vice versa. Retail and e-commerce businesses may see holiday surges. A company that begins with a known number of users still needs to account for the growth it is trying to create.

Forecast layer	What to estimate	Why it changes demand
Base usage	Users, sessions per user, tokens per session	Sets the initial daily, monthly, or other-period demand estimate
Reasoning	Use of reasoning models, thinking-token thresholds, peak and average thinking-token use	Adds tokens that are consumed but not shown to the end user
Agentic workflows	Number of turns, tool calls, sub-agent calls, and loops	Multiplies model calls before a final result is returned
Cache behavior	KV cache hit rate	Changes how often repeated inputs must be recomputed
Variability and growth	Intraday patterns, seasonality, and expected user growth	Determines peak capacity needs and future supply requirements

Koparkar’s demand forecast starts with base usage, then adjusts for reasoning, agents, cache behavior, variability, and growth

The common thread is that token demand is not simply a user-count problem. Reasoning tokens, agentic turns, cache reuse, seasonality, and growth can materially change the infrastructure requirement.

Cost per token is the infrastructure metric Koparkar wants leaders to use

On the supply side of tokenomics, AI infrastructure decisions become economic decisions. The objective, as Koparkar frames it, is maximum token availability and output at the lowest token cost.

She cautions that organizations often default to metrics that are easy to obtain but incomplete: cost per GPU hour or FLOPS per dollar. She calls these “input metrics” because they describe what is being bought, not what is being delivered. They do not measure the actual token output a system produces.

Cost per token is her preferred metric because it connects the input cost to the delivered output. In its simplest form, she describes it as the cost of the GPU divided by the number of tokens the GPU produces. It is meant to capture the real return on infrastructure because the business ultimately runs on token output, not on the abstract availability of compute.

It is kind of a fundamental mismatch if you are evaluating infrastructure based on the inputs but your business runs on the output.

Shruti Koparkar · Source

Koparkar does not present cost per token as identical across use cases. It varies with model complexity, context, intelligence, and interactivity. Tokens generated by a more complex model or delivered at higher interactivity will be more expensive. She describes that as “just physics.” But she still treats cost per token as the base metric because it reflects both what the buyer pays and what the system actually produces.

Her example is the comparison between NVIDIA Blackwell and NVIDIA Hopper. Looking only at input metrics, she says Blackwell may appear to be 2x more expensive on hourly GPU cost and delivers 2x more FLOPS per dollar. That is an advantage, but it understates the delivered-output difference. Koparkar says Blackwell delivers 50x more tokens per watt than Hopper, and that the Blackwell NVL72 system delivers 50x more tokens than Hopper within the same infrastructure footprint. That translates, she says, into a 35x lower token cost.

50x

more tokens per watt for NVIDIA Blackwell compared with Hopper, according to Koparkar

35x

lower token cost from Blackwell versus Hopper, according to Koparkar

The implication is not that FLOPS or GPU-hour pricing are meaningless. It is that they can miss the system-level result. If a buyer is trying to produce intelligence in the form of tokens, the relevant comparison is not only how much compute is purchased, but how much usable token output the full system produces at a given cost.

Extreme co-design is presented as the route from spec-sheet capacity to delivered tokens

Koparkar draws a sharp distinction between integration and co-design. Integration, in her description, means independent parts assembled after the fact. Co-design means designing multiple parts of the same system simultaneously, from the ground up, toward a shared outcome: lowest token cost.

NVIDIA calls its approach “extreme co-design” because Koparkar says it extends across compute, memory, storage, networking, software, and ecosystem partners. She points to the Vera Rubin platform as an example, saying it has seven chips, but adds that the co-design extends beyond hardware into CUDA kernels, runtimes, serving software, silicon partners, OEMs, cloud providers, operating systems, and frameworks.

Her concrete examples center on the kinds of workloads that make system-level optimization necessary. Mixture of Experts models benefit, she says, from Blackwell NVL72’s inter-GPU communication and from software including Dynamo, disaggregated serving, and runtimes such as TensorRT, vLLM, and SGLang. She names wide expert parallelism as a technique that optimizes inference performance and reduces cost per token for Mixture of Experts models.

Agentic AI is the broader workload case. Koparkar contrasts it with ordinary conversational use. In a conversational setting, a human and an LLM take turns: the user prompts, the model responds, the user follows up, and the model responds again. In an agentic setting, AI takes turns with AI and with software. A main agent may reason, call a tool, invoke a specialized sub-agent, receive a result, and continue through more steps before returning an outcome. The user may only provide the initial prompt, such as booking a ticket to Miami, while the system performs multiple intermediate actions.

That structure changes both token demand and latency requirements. The number of LLM calls is higher. Token demand is higher. A few extra milliseconds on each turn can compound into several seconds of end-to-end delay.

Koparkar describes the Vera Rubin platform as built for agentic AI because the workload stresses multiple parts of the system at once. In her account, the Rubin GPU accelerates the LLM and reasoning work. She also refers to a “Groq 3 LPX solution” in connection with ultra-low latency, though the term is not further explained in the source. She says Vera CPU supports tool calling, sandboxing, code generation, and code testing, while a platform involving BlueField DPUs and Spectrum-X supports offloading KV cache so it can be retrieved when needed and matched with incoming requests.

The broader point is that agentic workloads require coordinated acceleration across several parts of the system. These are not separable components whose value is fully visible on a spec sheet. The delivered cost per token depends on how the stack behaves under the workload.

Software determines whether the hardware’s theoretical advantage becomes real output

Software, in Koparkar’s account, is the difference between spec-sheet capability and real-world token output. Hardware system design cannot be fully realized unless the software stack can exploit it.

She emphasizes that software optimization cannot be piecemeal. A system needs a software stack capable of enabling multiple optimizations together: NVFP4 quantization, MTP or speculative decoding, disaggregated serving, wide expert parallelism, KV cache offloading, KV-aware routing, and other techniques. The value comes from stacking those optimizations, not from enabling one isolated improvement.

That software stack is part of how she explains the 50x throughput and 35x cost-per-token improvement she attributes to Blackwell. But she also stresses that software keeps improving after the hardware ships. Open-source software, she says, “never stops,” and NVIDIA is not the only contributor. OSS frameworks, partners, customers, and developers all contribute incremental improvements that compound.

As an example, she says vLLM and SGLang, two inference runtimes, have shown 8x more performance in about six months. From the same infrastructure footprint, that means more token output and lower token cost.

performance improvement in about six months for vLLM and SGLang, according to Koparkar

In Koparkar’s framing, software is not an accessory to token economics. It is one of the mechanisms that changes how many tokens the same infrastructure footprint can produce.

Monetization depends on token cost, token value, and demand distribution

In Koparkar’s framework, lower cost per token matters commercially because it is one side of the pricing exercise: a company has to know what it costs to produce a token before deciding how to price, package, or embed the intelligence it delivers. Once an organization understands what kind of token value it must provide and what it costs to produce that output, it can decide how to turn token production into business value.

With token utility, demand, and supply established, Koparkar turns to monetization. Her simplest proxy is direct: tokens are generated and sold. The pricing question then becomes how much they can be sold for.

She frames pricing through three considerations. The first is cost-based pricing: what it costs to produce the token. A company needs to understand the token’s utility and the cost of producing that level of value, then charge more than that cost.

The second is value-based pricing: what customers are willing to pay for the token’s utility. A token that enables a valuable outcome can command more than one that does not.

The third is demand distribution. Businesses have revenue and margin goals, so they need to understand where the bulk of demand will sit and how demand tapers at different utility levels. Tokens with less utility may have few takers. Highly valuable tokens may also have fewer buyers because fewer customers will pay the premium. Pricing has to account for the shape of demand, not only the average token cost.

Koparkar adds that direct token pricing is only one proxy. Some companies build value-added services on top of tokens, including AI-native products. In those cases, the same process applies, but the company must also price the additional value it creates above raw token generation.

She names four broad business models for turning tokens into business value. The first is selling tokens directly; she cites Fireworks, Baseten, Together AI, and DeepInfra as examples of companies helping customers build services on top of tokens. The second is AI-native products built from the ground up around AI; she cites Perplexity and Cursor. The third is adding AI to existing products; she mentions Shopify, Airbnb, and Adobe, noting that Adobe has built the Firefly family of models and uses those models to add capabilities to Photoshop. The fourth is internal operational improvement: using AI to improve processes and employee productivity rather than selling an external AI product.

The framework is deliberately broad. Koparkar says there are likely more nuanced models, but these four are the main patterns she uses to describe how token production becomes business value.

Model	Description	Examples named by Koparkar
Sell tokens directly	Provide token output that customers use to build services.	Fireworks; Baseten; Together AI; DeepInfra
Build AI-native products	Create products with AI embedded from day one.	Perplexity; Cursor
Enhance existing products	Infuse AI capabilities into products already in market.	Shopify; Airbnb; Adobe; Photoshop with Firefly capabilities
Improve internal operations	Use AI to improve processes and employee productivity.	Koparkar describes this as relevant to almost every organization

Koparkar’s four business models for converting tokens into business value

Lower token cost does not mean lower GPU demand

Noah Kravitz asks whether lower cost per token eventually means fewer GPUs are needed to produce the required number of tokens. Koparkar’s answer is “absolutely no.” She points to Jevons paradox: as efficiency improves, new use cases are unlocked, and demand expands to absorb the efficiency.

Her macro pattern is generative AI’s own progression. When generative AI became common, people used it for tasks such as summaries and image generation. Lower token cost did not reduce the need for GPUs. Instead, it enabled more token consumption through test-time scaling and reasoning, because researchers found that scaling computation at inference time could produce better, more accurate, more intelligent responses.

She says the pattern is repeating with agentic AI. As Mixture of Experts models and reasoning models become more efficient to deploy, and as their cost per token falls, another inflection point appears: more tokens become available, and developers find more ways to use them. Agentic workflows consume the efficiency through more turns, more model calls, and more automation.

Kravitz puts the demand-side intuition plainly: “People aren’t going to run away from intelligence, they want to use it.” Koparkar agrees, and says she has seen the same dynamic at both the macro level and with individual customers.

The result, in her account, is that more efficient GPUs do not simply reduce the amount of infrastructure needed for existing workloads. They change what workloads are economical enough to attempt, which creates new demand for token output and continued demand for GPUs.

The practical starting point is the use case

For organizations trying to apply the framework, Koparkar recommends starting from the final outcome and working backward. That usually means starting with the customer need, whether the customer is external or an internal employee using AI to improve a business process.

The use case determines the rest of the chain. It shapes the model choice, the required context length, and the needed interactivity. Those factors determine the kind of intelligence and speed the tokens must provide. From there, the organization can estimate token demand, evaluate token supply, and use cost per token as the relevant infrastructure metric. Once utility, demand, and supply are understood, monetization strategy can be built around them.

That ordering matters. Koparkar’s framework does not begin with GPU procurement or abstract model ambition. It begins with what the user needs, then asks what kind of token value that need requires, how many such tokens the system must supply, what they cost to produce, and how the organization captures value from them.

Inference and Deployment Agents and Autonomy AI Infrastructure and Compute AI Business Models