AI Application Companies Are Moving Beyond Frontier APIs to Protect Margins

Tuhin SrivastavaStanford OnlineFriday, June 5, 202618 min read

Baseten founder and chief executive Tuhin Srivastava used a Stanford MS&E435 seminar with instructor Apoorv Agrawal to argue that inference is becoming the cost of goods sold for AI applications. His case is that scaled AI companies will need to move beyond default frontier-model APIs toward custom or post-trained models, both to improve margins and to protect the workflows and user signals that make their products defensible. Baseten’s role, as Srivastava framed it, is to provide the production inference stack and compute access needed to run that custom intelligence at scale.

Inference is becoming the cost of goods sold for AI applications

Apoorv Agrawal opened with the premise that inference is about to rise “a billion X,” and Tuhin Srivastava framed Baseten around the same underlying claim: inference is the mechanism by which AI value is delivered, and therefore the recurring cost line item that grows with usage.

Srivastava’s shorthand was direct: “Inference is the cogs of AI value being delivered.” For AI-first applications, the model call is not a background cloud expense; it is often the product itself. Agrawal later relayed a founder’s view of the problem: inference is “the core service that our customers are consuming,” so disruption in inference is disruption in the product. Srivastava put it even more plainly: once a company is “in token path,” if inference is down, the product is down.

Baseten’s business is built around that dependency. The company runs production inference for AI application companies, especially those trying to operate custom or post-trained models at scale. Srivastava said the company works with “the fastest growing AI companies in the world.” A customer-logo slide shown during the seminar named Abridge, Clay, Cursor, Decagon, OpenEvidence, Gamma, HubSpot, Lovable, Mercor, Notion, Parallel, Poolside, Superhuman, World Labs, and Writer. A second slide, headed “The most obsessive AI teams build on Baseten,” showed a grid of founder and AI-team headshots.

Two examples made the function concrete. Whisperflow, described by Srivastava as a speech-to-keyboard or voice-typing product, uses several models in the path between spoken audio and text appearing on screen. Srivastava said there are “three or four language models” and “two audio models” involved, all running on Baseten. For Whisperflow, Baseten’s job is not merely hosting; it is running the optimizations and infrastructure so the latency between speech and text is as short as possible.

Abridge is a more complex example. Srivastava described it as a healthcare ambient scribe “used in almost every healthcare system in the US” and deeply integrated with electronic medical records. He said Abridge runs about 20 models, from speech-to-text models to systems that turn what happens in a patient room or operating room into EMR-integrated clinical notes. He first said “almost” every one of those models runs on Baseten, then corrected himself to “every single one.” The requirements, in his telling, include reliability, speed, and the operational demands of a healthcare workflow.

The examples illustrate Baseten’s view of the market. Srivastava said that today roughly 90% to 95% of inference spend goes to frontier models, while roughly 5% goes to custom models or post-trained open-source models. Baseten’s thesis is that profitable, defensible AI application companies will increasingly shift toward custom intelligence they can control.

90–95%

of inference spend Srivastava said currently goes to frontier models

That shift is not only about model choice. It is about whether the AI application layer can become a real business. Srivastava said customers often arrive when they have moved beyond product-market fit and need to become scaled companies with viable gross margins. At small scale, using frontier APIs may be the easy path. At large scale, “just trading tokens” becomes expensive enough that a path to profitability requires a different architecture.

Why scaled AI companies leave the default frontier-model path

For Tuhin Srivastava, the case for custom and post-trained models has two parts: a viability argument and a defensibility argument.

The viability argument begins with a claimed performance-cost gap. Srivastava said open-source models are about 90 days behind closed-source frontier models and can be run “70 to 90% cheaper.” A Baseten slide, attributed on-screen to “© BASETEN 2026,” described open and closed models as converging in capability, cost, and customization, with the lag between frontier and open source “sub-90 days.” The same slide claimed post-trained open-source models increasingly deliver frontier-level performance at 30% of the cost, and gave the example that a post-trained Kimi model could provide “Sonnet level intelligence at Flash costs,” framed as a 70% cost reduction.

Baseten slide claim	Stated implication
Open and closed models are converging	Capability, cost, and customization are moving closer together
Open-source lag is sub-90 days	Application companies do not have to wait long for strong base models
Post-trained OSS can deliver frontier-level performance at 30% of the cost	Custom models can improve gross margin
Post-trained Kimi gives Sonnet-level intelligence at Flash costs	Baseten framed the example as a 70% cost reduction

Claims shown on Baseten’s post-trained open-source model slide

A company that has enough volume to care deeply about gross margin has a strong reason to pursue that gap. Srivastava said the larger the customer, the more “existential” it becomes to move toward post-trained models. Agrawal connected that to coding companies that are not frontier-model labs themselves, saying leading coding companies are still rumored to have negative gross margins. In that context, moving token volume toward open-source or post-trained models is not an optimization exercise; it is part of making the business economically viable.

The defensibility argument is more strategic. Srivastava said the things an application company has that may be defensible against frontier labs are likely “some workflow” or “some user signal” that only the application company has. If that company keeps sending all of its usage and workflow data through frontier labs, he argued, it risks giving those labs the patterns that make the application valuable.

He described the cynical version through an analogy he attributed to a prominent person he declined to name: frontier labs as the East India Company. They arrive as partners, learn the local operating knowledge, and eventually use that knowledge to rule the territory. The strategic concern for application companies is where their data, workflows, and user signals should go.

To be defensible against this and keep the thing that makes you special, you need to own your intelligence to some extent.

Tuhin Srivastava

Agrawal pressed the obvious objection: if a company post-trains an open-source model, it is branching away from the main line of intelligence improvement. The next GPT model or the next Opus model — Agrawal called them “GPT-6 or Opus 5” — might surpass the company’s specialized model without the company doing any work. Srivastava’s answer was that the trade-off changes when open-source models are close enough, cheap enough, and customizable enough. If a base model is within a short lag, and post-training on proprietary data produces a model that is faster, cheaper, and more aligned with the product’s specific utility function, then the “no-work option” is not automatically superior.

On user experience, Srivastava was cautious. Agrawal asked whether users notice a performance hit when a product like a coding assistant swaps from frontier models to a post-trained model, invoking public discussion of Cursor’s Composer. Srivastava said Baseten did not have internal data on that specific question. The hoped-for outcome, he said, is that performance improves: post-training on user signals should improve product experience while giving the company more control over latency and reliability.

Baseten sells an inference stack, not just GPU access

Raw compute is not the same thing as a production inference stack. That distinction was Srivastava’s answer to why companies such as Abridge or Whisperflow would use Baseten instead of going directly to AWS, Google Cloud, Azure, CoreWeave, Nebius, or another AI cloud.

The first reason is performance. Companies need models to run as quickly as possible, and Tuhin Srivastava argued that when a customer goes directly to a cloud provider, it is largely responsible for doing those optimizations itself. The second reason is reliability across compute sources. Baseten, he said, helps customers run inference in a fault-tolerant way across multiple clouds, unlocking multi-cloud capacity without forcing the application company to build that layer. The third reason is the developer platform: flexibility, security, observability, and operational tooling around deployment.

A Baseten platform slide, attributed on-screen to “© BASETEN 2026,” laid out the company’s stack as several connected layers: a model API, dedicated deployment, training, the Baseten inference stack, model runtime, inference-optimized infrastructure, developer experience, security, observability, forward-deployed engineering, and a multi-cloud capacity manager spanning AWS, Google Cloud, Azure, and more than 10 clouds. The slide’s own framing was that customers can “train and deploy models via API or in isolation,” run inference with optimized runtimes and infrastructure, and automatically shift GPU capacity across clouds.

Layer shown on Baseten platform slide	Function described on the slide
Model API and dedicated deployment	Train and deploy models via API or in isolation
Training	Part of the workflow connecting customer data to deployable models
Baseten inference stack	Run inference with optimized runtimes and infrastructure
Model runtime and inference-optimized infra	Serve models on optimized infrastructure
DevEx, security, observability, forward-deployed engineering	Manage deploys with monitoring and support
Multi-cloud capacity manager	Automatically shift GPU capacity across AWS, Google Cloud, Azure, and 10+ clouds

How Baseten’s slide described the production inference stack

Another Baseten slide framed inference as “COGS of AI-first economy.” It said AI spending as a whole is projected to grow 50% to 100% year over year, to $3.3 trillion in 2027, with Gartner cited on the slide for the broader AI-spending projection. The same slide then claimed this implies $1.3 trillion of inference spend, growing to more than $3 trillion by 2030, and said Nvidia projects $1 trillion of chip sales in 2027 driven primarily by inference demand.

Srivastava said many customers do try cloud providers first. They then discover the operational pain of building a complete inference stack on top of compute and decide they are better served by Baseten.

The company’s current pricing reflects that position in the stack. Srivastava said Baseten mostly “mark[s] up compute.” Customers bring models, load them onto Baseten’s inference stack, choose the compute they want to run on, and pay more for a Baseten H100 or B200 than for raw compute. The premium is meant to represent the software stack’s value: performance, reliability, orchestration, and support.

But he also said Baseten’s pricing is likely to evolve. Some other companies use token pricing, and Srivastava said Baseten is “relatively unsophisticated” about pricing today. As Baseten unlocks more post-training workflows itself, the pricing conversation may shift from compute to cost per token. Application companies coming from Anthropic or OpenAI are already accustomed to token-based economics, and token pricing gives them a more direct comparison of savings.

Post-training starts with the customer’s utility function

The post-training workflow begins with something Baseten cannot supply: the customer’s definition of what the model should optimize.

Tuhin Srivastava called this the customer’s “utility function.” A company must decide what it wants the model to be good at, because Baseten does not know the company’s product, users, business, or workflow as deeply as the company itself. Once the customer defines that target, it brings data, chooses an open-source base model, and Baseten provides the scaffolding to turn that into a post-trained model and deploy it back into inference.

Srivastava gave a medical speech-to-text example. Suppose a company is building a model for a medical use case and wants to minimize transcription errors. The company would define errors or transcription errors as the utility function, choose a base model such as Qwen 2.5, and provide a dataset. Baseten would then help turn that into a specialized post-trained model and integrate it into the inference system.

The workflow is not framed as generic fine-tuning detached from production. Srivastava emphasized the loop: data and utility function come in; a post-trained specialized model comes out; the model is already connected to the inference stack that must run it reliably at scale.

That loop raises a trust question. Apoorv Agrawal called the data “the keys to the kingdom” and asked whether customers are comfortable handing that to Baseten rather than to the “East India Company” of frontier labs. Srivastava joked that Baseten is the “West India Company” or “the arms for the rebellion,” but his substantive answer was about security posture and market trust.

Baseten, he said, works with multiple competitors and has access to their models and data. It has internal boundaries intended to prevent leakage between customers. Srivastava argued that the conversation is “not as bad as you’d think,” partly because customers are moving quickly and care about solving user problems rather than building everything from scratch. He also said Baseten’s track record with other high-quality customers helps build confidence.

Customers own the product knowledge, user signals, workflow insight, and evaluation target. Baseten’s proposed role is to supply the infrastructure and tooling that turn that knowledge into custom intelligence running in production.

The open-source assumption is necessary and uncertain

Baseten’s strategy depends on strong open-source or open-weight models remaining close enough to the frontier to be useful for post-training. Apoorv Agrawal identified that as one of the company’s major underlying bets. Tuhin Srivastava did not treat the answer as settled.

When Agrawal asked whether open source will remain at the frontier or three to six months behind, Srivastava replied, “TBD.” He said most of America has not produced the best-in-class open-source models over the past two years; the best open-source models today, in his view, are coming from China. Agrawal called that “wild” and asked why.

Srivastava’s answer was partly about talent concentration and incentives. He said the best researchers in the world currently work at two companies, and despite the names of those companies, they are not necessarily motivated to produce open-source models. Non-American labs such as Moonshot, Alibaba, and MiniMax, by contrast, may see open models as a path to relevance and market opportunity by taking the opposite position from closed-model providers. Meta, he noted, had earlier pursued that path and then “famously” moved the other way.

Srivastava argued that open-source models need to exist and described it as, for the United States, “somewhat” a matter of national security if America lacks strong open models. He also said he believes there is enough investment happening for U.S. open-source progress to resume. Asked whether the velocity of American open-source work is increasing or decreasing, he said that up until about a year ago it was “definitely decreasing,” but he now sees an uptick and expects “a lot this year going the other way.”

A student later asked whether the U.S. government would step in to fund open-source models or create incentives. Srivastava said he thinks the government is already getting involved and thinking about it. He also argued that Nvidia, Microsoft, and Google are putting significant work behind open source, though it has not yet borne fruit. In his view, artificial incentives and a top-down alliance may be needed.

The issue is not just national policy. It is also the incentive structure of frontier labs. A student asked what prevents Anthropic and OpenAI from open-sourcing lagging but still useful models and offering services to fine-tune them for enterprises. Srivastava said OpenAI already does some of this. But he argued that for frontier labs, heavy investment in post-training would partially undercut their central thesis. If the thesis is that AGI is everything, he said, then every dollar should go to pre-training. Their business incentive is to push the capability gap and monetize the most expensive model.

That creates room for companies like Baseten only if the open model ecosystem remains healthy. Srivastava later named insufficient open-source progress as one of Baseten’s core risks.

Nvidia remains the practical default, even if inference becomes heterogeneous

At the hardware layer, Srivastava expects a more heterogeneous future, but Baseten’s present is still organized around Nvidia.

Apoorv Agrawal asked Tuhin Srivastava to assess a compute ecosystem that includes Nvidia GPUs, AWS Trainium, Google TPUs, Cerebras, Etched, MatX, Positron, D-Matrix, SambaNova, and others. Srivastava distinguished the future architecture he expects from the current operating reality.

He said heterogeneous architectures are likely, especially where inference workloads are separated into stages such as prefill and decode. In his framing, memory-bound work could run on GPUs while compute-bound work runs on a different chip. This kind of separation could allow different hardware to specialize in different parts of the model-serving workload.

But for Baseten today, Srivastava said the company optimizes for speed of execution. Its customers are moving quickly, and Nvidia’s ecosystem makes that easiest. CUDA availability, Nvidia GPU availability, and runtimes such as TensorRT-LLM matter because they are built for Nvidia chips. He described TensorRT-LLM as an open-source runtime developed by Nvidia that runs on top of Nvidia chips. He also said vLLM and SGLang, two open-source frameworks used for model serving, are native to Nvidia.

So while the hardware future may be more diverse, Baseten’s current position is “fairly one-sided” toward Nvidia, as Agrawal summarized. Srivastava did not dispute that. The practical advantage is not only chip performance; it is the surrounding software ecosystem that lets application and infrastructure teams move quickly.

A student later asked where the real problem in the inference stack is: networking, kernels, or some other layer. Srivastava answered, “It’s everywhere.” Disaggregation is an obvious area because it separates concerns. Kernel work remains important, and he mentioned Nvidia, Stanford, and projects such as ThunderKittens as examples of ongoing opportunity. But his broader point was that the entire stack is unstable. A new model architecture could make many current optimizations obsolete.

Infrastructure companies must build for the workloads customers have now while anticipating that the model architecture, serving pattern, hardware mix, and economic bottleneck may change again.

Compute scarcity is not a temporary procurement problem

The most concrete numbers in the discussion came from Baseten’s own compute needs and procurement experience.

Tuhin Srivastava said that no matter how much people say there is a supply problem, “it is 10 times worse.” If a buyer goes into the market asking for 1,000 GPUs, Srivastava said they may be told availability is in the second quarter of the next year — roughly 12 to 15 months away, as Apoorv Agrawal clarified.

Baseten has mostly rented compute rather than owning data centers, but Srivastava said the company will also build an ownership footprint. The rent-versus-own decision reflects two competing views of the stack. Baseten’s view is that software is the sticky layer. Some cloud providers’ view is that GPU access is the sticky layer and software is easier. Srivastava said both views may be arrogant. Baseten rented because it needed speed and “had no business building data centers.” But the scale of projected demand is forcing a change.

He gave a specific pricing example. Baseten had a cluster of B200 Blackwell chips in one cloud. The unit price was $2.63 per hour. The renewal was due in October, but in May the provider came back with a new proposed price of $5.10 per hour. Agrawal noted that this was roughly double. Srivastava called it “egregious” and said Baseten would “absolutely not” take it, because it had other clouds it could work with.

$2.63 → $5.10/hour

price change Srivastava said one B200 cloud renewal proposed

The pricing story was part of a larger claim: access to compute is becoming a strategic advantage for inference. Srivastava cited a slide circulating from OpenAI’s Noam Brown making that point. Baseten’s own projected needs make the issue acute. Srivastava said Baseten’s inference service across customers is larger by token volume than OpenAI’s API based on “last reporting,” and larger than Gemini’s API product. He framed the comparison around tokens served. He said Baseten does about 30 trillion tokens per day.

30T tokens/day

Baseten inference volume Srivastava said the company handles

Projecting forward, Srivastava said Baseten will need roughly 150,000 B200 equivalents in two years. He called that “an insane amount of compute.” The strategic implication was explicit: Baseten must secure enough compute to control its own destiny. If it cannot, Srivastava said, it will be in trouble.

When Agrawal asked when compute scarcity normalizes — in 12 to 15 months, multiple years, or decades — Srivastava said he does not think it will ever normalize. His reasoning was demand-side. Applications are becoming more agentic, and models are getting bigger. Both increase inference demand. Unlike an airport security line that clears when the airport closes overnight, global inference demand does not reset. If demand falls in one region, it rises somewhere else.

The metaphor was informal but important. Srivastava compared inference demand to the saying that it is “5 p.m. somewhere.” There is always another geography, customer base, or use case consuming tokens. The demand curve does not get a nightly reset.

Baseten’s bear case has three parts

A student asked for the anti-thesis of Baseten. Tuhin Srivastava named three risks.

The first is compute concentration: a world in which only two players can get compute, and they own everything. In that scenario, an inference infrastructure company outside the dominant frontier labs or hyperscalers could be squeezed out by lack of access to the core resource.

The second is open-source failure: Baseten depends on open-source models becoming and remaining good enough to post-train. If the ecosystem does not produce strong base models, Baseten’s custom-model thesis weakens. Its customers may remain dependent on frontier APIs because the alternative is too far behind.

The third is the extreme AGI thesis. If one believes fully in a trajectory where increasingly capable models eliminate most human work, then, Srivastava said, “we all have nothing to do.” But he added a twist: in that world, inference may be “the only market left.” Agrawal agreed.

Those risks sit alongside the company’s current growth story. Baseten is betting that inference volume explodes, that application companies need custom intelligence to become profitable and defensible, that open models remain close enough to frontier models, and that infrastructure software can be a durable control point even as compute stays scarce.

The next infrastructure opportunity may be below the cloud

When asked what he would build if he were not building Baseten, Tuhin Srivastava did not name an application company. He said he would go “more the Crusoe route,” investing in energy and power.

His reasoning follows directly from the compute-scarcity thesis. The world will need more physical places to put compute. Srivastava said one idea he is especially excited about is modular data centers: a standardized unit of compute analogous to shipping containers as standardized units of trade. Containers were, in his words, “one of the biggest economic drivers in history” because they normalized the unit of trade. He suggested the same kind of standardization could industrialize data-center buildout.

Today, he said, if one talks to the people putting up data centers and the people who own power space, “everything is different.” A modularized, consistent format could create something like an API for compute at the physical-infrastructure layer. If the unit were standardized, an industry could form around servicing, deploying, and financing those units.

A student asked how this differs from a virtual machine. Srivastava said the virtual machine is several layers higher in the stack. His modular data-center idea is about making it cheaper and easier to put up more compute in the first place. It is “two layers deeper” than the software abstraction.

That answer also explains Srivastava’s advice to students. He said students can study “whatever” and change careers, arguing that one can become an expert in almost anything in six months. But when he named a concrete area he is learning, it was project finance for data centers. The financing of AI infrastructure — energy, data centers, and compute buildout — is now closely tied to the economics of AI applications.

The compute market is too immature for clean financial abstractions

A student asked about compute futures. Tuhin Srivastava said the idea is “obviously interesting,” but his description of the current market was intentionally rough: compute deals are done like “a drug market.” There is “a guy.” Baseten, he joked, has “so many guys,” including someone in the office who sits in the corner calling people all day asking for compute.

The point was market structure, not procurement humor. Srivastava said he is bullish that markets for compute will eventually exist, but the current market is not mature. Compared with electricity markets, it has “a lot of slippage,” as Apoorv Agrawal summarized and Srivastava confirmed. The implication is that compute remains too relationship-driven, opaque, and operationally constrained for clean financial instruments to fully solve the problem today.

That immaturity reinforces the broader tension in Baseten’s model. Customers want cloud-like reliability and simple economics for model serving. Underneath, the supply chain for GPUs and capacity is still chaotic enough that access itself becomes strategic work.

AI Application Architecture Data and Training Inference and Deployment AI Infrastructure and Compute Open Models AI Business Models