AI’s Frontier Shifts From Bigger Models To Deployment Constraints

Applied AIThursday, May 21, 20261h 59m to watch17 min read

Sara Hooker, Google DeepMind, Railway, Anthropic, Apoorv Agrawal, and Gavin Baker all point to an AI race increasingly measured by adaptation, latency, cost, supervision, infrastructure, and physical capacity. Bigger models still matter, but the harder question is whether agentic systems can be deployed safely and profitably at scale while chips, wafers, power, and data centers keep up.

1. The frontier is becoming a deployment problem

Frontier AI competition is being remeasured. The old question — who has the largest and most capable model — has not disappeared. But Sara Hooker’s argument and Google DeepMind’s Gemini posture both point to a market where the higher-return work is increasingly in the system around the model: post-training, data curation, routing, retrieval, test-time compute, memory, interfaces, agent harnesses, latency, cost, and evaluation.

Hooker’s claim is deliberately narrower than “scaling is over.” She argues that pre-training scale is losing its position as the default answer to every capability problem. Compute still matters. But in her telling, the marginal question is less “how much larger can the next model be?” and more “where should compute be spent, what should be stored in weights, and how cheaply can a system adapt after deployment?”

Her evidence cuts against parameter count as a sufficient strategic metric. She pointed to smaller models improving sharply over time, small models sometimes outperforming larger ones on the retired Hugging Face Open LLM Leaderboard, and neural-network redundancy suggested by pruning and weight-prediction research. Her conclusion is not that small models always win. It is that model size alone no longer gives a reliable recipe for progress, especially when serving cost, latency, and adaptation are included in the accounting.

That matters because monolithic AI pushes adaptation costs onto users. Hooker’s example was mundane: asking ChatGPT to create slides for a talk and getting a polished but wrong speaker slide, including the wrong woman’s photo. For her, the failure illustrated a system that gives users two weak options: thumbs-up/thumbs-down feedback that may eventually enter a future training process, or laborious prompt engineering. The model is general, static, and shared; the user absorbs the task-specific adaptation burden.

Her alternative is an adaptive stack. Data can be shaped, generated, pruned, deduplicated, and targeted. Inference can spend more compute on uncertain cases and less on routine ones. Interfaces can collect richer feedback than a chat box. Retrieval and memory can carry facts that do not belong in parameters. Post-training and automated training loops can make customization cheaper. The core strategic metric becomes what Hooker called the cost of adaptation.

Test-time compute is the clearest example. Hooker does not treat it as simply “more compute” under a new name. She argues that the value comes from using it adaptively: spend extra effort where uncertainty is high, not on every query equally. The same allocation problem applies to context: stable facts may be better retrieved than memorized, while skills and tool-use habits may need to live more generally in the model.

Google’s Gemini strategy is a live corporate version of that shift. Tulsee Doshi and Logan Kilpatrick described a Gemini posture built less around a single Ultra-branded frontier model and more around a deployable AI stack: Gemini 3.5 Flash, Flash Lite, Anti-Gravity, AI Studio, Search, YouTube, Gemini Live, and Gemini Spark. The model is important, but the product surfaces, harness, retrieval, multimodal interfaces, and feedback loops are being designed together.

3×

Gemini 3.5 Flash speed advantage Doshi claimed versus other large models

Google’s choice to lead with Flash rather than an Ultra model is strategically revealing. Doshi framed Flash as smart, fast, and cost-effective enough to run across Google-scale products. In Search, the Gemini app, and developer tools, latency and price are not implementation footnotes. They determine whether users wait, abandon, or use the model repeatedly. Flash Lite emerged because Google saw demand for an even lower-cost point on the quality-latency curve.

Kilpatrick’s “model eats the scaffolding” line captures one side of the strategy. As model capability improves, some wrapper logic, prompting patterns, tool orchestration, and agent behaviors move into the model itself. But the more important point is not that scaffolding disappears. Google is standardizing the scaffolding that remains. Anti-Gravity is meant to become a common agent harness across Gemini Spark, AI Studio, developer APIs, and eventually more Google products. The model is trained with the harness; the harness powers product experiences; the products generate failures and feedback that go back into the model organization.

The AI race is increasingly being measured around latency, cost, orchestration, context, evals, and distribution — not only frontier leaderboard scores.

Hooker and Google arrive at the same broader place from different starting points. Hooker’s frame is research and ecosystem structure: pre-training scale concentrates participation among labs with the largest colocated compute clusters, while adaptive systems could make more parts of AI progress accessible and efficient. Google’s frame is deployment at planetary product scale: a model family must serve consumer assistants, search, coding agents, enterprise customers, and multimodal products without collapsing under cost or latency.

They also leave different tensions unresolved. Hooker’s adaptive future depends on better methods for deciding what belongs in weights, context, retrieval, memory, tools, or search. Google’s full-stack approach creates an advantage in debugging and iteration, but raises the question Nathan Labenz pressed: whether co-training model and harness creates lock-in or reduces portability across agent frameworks. Kilpatrick and Doshi said they want “harness diversity,” but the economic incentive to own the best integrated stack is obvious.

The shared lesson is that “frontier” is no longer a single number. It is a system property.

Pre-Training Scale Is Losing Ground to Adaptive AI SystemsHugging Face

Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI InfrastructureThe Cognitive Revolution

2. Agents need infrastructure that makes failure survivable

If Hooker and Google describe why AI systems are becoming adaptive, Railway founder Jake Cooper describes what the substrate must look like when those systems start acting on software and infrastructure. His argument is a practical counterweight to vague “agent-native cloud” language: agents do not need magical new primitives so much as faster, cheaper, safer, more observable versions of old ones.

The primitives are familiar: compute, storage, networking, version control, logs, metrics, traces, feature flags, rollouts, filesystems, databases, and production-like environments. The stress pattern is new. Human developers create changes slowly enough that many deployment systems can survive friction, manual review, and staging drift. Agents can generate thousands of changes, tests, branches, and infrastructure mutations. The bottleneck becomes whether those changes can be isolated, evaluated, deployed incrementally, observed, and rolled back.

Cooper’s line is blunt: agents need “basically the exact same thing” as humans — network, compute, storage — but “much, MUCH faster.” That is not a trivial speed upgrade. It means the deployment platform must assume many concurrent software-producing actors, not one developer pushing one pull request. It must let an agent fork an environment, clone services, copy or transform production data, test near the real system, expose limited blast radius, and collapse successful changes back into production.

That is why production forks matter more than trust in the agent. Cooper is skeptical of “AI SRE” if it means giving an agent direct authority over production without safe primitives. In his words, that kind of agent is liable to “nuke your production database.” The alternative is not to ban agents from operational work. It is to make failure survivable: read-only production copies, copy-on-write databases, PII transformation, progressive rollout, shadow traffic, feature flags, and observability that shows what changed and who is affected.

This connects directly to the deployment problem in the first section. Adaptive systems need closed loops. Agents generate work; infrastructure has to evaluate it. Agents alter code; the platform has to test it. Agents deploy; observability has to detect bad outcomes. Agents may act continuously; the platform has to decide when to ask a human, when to roll back, and how to keep context intact.

Agent pressure	Infrastructure response Cooper emphasized
Many concurrent changes	Fast forks, versioning, and cheap production-like environments
Unsafe tool use or bad fixes	Blast-radius control, feature flags, progressive rollout, rollback
Staging drift	Forks close to production, copied services, transformed production data
Opaque failures	Logs, traces, metrics, clustered feedback, incident visibility
High compute demand	Own-metal margins, cloud bursting, placement control

Railway’s agent-native thesis is about making familiar primitives fast and reversible enough for agent workloads.

Railway’s own business history reinforces the economics behind this. Cooper said the company once lost about $500,000 per month on about $50,000 of revenue during its free-tier era, attracting bots, phishing, trojans, and crypto miners along with genuine developers. It then rebuilt around a more disciplined operating model. Cooper described Railway as roughly 35 people serving about 3 million users and adding about 100,000 users per week.

The own-metal strategy is part of the same agent-era thesis. Railway moved most workloads onto its own bare-metal data centers because Cooper said the payback period versus renting equivalent cloud capacity is about three months, with margins around 70% on owned metal. Cloud bursting remains a release valve when demand outruns installed capacity, but own-metal gives the company more margin and control over placement, performance, and cost.

The important comparison is with Google’s standardized harness. Google is trying to standardize the agent execution layer across its products. Railway is trying to make the cloud substrate safe for agent-generated software change. Both are responses to the same operational fact: once AI moves from answering to acting, iteration speed is no longer only a product advantage. It is a safety and economics problem.

Cooper’s view also complicates simple claims that the pull request is dying. He and Shawn Wang both expect the pull request to lose centrality, but not because code review becomes irrelevant. The unit of change expands: prompt, spec, generated code, tests, observability, feature flags, rollout, production feedback. The human role shifts toward specifying, reviewing, reconciling, and deciding. That is the same pattern Doshi described inside Google research and product work: the model compresses mechanics, while humans retain judgment.

The practical implication is that agent infrastructure is not a dashboard category. It is the set of control loops that keep fast software generation from becoming fast production damage.

Agent-Native Clouds Need Faster Primitives, Not New OnesLatent Space

3. Coding agents are the first mass test of agent economics

Claude Code is where the abstract agent debate becomes measurable. Boris Cherny, who leads Claude Code at Anthropic, describes it as more than a coding assistant. It has become one of Anthropic’s main agent surfaces: a way users experience models that can use tools, edit files, run tasks, connect services, and coordinate other agents.

Cherny’s productivity claims are strong, but the more useful reading is not “Claude Code proves agents work.” Coding is the first mass market where agentic work is visible, instrumented, and economically consequential. It shows both real demand and the constraints other domains are likely to encounter later.

The headline internal figure is that code written per engineer at Anthropic rose about 250% after Claude Code was introduced, with Cherny saying code quality and reliability remained stable. He also said Claude Code and Co-work are now “100% written by Claude Code,” and that in a Y Combinator room of a few hundred people, about half raised their hands to say 100% of their code was written using Claude Code.

250%

reported increase in code written per engineer at Anthropic since Claude Code was introduced, according to Cherny

Those claims sit beside a set of unresolved operating constraints. Agents waste tokens. Long-running tasks can spiral. Users hit rate limits. Plugins and integrations can consume tokens inefficiently. Permission prompts create fatigue. Power users no longer run one Claude at a time; Cherny said he normally runs about five simultaneously, and on many nights runs hundreds or thousands. That usage pattern turns “AI coding” from a feature into a capacity-management problem.

The product controls reflect that reality. Anthropic exposes model choice — Opus, Sonnet, Haiku — and “effort” levels that trade token use against work done by the model. Cherny said Anthropic optimizes first for intelligence, then efficiency. That sequencing may be rational for frontier product quality, but it also means users encounter the economics directly. A more capable agent can be more useful and more expensive in the same session.

Claude Code’s auto mode is the clearest example of a new agent pattern: agents increasingly need other agents as monitors, reviewers, or supervisors. Earlier versions asked users to approve tool calls. That appeared safer, but repeated prompts taught users to click yes or “always allow.” In auto mode, when Claude wants to use a tool, another Claude evaluates whether the tool call is safe, with limited context and additional safety checks. Cherny said Anthropic spent months iterating on the system and used thousands of evaluations to judge safety.

That is not merely a Claude-specific feature. It is an early form of supervisory infrastructure for long-running action. If agents are going to work for an hour, a night, or across multiple parallel tasks, the safety loop cannot depend on a human reading every command. But if the human is fully removed, tool misuse and silent error become harder to catch. Claude-checking-Claude is one attempt to resolve that middle ground.

Bottleneck	How it appears in Claude Code	Why it matters beyond coding
Token cost	Long tasks, retries, plugins, and multiple agents consume large token budgets.	Every agentic workflow has marginal compute cost.
Rate limits	Power users hit plan ceilings as they run many Claudes in parallel.	Capacity becomes part of product quality.
Permission fatigue	Users stop carefully reviewing repeated tool approvals.	Safety loops must scale beyond human click-through.
Supervision	Auto mode uses Claude to evaluate Claude’s tool calls.	Agent work may require monitor agents, reviewer agents, and escalation rules.
Organizational change	Cherny says companies need workflow redesign, not token leaderboards.	Productivity gains depend on adoption patterns, not just model capability.

Claude Code exposes the economic and operational constraints likely to follow agents into other domains.

Cherny also pushed the agent frame beyond programming. His travel-planning example involved Co-work reading email and calendar context, correcting missing stops and wrong dates, then booking eight flights and five hotels before fixing a hotel location. He described a non-engineer friend using Co-work to diagnose and fix a laptop language-input problem through settings. These are not proofs that ordinary work is solved. They show why coding may be only the first domain where tool-using models become legible.

The labor question remains open. Cherny does not argue that people disappear. He argues that leverage per person rises. Someone still has to ask Claude to configure Salesforce or steer a business process. If there are many judgments, prompting and supervising may itself be a full-time job. Eventually Claude may prompt Claude to do more of that work, but Cherny’s account still leaves humans responsible for goals, judgment, and organizational change.

That is the same unresolved question running through Hooker, Google, Railway, and Anthropic: the systems can act more, but action requires allocation, context, supervision, reversibility, and cost control. The question is shifting from whether agents can act to whether the economics and safety loops can support long-running action.

Claude Code’s Growth Tests the Economics of Long-Running AI AgentsAlex Kantrowitz

4. The economic stack is still upside down

Apoorv Agrawal’s Stanford lecture disciplines the optimism around adaptive systems and agents. His framework is that generative AI’s revenue and gross profit are still concentrated in the layer least like software: semiconductors.

In Agrawal’s comparison, the cloud stack has roughly $600 billion of annual application revenue, $300 billion of infrastructure revenue, and $80 billion of semiconductor revenue. The AI stack is inverted: about $60 billion in applications, $75 billion in infrastructure, and $300 billion in semiconductors. The gross-profit picture is even more concentrated. Agrawal’s estimates show semiconductors accounting for 87% of AI gross profit in 2024 and 79% in 2026, while applications rise only from 3% to 7%.

Stack or period	Apps	Infra	Semis
Cloud estimated annual revenue	$600B	$300B	$80B
AI estimated annual revenue	$60B	$75B	$300B
AI gross profit share, 2026	7%	14%	79%
Cloud gross profit share	70%	24%	6%

Agrawal’s estimates show AI revenue and gross profit still weighted toward semiconductors rather than applications.

The core reason is marginal cost. Classic software could be built once and distributed at near-zero incremental cost, producing 80% or 90% gross margins in many categories. Generative AI does not automatically inherit that structure. Every user request consumes inference. Every tool call, retry, agent loop, long-running task, and parallel Claude burns compute. As Agrawal put it, “you’ve got to burn those GPUs.”

That point ties directly to Claude Code’s rate limits and Railway’s own-metal strategy. Claude Code users feel the cost as token ceilings and plan limits. Railway feels it as a need to own metal, manage placement, and avoid being compute constrained. Google feels it as a reason to emphasize Flash, Flash Lite, and latency-sensitive deployment rather than only maximum capability. Hooker’s adaptive-compute argument is also an economics argument: spend effort where it matters, not everywhere.

Agrawal is not saying AI demand is absent. His estimates show AI revenue growing from about $90 billion in Q1 2024 to about $435 billion in Q1 2026. The striking point is that the shape barely moved. Roughly 75% of incremental revenue over the period flowed to semiconductors in his estimate. Applications grew sharply, but the profit pool remains at the base of the stack.

5×

estimated AI revenue growth from Q1 2024 to Q1 2026 in Agrawal’s framework

The application layer has a second problem: usage is mainstream, but monetization is thin. Agrawal said ChatGPT has about 1 billion users and monetizes at about $10 per user per year, compared with Alphabet at about 4 billion users and roughly $100 per user per year, and Meta at about 3.5 billion users and roughly $70. He thinks consumer AI may need advertising to close the gap, despite obvious concerns about inserting ads into personal AI interactions.

The infrastructure layer is contested for a different reason. Agrawal asks whether infrastructure startups are platforms or features. Many middle-layer capabilities may be absorbed by hyperscalers. That does not mean every AI infrastructure startup fails, but it raises the burden of proof: a company must show that it controls something durable enough not to become an AWS, Azure, or Google Cloud feature.

This is where Railway is an interesting test case rather than a contradiction. Cooper’s argument is not merely that Railway has an agent feature. It is that the platform’s control over deployment loops, production forks, own-metal economics, observability, and developer experience can become a durable control point. Agrawal’s framework asks whether that is enough to resist hyperscaler absorption over time.

The semiconductor layer, meanwhile, remains the clearest profit center. Agrawal estimated Nvidia data-center gross margins around 75%, while application-layer gross margins may sit anywhere from 0% to 30%, depending on the company. Custom silicon could reprice that layer if Google’s TPU, Amazon’s Trainium, Meta’s MTIA, Microsoft’s efforts, OpenAI’s efforts, or other ASIC programs succeed. Hyperscaler capex guidance is therefore not just a quarterly finance detail. It is a signal about whether the buildout continues and whether the current stack shape persists.

The implication for agents is sobering. Agentic software may be useful enough to create real demand, but it is expensive enough that capacity, utilization, and gross margin matter immediately. The software model has not vanished. It has acquired a meter.

Generative AI’s Revenue Stack Is Still Inverted Toward ChipsStanford Online

5. Physical scarcity may decide how fast the cycle can run

Gavin Baker’s TSMC argument pushes the deployment story to its hard boundary. If Hooker says the technical frontier is moving toward adaptive systems, Google is building model-harness-product stacks, Railway is making agentic change survivable, Claude Code is testing agent economics, and Agrawal says profit remains concentrated in chips, Baker asks what happens when demand runs into wafers, power, and physical supply chains.

His frame is scarcity. Demand for frontier AI, in his telling, is outrunning the system’s ability to serve it. He used Anthropic as the reference point, saying the company added $11 billion of ARR in one month and arguing that reported ARR may understate unconstrained demand because compute is scarce. He also read DeepSeek as positive for compute demand: reasoning models made inference more compute-hungry, while GPU availability tightened and rental prices rose.

$11B

ARR Baker said Anthropic added in one month

Baker’s most useful distinction is “watts and wafers.” He is more confident capitalism can solve the power shortage over time than the wafer shortage. Power bottlenecks may ease as zoning, turbines, gas prices, and data-center approvals adjust, though he does not treat that as instantaneous. Wafers are different because leading-edge supply depends heavily on TSMC’s capacity decisions, process leadership, equipment supply, and institutional discipline.

His central claim is not simply that TSMC is important. It is that TSMC’s restraint may be preventing a classic overbuild. Foundational technologies often create bubbles: markets correctly identify the importance of a technology, capital floods in, supply outruns demand, and a crash follows. Baker sees differences from 2000 — the current buildout is more funded by operating cash flow than debt, and GPUs are highly utilized rather than sitting idle like unused fiber — but he still sees the historical risk.

In his view, if TSMC expanded as much as Nvidia wanted, Nvidia might sell vastly more GPUs in 2026 or 2027, potentially enough to create overbuild conditions. Wafer scarcity, in that reading, is not only a bottleneck. It is a governor on excess. If AI avoids a classic infrastructure bubble, Baker joked, investors should “throw a party” for TSMC.

That argument should be handled as a read on the cycle, not a settled forecast. It depends on TSMC maintaining enough leading-edge advantage to discipline supply, while Intel, Samsung, or another source does not break the constraint too quickly. Baker himself names that as a risk: if a meaningful second source emerges, industry behavior could change.

His discussion of orbital compute and the Terafab shows how scarcity expands the imagination of infrastructure. Baker wants orbital compute understood as racks in space, not floating Pentagon-sized data centers. He argues that if terrestrial permitting, power, and cooling become too binding, SpaceX-style engineering may make inference in orbit commercially interesting. He also described a Musk-linked Terafab ambition to build an enormous semiconductor fab in America, possibly compressing normal timelines through talent, supplier pressure, and operational urgency. Cooper, from Railway, expressed the same type of engineering skepticism in miniature when he questioned heat dissipation for space data centers: ambitious infrastructure claims need the physical bottleneck named and solved.

Baker’s model-layer economics also connect back to Agrawal and Cherny. He argues that frontier tokens still capture most of the economics, and that premium access is increasingly usage-based rather than all-you-can-eat. Consumer and prosumer plans are rate-limited; serious users need enterprise plans, Claude Code, Codex, or API access. That is bullish for model revenue if demand holds, but it also raises the distributional problem Baker calls “sad for the world”: the best AI may become available mainly to those who can afford usage-based frontier access.

The application layer remains squeezed in his account until the frontier premium changes. Cursor and Cognition are exceptions because coding reached scale first. Narrow vertical applications must either build a data moat before model companies enter, occupy niches frontier labs ignore, or benefit if frontier-token prices fall relative to other tokens. This is close to Agrawal’s inverted-stack problem from the investor side: applications may be where users experience value, but chips and frontier tokens capture much of the economic surplus today.

The unresolved question across all six arguments is not whether AI systems are getting more capable. It is what governs the next phase of deployment.

Layer	What the arguments emphasize	Main constraint
Model strategy	Hooker shifts attention from pre-training scale to adaptive systems.	Efficient adaptation and allocation of knowledge across weights, context, retrieval, and tools
Product stack	Google builds Gemini around Flash, Anti-Gravity, Search, AI Studio, YouTube, and Gemini Live.	Latency, cost, harness portability, and product feedback loops
Cloud substrate	Railway argues agents need faster, safer versions of old primitives.	Forking, rollout, observability, and compute economics
Agent product	Claude Code shows high usage and productivity claims.	Tokens, rate limits, supervision, and long-running safety
Economic stack	Agrawal says gross profit remains concentrated in semiconductors.	Inference marginal cost and hyperscaler control
Physical supply	Baker argues wafer and power scarcity shape the boom.	TSMC capacity, energy, data centers, memory, networking, and capital discipline

The deployment race spans technical, economic, and physical constraints.

The answer is probably all three: smarter adaptive systems, better deployed products, and harder physical limits. Scaling remains necessary, but it is no longer sufficient. Agents are useful, but they are not free-running or free to serve. Infrastructure demand is real, but its durability depends on whether applications can absorb the cost and whether physical capacity can expand without overshooting.

TSMC’s Wafer Scarcity May Be Preventing an AI OverbuildInvest Like The Best