Applied AI Moves From Model Choice To The Machinery Around It
Dan Fu, Tuhin Srivastava, Ahmad Awais, Wolfgang Lehrach, Vincent Koc, and Hock Tan each describe a different version of the same production shift: capability is increasingly shaped by inference systems, harnesses, tests, workflow data, and compute infrastructure around the model. Strong base models still matter, but the operational question is becoming how reliably, cheaply, and safely those models can be served, constrained, reviewed, and scaled.

Inference is no longer plumbing; it is shaping the product and the model
Applied AI is moving away from a simple question — which model is smartest? — toward a harder production question: what system makes the model fast, cheap, reliable, and useful enough to run as a business?
Dan Fu’s Stanford CS336 lecture gives the cleanest technical version of that shift. Fu framed inference as the machinery that “turns electricity into intelligence”: the schedulers, caches, kernels, routing rules, hardware choices, and debugging tools that sit between a trained model and a product returning tokens to users. His stronger point was that inference is no longer just deployment. It is now a research and product constraint that feeds back into which model architectures make sense.
The most important split in Fu’s account is between prefill and decode. Prefill processes the prompt — for example, a 10,000-token input — and is relatively compute-bound. Decode generates the response one token at a time and is often memory-bandwidth-bound because the system repeatedly loads model weights while doing comparatively little computation. That distinction changes everything around serving: what hardware is useful, how requests should be scheduled, how long context should be cached, and whether different phases should run on different machines.
| Inference layer | What it decides | Why it now matters commercially |
|---|---|---|
| Prefill and decode split | How prompt processing and token generation are placed across hardware | Affects latency, throughput, and whether specialized chips can help |
| KV cache management | Which conversation prefixes stay hot in GPU memory, move to CPU memory, or spill to disk | Determines whether multi-turn and long-context products stay economical |
| Routing | Whether cold and warm requests take different serving paths | Can improve serving efficiency without changing the model |
| Kernel work | How model operations are fused and scheduled on GPUs | Can move decode closer to hardware limits, but at high engineering cost |
| Architecture co-design | Whether recurrence, KV compression, quantization, or model size fit the serving target | Makes model design dependent on production constraints |
Fu’s examples make the point concrete. Continuous batching turns serving into a scheduling problem. KV caches turn inference into something like an operating-system memory-management problem, with hot blocks in GPU memory, warm blocks in CPU memory, and cold blocks on disk. Cache-aware prefill-decode disaggregation routes cold and warm requests differently; Fu said a routing change of that kind produced up to a 40% speedup in long-context serving. Megakernels attack decode by fusing many operations into one larger GPU program, with Fu reporting 72% H100 bandwidth utilization in one Llama-1B decode benchmark. Recurrence, in his Parcae work, becomes interesting partly because reusing parameters can reduce the weight footprint and leave more room for KV cache.
Baseten chief executive Tuhin Srivastava presented the commercial version of the same argument at Stanford. If inference is the cost of goods sold for AI applications, then model calls are not just a cloud bill; they are the recurring cost of delivering the product. Srivastava said roughly 90% to 95% of inference spend currently goes to frontier models, while only about 5% goes to custom or post-trained models. Baseten’s thesis is that scaled AI application companies will increasingly need to move beyond default frontier-model APIs to improve margins and protect the workflow data and user signals that make their products defensible.
Baseten’s examples are production workflows, not abstract benchmarks. Srivastava described Whisperflow as running several language and audio models in the path from speech to text, with latency as the product experience. He described Abridge as running about 20 models for healthcare ambient documentation and EMR integration, with reliability and speed bound to clinical workflow. In both cases, the inference stack includes not just a model endpoint but deployment, runtime optimization, observability, security, multi-cloud capacity, and sometimes post-training.
Srivastava’s custom-model argument remains a bet, not a settled outcome. He said open-source models may be within a sub-90-day lag of closed frontier systems and can run much cheaper after post-training, but he also named open-source weakness as one of Baseten’s core risks. His economic point is still useful: once a company reaches enough usage volume, “trading tokens” through frontier APIs can pressure gross margin, and the application company may want its own utility function, data loop, model variant, and serving stack.
Taken together, Fu and Srivastava describe production AI as a scheduler, cache hierarchy, runtime, hardware plan, cost model, and often a post-training loop tied to a company’s own data — not just a trained model behind an endpoint.
Fu and Srivastava approach the same constraint from different sides. Fu explains why serving shapes architecture. Srivastava explains why serving shapes business strategy. Together, they make inference look less like plumbing and more like the surface on which applied-AI products compete.
The harness is becoming a source of capability
Ahmad Awais’s Command Code account brings the inference argument down to a smaller but revealing failure mode: tool calls. His headline claim was that DeepSeek v4 Pro beat Opus 4.7 in six of ten Command Code internal comparisons after Command Code added deterministic repairs to malformed tool-call inputs. The important point is not the scoreboard. It is that the model did not change. The harness did.
Awais argued that many open models look worse at coding-agent work because they collide with tool schemas and error handling designed around more forgiving models. A coding agent may need to list directories, read files, write files, or run commands. Each operation has a schema. If a model sends null where an optional field should be omitted, wraps an array as a JSON string, or emits a Markdown autolink where a file path belongs, a validator rejects the call. Awais said DeepSeek v4 Pro would often repeat the same wrong call when shown a raw Zod error, producing dozens of hidden failures in a session.
Command Code’s repair pattern is narrow and instructive: validate first, repair only where the schema complains, return the tool result plus a model-readable correction, and log the repair rate. Awais said his first attempt — preprocessing all inputs before validation — risked silent corruption, such as rewriting file content that happened to look like JSON. The later approach made the schema the authority. Valid inputs pass untouched. Invalid inputs are repaired only at the failing paths identified by the validator.
That is a small systems lesson with large implications. Some apparent model failures are interface failures. The model may not need to become more intelligent to perform better in a product; the product may need to enforce clearer contracts between model output and software action.
| Pattern | What changes | What does not change |
|---|---|---|
| Tool-call repair | Malformed arguments are corrected after validation and explained back to the model | The underlying model weights |
| Design contract | The agent must choose a surface pattern before styling a UI | The model’s generic CSS capability |
| Taste file | Local preferences such as package manager, test framework, or PR habits are made explicit | The model’s broad training distribution |
| Telemetry | Repair rates expose which tools and models are failing | The fact that models still make schema mistakes |
Awais extended the same logic from tools to design and developer preferences. In his view, AI-generated interfaces often look bad because the prompt says “make it prettier” without specifying the design job. Command Code’s /design work asks the agent to classify the surface first — Monitor, Operate, Compare, Configure, Learn, Decide, or Explore — before changing visual properties. A dashboard and a landing page should not collapse into the same centered hero, feature-card grid, and purple-gradient treatment.
His “taste” system makes a similar move for software preferences. A Taste.md file can encode recurring choices such as pnpm for package management, tsup for bundling, vitest for tests, or a team’s preferred PR workflow. Awais contrasted this with broad hand-written rules: rules are what a developer remembers to write down; taste is meant to capture recurring micro-decisions that differ from the model’s default distribution.
The connection to Fu and Srivastava is direct. Fu’s serving stack mediates between model and hardware. Srivastava’s inference platform mediates between model and business workflow. Awais’s harness mediates between model and developer action. In all three cases, capability is partly in the machinery around the model: the validator, the cache, the router, the runtime, the preference layer, the telemetry.
Awais did not argue that a weak model can be rescued indefinitely by wrappers. His claim was narrower and more practical: a Claude-oriented harness may make an open model look worse than it is, because Claude may recover from ambiguous tool errors that other models repeat. The applied-AI lesson is not that models no longer matter. It is that model comparisons can be distorted by the interface contract.
Agents need executable structure, not just better prompting
Wolfgang Lehrach’s Stanford HAI seminar supplied the research analogue of the same systems philosophy. His central move was simple: do not ask a language model to play a game directly if it is slow, strategically weak, or prone to illegal actions. Ask it to write executable structure, then let ordinary algorithms operate over that structure.
In Code World Models, the LLM reads natural-language rules and game trajectories, then writes code that replicates the game’s dynamics. A planner such as Monte Carlo Tree Search can then search over the synthesized game model. In Autoharness, the task is narrower: synthesize code that checks whether a proposed action is legal. The model can still propose actions, but the harness filters illegal ones.
This is not classic “better prompting.” It is a translation step from fuzzy model knowledge into executable artifacts. A language model may know game rules, tactical concepts, and Python idioms, but direct action generation is a poor use of that knowledge in many settings. Code gives the rest of the system something testable. Unit tests can compare state transitions against trajectories. Planners can roll out futures. Validators can reject invalid moves.
Harness is the useful bridge between Lehrach and Awais. Command Code repairs malformed file and shell tool calls. Autoharness checks whether a game action is legal. Both systems recognize that a model’s raw output may not be the right final interface to the world. The model proposes; software validates, repairs, filters, or searches.
Lehrach’s results show why that can matter. For fully observed games such as backgammon, connect four, tic-tac-toe, generalized tic-tac-toe, and generalized chess, he reported more than 99% transition accuracy for synthesized world models. For TextArena, Autoharness achieved 100% legal move accuracy across 145 evaluated one-player and two-player games, according to the seminar. He also said 78% of Gemini 2.5 Flash’s losses in one TextArena chess competition were due to illegal moves — a failure mode that legality checking directly targets.
The most interesting part is the separation of responsibilities. A model does not have to be the world simulator, policy, planner, and legality checker all at once. It can write a simulator. It can propose a heuristic. It can emit an action. Other code can test the simulator, run MCTS, train a policy from synthetic rollouts, or reject illegal moves. That division resembles how production AI systems are increasingly built outside games: tool contracts, validators, tests, skills, evals, and workflow-specific checks surround the model.
Lehrach also marked the limits. Code World Models can miss dynamics not covered by rules or trajectories. Autoharness can learn a legal but restricted action subset, such as playing chess without castling or promotion, unless the missing cases appear through rules, opponents, or exploration. The harness improves reliability, but it does not by itself solve strategy.
That boundary reinforces the broader theme. The system around the model can make capability usable, but it introduces its own engineering questions: what must be checked, what must be simulated, what supervision is available, and where coverage is unknowable.
In software engineering, the bottleneck is shifting to review, tests, and management
Vincent Koc’s OpenClaw account shows what happens when agentic coding begins to work well enough to overwhelm normal software process. The story is not mainly that OpenClaw had a 3,000-commit day. It is that commit velocity made review, tests, triage, and human judgment the binding constraint.
Koc said his GitHub graph showed 2,886 contributions on March 15, 2026, and he rounded the moment to “close to 3,000 commits per day.” During OpenClaw’s “Great Refactor,” the project recorded 2,700 commits over nine days, with 638,615 insertions, 271,117 deletions, and 82% of core lines touched. The result was a plugin architecture, but the more important result was the exposure of a new operating model: humans could no longer rely on reading every diff in the old way.
Koc’s claim is that the bottleneck has moved from hands to taste. Agents can produce code. The harder question is what should exist, what should be rejected, and when a session has lost the plot. He warned that if cheap tokens make maintainers say yes to every feature request, the codebase becomes bloat. His shorthand was that 2025 was about “tokenmaxxing” and 2026 is about not wasting tokens.
| Operating phase | Koc’s description | Binding constraint |
|---|---|---|
| 2025: tokenmaxxing | Running lots of tokens through agents and learning what raw volume could do | Access to agents, compute, and human exposure to enough sessions |
| March 15, 2026 | Koc’s GitHub graph showed 2,886 contributions in a day | Review capacity and orientation |
| The nine-day Great Refactor | OpenClaw recorded 2,700 commits, touched 82% of core lines, and launched plugins | Tests, harness trust, work isolation, and merge discipline |
| 2026: not wasting tokens | Koc framed the next constraint as token efficiency, process, tests, and refusal | Taste, triage, and management |
OpenClaw’s working method was not magic prompting. Koc described many Codex sessions arranged into swim lanes: some for CI, some for tests, some for features, some for P0 and P1 issues, some for monitoring incoming GitHub or Discord activity. The human operator’s job is to segment the work by risk and supervision load, then decide which sessions can run, which need conversation, and which should be killed.
His “blind faith in the harness” is not blind trust in AI. It is dependence on tests and process when change volume exceeds human reading capacity. Koc said the AI-generated unit tests were “awful” in the sense that they overfit the prior code, but after the team tore the system apart, those tests became a useful signal: if they went green, the system was at least close to its previous behavior. He also described eval loops, a fake Slack environment, PR clustering through semantic graphs and embeddings, and reusable .skills files maintained like engineering assets.
That turns agentic coding into an industrial management problem. Koc compared managing 10 or more agents to managing 10 or more staff. The skill includes detecting when an agent is bluffing, confused, or waffling in its explanation; knowing when to stop; and preserving enough process to prevent a flood of changes from becoming noise.
OpenClaw’s experience talks back to Awais and Lehrach. Tool-call repairs, action validators, and world models are local harnesses. OpenClaw needed an organizational harness: tests, swim lanes, PR clustering, skills, work isolation, and taste. The question becomes less “can an agent write code?” and more “can an organization supervise parallel agents, trust its checks, and decide what not to merge?”
The infrastructure race is moving from GPUs to custom silicon, networking, and capital discipline
Hock Tan’s Broadcom interview provides the hardware and capital counterweight to the software and harness material. If inference is becoming strategic, major AI buyers have reason to own more of the stack: custom silicon, networking, compute capacity, and multi-generation engineering roadmaps.
Tan said Broadcom is working with six customers building custom AI accelerated silicon. He described the objective as replacing, or at least competing with, the de facto Nvidia GPU path for customers whose workloads justify custom chips. Google is furthest along with TPUs. OpenAI has been engaged with Broadcom for more than two years, and Tan said its AI accelerated silicon is working in labs and data centers, with production expected late this year. Anthropic sits in a different arrangement: Broadcom and Google provided TPU compute capacity, a bet Tan said was made partly on Anthropic’s enterprise and coding-tool demand.
Tan did not frame this as a simple Broadcom-versus-Nvidia story. He called Nvidia the real competitor facing Google’s TPU effort, because Nvidia keeps improving generation after generation. Broadcom’s role, in Tan’s telling, is to help customers build technology strong enough to match that pace. That is a multi-generation engineering contest, not a one-chip event.
This aligns with Fu’s prefill/decode and hardware discussion. If inference workloads split into different bottlenecks, and if decode, KV cache, memory, interconnect, and model architecture all matter, then hardware buyers may not want one generic answer forever. They may pursue TPUs, custom accelerators, networking fabrics, specialized decode chips, or heterogeneous systems — while still depending heavily on Nvidia’s ecosystem where it remains the fastest practical route.
Broadcom’s networking exposure is part of the same production shift. Tan said switching demand is “taking off like a rocket” because AI clusters require massive networking capacity. He also said Broadcom prioritizes strategic customers likely to need repeated generations rather than chasing every order. That selectivity mirrors Baseten’s view that compute access is a strategic advantage, not a routine procurement line. Srivastava described B200 renewal pricing nearly doubling in one case and said compute scarcity may not normalize because agentic applications and larger models keep increasing inference demand.
Tan’s comments on internal AI tool use echo the token-economics debate from a different vantage point. He said Broadcom has not throttled employees’ AI usage because people are still learning how to use the tools, and because the return can be compelling if a senior engineer with AI can produce in one week what otherwise might take 10 engineers three months. Koc’s framing was stricter — 2026 is about not wasting tokens — but both treat token use as an operating discipline rather than a novelty. The question is ROI per token, per engineer, per workflow.
| Layer | What the selected cases show | Production constraint |
|---|---|---|
| Model serving | Fu described scheduling, KV caches, routing, kernels, and architecture co-design | Latency, throughput, memory, reliability |
| Application economics | Srivastava argued inference is AI applications’ COGS | Gross margin, defensibility, compute access |
| Coding agents | Awais and Koc described tool repairs, taste, tests, swim lanes, and review bottlenecks | Reliability, supervision, merge quality |
| Executable wrappers | Lehrach described simulators and legality harnesses | Validity, search, testability |
| Hardware | Tan described custom silicon, TPUs, OpenAI production plans, and networking demand | Ownership of stack, capacity, multi-generation engineering |
The frontier model still matters. Several arguments depend on strong base models: Baseten needs open models to remain close enough to the frontier; Command Code’s repairs are useful only if the underlying model can reason once the interface stops blocking it; Lehrach’s systems depend on LLMs that can write plausible code; Broadcom’s customers are building hardware because model demand is large and persistent.
But model quality is only one layer. The applied-AI race described here is increasingly fought in the machinery around the model: inference architecture, tool contracts, executable checks, workflow data, tests, review processes, custom chips, networking, and compute access. Once AI becomes a production workload, the surrounding system determines how much intelligence can be delivered, trusted, and afforded.











