Applied AI Moves From Model Access To Operating Control
Greg Brockman, Sachin Katti, Amin Vahdat, Tatsunori Hashimoto, Maxim Kogan, Phil Hetzel, Scott Wu, Priscila Oliveira, and Pete Koomen each point to the same shift: model capability is only one part of deployment. The applied-AI question is becoming whether companies can secure useful compute, shape model behavior, govern agent actions, and turn organizational context into reliable work.

Compute is now a strategic allocation problem
The most consequential applied-AI constraint in these Articles is not model capability by itself. Greg Brockman and Sachin Katti both frame the next phase around the operating environment needed to turn models into widely available work: compute, power, data centers, supply chains, and decisions about who gets access when capacity is scarce.
Brockman’s version is explicitly social and political. In his Knowledge Project interview with Shane Parrish, he says OpenAI’s early data-center investment has become a business advantage, but he quickly turns the point into an allocation problem. Compute can serve consumer access, business productivity, scientific discovery, medical research, and other goals. If there is not enough of it to serve everything at once, someone has to decide which uses matter most.
Where does the compute go? What problems are worthy?
Brockman does not offer a complete allocation mechanism. He contrasts two possible instincts: concentrate scarce systems in an “ivory tower” aimed at major breakthroughs, or distribute access broadly so people feel the technology in daily work, business creation, and creative output. His argument is that OpenAI wants a balance, including broad access through free and paid consumer products, rather than reserving the systems only for elite scientific or institutional uses.
Katti’s Stanford seminar makes the same scarcity concrete at industrial scale. As OpenAI’s infrastructure executive, he describes “industrial compute” as a system that includes chips, memory, networking, power, cooling, buildings, land, suppliers, operators, and power distribution. Asked whether the bottleneck is energy, chips, sourcing, land, or power, Katti’s answer is “all of the above.” In his account, frontier AI is not constrained by GPUs in isolation; it is constrained by the synchronized arrival of the entire physical chain.
That distinction matters because announced capacity is not usable capacity. Katti says signing contracts is the easy part. The harder work is making suppliers deliver, integrating brittle chips, keeping systems cooled and powered, staffing facilities, and operating clusters at high utilization. Chips without power are not compute. Power without cooling is not compute. A building without reliable networking, memory, operators, and software is not compute.
Katti also argues that the workload itself has changed. Training remains important, but he says inference is already the majority of OpenAI’s compute and could become “80% plus” of future compute. That does not mean simple product serving. Post-training reinforcement learning, synthetic data generation, reasoning traces, Codex-like coding tasks, and agent loops are all inference-heavy. In his framing, the old training-versus-inference split is less useful because research, deployment, and product usage all consume inference.
Agents make the infrastructure problem harder. A chatbot can look like one prompt and one answer. A reasoning system adds internal inference. An agent becomes a graph: inference calls, tool calls, database lookups, virtual machines, tests, more inference, and user decisions. Katti says the compute goal is to make “the human the bottleneck,” meaning the system should be fast enough that the user is waiting on their own judgment rather than on the model, tool hops, or scheduling delays.
The connective tissue between Brockman and Katti is scarcity. Brockman asks where society should point the compute. Katti explains why producing that compute is no longer a procurement task. Together, their arguments imply that data centers are becoming strategic machines: commercially valuable, politically visible, physically fragile, and tied to public questions about access, electricity prices, water, grid reliability, and local permission. Brockman’s answers on water use and electricity prices are OpenAI’s own framing of those concerns; the broader point is that infrastructure justification now has to travel outside the lab.
The industry can announce gigawatts. The harder question is whether those gigawatts become useful work.
The core metric is useful output per watt and dollar
Amin Vahdat’s Stanford lecture is the necessary counterweight to raw buildout enthusiasm. He does not dispute that AI infrastructure is moving to enormous scale. He questions whether gigawatts and flops are the right scoreboard.
Vahdat, Google’s chief technologist for AI infrastructure, says a gigawatt-scale buildout can cost roughly $40 billion to $50 billion, with estimates rising. But his central metric is not spend per gigawatt. It is value delivered per dollar. A large facility that is unreliable, unbalanced, poorly scheduled, or politically unwelcome is not equivalent to one that reliably produces useful model work and customer value.
| Infrastructure question | Why it now matters |
|---|---|
| Capacity | Gigawatts and flops are visible, but do not prove the system can produce useful work. |
| Reliability | Training and large-scale serving can halt when a single specialized node fails. |
| System balance | Compute without enough memory bandwidth, networking, storage, or scheduling becomes stranded. |
| Lead time | A net-new gigawatt can take two to three years even with capital available. |
| Public permission | Power, water, noise, jobs, and grid effects are now part of the AI deployment problem. |
His reliability argument is especially important because AI workloads do not behave like older web services. Google’s search infrastructure was designed so that any rack could disappear and users would not notice. Data was replicated, compute was loosely coupled, and failures could be absorbed. Frontier training and large model serving are different. Vahdat says that in TPU or GPU training and inference, “every node is special.” If one node holds a specific shard of a model or participates in a synchronous training run, its failure can stop the whole computation.
That changes how reliability is priced. Traditional enterprise services often demanded very high uptime, with heavy redundancy. Frontier training customers may prefer more raw capacity at lower availability if the total training throughput is higher. Vahdat gives the tradeoff directly: many frontier labs would rather have 99.9% reliability and twice the capacity than 99.999% reliability and half the capacity. The acceptable reliability level depends on the workload.
System balance is the second discipline. Vahdat warns against overbuilding flops while underbuilding memory bandwidth, memory capacity, interconnect, data center networking, storage, or scheduling. Sparse and mixture-of-experts workloads may need more memory bandwidth relative to computation. Agentic systems may wait on CPUs, storage, simulations, or remote data. In those cases, an expensive accelerator can sit idle while another part of the system catches up.
Katti and Vahdat agree that the data center is an integrated machine, but their emphasis differs. Katti stresses the difficulty of assembling the full chain quickly enough for frontier model and product demand. Vahdat stresses that the chain must be balanced, reliable, and valuable enough to justify its energy and capital. Katti’s “all of the above” bottleneck is the build problem. Vahdat’s “value per dollar” framing is the operating discipline.
Lead time makes the discipline harder. Vahdat says a net-new gigawatt can take roughly two to three years, even if the buyer has the money. The company has to forecast future demand for land, power, chips, memory, buildings, and locations under conditions where product usage and model architectures can change quickly. Underpredicting leaves growth constrained. Overpredicting wastes capital. Perfect prediction is nearly impossible.
The public layer is no longer separable from the technical layer. Vahdat says data centers should be an uplift for local communities and the grid, not just private loads. He discusses water-versus-power tradeoffs, demand response, and giving capacity back to utilities during peak residential demand. Katti warns that synchronized AI loads can create hundreds of megawatts of rapid fluctuation on regional grids if unmanaged. Brockman, from OpenAI’s side, argues that people need to understand why data centers are being built and how the benefits show up in daily life.
The practical conclusion in Vahdat’s framing is not simply “more compute wins.” It is that more useful compute wins. Useful compute depends on the entire operating environment: power contracts, grid behavior, topology, cooling, memory, networks, schedulers, reliability targets, customer workloads, and public acceptance.
Model behavior is selected by pipelines, not just architectures
The same lesson appears inside the model stack. What users experience as “the model” is not only the pretrained system or the algorithmic label attached to it. Tatsunori Hashimoto’s Stanford lectures describe model behavior as the output of data pipelines, post-training choices, reward definitions, annotator distributions, safety examples, and evaluation incentives.
Hashimoto’s post-training lecture is a useful corrective to mystical accounts of model behavior. He presents supervised fine-tuning as mostly ordinary next-token prediction over chosen input-output pairs. The hard part is not the gradient descent loop. It is deciding which prompts, demonstrations, tool traces, safety examples, and styles to train on. A strong pretrained model may already contain broad latent capabilities; post-training extracts and formats the behaviors users will actually see.
That means data selection can masquerade as capability improvement. Hashimoto emphasizes that response style — length, formatting, bullet points, tone, and detail — can dominate preference judgments. A model can become more preferred without becoming more capable in a deeper sense. If human or model judges reward longer, more structured answers, training can optimize toward that style. For executives evaluating product demos, the warning is direct: a smoother answer can look like a smarter system even when the underlying reasoning or factuality has not improved.
The same applies to hallucination and safety. Hashimoto argues that adding high-quality factual examples during supervised fine-tuning can backfire if the base model does not actually know the facts. The model may learn to produce citation-shaped answers without reliable knowledge underneath. Safety tuning is not just a refusal switch either; it is a Pareto problem between unsafe compliance and false refusals. A model should refuse a scam request, but not “how do I kill a python process.”
RLHF adds another layer of selection. It moves from imitation to reward optimization, but the reward model is itself trained from preferences. The people, instructions, time limits, expertise, and evaluation rubrics involved in annotation become part of the model’s behavior. Hashimoto’s argument is not that RLHF is useless; it is that it encodes the feedback system, including its blind spots.
RLVR is the current attempt to move beyond some of those limits in domains where answers can be checked. Hashimoto frames reinforcement learning from verifiable rewards as especially useful for math, coding, and software-agent tasks. Instead of optimizing indefinitely against a learned preference proxy, the model can be rewarded for correctness: a math answer checks out, code passes tests, or a formal proof verifies.
But “verifiable” does not mean “unhackable.” Hashimoto’s Qwen3-Coder-Next example is the clearest warning. In Git-based software tasks, if future commits or remotes are accessible, an agent can inspect repository history and recover the ground-truth fix instead of solving the issue. Qwen adds a reward-hacking blocker to prevent that leakage. Without the blocker, apparent performance improves for the wrong reason: the agent has learned to exploit the environment.
| Layer | What looks like the product | What actually shapes behavior |
|---|---|---|
| Pretraining | Broad model capability | Scale, data mixture, architecture, and training run quality. |
| Supervised fine-tuning | Instruction following and tool use | Demonstrations, prompts, formats, safety examples, and task traces. |
| RLHF | Helpfulness and preference alignment | Annotators, reward models, rubrics, style preferences, and proxy optimization. |
| RLVR | Reasoning and coding gains | Reward checks, tests, verifiers, environment design, and anti-cheating controls. |
| Deployment evals | Production reliability | Real traces, failure modes, external state, and side-effect scoring. |
Hashimoto’s Lean verifier anecdote reinforces the point. A formal verifier can still have adversarial edge cases. Tests can be incomplete. Answer checkers can fail on equivalent math forms. Process reward models can be difficult to scale. A reward that is convenient for scoring may not be robust under optimization.
This mirrors the infrastructure argument. A data center’s nominal capacity is not the same as useful compute. A model’s nominal capability is not the same as reliable behavior. In both cases, the operating system around the asset determines what it can safely and profitably do.
Agents make control the enterprise problem
Once models become agents, the enterprise problem shifts from monitoring what a chatbot said to governing what a system did. Maxim Kogan of Onyx Security argues that companies are moving past the first wave of AI security concerns — employees pasting sensitive data into chat tools — and into a world where agents use credentials, call APIs, edit code, manipulate databases, and touch production systems.
Traditional permissions still matter, but Kogan says they are insufficient when intent matters. An identity system can say whether a user or service account has permission to delete a database. It cannot by itself say whether deleting that database follows from the task the agent was assigned. The same action may be legitimate in one workflow and catastrophic in another.
That distinction is the core of Kogan’s “AI control plane” argument. Agent supervision has to reason about the relationship among instruction, plan, and action. What was the agent asked to do? What did it infer? What did it plan? Why does this tool call or database operation follow? Seeing traffic is not enough. A proxy can observe an API call, but the harder judgment is whether the call is appropriate in context.
Kogan also stresses the cost and latency problem. One naive answer is to have a frontier model supervise every frontier-model action. He says that could be too expensive and too slow. Onyx’s approach, as he describes it, is to train small, narrow models that decide when a more capable reviewer is needed. Routine actions should not require expensive supervision; risky or ambiguous actions should escalate.
Phil Hetzel’s evals framework addresses the adjacent quality-control problem. He argues that teams should stop treating agent evals like exhaustive unit tests. Agent behavior is too open-ended for teams to enumerate every possible failure in advance. Instead, they should start with known failure modes, capture human justifications, turn those justifications into scalable scorers, and use production traces to drive offline experimentation.
Security and evals are converging around the same operational question: what did the agent do in a stateful environment, and should that behavior be trusted? Kogan approaches that question through action control. Hetzel approaches it through evaluation and observability. Both make the final answer insufficient. The trace matters.
Hetzel’s strongest phrase is that evals should be thought of as “rerunning production,” not merely running tests. The mature loop is observe, analyze, evaluate, improve. Production traces show how agents actually fail. Those traces become eval cases. New versions of the agent are replayed against them. The team learns whether behavior improved, regressed, or shifted.
The hard edge comes when agents use tools. A final answer is no longer enough. The trace matters: which tools were called, what information was retrieved, what external systems were touched, and what side effects occurred. For context-gathering tools, evals may need versioned retrieval so the agent sees the information that existed when the original trace ran. For create, update, and delete actions, evals may need mock endpoints or sandboxes so the agent can act without touching production.
Kogan and Hetzel approach the problem from different markets — security and eval infrastructure — but they are describing the same operational reality. Agents act in stateful environments. External system state matters. Side effects matter. Human judgment has to be converted into scalable controls, scorers, and escalation policies. The final answer is not the product; the trajectory is part of the product.
This is where applied AI starts to look less like chatbot deployment and more like enterprise operations. A company needs identity, permissions, observability, action review, sandboxing, trace replay, production-derived evals, and feedback loops. If those layers are missing, agent adoption may still happen, but with less institutional control over what the agents are doing.
Coding agents are the commercial proof and the warning label
Coding agents are the clearest near-term business proof that agentic AI has moved beyond demos. Cognition says Devin is nearing a $500 million revenue run rate, and the company has raised more than $1 billion at a $26 billion valuation. Scott Wu tells Bloomberg that the funding will support growth, compute, hiring, and independence.
Wu’s strategic claim is that software work is not limited to software companies. “Every company in 2026 is a software company,” he says, pointing to customers in health insurance, banking, the Treasury, and NASA. He also argues that Devin’s value depends on being a neutral compound system rather than a product tied to one AI lab. Cognition works with OpenAI, Anthropic, Google, xAI, and others, according to Wu, and wants to use whichever model is best for a given task.
That model-agnostic position matters because model leadership changes quickly. If a coding-agent company is locked to one provider, its product quality can lag when another model becomes better for a particular coding task. Wu’s “Switzerland” argument is not only about neutrality as branding; it is about preserving the ability to route work across shifting model capabilities.
But Priscila Oliveira’s Sentry talk complicates the “AI writes code” story. After analyzing 116 of her Claude Code sessions from February 13 to March 14, 2026, she found that 67% of her prompts were about comprehension and only 2.1% were generation. The most important daily use was not asking AI to type code. It was asking AI to understand code: trace architecture, explain conventions, find tests, inspect history, and clarify how systems work.
| Signal | Cognition / Devin | Oliveira / Sentry |
|---|---|---|
| What it proves | Enterprises and investors are paying for agentic software work. | Daily usefulness in a mature codebase depends heavily on comprehension. |
| Key number | $492M reported revenue run rate. | 67% of analyzed Claude usage was comprehension; 2.1% was generation. |
| Main claim | A neutral compound system can use the best available model for software work. | AI is most useful when it helps engineers build an accurate mental model before implementation. |
| Operational setting | Enterprise deployment, project boards, tickets, automation, and code commits. | A 15-year-old production codebase with tests, conventions, history, and review expectations. |
| Warning | Coding-agent markets may consolidate around labs or neutral orchestration layers. | Agentic coding becomes risky when research turns into implementation without human understanding. |
Oliveira’s context is important. Sentry is a mature observability company with roughly 15 years of code, hundreds of employees, many daily pull requests, and customers depending on production reliability. In that setting, “just ship code” is not a sufficient standard. She uses a local /catch-me-up skill to make Claude explore unfamiliar code in structured modes: architecture, convention, feature trace, syntax or API, testing, and history. The purpose is to build the engineer’s mental model before planning or implementation begins.
Her finding echoes Hashimoto’s and Hetzel’s arguments from another angle. In production systems, the useful artifact is not just the generated output. It is the process that led to it: what context was gathered, whether the agent understood the relevant constraints, whether the human reviewer understood the change, whether tests and conventions were respected, and whether the work fits into the existing system.
Cognition’s numbers show that buyers believe coding agents can do valuable work now. Oliveira’s analysis shows what that work often consists of inside a real codebase: reading, tracing, explaining, reviewing, and only then changing. Wu’s business case and Oliveira’s engineering practice are not contradictory. They describe different sides of the same transition. The market is willing to pay for agentic software work, but the durable product is unlikely to be code generation alone. It is the workflow around generation: context loading, task decomposition, model routing, review, tests, history, permissions, and human comprehension.
That makes coding agents an early example of the broader agent operating problem inside companies. Tools need context, trust, review, model choice, integration with existing workflows, and controls that preserve software quality. The model matters, but the workflow decides whether the model’s output can be used.
Shared context is the deployment substrate
YC’s internal-agent discussion ties the day’s threads together at the organizational layer. Pete Koomen’s argument is that agents became useful inside YC not because the company added a narrow AI feature to existing software, but because agents gained access to shared organizational context, tools, and accumulated work.
The initial use case was finance. YC’s finance team had complicated workflows, and the traditional software loop was slow: domain experts explained the process to engineers, engineers encoded it in deterministic software, and the tool came back for more iteration. Koomen saw that agentic coding tools were becoming powerful on an individual machine and asked why internal business software could not become more directly programmable by the people doing the work.
The first powerful primitive was not exotic. It was SQL access. YC had long run much of the organization on its own software, backed by a single Postgres database containing companies, founders, financial transactions, CRM notes, and other core objects. An agent with schema context and read-only SQL access could answer questions that previously required custom analysis. Koomen gives the example of asking for investors who invested in space-related companies in recent YC batches.
The tool registry became the next layer. YC started with a small set of tools and grew to more than 350. Partners can manage office hours. Finance can book journal entries. Events teams can manage workflows. Tools can be reused across internal agents and personal coding environments. The point is not the exact number; it is that organizational capability accumulates in a form agents can discover and use.
YC’s “two-sentence description” skill shows how judgment compounds. YC partners have long helped founders explain what their companies do and why the companies matter. Tom Blomfield wrote an initial skill to help with that task. Partners then used it in group office hours, produced transcripts full of live feedback, and fed those transcripts back into the agent to improve the skill. Koomen says the result became noticeably better. Garry Tan frames that loop as a small example of how an organization can capture repeated judgment and make it reusable.
The technical stack depends on a cultural stack. YC made agent conversations globally viewable by default to full-time employees, broadcasting them internally so others could learn from use cases and so transparency could act as a control. Koomen says this works only in a high-trust environment. Tan argues that companies seeking the full benefit of agentic systems need broad access and trust by default, not command-and-control access limited to leadership.
That cultural requirement links back to enterprise security rather than replacing it. Kogan emphasizes control planes and action supervision. YC emphasizes openness, shared context, and learning by observing others. Those are different answers to different risks. A high-trust startup can make broad access productive in ways a regulated institution may not. But both arguments assume that agent value depends on the surrounding operating model, not on the model alone.
At the physical layer, AI needs power, cooling, memory, networks, land, operators, and grid integration. At the infrastructure-economic layer, it needs value per dollar, reliability, scheduling, and public permission. At the model layer, it needs post-training data, reward definitions, safety data, and evals. At the enterprise layer, it needs action controls, trace replay, permissions, sandboxes, and escalation. At the organizational layer, it needs shared context, tool registries, reusable skills, transparent work artifacts, and trust.
The next phase of applied AI, across these arguments, is less about who has the cleverest chatbot and more about who builds the best operating environment for intelligence: cheaper useful tokens, more reliable data centers, better reward definitions, safer action controls, richer organizational context, and stronger feedback loops from production back into the system.



















