Applied AI’s Frontier Moves From Models To Control Loops

Applied AIThursday, June 4, 20262h 14m to watch18 min read

Satya Nadella framed the AI frontier as the company-specific system around models: private evals, traces, tools, proprietary context, and a harness that can keep improving as models change. Across enterprise agents, accounting, cybersecurity, software engineering, formal verification, governance, and interfaces, the same pattern appears: capability becomes useful when organizations can bound it, inspect it, verify it, and decide who remains accountable.

Capability is moving into the company-specific loop

Satya Nadella’s strongest claim at Microsoft Build was not that one model, product, or device now defines the AI frontier. He described the frontier as the system a company builds around models: private evals, traces, tools, proprietary context, reinforcement environments, and a harness that can swap models while continuing to improve.

That framing turns applied AI from a procurement question into an operating question. A company that merely consumes a frontier model is dependent on the model provider’s roadmap. A company that defines its own success measures, collects its own traces, routes work through its own tools, and trains or tunes against its own evals begins to compound something more specific. Nadella called that possibility “frontier intelligence” that every company can operate for itself.

The Land O’Lakes example made the claim concrete. Microsoft used GPT-4o, collected traces, and then used a 5B reasoning model to achieve higher performance on the task. In Nadella’s telling, the frontier was not located only in the original general-purpose model. It also emerged from the combination of the model, the organization’s traces, and a smaller specialist model trained to climb on the organization’s task.

Private evals become central in this account because public benchmarks are increasingly insufficient. Nadella said many public evals can now be maxed out. The more important test is whether a company can define what good looks like for its own work, run different models against that standard, and keep the resulting traces from leaking. If it can move from model A to model B while still improving on its private eval, it has leverage. If it cannot, it is locked into someone else’s implementation.

Every company having private evals may be the biggest IP.

That is why Nadella accepted Shawn Wang’s suggestion that Microsoft’s third act may be as a “harness or evals company.” The harness is where models, tools, data, and context meet. Nadella described Microsoft products such as GitHub Copilot, security work, M-Dash, and science discovery as multi-model harnesses, not single-model wrappers. The valuable layer is not just calling a model; it is preparing context, disclosing tools progressively, routing work efficiently, and using traces from actual work to improve the system.

Nadella also pushed the idea into organizational memory. He described each company as having human capital and “token capital,” with Teams-like traces between people and agents becoming material for what he called a “company veteran agent.” That agent is not simply a chatbot trained on documents. It is an accumulated working memory of how a company acts, decides, coordinates, and corrects itself over time.

15 months

period in which Microsoft built more Azure capacity than in its first 15 years, according to Nadella

The infrastructure example carried the same logic. Microsoft’s Azure networking team, Nadella said, built more capacity in the previous 15 months than in Azure’s first 15 years. The team reframed its task from doing Azure networking to building the agentic system that does Azure networking. In that story, the ambitious work is not merely using AI to make an existing task easier. It is building the machinery that can perform, monitor, and improve the task at scale.

Nadella’s argument is also a Microsoft platform argument. Microsoft benefits if Azure, Microsoft 365, GitHub, Foundry, and Copilot become the places where enterprise traces, tools, models, and evals converge. But the broader shift does not depend on Microsoft being uniquely right. Once models are capable enough to act, the scarce advantage moves toward the systems that evaluate, constrain, route, audit, and improve those actions.

Private Evals Are Becoming the Core IP of Enterprise AINo Priors

Bounded workflows are proving more plausible than loose autonomy

Across TBPN’s enterprise examples, the commercially coherent AI systems were not unconstrained agents. They were bounded systems embedded in infrastructure that can price, route, audit, secure, or constrain them.

John Coogan’s discussion of Microsoft’s Build announcements fit Nadella’s harness thesis from another angle. Coogan framed Microsoft’s agent strategy less as a consumer-device push than as a bid to make Azure and Microsoft 365 the controlled environment for enterprise agents. Project Solara, especially the badge-like portable device, looked questionable as hardware. The more serious argument was architectural: if a company already lives inside Microsoft’s cloud, a thin, secure access device can become one endpoint into agents running across Outlook, Teams, SharePoint, OneDrive, calendars, documents, and enterprise identity.

The hosts were skeptical that the device itself would replace a phone. Coogan’s stronger version of the case was that agents fit a cloud-hub model better than prior mobile interactions did. A user may briefly authenticate, delegate, and later review work done in the background. In that world, the hardware is less important than the boundary: secure identity, enterprise context, delegated access, and review.

OpenClaw sharpened the point. Microsoft’s Scout agent is expected to be powered by OpenClaw, with Microsoft contributing security guardrails back to the open-source project. Coogan described OpenClaw as powerful but rough-edged. Inside Microsoft’s environment, the appeal is not that a general agent can roam everywhere. It is that an open agent framework can be domesticated inside Copilot, Azure, and Microsoft 365, where permissions, documents, spreadsheets, databases, email, and security rules can be managed inside an enterprise boundary.

The same controlled-workflow logic appeared outside Microsoft. Ramp Stack’s accounting product was presented as an “AI operating system for accounting firms,” not as a generic chatbot. Its promised work includes booking transactions, amortizing prepaids, reconciling accounts, connecting to ERPs, learning firm-specific processes, requiring human approval before posting, and recording every action for auditability. Accounting is a useful example because accuracy, approval, and audit trails are not optional. The system is valuable only if it lives inside the workflow and leaves a record.

Cybersecurity reinforced the point even more sharply. Nikesh Arora argued that AI does not make cybersecurity infrastructure obsolete because unaided models still produce unacceptable false positives. Palo Alto’s position was that models need harnesses, context, explanations of what code is supposed to do, and enforcement points. AI may improve detection and response, but it does not remove the need for the network, endpoint, and policy systems where threats are actually blocked.

Domain	AI capability	Control layer that makes it usable
Enterprise agents	Delegate work across email, files, calendars, and business apps	Cloud identity, permissions, Microsoft 365 context, review surfaces
Accounting	Reconcile accounts and prepare journal entries	ERP connections, human approval, firm-specific process memory, audit trails
Cybersecurity	Find vulnerabilities and inspect traffic	Harnesses, ground truth, enforcement points, patch workflows
Creative production	Generate visual concepts or music sketches	Human direction, development craft, production workflow, distribution context

Across TBPN’s examples, AI becomes more plausible when it is inside a workflow with boundaries and records.

The creative examples followed the same structure, though with different constraints. Martin Scorsese’s use of Black Forest Labs’ FLUX 2 was framed as previsualization: describing a town, asking for narrower streets and a higher camera angle, and using generated images as a way to think through cinematic possibility. The point was not that AI makes the film. It was that AI enters an existing workflow as a tool for ideation, analogous in Scorsese’s comparison to production designers creating oil paintings.

Suno supplied a useful contrast with more verifiable domains. Mikey Shulman argued that music lacks “right answers” and therefore lacks the verified rewards that coding or chess can use. That forces a different kind of product discipline: retention, user delight, creative formats, and workflows where AI functions as an instrument or fan-participation layer rather than a formally judged solver.

Accounting needs auditability. Cybersecurity needs enforcement points and exploitability judgments. Enterprise agents need permissions and review. Creative systems need human taste, development, and distribution context. TBPN’s applied-AI examples pointed less toward autonomy released into the world than toward capability embedded in bounded operating environments.

Useful AI Systems Are Emerging Inside Controlled Enterprise WorkflowsTBPN

Human work is shifting toward verification

If AI systems generate more candidate work, human labor does not simply disappear. It moves toward supervision, verification, integration, judgment, and craft preservation. The Melbourne AI engineering talks made that shift explicit, and they complicate a simple productivity story.

Jeremy Howard framed the issue psychologically before he framed it technically. Drawing on self-determination theory, he warned that AI tools can either support autonomy and mastery or simulate them while eroding both. A tool that helps an engineer tackle harder work while understanding it can deepen craft. A tool that presents choices the engineer does not understand, or removes effortful practice from the work, can create an illusion of control.

Howard’s strongest concern was not that AI coding is useless. It was that some agentic coding loops feel productive while detaching the user from real feedback and understanding. He described “junk flow”: a compelling, dopamine-driven loop in which generated work accumulates, demos look plausible, and the hardest verification comes later. The danger is precisely that it does not feel like stagnation while it is happening.

Annie Vella’s longitudinal study gave empirical shape to the same transition. She studied professional software engineers using AI at work or home across two questionnaires six months apart, with participants from 28 countries. Her first finding was a statistically significant “creation to verification shift.” Engineers reported spending less time on most creation-focused activities, while code review increased slightly. Vella called the emerging work “supervisory engineering”: directing AI, evaluating output, debugging, testing, and integrating candidate work into a coherent system.

countries represented in Vella’s longitudinal study of professional software engineers using AI

That shift is central to applied AI because enterprise harnesses and private evals only work if humans retain enough craft to supervise the system. A company cannot benefit from traces, agents, and model routing if its workers lose the ability to know whether the output is coherent, maintainable, secure, or fit for purpose. Verification here is broader than formal proof. It includes code review, testing, product judgment, debugging, design taste, and the ability to notice when a system is producing plausible nonsense.

Vella’s second finding made the tradeoff sharper. At both study points, 84% of engineers said AI made them feel more productive. But reports of deterioration in developer experience nearly doubled, from 14% to 27%, across measures such as cognitive load, flow state, and feedback loops. Flow state was the most negatively affected dimension, while feedback loops improved.

Measure	First questionnaire	Second questionnaire
Engineers reporting higher productivity with AI	84%	84%
Engineers reporting decline in at least one developer-experience dimension	14%	27%
Most negatively affected dimension	Flow state	Flow state
Dimension that improved	Feedback loops	Feedback loops

Vella found stable perceived productivity gains alongside worsening developer-experience indicators.

That paradox matters for managers evaluating AI adoption. More feedback can be useful, but it can also fragment attention. More generated code can accelerate a demo, but make later integration harder. Higher perceived productivity can coexist with more cognitive load and less sustainable craft.

Howard’s alternative was not hand-coding nostalgia. He placed AI in the older tradition of intelligence amplification: Sketchpad, Engelbart, APL, Bret Victor, Swift Playgrounds, and other systems that made human thought more powerful rather than replacing it. His SolveIt examples were meant to show AI keeping the human in contact with the work: reading papers, asking questions, running code, checking results, experimenting with CSS, and using visual feedback to refine understanding.

Vella’s future-role framework similarly avoided a single outcome. She described artisanal developers, orchestrators or “code conductors,” and clerical coders. The clerical path is the danger: agents do work overnight and humans arrive to approve pull requests without meaningful judgment. The orchestrator path preserves more craft: humans understand the problem, write strong specifications, supervise agents, and build the harnesses that build the software.

That labor shift links back to the enterprise control layer. Private evals, harnesses, and workflow agents are only as good as the people and processes that define success, inspect failures, and improve the loop. AI can move work from creation to verification. It cannot make verification optional.

AI Engineering Must Preserve Craft as Work Shifts to VerificationAI Engineer

Formal verification is the hard edge of the same movement

Carina Hong’s Axiom Math argument is the most ambitious version of the verification theme. She does not frame formal verification mainly as a safety tax or a compliance layer. She frames it as a way to scale reasoning itself.

Axiom’s claim is that Lean-based verified generation gives AI systems a sharper training signal than informal reasoning alone. In a Lean setting, a proof either checks or it fails, assuming safe conditions and no improper escape hatches. That binary feedback can support reinforcement learning and proof search in a way that human expert grading or LLM-as-judge evaluation cannot easily scale.

Lean matters here because it turns proof into executable structure. Hong described Axiom Prover as an ensemble system that recursively decomposes proof goals into subgoals, learns to backtrack, and scales inference over proof search. The formal substrate gives the system reusable, checkable artifacts rather than persuasive natural-language arguments.

Axiom’s visible proof point is its reported 120 out of 120 score on the 2024 Putnam exam. Hong said Axiom competed in real time, while Math Arena found DeepSeek scored 103 and the best human score was 110. The figures were reported by Hong in the interview. The point of the result, in her telling, is sample efficiency: a formal math system with far less data could beat informal systems on a hard exam because verification supplies a cleaner signal.

120/120

Axiom’s reported score on the 2024 Putnam exam

The connection to enterprise AI is not that every workflow should become Lean. It is that different domains need different definitions of success. Private evals define success for company-specific work. Workflow harnesses define success through approvals, audit trails, and policy enforcement. Formal verification defines success in domains where correctness can be specified precisely enough for a checker.

Hong’s caution is as important as her ambition. Formal verification does not remove the specification problem. If a bank asks for a “safe financial audit,” nothing has been formally specified yet. “Safe” has to be translated into precise statements. Hong’s formulation was blunt: if it is not specified, it is not proven. A proof can be correct and still prove the wrong thing if the statement does not capture the user’s intent.

That boundary also appears in code. Hong described a future in which a high-level coding agent decomposes a complex task, hands a precise component to Axiom, and receives code with proof or a refusal if the task is too hard. Testing remains useful because tests, counterexamples, and input-output pairs help ground what the specification should be. Verification relocates judgment to specification, decomposition, and deciding what is worth proving.

Axiom’s Code Verina discussion showed the commercial bridge from math to software. Hong reported low pass rates for GPT-style and prover systems on a code-with-proof benchmark, then said Axiom’s Putnam system, without modification, solved 187 of 189 full code-with-proof problems. Those are Axiom’s reported results from the discussion, and the useful point is the thesis they support: verified generation could become a practical path from formal math into software and hardware verification if the reported capability generalizes.

The remaining bottlenecks are human and institutional as much as technical. Hong emphasized provenance, blueprints, taste, attention, and organizational focus. The Erdos-problem controversy showed that proving a statement is not the same as proving novelty; search and literature provenance remain difficult. Blueprint writing showed that large formal projects need human decomposition and coordination. Taste remains necessary because a system may produce more correct mathematics than humans can absorb, forcing people to decide which results matter.

Formal verification is therefore not an escape from human judgment. It is a more stringent version of the applied-AI stack’s broader control problem: specify what success means, check what can be checked, preserve provenance, and keep humans responsible for the parts that remain underspecified.

Axiom Math Says Verified Reasoning Can Outscale Informal AILatent Space

Governance and security are becoming operational bottlenecks

As AI systems become more capable, governance moves from a policy abstraction into release management, vulnerability triage, permissioning, audits, and real-time enforcement. Nathan Labenz and Prakash Narayanan treated that shift as the central governance problem: institutions want review and assurance, but the review process itself can slow defensive deployment or blur accountability.

The release-delay scenario around frontier models captured the tension. Narayanan imagined a frontier model previewed to government agencies before release. Agencies use it to find security holes across government systems, discover more than they can patch, panic, and ask the lab not to release. That might protect the government temporarily, but could harm companies if comparable open-source capability appears while the strongest defensive closed model remains delayed.

Labenz described Trump’s AI executive order as appearing to ask frontier model companies to give the government roughly 30 days before release to review models and render opinions. He was unsure how much force the order has and characterized it as a possible “gentleman’s agreement.” But the structure still matters: once a review process exists, officials have a venue to object, delay, or impose conditions.

Before release
Frontier labs may give government early access to a model for review.
During review
Agencies may discover vulnerabilities faster than they can patch them.
At release decision
Officials may seek delay even if the same capability could help defenders outside government.
After deployment
Auditors, security teams, and policy engines become responsible for ongoing controls rather than one-time approval.

Audits are becoming the political compromise. Labenz saw convergence around pre-release review and third-party audits, especially in state-level bills such as Illinois’s proposed AI safety law. Narayanan viewed audits partly as public assurance rather than a complete technical mechanism. An auditor will not know everything internal teams know or move at the same speed. But audits can reassure governments, the public, customers, and political actors that someone has looked under the hood.

Security operations show why that assurance is hard. Tal Hoffman and Yanir Tsarimi of Enclave AI argued that AI has made it easier to find possible bugs than to decide which vulnerabilities matter. Current cyber evals often measure exploitation of known vulnerabilities, not discovery from zero. Enclave’s claim is that a smaller model plus the right harness, context, workflow, and expert knowledge can reproduce results associated with stronger models. The bottleneck shifts from “can a model produce a finding?” to “is this exploitable, urgent, and worth patching now?”

That is an enterprise workflow problem, not just a model problem. Security teams send findings to engineering teams. Engineering teams have release dates and backlogs. Some issues are deep design flaws rather than simple code edits. AI can flood bug bounty programs with plausible reports, increasing the need for exploitability judgment and prioritization. In the same way private evals and enterprise harnesses define what “good” means inside a company, security teams now need harnesses that define which findings are real, which are reachable, and which demand action.

Agent security raises the same issue through permissions. Narayanan noted that useful agents need access to Gmail, Slack, databases, internal systems, and execution rights. Hoffman and Tsarimi emphasized familiar controls: zero trust, segmentation, blast-radius limits, narrow tool access, and careful treatment of write permissions. The novelty is not that authorization matters for the first time. It is that AI agents make authorization operationally urgent because they can act across systems at speed.

Moonbounce’s policy-engine discussion showed the content and product-safety version of the same problem. Brett Levenson argued that real-time enforcement requires decomposing ambiguous rules into smaller, auditable questions. A policy such as “hate speech” or “sycophancy” is too broad to enforce consistently unless it is broken into evidence questions that reviewers, classifiers, and customers can inspect. Moonbounce uses small models, prefix caching, binary classification heads, and deeper scans to keep enforcement fast enough to sit in the product path.

Latency, in Moonbounce’s account, becomes a product condition for governance. A safety system that adds too much delay will not be adopted, or will be pushed out of the critical path. Levenson said easy cases can be handled in under 200 milliseconds, while deeper scans are in the 300-to-500 millisecond range for text. The product requirement is not merely to have a policy. It is to apply the policy fast enough, consistently enough, and with enough auditability that payment providers, regulators, or internal teams accept it.

The unresolved question, as Labenz put it, is whether low-level controls add up to the high-level system properties people want. Formal methods face the same issue: a proof is only as good as the specification. Policy decomposition faces the same issue: subchecks may not fully capture the higher-level behavior. Security harnesses face the same issue: a finding is not the same as an exploitable, prioritized vulnerability.

Governance therefore becomes part of the applied-AI operating layer. Release review, audits, permissioning, security triage, and real-time policy enforcement are now mechanisms through which AI systems are allowed to act.

AI Governance Shifts From Model Review to Release BottlenecksThe Cognitive Revolution

Agents need interfaces and memory that make supervision possible

If applied AI is becoming a controlled operating system, the practical question is what humans actually see, edit, approve, and recover when agents work. Two interface-level arguments point to the same answer: chat and prompts are not enough. Agents need surfaces for supervision and durable project memory that can be enforced.

Ruben Casas argued that models crossed a frontend-generation threshold before product interfaces caught up. He described the current chat-heavy paradigm as analogous to using a terminal before the graphical interface for the “new computer” has been invented. The question is not only whether AI interfaces live inside SaaS products, super apps, or standalone tools. It is what the model is allowed to generate.

Casas distinguished three patterns. Static components let agents pass props and data into predefined UI. That preserves control but limits runtime flexibility. Declarative UI lets a model generate JSON, YAML, or similar descriptors that a renderer turns into interface elements using approved components. Fully generative UI lets a model write HTML, CSS, and JavaScript at runtime, which is more flexible but creates a containment problem.

Pattern	Model output	Control mechanism	Practical tradeoff
Static components	Props and data	Prebuilt component set	High control, limited flexibility
Declarative UI	JSON, YAML, or similar descriptors	Renderer and design system	Flexible while preserving consistency
Fully generative UI	HTML, CSS, and JavaScript	Sandboxed delivery boundary	Most dynamic, highest containment risk

Casas’s interface taxonomy turns agent UI into a question of what generation is allowed and where it is contained.

Casas’s practical preference today is declarative UI because it balances flexibility and control. The model can compose more dynamic interfaces, but the application still owns the renderer, component library, and design system. Fully generative UI may become important, but Casas argued it needs sandboxing. MCP apps matter in his account because they provide tool calling, authentication, message passing, and a double-iframe boundary that can contain runtime-generated interfaces.

That interface problem connects directly to Nadella’s comment that chat is inadequate when a developer has 100 agent sessions running. If agents work in parallel, over time, and across tools, the human needs a canvas, an artifact, or a workspace for inspection. Casas’s Excalidraw example pointed in that direction: not merely visual output, but a shared surface where a human and agent can both manipulate the artifact.

Michal Cichra’s argument brings the same principle inside the software repository. AI-assisted development, he said, does not fail mainly for lack of prompts. It fails for lack of enforceable memory. Teams forget why flows exist, why code is shaped a certain way, where a new feature belongs, or what rules govern a system. LLMs compact context and lose session memory. The repository has to carry the durable memory.

Cichra’s stack is old but newly useful: Architecture Decision Records, Product Requirements Documents, Behavior-Driven Development scenarios, design-system rules, Git hooks, CI, linters, type checks, architecture checks, and executable tests. His point is not that teams need more paperwork. It is that intent has to be written where humans can review it, agents can retrieve it, and automated checks can enforce it.

ADRs explain not just what is forbidden, but why. One example banned ORM queries in templates because .filter() and .count() calls made N+1 queries invisible until production. The enforcement mechanism was import linting. A useful system does not merely reject a change; it tells the human or agent which rule was violated, why it exists, and how to fix it.

PRDs preserve product purpose after the build. BDD scenarios connect human-readable behavior to executable tests. Cichra argued that Cucumber-style specifications have become useful again because AI may generate both code and tests, making it harder to trust either without a readable layer that states what behavior should exist and proves it still does.

Design systems play the same role for UI. Agents need visible rules and reusable components, not just instructions to “make it look consistent.” Git hooks and CI become the enforcement loop. The agent’s goal is to deliver a pull request, and to do that it must pass the same checks as a human. If it skips a local hook, CI catches it.

20–50

context compacts Cichra says long-running agent sessions can survive when project memory is recoverable

Cichra said he is no longer afraid of context compaction because the important facts are recoverable. ADRs, PRDs, BDD specs, and design-system rules let an agent look up what matters again. That is the repository-level version of Nadella’s company-specific traces and Casas’s agent interface surfaces: memory has to live outside the transient chat session, and supervision has to be supported by the system itself.

Declarative UI Is Emerging as the Practical Path for Agent InterfacesAI Engineer

BDD and ADRs Give AI Coding Agents Enforceable Project MemoryAI Engineer