Private Evals Are Becoming the Core IP of Enterprise AI
Microsoft chief executive Satya Nadella argues that the AI frontier is shifting from single models to company-specific systems built from private evals, traces, tools, data and multi-model harnesses. In a Microsoft Build conversation with Sarah Guo, Elad Gil and Shawn Wang, Nadella says those private evaluation loops may become a company’s most important intellectual property, allowing enterprises to build their own specialist intelligence rather than merely consume frontier models. He also frames the broader test for AI as legitimacy: whether customers, workers and communities see measurable gains from the technology and the infrastructure behind it.

The frontier is becoming a company-specific system, not a single model
Satya Nadella framed Microsoft’s AI strategy as an ecosystem problem rather than a race to make one model, one platform, or one first-party product the center of the market. A platform, in his definition, is measured by whether it creates more value above itself than it captures inside itself. That was the test he applied to the Build announcements: whether AI-native companies and traditional enterprises can become “first-class participants” in the new stack, able to point to AI they created rather than merely consuming AI built by someone else.
That distinction matters because Nadella repeatedly described the AI frontier as something an organization can operate at through its own data, tools, evals, traces, and harnesses. Microsoft may train its own MAI models, partner with frontier labs, and ship first-party products, but the broader aim he described was to give companies a path to create their own specialist intelligence on top of improving general-purpose systems.
For Microsoft’s MAI models, Nadella emphasized “clean lineage”: high-quality pre-training data, ablations, and careful exclusion of polluted or misleading data. He contrasted that with open-weight models that may look strong on one or two benchmarks but fail in practice. The point, for him, is not just benchmark performance; it is whether a small model can become useful when surrounded by the right scaffold. He cited Microsoft’s internal excitement that a small 5B model could “hill climb,” and linked that to a larger pursuit of what he called a “cognitive core.”
The model is only one part of the loop. Nadella described a scaffold around the model, reinforcement learning environments, traces collected from use, and private evals that measure what an individual company actually values. Public evals still have some interest, he said, but many can now be maxed out. The more important measurement is private: whether a company can define success for its own work, use models against that evaluation, and keep the traces from leaking.
He gave the Land O’Lakes demo as an example of the temporal nature of this frontier. Microsoft used GPT-4o, collected traces, and then used a 5B reasoning model to achieve higher performance. In that sense, frontier intelligence is not only something bought from a lab at a moment in time. It can also be produced by combining a frontier model, organization-specific traces, and a specialist model trained to climb on the organization’s own task.
Can everybody operate at the frontier with their frontier intelligence?
That, Nadella said, was the tagline he would attach to the developer conference. Without that possibility, companies cannot see how they compound value on top of a platform that keeps improving. A developer conference, in his view, should not ask builders to “worship at the altar of one model.” It should give them a platform layer they can extend into their own intelligence layer.
Private evals are becoming the new corporate IP
Shawn Wang suggested that Microsoft’s third act, after operating systems and cloud, might be as a “harness or evals company.” Nadella’s answer supported much of that direction. The modern asset, as he described it, is not simply a trained model or a body of human experience. It is the private evaluation and the working system around it: models, tools, data, context, traces, and the harness that lets an organization improve without surrendering control.
The acid test, in Nadella’s account, is whether a company can take a private eval, use model A, switch to model B, and continue climbing. If it can, it is in control. If it cannot, it is dependent on a particular model provider or implementation. That is why the harness decision matters. An open harness that can admit multiple models, connect to proprietary tools, incorporate context, and train against private evals becomes, in his telling, part of how a company preserves control over its own intelligence layer.
Sarah Guo pressed on what role developers have in this world, especially if independent frontier labs expect large first-party products to drive revenue while also serving enterprises and startups through APIs. Nadella answered that first-party products and platforms have always coexisted: Windows had Microsoft products on top of it; cloud and SaaS had the same pattern. The essential issue is whether the platform owner’s success prevents other people from achieving comparable success. In the AI era, he argued, the network effects around intelligence differ because the learning can come from data, and not necessarily vast amounts of data. Sometimes a few samples are enough to show what is novel.
That places a premium on private evals and traces. Nadella described each company as having both human capital and “token capital.” The question is how the two compound. In Microsoft Teams, for example, he imagined humans and agents both doing work, with the traces between them becoming critical context for how the enterprise creates value. Those traces do not train only a generic assistant. They can train what he called a “company veteran agent.”
He pushed the idea far enough to compare it to an asset that might belong on a balance sheet. Human capital, he said, was historically hard to capture as an accounting asset because tacit knowledge was not measurable or preservable in that way. Agents that learn through time from traces may change that. The claim was not that human capital disappears. Nadella explicitly argued that humans remain valuable because they find gaps that exist at all times. But the compounding asset becomes the interaction between human judgment, agent work, and the private measurements that define improvement.
Every company having private evals may be the biggest IP.
The enterprise harness is where models, tools, and context meet
Elad Gil drew a distinction between the agent that writes code and the “harness” around it: the environment, context, and setup that make the coding agent effective. Nadella generalized that idea to enterprise work. A harness, in his description, defines the models, the data, and the tools, and creates a loop across all three.
Microsoft’s own products, he said, are being built as multi-model harnesses. He listed GitHub Copilot, Microsoft’s security work, M-Dash, and science discovery as examples. The point is not only to call multiple models. It is also to provide tool access through progressive disclosure so the system remains token-efficient, and to supply the rich context needed for a plan to execute effectively.
Context preparation was one of the hardest lessons of the last two years. The “magic,” as Nadella put it, is often in the amount of work required to prepare the context layer so that execution becomes efficient. The GitHub harness is being used across Microsoft products and made available in Foundry, but he also stressed openness: companies can use a Llama harness, an open harness, or their own harness, trained with their tools, models, and context.
This is where Nadella sees a practical answer to the view that models, tools, and harnesses must be trained together by a single provider to achieve good evals. He cited M-Dash as evidence against that. When it launched, he said, it found bugs or vulnerabilities that were not found by Mythos. He treated that as an existence proof that a multi-model harness can outperform in real-world conditions.
The same logic applies to coding. Coding has worked so well, Nadella said, that the IDE itself has to be rebuilt. If a developer has 100 agent sessions running, the cognitive load shifts back to the human in a new form: reviewing, coordinating, and understanding what those agents did. A chat-only artifact is inadequate; he said a canvas becomes necessary.
He extended the pattern beyond code to what he called “glue work.” Much of human capital in enterprises is spent connecting systems, completing workflows, and applying judgment across fragmented tasks. Long-running, durable agents with delegated authority can amplify that work. Nadella imagined waking up to find that autopilots had worked through the night on one’s behalf, followed by the need for a new interface — an “ADE,” by analogy to IDE — to inspect what they did and determine whether the human now owns the work.
SaaS does not disappear, but its packaging gets relitigated
Nadella did not accept a simple “end of software” thesis. He agreed that AI forces a re-litigation of how applications are vertically stacked, but he separated the layers of SaaS that may become more or less durable.
Traditional SaaS, as he described it, captured workflow by building a data model, schematizing a business process, adding business logic, and placing a UI on top. Sarah Guo added that there was also “a little configuration”; Shawn Wang noted that for roughly 20 years, that was the model. Nadella’s view is that some parts of that stack remain highly valuable. A general ledger should remain a general ledger. Stable entity relationships are not something he wants every company to reinvent. Likewise, business logic encoded in products can be valuable even if the old UI packaging changes.
Power BI was his example. Dashboards may look like the visible product, but underneath them is a semantic model: measures and business logic someone took the trouble to define. Nadella wants that logic available to agents. The challenge for SaaS companies, then, is that products were bundled in one way and now need to be unbundled and rebundled in new ways, with new business models.
Microsoft 365 is his most important example. Nadella argued that Copilot has exposed what may be the most important database in a company: the corpus of email, Teams conversations, Word documents, Excel spreadsheets, PowerPoint decks, and SharePoint content. Previously, that database was captive to individual apps. Copilot lets it be used differently. He described asking Copilot to inspect a GitHub repo, review design meetings he attended the previous week, and propose codebase changes based on the meeting transcripts. In the old packaging of M365, he said, that use would have been hard to imagine.
That expansion creates technical strain. Serving an inbox or mailbox is not the same as serving an agent. Nadella said Microsoft has to re-architect for agent usage that may exceed direct end-user usage.
On pricing, he rejected the idea that one model will dominate. Per-user pricing, he said, exists because buyers need budget certainty; it is really a set of usage entitlements. Subscriptions and per-user bundles will remain, while consumption pricing will also grow. Outcome-based pricing will have a place, but Nadella was skeptical that customers will always like it once the outcome materializes. Customers may want vendors to be accountable for results until they realize the vendor is effectively asking for a share of the outcome; at that point, they may prefer to return to per-user and consumption pricing.
GitHub Copilot illustrated the adjustment. It was originally constructed around per-user pricing for interactive code completion and task use. That is different from a world in which a developer launches thousands of agents running all day. Nadella said there will still be per-user pricing, but there will also need to be a consumption meter.
Elad Gil asked about enterprises experiencing “agent euphoria,” with internal teams trying to rebuild applications or threaten SaaS vendors with replacement. Nadella said the market likely needs one full budget cycle before equilibrium becomes clear. The relevant calculation is whether the marginal cost of building and maintaining something internally is lower than buying it. Maintenance matters: AI can find security issues, but those issues still have to be fixed quickly, and fixing them burns tokens. The durable vendor, in his view, is the one flexible enough to compose with a customer’s agentic workflows while still delivering value.
The new builder is more generalist, but not less technical
Nadella’s own account of building software centered on the increased agency that tools like GitHub Copilot provide. He joked that AI made it possible even for “the incompetence of a CEO” to build, but his substantive point was that knowledge workers can now inspect, learn from, and manipulate artifacts they previously could not touch.
He described building long-running Foundry agents, including a chief-of-staff-style autopilot. He used Work IQ, asked it to build a long-running Foundry agent, stored memory in a backend service he named, and then told it to publish to Teams. The system, he said, built and published the agent. The important part for Nadella was the end-to-end completion of a project by someone who would previously have stayed at the level of documents, spreadsheets, or meetings.
Elad Gil asked whether this changes engineering roles. He described a possible future with fewer categories: people managing agents, forward-deployed engineers, security engineers, and people working on large-scale infrastructure for a small number of services. Sarah Guo said that sounded correct. Nadella was more cautious, saying companies will have to experiment their way through it, but he pointed to LinkedIn as an example of structural change already underway.
At LinkedIn, he said, the company created a new discipline called the “full-stack builder,” bringing together design, product management, and front-end engineering. The point was not to erase specialties. A designer still has a design edge; a front-end engineer still has a front-end edge. But the role gives people a broader scope rather than confining them to narrow functions.
At the same time, infrastructure becomes more important, not less. Even the Excel team, Nadella said, now faces hard infrastructure problems in building reinforcement learning environments where rewards can be learned. That requires distributed systems talent inside what was once thought of as an end-user application team. Science and infrastructure remain specialist domains. But the highest leverage, in his view, may accrue to generalists whose scope expands dramatically.
Knowledge work that once ended in a Word document, spreadsheet, or presentation can now produce an app. Nadella did not present that as a reason for everyone to do the same work. Rather, he presented it as a change in the leverage available to generalists: they can move up and down the stack, inspect more, learn more, and act on artifacts that used to be outside their reach.
Ambition means making previously impossible work operational
When Sarah Guo asked how Microsoft can be more ambitious in an environment where users and companies are adopting new technologies quickly, Nadella answered with a conceptual shift rather than a product roadmap. He cited Kevin Scott’s distinction: making hard things easier is one kind of leverage, but true ambition is making the impossible possible.
Nadella’s example was Azure networking. Microsoft built more Azure capacity in the previous 15 months, he said, than it had built in the first 15 years. The same team responsible for the Azure network concluded that its existing model of work would not scale. Their job, as they reframed it, was not to “do Azure networking” but to build the agentic system that does Azure networking.
The Azure network involves more than 500 fiber operators managing the WAN. Fiber operations remain physical: cables are cut, repairs are needed, emails arrive, people respond. The team built an agentic system called Miles to handle parts of that operational flow. Nadella said the team began asking not for more headcount but for more tokens to manage the operation.
He described this as making work “meta.” The new work is not merely performing the operational task; it is building and managing the system that performs the task. He compared it to a mistaken 1980s model of the future in which, if billions of people were going to type every morning, one might conclude the world needed billions of typists. Instead, typing became part of knowledge work. The analogous shift now is that organizations need permission to invent new forms of meta-work and meta-cognition that change the outputs that matter.
That thread also shaped Nadella’s view of data centers. The buildout is extraordinary, he said, and unlike anything that has happened before. But he was more concerned with the legitimacy of that buildout in the communities where it lands. The industry, he argued, has to show real benefits: energy prices not rising and ideally falling over time through a better grid and more supply; water systems that are closed loop or replenishing; training, jobs, and tax base; construction employment and post-construction employment.
If those benefits are real, he said, companies will have permission. If not, they will not. He described community skepticism as appropriate and said the industry has to earn trust through tangible evidence. His broader historical claim was that using a lot of energy can be a good story if it also creates a lot of social value. If a token economy drives productivity, economic growth, wider participation, and better health outcomes, he believes the outcome can justify the infrastructure. But that case has to be made in the real economy, not asserted abstractly.
The public will not accept “trust us” as an AI strategy
Nadella’s societal argument was blunt: the world will be skeptical of technology companies that ask to be trusted on the promise of a glorious future. The technology is too important, and too much of the economy is implicated, for benefits to remain theoretical. People need to see a path to participating as first-class actors in the new economy.
A mock Apex AI post shown on screen captured the posture Nadella rejected. The post announced “Apex AGI-1” as “the last model humanity will need to build” and included the line: “Trust us, we’ve got it. The future is going to be glorious.” Nadella used nearly the same phrasing to describe the posture he thinks the industry can no longer rely on.
The world is going to be way skeptical of tech and tech companies that say, trust us, we’ve got it, the future is going to be glorious.
That means tangible examples over the next 12 to 18 months. Nadella listed health outcomes, the ability to create a startup, and the ability for a local store to operate more efficiently. The test is whether individuals and communities can see benefits themselves. Political legitimacy also matters in his answer: politicians who advocate for these benefits have to be able to win elections. Otherwise, the industry should not assume permission will persist.
Education was the final domain where Nadella saw the possibility of reinvention. Sarah Guo described a broad-benefit framework of wealth creation, healthcare, and education, and noted that education has not yet shown as much impact as one might expect. Nadella pointed to Alpha School as an example that had caused him to think about what education could become, while also stressing that traditional learning remains important.
Students still need concepts. He referred to a Stanford AI class whose guidelines required students to learn how to apply softmax appropriately rather than simply asking AI to fix a training run. The issue is not whether learning disappears; it is how incentives, credentials, the value of credentials, and employment opportunities change when access to information and continuous self-education have changed so radically.
The startup opportunity, in Nadella’s view, may be to build a new university or a new pedagogy: a way to move someone through a curriculum toward valuable economic opportunity. That had long felt impossible to change at scale. In his view, it is now one of the institutions AI may make possible to rebuild.





