Orply.

Enterprise AI Advantage Comes From Internal Evals and Proprietary Context

Apoorv AgrawalYash PatilStanford OnlineFriday, May 22, 202618 min read

Yash Patil, chief executive of Applied Compute and a guest speaker in Stanford’s MS&E435 seminar, argues that the enterprise opportunity in AI is shifting from access to general frontier models toward the ability to define and optimize company-specific tasks. General models provide a baseline, he says, but durable advantage comes from internal evals, verifiers, feedback loops, proprietary context and product constraints that teach systems what “correct” means inside a business.

Enterprise advantage starts with defining the hill

Yash Patil framed the current model frontier as a move from broad capability toward task-specific improvement systems. General models are getting stronger, but the most valuable enterprise work depends on proprietary context, internal standards, and domain-specific definitions of success.

Patil’s phrase for the gap was simple: frontier models can be “smart geniuses that know nothing about your business.” Applied Compute, the company he started after leaving OpenAI, was built around that gap. General models establish a floor. The ceiling comes from specializing models, harnesses, context, and evaluation systems around an enterprise’s own work.

That emphasis follows from how model progress has changed. Early large-language-model development was dominated by pre-training: train a transformer on internet-scale data to predict the next token and compress broad patterns of language into weights. The recent frontier has moved toward post-training, reinforcement learning, test-time compute, verifiable rewards, and evaluation design. Those methods do not simply ask whether a model has absorbed enough text. They ask whether a model can attempt a task, receive a useful signal about whether it succeeded, and improve.

For enterprises, the scarce asset is often not “data” in the generic sense. It is labeled behavior, evaluation criteria, verifiers, environments, and access to the real operational context in which the model’s answer will be judged.

Whatever hill you want to climb, you first define it with an Eval. Then RL is kind of this Eval-maxing machine.

Yash Patil

That is why evals set the roadmap. If a company wants a better coding model, the evaluation defines what “useful coding” means. If a company has its own internal standards, those standards become the enterprise-level eval. Applied Compute’s role, as Patil framed it, is to help companies train toward those specific hills.

Deep learning’s breakthrough also created the system no one fully understands

Apoorv Agrawal invited Patil to place recent model progress in the broader history of AI. Patil began with AlexNet, which he called a pivotal moment for deep learning and “the moment that we stopped understanding what any of these models actually do.”

Before AlexNet, many machine-learning systems depended on human-designed features. In vision, people would design feature extractors for edges or other low-level patterns, then train classifiers on those hand-crafted representations. AlexNet marked the shift to end-to-end learning: apply GPUs and a massive dataset such as ImageNet to neural networks, and the model learns the representations itself.

The accompanying slide compared “hand-crafted ML” with “deep learning.” Before AlexNet, humans designed features and the classifier used them. After AlexNet, one neural network learned features automatically and produced the classification. The point was not only historical. It established the pattern that still governs the field: more data, more compute, and architectures that can absorb both tend to produce better models, even when researchers cannot inspect the internal mechanism in a human-legible way.

The same pattern reappeared in language. Patil identified the transformer as the next major architectural break. Compared with recurrent neural networks and LSTMs, the transformer’s attention mechanism made language-model training more performant on GPUs and better suited to long sequences. That allowed the field to scale next-token prediction across massive text corpora.

From there, he traced the pre-training era of 2018 and 2019, when models learned by predicting the next token in a corpus, comparing the prediction against the actual next token, and updating weights through backpropagation. The OpenAI scaling laws then showed that larger models produced better performance. Patil pointed to GPT-3 as the first model that seemed to have “some level of general intelligence.” Chinchilla-style scaling laws then complicated the recipe: it was not enough to make the model larger; there was a compute-optimal balance between parameter count and training data.

Once base models became broadly useful, the problem changed. Raw next-token prediction did not produce assistants that reliably answered user questions, followed instructions, or complied with safety norms. Reinforcement learning from human feedback, preference tuning, and related post-training methods made models usable by ordinary users. GPT-4 represented, in Patil’s account, another step change in quality.

The recent break was reasoning. Patil described OpenAI’s o1 as opening a new axis of scaling: test-time compute. Instead of only making training larger, labs could give models more computation at inference time so they could spend time thinking, checking, and correcting. His narrower claim was that reasoning behavior emerged when models were placed in constrained reinforcement-learning environments and given substantial compute. Combined with tool use, that reasoning behavior helped produce agents such as coding assistants and deep-research systems that can work for longer periods and are now often described as AI coworkers.

The bottleneck has moved from architecture and scale to high-quality tasks

Yash Patil presented AI progress as a repeating sequence of bottlenecks. Before 2012, the bottleneck was hand-designed features. AlexNet showed that deep learning could learn features automatically. From 2012 to 2016, data and GPUs were central. In 2016 and 2017, sequence architecture was the obstacle, and transformers made language modeling easier to scale. From 2018 to 2021, compute and scale drove progress through GPT-style models. From 2022 to 2024, usability and alignment became the bottleneck; RLHF and ChatGPT made models useful to normal users.

Today, the bottleneck is high-quality tasks.

EraMain bottleneckWhat changed
Pre-2012Hand-designed featuresDeep learning started learning features automatically
2012-2016Data and GPUsImageNet and AlexNet proved neural nets could scale
2016-2017Sequence architectureTransformers made language modeling easier to scale
2018-2021Compute and scaleGPT-style models showed bigger models could become broadly capable
2022-2024Usability and alignmentRLHF and ChatGPT made models useful to normal users
2024-todayHigh-quality tasksEvals, verifiers, and RL environments help models improve on hard tasks
Patil’s account of how the binding constraint in AI development has shifted over time

Those tasks matter because recent reasoning models are trained through reinforcement-learning environments. To improve, the model needs a reward signal. The strongest environments provide a way to verify whether the model did the task correctly. Code and math are unusually valuable because they provide deterministic checks: code can compile, unit tests can run, mathematical answers can be verified.

That explains why software engineering became the first major frontier for agentic models. Agrawal asked why so many labs converged on code rather than immediately focusing on life sciences, cybersecurity, slides, or other domains. Patil answered that reinforcement learning with verifiable rewards needs deterministic success criteria. Code offers them. It also has abundant tokens on the internet and can be used to generate synthetic data.

He added a broader claim: many researchers, including himself, think coding models are “AGI complete” in the sense that many tasks can be reduced to code. A model can write code as a general interface to the world rather than relying only on narrow tool calls designed for one workflow.

Slide generation tested the boundary of that claim. Agrawal said the course slides had been generated with Claude Code for formatting. If models improve at code, why should they improve at slides?

Patil’s answer separated structure from preference. A model can use code to build a slide deck: make a table, set title formatting, place visual elements, and create a structurally valid output. But if the goal is not only functional slides but aesthetically pleasing slides, the training process needs additional reward signals. A reward model trained on human preferences for attractive versus unattractive slides could be combined with code-execution rewards. The model could then jointly optimize for structural correctness and aesthetic quality.

The example captures the broader point: code is valuable because it gives the model a way to act, but usefulness still depends on the reward model, verifier, or evaluator attached to the action.

Training economics are shifting from internet-scale compression to verifiable feedback

Yash Patil drew a sharp line between pre-training and post-training. Pre-training teaches broad world knowledge and pattern recognition. The model sees huge datasets of internet text, books, code, math, papers, and multimodal data. It predicts the next token, computes the loss against the true next token, and updates its weights. The process requires massive GPU clusters and weeks or months of compute. The output is a base model: knowledgeable, but not necessarily helpful, aligned, or safe.

DimensionPre-trainingPost-training
GoalTeach broad world knowledge and pattern recognitionTurn raw capability into useful behavior
How it worksPredict the next token and optimize loss over huge datasetsUse fine-tuning, RLHF, RLVR, safety training, and eval-driven iteration
DataInternet text, books, code, math, papers, and multimodal dataHuman demonstrations, preference rankings, expert labels, synthetic tasks, and RL environments
ComputeExtremely high: massive GPU clusters over weeks or monthsLower than pre-training, but growing with RL and test-time compute
OutputA base model: knowledgeable but not necessarily helpful or alignedAn assistant or agent: instruction-following, safer, and more product-ready
Scarce resourceCompute and internet-scale dataHuman feedback, evals, verifiers, and environments
The pre-training/post-training distinction as presented in the seminar slides and Patil’s explanation

Patil described pre-training as compression. The model takes internet-scale human knowledge and compresses patterns in language into weights. The result can look like intelligence, but the training objective is still next-token prediction.

One slide illustrated the loop with a simple sequence: given “The cat sat on the,” the target token is “mat.” The transformer produces probabilities over the vocabulary. If the probability assigned to “mat” is 0.62, the loss is computed as negative log probability. Backpropagation then updates model weights with gradient descent or AdamW. Repeat this over large-scale corpora, and training loss falls over many steps.

Post-training turns raw capability into useful behavior. It includes supervised fine-tuning, reinforcement learning from human feedback, reinforcement learning with verifiable rewards, safety training, and eval-driven iteration. The output is an assistant or agent that follows instructions, reasons better, behaves more safely, and is more product-ready.

Patil used a simple example: if a user asks, “Who should I invite to dinner?” a base model might complete the sentence with random names because it is continuing a plausible text sequence. A post-trained assistant should recognize that it does not know the user’s invite list and ask for more context. Similarly, if a user asks how to make a weapon or bomb, post-training can teach the model not to provide harmful instructions.

The distinction matters economically because frontier pre-training is limited to a small number of labs and hyperscalers. It requires huge capital expenditure, compute, and internet-scale data. Post-training is still compute-intensive and becoming more so, but it is more accessible to startups, labs, domain companies, and enterprises with the right feedback, evals, verifiers, and environments.

The data constraint makes the shift more important. A chart sourced on the slide to Epoch AI asked, “Will we run out of data?” It plotted the effective stock of human-generated public text against dataset sizes used to train notable LLMs, including GPT-3, PaLM, FLAN-137B, Falcon-180B, DBRX, and Llama 3. The chart showed notable model-training datasets approaching the estimated stock of public human-generated text, with median full-stock use projected around 2028 and median data use with 5x overtraining projected around 2027.

~2028
median full-stock use of human-generated public text shown on the Epoch AI chart

For pre-training, scale remains central. Labs and data providers are trying to find more tokens by scanning old libraries and ancient books. They are also investing in synthetic generation: take primary source documents and expand them into many more tokens in the hope that the model can learn more from them.

But Patil said an important pre-training direction is architectural improvement: make better use of the data that already exists. “On principle,” humans do not need internet-scale data to learn many things, so researchers are looking for more data-efficient methods.

RL environments represent a different kind of data. Instead of simply training on a codebase and learning token patterns, a lab can construct a world in which the model must implement a feature, try hundreds or thousands of times, and receive a verifiable reward based on whether the result passes tests. This exchanges compute for scarce high-quality data. A single task can yield much more learning because the model generates attempts, sees a distribution of outcomes, and updates from the reward signal.

Asked to compare pre-training and post-training budgets, Patil answered with an example he said he had looked up before arriving. DeepSeek-V3 was trained on roughly 2.4 million to 2.5 million H800 hours, while the RL training that led to DeepSeek-R1 used about 150,000. That puts the RL training at roughly 5% of the pre-training compute.

~5%
RL training compute compared with pre-training compute in Patil’s DeepSeek example

He cautioned that the trend is changing. Labs are not only pre-training models and then adding a small post-training pass. They are increasingly running datacenter-wide or multi-datacenter-wide RL runs. Patil referred to three scaling laws: pre-training scaling, post-training scaling, and test-time scaling. Test-time scaling concerns inference. Post-training scaling increases the compute devoted to RL itself, including larger batch sizes and more reasoning during task attempts.

While he did not know the latest statistics for newer model families, Patil said his experience scaling systems such as o1, o3, Codex, and Deep Research showed that more compute in RL produced better performance. Pre-training remains enormous, but post-training is becoming a larger part of the frontier.

Evals are protected because they decide what gets optimized

Yash Patil began his career at OpenAI working on evals, which he described as the kind of “hairiest thing that no one wants to work on” but a useful place to start because it makes a person valuable inside a company. He later connected that experience to the enterprise opportunity.

Evals benchmark a model on a task and reveal how it behaves. But they do more than measure. They decide what the organization will optimize. Patil described SWE-bench as an eval that helped start the code-model race because it gave labs a concrete target for “useful coding,” even though he also called SWE-bench flawed and said better evals have since appeared.

The slide on evals defined them as a way to “codify a task into an environment,” produce “measurable outcomes that you can optimize towards,” and give the model developer “a hill to climb.” It also warned of a “slippery slope” and the need to avoid train-test mismatch.

For frontier labs, the protected asset is therefore not just the trained model. It is the internal map of what matters next. If RL is an eval-maxing machine, then the organization with better evals is better positioned to define better hills and train toward them.

The same logic applies inside companies, but the target changes. Patil’s example was financial institutions: JP Morgan and Goldman Sachs may have different standards and operating procedures. Each enterprise will therefore need its own evals. Frontier labs optimize toward broad evals; enterprises optimize toward internal ones.

DoorDash shows why the hard part is the company’s own definition of correctness

Yash Patil offered DoorDash merchant onboarding as the clearest example of enterprise-specific post-training. DoorDash onboards more than 100,000 merchants a year, he said. Those merchants provide unstructured information about their businesses, including menus. Turning images of physical menus into a DoorDash storefront is harder than generic OCR because DoorDash has a specific internal style guide: how modifiers attach to items, which options can be mixed and matched, what counts as an add-on versus a special ingredient, and related menu-structure rules.

100,000+
merchants DoorDash onboards each year, according to Patil

General models were not able to perform that task well enough, even after prompting attempts. Applied Compute’s approach was to use a vision-language model, collect model outputs, have humans correct the resulting menus, and measure the delta from the correct version. During training, the model output could be checked against ground truth, producing an error signal that could be optimized directly.

The DoorDash slide described the project as “Automating Merchant Onboarding at DoorDash” and stated that DoorDash trained a proprietary agent using internal experts and delivered a “30% relative reduction in critical menu errors.”

30%
relative reduction in critical menu errors shown on the DoorDash slide

The DoorDash case illustrates the broader enterprise claim. The value was not simply that the model could read text from an image. A pre-transformer version of the problem might have been framed as OCR or computer vision. But the actual task was to transform unstructured menu information into DoorDash’s internal representation under DoorDash’s rules. The model had to learn the company’s notion of correctness.

The reason not to wait for a future general model, Patil argued, is that enterprises care about being at the frontier at any given time, not at some indefinite future point. He also expressed skepticism that a single future ASI model would control everything, because the world and its data are fragmented.

His economic claim was that the return on training models today can be attractive because post-training is far cheaper than pre-training and RL has become more data efficient. Enterprises can improve performance on their own tasks without incurring the massive compute burden of training a foundation model from scratch.

Specialized small models can beat general models on product constraints

Yash Patil gave Cognition/Devin as a second enterprise example. Applied Compute recently put a model into production that checks code shortly after a developer saves a file and reports whether there is a bug. The target behavior is sub-two-second latency.

This is not a good fit for a general model because products live on a Pareto frontier of performance, cost, and latency. A large general model may perform well but be too slow or expensive for a tight product loop. A smaller model, heavily post-trained on bug detection, can deliver the latency and cost profile while approaching the performance of larger models on the specific task.

Patil described the value to Cognition as extending the product from writing code into testing and bug detection. He then broadened the point into what he called model, harness, and context co-development. Companies cannot focus on only one layer. Application companies often innovate heavily on the harness: the surrounding system that routes tasks, manages context, invokes tools, and turns model outputs into product behavior. Context is equally important because without access to the right company data, a model may not know what action is correct.

The future system, in this framing, is an ensemble. General models remain powerful orchestrators. But fast sub-agents, or agents trained on proprietary data that is out of distribution for frontier models, can be orchestrated with those general models to create a stronger system.

Patil gave Ramp as another example from the market. Ramp Labs, he said, trained an RL model for fast search inside spreadsheets, improving the product experience. The point was not that every company should train a foundation model. It was that product advantage can come from specialized models embedded in a broader system.

Continual learning depends on sparse feedback from real use

Yash Patil described continual learning as the next bottleneck and the closest version of what many people mean when they imagine more general intelligence. The key question is whether a model can do something once, receive sparse feedback, and learn from it.

His example was human learning from a hot stove. A person does not need internet-scale examples to learn not to touch a burner after being burned once. Current models, he said, are not really like that. They typically require many examples, dense training signals, or replayable environments.

In production, continual learning means understanding how an AI system is being used, what downstream consequences its actions produce, and how that information can update the system over time. But Patil stressed that this will be gradual. Much of the problem is data access: Is the agent deployed in front of the right users? Can the company capture the relevant feedback? Does it understand the context needed to know good from bad?

Patil pointed to Cursor’s Composer model as one example. Composer is Cursor’s coding model, trained on its coding data on top of an open-source model. Cursor put the model into production, captured telemetry, inferred implicit rewards, and took online training steps. Those rewards could include whether a user accepted a code suggestion, reverted it, or otherwise signaled success. Agrawal called this “a heuristic for success,” and Patil agreed.

Patil estimated that the improvement graph represented days or weeks, with a couple of hours per step, though he said he did not know the exact terms. The production setting differs from offline RLVR because the environment is not replayable. In offline reinforcement learning with verifiable rewards, the same task can be rolled out hundreds or thousands of times in parallel. In production, users operate in dynamic environments. Cursor’s experiment, as he described it, was to take massive batches of many conversations, denoise the gradient that way, and take a step that would hopefully improve the model directionally.

Applied Compute’s related work involves “context base”: using agents offline to analyze documents and past human-agent traces, extract learnings, and improve downstream performance. Patil summarized the direction as innovation across weight updates, context, and the harness itself.

Transformers may be inefficient, but the scaling recipe is still winning

The debate over non-transformer architectures turns on a real weakness: transformers consume significant power and may be inefficient. Other architectures, such as Mamba, are often discussed as possible alternatives. Apoorv Agrawal framed the question by asking whether transformers might persist anyway because the world’s infrastructure has already adapted to them, much as airplanes are not bird-like but dominate human flight.

Yash Patil gave a pragmatic answer. Scaling transformers is working. There is a simple recipe for making them smarter and better. He said it is probably more likely that AI will tell researchers the better architecture if they continue scaling than that humans will invent it from scratch now.

He acknowledged that if the field hit a wall with transformers, there would be more pressure to innovate. He also said there is “really cool research” happening on alternatives. But his own view is to concentrate on scaling transformers.

Serious people disagree, including Ilya Sutskever and Yann LeCun, as Agrawal noted. Patil described the opposing view as a first-principles argument about data efficiency. Humans do not need pre-training-scale data to learn language and world models, so perhaps the current architecture is wrong or incomplete. There should be a better solution that learns more like humans.

Patil did not dismiss the argument. But he emphasized the installed base: compute buildouts, chips, and lab investments are being optimized around the transformer architecture. Some people are optimizing chips directly for the architecture. That makes a change “a big ship to turn.” Labs may be researching alternatives, but the major deployment path remains transformer scaling.

Compute scarcity looks more durable than today’s data market

In the rapid-fire portion, Yash Patil said that if he were not building Applied Compute, he would look at hardware. Applied Compute and other AI companies, he said, are running into compute scarcity. Demand is far outpacing supply. That suggests opportunities in energy sources to power compute and in more efficient chips.

He said he does not have a background in hardware or chip design, but he sees room for better co-development between training methods and chip design.

Asked for a long position, Patil named compute and chip providers, especially Nvidia. In his view, Nvidia will continue supplying the labs and remain a leader. But he also identified a risk in the economics. Patil said Nvidia takes roughly a 75% margin on its chips, while labs are spending hundreds of billions of dollars. That creates an incentive, in his account, for the largest customers to invest in their own chip design, co-develop model training, architecture, and chips internally, and make more chips even if each one is less effective.

He did not present that as easy. He said chip design is very hard. But the possibility of in-house chips is, in his view, a risk to the supplier structure because the biggest customers may eventually want to internalize more of the stack.

Patil was less bullish on the data market as currently structured. The problem is that RL data providers can be squeezed by their own success. If they create tasks that improve a customer’s model, the model becomes smarter. The next round of tasks must be harder, more expensive, and slower to produce because the hill has moved.

He also argued that smarter models improve synthetic data pipelines. Many RL tasks exploit a generator-verifier gap. In code, for example, the unit tests can be held out while a model attempts the task and the system verifies the result. That does not necessarily require a human. As models improve, companies can build better synthetic data systems around them.

Patil did not say data businesses will disappear. He noted that people have predicted the death of data markets before and been wrong. But he expects the market to change. The best founders in data, he said, are good at pivoting to the next wave: robotics data, egocentric data, or other categories.

Asked for a favorite AI product or modality, Patil named Image 2.0. He said he found it especially useful for people who cannot do design or want to understand things visually: he can take a paper, drop it into the image model, and get a visual walkthrough of how something works. Agrawal added that one of the most beautiful slides in the presentation had come from feeding the course syllabus into Image 2.0.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free