Enterprises Are Misassigning GenAI Work to Traditional ML Teams

Phil HetzelAI EngineerMonday, May 25, 20269 min read

Phil Hetzel of Braintrust argues that many enterprises misassigned generative AI work to data science and ML platform teams because it carried the AI label. His case is not that those teams are irrelevant, but that LLM application work starts after providers such as OpenAI and Anthropic have trained the base models. What remains, he says, is a broader product and systems problem: prompt and context engineering, domain annotation, functional evaluation, observability, and production feedback loops that require data scientists, engineers, and subject-matter experts working together.

Enterprises are mis-scoping agent work when they route it through old ML ownership models

Phil Hetzel’s central claim is not that data scientists and machine learning engineers have no place in agent development. It is that many traditional enterprises assigned generative AI to them for the wrong reason: because “AI” appeared in the name, and because an ML platform team already existed.

The practical mistake is scoping agent work as if it were another predictive-model program. Hetzel described a familiar enterprise pattern: a CEO or CIO reads that agents are the path to the “AI promised land,” delegates the mandate downward, and the work lands with the existing ML or data science platform team. The fit looks natural from an org-chart perspective. Those teams already govern models, have deployment practices, and know how to think about testing.

Hetzel contrasted that with AI-native companies, where the company may have been built around agents from the beginning. A founder and a small group of engineers build around a problem, and because the company is small, more people are close to what the agent is actually supposed to solve. Those teams are less constrained by preexisting platform boundaries and more likely to be cross-functional across product engineering and AI engineering.

Organization type	How agent work tends to start	What Hetzel emphasized
Traditional enterprise	A CEO or CIO delegates an agent mandate to an existing ML or data science platform team.	The assignment often follows the label “AI,” not the changed nature of the work.
AI native	A founder and small engineering group build the solution, partly using agents.	Small teams tend to be cross-functional and closer to the problem the agent is meant to solve.

Hetzel contrasted enterprise delegation to existing ML teams with AI-native teams organized around the agent’s use case.

That organizational difference matters because agent development is not simply another predictive-model workflow. Hetzel reduced the distinction to two changes: the model is already built, and functionality can be changed with natural language.

The model's already built.

Phil Hetzel · Source

In a conventional ML workflow, a team gathers data, labels it, trains a model, tests it, deploys it, implements it downstream, and observes it. Hetzel acknowledged that real workflows are more complicated, but used the abstraction to isolate the point: much of what data scientists and ML engineers traditionally do is the training and testing pipeline.

With LLM-based applications, that upstream pipeline has usually been run by model providers. Anthropic, OpenAI, and Mistral have gathered data, trained underlying LLMs, and exposed them through endpoints. Application teams still need to test what they build on top of those models, but the work shifts. They are not usually training the base model. They are implementing an API inside a product, changing prompts, supplying context, evaluating behavior, and watching what happens in production.

What changed	Traditional ML frame	Agentic application frame
The model	Teams often build, train, test, and deploy the model themselves.	The base model is usually already trained and exposed through an endpoint by providers such as Anthropic, OpenAI, or Mistral.
How behavior changes	Teams add data, retrain, perform feature engineering, and A/B test model changes.	Teams change prompts, supply context, and study agent interactions to better align the system with the use case.

The conceptual hinge of Hetzel’s argument is that agent teams inherit the base model and change behavior largely through inputs and surrounding systems.

The source slide on changing predictive applications made the same comparison explicitly: traditional ML improvement was framed around adding training data, feature engineering, and A/B testing; AI improvement was framed around changing prompts, performing context engineering, and understanding agent interactions. That shift changes who needs to be in the room.

Agent quality is broader than precision, recall, and F1

The strongest version of the case for data science ownership is straightforward. As Phil Hetzel framed it, agents use models, and in many organizations models are governed by data scientists. Data scientists understand neural networks and, by extension, how LLMs work. This gives them an appreciation of the risks involved in using a complex technology, along with production discipline: existing processes for deploying model assets, keeping the company safe, and testing whether users receive an acceptable experience.

The counterargument is not that this expertise is irrelevant. It is that agent development does not follow the same pipeline. Teams building on LLM APIs are not necessarily doing training and testing in the old sense. They are not doing the “cross-validation dance” around a custom model in the same way.

Hetzel’s sharper objection was evaluative. A data scientist or ML engineer may know how to test a model, but not necessarily know what an agent should be tested for in a product context. He said he has seen teams “lock onto” traditional ML metrics such as precision, recall, and F1 because those metrics are familiar and have been rewarded in their previous work. Agent quality has a broader surface area.

When you're analyzing agents, it is far broader of a surface area that you need to be evaluating.

Phil Hetzel · Source

The question is not only whether a model scored well on a technical classification-style metric. It is whether the agent performs the function the product requires. Precision, recall, and F1 do not disappear, but they are insufficient as the primary frame for a system that must satisfy a use case, interact with users, and operate inside an application.

Hetzel also separated evaluation from observability while treating them as connected disciplines. Evals happen during experimentation: they help a team become confident while it tweaks and builds an agent before production. Observability happens after deployment: it helps the team remain confident when real users and real usage expose the agent to live conditions. Model providers may test the model. The application team still has to evaluate the implemented agent.

Product engineers see APIs and distributed systems where ML teams may see models

The case for non-data-scientist involvement begins with a simple reframing: LLMs can be treated as APIs. Phil Hetzel argued that product engineers are already accustomed to sending payloads to external systems, receiving responses, and shaping those responses into useful product experiences.

That does not make agent development simple. Hetzel emphasized that complex agents can become distributed systems. A supervisor agent may call multiple child or sub-agents. Those sub-agents may run on different infrastructure and call different downstream systems. The resulting architecture is not just a statistical problem. It can involve execution across infrastructure, integrations with other systems, and the surrounding product experience.

For that reason, a complex agent would “not be served well with only data scientists.” Product, application, and systems engineers are needed to implement requirements into the product, build the surrounding system in a way that can support production use, and implement eval and observability pipelines.

The role of non-technical experts is different but central in Hetzel’s model. Product managers and subject matter experts know what to examine to determine whether the agent succeeded or failed. They understand how actual users will judge the agent’s behavior.

Hetzel connected that expertise to two practical workflows: prompt and context engineering, and human annotation. Because LLM behavior can be changed through natural language inputs rather than only through model retraining or feature engineering, people with domain knowledge can shape the system directly. They can adjust prompts, experiment with context, inspect traces, label outputs, and explain why an agent behaved well or badly.

That is where the narrow ML-platform model breaks down. In a generative AI application, improvement can come from changing prompts, supplying better context, and understanding agent interactions well enough to align the system with the use case. The person best equipped to do that may not be the person with the deepest knowledge of model internals. It may be the person who knows the work the agent is supposed to perform.

Data scientists provide the rigor that keeps agent work from becoming guesswork

Phil Hetzel explicitly rejected the idea that data scientists need to “completely refresh” their skill set or step away from agent development. His conclusion was that the best agent teams are diverse and do not consist only of data scientists.

He identified three specific ways data scientists add value beyond simply helping build the product.

The first is education. Many people implement LLMs aggressively without understanding how the underlying technology works. Data scientists can be “the adult in the room” by explaining how LLMs are trained, what their limitations are, and why the system should not be treated as if it “actually knows” things. In his phrasing, an LLM is “just predicting token after token” and is “a bunch of stats problems at the end of the day.”

That education role also includes keeping the broader team current on new research. In a cross-functional agent team, data scientists can prevent product enthusiasm from becoming ungrounded certainty.

The second role is reining in LLM-as-judge evaluation. Hetzel described LLM-as-judge as a major part of the eval process for agentic applications, but cautioned that people are often tempted to simply believe judges because they are convenient. An LLM judge is still a prompt and a model. It needs to be evaluated too.

Traditional data science metrics re-enter the picture when they fit the problem. If a team can create labeled data, it can compare the judge’s outputs against ground truth and apply precision, recall, F1, and related scoring. Hetzel’s slide framed a good LLM judge as one with discrete outputs and ground truth that is easy to create. In that setting, the traditional data science toolkit is directly useful.

The third role is fine-tuning. Hetzel described this as rare, but also as the most technically obvious place where data scientists and machine learning engineers provide value. If a team needs to fine-tune an open-source model for a specific use case, the data science team is best equipped to handle that pipeline.

Data science metrics are not obsolete. They can help validate evaluators and quantify agreement. They should not be mistaken for the whole of agent quality.

The problem should determine the team

The proposed team model assigns different responsibilities to three groups: data scientists, product or systems engineers, and non-technical experts. In Phil Hetzel’s version, data scientists educate the team on the underlying technology, perform traditional ML-style scoring when appropriate, and fine-tune models when necessary. Product, application, and systems engineers turn requirements into product behavior, build the execution environment around the agent, and implement eval and observability pipelines. Non-technical experts solicit feedback from real users, study traces, annotate behavior, provide subject matter expertise, and experiment with prompts and context in natural language.

Persona	Primary contribution in Hetzel’s model
Data scientist	Explains the underlying technology, scores evaluators with traditional ML metrics when appropriate, and handles fine-tuning when needed.
Product / application / systems engineer	Implements requirements, builds the production system around the agent, and wires eval and observability pipelines.
Non-technical expert	Brings domain knowledge, annotates traces, gathers user feedback, and adjusts prompts or context using natural language.

Hetzel’s ideal agent-development team is divided by contribution, not by exclusive ownership.

An audience member sharpened the ownership question by reframing agents as tools. On that view, anyone in an organization could build or own an agent, depending on the problem being solved. The right question is not whether the ML team or another computing team owns the tool, but what problem the agent is meant to solve and who has the relevant expertise.

Hetzel agreed. An agent is a product built by a diverse team. The mistake he sees in traditional companies is treating agents as “another predictive model,” isolating the work to ML engineers or data scientists, and telling them to “go build these agent things.”

The same clarification applied to tooling and feedback loops. Asked about making it easier for domain experts to update systems and keep evaluators current as the system changes, Hetzel said Braintrust leans into the domain-expert persona with human labeling and an agent and prompt playground where people can experiment with prompts and send them to underlying agents.

More broadly, the loop he described is to gather production data continually into the offline dataset used for evaluation. Over time, the team should use grounded data to check whether its evals are aligning more closely with human agreement or diverging from it. That is the practical bridge between domain judgment and repeatable evaluation: subject matter experts help define and annotate what good behavior looks like, while the evaluation system is updated against production examples and checked against human agreement.

Domain experts know what good behavior looks like. Data scientists know how to bring rigor to measurement. Engineers know how to build production systems. Agent development needs all three because the work is no longer only model creation, only product implementation, or only domain review.

Hetzel’s answer was not that GenAI belongs somewhere else. It was that exclusive ownership is the wrong abstraction.

AI Application Architecture Evals and Benchmarks Agents and Autonomy Enterprise AI Adoption