Fine-Tuning Becomes the Next Step for Mature AI Products

Benjamin CowenAI EngineerTuesday, June 2, 20266 min read

Benjamin Cowen, a forward-deployed machine-learning engineer at Modal, argues that fine-tuning is becoming a normal stage in the maturation of AI products rather than a specialist research exercise. His case is that frontier APIs and product teams optimize for different goals: labs need broadly capable models, while companies need models that fit their own economics, latency constraints and business-specific quality metrics. Cowen says the decision point shows up when API costs overwhelm revenue, evals stop improving through prompting, or shared endpoints cannot meet throughput requirements.

Fine-tuning becomes attractive when the product stops being generic

Benjamin Cowen’s central claim is not that every AI company should immediately train its own model. It is that, as AI products mature, more of them cross into a custom domain where frontier APIs no longer match the product’s economics, latency requirements, throughput needs, or business-specific quality metric.

Cowen framed this as a spectrum. At one end is the frontier API: fast to start, broadly capable, and responsible for much of the acceleration in AI application development. Its weakness is that customization mostly stops at prompt engineering. He joked that “caveman mode” prompting can reduce token usage by asking the model to speak tersely, but argued that this kind of optimization is not a durable answer if a startup grows by 100x or 1,000x.

At the other end is training and serving from scratch: maximum control, but also responsibility for cluster management, reservations, orchestration, production isolation, software maintenance, and infrastructure work that can pull AI engineers, scientists, or dedicated infrastructure engineers away from product work. Training workloads, he said, have different scaling and compute characteristics from most production workloads, so owning the whole stack traditionally meant becoming an infrastructure operator.

The middle ground in Cowen’s account is fine-tuning on serverless infrastructure: algorithmic control without the full burden of cluster ownership. Modal, where he is a forward-deployed machine learning engineer, is one such platform. His broader point was that cloud platforms and open-source libraries are making model customization feel less like an infrastructure migration and more like an extension of product iteration.

If you have a differentiated product, it is custom.

Benjamin Cowen

The frontier lab and the product team are optimizing for different things

Cowen treated domain-specific models as a pattern he is seeing among companies moving beyond generic API use. A Decagon-attributed slide described “fine-tuning on proprietary data” as “replacing API dependency,” citing Intercom’s Fin Apex as beating GPT-4 at one-fifth the cost with a 73.7% resolution rate, and Pinterest’s CEO as claiming an “orders of magnitude reduction in cost” from fine-tuned open-source models versus frontier APIs. Cowen also showed a post attributed to Clement Delangue saying companies including Pinterest, Airbnb, Stripe, Uber, and Intercom were publicly sharing that they were finding open models they train themselves “better, cheaper, faster” for many tasks.

1/5

the cost at which the Decagon slide said Intercom’s Fin Apex beat GPT-4

Cowen’s explanation for this pattern was strategic, not merely technical. Frontier labs want models that perform well across as many tasks as possible. A company building a product wants to be best at what it provides its customers. Those are different objectives.

The Decagon material put the point more sharply: “The model is the transistor. The fine-tuned, domain-specific system is the product.” It also included a quote attributed to Max Lu, Head of Decagon Labs: “As the major labs push toward broader reasoning capabilities, the gap between what they prioritize and what [Decagon] actually requires has only widened.”

If the metric that matters is tied to the company’s product and customer promise, the broad frontier model may be misaligned even when it remains impressive.

The decision point is visible before the migration is urgent

Cowen offered three practical categories of signals that a product may be nearing the point where fine-tuning is worth evaluating.

The first is cost. If the company has already optimized prompts and is still paying more for API usage than customers pay the company, that is a sign the economics are not scaling. At that point, the question is not whether frontier APIs are good, but whether a customized inference endpoint could change the margin structure.

The second is quality plateauing. If evals are no longer improving despite prompt work, fine-tuning may offer another route. Cowen did not present fine-tuning as a substitute for measurement; he treated evals as a precondition. Without mature evals, there is no reliable way to know whether training improved the product.

The third is latency and throughput. Cowen pointed especially to startups that win large enterprise contracts with specific latency or throughput requirements. Shared frontier endpoints offer limited ability to customize around those constraints, particularly when the product also depends on a custom metric that encodes the company’s own logic.

He also gave the negative test: if the company is still developing core functionality, has not collected data, or lacks mature evals, it is probably not time to train. He invoked the old training maxim “garbage in, garbage out.” Fine-tuning does not remove the need for data; it makes the quality of data more consequential.

A mature agent product may already contain the training system’s ingredients

Benjamin Cowen’s most practical claim was that many AI product teams have already built much of what they need to train, even if they do not think of themselves as training teams. If a team has built an agent harness, evaluates product behavior, and collects data on what works and what fails, then it has touched the core components needed for model training.

For reinforcement learning, an agent harness can become the environment in which a model learns to provide the service. If the product already records successful and unsuccessful behavior, that becomes training data. If the team already has an eval suite, that becomes the feedback mechanism for deciding whether a trained model is improving.

Cowen contrasted this with older training workflows. He said many machine-learning practitioners used to “tape the gradient by hand” and implement linear algebra themselves. For most practical fine-tuning, that is no longer necessary. Open-source libraries now expose the algorithms directly enough that teams can control training behavior without building the machinery from first principles.

Modal’s examples repository was used to illustrate the point: supervised fine-tuning in 300 lines of Python once the data is curated; GRPO reinforcement learning in 300 lines of code; vLLM inference in 200 lines of code. The slides showed the corresponding example paths: modal-labs/modal-examples/blob/main/06_gpu_and_ml/yolo/finetune_yolo.py, modal-labs/modal-examples/tree/main/06_gpu_and_ml/reinforcement-learning, and modal-labs/modal-examples/blob/main/06_gpu_and_ml/vllm_inference.py. Cowen’s claim was not that code length alone solves the problem, but that the training and serving loop no longer necessarily requires a giant monorepo or a team operating raw infrastructure.

Task	What Cowen said is now feasible	Illustrated example
Supervised fine-tuning	Train once data is curated, using open-source libraries and serverless infrastructure	SFT in 300 lines of Python
Reinforcement learning	Use an agent harness and eval data to let a model practice through rollouts	GRPO in 300 lines of code
Serving	Deploy the trained model behind an autoscaling inference endpoint	vLLM inference in 200 lines of code

Cowen’s examples for how much training and serving code is required once data and evals exist

Serverless matters because training is not only one big job

Cowen argued that serverless infrastructure is useful for more than inference. In training, it changes how teams can run experiments.

Hyperparameter tuning was his first example. Instead of treating every minute on a reserved cluster as scarce, a team can fan out many containers on demand, abandon unpromising runs quickly, and keep the faster-moving experiments. He compared this to a “meta evolutionary algorithm”: spin up many candidates, kill the weak ones, and iterate.

Reinforcement learning makes the infrastructure point more explicit. A model being trained with RL must practice repeatedly, producing many evaluations known as rollouts. Cowen described these rollouts as “massively, embarrassingly parallel.” Modal’s particular advantage, he said, is unified APIs for GPU containers and secure sandboxes, which allow RL systems to run large numbers of practice environments.

50,000–100,000

sandboxes Cowen said Modal customers have scaled to for RL rollouts

After training, the model still has to be served. Cowen said this is what the frontier API is doing under the hood. He did not say which serving stack frontier labs use; the relevant point was that comparable serving patterns are available through vLLM, SGLang, Triton Inference Server, or custom Python workflows, with autoscaling to match traffic.

The preparation is data and evals, not a premature platform bet

Benjamin Cowen closed by narrowing the prescription. He was not telling teams to train a model immediately. He was telling them not to treat training as something safely deferred for a decade. A product may need its own model in a year or six months, and the useful question is how the team will know when that moment has arrived.

The preparation he emphasized was straightforward: collect data, develop evals, and understand which metric the generic model is failing to optimize. Fine-tuning becomes meaningful only when the company can say what improvement means and has the data to train toward it.

AI Application Architecture Data and Training Evals and Benchmarks Inference and Deployment AI Infrastructure and Compute