Fixed Evaluation Suites Go Stale as Agents Optimize Toward Intent

Vincent KocAI EngineerTuesday, May 12, 202611 min read

Vincent Koc of Comet ML argues that AI evaluation is being outpaced by the systems it is meant to measure. In a talk on adaptive evaluation for agents, Koc says static benchmarks and handcrafted test sets are poorly suited to applications that change with prompts, tools, production traces, user behavior and even their own harnesses. His proposed direction is to define the intended end state, use traces and telemetry to surface drift and edge cases, and treat evals as a continuously revised system rather than a one-time benchmark.

Static evals break when the system being measured keeps adapting

Vincent Koc’s central claim is that evaluation practice has lagged behind the systems it is now supposed to measure. AI applications are increasingly malleable: their behavior changes with prompts, context, tools, telemetry, users, and in some cases the harness around them. Yet much of the evaluation stack still assumes the thing under test is relatively fixed.

Koc framed the problem through the familiar software-testing ladder. In conventional engineering, teams begin with examples and unit tests. They add manual regression suites when a particular path produces an unwanted outcome. They put CI/CD pipelines around shipping. They use observability to understand what is happening in production. And, at the edge, they practice something like chaos engineering: deliberately stressing and breaking systems to see where they fail.

His point was not that AI teams have no tests. It was that their dominant testing pattern is narrower. In AI and data science, he said, teams lean heavily on static benchmarks, hand-curated evaluation sets, and pre-deployment offline evaluation. A bank might ask an AI system a crafted set of compliance questions to ensure it does not cross a line into selling financial services. A product team might tune a set of prompts or expected answers until the model behaves correctly on that set. But the equivalent of chaos engineering—the “what happens when users, data, and behavior drift?” layer—is often missing.

That gap matters because the applications themselves are no longer static. Koc described AI systems as being treated like static software even as software itself becomes more malleable. He cited OpenClaw, where he said he is one of the core contributors, as an example of a harness that changes itself: the harness can shift as users create skills or ask the system to do new things. If the application and its surrounding harness can adapt at that speed, the benchmark suite faces an obvious question: how does it keep up?

The answer, for Koc, is not simply to make a larger benchmark. He criticized the AI field’s fixation on benchmarks that may be technically interesting but weakly connected to production failure. In his telling, teams can accumulate “a huge humongous set” of datasets meant to explain agent behavior, only to find themselves back at the drawing board when something breaks in production. The failure mode is not a lack of measurement in the abstract. It is measurement that calcifies while the product keeps moving.

The missing layer is adaptive testing, not more handcrafted examples

Koc used an academic paper on adaptive testing for LLM evaluation to point at a broader mindset shift. The slide showed a preprint titled “Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks,” by Peiju Liu, Chen Yang, Cheng-Hao Hsieh, Cheng-Hao Tung, Chien-Kuo Chien, and Nitesh V. Chawla at the University of Notre Dame. Its abstract described ADAPT, an adaptive testing framework based on Item Response Theory, using Fisher-information-guided item selection to estimate model ability while reducing the number of benchmark items required.

The paper, as shown, described existing evaluation protocols as relying on constant test sets. ADAPT, by contrast, dynamically selects an optimal set of items per model, with the stated aim of maximizing measurement precision and minimizing resource consumption. The slide’s abstract also said the authors used evaluation data from 151 LLMs on HellaSwag and found that reconstructed accuracies from ability estimates were highly correlated with true accuracy across four benchmarks. It further stated that among more than 5,000 evaluated models, 25–35% exhibited equivalent LLM accuracy evaluations that ADAPT could resolve more finely.

Koc did not present the paper as his own work. He called it one of many papers in the area and used it to ask why benchmarks are static in the first place. If applications change, and if different test items are informative for different models or different conditions, then a fixed test suite may be both expensive and poorly targeted.

In Koc’s framing, adaptive testing is useful not only because it can be more selective about which items to run. It also helps shift the mindset away from benchmark suites as fixed artifacts. His production concern is broader: deployed agents generate traces, encounter new user behavior, incur costs, fail in specific ways, and operate under organizational intent. Evaluation, in his view, should be capable of changing with that operational reality.

The path from prompt engineering to intent engineering makes evals harder, not obsolete

Vincent Koc organized the recent development of AI application-building into three stages: prompt engineering, context engineering, and intent engineering.

Prompt engineering, in his account, was the trial-and-error phase. Builders “doom scroll” and wordsmith instructions, changing words in a prompt and hoping the output improves. He compared it, with caveats, to discovering a drug’s actual use by accident: trying to make medication for one disease and finding it cures pain. Prompt engineering was not, in his view, a disciplined control surface. It was bashing words into a system and observing what changed. He said this practice “died” around 2023, though people still do it.

Context engineering made evaluation more relevant because the systems became more structured. With RAG, tool calling, long context windows, MCP servers, reasoning models, and longer-running models, the agent could be decomposed. A team could test whether a specific tool did the sales-agent function it was supposed to do. It could inspect parts of the agentic system rather than only judging the final response. Koc described this as making evaluation more steerable and understandable, though still incomplete.

Intent engineering is where he sees the next complication. Code is cheap, he said, and tokens are available; whether or not one calls them cheap, availability increases consumption. More tokens means more generated software and faster application velocity. Meanwhile, models are becoming more capable. Koc pointed to optimization problems and ARC-AGI-style puzzles as examples of areas where large language models can recognize patterns and perform tasks that are difficult for humans.

The result is a class of systems that can self-optimize toward a user’s or organization’s intent. Koc said this is visible in harnesses such as OpenClaw, and in other harness-like experiences inside Claude and Codex, where the system tries to understand the user, adapt to them, and improve the experience.

That is precisely why he rejects the idea that evals are “dead.” The joke that evals, A/B testing, or observability are dead may contain a real frustration with old practices, but Koc’s conclusion is the opposite. If agents adapt to different users, then organizations need more visibility, not less. They need to know how one user’s experience differs from another’s, how the agent is changing across its layers, and what “insecure” or “unknown” behavior actually means in operational terms.

When we have intentful machines, the evaluations become even more complicated because it’s like, how do I know my experience is different from your experience and different from someone else’s experience?

Vincent Koc

The difficulty is not only correctness. Koc asked how teams should define ambiguity inside an agent, or personality inside an agent, and how those qualities should map to organizational requirements. He suggested rubrics as one path, analogous to evaluating art or other subjective work in schools. In that model, the evaluation target is no longer “one plus one equals two” or “this exact question must receive this exact answer.” It is a property or intent-based outcome.

Production traces should make stale tests visible

Koc’s proposed direction begins with traces. His example was not an abstract observability program but a concrete shift in usage: perhaps 80% of the time the same things happen with an agent, then the customer base changes. Those new customers start looking for different things, asking questions differently, and changing the behavior the agent has to handle.

If the tests remain tied to the original handcrafted dataset, the evaluation system may miss that shift. Koc’s question was why those traces are not being fed into agents. If something has changed, he argued, an agent could surface that change to the owners of the system and help change the suites and tests. In this model, the agent is not only the thing being evaluated; agents may also help curate the evaluation artifacts.

That leads to what he called always-on evaluation and optimization. Instead of a static benchmark run before deployment, the evaluation layer would operate continuously. It would watch traces, use agents to evaluate behavior, and update the measurement process as the product and its users change. Koc presented this as the direction he believes evaluation should move, not as a settled industry practice.

Telemetry is the other input. Koc described work on “telemetry in the loop,” including a paper he said has been written on the idea. The premise is that when teams are writing software applications, MCPs, or other agentic systems, a harness that is aware of what is breaking and how much it costs could use set conditions to self-correct to some degree. If a harness sees an error or issue, it may be able to fix itself and continue. Koc said teams are starting to see this kind of behavior with harnesses.

The underlying shift is from trying to predict every failure in advance to using operational data as part of the agent and evaluation loop. Instead of treating telemetry only as a dashboard humans inspect after the fact, Koc wants that data to become part of the correction mechanism.

The calcification problem is that evals become stale exactly where the business is exposed

Vincent Koc called the failure mode “eval calcification”: the evaluation suite hardens while the product, users, and agent behavior keep changing. The phrase was tentative—he said he was still “stewing” on it and that it sounded like a paper title—but the mechanism was clear.

An evaluation set often begins as a dataset. It contains examples, expected outputs, adversarial cases, regressions, and compliance checks. That may be enough for the stable 80% of system behavior. But Koc argued that the dangerous part is the changing 20%: the weird question, the unexpected user path, the new customer segment, the use of the agent in a strange way. That 20% is what can “mess up your business.”

20%

Changing edge behavior Koc argued can create disproportionate business risk

The slide behind this point reduced the problem to an 80/20 split. The visual was intentionally simple: a large stable block and a smaller changing block. Its force was not statistical precision; it was the operating intuition that a product can appear mostly covered while the exposed margin keeps moving.

Koc’s alternative is to treat evals less like a fixed dataset and more like code, software, or even a living agent. The team defines the end state it wants. The system can then use traces, telemetry, reward signals, and adaptive testing approaches to move toward that end state and keep the evaluation suite from going stale.

He illustrated the idea with auto-optimization examples: set a goal, set a target, and let a Python loop tune and tweak toward it. The target could be practical or arbitrary—the cheapest barbecue mix, the tastiest barbecue mix, or an agent behavior objective. The important pattern is that users and organizations have an intent to optimize toward. Evaluation should express that end state rather than only enumerate starting examples.

This reframes the role of the evaluator. Instead of asking humans to predict every case and encode it as a dataset, the evaluation system is given a desired outcome and can help discover, maintain, and update the tests. Koc summarized the shift as moving from “our evals become the dataset or the starting point” to “our evals become what is the end state that we want to get to.”

He also showed a chart titled “Accuracy per Attempt,” with several approaches plotted across 100 attempts and accuracy rising or varying over repeated tries. The chart supported the same idea: optimization is iterative. Once agents can work against a reward signal, evaluation can be framed as a loop—define the desired outcome, measure attempts, feed results back, and adapt.

Evals need the same adaptive mindset as the agents

The practical consequence of Koc’s argument is that evals and observability start to converge. If an AI system were a static model behind a static interface, pre-deployment offline tests could carry more of the burden. But his target is an agentic application with tools, memory, traces, user-specific adaptation, changing harness behavior, and cost constraints. In that setting, evaluation has to account for the layers where the agent is changing.

That does not make handcrafted tests useless. Koc’s 80/20 framing leaves room for stable, intent-defined tests around known behavior. Compliance checks, regression cases, and tool-specific tests still matter. But they are insufficient when the product is shifting under the test suite.

The harder problem is maintaining attention on the moving boundary: how users are changing, how the agent is adapting, what tools are invoked, which costs are rising, which errors recur, and whether the agent’s personality or ambiguity still matches the organization’s desired intent. Koc’s recommendation is to approach evaluation with the same agentic mindset as the product itself.

He said he had planned to show a more in-depth demonstration of how Comet has applied these ideas, but the end state was not yet finished and would be ready in the coming weeks. Instead, he presented the conceptual map: define intent, observe traces and telemetry, let agents help surface and revise hard cases, and keep stable tests for the known 80%. Static benchmarks made sense for more static systems. Context-rich agents made evaluation more modular but not complete. Intentful agents make fixed evaluation brittle because the system adapts to users and conditions.

Agents and Autonomy AI Application Architecture Evals and Benchmarks Inference and Deployment