AI Evaluation Is Falling Behind Agent Deployment in High-Stakes Domains
Vincent Chen of Snorkel AI argues that agent evaluation has not kept pace with the systems now being pushed toward real deployment. Drawing on more than 120 applications to Snorkel’s Open Benchmarks Grants, he lays out a framework for benchmarks that are rigorous enough to measure capability and opinionated enough to direct research. In Chen’s account, the next useful benchmarks will need validated tasks, intentional distributions, unsaturated headroom, and evaluation methods that capture realistic constraints, while also betting on richer environments, longer autonomy, and more complex outputs.

The evaluation gap is now a deployment problem
Vincent Chen’s starting point is not that agents lack capability. It is that measurement has fallen behind the capability people are now trying to deploy.
At Snorkel AI, Chen said, the company works at the intersection of academic frontier research, frontier labs, open-source collaborations, and enterprise deployments in domains where mistakes have consequences: finance, insurance, healthcare, and other production settings. That exposure produces an asymmetry. Model cards show improvement, “the vibes are improving,” and coding agents in particular have shifted expectations. But when individuals and enterprises are asked whether they are ready to let agents operate freely in high-stakes environments, Chen said the answer is usually hesitation.
The issue, in his framing, is not simply whether the capabilities exist. It is whether organizations can measure them well enough in practice.
Our ability to actually measure these agents in practice, that is falling behind of where the capabilities actually are.
Closing that gap, Chen said, requires a broader evaluation toolkit. Field deployments matter because they force models into “contact reality” in actual production environments. Other evaluation tools matter too: red-teaming, private human evaluation, domain-expert review, crowd-sourced labeling, and related approaches. But he argued that open benchmarks remain a critical part of the measurement stack because the best of them do more than retrospectively score progress. They define new vectors of progress.
In that sense, benchmarks are not neutral scoreboards. They are bets about what the field should learn to do next. Chen cited Terminal-Bench, METR’s long-horizon benchmark, and ARC-AGI as examples of benchmarks that have become guideposts for frontier development. The path to safe and trustworthy agents, in his view, depends on more of these benchmarks in practice. The slide put it plainly: “We need more, diverse open benchmarks on the path to reliable agents.”
That view is also behind Snorkel’s Open Benchmarks Grants, a $3 million-plus commitment announced with partners including Hugging Face, AI2, Together.ai, Factory, Harbor, and PyTorch. Chen said Snorkel had reviewed more than 120 applications from academic and industry teams, and used that process to distill two sets of criteria: the “science” of benchmarks as empirical measuring sticks, and the “art” of benchmarks that shape the frontier.
Good benchmarks start with tasks that survive adversarial validation
The first scientific requirement in Vincent Chen’s framework is individual task quality. A task in a serious benchmark needs to represent real-world complexity, contain well-posed instructions, have a verifiable solution, and ideally be validated by domain experts. The more frontier the task, the less sufficient it is for one expert to assert that it is good.
GPQA matters in Chen’s framework not only because it has endured as a test of graduate-level and professional knowledge, but because of its quality-control process. He emphasized a contribution “tucked away in the appendix”: GPQA introduced adversarial quality control to ensure that tasks were not only well-posed, but tractable for other experts to solve.
The workflow shown for GPQA was deliberately multi-stage: question writing by an author, expert validation by a first validator, expert validation by a second validator, question revision by the original writer, and non-expert validation. Chen said this protocol reflected the difficulty of frontier task design. When questions push the edge of expertise, it is non-trivial for any single expert to determine whether a task is valid, solvable, and useful.
He also highlighted GPQA’s incentive design. Payouts were tied to forms of agreement, borrowing some inspiration from academic peer review while trying to improve the process of multi-expert validation. Chen did not present that as a solved system; he noted peer review is flawed. But he treated the benchmark’s attempt to operationalize adversarial, multi-reviewer task validation as a model for the kind of rigor needed when benchmark items are themselves hard.
The underlying claim is that a benchmark’s aggregate number cannot compensate for weak individual tasks. If the tasks are ambiguous, underspecified, unverifiable, or not actually tractable for qualified humans, the downstream leaderboard inherits that weakness.
The distribution of tasks has to be designed, not accumulated
Task quality is not enough if the task set is accidental. In Chen’s account, a benchmark should define a clear taxonomy of the domain or capability it intends to measure, then distribute tasks intentionally across that taxonomy.
That can mean representing the actual traffic an agent sees in production. It can also mean intentionally characterizing failure modes that are rare but disproportionately important. Vincent Chen used self-driving as an analogy: yellow lights, pedestrians, or motorcyclists may appear less often than ordinary driving scenarios, but they matter enough that a benchmark cannot ignore them.
MMLU was the durable example of taxonomic design. Chen described it as an ambitious construction of 57 academic and professional domains across STEM, humanities, and other areas. Its longevity as a test of graduate and professional knowledge, he argued, is tied to the intentionality of that taxonomy.
The principle generalizes beyond academic Q&A. For agent benchmarks, the relevant distribution might be production traffic, failure modes, tool-use patterns, or domain-specific task categories. Chen’s point was not that every benchmark must mirror production traffic exactly. It was that benchmark designers need to know what distribution they are claiming to measure, and why.
Headroom matters because saturated benchmarks stop directing research
The third scientific axis was difficulty and model headroom. Vincent Chen argued that a useful frontier benchmark should be unsaturated: it should expose real soft spots in current systems and separate models at the frontier.
ARC-AGI was his central example. Chen credited the ARC Prize Foundation team and described ARC-AGI-2 as a benchmark that remained unsaturated for a long period. Its design targeted a gap between human capabilities and model capabilities: tasks humans could solve, but models could not yet handle well. When the recent reasoning push produced a major leap in model performance, Chen said that leap corresponded to a real capability gain. In his account, ARC-AGI had successfully measured a kind of efficiency or reasoning capability that models were missing.
ARC-AGI-3 had launched shortly before the talk, and Chen emphasized the same design feature at a new frontier: at launch, frontier models were under 1%, while every task was human-solvable “up to some degree.”
The benchmark’s value, in Chen’s account, comes from the size and clarity of the human-AI gap. It gives the field something to climb toward without already being exhausted by current systems. That is different from a benchmark that mostly confirms incremental differences among models on tasks they can already solve.
Headroom also creates anticipation. Chen said people now wait to see how new models perform on ARC-style evaluations. That kind of attention indicates that a benchmark has located an open capability frontier.
Accuracy is too thin for agents operating under constraints
The fourth scientific requirement was robust evaluation methodology. Vincent Chen argued that benchmarks should go beyond accuracy when the real-world capability being tested demands more. Relevant dimensions may include cost, latency, reasoning-trace quality, intermediate tool use, policy adherence, or whatever reward and supervision signals matter for the task.
TAU-bench matters here because it evaluates multi-turn agents using both task completion and policy adherence. The slide’s example made the point directly: an agent that books the right flight but violates fare-class rules still fails.
That distinction is central to agent evaluation. In many production settings, finishing the task is not enough. The system must also respect constraints: policy, permissions, budgets, latency requirements, user preferences, safety rules, and domain-specific standards. A benchmark that scores only final-answer correctness can miss the difference between a useful agent and an unacceptable one.
Chen framed methodology as a question of whether a benchmark measures what it claims to measure. For agents, that often means capturing the process as well as the endpoint: whether the agent used tools correctly, complied with policy, and operated within the practical constraints that define the work.
The strongest benchmarks are bets about where the frontier is going
The “art” side of the framework begins where empirical hygiene ends. Good task validation, taxonomy, headroom, and methodology make a benchmark useful. But benchmarks that shape the field also carry a thesis.
Vincent Chen used Terminal-Bench as the example: a bet that the command-line interface would become a central interface for general-purpose agents, not only coding agents. The slide quoted Mike Merrill, co-creator of Terminal-Bench: “Our big bet is that there’s a future in which 95% of LLM-computer interaction is through a terminal-like interface.”
Chen argued that this bet has, in many ways, turned out to be largely correct and consequential. He pointed to teams around Claude and Codex building capabilities on top of coding and CLI-based tools as evidence that the CLI is becoming an important abstraction for agents interacting with computers. In his view, Terminal-Bench anticipated an interface and affordance that now appears central to general-purpose computer use by agents.
The most ambitious benchmarks are really a statement about where the world is going.
That is why Chen sees thesis-driven benchmarks as more than measurement artifacts. They can accelerate a mode of development by making a capability visible, measurable, and worth optimizing. A benchmark can say: this interface, this workflow, this class of environment, or this level of autonomy is going to matter. If the bet is right, the benchmark becomes part of how the field organizes its effort.
A benchmark can become a research roadmap
A strong benchmark does not only produce a leaderboard; it inspires new methods, new benchmark variants, and new research agendas. That was Vincent Chen’s second artistic differentiator: the ability to create a roadmap for the field.
SWE-bench was the example. Chen described the core idea as simple in the way strong ideas often are, loosely characterizing it as a way to leverage existing coding capabilities and coding tasks. The slide showed a lineage from SWE-bench to Lite, Verified, Multilingual, and Multimodal variants; Chen also described the family as extending further with “etc.”
The significance, for Chen, is not only that SWE-bench measured coding agents. It changed how researchers thought about coding-agent evaluation and created a foundation on which others could extend. He argued that the SWE-bench team’s work continues to shape the space, and that there is room for further innovation as software development itself changes. He specifically raised “vibe coding” and higher layers of abstraction in software work as areas where future benchmark design may need to adapt.
The broader point is that a benchmark can function like infrastructure for a research program. It can make a problem concrete enough that many groups can attack it, compare methods, identify limitations, and extend the task family.
Researcher UX determines whether a benchmark gets used
Vincent Chen called researcher UX “severely underrated.” Benchmarks have users: model builders, agent builders, researchers, and teams that want to evaluate or improve systems. If those users cannot easily run models against the benchmark, contribute new tasks, extend the setup, or reuse signals for reinforcement learning or post-training, adoption suffers.
He treated this as a product principle applied to research infrastructure. A benchmark needs interfaces, harnesses, documentation, and workflows that make it easy for the community to engage. It should be simple to run agents against the benchmark, simple to extend the task set, and ideally simple to extract useful supervision or reward signals after evaluation.
Chen cited HELM from Stanford CRFM as an early example of a standardized, modular harness for reproducible evaluations across scenarios and models. He also cited Harbor, shipped with Terminal-Bench 2.0, as tooling that, in Chen’s words and on the slide, has become “de facto” evaluation infrastructure for teams building agents.
The practical point is straightforward. A benchmark may be conceptually excellent but fail to influence the field if it is hard to run, brittle, opaque, or expensive to adapt. Good researcher UX lowers the cost of participation and makes hill-climbing against the benchmark more likely.
The next benchmarks need richer environments, longer autonomy, and more complex outputs
Chen closed with Snorkel’s view of where new benchmark work is needed. He proposed three axes for the next generation: environment complexity, autonomy horizon, and output complexity. The slide rendered them as a three-dimensional map: environment complexity asks how dynamic the operating environment is; autonomy horizon asks how independently the agent can operate; output complexity asks how sophisticated the deliverable is.
| Axis | Question | What Chen emphasized |
|---|---|---|
| Environment complexity | How dynamic is the operating environment? | Domain specificity, noisy context, multi-modality, complex tools, and human or multi-agent coordination |
| Autonomy horizon | How independently can the agent operate? | Long-horizon scope, world modeling, and non-stationary goals |
| Output complexity | How sophisticated is the deliverable? | Multi-artifact work products, nuanced rubrics and rewards, and trustworthy outputs |
Environment complexity asks how dynamic and realistic the operating environment is. Vincent Chen listed several subdimensions: domain specificity, context complexity, multi-modality, tool complexity, and human or multi-agent coordination. In coding, for example, a real codebase includes organization-specific policies, Slack context, screenshots, flaky toolchains, distributed CI, human reviewers with preferences in their heads, and many contributors working in parallel. Current benchmarks, he said, capture only a fraction of that complexity.
The same pattern applies outside coding. Real work includes unwritten knowledge, specialized standards, noisy documents, multi-persona input, unstructured information, images, video, audio, spatial and temporal dynamics, tool permissions, rate limits, ambiguous documentation, chained tool use, handoffs, and multi-turn coordination. Chen’s view is that agent failures often occur in this gap between clean benchmark worlds and the messier environments where professionals actually work.
Autonomy horizon asks how independently an agent can operate before reliability breaks down. Chen described long-horizon scope as reliable operation over hundreds or thousands of steps, and highlighted world modeling and non-stationary goals as important pieces. A customer-experience agent, for instance, may need to preserve context from weeks earlier, adapt when integrations or product specs change, and respond when reorganizations shift priorities midstream. Real environments change state; requirements evolve; goals are not stationary. Chen argued that benchmarks should better represent those long-running, continually changing settings.
Output complexity asks how sophisticated the deliverable is. Chen argued that outputs are expanding beyond single answers into multi-artifact work products. A software deliverable or strategic recommendation is not easily scored as simply right or wrong. A good proposal may need correctness, clarity, depth, usability, organizational context, and good human judgment. Tomorrow’s benchmarks, in his view, need nuanced rubrics and rewards that can evaluate such work and potentially provide training signals.
He also tied output complexity to trustworthiness. Agents should be able to calibrate for risk, surface uncertainty honestly, and know when to stop or escalate. That matters especially when the output is not a plain-text answer but a decision-support artifact, roadmap, report, implementation, or other work product that humans may act on.
Together, the three axes are a call for benchmarks that more closely resemble the work agents are being asked to do: in dynamic environments, over longer horizons, producing richer deliverables under constraints.


