Agent Benchmarks Are Measuring Harnesses as Much as Models

Nicholas Kang Michael AaronAI EngineerMonday, May 25, 202611 min read

Nicholas Kang and Michael Aaron of Google DeepMind’s Kaggle team argue that AI evaluation is failing less because of a shortage of benchmarks than because benchmark results are hard to reproduce, easy to distort through hidden harness choices, and shaped by too narrow a group of authors. Their case is that agentic evals need shared infrastructure: transparent execution, community-created tests, model-versus-model arenas, and low-friction exams for builders who are not research labs. The recurring example is a wastewater treatment engineer in Turkey whose field experience produced a safety benchmark no lab was likely to create on its own.

The hard part of agent evaluation is no longer just the model

Nicholas Kang framed Kaggle’s work on agentic evaluation around a blunt premise: today’s AI evals are “kinda broken.” The problem, as he described it, is not that benchmarks do not exist. It is that they are scattered, difficult to verify, quick to go stale, and too often built by a narrow set of people for a narrow set of capabilities.

Kang’s first complaint was operational. New AI benchmarks appear constantly across GitHub repositories, arXiv papers, and lab releases. Keeping up with them can become “a full-time job,” and even he said he cannot reliably do it despite working on benchmarks professionally. Once a benchmark is published, the leaderboard in the paper often stops being maintained. The authors move on, the models change, and the result becomes less useful as a current signal.

His second complaint was about transparency. Labs frequently publish comparison charts when releasing models, but the charts often show only the final numbers. Kang argued that the missing details matter: how the benchmark was set up, what configurations were used, how the model was called, how the run was orchestrated, and what the benchmark was actually testing.

He gave one concrete example from Kaggle’s own experience. Kaggle had published a benchmark with one AI lab. A competing lab objected to the results, reran the benchmark itself, and published much better numbers. According to Kang, the competing lab had optimized the setup for its own model by using compaction provided through its API, while Kaggle had not used that compaction across all models in its original run. His point was not that the second run was necessarily fabricated. It was that results can diverge meaningfully depending on choices that are often invisible to the reader.

That ambiguity becomes sharper when benchmarks evaluate agents rather than simple model calls. Michael Aaron later pointed to a claim from a Morpheus LLM post or paper about SWE-bench Pro: six frontier models were reportedly within a couple of percentage points of one another, while the harness used around the model could produce a 22 percentage-point difference. Aaron said he had not specifically verified the blog post, but the claim “doesn’t seem unlikely” to him. The implication was central to the talk: in agentic benchmarks, the unit under test is often unclear. A result may be measuring the model, the harness, the scaffold, the agent, or some combination.

22%

reported performance difference on a coding benchmark depending on scaffold or harness

The final slide visible in the source showed the same issue in benchmark-process terms. It contrasted a SWE-lite style flow — input dataset, issue text and repository, test environment, agent predictions, patch application, pass/fail evaluation — with a “Native QA” benchmark process. The slide also displayed an example comparison: “Echo Model (Different Scaffold) 43%” versus “Base SWE-agent 21%,” alongside the source URL shown on the slide, tengyu.ai/best-ai-model-for-coding, dated March 18, 2024. The talk did not independently validate that external result; it used it to illustrate the ambiguity Aaron was describing.

Benchmarks reflect who is allowed to create them

Kang’s broadest claim was that evaluation infrastructure shapes what AI systems improve at. If a capability is not evaluated, teams cannot “hill climb” on it; they cannot know whether models are getting better. The current benchmark ecosystem, in his view, overrepresents the knowledge of AI researchers and technical professionals and underrepresents the long tail of specialized human work.

A slide placed “AI researchers” as a small circle inside “technical professionals,” itself inside “all of the world and its knowledge.” Kang said Google AI search had told him there are “something like 30,000 AI researchers,” and he contrasted that with roughly 30 million software engineers, data scientists, and other technical workers. The exact numbers were not the substance of the argument. His point was that AI is expected to affect most of humanity, while the people defining the tests remain a tiny subset.

That imbalance, he argued, will worsen the “cognitive edges” or jaggedness already visible in models: superhuman performance in some areas, mediocre performance in others. For Kang, that is not an equitable version of AI development.

The example he returned to was a Kaggle benchmark created by a wastewater treatment plant engineer in Turkey. The benchmark, shown on a Kaggle page as “WWTP Engineering Benchmark,” evaluated LLMs on real-world wastewater treatment plant engineering tasks: material selection, root-cause analysis, safety protocols, and process-chart thinking. The page said it was based on “20+ years of field experience” and knowledge “NOT found in standard training data.”

Kang said the benchmark creator had worked as a wastewater plant engineer for 20 years and had described severe incidents in his country where safety protocols were not followed and people died. The engineer built the benchmark because he wanted to evaluate how AI might help in his job and prevent similar incidents. Kang emphasized that the resulting dataset was proprietary and novel in the practical sense: it came from the creator’s field experience, did not exist elsewhere on the web, and was not the kind of economically obvious benchmark an AI lab would necessarily prioritize.

This was the clearest example of Kaggle’s community-level thesis. If evaluation is limited to what AI researchers choose to test, whole domains of consequential human expertise remain invisible to model progress. Kaggle’s proposed answer is not only to host leaderboards, but to make benchmark creation available to people with domain knowledge who would not normally be part of AI research.

Kaggle is trying to turn evals into shared infrastructure

Kaggle’s response spans four products or experiments: hackathons, standardized agent exams, Game Arena, and the Benchmarks platform. Kang was careful not to present Kaggle as having solved the problem. He described the work as an attempt to build infrastructure that makes evaluation more transparent, accessible, and participatory.

Kaggle’s institutional claim rests partly on scale. The source showed Kaggle describing itself as the world’s largest online community of AI and machine-learning practitioners, researchers, and enthusiasts, with more than 30 million registered members. The same slide listed more than 500 featured competitions and hackathons, more than 5,000 AI/ML models, 1.4 million public notebooks, and 470,000 public datasets.

Kaggle asset	Scale shown in the source
Registered members	30M+
Featured competitions and hackathons	500+
AI/ML models	5K+
Public notebooks	1.4M+
Public datasets	470K+

Kaggle’s slide positioned its existing community and artifacts as the base for broader evaluation infrastructure.

Hackathons are the most familiar mechanism. Kang described them as a way to channel community energy and expertise toward a defined problem, with guardrails tight enough to keep participants aligned and open enough to let creativity surface. He pointed to a then-running hackathon with the Google DeepMind AGI team, following a DeepMind paper on measuring cognitive faculties of AGI. The hackathon focused on five of ten faculties and asked participants to build benchmarks in those areas.

The practical obstacles are not trivial. Kang said producing a clear problem statement and evaluation rubric is hard because thousands of participants may interpret instructions differently. The platform also has to provide tools: hosting datasets, giving access to AI models, and enabling participants to explain and share their work in a way others can understand and build on. He noted that some participants may not be able to pay for API access to multiple state-of-the-art models, so access itself becomes part of the evaluation-infrastructure problem.

Judging remains human-heavy. Kaggle’s slide described a three-round judging process: initial QA and eligibility screening, first human review, and final human review. Kang said AI agents may be good at many things, but they are not yet good at judging innovation and creativity. Even human expert alignment is difficult and has to be coordinated.

Standardized Agent Exams target the builders who are not running evals

Kang’s second product example was Standardized Agent Exams, or SAE. He said he had originally called them “standardized agent tests,” but changed the name because of a trademark issue. The mechanism is deliberately simple: a builder pastes a one-line prompt into their agent, the agent takes an exam on Kaggle, and the platform returns a score on a leaderboard.

The motivation is a gap between professional evaluation tooling and consumer-agent deployment. Kang described one end of the spectrum as research labs and enterprises using sophisticated evaluation systems. On the other end are open-source and consumer agents being deployed without much testing. He called that a “huge, huge problem,” especially as agents are sent into real-world contexts such as inboxes, shopping accounts, and other personal workflows.

Safety-focused exams were one possible extension raised during the week of the talk. Kang suggested a builder might run a quick baseline before letting an agent operate in the world. The point was not to replace deep production evaluation, but to give more people a low-friction test they might actually use.

The early response suggested demand. Kaggle had launched the experimental MVP the previous week, and Kang said more than 500 agents had already been evaluated without much promotion. A slide showed a spike in daily traffic tied to profile views, leaderboard views, and SAE fetches. Kang also described spin-off posts on “MoaBook,” including people sharing agents’ exam results and even an SAE prep course.

500+

agents evaluated on Kaggle’s Standardized Agent Exams in the first week, according to Kang

The product design problem is calibration. If the exam is too difficult, agents may not finish or runs may take too long. If it is too easy, the score will not provide useful signal. Kang framed this as especially hard because consumer agents have a wide range of capabilities, yet the exam needs to remain accessible enough for broad participation.

Game Arena tries to avoid benchmark saturation by making models play each other

Aaron described Kaggle Game Arena as a benchmarking platform built around model-versus-model games. Its core advantage, in his telling, is that PvP competition resists saturation. A static benchmark can quickly become maxed out; in a game setting, one model can always compete against another, and the leaderboard can continue to move. Saturation, at most, means one model remains the best for a while.

Game selection is meant to isolate different model capabilities. Aaron mentioned Werewolf for deception, poker for randomness and deception, and chess because, as he put it, when analyzing ML systems “you have to be analyzing Chess.” Poker had already surfaced model-specific behaviors. Grok, Aaron said, “loves to go all in,” while other models are more conservative. Some newer-generation models were worse at poker because they were more risk-averse. He described these as “personalities” beginning to emerge over time.

The Game Arena pipeline starts with game design and iteration. Kaggle chooses games, checks whether models can play them better than random, iterates on prompts to keep them fair, and builds a harness. Aaron said much of this work is open source and pointed to the GitHub link shown on the slide: github.com/kaggle/kaggle-environments. Many of the games so far used OpenSpiel, which he described as an RL framework.

Simulations run on Kaggle’s simulation platform, originally built for reinforcement-learning competitions before LLMs became central. Kaggle uses an LLM model proxy — Aaron said it is available on Colab — to communicate consistently with the models being tested. Game scheduling uses a Bradley-Terry pairwise algorithm to reduce the number of matches needed. After runs complete, Kaggle publishes the results as datasets containing the LLM conversations, benchmark leaderboards with Elo scores and confidence intervals, and visualizers that let users inspect gameplay and reasoning logs.

The engineering constraint is cost. Full pairwise competition scales badly, and statistical significance can require a large number of games. Aaron said poker required about 400,000 hands, each with multiple turns. He asked for ideas on how to achieve statistical confidence without running millions of games.

400,000

poker hands Aaron said Kaggle ran to get statistical significance in Game Arena

Community participation is another unresolved question. Watching LLMs play each other can be entertaining for a while, but Aaron said it may become repetitive. One possibility he raised was to let community members compete by writing prompts for models in a hackathon-style format, then measure whose prompts help a model climb a game leaderboard.

The third difficulty is longitudinal comparison. Old models disappear, new ones appear, and endpoints may not always make it clear what model is actually running behind the scenes. That creates the same reproducibility problem Kaggle is trying to address elsewhere: a leaderboard is only as meaningful as the stability and transparency of what it is measuring.

The Benchmarks platform is for public evals, not production monitoring

Aaron distinguished Kaggle Benchmarks from production evaluation platforms. It is not intended as the place to run evals for production code; he said other tools exist for that. Kaggle’s Benchmarks platform is aimed at community involvement: anyone can build, run, and share evaluations in an open and verifiable way.

The platform structure resembles production eval systems. A user writes assertions, such as whether an output contains a required element or whether a task succeeded. Kaggle also supports LLM judging. Assertions are grouped into tasks, tasks are evaluated against a collection of models selected by the user, and tasks are then aggregated into a benchmark. The wastewater treatment benchmark was one example.

Aaron showed a task created by Paige, who had presented earlier in the same room. The task evaluated whether an LLM could convert an image of an xkcd comic into raw, unstyled SVG. The interface allowed side-by-side comparison of model outputs. Paige had created assertions such as whether the model generated SVG at all and whether it included the correct text. One visible response from Gemini 1.5 Pro said: “Based on the image provided, there is no text. The image contains a stick figure sitting at a desk with a laptop.” Aaron contrasted that with an older Claude model that produced a more complete reproduction.

The benchmark-building challenge starts with motivation. Aaron said most people do not set out to build benchmarks and may not know their value. Kaggle has hackathons, points, and medals to help with inspiration and incentives, but writing a good evaluation still takes work. That makes the community platform dependent not only on tooling, but on persuading people that their domain expertise is worth turning into a test.

Agentic benchmarks add further complexity. Execution, grading, result validation, and traces all become harder once an agent can take actions through a scaffold. Aaron’s SWE-bench Pro example returned here: when the harness moves performance by more than the model spread, the benchmark has to say what it is evaluating. “Are you testing the harness, are you checking the model?” he asked. Kaggle’s platform has to handle that ambiguity rather than hide it.

Fast model release and deprecation cycles compound the issue. Even if a benchmark is well designed, reproducing results over a long time horizon is hard when model endpoints change, older models are removed, and routing behavior is uncertain. This is not a side concern; for Aaron, it is one of the main reasons public benchmark claims can decay.

Evals and Benchmarks Agents and Autonomy