Agent Benchmarks Are Measuring Harnesses as Much as Models
Nicholas Kang and Michael Aaron of Google DeepMind’s Kaggle team argue that AI evaluation is failing less because of a shortage of benchmarks than because benchmark results are hard to reproduce, easy to distort through hidden harness choices, and shaped by too narrow a group of authors. Their case is that agentic evals need shared infrastructure: transparent execution, community-created tests, model-versus-model arenas, and low-friction exams for builders who are not research labs. The recurring example is a wastewater treatment engineer in Turkey whose field experience produced a safety benchmark no lab was likely to create on its own.
AI Engineer·May 25, 2026·11 min read