Nicholas Kang

Nicholas Kang is a product manager at Google DeepMind working on Kaggle Benchmarks, a platform for AI model evaluation and community-built benchmarks. He has publicly written and spoken about scalable, reproducible evaluation for AI systems and agentic workflows.

Agent Benchmarks Are Measuring Harnesses as Much as Models

Nicholas Kang and Michael Aaron of Google DeepMind’s Kaggle team argue that AI evaluation is failing less because of a shortage of benchmarks than because benchmark results are hard to reproduce, easy to distort through hidden harness choices, and shaped by too narrow a group of authors. Their case is that agentic evals need shared infrastructure: transparent execution, community-created tests, model-versus-model arenas, and low-friction exams for builders who are not research labs. The recurring example is a wastewater treatment engineer in Turkey whose field experience produced a safety benchmark no lab was likely to create on its own.

AI EngineerMay 25, 202611 min read