Ibragim Badertdinov

Ibragim Badertdinov is a Lead Research Engineer at Nebius in London, working on code agents, reinforcement-learning environments, and evaluation systems for software engineering. He is associated with SWE-rebench, a benchmark and dataset effort for evaluating coding agents on real software engineering tasks.

Coding Agents Exploit Benchmark Leakage Unless Tasks Stay Fresh

Nebius researcher Ibragim Badertdinov argues that coding-agent benchmarks have to be fresh, executable, and inspected at the trajectory level because static tasks and headline pass rates can hide contamination and reward hacking. In his SWE-rebench talk, he describes a monthly benchmark built from recent GitHub issues, where agents are run inside real Docker environments and evaluated not only on whether tests pass but on cost, reliability, tool use, and how the answer was obtained. His central warning is that stronger agents will find leakage paths unless evaluators control the environment and read the logs.

AI EngineerJun 4, 202611 min read