Coding Agents Exploit Benchmark Leakage Unless Tasks Stay Fresh

Ibragim BadertdinovAI EngineerThursday, June 4, 202611 min read

Nebius researcher Ibragim Badertdinov argues that coding-agent benchmarks have to be fresh, executable, and inspected at the trajectory level because static tasks and headline pass rates can hide contamination and reward hacking. In his SWE-rebench talk, he describes a monthly benchmark built from recent GitHub issues, where agents are run inside real Docker environments and evaluated not only on whether tests pass but on cost, reliability, tool use, and how the answer was obtained. His central warning is that stronger agents will find leakage paths unless evaluators control the environment and read the logs.

Freshness is the benchmark’s contamination strategy

Ibragim Badertdinov built SWE-rebench around a simple premise: evaluating coding agents on static public tasks is increasingly fragile because models improve quickly and benchmark artifacts can leak into future training data. His answer is not to hide the benchmark forever, but to keep moving the time window.

SWE-rebench evaluates roughly 30 models each month on fresh, real-world software engineering tasks selected from the previous month. One monthly update contained 57 problems from 46 repositories within the current time window. Badertdinov argued that if a benchmark releases both questions and solutions, “implicitly or explicitly this data can become a part of the pre-training of the next generation of models.” For an open benchmark that aims to stay decontaminated, he said, time splits are “the only way.”

Badertdinov framed software engineering as an unusually valuable evaluation domain because it resembles economically useful work more than pre-LLM benchmark tasks such as bracket-sequence exercises or adjective ordering. Real software tasks require an agent to understand a repository, reproduce a bug, write or run tests, implement a fix, and use tools across a naturally long-context, multi-turn environment.

SWE-rebench therefore reports more than a single resolved-rate score. The leaderboard includes model comparisons under the same simple harness, reference numbers for scaffolds such as Claude Code, Codex, and Junie, and additional statistics such as tokens and cost. Badertdinov said he also watches community feedback on places like LocalLlama and X when deciding which models to add, while mostly staying with the most popular models rather than niche requests.

His insistence on evaluation is partly biographical. Badertdinov trained as a dentist before moving into AI, machine learning, NLP, research, and open source work at Nebius. In medicine, he said, the cost of each mistake is high; he argued that AI is moving toward a similar regime, where mistakes can cost more than traditional software-engineering mistakes.

A task is not a prompt; it is an executable environment

For Badertdinov, a verifiable software engineering task has three core components: the task description, the sandbox, and the verifier. The task description in SWE-rebench is the original issue title and description from a permissive, popular open-source repository within the relevant time window. The sandbox is an executable Docker image with the project and dependencies installed. The verifier is drawn from tests in the pull request that solved the issue or implemented the feature.

That verifier is split into two kinds of tests. FAIL_TO_PASS tests should fail before the fix and pass after it. PASS_TO_PASS tests act as regression tests and should continue passing. This is why the benchmark is not comparable to a plain question-answering set. Each task is attached to a runnable software environment, and Badertdinov noted that these Docker images can be one to 10 gigabytes. Running the benchmark is therefore an infrastructure problem as much as a modeling problem.

1–10 GB

typical Docker-image size Badertdinov cited for real SWE tasks

The shape of a good task is easier to define negatively than positively. Badertdinov said the team has a large bank of rejected or problematic tasks because “it is not too easy to say what does it look like, a perfect task,” but it is easier to identify what makes one bad. The problem description needs balance: not too vague, not too over-specified, not too easy, and not too hard. If a task is too easy, all models solve it and the effective benchmark size shrinks. Badertdinov treated vague or over-specified descriptions as bad tasks because they obscure what the benchmark is measuring.

The verifier has its own failure modes. Tests should reward real fixes and reject fake solutions, but they can be too narrow or too broad. Badertdinov showed an example where a test required an exact substring in an error message. In that case, even a correct solution could fail because the expected wording was too specific. He tied this to a normal software-engineering pattern: developers often write tests after implementing a solution, which can make those tests overfit to the exact implementation they just wrote.

Infrastructure reliability is part of the evaluation, not a separate operational concern. External connections can blink. Docker images can become stale. A pipeline can break in ways that look like model failure. Badertdinov described one case where several images received a default time from the 1970s, and tests that depended on the current time began failing for reasons unrelated to agent capability.

Useful difficulty is different from noise

Badertdinov described SWE-rebench task construction as a filtering problem. GitHub provides a large source of candidate tasks through GitHub Archive for large-scale projects and the GitHub API for smaller ones. The pipeline starts from pull requests linked to issues, builds pre-tasks with tests, creates Docker images, executes the tests, applies LLM-as-judge filtering for common problems, runs sample model attempts, and then ends with manual verification.

The reference point for the funnel is pull requests linked with issues. Badertdinov noted that using pull requests alone would produce an eight-times-larger dataset for pre-training or post-training use. But the leaderboard needs cleaner tasks, not merely more of them. After execution and filtering, the final monthly update is much smaller.

Pipeline stage	Role in the filtering process
Issues and pull requests	Initial source from GitHub Archive or GitHub API
Pre-task with tests	Candidate task built from linked issue and PR information
Pre-image	Dockerfile and repository setup produced with an interactive agent
Valid task execution	Environment and tests run successfully
LLM filtering	Common quality problems screened out
Sample runs	Agent attempts reveal task-quality problems not visible earlier
Human verification	Final check that tasks are solvable and challenging

Badertdinov described SWE-rebench collection as a funnel from GitHub candidates to manually verified tasks.

The team deliberately oversamples before final runs. Badertdinov said they choose a sample about 10% larger than needed because some task-quality problems only become visible after agents attempt to solve them. He estimated the manual verification of the final set at roughly one full-time day of work, to ensure tasks are solvable but challenging.

SWE-rebench’s task filtering is meant to separate useful difficulty from noise. The source describes accepted tasks as averaging roughly twice the tool calls of rejected tasks, with lower pass rates and cleaner failure modes. That characterization matters because harder tasks are only useful when they force agents to inspect, edit, test, and reason through a repository. Ambiguous specifications, brittle verifiers, and flaky environments can also lower pass rates, but they create noise rather than a better benchmark.

This filtering discipline matters because the benchmark’s purpose is not to create adversarial puzzles. It is to measure agent behavior on tasks that correspond to real software work. A vague task, an overfit verifier, or a flaky environment can make a result look hard while actually measuring ambiguity, test brittleness, or infrastructure noise.

A simple agent needs strong infrastructure more than elaborate scaffolding

SWE-rebench’s default setup favors a minimalistic agent and robust infrastructure. Badertdinov argued that it is better to have “some minimalistic agent with strong infrastructure” than an over-engineered agent with weak infrastructure. The scaffold uses basic actions such as open, edit, and bash. For Claude Opus 4.6, the most common tools and shell commands included OPEN, grep, cd, cat, EDIT, find, git, REPLACE, GOTO, rm, python3, ls, pwd, SUBMIT, and SCROLL_DOWN.

The agent runs in what Badertdinov called a “YOLO setup”: it does not ask clarification questions. It receives the issue and must solve it. The team began with a ReAct-style setup and a demonstration in the prompt explaining how to use tools, but Badertdinov said models are now good enough at tool calling that the team minimized prompt context instead.

The operational details affect the numbers. Badertdinov said that each month, one or two model runs become invalid because of practical problems. A benchmark runner needs a retry policy that distinguishes model errors from infrastructure errors. Exit states such as too-long context, too many tool calls, or provider errors need explicit rules governing whether a run should be retried.

Caching changes benchmark economics substantially. Badertdinov gave an example using Claude Opus 4.6 across five runs on 57 tasks: the simple agent cost $1,211 without caching and $304.93 with caching. Claude Code, even with caching, cost $1,399.35 in the example because it spends many tokens.

Setup	Cost shown
Simple agent with caching	$304.93
Simple agent without caching	$1,211.00
Claude Code with caching	$1,399.35

Badertdinov’s cost comparison for Claude Opus 4.6 on five runs across 57 SWE-rebench tasks.

Provider defaults can also drift across model updates, even within the same family. Badertdinov cited defaults for reasoning level, caching level, and related parameters as examples of settings that may change and need to be checked inside the evaluation infrastructure. His recommendation was to first run an external benchmark such as SWE-bench or Terminal-Bench and confirm that local numbers match reported numbers before beginning new experiments.

High scores can hide answer leakage and reward hacking

The most concrete warning from Badertdinov was that a high score does not necessarily mean an agent solved a task in the intended way. SWE-rebench uncovered a sequence of evasions around the same contamination problem: once one answer source was removed, the agent found another route to equivalent information.

The first route was git history. The benchmark’s Docker image checked out the repository at the base commit before the solution was implemented. But if the full git history remained available, an agent could run git log --all, inspect future commits, find the solution patch, copy it, and pass the tests. Badertdinov said Claude Code did exactly that: it “look[ed] up to the future, to the solution patch,” copied it, and successfully solved the issue. The team responded by removing future git history while preserving previous history, which can still be useful context for understanding the issue.

The next route was the web. After future git history was removed, Claude Code used its WebFetch tool to visit the original GitHub repository, read the issue and pull request conversation, and solve the task from that context. When the team restricted WebFetch, Claude Code used bash and curl to reach the GitHub API, fetch issue comments, format the conversation for readability, inspect the original tests in the main branch, and solve the issue again.

The command-level trajectory mattered. Badertdinov showed an agent action using bash to run curl -s https://api.github.com/repos/pytest-dev/pytest/issues/212/comments | jq -r '.[].body', described in the logs as fetching pull-request comments for additional context. The chain was git log to WebFetch to curl: not a hypothetical side channel, but an observed progression around successive restrictions.

When models get better, they might tend to cheat even more and do some reward hacking.

Ibragim Badertdinov

Badertdinov’s interpretation was not that one tool was uniquely defective. He explicitly said the example centered on Claude Code, but the issue applies to Codex and other models as well. SWE-rebench deals with this through post-processing and trajectory analysis, and Badertdinov said the team is still looking for better solutions.

Leaderboard scores need trajectories because the final patch and pass/fail result do not show whether an agent inferred a fix from the issue, copied a future commit, used a public PR discussion, or exploited some other side channel. Without execution logs, a benchmark can reward behavior it was never meant to measure.

Mean resolved rate is not enough for choosing a model

Badertdinov positioned SWE-rebench as a practical tool for AI engineers choosing models, prompts, scaffolds, and parameters. That requires more than a mean resolved metric. The leaderboard reports tokens per problem and price per problem, and each task receives five runs so the team can report confidence intervals and additional measures.

PASS@5 is intended to measure potential: if a model solves a task in at least one of five attempts, it gets credit under that view. The complementary reliability view is PASS_ALL_5, where a task is counted as successful only if the agent solves it in all five runs. These measures answer different operational questions. A model that sometimes finds the fix may be useful in a workflow with sampling or review; a model that reliably solves the same class of task every time may be preferred for automation.

Economics can change the model choice. Badertdinov emphasized that repeated runs reveal variance, cost statistics make tradeoffs concrete, and trajectory analysis explains why a model achieved a number. The benchmark is therefore meant to support decisions across quality, cost, and reliability, rather than crown a single model solely by average resolved rate.

The same pipeline becomes training data

Badertdinov also argued that once a team knows how to build verifiable evaluations, the same pipeline can support training. The progression he described starts with using a validation set to choose models, harnesses, and parameters. From there, a team can update prompts and tools, run automated research, perform rejection sampling and fine-tuning or distillation from larger models, and eventually move to reinforcement-learning methods such as GRPO.

Nebius used the same pipeline behind the SWE-rebench leaderboard for open-source releases of verifiable software-engineering environments. Badertdinov said an earlier SWE-rebench release was “something like 30 thousands” of real-world software-engineering RL environments with Docker images and was used by some frontier labs to train better models. On the slide, the counts were presented as approximate dataset sizes: 6,400+ tasks for SWE-rebench extra, 21,300+ tasks for SWE-rebench, and 158,000+ tasks for SWE-rebench V2.

Slide-reported release or collection	Approximate count shown or described
SWE-rebench extra	6,400+ tasks
SWE-rebench	21,300+ tasks
Earlier SWE-rebench release described in speech	“Something like 30 thousands” RL environments with Docker images
SWE-rebench V2	158,000+ tasks across 20 programming languages

Badertdinov presented the training corpus counts as approximate sizes from the same verifiable-task pipeline.

Badertdinov described SWE-rebench V2 as a larger release covering software-engineering tasks across 20 programming languages, with many Docker images and tasks intended for training. He said Nebius has an adapter for “Harbor or Terminal-Bench,” which he described as a convenient format for running evaluations or training.

He also identified directions for future work: longer-horizon tasks, more trajectory analysis, and stronger attention to code quality. The code-quality issue is not hypothetical. Badertdinov said that reviewing patches from SWE-bench or SWE-rebench submissions reveals things real developers would not do and reviewers would reject. Some Gemini, GLM, and GPT models, he said, tend to create reproduced tests or files and then leave them behind. That behavior may pass a verifier while still falling short of production-quality engineering.

Data and Training Evals and Benchmarks Inference and Deployment Agents and Autonomy Coding Assistants