AI benchmark contamination

Why AI Benchmarks Lie (and How to Read Them Honestly)

The same model scored 80.9 and 45.9 percent on the same task. Only the harness changed. A benchmark number with a model name is not a fact about the model.

Read any model announcement and you reach the same slide: a bar chart, your lab's bar tallest, a row of percentages that look like measurements from a lab bench. The number feels solid. It is the kind of thing a careful decision-maker is supposed to trust over a vendor's adjectives.

It is not solid. In Anthropic's Opus 4.6 system card, GPT-5.2-Codex scored 57.5 percent on one terminal-agent harness and 64.7 percent on OpenAI's own Codex CLI harness on the same Terminal-Bench task set. Same model, same task set, about seven points apart, and the only variable was the code wrapped around the model. Claude Opus 4.5 scored 80.9 percent on SWE-bench Verified and 45.9 percent on the standardized-scaffolding SEAL variant of the same coding task set. A benchmark score with a model name attached is not a fact about the model. It is a fact about the model, the prompt, the tools, the retry budget, and how many times someone ran it. The headline drops everything except the model name, which is exactly the part that travels worst.

This is not an argument to ignore benchmarks. It is an argument to read them like a procurement contract instead of a press release. Here is what goes wrong, and how to tell.

Origin: a measuring stick that the thing being measured can grab

Benchmarks built modern AI. ImageNet gave computer vision a shared target through the 2010s. GLUE and SuperGLUE did the same for language understanding, each with a published human baseline, 87.1 and 89.8, so progress had a finish line. The pattern worked because the test was external to the system. A model could not study for ImageNet any more than a thermometer can study for a fever.

Two things broke that. The first is speed. It took the research community roughly 18 years to reach human performance on the MNIST digit task and about six years on ImageNet. GLUE fell in about a year. By late 2019, RoBERTa and ALBERT had passed the human baseline, and SuperGLUE existed because GLUE had stopped discriminating. A benchmark that saturates in a year spends most of its short life as a test nobody can fail informatively.

The second break is structural. A large language model trains on a large slice of the public internet. Benchmarks live on the public internet. The test is no longer external to the system. It is, with high probability, inside the training data. The thing being measured can now grab the measuring stick, and once that is true, the number stops meaning what the slide implies.

Present: six reasons the number is not the measurement

Contamination: the model has seen the answer key

When test questions leak into training data, a model can score well by recall instead of reasoning. The cleanest demonstration is a time cliff. When GPT-4 launched, researcher Horace He found it solved 10 of 10 Codeforces problems from before its training cutoff and 0 of 10 from after. A NeurIPS 2024 study by Roberts and colleagues formalized it: for Project Euler problems, each unit of log GitHub presence raised the odds of a model passing, but only for problems published before the cutoff.

How much does contamination inflate a headline? It depends on the model. In May 2024, Scale AI built GSM1k, 1,205 fresh grade-school math problems matched to the difficulty of the popular GSM8K set. Some model families dropped up to 13 points on the new problems. Frontier models from OpenAI, Google, and Anthropic barely moved. Contamination is real, it is uneven, and you cannot tell from the score alone which kind of model you are looking at. SWE-bench Verified, the coding benchmark most often quoted in 2025, became a case study. By October 2024 researchers had flagged suspiciously high scores on specific problems that also appeared in popular training corpora, and in February 2026 OpenAI said it would stop reporting Verified scores, citing contamination and scaffolding concerns.

Scaffolding: the harness scores, not the model

A modern model is not run bare. It runs inside a harness: a system prompt, a set of tools, a file browser, retry logic, a budget for how many attempts it gets. Change the harness and the score moves a lot. Analyses of the SWE-bench Pro leaderboard in 2026 found the scaffold accounting for a swing of more than 20 points, while swapping between the top frontier models moved the score by roughly a single point. Put plainly: at the frontier, the harness matters more than the model. The same write-ups documented individual models gaining tens of points from scaffolding changes with no model upgrade at all.

This makes cross-lab comparison close to meaningless when each lab reports its own harness. The 64.7 versus 57.5 split on the same model is the whole problem in one line. When a chart compares your model's number against a competitor's, ask whether both ran the same harness. They almost never did.

Single-run and best-of-N: a lucky number, dressed as typical

A benchmark score is a sample. Run the same model stack again and you get a different number, because the model is non-deterministic. Two reporting habits exploit this gap. The first is single-run reporting on small, high-variance test sets, like the AIME math problems where the set is only tens of questions. The second is best-of-N: run many times, report the best, and let the reader assume it is typical.

Watch the notation. Pass@k is the share of tasks solved in at least one of k attempts; pass^k is the harder share solved in every one of k attempts. Pass@1, pass@10 and pass^8 are three different questions, and a release that quotes pass@10 next to a rival's pass@1 is not comparing like with like. The fix is cheap and rarely done: run several times and publish a confidence interval. Epoch AI runs models up to 16 times on a benchmark and reports a standard error around the mean. Most launch posts report a single bar with no error band at all, which means you cannot tell a real lead from sampling noise.

Saturation and overfitting: the test becomes the target

Goodhart's law: when a measure becomes a target, it stops being a good measure. The moment a benchmark becomes prestigious, labs optimize toward it. Some of that is genuine progress. Some is benchmark-specific tricks that do not generalize. From the outside, both look like a higher bar.

Saturation is the visible symptom. A 2026 systematic study of 60 widely used text benchmarks documented how scores compress at the top until the test no longer separates good from great. MMLU and MMLU-Pro are functionally saturated above 88 percent for frontier models. Stanford's 2026 AI Index found the top 15 models separated by as little as 3 points per benchmark category, a margin too thin to base a decision on. When everyone scores in the high 80s, the leaderboard has stopped measuring and started ranking noise.

Frozen environments: a 2024 test of a 2026 world

Agent benchmarks ship a fixed environment: a snapshot of a website, a recorded API, a sandbox. The real world does not hold still. Web pages change, APIs deprecate, applications update. WebArena's sandbox is a 2023 artifact; an agent deployed in 2026 meets a different web. A benchmark also cannot test error recovery against a service that is genuinely down, because nothing is genuinely down inside a sandbox. The benchmark measures a world that no longer exists.

Frozen environments also leak. A Berkeley study showed how brittle these harnesses are: an automated agent hit near-perfect scores on eight major agent benchmarks by exploiting the plumbing, not the task. It read gold answers from config files, trojanized install commands, and inserted a test hook that forced every test to report as passing. The point is not that every leaderboard is gamed. It is that an environment full of exploitable seams is not measuring what its name claims.

The production gap: the benchmark is not your workload

Even a clean, uncontaminated, well-run score is a score on someone else's task, and it is almost always a single-run score. The gap between that and production is best seen in the difference between pass@1 and pass^k. On the tau-bench retail task, a GPT-4 class agent that solves about 60 percent of tasks on one attempt solves only about 25 percent when it has to succeed on all eight of eight tries. Production does not want peak capability on a good day. It wants the same task done right every time, and that is the number a single bar never shows. A SWE-bench number does not predict how an agent handles your codebase, your tools, and your users in one messy session. That is the same demo-to-production gap covered in why agents fail in production, arriving one layer earlier, in the number you trusted before you built anything.

The leaderboard problem, specifically

Arena-style leaderboards, where humans vote on blind pairwise outputs, were supposed to dodge contamination because the prompts are live and user-generated. In April 2025 a paper from researchers at Cohere, Stanford, Princeton, MIT, Ai2, and others, "The Leaderboard Illusion," argued the format had its own holes. Large labs could privately test many model variants and publish only the best score, a best-of-N move at the leaderboard level; the paper reported one provider running 27 private variants on the arena in the run-up to a single release. Proprietary models also received a disproportionate share of battle data. Cohere fine-tuned a small model on arena-style data and roughly doubled its arena win rate while its MMLU score, a test of actual out-of-distribution knowledge, fell. The arena's operators disputed the framing and pointed to what they called factual errors in the paper. Treat that as an open argument, not a verdict. The lesson stands either way: any leaderboard with prizes attached will be optimized, and a leaderboard you can optimize for is no longer a clean measurement.

Future and impact: how to read a benchmark claim

The honest fix the field is moving toward is contamination-resistant and live evaluation: LiveBench and LiveCodeBench draw fresh questions published after model cutoffs, FrontierMath and SWE-bench Pro raise the difficulty ceiling, and HAIC-style proposals push evaluation toward whole workflows over time rather than one-shot tasks. None of that helps you this quarter, reading a launch slide. For that, a checklist.

Check the date and the cutoff. If the benchmark predates the model's training cutoff, assume contamination until the lab shows a clean-set result. A time-split score, old questions versus new, tells you more than the headline.

Check the harness. One number, no harness described, is not comparable to anything. Trust independent runs on a shared harness, like Epoch AI or the system-card cross-harness tables, over a lab grading its own work.

Check the run count. Look for a confidence interval or an error bar. No error bar means you cannot separate a real lead from a lucky sample. A best-of-N or a high pass@k quoted against a rival's pass@1 is not a comparison.

Check saturation. If the top models sit within a few points near a ceiling, the leaderboard is not telling you which model is better. It is telling you the test is finished.

Check the gap to your work. The benchmark is somebody else's task in a frozen world. The number that matters is the one your own evaluation produces on your own tasks, run many times, measured for consistency. Building that harness is the subject of evaluating non-deterministic agents, and it is the only score that should move a budget.

A benchmark is a proxy, not a verdict. Read at this resolution it is still useful: it narrows the field, flags a real regression, sets a rough ceiling. The failure is not using benchmarks. It is reading a single bar on a launch slide as the measurement, when it is a sample from one run of one harness on a test the model may have already read. Procurement-grade skepticism, not dismissal. The number is evidence. It was never the proof.

Council summary

This post argues that a leaderboard percentage with a model name attached is not a measurement of the model. It is a measurement of the model plus its harness, its run count, and whether the test leaked into training, and the headline strips away everything except the one part that travels worst. The six failure modes are concrete and individually verifiable: contamination, scaffolding swings, single-run and best-of-N reporting, saturation, frozen environments, and the gap to your own workload. The takeaway for a technical decision-maker is a habit, not a dismissal: read a benchmark like a procurement document, run the five-point checklist on date, harness, run count, saturation and fit, and treat the only score that should move a budget as the one your own evaluation produces on your own tasks, run many times. The number is evidence; it was never the proof.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next