What if the benchmark is the problem?

The problem

Machine learning fields love benchmarks. We use them to compare methods quickly, rank models neatly, and tell a story of progress. But all of that depends on one quiet assumption: the benchmark itself has to be sound. If the data are messy in the wrong way, a model can look brilliant simply because it learns shortcuts.

The idea

In this project, we turned that concern into the main subject. Instead of proposing a new predictive model, we asked how trustworthy widely used biomolecular benchmarks really are. Our goal was to audit the datasets themselves, not just the algorithms evaluated on them.

How it works

In the paper, we introduce a reproducible, configuration-driven audit framework and apply it to more than fifty dataset configurations across several prominent benchmark suites. We look for exactly the kinds of problems that can quietly distort results: cross-split contamination, unresolved label conflicts, severe structural redundancy, and related forms of leakage. We also use controlled noise-injection experiments to test whether those artifacts can inflate baseline performance.

What the paper showed

In short, we found that these problems are not rare. Our audit uncovered pervasive contamination, conflicts, and redundancy across several widely used datasets, with especially serious issues in some drug-target interaction benchmarks where protein overlap is extensive. The noise-injection experiments strengthened that case by showing that such artifacts can artificially improve model scores. That led us to the paper’s sharpest conclusion: some state-of-the-art claims may reflect benchmark exploitation more than genuine methodological progress.

Why it matters

I think that matters far beyond leaderboard culture. If we measure ourselves against a distorted yardstick, we can end up optimizing for the wrong thing. A model that looks strong on paper may then disappoint as soon as it meets cleaner data or a real-world task. In that sense, benchmark auditing is not a side issue. We need it if we want to build reliable scientific tools.

Limits

I do not mean that benchmarks are useless, or that every published comparison is invalid. I mean that datasets deserve the same level of scrutiny as models. I personally very much agree with Pat Walters: we need to go away from bold numbers in tables.

Read the paper

Schuh, M. G.; Daniluk, A.; Sieber, S. A. ChemRxiv 2026. 10.26434/chemrxiv.15000559/v1