Measuring what matters: Construct validity in large language model benchmarks
Is there a scientific crisis in AI evaluations? We reviewed all 445 recent AI benchmarks in leading AI conferences. Very few offer rigorous evaluations of AI capabilities, making it difficult to actually track advances in the field and compare models.
Authors
Andrew Bean , Franziska S. Hafner , Lujain Ibrahim , Luc Rocher , and 39 colleagues outside SSL
Published
2025
Benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don’t measure what matters.
Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.”
We reviewed 445 LLM benchmarks from the proceedings of top AI conferences. We found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety. We identified:
- A lack of statistical rigour. Only 16% of reviewed benchmarks used statistical tests in their comparisons. Statistical testing is essential to reliable science: without them, many AI system “wins” could be simply due to random chance.
- Vague or contested definitions. Roughly half of benchmarks test abstract phenomena, such as “reasoning” or “harmlessness,” without providing clear, uncontested definitions of what is being measured. And only half justify if they really measure what they should measure.
- Poorly constructed datasets. 38% of benchmarks reuse existing benchmarks/exams that might not be adequate, increasing risks of contamination and memorisation.

In our work, we built a taxonomy of these failures and translated them into an operational checklist to help future benchmark authors demonstrate construct validity.
Try our interactive checklist to evaluate your own benchmarks.