Science of Evaluations

Improving how AI systems are measured and tested in academia, industry, and public sector.

AI on the ground Health Human-AI Interaction Participatory methods Privacy & Data Access Science of Evaluations

2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Published in Nature Medicine

2025

Meaningful Data Access for Quantitative Algorithm Audits

Presented at ACM CHI

Gender trouble in language models: an empirical audit guided by gender performativity theory

Presented at ACM FAccT

Measuring what matters: Construct validity in large language model benchmarks

Presented at NeurIPS

A scaling law to model the effectiveness of identification techniques

Published in Nature Communications

2024

Anonymization: The imperfect science of using data while preserving privacy

Published in Science Advances