Science of Evaluations
Improving how AI systems are measured and tested in academia, industry, and public sector.
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Published in Nature Medicine
Gender trouble in language models: an empirical audit guided by gender performativity theory
Presented at ACM FAccT
A scaling law to model the effectiveness of identification techniques
Published in Nature Communications
2024
Anonymization: The imperfect science of using data while preserving privacy
Published in Science Advances