Andrew Bean

Research

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Published in Nature Medicine

Measuring what matters: Construct validity in large language model benchmarks

Presented at NeurIPS

Press

ARS Technica

Americans ask AI for health care. Hospitals think the answer is more chatbots.

Article cites our research which warns about the risks in AI chatbots giving medical advice.

14 Apr 2026

De Standaard

Dr ChatGPT doesn’t help you any better than Dr Google, and that’s not because of the AI models’ ‘knowledge.’

New study led by Andrew warns of the risks in AI chatbots giving medical advice.

09 Feb 2026

New York Times

Health advice from AI chatbots is frequently wrong, study shows

New study led by Andrew warns of the risks in AI chatbots giving medical advice.

09 Feb 2026

Reuters

AI no better than other methods for patients seeking medical advice, study shows

New study led by Andrew warns of the risks in AI chatbots giving medical advice.

09 Feb 2026

The Register

AI chatbots are no better at medical advice than a search engine

A new study led by OII researchers warns of the risks in AI chatbots giving medical advice.

09 Feb 2026

New York Times

Frustrated by the Medical System, Patients Turn to A.I

Chatbots are cheap, always available, superficially empathetic — and sometimes wrong. Some have concluded they’re a risk worth taking. Article references upcoming study led by Andrew.

16 Nov 2025

de Correspondent

The claims about increasingly smart AI models?

More vibe than science. Luc comments.

13 Nov 2025

NBC News

AI Revolution – NBC News discuss latest OII study exploring AI evaluation

The NBC Morning News programme discuss the findings from Andrew's latest study which finds weaknesses in how AI systems are evaluated.

09 Nov 2025

The Register

AI benchmarks are a bad joke - and LLM makers are the ones laughing

Covers our research finding that many AI benchmarks do not measure the right things.

07 Nov 2025

Gizmodo

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

Covers our Measuring What Matters study on the construct validity of AI benchmarks.

06 Nov 2025

NBC News

AI’s capabilities may be exaggerated by flawed tests, according to new study

Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigour.

06 Nov 2025

The Guardian

Experts find flaws in hundreds of tests that check AI safety and effectiveness

Scientists say almost all have weaknesses in at least one area that can ‘undermine validity of resulting claims’ with commentary and latest research findings from Andrew.

04 Nov 2025

BMA The Doctor

Bot-ched advice – ‘disturbing’ results in AI study

Rebecca and Andrew comments on our study showing that LLM chatbots can perform worse when interacting with humans than when assessed using benchmarks.

10 Jul 2025

TechCrunch

People struggle to get useful health advice from chatbots, study finds

Coverage of Andrew's study showing that people using AI chatbots for medical self-diagnosis did not make better decisions than people using traditional sources.

05 May 2025

Andrew Bean (he/him)

Research

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Press

Americans ask AI for health care. Hospitals think the answer is more chatbots.

Dr ChatGPT doesn’t help you any better than Dr Google, and that’s not because of the AI models’ ‘knowledge.’

Health advice from AI chatbots is frequently wrong, study shows

AI no better than other methods for patients seeking medical advice, study shows

AI chatbots are no better at medical advice than a search engine

Frustrated by the Medical System, Patients Turn to A.I

The claims about increasingly smart AI models?

AI Revolution – NBC News discuss latest OII study exploring AI evaluation

AI benchmarks are a bad joke - and LLM makers are the ones laughing

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

AI’s capabilities may be exaggerated by flawed tests, according to new study

Experts find flaws in hundreds of tests that check AI safety and effectiveness

Bot-ched advice – ‘disturbing’ results in AI study

People struggle to get useful health advice from chatbots, study finds