Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

LLMs perform well on medical benchmarks but provide inaccurate and inconsistent answers for self-diagnosis. Our study with 1,298 UK participants shows that we are better off using online searches or our own judgment.

Authors

Andrew M. Bean , Rebecca Elizabeth Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera-Gómez, Sara Hincapié M, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher , Adam Mahdi

Published

2026

DOI

10.1038/s41591-025-04074-y

Despite all the hype, chatbots still make terrible doctors. We ran the largest user study of language models for medical self-diagnosis. We found that chatbots provide inaccurate and inconsistent answers, and that people are better off using online searches or their own judgment.

Working alongside a group of doctors to create ten medical scenarios, ranging from a common cold to a life-threatening haemorrhage causing bleeding on the brain, our team recruited 1,298 UK-based participants. Participants were randomly assigned a chatbot (OpenAI’s GPT, Meta’s Llama, and Cohere’s Command R+) or were told to use a source of their choice to make decisions about a medical scenario as though they had encountered it at home.

We found that:

Yes, models encode medical knowledge. Tested alone, models correctly identified relevant conditions in 94.9% of cases, and the right course of action (going to the hospital, calling the doctor, etc.) in 56.3%.
Still, people are better off not using chatbots. Participants who used a chatbot identified conditions in less than 34.5% of cases, and the right course of action in less than 44.2%. They were no better than the control group using more traditional tools.
Failure in communication. Chatbots often combined good and poor recommendations, which people struggled with. Model answers changed dramatically depending on the words participants used, e.g. when told someone had returned from ‘Saudi Arabia’ or that a headache developped ‘suddenly’.
Tests and benchmarks fall short. We tried hard to see if multiple-choice question-answering benchmarks, as well as increasingly-used simulations with fake users, could forecast when a model fails with real humans users. None of them could predict human-LLM interaction failures.

Despite strong performance on benchmarks, providing people with language models does not necessarily improve their understanding of medical information. We hope this will be an opportunity to rethink safety evaluations and regulations of AI models and chatbots.