Gender trouble in language models: an empirical audit guided by gender performativity theory

AI language models encode a flawed and binary understanding of gender, posing significant risks for transgender, nonbinary, and even cisgender individuals.

Authors

Franziska S. Hafner , Ana Valdivia, Luc Rocher

Published

2025

DOI

10.1145/3715275.3732112

AI language models are developing a flawed understanding of gender, leading to stereotypical associations that could result in harmful discrimination. In healthcare, where AI is increasingly integrated into health technologies, these flawed assumptions, which are often based on a model’s conflation of gender and biological sex characteristics, could lead to inaccurate advice and misdiagnoses.

For example, an AI model that learns a rigid association between ‘woman’ and biological markers like ‘uterus’ or ‘estrogen’ could provide irrelevant or even harmful advice to a transgender woman. This narrow view could also misinterpret the needs of cisgender women whose health profiles differ from typical reproductive assumptions, such as those who are postmenopausal or have undergone a hysterectomy, say the researchers.

We developed a robust framework to examine how gender is constructed in 16 AI language models. It reveals their fundamental limitations in understanding gender, often defaulting to a restrictive, biologically tied, and binary view. These limitations have broad implications for both cisgender heterosexual people and the LGBTQIA+ community.

We identified four key issues:

Language models make problematic gender–illness connections: Across 110 illnesses evaluated, models tend to create problematic associations when given different gender identity labels. For example, many models systematically associate physical illnesses with men and mental illnesses with trans and gender-diverse identities, and to a lesser degree with women too. Some models associate physical illnesses, such as ‘coronavirus’ or ‘parasitic worm infections,’ as unlikely for trans and gender-diverse identities. This raises concerns about ‘diagnostic overshadowing,’ where models might incorrectly flag physical health issues as mental health concerns for these individuals.
Language models encode a binary, biologically tied view of gender: Language models predominantly define gender in rigid male/female terms and directly link it to biological sex characteristics. This reflects stereotypes prevalent in internet training data, rather than the diversity of lived human experiences.
Trans and nonbinary identities are often erased or misrecognised: Language models rarely choose terms like ‘nonbinary’ or ‘transgender’ when predicting gender – they mostly choose ‘man’ or ‘woman’. Some models treat terms like ‘nonbinary’ or ‘genderqueer’ as less likely than non-human objects like ‘windscreen’, suggesting a fundamental failure to recognise these as valid human identities.
Model size amplifies bias: Contrary to some expectations, the study found that larger, more powerful models often learn stronger and more rigid associations between gender and sex characteristics. This challenges the notion that simply scaling up AI will lead to more nuanced or fair outcomes. Instead, these fundamental biases risk becoming more deeply ingrained.

In our study, we evaluated associations between gendered and sexed words, as well as associations between gendered words and physical or mental illnesses. We tested 16 language models based on GPT, RoBERTa, T5, Llama, and Mistral.

Language models are known to perpetuate stereotypes present in their training data, and developers typically respond by auditing for bias and applying filters. Our study highlights deeper issues in how models internalise and reproduce social norms and stereotypes based on language.

If language models are going to be used in healthcare, either built into diagnostics to help doctors make decisions or as self-help tools for individuals, their limited and biased understanding of gender could introduce significant discriminatory harm.

Current AI models are largely learning gender from the Internet, and the results are predictably problematic. Fixing AI’s gender problem is not just about tweaking algorithms. We need a concerted approach, from curating better training datasets to building standards and robust public oversight, to ensure these new tools stop amplifying old prejudices.

Our academic community is aware of the social biases reproduced by algorithmic models. With the emergence of a new generation of AI systems, such as language models, these biases have not been mitigated; rather, they continue to amplify stereotypical representations. In our work, we advocate for stronger accountability mechanisms.