TLDR: A research paper by Zoya Hammad and Nii Longdon Sowah evaluates gender bias across four text-to-image AI models: DALL-E 3, Emu, Stable Diffusion XL, and Stable Cascade. The study found that Stable Diffusion models exhibited significant male bias in high-status professions and female bias in traditionally female roles. Emu showed more balanced results. DALL-E 3, surprisingly, displayed a female-favoring bias, likely due to backend prompt modifications aimed at increasing diversity, potentially leading to ‘over-correction.’ The research emphasizes that biases stem from training data and lack of diversity in AI development, posing a critical question about whether AI should reflect real-world demographics or aim for a 50:50 gender ratio.
Artificial Intelligence (AI) is increasingly integrated into various aspects of our daily lives, from healthcare to entertainment. As this technology advances, it becomes crucial to examine its ethical implications, particularly concerning inclusivity and fairness. A recent research paper, titled “Evaluating and comparing gender bias across four text-to-image models,” delves into this very issue, analyzing how different AI models represent gender in generated images.
Authored by Zoya Hammad and Nii Longdon Sowah, this study aimed to evaluate and compare the degree of gender bias present in four prominent text-to-image AI models: Stable Diffusion XL (SDXL), Stable Cascade (SC), DALL-E 3, and Emu. Previous research had often focused on one or two models, and lacked quantifiable comparisons. This paper addresses that gap by investigating 30 different professions and generating 50 images for each, across all four models, to provide a comprehensive and comparative analysis of gender representation.
The researchers hypothesized that older models like DALL-E and Stable Diffusion would show a noticeable bias towards men, while Emu, a newer model from Meta AI, would offer more balanced results. Their findings largely supported this, with some intriguing exceptions.
Stable Diffusion and Emu: Reflecting and Moderating Stereotypes
The study found that Stable Diffusion XL and Stable Cascade consistently exhibited a significant degree of gender bias. For high-paying or high-education professions such as CEO, pilot, scientist, doctor, and engineer, these models predominantly generated images of men, often reaching 100% male representation for roles like CEO and doctor in SDXL and SC. Conversely, for professions traditionally associated with women, such as nurse, housekeeper, and administrative assistant, the models were much more likely to generate female images, sometimes also reaching 100% female representation.
An interesting pattern emerged when comparing related professions. For instance, while “a doctor” yielded almost exclusively male images from Stable Diffusion models, “a nurse” resulted in overwhelmingly female images. Similarly, “a person cooking in the kitchen” often produced female images, but “a Chef” predominantly showed men. This trend was also observed between “a Teacher” (mostly women) and “a Professor” (mostly men), highlighting how these models reinforce societal stereotypes linked to perceived status or formality of a role.
Emu, Meta AI’s recently released model, demonstrated comparatively more balanced results, showing at least some diversity even in professions where Stable Diffusion models showed none. This suggests that developers might be actively incorporating ethical guidelines and diverse training data in newer models, possibly in response to past criticisms of AI bias.
DALL-E 3: The Case of “Over-Correction”
Perhaps the most striking finding concerned OpenAI’s DALL-E 3. Contrary to the hypothesis and previous studies, DALL-E 3 exhibited a significant bias favoring women. For 28 out of 30 professions, it generated more female images. For example, where other models produced mostly male surgeons or CEOs, DALL-E 3 generated 82% female surgeons and 78% female CEOs. This is a stark contrast to earlier reports where DALL-E showed male bias in medical professions.
The researchers observed that DALL-E achieves these results by automatically modifying user prompts at the backend, adding keywords to ensure diversity. For instance, a simple prompt like “A Doctor” might be revised to include descriptors like “South Asian in descent and a woman by gender.” While this is an attempt to correct for bias, the study raises the question of whether DALL-E is “over-correcting,” leading to a reverse gender imbalance.
Also Read:
- AI Models Systematically Amplify Gender Stereotypes, But Fairness Is Within Reach
- AI’s Blind Spots: Gender and Political Leanings in Language Models Analyzing European Parliament
The Root of the Bias and the Path Forward
The paper discusses that the bias in these text-to-image models likely stems not from the model architectures themselves, but from the vast datasets they are trained on. Publicly available images, often scraped from the internet, frequently underrepresent women in certain professional roles, thus perpetuating existing societal stereotypes. For example, studies show that only 38% of images for social categories found via Google Image search represent women.
Another contributing factor is the lack of gender diversity within the AI research community itself. Studies indicate that women comprise only about 26% of data and AI roles and around 13.8% of AI research paper authors. Biases from developers’ worldviews can inadvertently be instilled into algorithms, leading to unfair outcomes. The researchers suggest that ensuring diversity in AI research teams and curating comprehensive, diverse datasets are crucial steps to mitigate these biases.
This research project highlights a fundamental question for the future of AI: should AI image generation tools aim to reflect real-world demographic statistics, or should they strive for an idealized 50:50 gender ratio? The study concludes that current generative AI models are not adequately prepared to estimate appropriate gender representation. By uncovering these biases, the paper initiates an important discussion on who decides what constitutes appropriate representation and how to build fairer, more inclusive AI tools that truly reflect the diversity of the real world. You can read the full research paper here.


