Auditing Stereotypes in AI: How Vision-Language Models Portray Healthcare Professionals

TLDR: A study audited Vision-Language Models (VLMs) like CLIP and OpenCLIP for biases in representing healthcare professionals. It found consistent demographic stereotypes across age, gender, and race, with “adult” being overrepresented, specific gender roles reinforced (e.g., male surgeons, female nurses), and varied racial associations. The research highlights that these biases are not uniform across different VLM architectures and have significant implications for AI applications in healthcare, such as hiring and workforce analysis, underscoring the need for careful bias identification and mitigation.

Vision-Language Models (VLMs) are powerful AI systems that combine visual perception with natural language understanding. These models, such as OpenAI’s CLIP and OpenCLIP, are trained on vast amounts of internet data, allowing them to learn joint representations of images and text. While incredibly versatile for tasks like image captioning and visual question answering, a recent study reveals a critical concern: these models often absorb and reproduce societal biases, particularly when representing healthcare professionals.

The research paper, titled Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals, delves into how VLMs encode stereotypical associations between medical professions and demographic attributes. This is a significant issue, as AI-enabled hiring and workforce analytics in healthcare could inadvertently perpetuate inequities, impact compliance, and erode patient trust.

Unpacking the Methodology

To rigorously evaluate these biases, the researchers developed a comprehensive protocol. First, they defined a detailed taxonomy of 33 healthcare roles, spanning clinicians (like surgeons, cardiologists, dentists), allied health roles (nurses, pharmacists, speech therapists), and hospital administration staff. This taxonomy allowed for a structured examination of various professional categories.

Next, a profession-aware prompt suite was curated. These prompts were designed to be demographically neutral, such as “Photo of a dentist” or “Photo of a hospital receptionist,” ensuring that any observed biases reflected the VLM’s intrinsic associations rather than explicit gender, race, or age mentions in the prompt itself.

The study utilized the FairFace dataset, a large and balanced collection of over 108,000 face images annotated for race (seven groups), gender, and age. This dataset was crucial for benchmarking demographic skew against a diverse and representative baseline. The evaluation involved OpenAI’s CLIP (ViT-B/16, ViT-B/32) and OpenCLIP (ViT-L/14, ViT-H/14) models, chosen for their architectural similarities but different training datasets, allowing for insights into whether biases are model-specific or consistent phenomena.

Bias identification was performed using two key methods: top-k retrieval analysis and Jensen-Shannon (JS) divergence. Top-k retrieval quantified how often images from stereotyped demographic classes were retrieved for a given profession. JS divergence provided a measurable bias score, indicating the deviation from an ideal, perfectly balanced demographic distribution.

Key Findings: A Landscape of Bias

The empirical results revealed consistent and concerning demographic biases across multiple roles and vision models:

Age Bias: The “adult” age group (20-49 years) overwhelmingly dominated retrievals for nearly every profession. “Young” individuals were rarely the leading category, and while “old” representation was higher in some roles like Speech Therapist and Psychiatrist, it was still significantly underrepresented overall. This suggests a lack of age diversity in the models’ learned representations.
Gender Bias: Strong gender stereotypes were evident. Roles like Ambulance Driver, Paramedic, and Hospital Guard were consistently male-dominant (88–95% male for Ambulance Driver). Conversely, Nurse (1–10% male) and Hospital Receptionist were strongly female-biased. Physician and specialty roles showed more variation, with some like Cardiologist being highly male-dominated (93% male in some models), while others like Dermatologist and Gynecologist/Obstetrician exhibited high volatility across models.
Race Bias: Racial biases were also pronounced and model-dependent. OpenCLIP L/14 frequently associated Indian faces with 28 professions. Other models showed more diversity, with OpenCLIP H/14 often retrieving Latino or Black faces as dominant groups alongside Indian faces. The “top-race” identity for many roles was unstable across models, indicating that racial associations are intrinsic to VLMs but can vary based on the specific model architecture and training data.
Intersectional Biases: The study also highlighted how gender, race, and age biases intertwine. Consistent intersectional stereotypes emerged, such as Black female midwives, White female speech therapists, older male sanitation workers, and older male hospital guards. These compounded biases are particularly concerning as they risk marginalizing already underrepresented groups.
Cross-Model Volatility: A crucial finding was that bias patterns were not uniform across different VLM architectures. For instance, OpenCLIP H/14 tended to be female-skewed overall, while CLIP B/16 leaned male. This volatility underscores that fairness audits for one model cannot be assumed to generalize to others, complicating deployment in sensitive contexts.

Also Read:

Implications and Recommendations

The findings extend previous research on VLM biases, demonstrating their manifestation in healthcare professions. The perpetuation of stereotypes like male physicians and female nurses, along with specific intersectional biases, has direct consequences for workforce analytics, recruitment, and medical education. AI systems built on these biased models could inadvertently favor certain demographics, leading to unfair hiring practices or misrepresentations of healthcare roles.

The researchers recommend several actions to address these issues:

Model-Specific Fairness Audits: Given the volatility of biases across different VLM architectures, thorough, model-specific audits are essential before deployment in healthcare settings.
Improved Benchmark Datasets: There is a need for benchmark datasets that explicitly capture intersectional and age-related biases within professional contexts, moving beyond general demographic categories.
Reporting Standards: Establishing clear reporting standards for demographic distributions in model outputs can enhance transparency and accountability.

In conclusion, the study emphasizes that VLMs construct knowledge of healthcare professions in ways that are selective, fragile, and socially patterned. Addressing these biases requires a holistic approach, encompassing critical reflection on data curation, model training, evaluation standards, and governance, to ensure that AI advances equity and trust in healthcare rather than undermining it.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Auditing Stereotypes in AI: How Vision-Language Models Portray Healthcare Professionals

Unpacking the Methodology

Key Findings: A Landscape of Bias

Implications and Recommendations

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

EBU Academy’s School of AI Honored with European Digital Skills Award for Upskilling Media Professionals

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates