TLDR: A study audited Vision-Language Models (VLMs) like CLIP and OpenCLIP for biases in representing healthcare professionals. It found consistent demographic stereotypes across age, gender, and race, with “adult” being overrepresented, specific gender roles reinforced (e.g., male surgeons, female nurses), and varied racial associations. The research highlights that these biases are not uniform across different VLM architectures and have significant implications for AI applications in healthcare, such as hiring and workforce analysis, underscoring the need for careful bias identification and mitigation.
Vision-Language Models (VLMs) are powerful AI systems that combine visual perception with natural language understanding. These models, such as OpenAI’s CLIP and OpenCLIP, are trained on vast amounts of internet data, allowing them to learn joint representations of images and text. While incredibly versatile for tasks like image captioning and visual question answering, a recent study reveals a critical concern: these models often absorb and reproduce societal biases, particularly when representing healthcare professionals.
The research paper, titled Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals, delves into how VLMs encode stereotypical associations between medical professions and demographic attributes. This is a significant issue, as AI-enabled hiring and workforce analytics in healthcare could inadvertently perpetuate inequities, impact compliance, and erode patient trust.
Unpacking the Methodology
To rigorously evaluate these biases, the researchers developed a comprehensive protocol. First, they defined a detailed taxonomy of 33 healthcare roles, spanning clinicians (like surgeons, cardiologists, dentists), allied health roles (nurses, pharmacists, speech therapists), and hospital administration staff. This taxonomy allowed for a structured examination of various professional categories.
Next, a profession-aware prompt suite was curated. These prompts were designed to be demographically neutral, such as “Photo of a dentist” or “Photo of a hospital receptionist,” ensuring that any observed biases reflected the VLM’s intrinsic associations rather than explicit gender, race, or age mentions in the prompt itself.
The study utilized the FairFace dataset, a large and balanced collection of over 108,000 face images annotated for race (seven groups), gender, and age. This dataset was crucial for benchmarking demographic skew against a diverse and representative baseline. The evaluation involved OpenAI’s CLIP (ViT-B/16, ViT-B/32) and OpenCLIP (ViT-L/14, ViT-H/14) models, chosen for their architectural similarities but different training datasets, allowing for insights into whether biases are model-specific or consistent phenomena.
Bias identification was performed using two key methods: top-k retrieval analysis and Jensen-Shannon (JS) divergence. Top-k retrieval quantified how often images from stereotyped demographic classes were retrieved for a given profession. JS divergence provided a measurable bias score, indicating the deviation from an ideal, perfectly balanced demographic distribution.
Key Findings: A Landscape of Bias
The empirical results revealed consistent and concerning demographic biases across multiple roles and vision models:
-
Age Bias: The “adult” age group (20-49 years) overwhelmingly dominated retrievals for nearly every profession. “Young” individuals were rarely the leading category, and while “old” representation was higher in some roles like Speech Therapist and Psychiatrist, it was still significantly underrepresented overall. This suggests a lack of age diversity in the models’ learned representations.
-
Gender Bias: Strong gender stereotypes were evident. Roles like Ambulance Driver, Paramedic, and Hospital Guard were consistently male-dominant (88–95% male for Ambulance Driver). Conversely, Nurse (1–10% male) and Hospital Receptionist were strongly female-biased. Physician and specialty roles showed more variation, with some like Cardiologist being highly male-dominated (93% male in some models), while others like Dermatologist and Gynecologist/Obstetrician exhibited high volatility across models.
-
Race Bias: Racial biases were also pronounced and model-dependent. OpenCLIP L/14 frequently associated Indian faces with 28 professions. Other models showed more diversity, with OpenCLIP H/14 often retrieving Latino or Black faces as dominant groups alongside Indian faces. The “top-race” identity for many roles was unstable across models, indicating that racial associations are intrinsic to VLMs but can vary based on the specific model architecture and training data.
-
Intersectional Biases: The study also highlighted how gender, race, and age biases intertwine. Consistent intersectional stereotypes emerged, such as Black female midwives, White female speech therapists, older male sanitation workers, and older male hospital guards. These compounded biases are particularly concerning as they risk marginalizing already underrepresented groups.
-
Cross-Model Volatility: A crucial finding was that bias patterns were not uniform across different VLM architectures. For instance, OpenCLIP H/14 tended to be female-skewed overall, while CLIP B/16 leaned male. This volatility underscores that fairness audits for one model cannot be assumed to generalize to others, complicating deployment in sensitive contexts.
Also Read:
- Unpacking Sociodemographic Biases in AI’s Reward Models
- Assessing Multimodal AI Retrieval in Medical Applications
Implications and Recommendations
The findings extend previous research on VLM biases, demonstrating their manifestation in healthcare professions. The perpetuation of stereotypes like male physicians and female nurses, along with specific intersectional biases, has direct consequences for workforce analytics, recruitment, and medical education. AI systems built on these biased models could inadvertently favor certain demographics, leading to unfair hiring practices or misrepresentations of healthcare roles.
The researchers recommend several actions to address these issues:
-
Model-Specific Fairness Audits: Given the volatility of biases across different VLM architectures, thorough, model-specific audits are essential before deployment in healthcare settings.
-
Improved Benchmark Datasets: There is a need for benchmark datasets that explicitly capture intersectional and age-related biases within professional contexts, moving beyond general demographic categories.
-
Reporting Standards: Establishing clear reporting standards for demographic distributions in model outputs can enhance transparency and accountability.
In conclusion, the study emphasizes that VLMs construct knowledge of healthcare professions in ways that are selective, fragile, and socially patterned. Addressing these biases requires a holistic approach, encompassing critical reflection on data curation, model training, evaluation standards, and governance, to ensure that AI advances equity and trust in healthcare rather than undermining it.


