TLDR: This research explores using Generative AI (GenAI) to create synthetic skin lesion images for assessing the fairness of AI-based melanoma classifiers. By generating balanced datasets across various demographic attributes like sex, age, and skin type, the study demonstrates that synthetic data can effectively evaluate and highlight biases in existing models. The findings suggest that while synthetic data is a promising tool for fairness assessment, its reliability is highest when the AI model being evaluated was trained on a dataset similar to that used for generating the synthetic images.
Melanoma, the most dangerous form of skin cancer, is projected to see a significant increase in cases and deaths by 2040. Early detection is crucial for improving survival rates, and recent advancements in Artificial Intelligence (AI) offer promising tools for automated medical diagnostics, including smartphone-based applications for pre-screenings. These innovations can reduce physician workload, diagnostic errors, and healthcare costs, while promoting early detection.
However, as AI systems become more integrated into critical areas like medicine, their trustworthiness becomes paramount. A key aspect of this trustworthiness is fairness, especially given the emergence of regulations like the European Union’s AI Act. Ensuring fairness in AI systems requires robust evaluation datasets that are representative of diverse populations, including different sexes, ages, and races (Fitzpatrick skin types).
A significant challenge in fairness assessment is the imbalance often found in real-world datasets. For instance, while the International Skin Imaging Collaboration (ISIC) dataset is a valuable resource for skin lesion images with patient metadata, it still presents imbalances across various demographic attributes. This imbalance makes it difficult to ensure that AI models perform equally well for all groups, potentially leading to biased outcomes.
This research addresses this challenge by leveraging state-of-the-art Generative AI (GenAI), specifically the LightningDiT model, to synthesize highly realistic skin lesion images. The goal is to create balanced datasets that can be used to thoroughly assess the fairness of publicly available melanoma classifiers. The study posed two key research questions:
Can we use state-of-the-art generative image synthesis methods to obtain a balanced fairness assessment dataset?
The researchers developed a protocol that uses diffusion-based image synthesis to generate balanced cohorts of dermoscopic images, encompassing various sexes, ages, and Fitzpatrick skin types. This involves training the LightningDiT model on a large corpus of real ISIC images, extracting latent representations, and then generating new synthetic images conditioned on specific demographic attributes.
Also Read:
Can this synthetic dataset be used to reliably assess the fairness of skin lesion classifiers?
To answer this, the synthetic images were applied to three peer-reviewed, pre-trained skin lesion classification models: DeepGuide, MelaNet, and SkinLesionDensenet. The analysis quantified Demographic Parity (DP), also known as Accuracy Parity (AP), which measures the difference in accuracy between different demographic subgroups.
The methodology involved generating a substantial number of synthetic images—100 images for each combination of sex (2 types), age (8 types), skin type (7 types), and disease case (1 type), totaling 11,200 images. These images were then fed into the pre-trained melanoma detection models to evaluate their fairness across these PII attributes.
The results indicated that fairness assessment using highly realistic synthetic data is a promising direction. The synthetic images successfully allowed for the evaluation of fairness across different demographic groups. A notable finding was that the performance of models trained on different datasets varied. For example, DeepGuide, which was trained on the HAM dataset, showed slightly lower performance when evaluated with synthetic data generated from the ISIC dataset. This highlights the impact of ‘dataset shift’ – where the data used for evaluation differs from the data the model was originally trained on.
Despite this, the study proposes that this approach offers a valuable new avenue for employing synthetic data to gauge and enhance fairness in medical-imaging AI systems. It suggests that synthetic test data can first verify the robustness of pre-trained models and then evaluate their fairness levels across different PII groups. The research paper can be found here: Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis.
In conclusion, while some generated images appeared unrealistic, the synthetic image test data generation performed well using LightningDiT. The study verified that synthetic images can serve as a powerful tool for evaluating the fairness and robustness of pre-trained AI models, particularly when the generator and detector originate from the same data distribution. This method holds significant potential for creating privacy-preserving and PII-free fairness audits in medical imaging AI.


