TLDR: MammoClean is a novel public framework designed to address the significant data heterogeneity and biases in mammography datasets that hinder the development of reliable AI for breast cancer detection. It standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. By applying MammoClean to diverse datasets, researchers quantified substantial distributional shifts in breast density and abnormality prevalence, demonstrating that AI models trained on corrupted data perform poorly. The framework enables the construction of unified multi-dataset training corpora, leading to more robust AI models with superior cross-domain generalization and equitable performance across diverse patient populations, ultimately facilitating reproducible and bias-aware AI development in mammography.
The advancement of artificial intelligence (AI) in mammography, a crucial tool for early breast cancer detection, faces a significant hurdle: the vast inconsistencies across public datasets. These datasets often vary widely in image quality, how metadata is structured, and the demographics of the populations they represent. This variability introduces biases specific to each dataset, severely limiting how well AI models can perform across different clinical settings – a fundamental barrier to their widespread use.
To tackle this challenge, researchers have introduced a new public framework called MammoClean. This innovative system is designed to standardize and quantify biases within mammography datasets. MammoClean streamlines several critical aspects of data preparation, including how cases are selected, how images are processed (such as correcting for laterality and intensity), and how metadata is unified into a consistent, multi-view structure.
The framework also provides a thorough review of breast anatomy, imaging characteristics, and existing public mammography datasets. This comprehensive approach helps systematically pinpoint the main sources of bias. When MammoClean was applied to three diverse datasets—CBIS-DDSM, TOMPEI-CMMD, and VinDr-Mammo—it revealed substantial differences in breast density distributions and the prevalence of abnormalities.
A critical finding from this research is the direct impact of data corruption: AI models trained on inconsistent or poorly prepared datasets showed a significant drop in performance compared to those trained on data curated by MammoClean. By using MammoClean to identify and reduce these bias sources, researchers can build unified training collections from multiple datasets. This, in turn, allows for the development of more robust AI models that can generalize better across different clinical environments.
MammoClean offers an essential and reproducible pipeline for creating bias-aware AI in mammography. It facilitates fairer comparisons between different AI methods and helps in the creation of safe, effective systems that perform equitably across diverse patient populations and various clinical settings. The open-source code for MammoClean is publicly available, encouraging broader adoption and collaboration.
Understanding Mammography and Data Challenges
Breast cancer remains a leading cause of cancer deaths among women, and regular mammography screening is vital for early detection. Mammography uses low-dose X-rays to visualize breast tissue, with Digital Mammography (DM) or Full-Field Digital Mammography (FFDM) being the most common methods. While effective, FFDM projects a 3D breast into a 2D image, leading to potential information loss and tissue overlap. Newer technologies like Digital Breast Tomosynthesis (DBT) and Contrast-Enhanced Mammography (CEM) aim to mitigate these issues.
A standard mammogram typically involves two views for each breast: the Cranio-Caudal (CC) view (from above) and the Medio-Lateral Oblique (MLO) view (from the side). Radiologists use multi-view strategies to compare findings across views and between breasts, ensuring a thorough interpretation.
The paper highlights that a major obstacle for AI in mammography is the lack of standardized and harmonized data. Datasets are often collected, annotated, and formatted under different protocols, leading to inconsistencies in image quality, labeling, and metadata. This variability makes it hard to reproduce results, ensure interoperability, and generalize AI models across different clinical environments. Harmonized datasets are crucial for fair comparisons and for building robust models that can handle real-world diversity.
Existing harmonization efforts often focus on contrast enhancement, which may require parameter adjustments that vary across settings. Other critical technical aspects are frequently overlooked. MammoClean takes a more comprehensive approach, directly addressing the underlying issues of dataset heterogeneity and bias.
MammoClean’s Approach to Standardization
MammoClean’s process for standardizing and harmonizing data for deep learning models involves three main stages:
First, **case selection and metadata standardization** ensure that only complete and consistent data are used. For instance, examinations with missing views or inconsistent labels are excluded. Common information is mapped to the BI-RADS standard, and relevant features are extracted and rearranged to unify the metadata structure.
Second, **image data processing** begins with normalizing images to ensure a consistent dynamic bit range. Each image is then checked for laterality (left/right breast), and images with right laterality are horizontally flipped for uniform orientation. An additional procedure identifies and corrects images with ‘flipped intensity,’ where the background appears brighter than the breast tissue. This is a common issue, with approximately 28% of CBIS-DDSM images showing flipped laterality and about 23% of VinDr-Mammo images exhibiting flipped intensity artifacts.
Third, **unified data storing** ensures that both metadata and imaging data are stored in a standardized format to enhance reproducibility. The unified metadata files include detailed information such as patient ID, image ID, laterality, view, age, breast density, diagnosis, BI-RADS assessment, and characteristics of abnormalities like mass shape, margin, and density, or calcification morphology and distribution.
Insights from Bias Analysis
The researchers conducted a detailed bias analysis on the harmonized datasets, revealing several key differences:
- Each dataset showed distinct biases in diagnostic labels and BI-RADS scores. For example, CBIS-DDSM had a balance of benign and malignant cases but was skewed towards BI-RADS 4, while VinDr-Mammo had a significant imbalance with over 90% of studies in BI-RADS 1 and 2.
- Breast density distributions varied significantly across datasets, reflecting known ethnic differences. For instance, category B was dominant in CBIS-DDSM (US), while category C was dominant in TOMPEI-CMMD (China) and VinDr-Mammo (Vietnam). Higher breast density makes abnormality detection more challenging.
- The distribution of abnormalities also varied. In CBIS-DDSM, mass and calcification cases were balanced, but in TOMPEI-CMMD, mass cases were double that of calcifications. Discrepancies were also noted between BI-RADS assessments and biopsy-confirmed diagnoses, highlighting that BI-RADS categories alone might not always accurately infer diagnostic labels.
These findings underscore the importance of using multiple datasets and employing bias-aware training and evaluation strategies to prevent AI models from overfitting to dominant groups or producing misleading results.
Also Read:
- Uncovering Age Bias in Medical Image Segmentation: A Deep Dive into Label and Representational Disparities
- Advancing Medical Image Analysis with Adaptable Foundation Models
Future Directions for AI in Mammography
While MammoClean is a significant step, the paper also outlines future directions. It emphasizes the need for subgroup-specific performance assessments, moving beyond overall model evaluations to understand how AI performs across different breast densities, patient ages, populations, and disease types. This will help identify and mitigate model weaknesses.
Another crucial area is developing clinically aligned AI decision-making. Current AI models often rely solely on imaging data, unlike radiologists who integrate patient symptoms, age, and family history. Future AI systems should emulate this reasoning, providing transparent explanations for their decisions rather than just confident predictions. This transparency, combined with human-in-the-loop strategies, could greatly improve trust and adoption in clinical practice.
The paper also calls for future standard datasets to include more longitudinal data (time-ordered studies), detailed lesion descriptors, and richer metadata. Broader demographic representation is also essential for generalizable and clinically applicable AI systems. The integration of multimodal data (ultrasound, MRI, DBT, clinical history) and the emergence of large-scale foundation models are promising avenues for more robust risk assessment and early detection. However, these advances also raise questions about computational cost, fairness, and interpretability, requiring rigorous evaluation and standardized protocols.
In conclusion, MammoClean provides a foundational framework for harmonizing mammography data, addressing critical inconsistencies, and enabling the development of more reliable and equitable AI tools for breast cancer screening and diagnosis. You can read the full research paper here: MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization.


