TLDR: T-SYNTH is a new open-source dataset of synthetic 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. It uses physics simulations to generate realistic breast images with detailed pixel-level annotations, addressing the scarcity of large, annotated real medical imaging datasets. The dataset shows promise for augmenting limited real patient data, improving lesion detection, and enabling subgroup analysis in breast cancer screening AI development.
Developing robust artificial intelligence algorithms for medical imaging, especially in areas like breast cancer detection, faces a significant hurdle: the scarcity of large, well-annotated datasets. Patient privacy concerns, high costs, and the labor-intensive nature of obtaining detailed annotations from medical specialists make it incredibly challenging to gather enough real-world data. This limitation often hinders the development and assessment of effective AI models.
Addressing this critical need, researchers from the U.S. Food and Drug Administration have introduced T-SYNTH, a groundbreaking open-source dataset of synthetic breast images. This dataset is designed to overcome the data limitations by leveraging physics simulations to generate highly realistic 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images, complete with precise pixel-level segmentation annotations that are notoriously difficult to acquire from real patient data.
What is T-SYNTH?
T-SYNTH is built upon a knowledge-based (KB) model, which inherently incorporates domain expertise from physics and biology to procedurally create images and their corresponding annotations. Unlike generative AI models that learn from existing data and can perpetuate biases or errors, KB models offer precise control over the characteristics of the generated samples. This allows researchers to create balanced datasets and analyze specific subgroups, such as different lesion sizes or breast densities, which is often not possible with real patient data due to its inherent imbalances.
The dataset includes paired DM and DBT images, offering a comprehensive resource for various breast imaging analysis tasks. It provides pixel-level segmentation and bounding boxes for a variety of breast tissues, including lesions, glandular and adipose tissue, skin, ligaments, ducts, and veins. This rich annotation is crucial for training AI models for tasks like mass detection and segmentation, which are vital for improving patient outcomes and clinical workflow efficiency.
Key Contributions and Potential Uses
The T-SYNTH dataset offers several significant contributions:
- It provides a large-scale public dataset of paired DM and DBT images with detailed pixel-level annotations, derived from a knowledge-based model.
- It enables robust subgroup analysis, demonstrating how AI models perform across different breast densities, lesion sizes, and lesion densities. For instance, experiments show that less dense lesions are harder to detect, aligning with clinical expectations.
- It proves valuable for data augmentation. When combined with limited real patient data, T-SYNTH can significantly improve the performance of detection models, especially for underrepresented subgroups in patient datasets.
The data and code for T-SYNTH are publicly available, encouraging widespread use and collaboration within the medical imaging AI community. This promotes transparency, reproducibility, and fairness in research, as it provides a standardized, controlled environment for testing and developing algorithms without patient-identifiable information.
How T-SYNTH is Generated
The synthetic images in T-SYNTH are generated using the open-source Virtual Imaging Clinical Trials for Regulatory Evaluation (VICTRE) pipeline. This pipeline simulates DM and DBT images by varying breast density and mass properties. It incorporates a computational model for lesion growth, accounting for factors like tissue stiffness and tumor development phases, to create realistic and biologically relevant lesion morphologies.
The acquisition system modeling replicates real-world mammography and tomosynthesis systems, ensuring the synthetic images closely mimic those obtained in clinical settings. This meticulous generation process ensures that T-SYNTH images exhibit visual trends consistent with established medical literature, such as lesions being less distinct in higher breast density categories.
Also Read:
- New Approach to Enhance Synthetic CT Quality Using Multimodal Imaging and Registration
- Advancing Medical Video Generation with MedGen and MedVideoCap-55K
Looking Forward
While T-SYNTH represents a major step forward, the researchers acknowledge areas for future improvement. These include bridging the “domain gap” between synthetic and real patient images, which often stems from proprietary image reconstruction algorithms in clinical systems. Future work also aims to expand the lesion model to capture a wider diversity of abnormalities beyond masses, such as calcifications, and to incorporate classifications based on severity scales like BIRADS. Additionally, further analysis of the 3D DBT images within T-SYNTH is planned, as robust public 3D DBT datasets with rich annotations are still scarce.
T-SYNTH is a promising resource for accelerating AI development in medical imaging, offering a controlled and richly annotated environment to train and evaluate algorithms. For more technical details, you can refer to the full research paper: T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images.


