TLDR: SynDiff is a new framework that uses text-guided synthetic data generation to augment limited medical datasets and a single-step diffusion model for efficient, real-time polyp segmentation. It achieves high accuracy (96.0% Dice) and significant speedup (0.08s inference) on the CVC-ClinicDB dataset, addressing both data scarcity and computational challenges in clinical settings.
Medical image analysis plays a vital role in modern healthcare, aiding in precise diagnosis and treatment planning. A particularly important area is the automated detection of polyps in gastrointestinal endoscopy, which can significantly improve colorectal cancer screening. However, a major hurdle in developing robust medical image segmentation systems is the scarcity of high-quality annotated data. Medical datasets are often limited due to privacy concerns, the high cost of expert annotation, and the time-intensive process of outlining boundaries.
Traditional methods for increasing data, like geometric transformations, don’t create new variations of diseases, which is crucial for models to generalize well. While newer generative models like GANs have shown promise, they often struggle with control and consistency. Diffusion models have emerged as powerful tools for image generation, with text-guided versions allowing for the creation of new data based on clinical descriptions. However, these models typically require many computational steps, making them too slow for real-time clinical use.
Addressing these challenges, researchers Muhammad Aqeel, Maham Nazir, Zanxi Ruan, and Francesco Setti have introduced SynDiff, a novel framework designed to overcome both data scarcity and computational inefficiency in biomedical image segmentation. SynDiff combines text-guided synthetic data generation with an efficient, single-step diffusion-based segmentation approach. This innovative method leverages latent diffusion models to create realistic synthetic polyps, guided by text descriptions, effectively expanding limited training datasets with diverse and clinically relevant samples.
How SynDiff Works
SynDiff operates in two main phases. First, it generates synthetic data offline using Stable Diffusion XL (SDXL) inpainting. This process takes a normal endoscopic image, a specific text description (e.g., “small sessile polyp with irregular surface texture”), and a binary mask indicating where the polyp should appear. The text prompt guides the generation, ensuring the synthetic polyps are clinically realistic and varied. The binary mask simultaneously serves as the ground truth label for the newly generated image, providing valuable training data.
The second phase involves a direct latent estimation technique for segmentation. Unlike traditional diffusion methods that require multiple iterative denoising steps, SynDiff can infer the segmentation mask in a single step. This “single-step inference” dramatically speeds up the process, offering a theoretical computational speedup, making SynDiff suitable for real-time clinical deployment without sacrificing performance.
Also Read:
- GeMix: Enhancing Medical Image Augmentation with Learned Generative Mixing
- RARE-UNet: A Smart Approach to Medical Image Segmentation Across Different Resolutions
Performance and Impact
SynDiff was rigorously evaluated on the CVC-ClinicDB dataset, a collection of colonoscopy images with precise polyp annotations. The framework achieved impressive results, with a Dice coefficient of 96.0% and an Intersection over Union (IoU) of 92.9%. These metrics indicate high accuracy in segmenting polyps. Furthermore, SynDiff demonstrated superior boundary quality, with a Hausdorff Distance at 95th percentile (HD95) of 7.2 mm, which is critical for accurate surgical planning.
A key finding from the research is the significant computational efficiency. SynDiff completes inference in just 0.08 seconds, a remarkable 22-28 times faster than existing diffusion-based methods that typically take 1.8-2.3 seconds. This speed makes it a viable solution for real-time applications in resource-limited medical settings.
The study also highlighted the effectiveness of text-guided data augmentation. Adding just 100 synthetic samples (approximately 20% of the real training data) optimized performance, showing that controlled synthetic augmentation improves segmentation robustness without causing a shift in data distribution. This approach significantly outperformed traditional geometric augmentation and GAN-based synthesis methods.
In conclusion, SynDiff represents a significant step forward in medical image segmentation. By bridging the gap between data-hungry deep learning models and clinical constraints, it offers an efficient and robust solution for deployment in healthcare. For more technical details, you can refer to the full research paper available at arXiv:2507.15361.


