TLDR: A study evaluated vision foundation models (like SAM and DINO-v2) for zero-shot breast MRI registration across various challenging tasks. While these models excel at aligning large structures and handling different image types, they struggle with fine anatomical details. Surprisingly, models pre-trained on medical data didn’t consistently outperform those trained on natural images, suggesting more research is needed to optimize their performance for specific medical applications.
Medical image registration is a crucial process in healthcare, enabling doctors to accurately track changes in tumors, plan surgeries, and compare images taken at different times or with different equipment. However, aligning breast MRI images is particularly challenging. This is due to the natural variations in breast anatomy, deformations caused by patient positioning, and the intricate, delicate structure of the fibroglandular tissue within the breast. Traditionally, this has relied on complex optimization-based algorithms or deep learning methods that require extensive, task-specific training data.
Recently, a new class of artificial intelligence models, known as foundation models, has emerged. These models are pre-trained on vast datasets and are capable of understanding and generating rich feature representations from images. They have shown great promise in various tasks, including zero-shot image registration, where they can perform tasks without being explicitly trained on specific examples. However, most evaluations of these models have focused on more rigid or less complex body parts like the brain or abdominal organs, leaving their effectiveness for highly deformable anatomies like the breast largely unexplored.
A recent study, titled “Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?”, delves into this critical question. Conducted by researchers Hanxue Gu, Yaqian Chen, Nicholas Konz, Qihang Li, and Maciej A. Mazurowski from Duke University, the study provides a comprehensive evaluation of how well foundation models perform in breast MRI registration. You can find the full research paper here.
The researchers assessed five different pre-trained foundation models: DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP-SAM. These models vary in their pre-training strategies and whether they were initially trained on natural images or medical images. The study implemented a flexible, training-free pipeline where these models extract semantic features from MRI volumes, and then a deformable registration is performed on these reduced features without any additional training or fine-tuning.
To thoroughly test the models, four challenging breast registration tasks were designed:
Also Read:
- Guiding the Segment Anything Model: A Deep Dive into Prompt Engineering
- Advanced AI Model Enhances Brain Tumor Segmentation by Fusing Visual and Textual Medical Data
Key Registration Tasks
- Registering breast MRI scans taken at different dates or years but with the same image sequence.
- Aligning longitudinal breast MRI exams with different image sequences.
- Tracking lesions by registering an image with a lesion to one without a lesion, evaluating if the model preserves the lesion’s characteristics.
- Registering PET-CT scans to MRI scans, a particularly difficult task due to different imaging modalities and significant breast deformation from patient positioning.
The results revealed several interesting findings. Foundation models, especially SAM (Segment Anything Model), showed superior performance in aligning large structures, such as the overall breast contour. For cross-sequence registration, SAM significantly outperformed traditional optimization-based methods, indicating that the features extracted by these models are robust to changes in image appearance.
However, the study also highlighted limitations. Foundation models struggled to capture the fine details of fibroglandular tissue (FGT), which is crucial for accurate internal structure alignment. This suggests that while they excel at global alignment, preserving fine-grained anatomical details remains a challenge. Surprisingly, models that underwent additional pre-training or fine-tuning on medical or breast-specific images, such as MedSAM and SSLSAM, did not consistently improve registration performance and, in some cases, even decreased it. This could be due to the relatively smaller datasets used for medical pre-training compared to the massive datasets used for natural image pre-training, leading to less generalizable features.
For lesion tracking, DINO-v2 performed best in preserving lesion size, while MedSAM showed poor performance. In the most challenging task of PET-CT to MRI registration, foundation models demonstrated a clear advantage. Traditional methods often failed to align organs between CT and MRI, whereas SAM successfully registered the images despite significant shape differences, confirming their strength in handling large domain gaps.
In conclusion, this research indicates that vision foundation models, particularly those pre-trained on natural images like SAM and DINO-v2, are highly capable of achieving strong performance for large-structure alignment in breast MRI. However, their current limitation lies in accurately preserving fine anatomical details. This study underscores an important direction for future research: developing strategies to enhance the preservation of fine-grained information within the feature representations of foundation models, ultimately making them more versatile and precise for complex medical imaging applications.


