TLDR: This research explores how increasing data size impacts the performance of medical imaging foundation models, specifically MedImageInsight (MI2) and RAD-DINO, when continually pretrained on large chest X-ray datasets. It finds that MI2 excels in finding-related tasks, while RAD-DINO is better for lines and tubes. Crucially, adding structured labels to MI2 significantly boosts its performance. The study highlights that even a moderate amount of in-domain data can outperform general-purpose models, emphasizing the benefits of tailoring AI to specific medical institutions.
In the rapidly evolving field of artificial intelligence, foundation models have demonstrated remarkable capabilities across various domains. However, their application in medical imaging, particularly radiology, presents unique challenges. Unlike the vast, web-scale datasets used for general vision models, medical imaging datasets are typically smaller, raising questions about how data quantity and pretraining methods influence performance in this specialized context.
A recent study delves into this critical area, systematically investigating the continual pretraining of two prominent vision encoders, MedImageInsight (MI2) and RAD-DINO, on an extensive collection of chest X-rays. The research aims to understand how these models scale with data and how different pretraining approaches affect their ability to interpret complex medical images.
Understanding the Models
The study focuses on two distinct paradigms for vision encoders:
-
MedImageInsight (MI2): This model employs a CLIP-style approach, which involves contrastive learning from both images and associated text reports. It’s designed to align visual features with textual descriptions, making it adept at tasks requiring an understanding of radiology findings.
-
RAD-DINO: Based on the DINOv2-style, this model uses self-supervised learning purely from images. It excels at learning dense visual features, which are particularly useful for tasks like segmentation and detecting continuous structures.
Both models were continually pretrained on INST-CXR-BENCH, a large internal dataset comprising up to 3.5 million chest X-ray images paired with their corresponding radiology reports from a single institution. This controlled environment allowed researchers to precisely study the impact of increasing data scale while keeping other factors constant.
Key Findings and Insights
The evaluation covered a diverse range of tasks, including classifying radiology findings, identifying lines and tubes (such as catheters and drains), segmenting these lines and tubes, and generating radiology reports. The results revealed several important insights:
-
Complementary Strengths: MI2 demonstrated superior performance in tasks related to identifying general radiology findings, such as pneumothorax or cardiomegaly. In contrast, RAD-DINO proved more effective for tasks involving lines and tubes, which require the model to extract features that preserve continuity along elongated structures.
-
Value of Structured Supervision: A surprising finding was that continually pretraining MI2 with both radiology reports and structured labels (like the presence of specific tubes) significantly improved its performance. This highlights the importance of incorporating structured supervision, even when working with millions of image-report pairs.
-
Efficiency of In-Domain Data: For some tasks, the study showed that using as few as 30,000 in-domain samples for continual pretraining was sufficient to surpass the performance of open-weights foundation models. This underscores the immense value for medical institutions to leverage their own patient data to tailor AI models to their specific needs and populations.
-
Scaling Laws and Limitations: While clear scaling laws were observed, indicating predictable performance gains with more data, the study also noted deviations. Performance could be noisy with small datasets and sometimes plateaued with very large datasets. Domain shifts, such as applying models trained on one hospital’s data to another’s, further complicated these trends, emphasizing the need for larger, multi-center benchmark datasets.
Also Read:
- XR-0: A Foundation Model for Multi-Anatomy X-Ray Analysis
- Enhancing ECG Foundation Models with a Targeted Post-Training Strategy
Implications for Medical AI
The research concludes that continual pretraining of open-weight models on large-scale, institution-specific chest X-ray datasets can lead to significantly improved vision encoders. This approach empowers medical centers to develop specialized foundation models that are finely tuned to their unique patient demographics and imaging protocols.
The findings suggest that a combination of MI2, utilizing the UniCL framework with automated label extraction, offers a highly effective strategy for medical centers looking to train foundation vision encoders on their proprietary data. This work paves the way for more accurate and reliable AI tools in radiology, ultimately benefiting patient care.
For more detailed information, you can read the full research paper here.


