TLDR: A study used Representational Similarity Analysis to compare six computational pathology foundation models, finding that models with the same training paradigm don’t always have similar internal representations. All models showed high slide-dependence but low disease-dependence, with stain normalization reducing slide-dependence. Vision-language models had more compact representations, while vision-only models were more distributed. These insights can improve model robustness and inform ensembling strategies.
The field of computational pathology (CPath) is rapidly advancing with the development of “foundation models.” These powerful AI models are designed to learn from vast datasets and then apply that knowledge to various tasks, such as identifying tumor types or predicting disease progression. While many studies have focused on how well these models perform on specific tasks, less is known about the underlying structure of the information they learn and how similar or different these structures are across various models.
A recent study delves into this very question, systematically analyzing the “representational spaces” of six prominent CPath foundation models. Think of representational space as the internal map a model creates to understand and categorize the complex visual information from tissue slides. The researchers used a technique called Representational Similarity Analysis (RSA), which is commonly used in computational neuroscience to compare how different parts of the brain process information.
The study examined models that use two main learning strategies: vision-language contrastive learning (like CONCH, PLIP, and KEEP), which learns by associating images with text descriptions, and self-distillation (like UNI v2, Virchow v2, and Prov-GigaPath), which learns by refining its own understanding of visual data. They used H&E stained image patches from The Cancer Genome Atlas (TCGA) to conduct their analysis.
One of the key findings was that UNI2 and Virchow2, both vision-only models, had the most distinct internal representations. Surprisingly, simply having the same training approach (e.g., both being vision-only or both vision-language) didn’t guarantee that models would have similar internal structures. For instance, Prov-GigaPath, a vision-only model, showed the highest average similarity across all models, even with vision-language ones.
The research also highlighted a significant “slide-dependence” in all models’ representations. This means that the models’ internal maps were heavily influenced by individual tissue slides, rather than just the disease type. While this might be useful for some tasks, it also suggests a potential lack of robustness to variations between slides, such as those caused by different hospitals or staining protocols. Interestingly, applying a technique called “stain normalization” (which standardizes the appearance of tissue stains) significantly reduced this slide-dependence, improving robustness.
Conversely, the models showed relatively low “disease-dependence.” This might seem counterintuitive given their strong performance in classifying tumor types. However, the researchers suggest that while the overall representations might vary, specific combinations of features crucial for disease classification could remain stable.
When looking at the “intrinsic dimensionality” of the representations, vision-language models tended to have more compact, lower-dimensional representations. This could be because the language component acts as a “bottleneck,” encouraging the model to compress visual information into a more concise form. Vision-only models, on the other hand, had more distributed, higher-dimensional representations, potentially preserving richer visual details. This difference in dimensionality might also contribute to the generally higher performance observed in vision-only models, though their larger training datasets could also play a role.
The implications of these findings are significant for the future of computational pathology. The high slide-specificity points to a need for models that are more robust to variations in data. Techniques like data augmentation or adversarial learning during training, and stain normalization during inference, could help address this. Understanding the similarities and differences between models can also guide “ensembling strategies,” where combining different models can improve performance. Instead of combining many similar models, focusing on more dissimilar, complementary ones could be more effective.
Also Read:
- Tensor Decomposition: A Lightweight Shield for Vision-Language Models Against Adversarial Attacks
- Advancing Childhood Leukemia Diagnosis with a New AI-Ready Bone Marrow Dataset
This study provides a valuable framework for understanding the internal workings of CPath foundation models, moving beyond just performance metrics. By probing these internal representations, researchers can develop more effective and reliable AI tools for clinical settings. You can read the full paper here: Comparing Computational Pathology Foundation Models using Representational Similarity Analysis.


