TLDR: MVHybrid is a new AI model architecture combining State Space Models and Vision Transformers, designed to improve the prediction of spatial gene expression from routine pathology images. It achieves superior performance and robustness in biomarker prediction by better capturing subtle, low-frequency morphological patterns, outperforming existing Vision Transformer-based models and showing promise for future pathology Vision Foundation Models.
Spatial transcriptomics is a powerful technology that allows scientists to understand how genes are expressed within the context of actual tissue, rather than just in isolated cells. This capability is crucial for advancing precision oncology, such as predicting how a patient might respond to cancer treatment. However, the widespread use of spatial transcriptomics in clinical settings is currently limited by its high cost and technical complexity.
A practical alternative is to predict spatial gene expression, which are essentially biological markers, directly from routine histopathology images. These are the standard tissue slides stained with hematoxylin and eosin (H&E) that pathologists already use for diagnosis. While Vision Foundation Models (VFMs) in pathology, often built on Vision Transformer (ViT) architectures, have shown promise, they often fall short of the accuracy needed for clinical applications in this specific area.
Researchers hypothesize that the limitations of current VFMs might stem from their architectural design. Existing ViT-based models, even after being trained on millions of diverse whole slide images, tend to prioritize high-frequency features—the sharp, detailed patterns in an image. However, the subtle morphological patterns that correlate with molecular phenotypes, like gene expression, are often low-frequency features, meaning they are broader and less distinct to the human eye.
Introducing MVHybrid: A Novel Approach
A new study introduces MVHybrid, a groundbreaking hybrid backbone architecture designed to overcome these limitations. MVHybrid combines State Space Models (SSMs) with Vision Transformers (ViT). State Space Models are particularly adept at capturing low-frequency information, a characteristic that the researchers enhanced in MVHybrid by initializing the SSMs with negative real eigenvalues, which promotes a strong low-frequency bias.
The MVHybrid architecture is structured with MambaVision (MV), a type of SSM, in the first half of its layers, followed by ViT layers in the second half. This unique combination allows the model to learn more useful low-frequency biological features crucial for accurate biomarker prediction. The models were all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method, ensuring a fair comparison.
Also Read:
- LesiOnTime: Enhancing Breast Cancer Screening with AI that Learns from Past Scans and Clinical Data
- M2V AE: A New AI Model for Smarter Cold-Start Item Recommendations
Superior Performance and Robustness
The evaluation of MVHybrid against five other backbone architectures, including various ViT and SSM models, demonstrated significant improvements. In a rigorous evaluation setting called Leave-One-Study-Out (LOSO), where data from an entire study source is held out for testing to assess robustness against batch effects, MVHybrid achieved a 57% higher correlation in gene expression prediction compared to the best-performing ViT model. Furthermore, it showed 43% less performance degradation when moving from random data splits to the more challenging LOSO setting, highlighting its superior robustness.
Beyond biomarker prediction, MVHybrid also performed equally well or better in other critical downstream tasks, including classification, patch retrieval, and survival prediction. This broad applicability underscores its potential as a next-generation backbone for pathology Vision Foundation Models.
The researchers attribute MVHybrid’s success to its unique design, which includes regular convolution layers in its SSM components, its hybrid nature allowing MV and ViT layers to capture different types of features, and its inherent low-frequency bias. This work represents a significant step forward in computational pathology, demonstrating that tailoring the backbone architecture of VFMs can lead to more robust and accurate predictions, especially for complex molecular tasks.
For more detailed information, you can access the full research paper here.


