spot_img
HomeResearch & DevelopmentSToFM: A New Model for Understanding Spatial Transcriptomics Data...

SToFM: A New Model for Understanding Spatial Transcriptomics Data Across Multiple Biological Scales

TLDR: SToFM is a novel foundation model designed for Spatial Transcriptomics (ST) data analysis. It addresses the challenge of integrating multi-scale biological information (macro-scale tissue morphology, micro-scale cellular interactions, and gene-scale expression profiles) by constructing multi-scale sub-slices and using an SE(2) Transformer. Trained on SToCorpus-88M, the largest ST dataset to date, SToFM demonstrates superior performance across various downstream tasks including tissue segmentation, cell type annotation, clustering, deconvolution, and imputation, showcasing its comprehensive understanding and transferability of ST data.

Spatial Transcriptomics (ST) technologies are revolutionizing how biologists understand single-cell biology by allowing them to study gene expression while keeping the cells in their original spatial context within tissues. This provides a much richer picture than traditional methods that dissociate cells, losing crucial information about their environment and interactions.

However, analyzing ST data presents a significant challenge: it requires extracting information at multiple scales simultaneously. Imagine trying to understand a city by only looking at individual houses, or only at the entire city map, but not both. ST data similarly contains macro-scale tissue morphology (the overall shape and structure of an organ), micro-scale cellular microenvironments (how cells interact with their immediate neighbors), and gene-scale gene expression profiles (the specific genes active within each cell). Integrating these diverse levels of information is complex.

Introducing SToFM: A Multi-scale Foundation Model

To address this challenge, researchers have developed SToFM, a multi-scale Spatial Transcriptomics Foundation Model. SToFM is designed to comprehensively understand ST data by capturing and integrating information from all three crucial scales: macro, micro, and gene.

The model works by first performing a multi-scale information extraction process on each ST tissue slice. This involves creating ‘ST sub-slices’ that cleverly combine information from all three scales. For the gene scale, SToFM uses a pre-trained cell encoder, which is further adapted to ST data to ensure high-quality gene expression representations. For the micro scale, the ST slice is divided into smaller sub-slices, focusing on localized cell-cell interactions. To maintain macro-scale information, SToFM identifies ‘virtual cells’ by clustering all cells in the slice. These virtual cells act as a compressed representation of the tissue’s overall structure and are then incorporated into each sub-slice, allowing the model to perceive the larger organizational patterns while still focusing on local details.

Once these multi-scale sub-slices are constructed, SToFM employs an SE(2) Transformer. This specialized neural network is designed to process both the gene expression information and the spatial coordinates of the cells, producing high-quality cell representations that are robust to common spatial transformations like rotations and translations. The model is trained using two main objectives: Masked Cell Modeling (MCM), where it predicts masked gene expression embeddings, and Pairwise Distance Recovery (PDR), where it reconstructs original spatial distances after some coordinates are perturbed. These tasks help SToFM learn both the genetic and spatial characteristics of the data.

A Massive Training Corpus: SToCorpus-88M

A key component of SToFM’s development is SToCorpus-88M, the largest high-resolution spatial transcriptomics corpus ever constructed for pretraining. This massive dataset comprises approximately 2,000 high-resolution ST slices, totaling an astounding 88 million cells. It includes data from six different ST technologies and covers both human and mouse samples, significantly surpassing previous datasets in both scale and diversity. This extensive corpus is crucial for training a robust foundation model like SToFM, enabling it to learn generalizable patterns across various biological contexts.

Demonstrated Performance Across Diverse Tasks

SToFM has shown exceptional performance across a variety of important downstream biological tasks, highlighting its comprehensive understanding of ST data:

  • Tissue Region Semantic Segmentation: SToFM significantly outperforms existing methods in identifying structural and functional regions within tissues, such as human embryonic structures and layers of the dorsolateral prefrontal cortex (DLPFC). Notably, its performance is particularly strong in cross-slice settings, demonstrating excellent robustness and transferability.
  • Cell Type Annotation: The model achieves superior accuracy in identifying different cell types within spatial transcriptomics data, even with lower-quality gene expression profiles often found in ST data. This suggests that incorporating spatial information helps in inferring cell types.
  • Zero-shot Clustering and Visualization: SToFM produces high-quality cell embeddings that allow for clear and distinct clustering of cell types, even without prior training on specific labels. Visualizations show that cells of the same type form tight clusters, and the representations also reflect biological relationships between different cell types.
  • Spatial Deconvolution: The model effectively predicts the proportion of various cell types within a given spot in ST data, showcasing its ability to transfer deconvolution results between labeled and unlabeled slices.
  • Spatial Transcriptomics Imputation: SToFM demonstrates strong capabilities in inferring uncaptured gene expression levels, a critical task for analyzing ST data where gene coverage can be limited.

An ablation study further confirmed that each multi-scale component (gene, micro, and macro) contributes to SToFM’s improved performance and transferability, validating the design choices.

Also Read:

Future Directions

While SToFM represents a significant leap forward, the researchers acknowledge its current limitations. The model currently focuses on three main scales, and future work could explore integrating even more scales, perhaps using techniques like image pyramids. Additionally, incorporating causal machine learning methods to model gene regulatory relationships or integrating other biological knowledge and modalities like pathological images could further enhance the model’s capabilities.

SToFM is a powerful new tool that promises to accelerate discoveries in single-cell biology and tissue research by providing a more complete and integrated understanding of spatial transcriptomics data. You can find more details about this research in the original research paper.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -