TLDR: MuMo is a novel multimodal molecular representation learning framework that addresses challenges like 3D conformer unreliability and modality collapse. It introduces a Structured Fusion Pipeline (SFP) to create a stable structural prior from 2D and 3D molecular data, and a Progressive Injection (PI) mechanism to asymmetrically integrate this prior into the sequence stream. This approach preserves modality-specific modeling while enabling cross-modal enrichment, leading to improved robustness and generalization across 29 benchmark tasks, with an average 2.7% performance increase and ranking first on 22 tasks.
In the complex world of drug discovery and computational chemistry, predicting how molecules behave is a crucial step. Traditional methods are often expensive and time-consuming, leading researchers to explore advanced computational models. Recent efforts have focused on multimodal molecular models, which combine different types of information about a molecule, such as its chemical sequence (SMILES), 2D graph structure, and 3D shape (geometry). However, these models face significant hurdles: the unreliability of 3D conformers (different spatial arrangements of the same molecule) and a phenomenon called ‘modality collapse,’ where one type of data overwhelms or distorts information from others.
A new research paper introduces MuMo, a novel framework designed to tackle these challenges head-on. MuMo, which stands for Structured Multimodal Fusion, aims to create more robust and generalizable molecular representations by carefully integrating diverse molecular data.
Addressing 3D Conformer Unreliability with a Structured Fusion Pipeline
One of the primary issues in molecular modeling is that 3D conformers, which are generated by tools like RDKit, can vary significantly even for the same molecule. These subtle differences in local arrangement can lead to different predictions for molecular properties. To counter this instability, MuMo proposes a Structured Fusion Pipeline (SFP).
The SFP works by combining the 2D topological information (how atoms are connected) and 3D geometric information (the spatial arrangement of atoms) into a single, stable ‘structural prior.’ This unified representation acts as a reliable foundation, reducing the model’s sensitivity to the noise and inconsistencies often found in 3D conformer data. By aligning and encoding these two structural inputs, SFP ensures that the model has a consistent and accurate understanding of the molecule’s physical structure.
Mitigating Modality Collapse with Progressive Injection
Another common problem in multimodal models is modality collapse, which occurs when different data types are fused too simply or symmetrically. For instance, noisy 3D signals might dominate or distort the information from a more stable SMILES sequence. MuMo addresses this with its Progressive Injection (PI) mechanism.
Instead of a naive, symmetric fusion, PI asymmetrically integrates the stable structural prior (created by SFP) into the main sequence stream. This means the sequence data, typically derived from SMILES, first establishes its own contextual understanding. Only then is the structural information progressively injected into the sequence stream. This staged approach allows each modality to develop its unique features independently before cross-modal enrichment occurs, preserving the integrity of modality-specific modeling while still benefiting from comprehensive structural guidance.
Built on a state space backbone, MuMo is also adept at modeling long-range dependencies and propagating information effectively throughout the molecule.
Also Read:
- Predicting Chemical Toxicity with Visual AI and Explainable Insights
- Unlocking Drug Discovery with Hybrid AI: The SNG Approach
Impressive Performance Across Diverse Tasks
The effectiveness of MuMo has been rigorously tested across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet. The results are compelling: MuMo achieved an average improvement of 2.7% over the best-performing baseline on each task, securing the top rank on 22 of them. Notably, it showed a remarkable 27% improvement on the LD50 task, which predicts the lethal dose of a substance.
These findings underscore MuMo’s robustness to 3D conformer noise and the significant benefits of its multimodal fusion strategy in molecular representation learning. The research paper, titled “Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning,” is available for further details. You can read the full paper here.
MuMo represents a significant step forward in developing more reliable and accurate computational tools for molecular property prediction, with potential applications spanning computational chemistry and drug discovery.


