TLDR: MDiCo is a novel multi-modal co-learning framework designed for Earth Observation (EO) that improves single-modality model performance when some sensor data is unavailable during inference. It achieves this by disentangling modality-shared, modality-specific, and unused features using a combination of loss functions, including contrastive and modality discriminative learning. The framework consistently outperforms state-of-the-art methods across various EO classification and regression tasks, demonstrating its robustness and general applicability in scenarios with missing sensor modalities.
In the rapidly evolving field of Earth Observation (EO), scientists and researchers rely on a vast amount of data collected from diverse remote sensors. This multi-modal data, combining information from sources like optical images and radar signals, offers a comprehensive view of our planet. However, a significant challenge arises when some of these sensor modalities are not available during the inference stage, meaning when the model is actually used to make predictions in the real world. This ‘all-but-one missing modality’ scenario is common due to operational constraints, weather conditions, or sensor failures.
To tackle this critical issue, a new framework called Multi-modal Disentanglement for Co-learning (MDiCo) has been developed. This innovative approach focuses on enhancing the performance of single-modality models by leveraging the rich, multi-modal data available during the training phase, even if only one type of sensor data is present during actual deployment.
The Core Idea: Collaborative Learning and Feature Disentanglement
MDiCo operates on the principle of multi-modal co-learning, where different models (or parts of a model) learn collaboratively from various data types. The framework is designed to be task-agnostic, meaning it can be applied to a wide range of EO problems, from classifying crops to estimating tree species or predicting soil moisture, without being tailored to a specific task or modality for inference.
A key aspect of MDiCo is its ability to disentangle different types of information from each sensor modality. For each modality, the framework extracts three distinct feature spaces:
- Shared Features: Information common across all modalities, making the model robust to missing data.
- Specific Features: Unique information inherent to a particular modality that is crucial for the downstream task.
- Unused Features: Information that is unique to a modality but not relevant to the task, potentially representing noise. This information is intentionally discarded.
This disentanglement is guided by a combination of four loss functions during training: a main predictive loss, an auxiliary predictive loss, a contrastive loss, and a modality discriminant loss. The contrastive loss, in particular, plays a vital role in ensuring that the shared features from different modalities are aligned and similar, making them modality-invariant.
Robust Performance Across Diverse Earth Observation Tasks
The MDiCo framework was rigorously evaluated on four different EO benchmarks, covering binary classification (cropland detection), multi-class classification (crop-type identification), multi-label classification (tree species identification), and regression (live fuel moisture content estimation). These benchmarks involved various combinations of sensor modalities, such as Sentinel-1 radar, Sentinel-2 optical, Landsat 8 optical, and aerial images.
The results were highly promising. MDiCo consistently outperformed both individual models trained on single modalities and several state-of-the-art methods from general machine learning, computer vision, and even EO-specific strategies. This superior performance was observed across nearly all scenarios, regardless of which single modality was available during inference. For instance, in some crop classification tasks, MDiCo even improved upon multi-modal fusion models that use all data types.
Ablation studies, where individual components of the framework were removed, highlighted the critical role of the contrastive loss in enhancing cross-modal interaction. The analysis also showed that combining both shared and specific features yielded the best results, demonstrating the benefit of leveraging complementary information.
Furthermore, MDiCo proved to be adaptable to different encoder architectures, which are the neural network components responsible for extracting features from the raw sensor data. This flexibility underscores the framework’s general applicability in various EO contexts.
Also Read:
- Advancing AI’s Continuous Learning in Audio-Visual Understanding
- A Consolidated LoRA Approach for Adaptive Domain Incremental Learning
Advancing Earth Observation with Smarter AI
The MDiCo framework represents a significant step forward in multi-modal co-learning for Earth Observation. By effectively handling scenarios where sensor modalities are missing at inference time, it makes EO models more robust and practical for real-world deployment. This research contributes to developing more intelligent AI systems that can make the most of the vast and diverse satellite data available, leading to better monitoring and understanding of our planet.
For those interested in the technical details and implementation, the code and related datasets are available at the project’s GitHub repository. You can read the full research paper for more in-depth information. Read the full research paper here.


