TLDR: MSCloudCAM is a novel deep learning model designed for accurate and robust cloud segmentation in multispectral satellite imagery from Sentinel-2 and Landsat-8. It leverages a Swin Transformer backbone for hierarchical feature extraction, multi-scale context modules (ASPP and PSP) for enhanced scale-aware learning, and a Cross-Attention block for effective multi-sensor and multispectral feature fusion. The model classifies clear sky, thin cloud, thick cloud, and cloud shadow, achieving state-of-the-art segmentation accuracy while maintaining computational efficiency, making it practical for large-scale Earth observation.
Clouds are a persistent challenge in optical satellite imagery, often obscuring the Earth’s surface and making it difficult to analyze data for environmental monitoring, land cover mapping, and climate research. Accurate detection and classification of different cloud types are crucial for various remote sensing applications, including atmospheric correction and land surface monitoring.
Traditional methods for cloud detection, such as rule-based or spectral index-based approaches, often struggle with mixed pixels, thin clouds, or bright surfaces like snow. While machine learning classifiers improved accuracy, they were limited by handcrafted features. The advent of deep learning, particularly Convolutional Neural Networks (CNNs) and Transformer-based architectures, has significantly advanced cloud segmentation by learning complex features directly from multispectral data.
However, many existing deep learning models are trained on data from a single sensor, which limits their ability to generalize across different satellite sensors and spectral configurations. Furthermore, few models effectively integrate multi-scale spectral-spatial features with cross-attention mechanisms specifically designed for cloud segmentation, especially in multi-class scenarios where distinguishing between thin clouds, thick clouds, and cloud shadows is vital.
Introducing MSCloudCAM
To address these limitations, researchers have proposed MSCloudCAM, a novel network designed for robust cloud segmentation in multispectral and multi-sensor imagery. MSCloudCAM stands for Cross-Attention with Multi-Scale Context Network. It is specifically tailored to exploit the rich spectral information from Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data. The model classifies four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow.
How MSCloudCAM Works
MSCloudCAM combines several advanced deep learning techniques to achieve its high performance:
- Swin Transformer Backbone: This component is responsible for extracting hierarchical features from the input multispectral images. It efficiently captures both local and global dependencies within the image.
- Multi-Scale Context Modules (ASPP and PSP): To enhance the model’s ability to understand objects at different sizes, MSCloudCAM integrates Atrous Spatial Pyramid Pooling (ASPP) and Pyramid Scene Parsing (PSP) modules. ASPP captures large-scale semantic context using dilated convolutions, while PSP aggregates multi-scale contextual cues by adaptive pooling, which is particularly useful for delineating fine structures like thin clouds.
- Cross-Attention Block: This is a key innovation that enables effective fusion of features from different sensors and spectral domains. It refines the combined outputs of the ASPP and PSP modules, aligning global semantic information with fine-grained spatial details.
- Efficient Channel Attention Block (ECAB) and Spatial Attention Module: These modules adaptively refine feature representations, allowing the model to focus on the most discriminative regions within the image.
The model processes input multispectral images through the Swin Transformer to get a hierarchy of features. These features are then enriched by the ASPP and PSP modules. A convolutional multi-head cross-attention module fuses these enriched features, which are then further refined by combined channel and spatial attention. Finally, a multi-stage decoder with auxiliary supervision produces the pixel-wise classification of cloud types.
Performance and Efficiency
Comprehensive experiments conducted on the CloudSEN12 and L8Biome datasets demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy. It consistently outperforms leading baseline architectures across various metrics, including IoU (Intersection over Union), F1 Score, and Accuracy, for all four semantic categories. Importantly, MSCloudCAM achieves this superior performance while maintaining competitive parameter efficiency and computational cost (FLOPs) compared to other advanced models.
The qualitative results also show that MSCloudCAM produces sharper delineations of thin clouds and cloud shadows and reduces false detections compared to other approaches. This underscores the model’s effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.
Also Read:
- Enhancing Visual Clarity for Smart Transportation in Challenging Weather
- AETHER: Bridging Physical and Functional Views of Cities
Future Directions
The researchers plan to explore lightweight variants of MSCloudCAM for onboard satellite processing, which would allow for real-time cloud segmentation directly on satellites. Additionally, future work will extend the model to spatiotemporal cloud tracking, enabling the monitoring of cloud movement and evolution over time.
For more technical details, you can refer to the full research paper: MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation.


