TLDR: AQFusionNet is a novel deep learning framework that significantly improves Air Quality Index (AQI) prediction by synergistically combining atmospheric imagery with environmental sensor data. Designed for robustness, it maintains high accuracy even with partial sensor unavailability and is computationally efficient for edge deployment. Evaluated on data from India and Nepal, the EfficientNet-B0 variant achieved 92.02% accuracy, demonstrating an 18.5% improvement over unimodal baselines and offering a scalable solution for air quality monitoring in resource-constrained regions.
Air pollution is a critical global health issue, responsible for millions of premature deaths annually, particularly severe in rapidly industrializing regions like South Asia. Accurate, real-time monitoring of the Air Quality Index (AQI) is essential for public health, but traditional methods face significant challenges. Ground-based sensors offer high temporal resolution but are costly and sparsely distributed, especially in developing areas. Satellite observations provide broad coverage but suffer from limitations like cloud interference and lower sensitivity to ground-level pollutants.
Addressing these challenges, researchers Koushik Ahmed Kushal and Abdullah Al Mamun from Clarkson University have introduced AQFusionNet, a novel multimodal deep learning framework. This innovative system is designed to predict AQI robustly by combining atmospheric imagery with environmental sensor data. Unlike many existing approaches that rely on a single data source, AQFusionNet leverages the strengths of both visual and sensor information to provide a more comprehensive and accurate picture of air quality.
The core of AQFusionNet lies in its dual-objective learning architecture. It uses lightweight Convolutional Neural Network (CNN) backbones, such as MobileNetV2, ResNet18, and EfficientNet-B0, to extract detailed visual features from ground-level atmospheric images. These visual features are then seamlessly integrated with pollutant concentration measurements (like PM2.5, PM10, NO2, SO2, CO, O3) through semantically-aligned embedding spaces. A key innovation is the ability to estimate sensor values directly from visual features, which makes the system incredibly robust even when some sensor data is unavailable – a common scenario in resource-constrained environments.
The framework’s architecture consists of an image encoder, a sensor encoder, a multimodal fusion module, and dual prediction heads. The image encoder processes atmospheric images, while the sensor encoder handles environmental measurements. The fusion module combines these two data streams, and the dual prediction heads simultaneously predict the AQI and estimate sensor values. This design ensures that the model can maintain predictive capability across varying data availability scenarios.
Extensive evaluation was conducted on over 8,000 samples from 15 cities across India and Nepal, collected between 2019 and 2022. The results demonstrated AQFusionNet’s superior performance across all backbone configurations. The EfficientNet-B0 variant achieved optimal results, with a Root Mean Square Error (RMSE) of 7.70 and a remarkable 92.02% classification accuracy on test data. This represents an 18.5% improvement over unimodal baselines and significant gains over other multimodal approaches, including a 23.7% RMSE reduction compared to a framework using CCTV traffic imagery and environmental sensors.
Beyond its accuracy, AQFusionNet is also computationally efficient, with the EfficientNet-B0 variant having only 6.2 million parameters and the MobileNetV2 configuration even less at 2.41 million. This lightweight design makes it highly suitable for deployment on edge devices and mobile platforms, which is crucial for real-time AQI monitoring in areas with limited computational infrastructure. The model’s ability to maintain performance under partial sensor unavailability further enhances its practical deployability in real-world settings.
To understand how the model makes its decisions, Grad-CAM visualization was used. This technique showed that for good air quality, the model focused on clear sky regions, while for high pollution levels, it prioritized hazy or smoggy areas, directly correlating visual cues with pollutant concentrations. This interpretability builds trust and can guide air quality interventions.
The researchers acknowledge that future work will focus on integrating temporal attention mechanisms for long-term forecasting, incorporating satellite imagery for broader spatial coverage and all-weather performance, developing unsupervised domain adaptation for seamless cross-regional deployment, and extending the framework to real-time streaming architectures. These advancements aim to further strengthen AQFusionNet’s robustness and scalability, ultimately helping to democratize air quality monitoring in developing nations facing severe pollution challenges.
Also Read:
- AI’s Eye on Our Waters: Predicting Pollution with Computer Vision
- New Benchmark Dataset for India’s Weather Forecasting
For more detailed information, you can read the full research paper: AQFusionNet: Robust Multimodal Deep Learning for Air Quality Index Prediction through Atmospheric Imagery and Environmental Sensor Integration.


