TLDR: The MoWE (Mixture of Weather Experts) framework combines outputs from multiple existing AI weather models using a Vision Transformer-based gating network. This approach dynamically weights expert contributions based on lead time and location, achieving up to 10% lower RMSE than the best individual AI model for 2-day forecasts. It’s computationally efficient and offers a scalable way to improve weather prediction by leveraging the strengths of diverse models.
In the evolving landscape of weather prediction, data-driven models have made significant strides, yet recent progress has shown signs of plateauing. To overcome these limitations, researchers have introduced a novel approach called the Mixture of Experts (MoWE).
Instead of developing entirely new forecasting models, MoWE focuses on optimally combining the outputs of existing, high-performing models. This strategy allows for enhanced accuracy with significantly lower computational resources compared to training individual expert models from scratch.
At the heart of the MoWE system is a Vision Transformer-based gating network. This intelligent network dynamically learns to assign weights to the contributions of multiple “expert” models at each specific grid point, adjusting these weights based on the forecast lead time. The result is a synthesized deterministic forecast that consistently outperforms any single component model in terms of Root Mean Squared Error (RMSE).
The effectiveness of MoWE is striking: it has achieved up to a 10% lower RMSE than the best-performing AI weather model for a 2-day forecast horizon. This represents a substantial improvement over individual experts and even a simple average of their predictions. The framework offers a computationally efficient and scalable method to advance the state of the art in data-driven weather prediction by maximizing the utility of leading forecast models.
The paper details the methodology, explaining that the MoWE model produces a superior forecast by dynamically weighting the contributions of pre-existing expert models. The core gating network, a deep neural network, determines these optimal weights by considering all expert forecasts, forecast lead time, and an optional noise vector for probabilistic variants. The architecture leverages Vision Transformer blocks, processing a composite image of stacked forecast maps from experts, then outputting pixel-by-pixel weight maps for each expert and a final bias map.
For this preliminary study, three expert models were chosen: Pangu, Aurora, and FCN3. Pangu utilizes a 3D data cube approach to capture complex weather patterns. Aurora, built on a Swin Transformer, processes diverse atmospheric data through pretraining and fine-tuning. FCN3 is a probabilistic model using a spherical neural operator, designed to minimize the Continuous Ranked Probability Score (CRPS), and while its single-member deterministic scores might lag, its ensemble performance is competitive.
The MoWE model was trained using 2-day forecast trajectories generated by each expert model, initialized at various timesteps of ERA5 data from 1980 to 2014. The training objective was to minimize the Mean Squared Error (MSE) between its prediction and the ground truth. Testing was conducted using data from 2015.
Results demonstrate that MoWE consistently achieves the lowest RMSE across all evaluated atmospheric variables and lead times, from 6 hours up to 2 days. Interestingly, while individual experts perform better at shorter lead times, the simple mean of experts can become superior at longer lead times (1-2 days) due to error reduction through averaging. MoWE, however, surpasses both the best individual expert and the simple mean across all scenarios.
An ablation study on model capacity showed that a Base model (25 million parameters) performed marginally better than a Small model (9 million parameters), highlighting the efficiency of the MoWE framework even with lightweight designs. Qualitative analysis of forecasts showed consistency with baseline models, and the learned weights dynamically adjusted based on lead time, channels, and spatial locations. For instance, at a 6-hour forecast, MoWE heavily favored the Aurora model, but as the forecast extended to 24 and 48 hours, weights were distributed more evenly among FCN3, Aurora, and Pangu, often influenced by geographical features.
In conclusion, the MoWE framework offers a strategic and effective alternative to developing new standalone models, leveraging the collective strengths of existing expert models to significantly improve forecast skill. This approach demonstrates that valuable, complementary information is distributed across different models and can be harnessed effectively. The superiority of MoWE over simpler ensembling strategies also indicates its ability to isolate advantages of different experts to specific locations and lead times.
Also Read:
- Enhancing Graph Neural Networks with Flexible Subgraph Pattern Learning
- Merge-of-Thought Distillation: Unifying AI Reasoning Abilities
While the current approach has limitations, such as fixed rollout times and the increasing infeasibility of simple channel concatenation with more experts, future work aims to address these through online training setups and dimensionality reduction strategies. This research paves the way for a shift from competing models to collaborative models, fostering community effort in the next generation of weather forecasting systems. You can find the full research paper here: MOWE : A Mixture of Weather Experts.


