Unmasking Hidden Biases in AI Weather Forecasts: The SAFE Approach

TLDR: The research paper introduces SAFE (Stratified Assessments of Forecasts over Earth), an open-source Python package that evaluates AI weather prediction models by stratifying performance across attributes like territory, global subregion, income, and landcover, instead of relying on globally-averaged metrics. Benchmarking state-of-the-art models with SAFE revealed systemic biases and disparities in forecasting skill across all attributes, with fairness generally declining with longer lead times. The paper also introduces new fairness metrics and improved latitude weighting, advocating for a more fine-grained assessment to ensure equitable and reliable weather forecasts.

Artificial intelligence is rapidly transforming various fields, and weather prediction is no exception. However, a new research paper highlights a critical flaw in how we currently evaluate these advanced AI weather models: they often rely on “globally-averaged” performance metrics. This means that the accuracy of a forecast is averaged across the entire Earth, which can hide significant disparities in how well models perform in different regions, for different populations, or over various types of terrain.

Imagine a weather model that performs exceptionally well over oceans but struggles over densely populated landmasses. A global average might still show good overall performance, masking the fact that it’s less reliable where people actually live and make critical decisions based on forecasts. This is the core problem that Nick Masi and Randall Balestriero from Brown University address in their paper, introducing a groundbreaking new framework called Stratified Assessments of Forecasts over Earth, or SAFE.

Introducing SAFE: A Finer Look at Forecast Accuracy

SAFE is an open-source Python package designed to provide a much more detailed and nuanced evaluation of AI weather predictions. Instead of just looking at an overall average, SAFE breaks down the Earth into various “strata” based on different attributes. These attributes include:

Territory: Typically, this means individual countries, allowing for accuracy assessment on a nation-by-nation basis.
Global Subregion: Broader geographical areas, like continents or major sub-continental divisions.
Income: Stratifying by the income level of a territory (high, upper-middle, lower-middle, or low-income), as defined by the World Bank.
Landcover: Distinguishing between predictions made over land versus water (oceans, seas, and large lakes).

By stratifying performance in this way, SAFE allows researchers and decision-makers to see precisely where models perform best or worst. This is crucial because, as the authors point out, neglecting to predict high-frequency, localized events can have severe real-world consequences, such as the impact of extreme heat predictions on mortality.

Uncovering Hidden Biases in Leading AI Weather Models

To demonstrate SAFE’s importance, the researchers used it to benchmark a selection of state-of-the-art AI-based weather prediction models, including GraphCast, Keisler, Pangu-Weather, Spherical CNN, FuXi, and NeuralGCM. Their findings were striking: all these models exhibited disparities in forecasting skill across every attribute examined. This means that biases are systemic, not just isolated incidents.

One particularly interesting finding emerged from the income attribute analysis. While some models initially performed worse in low-income territories at very short lead times (e.g., 12 hours), a clear trend emerged by 48 hours and continued to grow: prediction skill actually decreased as income increased. This suggests a bias against high-income countries at longer lead times, a counter-intuitive result that highlights the complex nature of these disparities.

For the landcover attribute, models generally performed better over land than water. However, at very long lead times (around 9 days), most models became worse at predicting temperature over land than water, with Pangu-Weather being a notable exception. This kind of detailed insight is invaluable for understanding model reliability in specific contexts.

Advancements in Evaluation and Fairness Metrics

Beyond stratification, SAFE also introduces new advancements in how model performance is measured. It incorporates a more accurate method for “latitude weighting,” which accounts for the Earth’s oblate spheroid shape, ensuring that grid points near the poles aren’t over-represented in calculations. This is a significant improvement over the common assumption of a perfectly spherical Earth in most AI weather and climate work.

Furthermore, the paper introduces novel fairness metrics grounded in the machine learning fairness field. These metrics quantify fairness by measuring the greatest absolute difference and the variance in per-strata RMSE (Root Mean Square Error). An optimally “fair” model would have zero for both, indicating consistent performance across all strata. This allows for direct comparison of model fairness and helps identify which models are most equitable in their predictions.

Also Read:

The Road Ahead: Towards More Equitable Weather Forecasts

The introduction of SAFE marks a significant step forward in ensuring that AI weather prediction models are not only accurate but also fair and reliable across all parts of the Earth and for all populations. The authors plan to expand SAFE by incorporating additional attributes like population density, coastlines, and islands as their own strata. They also suggest integrating fairness metrics directly into model training regimes to actively ameliorate bias.

As organizations like NOAA increasingly adopt machine learning systems for weather forecasting, the insights provided by SAFE become ever more critical. It empowers developers and decision-makers to choose models that are appropriate for local applications, understand existing biases, and ultimately work towards more trustworthy and equitable weather predictions globally. You can explore the SAFE package, which is open source, at its GitHub repository.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking Hidden Biases in AI Weather Forecasts: The SAFE Approach

Introducing SAFE: A Finer Look at Forecast Accuracy

Uncovering Hidden Biases in Leading AI Weather Models

Advancements in Evaluation and Fairness Metrics

The Road Ahead: Towards More Equitable Weather Forecasts

Gen AI News and Updates

AI’s Hidden Costs: Gaps in Social Impact Reporting Revealed

Unmasking Hidden Biases in Network Link Predictions

Trusys.ai Pioneers Ethical and Secure AI for Global Financial Inclusion

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates