TLDR: WEATHERARCHIVE-BENCH is a new benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on historical weather archives. It aims to help climate scientists understand past societal responses to extreme weather by testing AI’s ability to retrieve relevant information from over a million noisy, archaic news segments and classify societal vulnerability and resilience indicators. The research found that traditional sparse retrieval methods often outperform modern dense AI retrievers on historical terminology, and while large language models can extract explicit facts, they struggle with complex reasoning about implicit socio-environmental impacts and system interdependencies.
Climate change is bringing more frequent and severe extreme weather events, making it crucial for policymakers to develop effective adaptation strategies. To do this, we need more than just current meteorological data; a deeper understanding of how communities, infrastructure, and economies have historically responded to climate hazards is essential. Historical archives, filled with primary source records, offer a rich, untapped resource for these narratives. They provide unique insights into societal vulnerability and resilience that are often missing from purely meteorological records.
However, transforming these vast, often noisy, and archaic historical documents into structured knowledge for climate research presents significant challenges. This is where a new benchmark called WEATHERARCHIVE-BENCH comes in. It’s the first benchmark specifically designed to evaluate Retrieval-Augmented Generation (RAG) systems on historical weather archives. RAG systems combine information retrieval with generative language models to improve performance on knowledge-intensive tasks.
WEATHERARCHIVE-BENCH includes two main tasks. The first, WeatherArchive-Retrieval, measures how well a system can find historically relevant passages from a collection of over one million archival news segments. The second, WeatherArchive-Assessment, evaluates whether Large Language Models (LLMs) can accurately classify indicators of societal vulnerability and resilience from these extreme weather narratives.
The dataset for this benchmark is extensive, comprising over one million OCR-parsed archival documents. These documents were collected from Southern Quebec, covering both a historical period (1880–1899) and a contemporary period (1995–2014). A crucial step in creating this dataset involved cleaning the digitized articles to correct Optical Character Recognition (OCR) errors using advanced AI models like GPT-4o, ensuring the text quality is suitable for analysis.
The research paper, titled WEATHERARCHIVE-BENCH: BENCHMARKING RETRIEVAL-AUGMENTED REASONING FOR HISTORICAL WEATHER ARCHIVES, highlights several key findings. Experiments with various retrieval models (sparse, dense, and re-ranking) and a diverse set of LLMs revealed that dense retrievers often struggle with historical terminology. In contrast, sparse lexical models, like BM25, performed surprisingly well, often matching or exceeding dense alternatives in ranking quality for top results. This is likely because climate-related queries often contain specific technical and domain-specific terms (e.g., “flood damage,” “hurricane casualties”), which sparse methods are better at capturing directly. The study also found that combining sparse methods with a re-ranking procedure can lead to even better performance.
When it comes to assessing societal vulnerability and resilience, the LLMs showed varying capabilities. Larger, proprietary models like Claude-Opus-4-1 generally performed best. Models were effective at identifying explicit indicators of exposure (the type of hazard) and adaptability (the capacity to respond and recover), where factual extraction is sufficient. However, they frequently misinterpreted or struggled with classifying sensitivity (how strongly a system is affected) and complex socio-environmental system effects, particularly on functional (e.g., health, energy, transportation) and spatial (e.g., local, regional, national) scales. This suggests that while LLMs can extract facts, reasoning about implicit relationships and complex interdependencies in historical contexts remains a significant challenge.
Also Read:
- New Benchmark Reveals LLMs Struggle with Real-World Causal Reasoning
- Unpacking Knowledge Collapse: How LLMs Shape Our Information Landscape
In conclusion, WEATHERARCHIVE-BENCH provides a vital resource for advancing climate-focused AI. It offers a standardized way to evaluate AI systems on historical climate data, transforming previously underutilized archival narratives into actionable intelligence. The findings underscore the need for future research to improve retrieval methods for archaic language and narrative structures, and to enhance LLMs’ ability to reason about complex socio-environmental systems beyond simple factual extraction.


