TLDR: CrisiText is the first large-scale dataset designed to train Large Language Models (LLMs) for generating expert-based warning messages in 13 types of crisis scenarios. It contains over 400,000 messages, including both ‘good’ messages following expert guidelines (Tone and Instructions) and ‘bad’ suboptimal messages for preference alignment. Experiments show that fine-tuning LLMs with CrisiText significantly improves their ability to generate accurate and contextually relevant warning messages, especially when provided with instruction guidelines and previous message history. The dataset also supports the development of effective post-editing tools for crisis communication.
In our increasingly complex world, shaped by rapidly evolving social and environmental phenomena, the ability to communicate effectively during crises is more critical than ever. Natural disasters, violent attacks, and other emergencies can impact thousands or millions, making timely and accurate warning messages paramount for safeguarding endangered individuals. While Artificial Intelligence (AI) has increasingly assisted in crisis management, the use of Natural Language Processing (NLP) techniques has largely focused on classification tasks, overlooking the significant potential of generating timely warning messages.
Addressing this crucial gap, researchers have introduced CrisiText, the first large-scale dataset specifically designed for the generation of warning messages across 13 different types of crisis scenarios. This innovative dataset contains over 400,000 warning messages, spanning almost 18,000 crisis situations, all aimed at assisting civilians during and after such events. The creation of CrisiText marks a significant step towards specializing Large Language Models (LLMs) in expert-based crisis communication.
The development of CrisiText involved a meticulous pipeline. Scenario descriptions were extracted from two primary sources: the FEMA IPAWS Archived Alerts, which covers natural disasters, and the Global Terrorism Database (GTD), focusing on violent attacks. Using an advanced LLM (GPT-4o-mini), these descriptions were transformed into sequences of chronological events, simulating the unfolding of each crisis. For each event, warning messages were then generated following expert-written guidelines.
These guidelines were structured around two key dimensions: Tone and Instructions. Tone guidelines, derived from a systematic review and expert panel, focused on increasing attention, comprehension, believability, clarity, and triggering protective action. This meant ensuring proper terminology, providing accurate information, avoiding panic, and clearly stating behaviors. Instruction guidelines, sourced from the official FEMA website, provided grounded suggestions on how to behave depending on the crisis type.
Beyond generating “Good Messages” that adhere to these expert guidelines, the dataset also includes three types of “Bad Messages.” These suboptimal versions were deliberately created to ignore or worsen essential aspects of a good warning message, such as poor tone, incorrect instructions, or flaws in both. This unique feature allows for the study of different Natural Language Generation (NLG) approaches, including preference alignment techniques where models learn by comparing chosen (good) and rejected (bad) outputs.
To assess the effectiveness of CrisiText, a series of experiments were conducted using Llama 3 models. These experiments explored various methodologies for warning message generation, including Supervised Fine-Tuning (SFT) and ORPO (a preference alignment technique), alongside zero-shot and few-shot baselines. Researchers also investigated the impact of providing additional context, such as previous messages from the same scenario or specific FEMA instruction guidelines, during the generation process.
A crucial aspect of the research involved Leave One Scenario Out (LOSO) experiments, which tested the models’ ability to generalize to crisis types not seen during training. This demonstrated the importance of explicitly including instruction guidelines for adapting to new emergency protocols. Furthermore, an automatic post-editor model was fine-tuned using the “Bad Messages,” showing promising results in improving the quality of poorly written warning messages.
The evaluation of these experiments utilized both traditional overlap metrics like ROUGE and BLEU, and a sophisticated LLM-as-a-judge technique to approximate human evaluation. Results indicated that SFT generally achieved better performance in automatic metrics compared to ORPO, while LLM-as-a-judge evaluations showed comparable performance. The inclusion of previous messages significantly improved message consistency, and instruction guidelines proved fundamental for out-of-distribution scenarios.
Also Read:
- TripScore: A New Benchmark for Real-World AI Travel Planning
- CLARITY: Enhancing LLM Reasoning Quality Through Consistency-Aware Reinforcement Learning
In conclusion, CrisiText represents a valuable resource for advancing AI-driven crisis communication. It enables the specialization of LLMs for generating expert-based warning messages, offering a robust foundation for future research and practical applications. While the dataset is synthetic and LLM-generated, the researchers emphasize that any products based on CrisiText should serve as tools to assist human experts, not replace them, especially in sensitive real-world situations. This work paves the way for more effective and timely communication during emergencies, ultimately contributing to public safety and crisis mitigation. You can find the full research paper here: CrisiText: A dataset of warning messages for LLM training in emergency communication.


