TLDR: This research introduces the Multimodal Retrieval-Augmented Generation (MM-RAG) framework, an AI system designed for post-disaster housing damage assessment. It combines a visual encoder (ResNet and Transformer) to analyze images of damaged buildings with a BERT-based text retriever for insurance policies. A cross-modal interaction module and a dynamic modal attention gating mechanism bridge the gap between visual and textual information, allowing the system to generate accurate damage assessments. Trained end-to-end with multi-task optimization, MM-RAG demonstrates superior performance in retrieval accuracy and damage severity classification compared to existing methods, highlighting the effectiveness of integrating diverse data modalities for complex real-world problems.
Natural disasters like earthquakes, hurricanes, and floods can devastate homes, making quick and accurate damage assessment crucial for insurance claims, resource allocation, and rehabilitation efforts. Traditionally, this process relies on manual, on-site inspections, which are often slow, costly, and prone to subjective inconsistencies. However, with the rise of drone imagery and digitized insurance documents, data-driven automated tools are emerging to improve efficiency and objectivity.
While computer vision technologies have made strides in analyzing visual data to classify building damage, they often fall short in integrating the complex details of insurance policies, such as liability scope or exemption clauses. This highlights a critical need for systems that can combine visual perception with text understanding to enable comprehensive, cross-modal reasoning.
Introducing the MM-RAG Framework
Researchers have developed a novel Multimodal Retrieval-Augmented Generation (MM-RAG) framework designed to address these challenges. This advanced AI system goes beyond traditional RAG architectures by integrating both image and text data in a sophisticated manner to assess housing damage and match it with relevant insurance policies. You can read the full paper here.
How MM-RAG Works
The MM-RAG framework operates with a two-branch multimodal encoder structure:
-
Image Branch: This branch uses a combination of ResNet and Transformer models to analyze post-disaster images. It extracts detailed characteristics of building damage, understanding both local features and global structural dependencies.
-
Text Branch: A BERT retriever is employed here to process textual data, including insurance policy documents and post descriptions. It vectorizes this text, creating a searchable index of restoration information.
To ensure that the visual and textual information are semantically aligned, the model includes a cross-modal interaction module. This module uses multi-head attention to bridge the semantic representations between images and text, allowing the system to understand how visual damage relates to policy terms.
During the generation phase, a unique modal attention gating mechanism dynamically adjusts the influence of visual evidence and prior text information. This means the system can intelligently decide how much to rely on what it sees versus what it reads when generating a damage description or assessment.
Training and Optimization
The entire MM-RAG framework is trained end-to-end, optimizing multiple objectives simultaneously. It combines three types of losses:
-
Comparison Loss: Enhances the consistency between image and text representations.
-
Retrieval Loss: Evaluates the effectiveness of policy similarity ranking.
-
Generation Loss: Supervises the quality of the generated text output.
This multi-task optimization allows the model to achieve both image understanding and policy matching through collaborative learning.
Experimental Validation
The MM-RAG framework was tested on a multimodal dataset called xBD+Policy, which combines remote sensing disaster images with real insurance contract templates. The dataset includes images from various natural disaster types and provides pre-disaster, post-disaster images, insurance policy text, and damage level labels.
The experiments demonstrated that MM-RAG consistently outperformed several baseline methods, including visual-only, text-only, late fusion (ResNet + BERT), and text-based RAG models. Key findings included:
-
MM-RAG showed superior performance in retrieval accuracy and damage severity classification.
-
Its accuracy remained high even with varying amounts of training data, achieving smooth convergence.
-
The model’s Macro-F1 score, which objectively reflects classification performance across different damage levels, increased with a wider retrieval scope (Top-k documents).
-
Higher embedding dimensions for data representation improved retrieval accuracy, with MM-RAG maintaining a significant lead.
-
The modal attention gating mechanism proved critical, significantly enhancing the accuracy and stability of damage assessment by dynamically balancing visual and textual inputs.
Also Read:
- Synthetic Data for Smarter Cities: A New AI Framework for Building Energy Models
- A New AI System to Support Goat Health and Farm Management
Conclusion and Future Directions
In summary, the MM-RAG framework represents a significant advancement in post-disaster housing damage assessment. By deeply fusing image coding (ResNet-Transformer), text retrieval (BERT), cross-modal attention, and a modal-aware gating generator, it provides a robust solution that surpasses previous single-modal and multimodal approaches.
Future work aims to further enhance the model by incorporating time-series disaster evolution characteristics, improving semantic reasoning for complex policy clauses, and exploring online incremental learning and small-sample adaptation to broaden its generalization capabilities.


