spot_img
HomeResearch & DevelopmentFATHOMS-RAG: A New Benchmark for Evaluating Multimodal RAG Systems

FATHOMS-RAG: A New Benchmark for Evaluating Multimodal RAG Systems

TLDR: The FATHOMS-RAG research introduces a new benchmark for evaluating Retrieval-Augmented Generation (RAG) pipelines, focusing on their ability to process and reason across various data modalities like text, tables, and images. It features a human-curated dataset of 93 questions, a phrase-level recall metric for correctness, and a nearest-neighbor classifier to detect hallucinations. Evaluations showed closed-source models significantly outperform open-source pipelines, especially in multimodal and cross-document reasoning, though all systems still struggle with complex cross-document multimodal queries. The framework aims to provide a reproducible tool for assessing RAG system reliability.

A new research paper introduces FATHOMS-RAG, a comprehensive framework designed to evaluate the performance of Retrieval-Augmented Generation (RAG) pipelines, especially those dealing with various types of information like text, tables, and images. This benchmark aims to provide a holistic assessment of how well these systems can ingest, retrieve, and reason about multimodal data, addressing a gap in existing evaluation methods that often focus on only one aspect, such as retrieval accuracy.

The FATHOMS-RAG framework is built upon several key contributions. Firstly, it features a small, human-created dataset of 93 questions. These questions are specifically designed to test a RAG pipeline’s ability to handle different data modalities, including text-only, tables, images, and even information spread across multiple modalities within one or more documents. This diverse dataset allows for a thorough examination of a system’s capabilities.

Secondly, the researchers developed a phrase-level recall metric to measure correctness. This metric ensures that partial correctness is recognized, rewarding answers proportionally to the number of required phrases present. For example, if a question asks for three specific items, and the system provides two, it receives a partial score rather than a complete failure.

Thirdly, the framework includes a novel nearest-neighbor embedding classifier to identify potential hallucinations. Hallucinations occur when a model presents incorrect information as fact, while abstentions are when it explicitly states it cannot answer. This classifier helps distinguish between these two scenarios, providing a more nuanced understanding of a system’s reliability. A response is flagged as a hallucination if it’s presented as a factual statement but fails to achieve full phrase-level recall against the ground truth.

The paper also presents a comparative evaluation of different RAG pipelines. This includes two open-source pipelines—one built with LlamaIndex for text-only ingestion and another using Docling with EasyOCR for optical character recognition (OCR) and table recovery—along with four closed-source foundation models: Claude Sonnet-4, Gemini-2.5 Flash, GPT-4.1, and GPT-4o.

The findings from these evaluations are quite insightful. Closed-source pipelines consistently outperformed open-source pipelines in both correctness and hallucination metrics. The performance gap was particularly noticeable in questions that required reasoning over multimodal and cross-document information. For instance, text-only pipelines struggled dramatically with questions relying on tables or images, often leading to high hallucination rates. While incorporating OCR and layout-aware preprocessing with Docling and EasyOCR improved performance on image and cross-document queries, it still couldn’t fully close the gap with the advanced closed-source systems.

Even the state-of-the-art closed-source models, despite their superior performance, faced challenges with questions requiring information from text and tables or images spread across multiple documents. This highlights a persistent bottleneck in cross-document multimodal reasoning across all evaluated systems.

To validate their automatic scoring system, the researchers conducted a human evaluation. A third-party reviewer rated agreement with the system’s correctness and hallucination assignments on a Likert scale. The results showed a high average agreement, with 4.62 for correctness and 4.53 for hallucination detection, indicating that the automated metrics align well with human judgment.

Also Read:

By making their dataset and evaluation framework publicly available, the authors aim to provide a reproducible tool for benchmarking multimodal RAG pipelines. This work is a significant step towards developing more trustworthy retrieval-augmented systems by enabling systematic comparison across different models, modalities, and ingestion strategies. For more details, you can read the full research paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -