TLDR: The FATHOMS-RAG research introduces a new benchmark for evaluating Retrieval-Augmented Generation (RAG) pipelines, focusing on their ability to process and reason across various data modalities like text, tables, and images. It features a human-curated dataset of 93 questions, a phrase-level recall metric for correctness, and a nearest-neighbor classifier to detect hallucinations. Evaluations showed closed-source models significantly outperform open-source pipelines, especially in multimodal and cross-document reasoning, though all systems still struggle with complex cross-document multimodal queries. The framework aims to provide a reproducible tool for assessing RAG system reliability.
A new research paper introduces FATHOMS-RAG, a comprehensive framework designed to evaluate the performance of Retrieval-Augmented Generation (RAG) pipelines, especially those dealing with various types of information like text, tables, and images. This benchmark aims to provide a holistic assessment of how well these systems can ingest, retrieve, and reason about multimodal data, addressing a gap in existing evaluation methods that often focus on only one aspect, such as retrieval accuracy.
The FATHOMS-RAG framework is built upon several key contributions. Firstly, it features a small, human-created dataset of 93 questions. These questions are specifically designed to test a RAG pipeline’s ability to handle different data modalities, including text-only, tables, images, and even information spread across multiple modalities within one or more documents. This diverse dataset allows for a thorough examination of a system’s capabilities.
Secondly, the researchers developed a phrase-level recall metric to measure correctness. This metric ensures that partial correctness is recognized, rewarding answers proportionally to the number of required phrases present. For example, if a question asks for three specific items, and the system provides two, it receives a partial score rather than a complete failure.
Thirdly, the framework includes a novel nearest-neighbor embedding classifier to identify potential hallucinations. Hallucinations occur when a model presents incorrect information as fact, while abstentions are when it explicitly states it cannot answer. This classifier helps distinguish between these two scenarios, providing a more nuanced understanding of a system’s reliability. A response is flagged as a hallucination if it’s presented as a factual statement but fails to achieve full phrase-level recall against the ground truth.
The paper also presents a comparative evaluation of different RAG pipelines. This includes two open-source pipelines—one built with LlamaIndex for text-only ingestion and another using Docling with EasyOCR for optical character recognition (OCR) and table recovery—along with four closed-source foundation models: Claude Sonnet-4, Gemini-2.5 Flash, GPT-4.1, and GPT-4o.
The findings from these evaluations are quite insightful. Closed-source pipelines consistently outperformed open-source pipelines in both correctness and hallucination metrics. The performance gap was particularly noticeable in questions that required reasoning over multimodal and cross-document information. For instance, text-only pipelines struggled dramatically with questions relying on tables or images, often leading to high hallucination rates. While incorporating OCR and layout-aware preprocessing with Docling and EasyOCR improved performance on image and cross-document queries, it still couldn’t fully close the gap with the advanced closed-source systems.
Even the state-of-the-art closed-source models, despite their superior performance, faced challenges with questions requiring information from text and tables or images spread across multiple documents. This highlights a persistent bottleneck in cross-document multimodal reasoning across all evaluated systems.
To validate their automatic scoring system, the researchers conducted a human evaluation. A third-party reviewer rated agreement with the system’s correctness and hallucination assignments on a Likert scale. The results showed a high average agreement, with 4.62 for correctness and 4.53 for hallucination detection, indicating that the automated metrics align well with human judgment.
Also Read:
- EcphoryRAG: Boosting AI’s Knowledge Retrieval with Human Associative Memory
- Assessing the Spatial Limits of LLM Reasoning with RegexPSPACE
By making their dataset and evaluation framework publicly available, the authors aim to provide a reproducible tool for benchmarking multimodal RAG pipelines. This work is a significant step towards developing more trustworthy retrieval-augmented systems by enabling systematic comparison across different models, modalities, and ingestion strategies. For more details, you can read the full research paper here.


