FATHOMS-RAG: A New Benchmark for Evaluating Multimodal RAG Systems

TLDR: The FATHOMS-RAG research introduces a new benchmark for evaluating Retrieval-Augmented Generation (RAG) pipelines, focusing on their ability to process and reason across various data modalities like text, tables, and images. It features a human-curated dataset of 93 questions, a phrase-level recall metric for correctness, and a nearest-neighbor classifier to detect hallucinations. Evaluations showed closed-source models significantly outperform open-source pipelines, especially in multimodal and cross-document reasoning, though all systems still struggle with complex cross-document multimodal queries. The framework aims to provide a reproducible tool for assessing RAG system reliability.

A new research paper introduces FATHOMS-RAG, a comprehensive framework designed to evaluate the performance of Retrieval-Augmented Generation (RAG) pipelines, especially those dealing with various types of information like text, tables, and images. This benchmark aims to provide a holistic assessment of how well these systems can ingest, retrieve, and reason about multimodal data, addressing a gap in existing evaluation methods that often focus on only one aspect, such as retrieval accuracy.

The FATHOMS-RAG framework is built upon several key contributions. Firstly, it features a small, human-created dataset of 93 questions. These questions are specifically designed to test a RAG pipeline’s ability to handle different data modalities, including text-only, tables, images, and even information spread across multiple modalities within one or more documents. This diverse dataset allows for a thorough examination of a system’s capabilities.

Secondly, the researchers developed a phrase-level recall metric to measure correctness. This metric ensures that partial correctness is recognized, rewarding answers proportionally to the number of required phrases present. For example, if a question asks for three specific items, and the system provides two, it receives a partial score rather than a complete failure.

Thirdly, the framework includes a novel nearest-neighbor embedding classifier to identify potential hallucinations. Hallucinations occur when a model presents incorrect information as fact, while abstentions are when it explicitly states it cannot answer. This classifier helps distinguish between these two scenarios, providing a more nuanced understanding of a system’s reliability. A response is flagged as a hallucination if it’s presented as a factual statement but fails to achieve full phrase-level recall against the ground truth.

The paper also presents a comparative evaluation of different RAG pipelines. This includes two open-source pipelines—one built with LlamaIndex for text-only ingestion and another using Docling with EasyOCR for optical character recognition (OCR) and table recovery—along with four closed-source foundation models: Claude Sonnet-4, Gemini-2.5 Flash, GPT-4.1, and GPT-4o.

The findings from these evaluations are quite insightful. Closed-source pipelines consistently outperformed open-source pipelines in both correctness and hallucination metrics. The performance gap was particularly noticeable in questions that required reasoning over multimodal and cross-document information. For instance, text-only pipelines struggled dramatically with questions relying on tables or images, often leading to high hallucination rates. While incorporating OCR and layout-aware preprocessing with Docling and EasyOCR improved performance on image and cross-document queries, it still couldn’t fully close the gap with the advanced closed-source systems.

Even the state-of-the-art closed-source models, despite their superior performance, faced challenges with questions requiring information from text and tables or images spread across multiple documents. This highlights a persistent bottleneck in cross-document multimodal reasoning across all evaluated systems.

To validate their automatic scoring system, the researchers conducted a human evaluation. A third-party reviewer rated agreement with the system’s correctness and hallucination assignments on a Likert scale. The results showed a high average agreement, with 4.62 for correctness and 4.53 for hallucination detection, indicating that the automated metrics align well with human judgment.

Also Read:

By making their dataset and evaluation framework publicly available, the authors aim to provide a reproducible tool for benchmarking multimodal RAG pipelines. This work is a significant step towards developing more trustworthy retrieval-augmented systems by enabling systematic comparison across different models, modalities, and ingestion strategies. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

FATHOMS-RAG: A New Benchmark for Evaluating Multimodal RAG Systems

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Enhancing AI Retrieval: How Knowledge Graphs and Ontologies Boost RAG Performance

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates