TLDR: Document Haystack is a novel benchmark designed to evaluate Vision Language Models (VLMs) on their ability to understand and retrieve information from long, visually complex documents, ranging from 5 to 200 pages. It features strategically placed ‘needles’ (text or text+image key-value pairs) to test retrieval capabilities. The research reveals that VLMs struggle more with image-based and multimodal information in longer documents compared to pure text, highlighting significant areas for improvement in processing extensive visual documents.
The world of Artificial Intelligence, particularly with the rise of Large Language Models (LLMs), has seen incredible advancements in understanding and generating human-like text. This capability has further expanded with multimodal LLMs, which can process and analyze complex data inputs from various sources, including images and documents.
A significant area where these advanced models are making an impact is document understanding. This is crucial for many industries, such as legal, medical, and financial sectors, where accurate interpretation of documents directly influences important decisions. While LLMs have greatly improved tasks like information extraction and question-answering from documents, there’s a unique challenge: documents often contain complex elements like tables, charts, and other visual components, requiring sophisticated systems to handle large volumes of unstructured, diverse information.
Despite promising progress, how well Vision Language Models (VLMs) perform on multimodal documents, especially long ones, hasn’t been fully established. Current evaluation methods tend to focus on shorter documents, which limits our understanding of VLM performance on more extensive and complex document analysis tasks.
Introducing Document Haystack
To address this critical need, researchers have introduced Document Haystack, a novel and comprehensive benchmark. It’s specifically designed to evaluate how well VLMs can retrieve key multimodal information from long, visually complex documents. The benchmark includes 400 different document variations, ranging from 5 to a remarkable 200 pages in length. It strategically places either pure text or multimodal text-and-image “needles” at various depths within these documents to truly test the VLMs’ retrieval abilities.
The concept is inspired by the classic “Needle in a Haystack” problem, where a model must find a specific piece of information (the “needle”) hidden within a large amount of context (the “haystack”). In Document Haystack, these needles are formatted as key-value pairs, like “The secret sport is ‘basketball’,” where “basketball” could be text or an image. These needles are inserted at different pages and positions, with varying colors, sizes, and fonts to increase complexity.
The benchmark offers two main sets: one with text needles and another with text-plus-image needles. This allows for direct comparison of VLM performance across different information modalities. Documents are available in PDF, image (each page converted to an image), and text-only formats, accommodating various VLM input requirements.
Key Findings and Challenges
The evaluation of prominent VLMs like Nova Lite, Gemini Flash-2.0, and GPT-4o-mini on Document Haystack revealed several important insights:
- When retrieving text needles from document images, accuracy consistently dropped as document length increased across all models. This highlights a fundamental challenge for VLMs: the longer the document, the harder it is to maintain performance.
- There’s a significant performance gap between retrieving text information from pure text documents versus extracting the same information from document images. This indicates that visual processing adds considerable complexity.
- The most challenging task was retrieving text-plus-image needles from document images. Models struggled significantly more with this multimodal retrieval, especially as document length grew. For shorter documents (5-10 pages), Nova Lite showed strong performance, but for longer documents, Gemini Flash-2.0 emerged as slightly superior.
- Models also showed varying token consumption per image, which can impact performance. Higher token counts might allow for more detailed image extraction but could complicate retrieval in longer contexts.
These results underscore that while VLMs are robust in handling long textual information, their ability to identify and extract image-based and multimodal information deteriorates with increasing document length. This points to a critical area for future improvement in next-generation VLMs, focusing on more efficient architectures and training methods to maintain visual context over extended sequences.
Also Read:
- ChartScope: Advancing AI’s Understanding of Visual Data
- MVP-LM: A Unified Approach to Multi-Granular Visual Perception
Looking Ahead
Document Haystack provides an objective and automated evaluation framework with a total of 8,250 questions. Beyond just accuracy, the dataset can also be used to evaluate the latency of VLMs, offering insights into their computational efficiency. The comprehensive metadata included with each needle (page location, coordinates, color, size) will also support more detailed research, such as location-aware information extraction and spatial relationship analysis.
In conclusion, Document Haystack is a significant step forward in evaluating VLMs, offering a rigorous test for their ability to process long, visually complex documents. It highlights current limitations and paves the way for continued research and development in multimodal document understanding, ultimately leading to more effective VLMs for real-world applications.


