TLDR: This paper benchmarks eight multi-modal large language models (GPT-5, Gemini 2.5, Gemma 3) on invoice processing tasks using zero-shot prompting. It compares two strategies: direct image processing and a structured parsing approach (converting documents to markdown first). The study found that native image processing consistently outperforms structured parsing, with Gemini 2.5 models showing the highest accuracy. The research highlights the importance of visual context for document understanding and identifies challenges in extracting unstructured fields like IBANs.
Automating invoice and order processing has long been a critical, yet often tedious, task for businesses across all industries. Traditionally, companies relied on manual labor or specialized Optical Character Recognition (OCR) systems that required extensive customization and struggled with the diverse formats of documents. However, the advent of multi-modal large language models (LLMs) is ushering in a new era for document understanding, promising more adaptable and generalizable solutions.
A recent research paper, titled “Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing,” delves into this technological shift. Authored by David Berghaus, Armin Berger, Lars Hillebrand, Kostadin Cvejoski, and Rafet Sifa from Fraunhofer IAIS and the Lamarr Institute, this study provides a comprehensive benchmark for evaluating how different LLM approaches handle invoice processing. You can read the full paper here: Research Paper.
Understanding the Approach
The researchers set out to compare two primary strategies for LLMs to process invoices: direct image processing and a structured parsing approach. In the direct image processing method, multi-modal LLMs analyze the document image directly, leveraging their ability to understand visual content, text layout, and spatial relationships. This preserves all visual information and context from the original document.
The second strategy, called Docling Processing, is a two-step approach. First, an open-source tool called Docling converts the document image into a markdown format. This text-only representation maintains structural information like tables and sections using markdown syntax. The LLM then processes this structured text. While this might simplify the visual complexity for the LLM, it could potentially lose some crucial visual context.
Models and Data in Focus
The benchmark evaluated eight state-of-the-art multi-modal models from three major families: OpenAI’s GPT-5 (gpt-5-chat, gpt-5-mini, gpt-5-nano), Google’s Gemini 2.5 (gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite), and open-source Google Gemma 3 (gemma-3-12b-it, gemma-3-4b-it). These models were tested on three diverse, openly available invoice document datasets: Clean Invoices (synthetic), Scanned Receipts (real-world variations), and Scanned Invoices (with scanning artifacts like stamps and handwritten notes).
Key Findings: Native Image Processing Takes the Lead
The most significant discovery from the study is the consistent and substantial superiority of native image processing over the structured parsing (Docling) approach. Across all datasets and models, direct image analysis yielded significantly higher accuracy. For example, on the Scanned Receipts dataset, the best-performing model achieved 87.46% accuracy with native processing, compared to only 47.00% using the Docling method.
This suggests that the visual context and layout understanding are crucial for effective document processing, and current multi-modal LLMs are adept at leveraging this information when processing images directly. The Docling conversion process, while providing structured text, often created a performance bottleneck, especially on cleaner datasets, indicating that the initial OCR and markdown conversion became the limiting factor rather than the LLM’s reasoning abilities.
Model Performance Highlights
Among the models, the Gemini 2.5 family demonstrated the strongest overall performance. Gemini 2.5 Pro consistently achieved the highest accuracy across all three datasets. The GPT-5 models were also highly competitive, particularly on the less noisy Clean Invoices dataset, where GPT-5 Chat and GPT-5 Mini surpassed 96% accuracy.
The open-source Gemma 3 models showed promising results, with the larger ‘gemma-3-12b-it’ model delivering solid performance. However, the smaller ‘gemma-3-4b-it’ model struggled significantly with direct image analysis, highlighting a capability threshold where smaller models might be less effective for complex visual extraction tasks.
Also Read:
- Optimizing Large Language Models for Clinical Data Extraction
- Integrating New Data Types into Large Language Models with Minimal Samples
Challenges and Future Directions
The research also pointed out persistent difficulties in extracting highly unstructured alphanumeric fields, such as IBAN numbers, where common OCR-related mistakes like confusing ‘0’ with ‘O’ or ‘U’ were observed. Performance on noisy scanned documents also remains a challenge compared to clean digital invoices.
The study concludes that direct image processing with multi-modal LLMs offers a powerful approach for document automation. Future research could explore specialized models and fine-tuning for document understanding tasks, potentially incorporating models like LayoutLM and LiLT, which are designed for layout understanding but require fine-tuning, unlike the zero-shot prompting approach used in this benchmark.


