TLDR: VDInstruct is a new AI model for extracting key information from visual documents like receipts and contracts. It uses a unique “content-aware” vision tokenization method that processes documents more efficiently by focusing on important areas, rather than the entire image. This allows it to achieve state-of-the-art accuracy, especially on new, unseen document types, while using significantly fewer computational resources compared to previous models.
Understanding visual documents like receipts, invoices, and contracts is a crucial task for many businesses and applications. This process, known as Visual Document Understanding (VDU), involves extracting text, identifying structural components, and interpreting how these elements convey information. At its core lies Key Information Extraction (KIE), which focuses on pinpointing and extracting critical fields such as dates, amounts, or vendor names.
However, existing advanced AI models, particularly Multimodal Large Language Models (MLLMs), often struggle with dense documents. Their vision processing methods tend to scale with the image size, leading to inefficient computation and excessive memory usage. This can result in a ‘token explosion,’ where too many redundant image tokens are generated, hindering performance and scalability.
Introducing VDInstruct: A Smarter Approach
To tackle these challenges, researchers Son Nguyen, Giang Nguyen, Hung Dao, Thao Do, and Daeyoung Kim from KAIST and Auburn University have introduced VDInstruct. This innovative MLLM rethinks how visual documents are processed by separating spatial region detection from semantic feature extraction. At the heart of VDInstruct is a ‘content-aware tokenization’ strategy. Instead of uniformly breaking down the entire image, it intelligently generates tokens in proportion to the document’s complexity. This approach preserves critical structural information while eliminating unnecessary tokens, leading to much more efficient processing.
Dual Vision Encoder: The Core Innovation
VDInstruct’s architecture features a unique Dual Vision Encoder, which is key to its efficiency and accuracy. This encoder comprises two main components:
-
Spatial Encoder: This part is responsible for precisely detecting and embedding ‘Multimodal Regions of Interest’ (ROIs), which include both textual elements (like words and headers) and visual elements (such as figures and charts). It converts each detected ROI into a ‘spatial token’ that encodes its geometric information.
-
Semantic Encoder: Operating in parallel, this encoder extracts fine-grained visual-textual features from the ROIs identified by the spatial encoder, producing ‘semantic tokens’ that capture the local meaning of the content.
By decoupling region localization from feature extraction, each encoder can be trained with objectives tailored to its specific role. This allows for content-aware tokenization, where tokens are allocated only to informative regions, filtering out redundant areas and minimizing waste while preserving crucial layout cues.
A Three-Stage Training Journey
VDInstruct undergoes a meticulous three-stage training process to build its capabilities incrementally:
-
Layout Pretraining: In the first stage, the spatial encoder learns to detect multimodal ROIs in document images. This is distinct from traditional OCR, focusing solely on region-level detection rather than text recognition.
-
Feature Learning: The second stage trains the semantic encoder and the spatial embedding layer to extract meaningful visual representations from the detected ROIs. This aligns visual features with text-space embeddings without affecting the language decoder’s linguistic capacity.
-
Instruction Tuning: The final stage fine-tunes the language decoder to understand and execute complex tasks, particularly within the VDU domain, using a comprehensive instruction-based dataset.
Also Read:
- The AI Evolution in Document Understanding: A Comprehensive Survey of MLLMs
- Boosting Document Image Translation: A Self-Reviewing Approach for AI Models
Impressive Results and Efficiency
Extensive experiments demonstrate VDInstruct’s superior performance. It is significantly more computationally efficient than state-of-the-art models, generating approximately 500 tokens per page—about 3.6 times fewer than DocOwl 1.5, a leading previous model. Despite this substantial reduction in image tokens, VDInstruct achieves state-of-the-art overall performance on KIE benchmarks. In zero-shot evaluations (on unseen documents), VDInstruct sets a new record, outperforming strong baselines like DocOwl 1.5 by a significant margin of +5.5 F1 points. This highlights its robustness and ability to generalize to new document formats without additional fine-tuning.
The research also confirmed that all components contribute meaningfully: spatial tokens provide layout-specific cues, and combining different semantic token modalities (cross-modality, textual, and visual) yields the strongest overall performance. Furthermore, using a stronger vision backbone like SwinB-v2 significantly improves both ROI detection and KIE performance.
VDInstruct represents a promising step forward in document understanding, demonstrating that content-aware tokenization combined with explicit layout modeling offers a powerful direction for more efficient and accurate Key Information Extraction. For more in-depth details, you can read the full research paper here.


