TLDR: A new fine-tuning method called Synchronously Self-Reviewing (SSR) helps Multimodal Large Language Models (MLLMs) translate text from document images more effectively. Unlike traditional methods that cause MLLMs to “forget” their text recognition (OCR) abilities during translation training, SSR prompts the model to first perform OCR and then translate. This approach significantly improves translation quality, prevents the loss of OCR skills, and enhances the model’s ability to generalize to new types of documents and languages, even with less labeled data.
Multimodal Large Language Models (MLLMs) have made significant strides in understanding document images, particularly in tasks like Optical Character Recognition (OCR), which involves extracting text from scanned documents or images. However, these powerful models often face considerable challenges when it comes to Document Image Machine Translation (DIMT), a task that requires handling both visual and linguistic complexities across different languages.
A common approach to improve MLLMs for DIMT is Supervised Fine-Tuning (SFT) on specific translation datasets. Unfortunately, this method frequently leads to a problem known as “catastrophic forgetting.” This means that while the model gets better at translation, it tends to lose its original, strong monolingual abilities, especially its proficiency in OCR. For instance, a model fine-tuned for translation might achieve high translation scores but perform very poorly on basic text extraction from images.
To tackle this critical issue, researchers have introduced a new fine-tuning method called Synchronously Self-Reviewing (SSR) its OCR proficiency. This innovative paradigm is inspired by the concept of “Bilingual Cognitive Advantage,” which suggests that bilingual individuals often exhibit greater linguistic proficiency by leveraging their existing language skills while learning new ones.
How Synchronously Self-Reviewing (SSR) Works
The core idea behind SSR is to prompt the MLLM to first generate the OCR text from the document image before it produces the translated text. This process involves two main steps:
-
Monolingual Proficiency: In this initial step, the MLLM is instructed to perform OCR on the document image, generating the source text. Since this text is generated from the model’s original distribution, it helps in maintaining the model’s inherent OCR capabilities, preventing the catastrophic forgetting observed with traditional fine-tuning.
-
Cross-lingual Enhancement: Following the OCR generation, the self-generated source text is concatenated with the correct target translation text. The MLLM is then fine-tuned using this combined input. This approach allows the model to learn the intricate relationships between the image, the source text, and the target text. It also contributes to a smoother training process and enhances the model’s ability to translate across languages while continuously reinforcing its OCR skills.
Also Read:
- Advancing Multimodal AI: A New Model for Unified General and Spatial Understanding
- New AI Approach Unlocks Deeper Document Understanding by Focusing on Intrinsic Structure
Key Advantages and Experimental Validation
Extensive experiments have demonstrated that the SSR learning paradigm offers several significant advantages:
-
Mitigation of Catastrophic Forgetting: SSR effectively helps MLLMs retain their strong OCR proficiency, a common casualty in traditional DIMT fine-tuning.
-
Improved Generalization: The method enhances the MLLM’s ability to generalize across various domains and tasks, performing better on both OCR and DIMT tasks, even in unseen scenarios.
-
Leveraging Unsupervised Data: A notable benefit of SSR is its capacity to utilize large amounts of unsupervised data (document images without corresponding translations). By using the MLLM’s own OCR capability to generate synthetic source text, the method reduces the reliance on extensive parallel datasets, which are often scarce for DIMT tasks.
The research shows that SSR consistently outperforms other baseline methods in translation quality across different MLLMs, including Vary-base, Textmonkey, and Qwen2-VL. It also significantly preserves monolingual abilities like OCR and Visual Question Answering (VQA). Furthermore, the method proves effective in low-resource scenarios and can be extended to other language pairs, such as English-French and English-German. For more technical details, you can refer to the full research paper here.
In conclusion, Synchronously Self-Reviewing presents a robust and effective fine-tuning paradigm that not only boosts MLLMs’ performance in Document Image Machine Translation but also ensures the preservation of their crucial monolingual capabilities, paving the way for more versatile and reliable document understanding systems.


