spot_img
HomeResearch & DevelopmentAdvancing Image Translation: A New Benchmark for Position-Aware Text...

Advancing Image Translation: A New Benchmark for Position-Aware Text in Visual Models

TLDR: Researchers introduce PATIMT-Bench, a new benchmark and dataset for Position-Aware Text Image Machine Translation (PATIMT). PATIMT extends traditional image translation to include region-specific translation and full-image translation with precise text grounding (bounding boxes). Their Adaptive Image OCR Refinement Pipeline creates high-quality, multi-scenario training data. Fine-tuning compact Large Vision-Language Models on this data significantly improves their performance, even surpassing larger models, demonstrating the dataset’s effectiveness and scalability for layout-preserving image translation.

Text Image Machine Translation (TIMT) is a field focused on translating text found within images into another language. Traditionally, TIMT systems would simply provide a translation of all text in an image, often losing the original layout and making it hard for users to know which translation corresponds to which part of the image. This limitation has made these systems less practical for real-world use.

Introducing Position-Aware TIMT (PATIMT)

A new research paper introduces an extended approach called Position-Aware Text Image Machine Translation (PATIMT). This advanced method aims to provide more precise and layout-preserving translations, which is highly valuable for practical applications but has not been widely explored until now. PATIMT addresses two main challenges: region-specific translation and full-image translation with grounding.

Region-specific translation allows users to select a particular area of an image and get a translation only for the text within that selected region. This offers a fine-grained, user-controlled translation experience. Full-image translation with grounding, on the other hand, translates all text in an image while also providing precise positional alignment between the translated text and its original source in the image. This enables seamless rendering of a translated version of the image, where the translated text appears in the correct locations.

The Need for a New Benchmark: PATIMT-Bench

To support the development and fair evaluation of models for PATIMT, the researchers constructed a new benchmark called PATIMT-Bench. This benchmark is comprehensive, featuring 10 diverse real-world scenarios. Existing Large Vision-Language Models (LVLMs) show great potential for these tasks, but they often struggle to follow PATIMT instructions due to a lack of suitable training data. Current TIMT datasets typically lack bounding box annotations (which define the precise location of text) or cover only a limited range of scenarios, making them inadequate for position-aware translation.

Building a multi-scenario PATIMT dataset is challenging. General Optical Character Recognition (OCR) tools often provide line-by-line results that lack semantic coherence, while document-specific OCR tools might miss text in other types of images. Manual annotation is also very labor-intensive and expensive.

Adaptive Image OCR Refinement Pipeline

To overcome these data construction challenges, the researchers developed an “Adaptive Image OCR Refinement Pipeline.” This automated pipeline is designed to create high-quality, multi-scenario PATIMT data. It adaptively selects appropriate OCR tools based on the image scenario (e.g., documents, infographics, natural scenes) and refines the OCR results, especially for text-rich images. For instance, it combines a general OCR tool like EasyOCR with a PDF-optimized tool like MinerU to handle different image types effectively. This pipeline ensures that the training data includes fine-grained bounding boxes in the correct layouts for text within images.

The PATIMT-Bench also includes a meticulously constructed test set of 1,200 high-quality instances. These instances were manually annotated and reviewed by human experts to ensure reliable evaluation. The training data itself consists of 48,884 images with over 400,000 processed bounding boxes, providing a rich resource for model training.

Also Read:

Performance and Scalability

After fine-tuning on this new dataset, compact Large Vision-Language Models (LVLMs) achieved state-of-the-art performance on both region-specific translation and full-image translation with grounding tasks. Remarkably, some of these smaller, fine-tuned models even outperformed much larger proprietary models like Qwen2.5-VL-72B and GPT-4o. This highlights the effectiveness of the PATIMT-Bench dataset in significantly improving translation quality and the ability of models to accurately ground text spatially within images.

The research also demonstrated the scalability and generalizability of their training data. As more training data was used, model performance steadily improved. Furthermore, models fine-tuned on PATIMT-Bench showed substantial improvements when evaluated on other relevant benchmarks, proving the broad applicability of this new dataset.

This work marks a significant step forward in Text Image Machine Translation, moving beyond simple text output to provide intelligent, layout-preserving, and position-aware translations that are much more useful in real-world applications. For more details, you can read the full paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -