Advancing Image Translation: A New Benchmark for Position-Aware Text in Visual Models

TLDR: Researchers introduce PATIMT-Bench, a new benchmark and dataset for Position-Aware Text Image Machine Translation (PATIMT). PATIMT extends traditional image translation to include region-specific translation and full-image translation with precise text grounding (bounding boxes). Their Adaptive Image OCR Refinement Pipeline creates high-quality, multi-scenario training data. Fine-tuning compact Large Vision-Language Models on this data significantly improves their performance, even surpassing larger models, demonstrating the dataset’s effectiveness and scalability for layout-preserving image translation.

Text Image Machine Translation (TIMT) is a field focused on translating text found within images into another language. Traditionally, TIMT systems would simply provide a translation of all text in an image, often losing the original layout and making it hard for users to know which translation corresponds to which part of the image. This limitation has made these systems less practical for real-world use.

Introducing Position-Aware TIMT (PATIMT)

A new research paper introduces an extended approach called Position-Aware Text Image Machine Translation (PATIMT). This advanced method aims to provide more precise and layout-preserving translations, which is highly valuable for practical applications but has not been widely explored until now. PATIMT addresses two main challenges: region-specific translation and full-image translation with grounding.

Region-specific translation allows users to select a particular area of an image and get a translation only for the text within that selected region. This offers a fine-grained, user-controlled translation experience. Full-image translation with grounding, on the other hand, translates all text in an image while also providing precise positional alignment between the translated text and its original source in the image. This enables seamless rendering of a translated version of the image, where the translated text appears in the correct locations.

The Need for a New Benchmark: PATIMT-Bench

To support the development and fair evaluation of models for PATIMT, the researchers constructed a new benchmark called PATIMT-Bench. This benchmark is comprehensive, featuring 10 diverse real-world scenarios. Existing Large Vision-Language Models (LVLMs) show great potential for these tasks, but they often struggle to follow PATIMT instructions due to a lack of suitable training data. Current TIMT datasets typically lack bounding box annotations (which define the precise location of text) or cover only a limited range of scenarios, making them inadequate for position-aware translation.

Building a multi-scenario PATIMT dataset is challenging. General Optical Character Recognition (OCR) tools often provide line-by-line results that lack semantic coherence, while document-specific OCR tools might miss text in other types of images. Manual annotation is also very labor-intensive and expensive.

Adaptive Image OCR Refinement Pipeline

To overcome these data construction challenges, the researchers developed an “Adaptive Image OCR Refinement Pipeline.” This automated pipeline is designed to create high-quality, multi-scenario PATIMT data. It adaptively selects appropriate OCR tools based on the image scenario (e.g., documents, infographics, natural scenes) and refines the OCR results, especially for text-rich images. For instance, it combines a general OCR tool like EasyOCR with a PDF-optimized tool like MinerU to handle different image types effectively. This pipeline ensures that the training data includes fine-grained bounding boxes in the correct layouts for text within images.

The PATIMT-Bench also includes a meticulously constructed test set of 1,200 high-quality instances. These instances were manually annotated and reviewed by human experts to ensure reliable evaluation. The training data itself consists of 48,884 images with over 400,000 processed bounding boxes, providing a rich resource for model training.

Also Read:

Performance and Scalability

After fine-tuning on this new dataset, compact Large Vision-Language Models (LVLMs) achieved state-of-the-art performance on both region-specific translation and full-image translation with grounding tasks. Remarkably, some of these smaller, fine-tuned models even outperformed much larger proprietary models like Qwen2.5-VL-72B and GPT-4o. This highlights the effectiveness of the PATIMT-Bench dataset in significantly improving translation quality and the ability of models to accurately ground text spatially within images.

The research also demonstrated the scalability and generalizability of their training data. As more training data was used, model performance steadily improved. Furthermore, models fine-tuned on PATIMT-Bench showed substantial improvements when evaluated on other relevant benchmarks, proving the broad applicability of this new dataset.

This work marks a significant step forward in Text Image Machine Translation, moving beyond simple text output to provide intelligent, layout-preserving, and position-aware translations that are much more useful in real-world applications. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Image Translation: A New Benchmark for Position-Aware Text in Visual Models

Introducing Position-Aware TIMT (PATIMT)

The Need for a New Benchmark: PATIMT-Bench

Adaptive Image OCR Refinement Pipeline

Performance and Scalability

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates