Advancing Language Models for Low-Resource Languages Through Machine Translation

TLDR: This paper investigates using machine-translated (MT) text for pretraining language models in low-resource languages like Indonesian and Tamil. Key findings show that scaling MT-pretrained models improves generalization to native text, simplifying source text before translation is detrimental, and continually pretraining MT-initialized models on limited native text is highly effective, often outperforming native-only models. While MT data is beneficial for tasks like sentiment analysis, culturally nuanced tasks such as toxicity detection still require more native data.

The development of advanced language technologies, particularly Large Language Models (LLMs), has seen remarkable progress, but this advancement largely benefits languages with abundant digital text, such as English. For the majority of the world’s languages, however, a significant hurdle exists: a scarcity of native text data, often referred to as a “data wall.” This limitation prevents these languages from fully realizing the benefits of large-scale pretraining, where models learn from vast amounts of text.

While multilingual pretraining attempts to transfer knowledge from high-resource to low-resource languages, it faces its own challenges, including language imbalance and the “curse of multilinguality.” An alternative approach, explored in this research, involves using machine translation (MT) to convert text from a high-resource language into a target low-resource language, thereby creating a large synthetic corpus for pretraining.

This study, titled “Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text” by Dan John Velasco and Matthew Theodore Roque, delves into three critical questions regarding the use of MT-derived data. The researchers focused on Indonesian and Tamil, two typologically distinct and lower-resource languages, translating English text into these target languages. They then pretrained GPT-2 models, ranging from 124 million to 774 million parameters, using native, natural MT-derived, and LLM-simplified MT-derived corpora. The models were evaluated on their generalization to native text, grammatical proficiency, and performance on various natural language understanding (NLU) tasks.

Scaling MT-Pretrained Models

A key finding was that increasing the size of models pretrained on MT-derived data generally improved their ability to generalize to native text. This suggests that larger models are not merely overfitting to the specific characteristics of machine-translated text but are learning robust, transferable linguistic structures. This is a crucial insight, indicating that investing in larger model capacities can be beneficial even when pretraining data is predominantly machine-translated.

Impact of Source-Side Simplification

Counter-intuitively, simplifying the source English text before translation actually proved detrimental to the models’ generalization to native text. The researchers hypothesize that this is because simplification reduces the lexical and syntactic diversity of the source material. This diminished variety in the translated data likely hinders the model’s ability to learn a rich and nuanced understanding of the target language. Therefore, using natural, unsimplified source text for machine translation appears to be more effective for creating high-quality pretraining data.

Continual Pretraining with Native Data

Perhaps one of the most promising results is the effectiveness of continual pretraining (CPT) on limited native text after an initial MT pretraining phase. The study found that models first pretrained on MT-derived data and then continually trained on a smaller budget of native text often outperformed models trained solely on native text, even when the native-only models had access to more native data. This indicates that MT pretraining provides an excellent starting point, allowing models to efficiently adapt and refine their understanding when exposed to authentic native language. This approach offers a data-efficient strategy for bootstrapping performance in low-resource settings.

Also Read:

Performance on Downstream Tasks

When evaluating performance on NLU tasks, the study observed varied results. For tasks like sentiment analysis (SA) and natural language inference (NLI), MT-pretrained models, especially those that underwent CPT, performed comparably to, and sometimes even surpassed, native-only models. However, for tasks requiring a deep understanding of cultural nuances, such as toxicity detection, native-pretrained models maintained a significant advantage. This highlights that while MT-derived data is highly valuable for many linguistic tasks, culturally sensitive applications still necessitate substantial exposure to genuine native data.

In summary, this research offers a practical roadmap for enhancing monolingual language models in data-scarce environments. It advocates for generating target-language data via machine translation, pretraining with the largest feasible model size, and then refining these models through continual pretraining on available native text. For fine-tuning, translating task-specific data is effective for many applications, but native annotation remains essential for tasks deeply rooted in cultural context. You can explore the full details of this study at arXiv:2509.17317.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Language Models for Low-Resource Languages Through Machine Translation

Scaling MT-Pretrained Models

Impact of Source-Side Simplification

Continual Pretraining with Native Data

Performance on Downstream Tasks

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Unveiling LLM Refusal: A Multi-Directional Approach Using Self-Organizing Maps

AI Models Begin to Grasp What Makes Math Problems Interesting to Humans

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates