TLDR: This paper investigates using machine-translated (MT) text for pretraining language models in low-resource languages like Indonesian and Tamil. Key findings show that scaling MT-pretrained models improves generalization to native text, simplifying source text before translation is detrimental, and continually pretraining MT-initialized models on limited native text is highly effective, often outperforming native-only models. While MT data is beneficial for tasks like sentiment analysis, culturally nuanced tasks such as toxicity detection still require more native data.
The development of advanced language technologies, particularly Large Language Models (LLMs), has seen remarkable progress, but this advancement largely benefits languages with abundant digital text, such as English. For the majority of the world’s languages, however, a significant hurdle exists: a scarcity of native text data, often referred to as a “data wall.” This limitation prevents these languages from fully realizing the benefits of large-scale pretraining, where models learn from vast amounts of text.
While multilingual pretraining attempts to transfer knowledge from high-resource to low-resource languages, it faces its own challenges, including language imbalance and the “curse of multilinguality.” An alternative approach, explored in this research, involves using machine translation (MT) to convert text from a high-resource language into a target low-resource language, thereby creating a large synthetic corpus for pretraining.
This study, titled “Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text” by Dan John Velasco and Matthew Theodore Roque, delves into three critical questions regarding the use of MT-derived data. The researchers focused on Indonesian and Tamil, two typologically distinct and lower-resource languages, translating English text into these target languages. They then pretrained GPT-2 models, ranging from 124 million to 774 million parameters, using native, natural MT-derived, and LLM-simplified MT-derived corpora. The models were evaluated on their generalization to native text, grammatical proficiency, and performance on various natural language understanding (NLU) tasks.
Scaling MT-Pretrained Models
A key finding was that increasing the size of models pretrained on MT-derived data generally improved their ability to generalize to native text. This suggests that larger models are not merely overfitting to the specific characteristics of machine-translated text but are learning robust, transferable linguistic structures. This is a crucial insight, indicating that investing in larger model capacities can be beneficial even when pretraining data is predominantly machine-translated.
Impact of Source-Side Simplification
Counter-intuitively, simplifying the source English text before translation actually proved detrimental to the models’ generalization to native text. The researchers hypothesize that this is because simplification reduces the lexical and syntactic diversity of the source material. This diminished variety in the translated data likely hinders the model’s ability to learn a rich and nuanced understanding of the target language. Therefore, using natural, unsimplified source text for machine translation appears to be more effective for creating high-quality pretraining data.
Continual Pretraining with Native Data
Perhaps one of the most promising results is the effectiveness of continual pretraining (CPT) on limited native text after an initial MT pretraining phase. The study found that models first pretrained on MT-derived data and then continually trained on a smaller budget of native text often outperformed models trained solely on native text, even when the native-only models had access to more native data. This indicates that MT pretraining provides an excellent starting point, allowing models to efficiently adapt and refine their understanding when exposed to authentic native language. This approach offers a data-efficient strategy for bootstrapping performance in low-resource settings.
Also Read:
- Synthetic Bootstrapped Pretraining: Unlocking Deeper Understanding in Language Models
- CLIMB: Optimizing Language Data for Superior Multilingual AI Models
Performance on Downstream Tasks
When evaluating performance on NLU tasks, the study observed varied results. For tasks like sentiment analysis (SA) and natural language inference (NLI), MT-pretrained models, especially those that underwent CPT, performed comparably to, and sometimes even surpassed, native-only models. However, for tasks requiring a deep understanding of cultural nuances, such as toxicity detection, native-pretrained models maintained a significant advantage. This highlights that while MT-derived data is highly valuable for many linguistic tasks, culturally sensitive applications still necessitate substantial exposure to genuine native data.
In summary, this research offers a practical roadmap for enhancing monolingual language models in data-scarce environments. It advocates for generating target-language data via machine translation, pretraining with the largest feasible model size, and then refining these models through continual pretraining on available native text. For fine-tuning, translating task-specific data is effective for many applications, but native annotation remains essential for tasks deeply rooted in cultural context. You can explore the full details of this study at arXiv:2509.17317.


