TLDR: The In2x research team developed a large language model specifically for Japanese machine translation, focusing on improving idiomaticity, stylistic naturalness, and cultural appropriateness. By transferring strengths from English and using a unique multi-stage training pipeline (including pretraining, supervised fine-tuning, and reinforcement learning), In2x achieved top results in the WMT25 General Machine Translation Shared Task, securing first place in the unrestricted category. This demonstrates its superior performance in non-English, low-resource languages without the need for task-specific fine-tuning.
The field of machine translation has seen significant advancements with the rise of large language models (LLMs). However, two key challenges persist: LLMs often prioritize mathematical and code reasoning over creative language abilities, leading to translations that might be literal but lack naturalness, and there’s an uneven distribution of quality, with English receiving disproportionately better coverage than other languages.
Addressing these gaps, the In2x research team, comprising Lei Pang, Hanyi Mao, Quanjia Xiao, HaiXiao Liu, and Xiangyi Li, has introduced a novel approach in their paper, In2x at WMT25 Translation Task. Their work focuses specifically on Japanese-related translation, aiming to develop a generalizable method for extending LLMs to other languages, particularly those with fewer resources or less common usage.
The In2x Philosophy: Bridging English Strength to Japanese Expressiveness
The In2x model is built on three core principles designed to transfer the strengths of English into non-English targets, with a special emphasis on Japanese naturalness and cultural faithfulness:
- English-as-hub transfer: Leveraging extensive English data and robust English modeling to establish strong lexical and semantic foundations, then transferring these to Japanese through bilingual and style-augmented objectives.
- Expressiveness-first supervision: Prioritizing prompts and signals that explicitly reward idiomaticity and cultural appropriateness, moving beyond mere literal accuracy.
- Evaluation beyond metrics: Supplementing automatic evaluation metrics with human judgments specifically targeting idioms, slang, and stylistic naturalness.
A Multi-Stage Training Journey
The development of In2x involved a meticulous multi-stage training process to balance scientific and humanities-oriented capabilities and enhance multilingual proficiency:
Continued Pretraining Stage
This stage was divided into three phases:
- Phase 1: Fundamental Knowledge Enhancement: Jointly training creative writing and knowledge-focused corpora to boost STEM reasoning while preserving nuanced expression for humanities tasks.
- Phase 2: Long-Text Capability Refinement: Filtering data by length to increase context processing ability from 8,192 to approximately 32,000 tokens.
- Phase 3: Fast Annealing Stage: Using a high-quality corpus selected by perplexity and quality metrics, with a linear decay of the learning rate, to maintain vivid expressive style.
The training incorporated a massive 2 trillion token dataset covering encyclopedic knowledge, webpages, news, academic papers, and STEM data, alongside a dedicated 500 billion token corpus for creative writing and conversational data. Crucially, substantial Japanese language-specific corpora were introduced, with a balanced distribution across Chinese, English, and Japanese to facilitate transfer learning.
Post-Training Data and Supervised Fine-Tuning (SFT)
The post-training dataset consisted of 2 million samples. To ensure Japanese proficiency matched major languages like Chinese and English, a 1:1:1 ratio was used for target language instructions. The team developed a detailed pipeline for constructing these instructions, including collecting open-source datasets, rewriting instructions for creative and localized tasks (incorporating cultural style transformations), and synthesizing new instructions using methods like Magpie and Self-Instruct. A strict quality control pipeline, involving prompt engineering, validation via critic LLMs, and a “ReReading” mechanism, was implemented to address issues like overly simple questions or internal contradictions.
During the SFT stage, linguistic diversity was balanced by clustering instruction data, categorizing it with LLMs, and grading instructions by difficulty. The instruction space was aligned by using a temperature parameter to prevent semantic overfitting and employing a specialized two-step sampling strategy based on difficulty levels and linguistic diversity.
Reinforcement Learning (RL) for Cultural and Creative Industries
In the RL stage, an additional 500,000 non-overlapping samples were used. The reward model system was designed with a rule-based model for STEM and logic tasks, and a generative reward model for creative tasks, which evaluated compliance with task principles. Strategic adjustments were made to the RL algorithm, including Trajectory-Corrected GRPO for stability, a dual-clip mechanism, a soft length penalty, high-level clipping, temperature decay, and entropy regularization.
Impressive Results at WMT25
The In2x model demonstrated outstanding performance in prominent Japanese language benchmarks, including ja-mtbench. Remarkably, without any task-specific fine-tuning, In2x achieved second place overall in the Japanese-related tracks of the WMT 2025 competition and secured first place in the unrestricted category. This performance surpassed many large-scale proprietary models like Gemini-2.5-Pro, GPT-4.1, Claude-4, and DeepSeek-V3.
Also Read:
- Unlocking Cross-Lingual Reasoning in AI: Insights from Long Chain-of-Thought Studies
- LinguaSafe: Advancing Multilingual Safety Evaluation for Large Language Models
Conclusion
The In2x team has successfully validated their methodology for transferring language model capabilities, demonstrating significant improvements in Japanese proficiency across the entire training pipeline. Their work highlights a promising path for achieving exceptional performance in low-resource or less commonly spoken languages, aligning their capabilities with those of mainstream languages without additional language-specific fine-tuning.


