TLDR: This research investigates how training large language models (LLMs) with data from multiple reasoning domains (math, code, puzzles) using reinforcement learning (RL) affects their performance. It finds that while multi-domain training generally improves overall reasoning and task balance, specific domain combinations can lead to both mutual enhancements and conflicts. The study also highlights the critical roles of supervised fine-tuning, consistent training templates, curriculum learning, tailored reward designs, and language in optimizing LLM reasoning capabilities.
Large Language Models, or LLMs, have shown incredible progress in various reasoning tasks, from solving complex math problems to generating code and tackling logical puzzles. A key method behind these advancements is Reinforcement Learning with Verifiable Rewards (RLVR), which helps LLMs improve their reasoning abilities by learning from feedback.
However, most previous research has focused on training LLMs on these reasoning tasks in isolation. In the real world, complex problems often require a combination of different cognitive skills. This paper, titled “Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning,” delves into how these different reasoning skills interact when LLMs are trained using reinforcement learning.
The researchers conducted a comprehensive study focusing on three core reasoning domains: mathematical reasoning, code generation, and logical puzzle solving. They used the GRPO algorithm and the Qwen-2.5-7B model family for their experiments. Their investigation covered several key areas:
Understanding Single-Domain Training
First, the study looked at how training models on a single domain (like just math or just code) affects their performance within that domain and their ability to generalize to other domains. For instance, they found that training on mathematical data significantly improved the model’s math skills and surprisingly, also boosted its puzzle-solving abilities. However, this same math training often led to a decline in coding performance, suggesting that different reasoning requirements can sometimes conflict.
Similarly, training on code data greatly enhanced the model’s coding proficiency. Interestingly, the impact on other domains varied depending on whether the model had prior supervised fine-tuning (SFT). For models that had SFT, code training often helped with cross-domain reasoning, but for base models without SFT, it could actually limit their flexibility in non-code tasks.
Puzzle training, on the other hand, improved logical reasoning, which transferred well to mathematical tasks. However, its effect on coding performance was inconsistent, sometimes leading to a reduction in scores, likely due to the fixed format of puzzle data not aligning with coding requirements.
Combining Multiple Domains
The study then explored what happens when models are trained on combinations of these domains. They found that combining data from specific domains could lead to synergistic benefits. For example, training with both math and puzzle data improved math performance even more than math-only training. Combining puzzle and code data also showed strong overall improvements.
However, adding more domains doesn’t always guarantee better performance. Sometimes, increased data diversity can hinder the model’s ability to specialize in a particular task, especially for highly specialized tasks like puzzles. The researchers observed that while combining all three domains (math, code, and puzzle) led to the highest overall performance and better task balance, there could still be some negative transfer on specific tasks, such as a slight drop in puzzle performance compared to a puzzle-only setup.
Also Read:
- Unlocking AI Reasoning: How Language Mixing Enhances Large Language Models
- Unpacking RLVR’s Limits: Precision Gains Versus Reasoning Horizons
Crucial Training Factors
Beyond data combinations, the paper also investigated other critical aspects of RL training:
-
Template Consistency: A significant finding was the importance of using consistent templates during both training and evaluation. Mismatched templates, where the format of questions or answers differs between training and testing, severely degraded model performance. This highlights a current lack of robustness in RLVR models to such variations.
-
Curriculum Learning: The researchers explored curriculum learning, a strategy where models are trained on easier tasks before moving to harder ones. They found that this approach improved the model’s performance ceiling. A novel “policy refresh” strategy, which periodically updates the reference model and resets the optimizer state, further accelerated learning and enhanced final results.
-
Reward Design: The way rewards are given to the model also proved crucial. Binary rewards (all or nothing) worked well for simpler tasks, while partial rewards (based on how much of the answer is correct) were more suitable for complex tasks where models might not get everything right initially. The study suggests that more fine-grained partial reward signals are needed for further improvements.
-
Training Language: The language of the training data also played a role. Models trained to reason in Chinese consistently underperformed compared to those trained in English, indicating that RLVR is language-sensitive and more advanced algorithms are needed for better cross-lingual generalization.
In conclusion, this data-centric study provides valuable insights into how different reasoning domains interact within the RLVR framework. It reveals that while multi-domain training can significantly enhance overall LLM reasoning capabilities and promote balanced performance, careful design choices are essential to leverage synergies and mitigate potential conflicts. The findings also underscore the importance of factors like template consistency, curriculum learning, and tailored reward mechanisms for optimizing RL methodologies to foster comprehensive, multi-domain reasoning in LLMs. For more details, you can read the full research paper here.


