spot_img
HomeResearch & DevelopmentBoosting Mathematical Reasoning in LLMs: A Two-Stage Training Strategy...

Boosting Mathematical Reasoning in LLMs: A Two-Stage Training Strategy for Accuracy and Efficiency

TLDR: This research paper introduces a two-stage training recipe for Large Language Models (LLMs) to enhance their mathematical reasoning. The first stage involves extended Supervised Fine-Tuning (SFT) for up to 10 epochs to maximize accuracy. The second stage applies Reinforcement Learning with Group Relative Policy Optimization (GRPO) to dramatically improve token efficiency and solution length while maintaining high accuracy. The method was validated on challenging benchmarks like AIME and MATH-500, and achieved a high rank in the AI Mathematical Olympiad (AIMO), demonstrating its effectiveness in developing accurate and efficient mathematical LLMs.

Large Language Models (LLMs) are becoming increasingly powerful, but enhancing their ability to solve complex mathematical problems remains a significant challenge. Researchers are constantly looking for ways to make these models not only more accurate but also more efficient in how they generate solutions. A new study introduces a practical, two-stage training approach that aims to achieve both: maximizing accuracy through extensive Supervised Fine-Tuning (SFT) and then dramatically improving efficiency using Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO).

The paper, titled “A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning,” was authored by Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. Their work suggests that SFT and RL are not competing methods but rather complementary tools that, when used in sequence, can lead to superior performance in mathematical reasoning tasks.

The Two-Stage Training Recipe

The core of this new methodology lies in its two distinct, yet interconnected, stages:

The first stage involves intensive Supervised Fine-Tuning (SFT). This phase is crucial for pushing the LLM’s problem-solving accuracy to its highest potential. The researchers meticulously built a high-difficulty dataset by combining examples from OpenR1 Math and Light-R1-SFT Data. A key insight from their experiments is that extending the SFT process for as many as 10 epochs is vital for significant performance improvements. While initial epochs might show a temporary dip, prolonged SFT consistently and substantially boosts the model’s accuracy. This stage uses full-parameter SFT, meaning all parts of the model are fine-tuned, and it’s trained with a system prompt guiding the model to reason step-by-step and provide answers in a specific format.

Following the SFT stage, the second stage applies Group Relative Policy Optimization (GRPO). While SFT excels at accuracy, it can sometimes lead to models generating longer, more verbose solutions. The GRPO phase addresses this by focusing on enhancing token efficiency without compromising the high accuracy achieved in the first stage. The GRPO training uses a sophisticated reward function with three components: a Format Reward to ensure correct output structure, a Cosine Similarity Reward that subtly penalizes longer correct answers and more severely penalizes shorter incorrect ones, and a Length Penalty to explicitly discourage overly verbose solutions. This strategic application of GRPO refines the model to be significantly more concise and practical for real-world applications.

Also Read:

Validation and Key Findings

The efficacy of this two-stage recipe was rigorously validated on several challenging benchmarks, including AIME 2024, AIME 2025, and MATH-500. Most notably, the model achieved a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO) competition, demonstrating its robustness and practical effectiveness in a highly competitive environment.

The experiments revealed several key insights. Firstly, the extended SFT phase (10 epochs) was indeed critical for achieving performance breakthroughs, especially for larger models (7B and 14B parameters). Secondly, GRPO’s primary role in this combined framework was found to be optimizing solution length, dramatically improving token efficiency while preserving or slightly improving the peak accuracy established by SFT. This confirms the complementary nature of the two methods: SFT sets the performance ceiling, and GRPO optimizes the solution generation process for efficiency.

The researchers also conducted an ablation study on the reward functions used in GRPO, confirming that incorporating a length penalty effectively reduces the average number of tokens. The cosine reward also yielded slightly higher accuracy compared to a simple binary accuracy reward.

To ensure full reproducibility and empower future research, the entire framework, including code, model checkpoints, and training configurations, will be open-sourced. You can find more details about this research in the full paper available at arXiv.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -