Boosting Mathematical Reasoning in LLMs: A Two-Stage Training Strategy for Accuracy and Efficiency

TLDR: This research paper introduces a two-stage training recipe for Large Language Models (LLMs) to enhance their mathematical reasoning. The first stage involves extended Supervised Fine-Tuning (SFT) for up to 10 epochs to maximize accuracy. The second stage applies Reinforcement Learning with Group Relative Policy Optimization (GRPO) to dramatically improve token efficiency and solution length while maintaining high accuracy. The method was validated on challenging benchmarks like AIME and MATH-500, and achieved a high rank in the AI Mathematical Olympiad (AIMO), demonstrating its effectiveness in developing accurate and efficient mathematical LLMs.

Large Language Models (LLMs) are becoming increasingly powerful, but enhancing their ability to solve complex mathematical problems remains a significant challenge. Researchers are constantly looking for ways to make these models not only more accurate but also more efficient in how they generate solutions. A new study introduces a practical, two-stage training approach that aims to achieve both: maximizing accuracy through extensive Supervised Fine-Tuning (SFT) and then dramatically improving efficiency using Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO).

The paper, titled “A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning,” was authored by Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. Their work suggests that SFT and RL are not competing methods but rather complementary tools that, when used in sequence, can lead to superior performance in mathematical reasoning tasks.

The Two-Stage Training Recipe

The core of this new methodology lies in its two distinct, yet interconnected, stages:

The first stage involves intensive Supervised Fine-Tuning (SFT). This phase is crucial for pushing the LLM’s problem-solving accuracy to its highest potential. The researchers meticulously built a high-difficulty dataset by combining examples from OpenR1 Math and Light-R1-SFT Data. A key insight from their experiments is that extending the SFT process for as many as 10 epochs is vital for significant performance improvements. While initial epochs might show a temporary dip, prolonged SFT consistently and substantially boosts the model’s accuracy. This stage uses full-parameter SFT, meaning all parts of the model are fine-tuned, and it’s trained with a system prompt guiding the model to reason step-by-step and provide answers in a specific format.

Following the SFT stage, the second stage applies Group Relative Policy Optimization (GRPO). While SFT excels at accuracy, it can sometimes lead to models generating longer, more verbose solutions. The GRPO phase addresses this by focusing on enhancing token efficiency without compromising the high accuracy achieved in the first stage. The GRPO training uses a sophisticated reward function with three components: a Format Reward to ensure correct output structure, a Cosine Similarity Reward that subtly penalizes longer correct answers and more severely penalizes shorter incorrect ones, and a Length Penalty to explicitly discourage overly verbose solutions. This strategic application of GRPO refines the model to be significantly more concise and practical for real-world applications.

Also Read:

Validation and Key Findings

The efficacy of this two-stage recipe was rigorously validated on several challenging benchmarks, including AIME 2024, AIME 2025, and MATH-500. Most notably, the model achieved a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO) competition, demonstrating its robustness and practical effectiveness in a highly competitive environment.

The experiments revealed several key insights. Firstly, the extended SFT phase (10 epochs) was indeed critical for achieving performance breakthroughs, especially for larger models (7B and 14B parameters). Secondly, GRPO’s primary role in this combined framework was found to be optimizing solution length, dramatically improving token efficiency while preserving or slightly improving the peak accuracy established by SFT. This confirms the complementary nature of the two methods: SFT sets the performance ceiling, and GRPO optimizes the solution generation process for efficiency.

The researchers also conducted an ablation study on the reward functions used in GRPO, confirming that incorporating a length penalty effectively reduces the average number of tokens. The cosine reward also yielded slightly higher accuracy compared to a simple binary accuracy reward.

To ensure full reproducibility and empower future research, the entire framework, including code, model checkpoints, and training configurations, will be open-sourced. You can find more details about this research in the full paper available at arXiv.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting Mathematical Reasoning in LLMs: A Two-Stage Training Strategy for Accuracy and Efficiency

The Two-Stage Training Recipe

Validation and Key Findings

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates