TLDR: A new method called Re-Schedule improves Large Language Model (LLM) performance in complex reasoning tasks, especially math. It introduces a “Reasoning Score” (r-score) that measures a query’s learning difficulty based on the structure of its internal “Reasoning Tree,” rather than just its initial accuracy. By scheduling training data from structurally simple (high r-score) to complex (low r-score) queries, Re-Schedule significantly boosts accuracy, demonstrating that understanding the reasoning tree’s structure is key to more efficient LLM reinforcement learning.
Large Language Models (LLMs) have shown incredible capabilities, but enhancing their ability to tackle complex reasoning tasks, particularly in areas like mathematical problem-solving, remains a significant challenge. A promising approach involves Reinforcement Learning with Verifiable Rewards (RLVR), which essentially refines an LLM’s decision-making process by progressively ‘editing’ what researchers call a ‘Reasoning Tree’.
Understanding the Reasoning Tree
Imagine an LLM trying to solve a math problem. The various steps it takes, the intermediate thoughts, and the potential solution paths can be visualized as a tree. Each point (node) in this tree represents a partial reasoning step or a token generated by the LLM, and each complete path from the start to an end point (leaf node) represents a full solution trajectory. RLVR works by rewarding correct paths and penalizing incorrect ones, iteratively adjusting the model’s policy at each node to prune away branches leading to errors and strengthen those leading to correct answers.
The Challenge with Existing Training Methods
A crucial aspect of training LLMs effectively is data scheduling, which is akin to curriculum learning – organizing training examples from easy to hard. However, current data scheduling methods for RLVR often rely on ‘path-based’ metrics, primarily focusing on the final solution accuracy of a query to determine its difficulty. The authors of the paper, “Scheduling Your LLM Reinforcement Learning with Reasoning Trees,” argue that this approach has a critical limitation: accuracy alone doesn’t truly reflect a query’s learning difficulty. A query might have low initial accuracy but be structurally simple to fix, while another with higher initial accuracy could be much harder to optimize due to a fragmented, complex reasoning structure.
Introducing the Reasoning Score (r-score)
To address this, Hong Wang and his co-authors introduce a novel metric called the ‘Reasoning Score’ (r-score). This score quantifies a query’s learning potential by analyzing the actual structure of its reasoning tree. Instead of just looking at whether the final answer is right or wrong, the r-score measures the maximum potential accuracy gain achievable within a limited ‘node editing budget’ – essentially, how much improvement can be made by correcting a few key decision points in the tree. A higher r-score indicates a more tractable reasoning structure and greater learning efficiency, meaning substantial improvements can be made with minimal effort.
The Re-Schedule Algorithm
Building on the r-score, the researchers propose a new data scheduling algorithm called ‘Re-Schedule’. This algorithm constructs a curriculum that progresses from structurally simple (high r-score) to complex (low r-score) queries. The process involves three main stages:
- An approximate reasoning tree is built for each query by sampling multiple solution paths from a base LLM.
- The r-score is calculated for each query based on this approximated tree’s structure.
- The r-score is then used to dynamically weight each query in the RLVR loss function. Initially, high-scoring (simple) queries are prioritized to accelerate early convergence. As training advances, the weighting gradually shifts towards lower-scoring (difficult) queries, enabling the model to master more challenging problems and improve generalization.
Also Read:
- A Two-Stage Curriculum for General LLM Reasoning
- Unlocking Complex Reasoning in LLMs with Step-wise Supervised Reinforcement Learning
Significant Performance Gains
The effectiveness of Re-Schedule was rigorously tested on six math-reasoning benchmarks using two different LLM models (Qwen2.5-Math-7B and Qwen2.5-7B). The results were compelling: Re-Schedule consistently achieved state-of-the-art performance, significantly outperforming both traditional RLVR methods and existing scheduling algorithms. It demonstrated average accuracy gains of up to 3.2% over accuracy-based scheduling and up to 3.8% over classical RLVR methods. These strong results validate the core idea that a structural understanding of the reasoning tree provides a more powerful and principled foundation for efficient RLVR data scheduling.
The research highlights that by looking beyond simple accuracy and delving into the underlying structure of how LLMs reason, we can design more effective training strategies that lead to substantial improvements in their complex problem-solving abilities. For more details, you can read the full paper here.


