TLDR: CurES is a new curriculum learning algorithm for Large Language Models (LLMs) that significantly improves training efficiency for reasoning tasks. It achieves this by theoretically analyzing gradient optimization and dynamically adjusting both the selection of training prompts and the allocation of computational resources (rollout quantities) based on prompt difficulty. Using a Bayesian framework, CurES continuously refines its understanding of prompt difficulty, focusing resources on moderately challenging examples. Experiments show CurES outperforms existing methods in accuracy and converges much faster, demonstrating superior sample efficiency with minimal computational overhead.
Large Language Models (LLMs) are becoming increasingly powerful, especially in complex reasoning tasks. However, training these models efficiently remains a significant challenge. A new research paper introduces CurES, an innovative method designed to make this training process much more effective and less computationally wasteful.
Traditional training approaches for LLMs often treat all training examples, or ‘prompts,’ equally. This uniform sampling can lead to inefficiencies, as some prompts might be too easy (offering diminishing returns) or too hard (where the model makes little progress). This is where curriculum learning comes in, aiming to present prompts in a more structured, progressive way. However, existing curriculum learning methods often fall short by not accurately gauging prompt difficulty or by using overly simplistic filtering, leading to wasted computational resources.
The researchers behind CurES, Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang, approached this problem from the perspective of reinforcement learning gradient optimization. They conducted a systematic and theoretical investigation into how to boost LLM training efficiency. Their work identified two critical factors: how training prompts are selected and how ‘rollout quantities’ (the number of times a model attempts a prompt) are distributed across these prompts.
Their theoretical analysis revealed that the way prompts are sampled directly influences how quickly the model’s learning process (gradient descent) converges. Furthermore, the allocation of rollout quantities impacts the consistency and stability of the overall gradient updates. Building on these insights, they developed CurES, an efficient training method that not only accelerates convergence but also uses a clever technique called Bayesian posterior estimation to keep computational overhead to a minimum.
How CurES Works
CurES operates by first estimating the difficulty of each prompt, which it defines as the model’s accuracy in answering that particular question. This difficulty assessment then guides two key processes: the optimal sampling strategy for prompts and the allocation of rollout quantities. Essentially, CurES learns which prompts are ‘just right’ – not too easy, not too hard – and focuses more resources on them.
As the model trains and its capabilities evolve, the difficulty of prompts can change. To adapt to this, CurES employs a Bayesian inference framework. It models the success rate of each prompt using a Beta distribution, which is a statistical tool that can be continuously updated with new information. This means that as the model attempts more prompts, CurES refines its understanding of their difficulty, dynamically adjusting its sampling and resource allocation strategies. To prevent issues from the model’s performance shifting over time, the dataset is divided into subsets, and training is performed iteratively, with difficulty estimations reset at the start of each iteration.
Also Read:
- CLPO: A Self-Evolving Learning Approach for Enhanced LLM Reasoning
- Knapsack RL: Optimizing LLM Exploration for Enhanced Learning
Impressive Results
The effectiveness of CurES was rigorously tested against several strong baseline methods, including Group Relative Policy Optimization (GRPO) and REINFORCE++ (RPP), using Qwen2.5-Math models (1.5B and 7B parameters) on a wide array of challenging mathematical reasoning benchmarks like MATH500, GSM8K, and AIME. The results were compelling.
CurES consistently outperformed GRPO by a significant margin, achieving +3.30 points with 1.5B models and +4.82 points with 7B models. Beyond just higher accuracy, CurES also demonstrated much faster convergence. For instance, CurES-GRPO reached the same peak performance as GRPO in 5.5 times fewer steps, and CurES-RPP was 1.75 times faster than RPP. This remarkable sample efficiency highlights CurES’s ability to consistently provide the model with the most informative and optimally challenging samples.
The research also showed that CurES adaptively concentrates more rollouts on moderately difficult prompts, which are the most beneficial for learning. As training progresses, the distribution of these ‘moderately difficult’ prompts becomes sharper and narrower, indicating that CurES continuously refines its focus on the most impactful learning opportunities. This adaptive strategy ensures that computational effort is always directed where it yields the greatest improvement.
In conclusion, CurES represents a significant step forward in making LLM training for reasoning tasks more efficient and stable. By intelligently selecting prompts and allocating computational resources based on a deep understanding of gradient dynamics and prompt difficulty, it enables models to learn faster and achieve higher accuracy. For more details, you can refer to the full preprint: CURES: FROM GRADIENT ANALYSIS TO EFFICIENT CURRICULUM LEARNING FOR REASONING LLMS.


