TLDR: QuestA is a novel strategy that enhances the multi-step reasoning capabilities of large language models (LLMs), particularly on challenging problems. By augmenting training data with partial solutions, QuestA provides more informative learning signals during reinforcement learning (RL) training. This simple yet effective method improves performance on math reasoning tasks, achieving state-of-the-art results for 1.5B-parameter models and demonstrating significant gains in sample efficiency without causing entropy collapse.
Large Language Models (LLMs) have made incredible strides in various complex tasks, from writing creative text to solving intricate problems. A key method behind their advanced reasoning abilities is Reinforcement Learning (RL). However, recent observations have highlighted a challenge: standard RL often struggles to significantly improve multi-step reasoning, especially when faced with very difficult problems.
A new research paper titled “QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation” introduces an innovative and straightforward approach to tackle this limitation. Authored by Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, and Jingzhao Zhang, the paper proposes a method called QuestA, which stands for Question Augmentation.
What is QuestA and How Does It Work?
QuestA’s core idea is surprisingly simple yet highly effective: it introduces ‘partial solutions’ into the training process of LLMs. Imagine a student struggling with a complex math problem. Instead of just telling them the final answer, you give them a hint—the first few steps of the solution. This makes the problem less daunting and provides a clearer path forward. QuestA applies this same principle to LLMs.
Unlike other methods that might tweak the RL algorithm itself or change how rewards are given, QuestA operates purely at the input level. When training an LLM, especially on problems where the model initially fails completely, QuestA takes the original question and prepends a segment of its correct solution. For instance, it might add the first 50% of the solution sketch as a hint to the prompt. This ‘scaffolding’ helps the model explore the problem space more effectively and find correct solutions, even when it would otherwise get stuck due to a lack of positive feedback.
The researchers specifically focused on challenging math reasoning tasks. They used a dataset of 26,000 difficult problems from OpenR1-Math-220K. By injecting these partial solutions, QuestA provides a denser and more informative learning signal, allowing the RL process to make progress where it previously stalled. This approach also helps prevent ‘entropy collapse,’ a phenomenon where the model’s output becomes too narrow, limiting its ability to explore diverse solutions. QuestA, in contrast, encourages more varied and exploratory behavior.
Why is This Important?
The paper provides theoretical backing for QuestA’s effectiveness, explaining that it significantly improves ‘sample efficiency.’ In simpler terms, it means the model needs far fewer attempts to find a correct solution because the hints guide it more directly. This is crucial for training large models, as it can save substantial computational resources and time.
Impressive Results
QuestA was applied to strong open-source 1.5B-parameter models, DeepScaleR and Nemotron, and the results were remarkable. The method achieved new state-of-the-art performance on several challenging math benchmarks:
- AIME24: 67.1% accuracy (a 5.3% improvement)
- AIME25: 59.5% accuracy (a significant 10.0% improvement)
- HMMT25: 35.5% accuracy (a 4.0% improvement)
What’s particularly impressive is that QuestA-enhanced 1.5B-parameter models not only outperformed other models of similar size but also matched or even exceeded the performance of much larger models, such as DeepSeek-R1-Distill-32B, on several benchmarks. This demonstrates QuestA’s ability to unlock deeper reasoning capabilities in smaller models through targeted training.
Even though QuestA was trained exclusively on mathematical problems, it showed minor improvements in other domains like general knowledge, logic, and coding tasks, suggesting its potential for broader application. An ablation study using a different dataset, OpenMathReasoning, yielded similar positive results, further indicating the generalizability of the method.
Also Read:
- Dynamic Tree Reasoning with Reinforcement Learning for Adaptive LLM Problem Solving
- Adaptive Guidance: A New Approach for Stable and Efficient LLM Training
Looking Ahead
QuestA offers a practical and broadly applicable pathway for expanding the reasoning capacity of LLMs through RL. By focusing on data augmentation rather than complex algorithmic changes, it provides a flexible tool for improving model performance on difficult tasks. The researchers believe this method could be extended to other challenging domains like competitive coding and software engineering, paving the way for even more capable AI systems. You can read the full research paper for more details at https://arxiv.org/pdf/2507.13266.


