TLDR: A new method called “h1” improves large language models’ long-horizon reasoning by synthetically composing simple problems into complex, multi-step chains. Using reinforcement learning with outcome-only rewards and a curriculum that gradually increases problem complexity, the approach significantly boosts accuracy on challenging benchmarks (e.g., AIME, MATH-500) and teaches new reasoning capabilities, offering an exponential improvement in sample efficiency. It also demonstrates that performance can be maintained even with data skewed towards cheaper, short-horizon examples by increasing computational budget.
Large Language Models (LLMs) have shown impressive abilities in many areas, but they often struggle when tasks require a long sequence of reasoning steps. This challenge, known as long-horizon reasoning (LHR), involves breaking down complex goals into many intermediate steps, managing a large context, and ensuring errors don’t accumulate along the way. Current methods to address this often involve complex inference-time adjustments or expensive step-by-step supervision, which are not easily scalable.
A new research paper introduces a scalable approach to enhance LLMs’ long-horizon reasoning capabilities using only existing, readily available short-horizon data. The method, called “BOOTSTRAPPING LLMS TO REASON OVER LONGER HORIZONS VIA REINFORCEMENT LEARNING,” proposes a novel way to synthetically create complex, multi-step problems from simpler ones. You can read the full paper here: https://arxiv.org/pdf/2510.07312.
The Core Idea: Composing Simple Problems into Complex Chains
The researchers’ key innovation is a “chained problem construction” method. They take short, self-contained problems, like those found in 6th-grade math datasets (e.g., GSM8K), and link them together into longer sequences. Each subsequent sub-problem in the chain depends on the result of the previous one. This creates arbitrarily long and complex reasoning paths without needing new human annotations or expensive teacher models.
For example, a simple math problem might ask for a single calculation. In a chained problem, the answer to that first problem becomes a crucial input for a second, related problem, and so on. This forces the LLM to not only solve individual steps correctly but also to manage intermediate values, transform them, and carry them forward accurately across the entire sequence.
Reinforcement Learning with a Curriculum
To train models on this synthetically generated data, the team uses reinforcement learning (RL) with “outcome-only rewards.” This means the model only receives a reward if it solves the entire multi-step problem correctly, rather than getting feedback at each intermediate step. This is a more realistic and scalable approach as dense, step-level supervision is often unavailable.
Crucially, the training follows a “curriculum” that automatically increases in complexity. The model first learns to solve short chains of problems, then gradually progresses to longer and more complex ones. This staged approach is vital because directly training on very long, hard problems from the start would result in extremely sparse rewards, making learning inefficient. By mastering shorter horizons first, the model builds foundational skills that make learning longer horizons much more feasible.
Remarkable Generalization and New Capabilities
The empirical results of this method are quite impressive. Training LLMs on these composed 6th-grade math problems significantly boosted their accuracy on much harder, competition-level benchmarks. For instance, accuracy on benchmarks like GSM-Symbolic, MATH-500, and AIME improved by up to 2.06 times. These are tasks that implicitly require long-horizon reasoning, even if they don’t explicitly state the number of steps.
Beyond just improving existing skills, the research provides evidence that this curriculum-based RL training actually teaches LLMs genuinely new reasoning capabilities. Unlike previous findings that suggested RL mainly refines existing abilities, this work demonstrates that models can learn novel reasoning paths, especially when evaluated at high sampling budgets (pass@k).
The method also showed transferability to long-context benchmarks like LongBench-v2 and Hash-hop, which involve understanding and reasoning over very large inputs or tracing variables across shuffled chains. This suggests that the state-tracking and context management skills learned during long-horizon math training are broadly applicable.
Theoretical Foundations and Cost Efficiency
The paper also includes a theoretical analysis, showing that curriculum RL with outcome-only rewards offers an exponential improvement in sample complexity compared to training directly on full-horizon problems. This means it requires far fewer training examples to achieve the same results, making the approach highly efficient.
Furthermore, the researchers explored how to design cost-efficient curricula. They found that even with training data skewed towards more abundant short-horizon examples (which are cheaper to obtain), high long-horizon performance can still be achieved by allocating more computational resources to training. This trade-off is crucial for practical, real-world applications where long-horizon data is scarce and expensive.
Also Read:
- Boosting LLM Reasoning: A New Test-Time Optimization Approach for Latent Thoughts
- A Data-Centric Solution for Zero-Reward RL in Language Models
Future Directions
This work introduces an efficient and scalable path towards improving LLMs’ long-horizon reasoning. Future extensions could involve incorporating diverse “atomic skills” beyond math problems and developing more general chaining methods that go beyond simple serial dependencies, potentially using more complex computational graphs.


