Enhancing LLMs' Multi-Step Reasoning with Composed Data and Curriculum Learning

TLDR: A new method called “h1” improves large language models’ long-horizon reasoning by synthetically composing simple problems into complex, multi-step chains. Using reinforcement learning with outcome-only rewards and a curriculum that gradually increases problem complexity, the approach significantly boosts accuracy on challenging benchmarks (e.g., AIME, MATH-500) and teaches new reasoning capabilities, offering an exponential improvement in sample efficiency. It also demonstrates that performance can be maintained even with data skewed towards cheaper, short-horizon examples by increasing computational budget.

Large Language Models (LLMs) have shown impressive abilities in many areas, but they often struggle when tasks require a long sequence of reasoning steps. This challenge, known as long-horizon reasoning (LHR), involves breaking down complex goals into many intermediate steps, managing a large context, and ensuring errors don’t accumulate along the way. Current methods to address this often involve complex inference-time adjustments or expensive step-by-step supervision, which are not easily scalable.

A new research paper introduces a scalable approach to enhance LLMs’ long-horizon reasoning capabilities using only existing, readily available short-horizon data. The method, called “BOOTSTRAPPING LLMS TO REASON OVER LONGER HORIZONS VIA REINFORCEMENT LEARNING,” proposes a novel way to synthetically create complex, multi-step problems from simpler ones. You can read the full paper here: https://arxiv.org/pdf/2510.07312.

The Core Idea: Composing Simple Problems into Complex Chains

The researchers’ key innovation is a “chained problem construction” method. They take short, self-contained problems, like those found in 6th-grade math datasets (e.g., GSM8K), and link them together into longer sequences. Each subsequent sub-problem in the chain depends on the result of the previous one. This creates arbitrarily long and complex reasoning paths without needing new human annotations or expensive teacher models.

For example, a simple math problem might ask for a single calculation. In a chained problem, the answer to that first problem becomes a crucial input for a second, related problem, and so on. This forces the LLM to not only solve individual steps correctly but also to manage intermediate values, transform them, and carry them forward accurately across the entire sequence.

Reinforcement Learning with a Curriculum

To train models on this synthetically generated data, the team uses reinforcement learning (RL) with “outcome-only rewards.” This means the model only receives a reward if it solves the entire multi-step problem correctly, rather than getting feedback at each intermediate step. This is a more realistic and scalable approach as dense, step-level supervision is often unavailable.

Crucially, the training follows a “curriculum” that automatically increases in complexity. The model first learns to solve short chains of problems, then gradually progresses to longer and more complex ones. This staged approach is vital because directly training on very long, hard problems from the start would result in extremely sparse rewards, making learning inefficient. By mastering shorter horizons first, the model builds foundational skills that make learning longer horizons much more feasible.

Remarkable Generalization and New Capabilities

The empirical results of this method are quite impressive. Training LLMs on these composed 6th-grade math problems significantly boosted their accuracy on much harder, competition-level benchmarks. For instance, accuracy on benchmarks like GSM-Symbolic, MATH-500, and AIME improved by up to 2.06 times. These are tasks that implicitly require long-horizon reasoning, even if they don’t explicitly state the number of steps.

Beyond just improving existing skills, the research provides evidence that this curriculum-based RL training actually teaches LLMs genuinely new reasoning capabilities. Unlike previous findings that suggested RL mainly refines existing abilities, this work demonstrates that models can learn novel reasoning paths, especially when evaluated at high sampling budgets (pass@k).

The method also showed transferability to long-context benchmarks like LongBench-v2 and Hash-hop, which involve understanding and reasoning over very large inputs or tracing variables across shuffled chains. This suggests that the state-tracking and context management skills learned during long-horizon math training are broadly applicable.

Theoretical Foundations and Cost Efficiency

The paper also includes a theoretical analysis, showing that curriculum RL with outcome-only rewards offers an exponential improvement in sample complexity compared to training directly on full-horizon problems. This means it requires far fewer training examples to achieve the same results, making the approach highly efficient.

Furthermore, the researchers explored how to design cost-efficient curricula. They found that even with training data skewed towards more abundant short-horizon examples (which are cheaper to obtain), high long-horizon performance can still be achieved by allocating more computational resources to training. This trade-off is crucial for practical, real-world applications where long-horizon data is scarce and expensive.

Also Read:

Future Directions

This work introduces an efficient and scalable path towards improving LLMs’ long-horizon reasoning. Future extensions could involve incorporating diverse “atomic skills” beyond math problems and developing more general chaining methods that go beyond simple serial dependencies, potentially using more complex computational graphs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLMs’ Multi-Step Reasoning with Composed Data and Curriculum Learning

The Core Idea: Composing Simple Problems into Complex Chains

Reinforcement Learning with a Curriculum

Remarkable Generalization and New Capabilities

Theoretical Foundations and Cost Efficiency

Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates