TLDR: ReCOR is a new reinforcement learning framework that teaches language models to generate text in an adaptive, data-dependent order, rather than a fixed or random one. This allows models to solve complex reasoning and planning problems like Sudoku and arithmetic more effectively by tackling easier parts first, similar to how humans approach such tasks. It achieves superior performance without needing manual annotations for the correct order, by learning to estimate the ‘hardness’ of predicting each token and optimizing its generation sequence during both training and inference.
Modern language models, including the widely used causal language models and newer discrete diffusion models, have made incredible strides in generating diverse and useful content. From writing code to acting as intelligent agents, their capabilities are vast. However, these models typically operate by generating text in a fixed, left-to-right sequence, or sometimes in a random order. This approach, while effective for many tasks, hits a wall when faced with complex reasoning and planning problems.
Imagine solving a Sudoku puzzle. Do you fill in the cells strictly from left to right, even if the first few cells are incredibly difficult to deduce? Humans rarely do. Instead, we instinctively look for the easiest cells to fill first, using those initial insights to progressively tackle the more challenging parts. This adaptive, flexible approach is precisely what current language models struggle with, as their rigid generation order can lead them into computationally intractable situations.
A new research paper introduces a novel framework called Reinforced Context Order Recovery (ReCOR) that aims to bridge this gap. ReCOR is a reinforcement-learning-based system designed to teach language models to determine the optimal token generation order adaptively, without needing any explicit annotations or human-provided guidance on the correct sequence.
How ReCOR Works
At its core, ReCOR addresses the problem of ‘token hardness.’ Some tokens (or parts of a solution) are much easier to predict given the current context than others. ReCOR formalizes this intuition using a concept called ‘predictive V-information,’ which essentially measures how much easier a token is to predict given more context. The goal is to maximize this cumulative ‘easy-to-predict’ information over the entire generation process.
To achieve this, ReCOR frames the task of finding the best generation order as a ‘decision-making problem,’ similar to how an agent learns in a game. It uses reinforcement learning (RL) techniques to train a ‘policy’ that adaptively selects which token to generate next. Crucially, ReCOR doesn’t just adapt during the final generation phase; it learns and follows this adaptive order during its training process as well. This ensures that the model not only becomes flexible during inference but also benefits from learning on more tractable and informative token prediction tasks during training.
The system works by jointly optimizing two components: a ‘token prediction model’ that actually generates the text, and an ‘order prediction policy’ that decides the sequence. The token prediction model provides ‘self-supervision’ (like a reward signal) to the order prediction policy, guiding it to choose sequences that lead to easier and more accurate token predictions.
Also Read:
- Adaptive Guidance Boosts Reasoning in Small Language Models
- Enhancing Language Model Reasoning Through Targeted Exploration
Impressive Results Across Challenging Tasks
The researchers put ReCOR to the test on several challenging reasoning and planning datasets, including arithmetic problems and classic logic puzzles like Sudoku and Zebra. The results were highly encouraging.
For arithmetic tasks, where traditional models often struggle due to reverse dependencies (like carry digits in multiplication), ReCOR demonstrated its ability to automatically recover the correct generation order without any manual data preprocessing. It significantly outperformed standard causal language models and even adaptive masked diffusion models, which are state-of-the-art in adaptive inference.
On the Sudoku and Zebra puzzles, which demand highly adaptive, data-dependent reasoning, ReCOR truly shined. It not only outperformed all baseline approaches but also surpassed ‘oracle’ models that were supervised with the ground-truth (perfect) generation order. This suggests that ReCOR’s self-supervised approach to estimating token hardness provides a richer and more effective signal than simply knowing the ‘correct’ next step.
The paper also highlights a key difference between ReCOR and other adaptive methods: the necessity of adaptive orders during *both* training and inference. Many existing adaptive methods only apply their strategies during inference, but ReCOR’s experiments show that training with a flexible order is vital for handling complex dependencies and avoiding ‘intractable sub-problems’ that arise from random masking during training.
Furthermore, ReCOR’s design allows for scalability. It can improve its performance by leveraging more computational resources during training and inference, demonstrating a robust and flexible architecture.
In conclusion, ReCOR represents a significant step forward in enabling language models to tackle complex reasoning and planning problems with human-like adaptability. By learning to determine the optimal generation order from raw text data, it opens new avenues for more intelligent and efficient AI systems. For more technical details, you can refer to the full research paper: Reinforced Context Order Recovery for Adaptive Reasoning and Planning.


