TLDR: A new research paper reveals a counterintuitive finding: Large Language Models (LLMs) generalize better on reasoning tasks when trained on systematically inefficient, longer ‘chain-of-thought’ traces, rather than globally optimal, shorter ones. This is because long, coherent, and locally incremental steps make the training signal easier to optimize, boosting the model’s confidence in next-token prediction, which is crucial for effective learning.
Recent advancements in Large Language Models (LLMs) have shown their remarkable ability to tackle complex reasoning and multi-step problem-solving tasks. A key insight has been that allowing these models to reason step-by-step, much like humans form their thoughts, significantly boosts their performance. This process is often referred to as Chain-of-Thought (CoT) reasoning.
A new research paper, titled “ON THE BIAS OF NEXT-TOKEN PREDICTORS TOWARD SYSTEMATICALLY INEFFICIENT REASONING: A SHORTEST-PATH CASE STUDY,” delves into the intriguing dynamics of how LLMs learn to reason. Authored by Riccardo Alberghi, Elizaveta Demyanenko, Luca Saglietti, and Luca Biggio, the study introduces a controlled environment using shortest-path tasks in layered graphs to isolate and examine factors influencing LLM reasoning.
The researchers trained decoder-only transformers on question-trace-answer triples. They compared models trained on optimal, bottom-up dynamic programming traces with those trained on longer, yet valid, traces that involved backtracking. The surprising discovery was that, even with the same training-token budget, models exposed to these “inefficient” traces generalized better to new, unseen graphs. This benefit wasn’t simply due to the length of the traces; injecting arbitrary redundancy without a coherent structure actually hindered performance.
Instead, the study found a strong correlation between generalization and the model’s confidence in next-token prediction. This suggests that long, coherent, and locally incremental traces make the training signal easier for the model to optimize. In essence, while a globally optimal strategy might seem ideal for teaching, less efficient but more systematic and predictable reasoning paths align better with the inductive bias of next-token predictive architectures.
The paper highlights several key contributions. It introduces a controlled reasoning benchmark for studying how LLMs learn algorithms with different intermediate solution traces. It confirms that training transformers to produce intermediate steps significantly improves performance. Crucially, it provides direct evidence that training on inefficient reasoning traces can outperform training on optimal ones, emphasizing that the structure of the reasoning trace, not just its length, is paramount. Finally, the study motivates these findings by showing that next-token prediction confidence is higher for models trained on these longer, systematic, and locally incremental traces.
Also Read:
- The Transparency of AI Reasoning: A Study on Evasion and Monitorability
- Boosting LLM Reasoning: A New Approach to Efficient Reinforcement Learning
The findings suggest a paradox: what appears to be the most logical and efficient way to teach an AI—the shortest, globally optimal trace—is not what next-token predictors learn most readily. Instead, they favor systematic, locally incremental, and often longer reasoning paths. This research opens new avenues for understanding and potentially steering the behavior of contemporary AI systems. For more detailed information, you can refer to the full research paper here.


