TLDR: A new research paper introduces the “Accuracy Cliff,” a theoretical limit showing that large language models (LLMs) performing single-pass reasoning on multi-hop questions will inevitably see their accuracy collapse when the task’s information demand exceeds their output capacity. To address this, the paper proposes InfoQA, a multi-call framework that decomposes complex questions, manages reasoning steps explicitly, and prunes information, demonstrating significant performance improvements over single-pass methods, especially in complex and long-context scenarios.
Large Language Models (LLMs) have made incredible strides in understanding and generating human-like text. However, when faced with complex questions that require integrating multiple pieces of information from different parts of a long document – a task known as Multi-Hop Question Answering (MHQA) – they often hit a wall. A recent research paper sheds light on this fundamental limitation, proposing a theoretical explanation and a novel framework to overcome it.
The Inherent Limits of Single-Pass Reasoning
The paper, titled “A FANO-STYLE ACCURACY UPPER BOUND FOR LLM SINGLE-PASS REASONING IN MULTI-HOP QA,” highlights a critical bottleneck in how LLMs typically process information. When an LLM attempts to answer a complex multi-hop question in a single go (a ‘single-pass’ reasoning paradigm), it’s constrained by its finite output capacity. This means there’s a limit to how much information it can reliably carry forward and integrate at once. Once the complexity of the task, or its ‘information demand,’ exceeds this capacity, the model’s accuracy doesn’t just gradually decline; it collapses sharply. The researchers term this phenomenon the “Accuracy Cliff.”
This ‘Accuracy Cliff’ is formalized through a Fano-style accuracy upper bound, an information-theoretic principle that defines a theoretical performance ceiling. It reveals that achieving perfect accuracy becomes mathematically impossible when the task’s information demand surpasses the model’s output capacity.
Why Multi-Hop QA is Particularly Challenging
The paper identifies two main reasons why MHQA tasks are especially vulnerable to this capacity issue:
- Stepwise Capacity Overflow: The information needed to solve a multi-hop question grows super-linearly with the number of reasoning steps (hops) and the length of the context provided. This exponential growth quickly pushes the task beyond the LLM’s single-pass capacity.
- Cross-Step Error Accumulation: In a multi-hop reasoning chain, where each step depends on the correctness of the previous one, even small errors at intermediate stages can amplify catastrophically, leading to an incorrect final answer.
These dual challenges mean that the conventional single-pass approach is fundamentally inadequate for robust multi-hop reasoning.
Introducing InfoQA: A Multi-Call Solution
To address these limitations, the researchers introduce InfoQA, a proof-of-concept multi-call reasoning framework. InfoQA is designed to manage the information load and maintain reasoning integrity by breaking down complex tasks into smaller, more manageable steps. It achieves this through three key components:
- Capacity-Aware Task Decomposition: Instead of tackling the entire multi-hop question at once, InfoQA decomposes it into a sequence of simpler, single-hop sub-questions. This ensures that each individual step remains within the LLM’s single-pass capacity.
- Dependency-Explicit Workflow: To prevent errors from accumulating and to ensure a coherent reasoning chain, InfoQA explicitly maintains the reasoning state. The findings from one step are embedded directly into the next query, making the reasoning path transparent and controllable.
- Iterative Query Contraction: After each step, InfoQA prunes unnecessary reasoning traces and condenses the query with the latest findings. This prevents the prompt from becoming too long and noisy, keeping the information load manageable throughout the entire process.
Also Read:
- Unpacking AI’s ‘Working Memory’ Limits: How Cognitive Load Affects Large Language Models
- Finding the Right Dose: A Scaling Law for Knowledge Infusion in LLMs
Validation and Impact
To rigorously test their theory and framework, the researchers created a new, stringent, and noise-rich synthetic benchmark. Experiments using Qwen3-8B and Qwen3-14B models confirmed that single-pass methods indeed exhibit the predicted “Accuracy Cliff,” with their performance closely matching the theoretical curves. In contrast, InfoQA consistently outperformed all single-pass baselines, demonstrating significant improvements, especially in scenarios with more reasoning hops and longer contexts.
This work provides a crucial theoretical foundation for understanding the limitations of LLMs in complex reasoning tasks and offers a practical path forward. By moving beyond the single-pass paradigm and adopting capacity-aware, multi-call approaches like InfoQA, we can unlock more robust and reliable reasoning capabilities in large language models. For more details, you can read the full paper here.


