TLDR: Researchers found that large language models (LLMs) can internally predict whether their answer to a question will be correct *before* they even start generating it. By analyzing internal “activations” after a question is read, they trained simple tools called linear probes that accurately forecast correctness across various knowledge tasks, outperforming other methods. This internal “correctness signal” also correlates with when models say “I don’t know.” However, this self-assessment struggles with complex mathematical reasoning.
Large language models (LLMs) have become incredibly powerful, but a crucial question remains: do they truly understand when they are right or wrong? A new research paper, titled No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes, delves into this fascinating area, exploring whether LLMs can anticipate their own answer accuracy even before generating a single word.
The study, conducted by Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, and Lorenzo Pacchiardi, introduces a novel approach to uncover this internal self-assessment capability. Instead of relying on the model’s output or its stated confidence, the researchers looked directly into the LLM’s ‘mind’ – specifically, its internal activations – immediately after it processes a question but before it begins to formulate an answer.
The ‘No Answer Needed’ Approach
The core idea is to extract these hidden internal states, known as ‘residual stream activations,’ from various layers of the LLM. Once these activations are captured, simple tools called ‘linear probes’ are trained. These probes learn to distinguish between the internal patterns that precede a correct answer and those that precede an incorrect one. Essentially, they identify an ‘in-advance correctness direction’ within the model’s internal representation space. This method is remarkably efficient, requiring only a single pass through the model to extract activations, unlike other techniques that might need the model to generate multiple answers.
Key Discoveries from Within
The researchers tested their approach on a range of open-source LLMs, from 7 billion to 70 billion parameters, across diverse datasets including general trivia, geographical facts, historical birth years, Olympic medal winners, and mathematical problems. Their findings offer significant insights:
-
Strong Predictive Power: The linear probes proved highly effective at predicting answer correctness. They consistently outperformed traditional ‘black-box’ methods that only look at the input question, as well as the model’s own verbalized confidence scores.
-
Self-Assessment Emerges Mid-Computation: The ability for an LLM to assess its own correctness isn’t present from the very first layers. Instead, this predictive power gradually builds up and ‘saturates’ in the intermediate layers of the model, suggesting that the understanding of its own capabilities develops as the model processes the question.
-
Generalization Across Knowledge Domains: A probe trained on generic trivia questions demonstrated impressive generalization. It could accurately predict correctness on entirely different knowledge-based datasets, indicating that the internal correctness signal is robust and not just specific to the training data.
-
The ‘I Don’t Know’ Connection: For models that sometimes respond with ‘I don’t know,’ this behavior strongly correlated with a very low score on the ‘in-advance correctness direction.’ This suggests that the same internal signal that predicts correctness also acts as a measure of the model’s confidence.
-
Larger Models, Stronger Signals: The largest model tested, Llama 3.3 70B, exhibited the strongest and most consistent correctness signal, and required fewer training examples to learn a high-quality probe. This hints that more capable models might have a more refined internal sense of their own competence.
Where the Signal Falters
Despite these promising results, the approach revealed a notable limitation: generalization faltered significantly when applied to questions requiring mathematical reasoning, such as those in the GSM8K dataset. This indicates that while LLMs can internally gauge their knowledge-based accuracy, predicting success on tasks requiring deeper, step-by-step reasoning remains a challenge for this method.
Also Read:
- New Technique Improves Detection and Reduction of LLM Hallucinations
- Evaluating LLM Explanations: Moving Beyond Simple Preferences
Implications for Safer AI
This research significantly advances our understanding of how LLMs internally represent their own capabilities. By providing an early, low-cost indicator of potential failure, this ‘in-advance correctness direction’ could be invaluable for developing safer and more reliable AI systems. Imagine LLMs that could internally flag when they are likely to be wrong, allowing for early stopping, activating fallback mechanisms, or prompting human intervention in high-stakes applications. This work lays a foundation for building AI that not only answers questions but also understands its own competence.


