TLDR: This research introduces SLAQ, a framework to evaluate how Large Language Models (LLMs) maintain factual consistency when answering the same questions in simple (short-form) versus complex (long-form) queries. It finds that LLMs often fail to provide consistent correct answers, showing higher accuracy for short queries but a high rate of aligned incorrect answers in both formats. The study also reveals that accuracy degrades with the position of facts in long answers and that errors can cascade. Mechanistic analysis shows that consistent answers activate similar internal model components, and these internal similarities can predict factual alignment.
Large Language Models (LLMs) have become incredibly powerful tools, used in everything from education to healthcare and general knowledge search. However, their reliability is often questioned due to their tendency to ‘hallucinate’ or generate incorrect information. A recent study delves into a particularly curious aspect of this problem: why LLMs can correctly answer a simple factual question, but then fail to provide the same correct information when that same fact is part of a more complex, longer query.
This inconsistency, where models struggle to access factual knowledge reliably across different levels of task complexity, erodes trust in LLMs. While previous research has looked at factual accuracy in both short and long answers separately, it hasn’t directly compared how an LLM performs on the *same* factual question when asked in isolation versus when it’s embedded in a more elaborate request.
Introducing SLAQ: A New Evaluation Framework
To address this gap, researchers introduced the Short-Long Form Alignment for Factual Question Answering (SLAQ) framework. SLAQ is designed to systematically evaluate whether LLMs maintain consistent answers to identical factual questions, regardless of the query’s complexity. The framework works by presenting LLMs with the same fact-seeking questions in two formats:
- Short Queries: These are simple, isolated factual questions.
- Long Queries: These combine five topically related factual questions into a single, more complex information-seeking prompt.
By comparing the LLM’s answers to both types of queries, the researchers could distinguish between a genuine ‘knowledge gap’ (where the model doesn’t know the fact at all) and an ‘answer retrieval failure’ (where the model knows the fact but fails to provide it consistently in a complex context).
Key Findings on Factual (Mis)Alignment
The study evaluated 16 different LLMs using 600 queries and uncovered several significant patterns:
- Modest Accuracy: Most LLMs achieved only 30-50% factual accuracy for both short and long queries. Importantly, almost all models showed higher accuracy for short-form questions. This suggests that simply making models larger doesn’t dramatically improve their ability to recall facts.
- High Raw Alignment, But Negative Signed Alignment: The models showed a remarkable 73-78% consistency in whether their answers were correct or incorrect across both query types. However, a deeper look revealed a critical finding: this high alignment mostly stemmed from *systematic failures*. In other words, models were more consistently *wrong* for the same fact in both short and long queries than they were consistently *correct*. This indicates that LLMs have stable internal ways of processing facts, but these strategies often lead to incorrect information.
- Position-Dependent Degradation: When answering long queries, the accuracy of facts declined steadily based on their order in the prompt. Accuracy dropped from 51.3% for the first requested fact to 30.1% for the fifth, a significant 21.2 percentage point decrease. This suggests that managing multiple factual requirements in a long query progressively impairs the model’s ability to retrieve accurate information.
- Momentum Effects: The study also found ‘momentum’ in responses. Following a series of correct answers, the likelihood of subsequent answers being correct increased. Conversely, consecutive errors tended to cascade, reducing the accuracy of following answers. This ‘snowballing’ effect further explains why long-form responses often underperform short-form ones.
The Internal Mechanisms of Misalignment
To understand *why* these inconsistencies occur, the researchers delved into the LLMs’ internal computational mechanisms. They hypothesized that factual alignment (consistent correct answers) would correspond to similar internal processing pathways within the model.
Through a technique called zero-ablation, which identifies critical components responsible for generating answers, they found that facts answered correctly in both short and long formats indeed exhibited significantly higher ‘mechanistic similarity’ than facts answered correctly in only one format. This provides direct evidence that behavioral consistency reflects distinct internal mechanisms.
Furthermore, these mechanistic similarity metrics proved to be powerful predictors of factual alignment. A logistic regression classifier, using these metrics, could predict factual alignment with up to 78% accuracy. The Spearman correlation over attention components was identified as the strongest individual predictor, highlighting the importance of how attention mechanisms process information for factual consistency.
Also Read:
- Unlocking Entity Understanding in Large Language Models
- Tailoring Knowledge for Large Language Models: The Concept of LLM-Specific Utility in RAG
Conclusion: A Call for More Robust Evaluation
This research highlights that factual consistency across different query complexities is a crucial, yet often overlooked, aspect of LLM reliability. The SLAQ framework and its findings challenge the implicit assumption that good performance on simple factual queries guarantees reliability in more complex knowledge-seeking tasks. The study’s insights into position-dependent degradation, momentum effects, and the underlying mechanistic differences offer valuable directions for future work aimed at improving LLMs’ trustworthiness and consistency. You can read the full research paper here.


