TLDR: A new research paper introduces CoPE, a framework to quantify how Large Language Models (LLMs) use contextual versus parametric knowledge. The study uncovers a ‘lost-in-the-later’ phenomenon, where LLMs tend to overlook information appearing later in a given context, revealing a strong positional bias. It also finds that reasoning models and Chain-of-Thought prompting can worsen this effect, leading to lower contextual grounding. However, prompt-based methods, specifically a ‘CK Prompt’, can effectively increase contextual knowledge usage and reduce hallucination, as demonstrated in a summarization case study.
Large Language Models (LLMs) have become incredibly powerful, capable of generating coherent and relevant text by leveraging vast amounts of information. However, a new study sheds light on a critical challenge: how these models prioritize and integrate different sources of knowledge, specifically contextual information provided in the input versus their own pre-trained, or ‘parametric,’ knowledge.
Researchers have introduced a novel evaluation framework called CoPE (Context and Parametric Evaluation) to systematically measure how LLMs use these two types of knowledge across various models and languages. Their findings reveal a significant positional bias, termed “lost-in-the-later,” where LLMs tend to overlook or deprioritize information that appears later in a given context.
Understanding Contextual and Parametric Knowledge
Contextual Knowledge (CK) refers to information directly entailed by the input text provided to the model. Parametric Knowledge (PK), on the other hand, includes any output that is not directly derived from the context, such as memorized facts or general inferences from the model’s training data. Understanding the balance between CK and PK is crucial, especially in sensitive applications like medicine or law, where relying too heavily on PK can lead to factual inaccuracies or ‘hallucinations’.
The CoPE Framework: A New Lens for Evaluation
The CoPE framework offers a flexible, model- and task-agnostic approach to assess this balance. It works by breaking down both the input context and the model’s response into ‘atomic sentences’ – minimal, standalone factual propositions. Using a natural language inference (NLI) approach, CoPE then classifies each atomic sentence in the response as either CK or PK. It also measures ‘Context Recall distribution’ to see how well models recall information from different segments (early, middle, late) of the input.
To facilitate their research, the team created the MultiWikiAtomic dataset, an extension of an existing dataset, now including 15,000 atomic sentences in English, Spanish, and Danish, derived from Wikipedia articles. This multilingual aspect allowed for a broader analysis of LLM behavior.
Key Discoveries: The “Lost-in-the-Later” Effect and More
The experiments, involving six diverse LLMs (including GPT-4o, Gemini 1.5 Pro, and Llama models), yielded several significant insights:
- LLMs do not fully utilize the provided context, with CK usage typically peaking around 70-75% across all models and languages.
- Reasoning models (like GPT-o3 and Qwen 3 235B) showed persistently lower CK scores, suggesting a potential trade-off between complex reasoning and grounding in context.
- A consistent “lost-in-the-later” effect was observed: models strongly favor information at the beginning of the input, progressively incorporating less from later sections. This bias persists even with relatively short contexts and when the order of sentences is randomized, suggesting an inherent structural bias.
- Parametric knowledge (PK) is more likely to appear towards the end of model responses, especially when less context is available.
- While models generally show reduced contextual grounding when the input contains contradictions, they still ground to the provided context even if it’s factually incorrect, highlighting that CK reflects grounding to input, not real-world truth.
- Crucially, a higher percentage of CK in responses correlates with a reduced likelihood of hallucination, indicating that grounding responses in context improves factual accuracy.
Prompting for Better Grounding
Based on these findings, the researchers developed simple prompting strategies to encourage LLMs to better leverage input context. A ‘CK Prompt’, which combines strict adherence to context with instructions for balanced utilization, proved most effective. This prompt significantly increased CK scores and helped mitigate the “lost-in-the-later” effect, leading to a more even recall distribution across the context.
Interestingly, Chain-of-Thought (CoT) prompting, often assumed to improve contextual alignment, actually led to lower context recall and shorter responses, exacerbating the “lost-in-the-later” effect. However, a combination of CoT and the CK Prompt showed improved CK usage compared to CoT alone.
Also Read:
- Unstructured Reasoning Outperforms Structured Approaches in LLMs for Complex Problem Solving
- Assessing Large Language Models for Text Summarization: A Deep Dive into Prompt Engineering and Performance
Real-World Impact: Summarization Case Study
Applying the best-performing CK prompt to multi-document summarization tasks demonstrated its practical value. Summaries generated with the CK prompt showed improved factual grounding and reduced hallucination risks, while maintaining overall quality. This suggests that insights from CoPE can directly inform strategies for building more reliable AI applications.
This research provides a comprehensive framework for understanding how LLMs balance contextual and parametric knowledge, revealing a critical positional bias. The findings underscore the importance of designing prompts that encourage models to fully and evenly utilize provided context, ultimately leading to more accurate and less hallucinatory AI outputs. For more details, you can read the full research paper here.


