The 'Lost-in-the-Later' Effect: A New Look at How LLMs Use Context

TLDR: A new research paper introduces CoPE, a framework to quantify how Large Language Models (LLMs) use contextual versus parametric knowledge. The study uncovers a ‘lost-in-the-later’ phenomenon, where LLMs tend to overlook information appearing later in a given context, revealing a strong positional bias. It also finds that reasoning models and Chain-of-Thought prompting can worsen this effect, leading to lower contextual grounding. However, prompt-based methods, specifically a ‘CK Prompt’, can effectively increase contextual knowledge usage and reduce hallucination, as demonstrated in a summarization case study.

Large Language Models (LLMs) have become incredibly powerful, capable of generating coherent and relevant text by leveraging vast amounts of information. However, a new study sheds light on a critical challenge: how these models prioritize and integrate different sources of knowledge, specifically contextual information provided in the input versus their own pre-trained, or ‘parametric,’ knowledge.

Researchers have introduced a novel evaluation framework called CoPE (Context and Parametric Evaluation) to systematically measure how LLMs use these two types of knowledge across various models and languages. Their findings reveal a significant positional bias, termed “lost-in-the-later,” where LLMs tend to overlook or deprioritize information that appears later in a given context.

Understanding Contextual and Parametric Knowledge

Contextual Knowledge (CK) refers to information directly entailed by the input text provided to the model. Parametric Knowledge (PK), on the other hand, includes any output that is not directly derived from the context, such as memorized facts or general inferences from the model’s training data. Understanding the balance between CK and PK is crucial, especially in sensitive applications like medicine or law, where relying too heavily on PK can lead to factual inaccuracies or ‘hallucinations’.

The CoPE Framework: A New Lens for Evaluation

The CoPE framework offers a flexible, model- and task-agnostic approach to assess this balance. It works by breaking down both the input context and the model’s response into ‘atomic sentences’ – minimal, standalone factual propositions. Using a natural language inference (NLI) approach, CoPE then classifies each atomic sentence in the response as either CK or PK. It also measures ‘Context Recall distribution’ to see how well models recall information from different segments (early, middle, late) of the input.

To facilitate their research, the team created the MultiWikiAtomic dataset, an extension of an existing dataset, now including 15,000 atomic sentences in English, Spanish, and Danish, derived from Wikipedia articles. This multilingual aspect allowed for a broader analysis of LLM behavior.

Key Discoveries: The “Lost-in-the-Later” Effect and More

The experiments, involving six diverse LLMs (including GPT-4o, Gemini 1.5 Pro, and Llama models), yielded several significant insights:

LLMs do not fully utilize the provided context, with CK usage typically peaking around 70-75% across all models and languages.
Reasoning models (like GPT-o3 and Qwen 3 235B) showed persistently lower CK scores, suggesting a potential trade-off between complex reasoning and grounding in context.
A consistent “lost-in-the-later” effect was observed: models strongly favor information at the beginning of the input, progressively incorporating less from later sections. This bias persists even with relatively short contexts and when the order of sentences is randomized, suggesting an inherent structural bias.
Parametric knowledge (PK) is more likely to appear towards the end of model responses, especially when less context is available.
While models generally show reduced contextual grounding when the input contains contradictions, they still ground to the provided context even if it’s factually incorrect, highlighting that CK reflects grounding to input, not real-world truth.
Crucially, a higher percentage of CK in responses correlates with a reduced likelihood of hallucination, indicating that grounding responses in context improves factual accuracy.

Prompting for Better Grounding

Based on these findings, the researchers developed simple prompting strategies to encourage LLMs to better leverage input context. A ‘CK Prompt’, which combines strict adherence to context with instructions for balanced utilization, proved most effective. This prompt significantly increased CK scores and helped mitigate the “lost-in-the-later” effect, leading to a more even recall distribution across the context.

Interestingly, Chain-of-Thought (CoT) prompting, often assumed to improve contextual alignment, actually led to lower context recall and shorter responses, exacerbating the “lost-in-the-later” effect. However, a combination of CoT and the CK Prompt showed improved CK usage compared to CoT alone.

Also Read:

Real-World Impact: Summarization Case Study

Applying the best-performing CK prompt to multi-document summarization tasks demonstrated its practical value. Summaries generated with the CK prompt showed improved factual grounding and reduced hallucination risks, while maintaining overall quality. This suggests that insights from CoPE can directly inform strategies for building more reliable AI applications.

This research provides a comprehensive framework for understanding how LLMs balance contextual and parametric knowledge, revealing a critical positional bias. The findings underscore the importance of designing prompts that encourage models to fully and evenly utilize provided context, ultimately leading to more accurate and less hallucinatory AI outputs. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The ‘Lost-in-the-Later’ Effect: A New Look at How LLMs Use Context

Understanding Contextual and Parametric Knowledge

The CoPE Framework: A New Lens for Evaluation

Key Discoveries: The “Lost-in-the-Later” Effect and More

Prompting for Better Grounding

Real-World Impact: Summarization Case Study

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates