TLDR: This paper investigates how large language models (LLMs) manage discrepancies between their pre-trained knowledge and contradictory information in user prompts, particularly in code generation. It introduces a framework for constructing and interpreting these ‘knowledge conflicts’ and a novel evaluation method. Experiments with Llama3 models show that larger LLMs encode the concept of knowledge conflicts, which can be detected with up to 80.65% accuracy using probing techniques. The study also demonstrates that activation-level steering can influence LLM responses, achieving up to a 12.6% improvement in steering success, though effectiveness varies with model size, task domain, and steering direction.
Large Language Models (LLMs) have become incredibly powerful tools, capable of everything from understanding natural language to generating complex code. However, these models face a unique challenge: what happens when the information they’ve learned during training (their ‘parametric knowledge’) clashes with new, contradictory information provided in a user’s prompt (their ‘conflicting knowledge’)?
A recent research paper, titled “That’s Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation,” delves into this very issue. Building on previous work in question-answering, this study extends the investigation of these ‘knowledge conflicts’ into the critical and growing domain of code generation. The authors, Jaesung Bae, Cameron Churchwell, Mitchell Hermon, Tsun-An Hsieh, Jocelyn Xu, Yekaterina Yegorova, Mark Hasegawa-Johnson, and Heng Ji from the University of Illinois Urbana-Champaign, propose a new framework and evaluation method specifically designed for code conflict scenarios.
Understanding the Conflict
The core idea is simple: an LLM has a vast amount of information encoded in its parameters from its training. When a user provides a prompt that contains information contradicting this pre-existing knowledge, a conflict arises. For example, if an LLM was trained on an older version of a Python library, and a user’s prompt describes a function that has since been updated or deprecated, the model must decide which information to prioritize.
The researchers developed a domain-agnostic framework to systematically study these conflicts. It involves defining the model’s parametric knowledge (what it would say without any conflicting context), constructing prompts with conflicting information, and then categorizing the model’s response as either aligning with its parametric knowledge, the conflicting knowledge, or something else entirely.
Experiments Across Domains
To test their framework, the team used two types of tasks: Question Answering (QA) and Code Generation. For QA, they used datasets like “World Capitals” (common knowledge) and “Olympics Winners” (more specific, less common knowledge). For code generation, they utilized the EvalPlus dataset, creating conflicts by simulating function deprecation, operator deprecation, and function replacement scenarios.
Their experiments involved three Llama3 models of different sizes (1B, 3B, and 8B parameters) to observe how model scale influences conflict resolution.
Key Findings on How LLMs Handle Conflicts
The study revealed several interesting patterns:
- Model Size Matters: Larger LLMs (like the 8B model) tend to rely more on their parametric knowledge, especially when the task information is widely known (e.g., world capitals). Smaller models were more likely to adopt the conflicting information.
- Knowledge Strength: Models showed high resistance to conflicts in well-known domains (like world capitals) but were more flexible with less certain knowledge (like specific Olympic winners). This suggests that the confidence in their stored information plays a significant role.
- Code Generation Nuances: In code generation, all models primarily relied on their parametric knowledge. However, larger models were more likely to generate responses that incorporated the conflicting information. Interestingly, providing a replacement function for a deprecated one sometimes led to worse outcomes, with the 8B model occasionally including both the old and new functions.
Detecting and Steering Conflicts
A significant part of the research focused on whether knowledge conflicts could be detected within the LLM’s internal workings. By using a technique called “probing” – training a simple classifier to analyze the model’s internal representations (specifically, its residual streams) – the researchers found that LLMs do encode the notion of a knowledge conflict in their parameters.
- Detection Accuracy: The ability to distinguish between parametric and conflicting knowledge improved in deeper layers of the models, suggesting that semantic information crucial for this distinction is encoded there.
- Cross-Domain Transfer: Remarkably, the ability to detect conflicts transferred across domains. A probe trained on QA data could detect conflicts in code generation tasks, with the 8B model achieving up to 80.65% accuracy in certain layers. This indicates that a general concept of knowledge conflict, though subtly embedded, exists in larger models.
Building on this detectability, the paper explored “activation-level steering” – a method to influence the model’s output by subtly modifying its internal activations. By creating a ‘steering vector’ based on differences in activations when conflicts are present, they could bias the model to favor either its parametric knowledge or the conflicting knowledge from the prompt.
- Steering Success: While not uniformly high, steering achieved varying degrees of success, with an overall steering success rate of 12.6% for the 8B model when transferring from QA to code tasks.
- Task and Knowledge Influence: Steering towards parametric knowledge was often more successful for tasks with prevalent information (like world capitals or Python code). Conversely, for less common knowledge (like Olympic winners), steering towards conflicting knowledge was easier, suggesting that weaker parametric priors make models more amenable to contextual bias.
Also Read:
- Unpacking Prompting: How Language Model Instructions Affect Internal Representations
- TREAT: A New Framework for Evaluating Code Language Model Trustworthiness
Implications for Reliable AI
This research provides crucial insights into how LLMs process and resolve contradictory information. Understanding these mechanisms is vital for developing more reliable AI systems that can effectively identify, isolate, and navigate knowledge conflicts. The findings suggest that while LLMs often default to their pre-trained knowledge, especially when it’s strong, the concept of a conflict is detectable and, to some extent, steerable. Future work will explore more domains, refine predictive methods for conflict resolution, and investigate architectural impacts on these strategies. You can read the full paper here.


