TLDR: A new study reveals that specific characteristics of training data, such as intra-document repetition of facts, small amounts of factual inconsistency, and skewed knowledge distributions, are crucial for large language models to develop robust strategies for arbitrating between their internal (parametric) knowledge and external (in-context) information. These findings suggest that traditional data cleaning methods might inadvertently hinder a model’s ability to intelligently integrate different knowledge sources, highlighting the benefits of ‘imperfect’ data for learning effective knowledge resolution.
Large language models (LLMs) are incredibly powerful, capable of generating human-like text and performing complex tasks. A key part of their intelligence comes from two main types of knowledge: parametric knowledge and in-context knowledge.
Parametric knowledge is what an LLM learns and stores within its internal structure during its extensive pretraining phase. Think of it as the model’s long-term memory, built from the vast amounts of text it has processed. In-context knowledge, on the other hand, is information provided to the model at the time of inference, often through a prompt or retrieved documents. This is like the model’s short-term memory, allowing it to incorporate up-to-date or specific details not present in its core training.
However, a significant challenge arises when these two knowledge sources conflict. For example, if an LLM’s parametric knowledge states one fact, but the in-context information in a prompt presents a contradictory fact, how should the model decide which one to trust? If it blindly accepts external information, it becomes vulnerable to misinformation. If it rigidly sticks to its internal knowledge, it misses out on valuable new data.
While much research has focused on how already-trained LLMs handle these conflicts, a recent study delves into a more fundamental question: how do the conditions during a model’s initial training shape its ability to arbitrate between parametric and in-context knowledge? Understanding this is crucial because it can help us design better pretraining strategies, avoiding the waste of significant computational resources on models that develop undesirable knowledge arbitration behaviors.
Investigating Training Conditions
Researchers conducted controlled experiments, training transformer-based language models from scratch on a specially designed dataset of synthetic biographies. This allowed them to precisely manipulate various training conditions and observe their impact on how models learn to use and reconcile different knowledge sources. They evaluated models across three scenarios:
- Parametric Knowledge Utilization: How well the model recalls facts learned during training.
- In-Context Knowledge Utilization: How well the model extracts and uses new information provided in the prompt for entities it hasn’t seen before.
- Knowledge Conflict Resolution: How the model decides between its learned knowledge and conflicting information presented in the prompt.
Key Discoveries from Training Dynamics
The study revealed several fascinating insights into what makes an LLM develop a robust knowledge arbitration strategy:
1. The Power of Repetition: Intra-document repetition, where facts about an entity are mentioned multiple times within the same document, proved critical. This repetition fostered the simultaneous development of both parametric and in-context knowledge capabilities. Interestingly, the ability to use in-context knowledge emerged much earlier in training than the ability to recall parametric knowledge.
2. The Benefit of Small Inconsistencies: Counter-intuitively, a small amount of factual inconsistency or “noise” within a document during training was found to be beneficial. Without any noise, models tended to over-rely on in-context knowledge, even when their parametric knowledge was highly confident. However, introducing even a tiny degree of inconsistency (as little as 1%) encouraged the model to favor its more confident parametric knowledge when conflicts arose. This suggests that a perfectly clean dataset might not always be ideal for learning robust arbitration.
3. The Role of Skewed Knowledge Distribution: Training data often has a skewed distribution, meaning some facts or entities appear much more frequently than others (a “long-tail” of less common knowledge). The research found that this skewed distribution helps preserve the model’s ability to use in-context knowledge for unfamiliar or rare entities. It prevents the model from becoming overly reliant on parametric knowledge for everything, ensuring it can still learn from new, less frequent information.
When these three conditions—intra-document repetition, a small degree of factual inconsistency, and a skewed knowledge distribution—were present together, the models developed the desired arbitration pattern: they confidently relied on their parametric knowledge for well-learned facts but readily adapted to and used in-context knowledge for rare or novel information.
Also Read:
- Unpacking Why AI Models Prioritize Certain Information Sources
- New Research Reveals Critical Vulnerabilities in AI Model Contamination Detection
Real-World Validation and Implications
To ensure these findings weren’t limited to the synthetic environment, the researchers validated their results on a real-world open-source LLM, PYTHIA-6.9B. The real-world model exhibited similar training dynamics, confirming that the natural presence of repetition, minor inconsistencies, and skewed distributions in web-scale training data contributes to these robust knowledge arbitration strategies.
These insights have profound implications for how we prepare data for pretraining large language models. Traditionally, data cleaning practices often involve aggressive deduplication and balancing to remove inconsistencies and normalize distributions. However, this research suggests that such practices might inadvertently impair a model’s ability to intelligently integrate and arbitrate between its internal knowledge and new, external information. Instead, embracing modest inconsistencies and skewed distributions, which are inherent in real-world data, could be key to developing more robust and adaptable LLMs for applications like retrieval-augmented generation.
For more details, you can read the full research paper here: Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models.


