TLDR: A new study investigates whether Large Language Models (LLMs) truly incorporate external label definitions or primarily rely on their pre-trained knowledge. Through controlled experiments with various LLMs and definition types across general and domain-specific tasks, the research reveals that while explicit definitions can enhance accuracy and explainability, their integration is not always guaranteed or consistent. Models often default to internal representations, especially in general tasks, but benefit more from explicit definitions in specialized domains. The study also highlights a disconnect between improved explanation quality and classification accuracy, suggesting distinct internal processes.
Large Language Models (LLMs) have become incredibly powerful, but a fundamental question remains: do they truly understand and incorporate external instructions, like label definitions, or do they mostly rely on their vast pre-existing knowledge? A recent research paper, “Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions”, dives deep into this question, revealing fascinating insights into how these AI models process information.
The researchers, Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, and Manas Gaur, conducted a series of controlled experiments to understand this interplay. They tested various LLMs, including GPT-4, LLaMA-3, Phi-3, and Mistral, across different types of tasks and definition conditions. These conditions ranged from expert-curated definitions to those generated by LLMs themselves, and even intentionally perturbed or swapped definitions to see how models would react.
How LLMs Handle Conflicting Definitions
One key area of investigation was how LLMs respond when definitions are intentionally misaligned or incorrect. The study found that models generally perform much better when label definitions are correctly aligned with their intended meaning. When definitions were swapped or corrupted, performance dropped significantly. This suggests that while LLMs have internal knowledge, they are indeed receptive to the explicit instructions provided in the prompt.
Interestingly, the sensitivity to definition quality varied greatly. For instance, LLaMA-3 showed a remarkable increase in performance when moving from incorrect to correct definitions in general language tasks. However, a surprising finding involved GPT-4, which, when faced with highly inconsistent definitions, sometimes chose to abstain from providing a prediction altogether. This “meta-response” suggests a sophisticated ability to detect contradictions, a capability not observed in the other models.
The research also highlighted a difference between general and domain-specific tasks. While general tasks like natural language inference sometimes saw models default to their internal representations, domain-specific tasks, such as mental health categorization or hate speech detection, often benefited more significantly from precise, explicit definitions. This implies that for specialized areas where LLMs might have less pre-training exposure, external definitions become even more crucial.
Strategies for Integrating Definitions
Beyond just the quality of definitions, the way they are presented to the model also matters. The study explored four integration strategies: a “vanilla” setting with no explicit definitions, “fixed definitions” (expert-written), “adjusted definitions” (dynamically generated by an LLM for each input), and a combination of “fixed definitions + few-shot examples.”
Counterintuitively, for general tasks like e-SNLI, models sometimes performed best in the definition-free “vanilla” setting. This suggests that for tasks where LLMs have very robust internal representations, explicit definitions can sometimes interfere. However, for domain-specific tasks, definitions generally improved performance, with some models showing dramatic gains. Mistral, for example, showed a tenfold increase in performance for hate speech detection when provided with definitions.
A particularly intriguing discovery was the “explanation-classification disconnect.” The researchers found that while definitions often dramatically improved the quality of the explanations generated by LLMs, this improvement didn’t always translate into higher classification accuracy. This suggests that LLMs might have partially distinct systems for reasoning about concepts (which definitions help) and for making categorical predictions.
Also Read:
- AI Explanations: A New Frontier in Teaching Humans Active Learning
- Making Sense of Privacy Policies: An AI-Powered Approach
Practical Takeaways
The findings offer valuable guidance for anyone working with LLMs. For specialized applications, especially with smaller models like Phi-3 or Mistral, carefully crafted and context-specific definitions can significantly boost performance. In cloud environments, larger models like GPT-4, while relying more on internal knowledge, can offer a safety net by refusing to answer when definitions conflict, which is critical for high-stakes scenarios. Even when definitions don’t directly improve classification, they reliably enhance explanation quality, fostering user trust and transparency.
This research underscores that LLMs’ receptivity to external knowledge is not uniform; it varies based on the model’s architecture, the task domain, and the quality and integration strategy of the definitions. Understanding these nuances is key to unlocking the full potential of LLMs in diverse real-world applications.


