TLDR: A comprehensive study evaluated Large Language Model (LLM) performance in detecting hyperpartisan news, fake news, harmful tweets, and political bias by comparing Fine-Tuning (FT) and In-Context Learning (ICL). Across 10 datasets and 5 languages, Fine-Tuning consistently proved more effective, with decoder models excelling in factual tasks and encoder models in linguistic tasks. While ICL generally underperformed, the codebook approach showed modest improvements for certain tasks. The findings emphasize that dedicated task-specific training (FT) is crucial for robust misinformation detection, though ICL offers a convenient way to probe model knowledge without retraining.
The proliferation of fake news, politically biased content, and harmful information across online platforms poses a significant challenge to public discourse and democratic integrity. As large language models (LLMs) become increasingly sophisticated, researchers are exploring their potential to combat this wave of misinformation. A recent comprehensive study delves into how different LLM adaptation methods—specifically In-Context Learning (ICL) versus Fine-Tuning (FT)—perform in detecting various forms of problematic content.
The study, conducted by Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Ga¨el Dias, Fabrice Maurel, and Pablo Gamallo, aimed to provide a detailed benchmark of LLM performance across different models, usage methods, and languages. Their work spanned 10 datasets and 5 languages (English, Spanish, Portuguese, Arabic, and Bulgarian), covering both binary and multiclass classification tasks. The goal was to understand which strategies are most effective for identifying hyperpartisan news, fake news, harmful tweets, and political bias.
Comparing Fine-Tuning and In-Context Learning
The researchers investigated two primary approaches for adapting LLMs to these detection tasks. Fine-Tuning (FT) involves further training a pre-trained model on specific labeled data for a particular task. This method adjusts the model’s internal parameters to better suit the new task. In contrast, In-Context Learning (ICL) allows models to perform tasks without additional training by providing instructions and examples directly within the prompt. This approach leverages the model’s existing knowledge and reasoning capabilities.
The study tested a variety of ICL strategies, including zero-shot prompts (with general or specific instructions), few-shot prompts (using both randomly selected and diversity-optimized examples via Determinantal Point Processes or DPP), and Chain-of-Thought (CoT) prompting, which encourages step-by-step reasoning. For Fine-Tuning, both encoder-only models (like RoBERTa variants) and decoder-only LLMs (such as LlaMA3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Qwen2.5-7B-Instruct) were evaluated.
Key Findings: Fine-Tuning Takes the Lead
A central discovery of the research is that Fine-Tuning consistently outperformed In-Context Learning strategies across most settings. This highlights the importance of tailoring even smaller models to specific tasks through dedicated training, even when compared against the largest models used in an ICL setup.
More specifically, the study found that fine-tuned decoder-based models excelled in tasks requiring factual world knowledge, such as fake news detection and political bias identification. For instance, LlaMA3.1-8b achieved a high F1 score for fake news detection, significantly outperforming encoder models. Conversely, encoder-based models, with their bidirectional attention mechanisms, proved more effective for linguistically oriented tasks like harmful tweet detection and hyperpartisan language identification. RoBERTa-large, for example, showed strong performance in these areas.
In-Context Learning: Strengths and Limitations
While ICL generally underperformed FT, the study provided valuable insights into its effectiveness. Among the ICL methods, the codebook approach, which provides explicit definitions, rules, and examples for classification, showed modest improvements, particularly for harmful content and hyperpartisan news detection. This suggests that structured guidance can help LLMs in complex judgment tasks by bridging the gap between abstract concepts and concrete textual indicators.
However, zero-shot prompting, even with specific instructions, yielded only marginal gains, indicating that merely elaborating on task definitions is often insufficient for significant performance shifts. Few-shot learning, where models learn from a limited number of examples, showed inconsistent improvements, and simply increasing the number of examples did not always lead to better results. The use of Determinantal Point Processes (DPP) for selecting diverse examples sometimes reduced classification variance but did not consistently boost performance over random selection.
Chain-of-Thought (CoT) prompting, designed to elicit step-by-step reasoning, was largely suboptimal in these experiments. The researchers suggest this might be due to language representation issues in the models’ training data, especially for underrepresented languages, where novel tokens could confuse the models rather than aid reasoning. Interestingly, models sometimes generated explanations even without explicit CoT prompts, suggesting an inherent explanatory behavior from their pre-training.
Also Read:
- Improving Factual Accuracy in AI Outputs with Structured Claim Evaluation
- Optimizing LLM Training: The Power of Instruction Coverage and Depth
Implications for Misinformation Detection
The findings underscore that while LLMs are powerful tools, their optimal application in critical domains like misinformation detection often requires the resource-intensive process of fine-tuning. This approach allows models to develop a deeper, task-specific understanding that general-purpose in-context learning cannot fully replicate. However, ICL remains a valuable and convenient strategy for quickly assessing a model’s capabilities without the need for extensive retraining.
The research also highlights the complexity of political NLP tasks, where nuanced understanding of cultural contexts, linguistic factors, and evolving misinformation tactics is crucial. Future work could explore integrating Retrieval-Augmented Generation (RAG) systems to provide LLMs with up-to-date information, thereby improving factual verification capabilities. For more details, you can read the full research paper here.


