Unpacking Opinions: How Large Language Models Excel at Detecting Subjectivity in Text

TLDR: A research paper from CEA-LIST at CheckThat! 2025 demonstrates that Large Language Models (LLMs) with few-shot prompting can effectively detect subjectivity in text, matching or outperforming fine-tuned smaller models. The study highlights the importance of detailed prompts and shows that LLMs are particularly robust against noisy or inconsistent data, leading to top rankings in multiple languages, including first place in Arabic and Polish. While advanced prompting techniques like multi-agent debates showed promise, the core effectiveness relied on well-crafted standard few-shot prompts and ensemble methods.

In the world of natural language processing, understanding whether a piece of text expresses a personal opinion or a verifiable fact is a fundamental challenge. This task, known as subjectivity detection, is crucial for many applications, from fact-checking and media analysis to content moderation and even legal interpretation. Imagine trying to sort through news articles to separate factual reporting from editorial commentary – that’s where subjectivity detection comes in.

Traditionally, this has been tackled using smaller language models (SLMs) that require extensive training on large, specifically labeled datasets. However, the rise of Large Language Models (LLMs) has opened new avenues. These powerful models, like GPT-4o-mini, are pre-trained on vast amounts of text and can perform many tasks with minimal additional training, often just by being given a few examples or carefully crafted instructions.

A recent study by researchers from Université Paris-Saclay, CEA-List, presented at the CheckThat! 2025 evaluation campaign, explored the effectiveness of LLMs in detecting bias and opinion across multiple languages. Their goal was to see if LLMs, with smart prompting techniques, could truly compete with or even surpass the performance of fine-tuned SLMs, especially when dealing with messy or inconsistent data.

How They Approached the Challenge

The team experimented with several strategies to optimize LLM performance:

Prompt Engineering: This involves carefully designing the instructions given to the LLM. They tested simple prompts versus highly detailed ones based on annotation guidelines, and even different ways of framing the output (e.g., asking a yes/no question or using neutral categories like “Category 0” vs. “Category 1”).
Few-Shot Learning: Instead of extensive training, LLMs can learn from a small number of examples provided directly within the prompt. The researchers tested different numbers of examples (0, 6, or 12) and various ways of selecting these examples – randomly, based on semantic similarity to the test sentence, or based on semantic dissimilarity.
Multi-Agent LLM Systems: This advanced approach involved setting up multiple LLMs to work together. For instance, one LLM might argue why a sentence is subjective, another why it’s objective, and a third “judge” LLM makes the final decision. This “debate” setup aimed to improve reasoning and robustness.

Key Findings and Surprises

The results were quite insightful. While fine-tuned traditional models performed reasonably well, especially in English, LLMs showed significant promise. A detailed prompt, combined with a few randomly selected examples (6-shot or 12-shot), consistently boosted performance. Surprisingly, random selection of examples often outperformed strategies that tried to pick examples based on semantic similarity or dissimilarity. This suggests that a diverse set of examples, even if randomly chosen, can help LLMs generalize better.

The “debate” setup, where LLMs argued for different classifications, also yielded strong results, particularly enhancing the detection of subjective content. Ultimately, the best performance was achieved by an ensemble approach, combining predictions from several different LLMs and a traditional model. This highlights the power of combining diverse AI perspectives.

A Breakthrough in Arabic

One of the most striking outcomes was the team’s performance in Arabic. They secured first place, outperforming the second-ranked team by a significant margin. This success was particularly notable because the Arabic dataset was found to have annotation inconsistencies – meaning some labels didn’t perfectly align with the guidelines. Unlike traditional models that struggle with such “noisy” data, the LLM-based few-shot approach proved more resilient. This suggests a significant practical benefit of LLMs: their ability to handle imperfect datasets, which is common in real-world scenarios where high-quality, perfectly consistent annotations are hard to come by.

Also Read:

Looking Ahead

The study concludes that LLMs, when used with carefully designed few-shot prompts, offer a powerful and flexible alternative to traditional fine-tuning methods for multilingual subjectivity detection. Their robustness against varying data quality makes them especially valuable for complex NLP tasks where perfect data is scarce. This research paves the way for more adaptable and effective AI systems in understanding the nuances of human language.

For more technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking Opinions: How Large Language Models Excel at Detecting Subjectivity in Text

How They Approached the Challenge

Key Findings and Surprises

A Breakthrough in Arabic

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates