Unpacking LLMs' Role in Chinese Misinformation: A Deep Dive with the CANDY Benchmark

TLDR: The CANDY benchmark evaluates large language models (LLMs) in fact-checking Chinese misinformation, revealing their limitations in generating accurate conclusions and explanations, particularly due to factual hallucinations and poor temporal awareness. While LLMs alone are unreliable, the study shows their significant potential to enhance human fact-checking performance when used as assistive tools.

In an era where misinformation spreads rapidly, especially across vast online communities, understanding the capabilities and limitations of large language models (LLMs) in fact-checking is crucial. A recent research paper, titled “CANDY: Benchmarking LLMs’ Limitations and Assistive Potential in Chinese Misinformation Fact-Checking,” sheds light on this complex issue, particularly within the context of Chinese misinformation.

Authored by Ruiling Guo, Xinwei Yang, Chen Huang, Tong Zhang, and Yong Hu, this study introduces a novel benchmark named CANDY. This benchmark is specifically designed to systematically evaluate how well LLMs can identify and address Chinese misinformation, and to explore their potential as tools to assist human fact-checkers.

The Challenge of Misinformation in China

Misinformation, defined as false or misleading information, is a significant problem globally. China, with its immense internet user base, faces a daily deluge of such content, making manual verification an increasingly daunting and impractical task. LLMs, with their extensive knowledge and ability to generate explanations, have been seen as promising candidates for misinformation detection. However, concerns about their effectiveness, particularly regarding their tendency to “hallucinate” or generate false information, have persisted.

Introducing the CANDY Benchmark and CANDYSET Dataset

To address the need for a comprehensive evaluation, the researchers developed CANDY. Central to this benchmark is CANDYSET, a large-scale, multi-domain dataset comprising approximately 20,000 news instances, including both misinformation and authentic news. What makes CANDYSET unique is its detailed annotations, including around 5,000 manually annotated flawed LLM-generated explanations and 7,000 human study samples. This dataset allows for in-depth evaluation, especially in “contamination-free” scenarios, where models are tested on information they haven’t been trained on, mimicking real-world, rapidly evolving events.

Three Key Evaluation Tasks

The CANDY benchmark evaluates LLMs across three progressive tasks:

Fact-Checking Conclusion: Assessing an LLM’s ability to accurately classify a statement as factual or misinformation.
Fact-Checking Explanation: Evaluating the reliability and accuracy of the explanations LLMs provide for their fact-checking conclusions.
LLM-Assisted Fact-Checking: Investigating how LLMs can assist humans in fact-checking tasks.

Understanding LLM Failures: A New Taxonomy

To better understand why LLMs fail, the researchers developed a fine-grained taxonomy for categorizing flawed LLM-generated explanations. This taxonomy divides errors into three main dimensions:

Faithfulness Hallucination: When an LLM’s explanation is unfaithful to the input or contains internal logical conflicts. This includes instruction inconsistency, logical inconsistency, and context inconsistency.
Factuality Hallucination: When an LLM provides reasons that contradict real-world facts or are entirely fabricated. This is further broken down into factual fabrication and factual inconsistency.
Reasoning Inadequacy: When an LLM fails to provide high-quality or helpful reasoning, often due to overgeneralized reasoning or under-informativeness.

This taxonomy revealed that factual fabrication is the most common failure mode, where LLMs invent details to support false claims.

Key Findings from the Benchmark

The study benchmarked sixteen LLMs and three large reasoning models (LRMs), revealing several critical insights:

Struggles with Fact-Checking Conclusions: Even with advanced techniques like Chain-of-Thought (CoT) reasoning and few-shot prompting, LLMs struggle to accurately fact-check, especially with new, time-sensitive information (contamination-free scenarios). While larger models like GPT-4o performed better, their accuracy remained moderate. Smaller open-source models often misclassified truthful information as misinformation.
Flawed Explanations are Common: A significant majority (91.2%) of incorrect fact-checking conclusions were linked to flawed LLM-generated explanations. Factual hallucination was the dominant error, particularly when LLMs lacked direct factual evidence, leading them to generate plausible but unverified details. Interestingly, rephrasing claims as questions significantly reduced factual fabrication errors.
Limited Temporal Awareness: LLMs often struggle with time-sensitive information, failing to recognize outdated knowledge or acknowledge their own knowledge cutoff dates. This leads to factual inconsistencies in their explanations.
Inflexibility with Risk Levels: LLMs showed an imbalance in handling misinformation, often failing to detect high-risk content (like financial scams) while being overly cautious with low-risk topics (like life advice).
Challenges with Linguistic Nuances and Cultural Expertise: LLMs frequently misinterpreted subtle linguistic cues like qualifiers and negations, leading to context inconsistencies. Furthermore, even Chinese-focused LLMs struggled with culturally specific tasks, such as lunar calendar calculations.
LLMs as Assistants, Not Autonomous Decision-Makers: Perhaps the most promising finding is the potential of LLMs to augment human performance. A human study showed that individuals across all educational levels achieved higher fact-checking accuracy and efficiency when assisted by LLMs, especially when combined with web retrieval. This suggests that LLMs are best positioned as intelligent assistants or advisors, rather than independent fact-checkers.

Also Read:

Looking Ahead

The CANDY benchmark provides valuable guidance for future research in LLM fact-checking. While current LLMs have significant limitations, particularly in generating accurate and reliable explanations for Chinese misinformation, their potential to enhance human fact-checking capabilities is considerable. The full research paper can be found here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLMs’ Role in Chinese Misinformation: A Deep Dive with the CANDY Benchmark

The Challenge of Misinformation in China

Introducing the CANDY Benchmark and CANDYSET Dataset

Three Key Evaluation Tasks

Understanding LLM Failures: A New Taxonomy

Key Findings from the Benchmark

Looking Ahead

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates