Unpacking LLM Overconfidence: Why AI Tools Struggle with Truth in Journalism

TLDR: A study evaluated ChatGPT, Gemini, and NotebookLM on journalism tasks using a 300-document corpus. It found 30% of LLM outputs contained hallucinations, with Gemini and ChatGPT having significantly higher rates (40%) than NotebookLM (13%). The errors were primarily “interpretive overconfidence”—models added unsupported analysis, characterized sources without evidence, and transformed opinions into facts, rather than inventing entities or numbers. This reveals a fundamental mismatch with journalistic epistemology, which demands explicit sourcing. The research suggests newsrooms need tools that enforce accurate attribution and verification workflows that scrutinize interpretive claims, not just factual ones.

Large language models (LLMs) are increasingly being adopted in newsrooms for various tasks, but their tendency to generate plausible-sounding yet unsupported information, known as hallucination, poses significant risks to core journalistic principles like sourcing, attribution, and accuracy. A recent study, titled Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries, delves into this critical issue, evaluating how three popular LLM tools—ChatGPT, Gemini, and NotebookLM—perform on reporting-style tasks.

The researchers, Nick Hagar, Wilma Agustianto, and Nicholas Diakopoulos, conducted their evaluation using a comprehensive corpus of 300 documents related to TikTok litigation and policy in the U.S. This diverse collection included news coverage, legal and government materials, and scholarly sources, designed to mimic a realistic journalistic research scenario. They varied prompt specificity, from broad overviews to precise, document-bound questions, and also adjusted the context size provided to the models (10, 100, or 300 documents) to reflect real-world journalistic choices.

Hallucination Rates and Types

The study’s findings reveal a concerning prevalence of hallucinations. Across all model outputs, 30% contained at least one hallucination. However, there was a substantial difference between the tools: Gemini and ChatGPT exhibited hallucination rates of approximately 40%, which is about three times higher than NotebookLM’s 13%. When hallucinations occurred, they tended to cluster, indicating systematic issues rather than isolated errors. A significant portion of these errors—64%—were classified as moderate or alarming, introducing factual inaccuracies or misleading tangents.

Crucially, the nature of these hallucinations was not what one might typically expect. Instead of inventing entities or numbers, the models displayed what the researchers termed “interpretive overconfidence.” This meant they added unsupported analysis, confident characterizations of sources, and transformed attributed opinions into general statements, all without a basis in the original documents. Two main forms of this overconfidence were identified:

Editorializing about source type or audience: Models frequently added confident but baseless metadata about documents, such as claiming an article was “written for a general audience” or describing a filing as “intended for legal practitioners.” These characterizations sounded plausible but lacked textual support.
Attribution drift from opinion to statement: Attributed claims, like a senator’s opinion from a hearing, were transformed into uncontested universal facts. A source’s concerns quoted in a news article became the model’s own assertions, obscuring the original source and making it difficult for journalists to assess credibility.

This pattern highlights a fundamental epistemological mismatch: journalism demands explicit sourcing for every claim, while LLMs generate authoritative-sounding text regardless of evidentiary support.

Also Read:

NotebookLM’s Advantage and Practical Implications

NotebookLM’s significantly lower hallucination rate (13%) suggests that its Retrieval-Augmented Generation (RAG) implementation, which provides explicit citations, offers better grounding compared to ChatGPT’s Projects feature or Gemini’s in-context processing. This advantage was particularly evident in specific queries requiring information from particular documents.

The study concludes that even NotebookLM’s 13% hallucination rate is concerning for journalists whose credibility hinges on accuracy. ChatGPT and Gemini’s 40% rates make them effectively unusable for unsupervised reporting tasks. The findings underscore that verification must extend beyond mere facts to include interpretation, characterization, and attribution. Existing hallucination taxonomies may need expansion to capture these journalism-specific error types.

For newsrooms, the research offers several practical implications:

Tool Selection and Configuration: Prioritize tools that enforce source attribution, like NotebookLM, over those optimized for fluency or speed. If citations aren’t built-in, workflows should require models to include explicit passage markers for every claim.
Verification Workflows: Journalists must rigorously verify interpretive claims. This means checking if a document actually makes an argument, who specifically made a claim, and whether characterizations of sources are supported by the text.
Training and Awareness: Newsroom staff need to be trained on common LLM error patterns. Any model output that characterizes documents, summarizes positions, or makes comparative claims should be flagged for heightened scrutiny.
System Design Priorities: Vendors developing journalism tools should prioritize citation infrastructure and mechanisms that reject outputs untraceable to specific sources, maintain speaker attribution, and explicitly distinguish reporting from analysis.

Ultimately, the core challenge is to make LLMs more epistemologically aligned with journalistic practices. Until models can maintain clear provenance chains and distinguish reporting from analysis, they remain tools requiring intensive human supervision rather than trusted partners in the newsroom.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Overconfidence: Why AI Tools Struggle with Truth in Journalism

Hallucination Rates and Types

NotebookLM’s Advantage and Practical Implications

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates