TLDR: A study evaluated ChatGPT, Gemini, and NotebookLM on journalism tasks using a 300-document corpus. It found 30% of LLM outputs contained hallucinations, with Gemini and ChatGPT having significantly higher rates (40%) than NotebookLM (13%). The errors were primarily “interpretive overconfidence”—models added unsupported analysis, characterized sources without evidence, and transformed opinions into facts, rather than inventing entities or numbers. This reveals a fundamental mismatch with journalistic epistemology, which demands explicit sourcing. The research suggests newsrooms need tools that enforce accurate attribution and verification workflows that scrutinize interpretive claims, not just factual ones.
Large language models (LLMs) are increasingly being adopted in newsrooms for various tasks, but their tendency to generate plausible-sounding yet unsupported information, known as hallucination, poses significant risks to core journalistic principles like sourcing, attribution, and accuracy. A recent study, titled Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries, delves into this critical issue, evaluating how three popular LLM tools—ChatGPT, Gemini, and NotebookLM—perform on reporting-style tasks.
The researchers, Nick Hagar, Wilma Agustianto, and Nicholas Diakopoulos, conducted their evaluation using a comprehensive corpus of 300 documents related to TikTok litigation and policy in the U.S. This diverse collection included news coverage, legal and government materials, and scholarly sources, designed to mimic a realistic journalistic research scenario. They varied prompt specificity, from broad overviews to precise, document-bound questions, and also adjusted the context size provided to the models (10, 100, or 300 documents) to reflect real-world journalistic choices.
Hallucination Rates and Types
The study’s findings reveal a concerning prevalence of hallucinations. Across all model outputs, 30% contained at least one hallucination. However, there was a substantial difference between the tools: Gemini and ChatGPT exhibited hallucination rates of approximately 40%, which is about three times higher than NotebookLM’s 13%. When hallucinations occurred, they tended to cluster, indicating systematic issues rather than isolated errors. A significant portion of these errors—64%—were classified as moderate or alarming, introducing factual inaccuracies or misleading tangents.
Crucially, the nature of these hallucinations was not what one might typically expect. Instead of inventing entities or numbers, the models displayed what the researchers termed “interpretive overconfidence.” This meant they added unsupported analysis, confident characterizations of sources, and transformed attributed opinions into general statements, all without a basis in the original documents. Two main forms of this overconfidence were identified:
-
Editorializing about source type or audience: Models frequently added confident but baseless metadata about documents, such as claiming an article was “written for a general audience” or describing a filing as “intended for legal practitioners.” These characterizations sounded plausible but lacked textual support.
-
Attribution drift from opinion to statement: Attributed claims, like a senator’s opinion from a hearing, were transformed into uncontested universal facts. A source’s concerns quoted in a news article became the model’s own assertions, obscuring the original source and making it difficult for journalists to assess credibility.
This pattern highlights a fundamental epistemological mismatch: journalism demands explicit sourcing for every claim, while LLMs generate authoritative-sounding text regardless of evidentiary support.
Also Read:
- Large Language Models Tested in Business Simulations: A Benchmark for Managerial AI
- Beyond a Single Roll: Why Repetitions Are Key to Reliable LLM Evaluations
NotebookLM’s Advantage and Practical Implications
NotebookLM’s significantly lower hallucination rate (13%) suggests that its Retrieval-Augmented Generation (RAG) implementation, which provides explicit citations, offers better grounding compared to ChatGPT’s Projects feature or Gemini’s in-context processing. This advantage was particularly evident in specific queries requiring information from particular documents.
The study concludes that even NotebookLM’s 13% hallucination rate is concerning for journalists whose credibility hinges on accuracy. ChatGPT and Gemini’s 40% rates make them effectively unusable for unsupervised reporting tasks. The findings underscore that verification must extend beyond mere facts to include interpretation, characterization, and attribution. Existing hallucination taxonomies may need expansion to capture these journalism-specific error types.
For newsrooms, the research offers several practical implications:
-
Tool Selection and Configuration: Prioritize tools that enforce source attribution, like NotebookLM, over those optimized for fluency or speed. If citations aren’t built-in, workflows should require models to include explicit passage markers for every claim.
-
Verification Workflows: Journalists must rigorously verify interpretive claims. This means checking if a document actually makes an argument, who specifically made a claim, and whether characterizations of sources are supported by the text.
-
Training and Awareness: Newsroom staff need to be trained on common LLM error patterns. Any model output that characterizes documents, summarizes positions, or makes comparative claims should be flagged for heightened scrutiny.
-
System Design Priorities: Vendors developing journalism tools should prioritize citation infrastructure and mechanisms that reject outputs untraceable to specific sources, maintain speaker attribution, and explicitly distinguish reporting from analysis.
Ultimately, the core challenge is to make LLMs more epistemologically aligned with journalistic practices. Until models can maintain clear provenance chains and distinguish reporting from analysis, they remain tools requiring intensive human supervision rather than trusted partners in the newsroom.


