Evaluating AI in Journalism: A Practitioner-Centered Approach to Better Benchmarks

TLDR: This research explores how to design effective benchmarks for Large Language Models (LLMs) in journalism, addressing criticisms of existing benchmarks lacking real-world relevance. Through a workshop with 23 journalism professionals, the study identified key challenges: integrating journalistic values into metrics, accounting for diverse task contexts, the need for professional judgment in data creation, and accommodating varied organizational needs. The authors propose an “evaluation cookbook” – adaptable computational notebooks – to enable practitioners to customize LLM evaluations, thereby fostering more ecologically valid and domain-specific assessments.

Large Language Models (LLMs) are increasingly prevalent, with new versions and updates appearing almost monthly. These models often come with impressive “performance benchmarks” that highlight their advanced capabilities. However, a recent research paper delves into a critical question: are these benchmarks truly effective in evaluating how LLMs perform in real-world scenarios, especially within specialized fields like journalism?

The paper, titled “Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners”, argues that many existing benchmarks suffer from issues of “construct validity” – meaning they might not test what they claim to test – and “ecological validity” – they may not accurately represent how models are used in practice. To address these concerns, the researchers took a human-centered approach, focusing on creating a benchmark specifically for journalism professionals.

The core of their research involved a day-long workshop with 23 journalism practitioners from various U.S. news organizations and academic institutions. The goal was to understand their experiences with generative AI, their evaluation methods, and their needs for a journalism-oriented benchmark. The workshop explored six common news production use cases, such as information extraction, summarization, and fact-checking, and six key journalistic values, including accuracy, transparency, and accountability.

Key Challenges and Design Insights from Journalism Professionals

The workshop revealed several significant challenges and insights for designing effective LLM benchmarks in journalism:

Values as Metrics: Journalists often define success by how well an AI system adheres to core journalistic values like accuracy and uncertainty. However, the definition and importance of these values can vary greatly depending on the specific task and context. This suggests that evaluation metrics need to be “values-driven” but also operationalized in use-case specific ways.

Context is King: Participants emphasized that evaluation metrics diverge significantly across different use contexts. For example, summarizing a news story for internal reporting has different criteria than summarizing it for public consumption. Factors like input document type, size, and the stage in the editorial process all influence how an AI’s performance should be judged. This highlights the need for benchmarks to “map context” systematically, acknowledging and detailing these variations.

High-Quality Data and Professional Judgment: There’s a clear need for high-quality datasets that accurately represent journalism tasks. However, creating these datasets is challenging due to confidentiality concerns, the need for expert human annotation to establish “ground truth,” and a general lack of time and resources in newsrooms. The paper suggests “leveraging professional judgment” by making it easier for practitioners to provide editorial feedback and share evaluation data.

Individual and Organizational Differences: News organizations and individual journalists have diverse preferences, technical proficiencies, and business objectives. Some participants desired a standardized, industry-wide benchmark, while others preferred a flexible framework or “cookbook” that could be adapted to their specific needs. This points to the importance of “modularity and adaptability” in benchmark design.

The “Evaluation Cookbook” Approach

To address these findings, the researchers proposed an “evaluation cookbook” metaphor. This involves a series of computational notebooks, like those found in Google Colaboratory, that provide a structured yet flexible way to evaluate LLMs. Each notebook outlines a specific use case scenario, includes relevant data and ground truth, and defines metrics based on journalistic values. Crucially, these notebooks are designed to be copied and edited by practitioners, allowing them to customize evaluations to their unique contexts.

A case study was developed focusing on information extraction, using publicly available data journalism datasets. This case study demonstrated how the cookbook could incorporate different contextual variabilities, such as various input file formats and data structures. Each notebook includes an evaluation overview, detailed data information, task specifics with prompt construction examples, value-driven metrics, and model performance results.

Also Read:

Future Directions and Challenges

Initial feedback on the cookbook concept was positive, reinforcing the need for contextual variations and the importance of editorial attention. New considerations included the desire for manual spot-checking within the notebooks and the challenge of evaluating AI in real-time during news production, rather than just on post-production data. The paper also acknowledges the pragmatic challenges of implementing such a system, including the cost of practitioner time, the diverse technical backgrounds in newsrooms, and the broader economic pressures on the journalism industry that can hinder data sharing and collaboration.

Ultimately, this research offers a valuable framework for creating LLM benchmarks that are more relevant and useful for specific professional domains. By deeply engaging with practitioners and focusing on ecological validity, it paves the way for AI evaluation systems that truly reflect the complexities and values of real-world applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI in Journalism: A Practitioner-Centered Approach to Better Benchmarks

Key Challenges and Design Insights from Journalism Professionals

The “Evaluation Cookbook” Approach

Future Directions and Challenges

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates