spot_img
HomeResearch & DevelopmentEvaluating AI in Journalism: A Practitioner-Centered Approach to Better...

Evaluating AI in Journalism: A Practitioner-Centered Approach to Better Benchmarks

TLDR: This research explores how to design effective benchmarks for Large Language Models (LLMs) in journalism, addressing criticisms of existing benchmarks lacking real-world relevance. Through a workshop with 23 journalism professionals, the study identified key challenges: integrating journalistic values into metrics, accounting for diverse task contexts, the need for professional judgment in data creation, and accommodating varied organizational needs. The authors propose an “evaluation cookbook” – adaptable computational notebooks – to enable practitioners to customize LLM evaluations, thereby fostering more ecologically valid and domain-specific assessments.

Large Language Models (LLMs) are increasingly prevalent, with new versions and updates appearing almost monthly. These models often come with impressive “performance benchmarks” that highlight their advanced capabilities. However, a recent research paper delves into a critical question: are these benchmarks truly effective in evaluating how LLMs perform in real-world scenarios, especially within specialized fields like journalism?

The paper, titled “Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners”, argues that many existing benchmarks suffer from issues of “construct validity” – meaning they might not test what they claim to test – and “ecological validity” – they may not accurately represent how models are used in practice. To address these concerns, the researchers took a human-centered approach, focusing on creating a benchmark specifically for journalism professionals.

The core of their research involved a day-long workshop with 23 journalism practitioners from various U.S. news organizations and academic institutions. The goal was to understand their experiences with generative AI, their evaluation methods, and their needs for a journalism-oriented benchmark. The workshop explored six common news production use cases, such as information extraction, summarization, and fact-checking, and six key journalistic values, including accuracy, transparency, and accountability.

Key Challenges and Design Insights from Journalism Professionals

The workshop revealed several significant challenges and insights for designing effective LLM benchmarks in journalism:

Values as Metrics: Journalists often define success by how well an AI system adheres to core journalistic values like accuracy and uncertainty. However, the definition and importance of these values can vary greatly depending on the specific task and context. This suggests that evaluation metrics need to be “values-driven” but also operationalized in use-case specific ways.

Context is King: Participants emphasized that evaluation metrics diverge significantly across different use contexts. For example, summarizing a news story for internal reporting has different criteria than summarizing it for public consumption. Factors like input document type, size, and the stage in the editorial process all influence how an AI’s performance should be judged. This highlights the need for benchmarks to “map context” systematically, acknowledging and detailing these variations.

High-Quality Data and Professional Judgment: There’s a clear need for high-quality datasets that accurately represent journalism tasks. However, creating these datasets is challenging due to confidentiality concerns, the need for expert human annotation to establish “ground truth,” and a general lack of time and resources in newsrooms. The paper suggests “leveraging professional judgment” by making it easier for practitioners to provide editorial feedback and share evaluation data.

Individual and Organizational Differences: News organizations and individual journalists have diverse preferences, technical proficiencies, and business objectives. Some participants desired a standardized, industry-wide benchmark, while others preferred a flexible framework or “cookbook” that could be adapted to their specific needs. This points to the importance of “modularity and adaptability” in benchmark design.

The “Evaluation Cookbook” Approach

To address these findings, the researchers proposed an “evaluation cookbook” metaphor. This involves a series of computational notebooks, like those found in Google Colaboratory, that provide a structured yet flexible way to evaluate LLMs. Each notebook outlines a specific use case scenario, includes relevant data and ground truth, and defines metrics based on journalistic values. Crucially, these notebooks are designed to be copied and edited by practitioners, allowing them to customize evaluations to their unique contexts.

A case study was developed focusing on information extraction, using publicly available data journalism datasets. This case study demonstrated how the cookbook could incorporate different contextual variabilities, such as various input file formats and data structures. Each notebook includes an evaluation overview, detailed data information, task specifics with prompt construction examples, value-driven metrics, and model performance results.

Also Read:

Future Directions and Challenges

Initial feedback on the cookbook concept was positive, reinforcing the need for contextual variations and the importance of editorial attention. New considerations included the desire for manual spot-checking within the notebooks and the challenge of evaluating AI in real-time during news production, rather than just on post-production data. The paper also acknowledges the pragmatic challenges of implementing such a system, including the cost of practitioner time, the diverse technical backgrounds in newsrooms, and the broader economic pressures on the journalism industry that can hinder data sharing and collaboration.

Ultimately, this research offers a valuable framework for creating LLM benchmarks that are more relevant and useful for specific professional domains. By deeply engaging with practitioners and focusing on ecological validity, it paves the way for AI evaluation systems that truly reflect the complexities and values of real-world applications.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -