spot_img
HomeResearch & DevelopmentAI Agents Step into the Newsroom: A New Benchmark...

AI Agents Step into the Newsroom: A New Benchmark for Journalistic Tasks

TLDR: NEWSAGENT is a new benchmark that evaluates AI agents on realistic newswriting tasks, requiring them to iteratively search, select, and edit multimodal information to create news articles. It reveals that while agents can retrieve facts, they struggle with planning and narrative integration, and their information selection often differs from human journalists. The benchmark also shows that open-source models can compete with closed-source ones in this domain, with specific strengths observed in different models.

The world of journalism, with its demands for iterative planning, interpretation, and contextual reasoning from diverse raw content, presents a unique challenge for autonomous digital agents. While AI has shown promise in structured tasks, its ability to enhance multimodal web data productivity in complex fields like news writing has remained an open question.

A new research paper introduces NEWSAGENT, a comprehensive benchmark designed to evaluate how effectively AI agents can function as journalists. This benchmark assesses agents’ capabilities in automatically searching for available raw content, selecting desired information, and then editing and rephrasing it to form a well-structured news article. Unlike typical summarization or retrieval tasks where essential context is readily available, NEWSAGENT tasks agents to actively discover information, mirroring the real-world information gaps faced by human journalists.

The NEWSAGENT benchmark comprises over 6,000 human-verified examples derived from actual news events. To ensure broad model compatibility, all multimodal content, such as images and transcripts, is converted into text. Agents are given a writing instruction and initial firsthand data, much like a journalist starting a draft. Their tasks include identifying narrative perspectives, issuing keyword-based queries, retrieving historical background, and ultimately generating complete articles.

The framework for NEWSAGENT reflects a realistic journalistic workflow. Agents operate in an iterative perception-action loop, observing the current draft, task inputs (title, release date, firsthand information), and retrieved content. They can perform three core actions: a time-aware search function to retrieve historical context published strictly before the simulated release date, an insert function to add selected retrieved objects to the draft, and a remove function to delete existing content. This process continues until the agent decides to terminate, after which the draft is rephrased into a final news article.

Evaluation in NEWSAGENT occurs at two levels: function-wise metrics and end-to-end metrics. Function-wise metrics assess the precision, recall, and F1 scores for search and edit operations against human-written articles. For end-to-end newswriting, a dimension-wise GPT-4 comparative evaluation is used, assessing generated articles across six critical dimensions: Factuality, Logical Consistency, Importance, Readability, Objectivity, and Journalistic Style. This detailed evaluation provides a nuanced understanding of agent performance.

Also Read:

Key Findings from the Research

  • Limited Self-Correction: A significant finding was the agents’ limited capacity for self-correction. Across all models, the ‘Remove’ operation was rarely, if ever, invoked. This suggests that current AI agents tend to assume their drafts are satisfactory and do not iteratively refine or prune content in the way human journalists do.
  • Divergent Information Needs: The information selected by LLM agents often diverged from the choices made by human journalists. While agents could retrieve relevant facts, their planning and narrative integration capabilities showed weaknesses. The interaction design, specifically a 2-step execution mode, improved precision by focusing on highly relevant items but sometimes reduced overall coverage.
  • Open-Source Competitiveness: Contrary to common assumptions, closed-source models like GPT-4o did not consistently outperform high-performing open-source models such as Qwen3-32B and Gemma-3-27b-it in end-to-end newswriting. This indicates that general-purpose reasoning capability doesn’t always translate directly to superior performance in targeted editorial workflows.
  • Strengths and Weaknesses: Qwen3-32B demonstrated strong performance in ‘Journalistic Style’ and ‘Importance’, often incorporating a broader range of historical information to enhance narrative continuity. GPT-4o, on the other hand, excelled in ‘Readability’, producing fluent and easy-to-follow narratives. Human-written articles, while not always achieving the highest overall win rates, remained competitive in ‘Factual Consistency’ and ‘Objectivity’, emphasizing concise factual delivery.

The researchers believe that NEWSAGENT serves as a realistic testbed for iterating and evaluating agent capabilities in multimodal web data manipulation for real-world productivity. Future research directions include incorporating native multimodal capabilities for direct processing of images, videos, and audio, and exploring more sophisticated multi-agent frameworks where specialized agents can collaborate, mimicking professional newsroom practices.

For more in-depth information, you can read the full research paper available here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -