AI Agents Step into the Newsroom: A New Benchmark for Journalistic Tasks

TLDR: NEWSAGENT is a new benchmark that evaluates AI agents on realistic newswriting tasks, requiring them to iteratively search, select, and edit multimodal information to create news articles. It reveals that while agents can retrieve facts, they struggle with planning and narrative integration, and their information selection often differs from human journalists. The benchmark also shows that open-source models can compete with closed-source ones in this domain, with specific strengths observed in different models.

The world of journalism, with its demands for iterative planning, interpretation, and contextual reasoning from diverse raw content, presents a unique challenge for autonomous digital agents. While AI has shown promise in structured tasks, its ability to enhance multimodal web data productivity in complex fields like news writing has remained an open question.

A new research paper introduces NEWSAGENT, a comprehensive benchmark designed to evaluate how effectively AI agents can function as journalists. This benchmark assesses agents’ capabilities in automatically searching for available raw content, selecting desired information, and then editing and rephrasing it to form a well-structured news article. Unlike typical summarization or retrieval tasks where essential context is readily available, NEWSAGENT tasks agents to actively discover information, mirroring the real-world information gaps faced by human journalists.

The NEWSAGENT benchmark comprises over 6,000 human-verified examples derived from actual news events. To ensure broad model compatibility, all multimodal content, such as images and transcripts, is converted into text. Agents are given a writing instruction and initial firsthand data, much like a journalist starting a draft. Their tasks include identifying narrative perspectives, issuing keyword-based queries, retrieving historical background, and ultimately generating complete articles.

The framework for NEWSAGENT reflects a realistic journalistic workflow. Agents operate in an iterative perception-action loop, observing the current draft, task inputs (title, release date, firsthand information), and retrieved content. They can perform three core actions: a time-aware search function to retrieve historical context published strictly before the simulated release date, an insert function to add selected retrieved objects to the draft, and a remove function to delete existing content. This process continues until the agent decides to terminate, after which the draft is rephrased into a final news article.

Evaluation in NEWSAGENT occurs at two levels: function-wise metrics and end-to-end metrics. Function-wise metrics assess the precision, recall, and F1 scores for search and edit operations against human-written articles. For end-to-end newswriting, a dimension-wise GPT-4 comparative evaluation is used, assessing generated articles across six critical dimensions: Factuality, Logical Consistency, Importance, Readability, Objectivity, and Journalistic Style. This detailed evaluation provides a nuanced understanding of agent performance.

Also Read:

Key Findings from the Research

Limited Self-Correction: A significant finding was the agents’ limited capacity for self-correction. Across all models, the ‘Remove’ operation was rarely, if ever, invoked. This suggests that current AI agents tend to assume their drafts are satisfactory and do not iteratively refine or prune content in the way human journalists do.
Divergent Information Needs: The information selected by LLM agents often diverged from the choices made by human journalists. While agents could retrieve relevant facts, their planning and narrative integration capabilities showed weaknesses. The interaction design, specifically a 2-step execution mode, improved precision by focusing on highly relevant items but sometimes reduced overall coverage.
Open-Source Competitiveness: Contrary to common assumptions, closed-source models like GPT-4o did not consistently outperform high-performing open-source models such as Qwen3-32B and Gemma-3-27b-it in end-to-end newswriting. This indicates that general-purpose reasoning capability doesn’t always translate directly to superior performance in targeted editorial workflows.
Strengths and Weaknesses: Qwen3-32B demonstrated strong performance in ‘Journalistic Style’ and ‘Importance’, often incorporating a broader range of historical information to enhance narrative continuity. GPT-4o, on the other hand, excelled in ‘Readability’, producing fluent and easy-to-follow narratives. Human-written articles, while not always achieving the highest overall win rates, remained competitive in ‘Factual Consistency’ and ‘Objectivity’, emphasizing concise factual delivery.

The researchers believe that NEWSAGENT serves as a realistic testbed for iterating and evaluating agent capabilities in multimodal web data manipulation for real-world productivity. Future research directions include incorporating native multimodal capabilities for direct processing of images, videos, and audio, and exploring more sophisticated multi-agent frameworks where specialized agents can collaborate, mimicking professional newsroom practices.

For more in-depth information, you can read the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Agents Step into the Newsroom: A New Benchmark for Journalistic Tasks

Key Findings from the Research

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates