TLDR: WebDS is a new, comprehensive benchmark featuring 870 tasks across 29 diverse websites, designed to evaluate AI agents on complex, end-to-end web-based data science workflows. Unlike previous benchmarks, WebDS simulates real-world scenarios from data acquisition to analysis and reporting, requiring multi-step interactions, diverse data formats, and tool usage. Initial evaluations show that even state-of-the-art LLM agents perform poorly, achieving success rates as low as 13%, indicating significant challenges in areas like information grounding and environment interaction.
A significant portion of real-world data science tasks are far from simple. They often involve multiple steps, require interacting with various websites, gathering real-time data from different sources and formats, and then producing summarized analyses. While existing benchmarks for web interactions typically focus on straightforward actions like filling out forms or making e-commerce transactions, they don’t capture the diverse tool-using capabilities needed for complex data science on the web.
Similarly, traditional data science benchmarks usually deal with static datasets, often text-based, and don’t evaluate the entire workflow, from acquiring data to cleaning, analyzing, and generating insights. To address these limitations, researchers have introduced WebDS, the first end-to-end benchmark specifically designed for web-based data science.
WebDS includes 870 web-based data science tasks spread across 29 different websites. These sites range from structured government data portals to unstructured news media, challenging AI agents to perform complex, multi-step operations. These tasks demand the use of various tools and the handling of heterogeneous data formats, better reflecting the realities of modern data analytics.
Initial evaluations of current state-of-the-art Large Language Model (LLM) agents on WebDS reveal substantial performance gaps. For instance, an agent like Browser Use, which successfully completes 80% of tasks on the WebVoyager benchmark, only manages to complete 15% of tasks in WebDS. Analysis suggests this performance drop is due to new types of failures, such as poor information grounding, repetitive behavior, and agents taking shortcuts.
What Makes WebDS Unique?
The creators of WebDS highlight three main contributions:
1. Comprehensive Task Suite: The benchmark features 870 tasks covering a wide range of data types, modalities, and domains across 29 websites. These tasks require not only analytical reasoning but also interaction with diverse tools and interfaces. An example task might involve identifying relevant healthcare data, applying analytical techniques like nonlinear optimization, and then translating technical findings into an accessible policy brief.
2. Realistic End-to-End Evaluation: WebDS is the first benchmark to assess the complete data science pipeline. Tasks begin with agents autonomously browsing the web for relevant data, followed by analysis or visualization, and concluding with the generation of well-reasoned, context-aware outputs.
3. Reproducible and Fine-Grained Evaluation Framework: To ensure consistent and reliable evaluations, WebDS provides a fully containerized environment using Docker. Each of the 29 websites is hosted locally, eliminating variability from dynamic web content. The benchmark also introduces detailed metrics that capture subtask completion, tool usage accuracy, data validity, reasoning quality, and report fidelity, allowing for a nuanced assessment of agent capabilities.
Task Attributes and Difficulty
Each task in WebDS is labeled with one or more attributes, including Question-Answering (QA), Action-Based, Single-hop vs. Multi-hop (requiring one or multiple data sources), Structured vs. Unstructured data, Tool Usage, Web Navigation, and Multi-website tasks. Tasks are also categorized into three difficulty levels – easy, medium, and hard – based on their structural and content properties. Hard tasks, for example, involve at least two complex properties or require interaction across multiple websites.
Performance Insights
Despite their strong performance on other agent benchmarks, all tested agents performed poorly on WebDS. Even Browser Use paired with GPT-4o, a leading web agent, achieved only a 12.9% success rate. Interestingly, Browser Use paired with Qwen2.5-72b slightly outperformed the GPT-4o setup, suggesting that the primary bottleneck isn’t just the model’s raw capability but rather the translation layer between its reasoning and its interaction with the environment.
Why Agents Fail
An in-depth analysis of failures revealed several recurring themes: weak grounding (models failing to extract key details), poor feedback handling (models not confirming if UI manipulations were successful), and misinterpretation of user intent. Other issues included navigation errors, repetitive behavior, and inefficient effort allocation.
Also Read:
- Evaluating AI’s Tool-Using Prowess: Introducing LiveMCPBench
- Advancing Web Agents with Human-Inspired Cognitive Learning
Looking Ahead
WebDS aims to be a robust and long-lasting benchmark. Its diverse domains, containerized environment, extensible task format, and high complexity ensure it will remain challenging for future AI advancements. The research paper, titled “WebDS: An End-to-End Benchmark for Web-based Data Science,” can be found at https://arxiv.org/pdf/2508.01222. By bridging the gap between web interaction and data science capabilities, WebDS sets the stage for significant progress in developing practically useful LLM-based data science agents.


