GRAPH2EVAL: Crafting Complex Evaluation Tasks for Advanced AI Agents

TLDR: GRAPH2EVAL is a novel framework that utilizes knowledge graphs to automatically generate diverse and challenging evaluation tasks for AI agents. It creates both multimodal document comprehension and multi-step web interaction tasks, moving beyond static datasets to provide a more comprehensive assessment of agents’ reasoning, collaboration, and interactive capabilities in dynamic environments. The framework efficiently generates tasks that effectively differentiate the performance of various agent-model combinations, revealing performance gaps and offering a new approach to agent evaluation.

As AI agents, particularly those driven by multimodal large language models (LLMs), become more sophisticated and autonomous, the traditional methods of evaluating their capabilities are falling short. Static datasets, which often lead to agents memorizing answers rather than demonstrating true problem-solving skills, can no longer adequately assess how these agents perform in dynamic, real-world environments and diverse tasks. Current LLM-based synthetic data generation methods are primarily designed for LLM training, not for agent tasks that require tool use and interactive capabilities. Furthermore, existing automatic agent task generation efforts are often limited to text or image analysis, failing to model the complex, multi-step interactions common in web environments.

To address these critical challenges, a new framework called GRAPH2EVAL has been proposed. This innovative system leverages knowledge graphs to automatically generate a wide array of evaluation tasks. These tasks span both multimodal document comprehension and intricate web interaction scenarios, providing a comprehensive way to evaluate an agent’s reasoning, collaboration, and interactive abilities.

How GRAPH2EVAL Works

At its core, GRAPH2EVAL uses knowledge graphs constructed from various external data sources, effectively creating a rich ‘task space’. Within this space, semantic relations are translated into structured multimodal tasks. This is achieved through a clever combination of subgraph sampling, predefined task templates, and meta-paths. To ensure the quality and executability of the generated tasks, a multi-stage filtering process is applied, which includes checks for node reachability, LLM scoring, and similarity analysis.

A significant advantage of GRAPH2EVAL is its versatility. It supports end-to-end evaluation for different types of agents, including Single Agents, Multi-Agent systems, and Web Agents. This allows for a thorough assessment of reasoning, collaboration, and interaction capabilities across various settings.

The framework’s workflow involves several key stages:

Data Parsing: Documents are structured beyond plain text, preserving hierarchical semantics, while web pages are collected via automated URL crawling, extracting DOM structures and screenshots.
Knowledge Graph Construction: Unstructured and semi-structured content is transformed into a computable semantic space. Nodes represent elements like paragraphs, headings, buttons, and forms, while edges capture relationships such as structural connections, semantic associations, and web interactions (e.g., navigation, clicks).
Subgraph Sampling: Relevant nodes and their interconnections are extracted from the knowledge graph based on the task objective. Different strategies are used for document comprehension (prioritizing semantic relevance and structural coherence) and web interaction (seed-driven, focusing on operational nodes like buttons and forms).
Task Generation: Sampled subgraphs are transformed into executable tasks. For document comprehension, task templates, subgraph sampling, and variable extraction are combined with LLMs to generate concrete task instances. For web interaction, a seed-driven subgraph sampling strategy, meta-path matching, and dynamic task generation are employed to create multi-step interactive tasks.
Coverage Optimization: A multi-stage optimization framework ensures the quality, diversity, and representativeness of the generated tasks through filtering, coverage quantification, and novelty assessment.

GRAPH2EVAL-BENCH: A New Dataset

To demonstrate its effectiveness, the researchers instantiated the framework with GRAPH2EVAL-BENCH, a curated dataset comprising 1,319 tasks. This dataset includes 1,002 document comprehension tasks and 317 web interaction tasks, offering a diverse range of scenarios for agent evaluation. Experiments with GRAPH2EVAL-BENCH have shown that the framework efficiently generates tasks that effectively differentiate the performance of various agent and model combinations, highlighting existing gaps in reasoning, collaboration, and web interaction across different settings.

Also Read:

Performance Insights

Evaluations on document comprehension tasks revealed that models like GPT-4o and Deepseek-V3 consistently achieved top performance. Interestingly, multi-agent collaboration did not always lead to significant improvements in document comprehension, sometimes even increasing token usage without a proportional gain in performance. For web interaction tasks, Agent S 2.5 generally outperformed the SoM Agent across most task types. The findings suggest that features like task-aligned reflection and multidimensional memory management can significantly enhance the reasoning capabilities of LLM-based agents in web environments.

GRAPH2EVAL offers a new perspective for agent evaluation, moving beyond the limitations of static datasets to provide a scalable and reproducible method for assessing agent capabilities in dynamic, multimodal environments. The code for GRAPH2EVAL is available for public access, fostering further research and development in the field. You can find the research paper here: GRAPH2EVAL: AUTOMATIC MULTIMODAL TASK GENERATION FOR AGENTS VIA KNOWLEDGE GRAPHS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GRAPH2EVAL: Crafting Complex Evaluation Tasks for Advanced AI Agents

How GRAPH2EVAL Works

GRAPH2EVAL-BENCH: A New Dataset

Performance Insights

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates