TLDR: GRAPH2EVAL is a novel framework that utilizes knowledge graphs to automatically generate diverse and challenging evaluation tasks for AI agents. It creates both multimodal document comprehension and multi-step web interaction tasks, moving beyond static datasets to provide a more comprehensive assessment of agents’ reasoning, collaboration, and interactive capabilities in dynamic environments. The framework efficiently generates tasks that effectively differentiate the performance of various agent-model combinations, revealing performance gaps and offering a new approach to agent evaluation.
As AI agents, particularly those driven by multimodal large language models (LLMs), become more sophisticated and autonomous, the traditional methods of evaluating their capabilities are falling short. Static datasets, which often lead to agents memorizing answers rather than demonstrating true problem-solving skills, can no longer adequately assess how these agents perform in dynamic, real-world environments and diverse tasks. Current LLM-based synthetic data generation methods are primarily designed for LLM training, not for agent tasks that require tool use and interactive capabilities. Furthermore, existing automatic agent task generation efforts are often limited to text or image analysis, failing to model the complex, multi-step interactions common in web environments.
To address these critical challenges, a new framework called GRAPH2EVAL has been proposed. This innovative system leverages knowledge graphs to automatically generate a wide array of evaluation tasks. These tasks span both multimodal document comprehension and intricate web interaction scenarios, providing a comprehensive way to evaluate an agent’s reasoning, collaboration, and interactive abilities.
How GRAPH2EVAL Works
At its core, GRAPH2EVAL uses knowledge graphs constructed from various external data sources, effectively creating a rich ‘task space’. Within this space, semantic relations are translated into structured multimodal tasks. This is achieved through a clever combination of subgraph sampling, predefined task templates, and meta-paths. To ensure the quality and executability of the generated tasks, a multi-stage filtering process is applied, which includes checks for node reachability, LLM scoring, and similarity analysis.
A significant advantage of GRAPH2EVAL is its versatility. It supports end-to-end evaluation for different types of agents, including Single Agents, Multi-Agent systems, and Web Agents. This allows for a thorough assessment of reasoning, collaboration, and interaction capabilities across various settings.
The framework’s workflow involves several key stages:
- Data Parsing: Documents are structured beyond plain text, preserving hierarchical semantics, while web pages are collected via automated URL crawling, extracting DOM structures and screenshots.
- Knowledge Graph Construction: Unstructured and semi-structured content is transformed into a computable semantic space. Nodes represent elements like paragraphs, headings, buttons, and forms, while edges capture relationships such as structural connections, semantic associations, and web interactions (e.g., navigation, clicks).
- Subgraph Sampling: Relevant nodes and their interconnections are extracted from the knowledge graph based on the task objective. Different strategies are used for document comprehension (prioritizing semantic relevance and structural coherence) and web interaction (seed-driven, focusing on operational nodes like buttons and forms).
- Task Generation: Sampled subgraphs are transformed into executable tasks. For document comprehension, task templates, subgraph sampling, and variable extraction are combined with LLMs to generate concrete task instances. For web interaction, a seed-driven subgraph sampling strategy, meta-path matching, and dynamic task generation are employed to create multi-step interactive tasks.
- Coverage Optimization: A multi-stage optimization framework ensures the quality, diversity, and representativeness of the generated tasks through filtering, coverage quantification, and novelty assessment.
GRAPH2EVAL-BENCH: A New Dataset
To demonstrate its effectiveness, the researchers instantiated the framework with GRAPH2EVAL-BENCH, a curated dataset comprising 1,319 tasks. This dataset includes 1,002 document comprehension tasks and 317 web interaction tasks, offering a diverse range of scenarios for agent evaluation. Experiments with GRAPH2EVAL-BENCH have shown that the framework efficiently generates tasks that effectively differentiate the performance of various agent and model combinations, highlighting existing gaps in reasoning, collaboration, and web interaction across different settings.
Also Read:
- SafeEvalAgent: A Dynamic Approach to AI Safety Evaluation
- TRACE: A Framework for Dynamically Evolving AI Agent Benchmarks
Performance Insights
Evaluations on document comprehension tasks revealed that models like GPT-4o and Deepseek-V3 consistently achieved top performance. Interestingly, multi-agent collaboration did not always lead to significant improvements in document comprehension, sometimes even increasing token usage without a proportional gain in performance. For web interaction tasks, Agent S 2.5 generally outperformed the SoM Agent across most task types. The findings suggest that features like task-aligned reflection and multidimensional memory management can significantly enhance the reasoning capabilities of LLM-based agents in web environments.
GRAPH2EVAL offers a new perspective for agent evaluation, moving beyond the limitations of static datasets to provide a scalable and reproducible method for assessing agent capabilities in dynamic, multimodal environments. The code for GRAPH2EVAL is available for public access, fostering further research and development in the field. You can find the research paper here: GRAPH2EVAL: AUTOMATIC MULTIMODAL TASK GENERATION FOR AGENTS VIA KNOWLEDGE GRAPHS.


