spot_img
HomeResearch & DevelopmentAdvancing AI Agents: Introducing GUI-360 for Desktop Tasks

Advancing AI Agents: Introducing GUI-360 for Desktop Tasks

TLDR: GUI-360 is a new, large-scale dataset and benchmark designed to improve computer-using agents (CUAs) that automate tasks on desktop environments like Windows Office applications. It addresses key challenges such as the scarcity of real-world tasks, lack of automated data collection, and absence of a unified benchmark. The dataset, created using an LLM-augmented automated pipeline, contains over 1.2 million action steps with multimodal data. It supports three core tasks: GUI grounding, screen parsing, and action prediction, and reveals that current AI models struggle with these tasks out-of-the-box but show significant improvement with fine-tuning on GUI-360.

In the rapidly evolving world of artificial intelligence, the dream of agents that can seamlessly interact with our digital environments is becoming a reality. These ‘computer-using agents’ (CUAs) promise to automate routine tasks, making our digital lives more efficient. However, developing truly robust CUAs for desktop environments, such as Windows office applications, presents unique and significant challenges.

A new research paper introduces GUI-360, a groundbreaking dataset and benchmark suite designed to accelerate progress in this critical area. The paper, titled “GUI-360 â—¦: A COMPREHENSIVEDATASET AND BENCHMARK FORCOMPUTER-USINGAGENTS,” was authored by Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang.

The researchers highlight three persistent gaps hindering CUA development: a scarcity of real-world tasks, the absence of automated data collection and annotation pipelines, and the lack of a unified benchmark for evaluating key capabilities like GUI grounding, screen parsing, and action prediction. GUI-360 directly addresses these issues.

What Makes GUI-360 Unique?

GUI-360 is a large-scale, comprehensive resource built with realism, scalability, and task breadth in mind. It features over 1.2 million executed action steps across thousands of trajectories in popular Windows office applications like Word, Excel, and PowerPoint. The dataset includes full-resolution screenshots, accessibility metadata, natural-language goals, intermediate reasoning traces, and both successful and failed action trajectories.

One of its most innovative aspects is the LLM-augmented, largely automated pipeline used for its creation. This pipeline handles everything from sourcing user queries to constructing environment templates, instantiating tasks, executing them in batches, and filtering for quality. This automation minimizes human intervention while ensuring the data reflects real-world usage patterns.

The dataset supports three core tasks crucial for CUAs:

  • GUI Grounding: Identifying the precise screen location or UI element to interact with based on a given instruction.
  • Screen Parsing: Enumerating all interactable UI elements on a screen and their properties.
  • Action Prediction: Predicting the next action (e.g., click, type, API call) given the current screen state and user intent.

Furthermore, GUI-360 incorporates a hybrid GUI+API action space, reflecting modern agent designs that combine direct graphical interface operations with higher-level application programming interface calls for efficiency.

How Was the Data Collected?

The collection process for GUI-360 involved three main stages:

  1. Query Acquisition: Real-world user queries were gathered from sources like search logs, community forums, and in-app help content, then augmented with synthetic variants. These queries were then instantiated into concrete, executable tasks within specific environment templates.
  2. Automatic Trajectory Collection: A specialized CUA called TrajAgent was developed to execute tasks automatically and consistently. It records detailed execution data, including screenshots, accessibility information, and agent actions. A two-stage execution strategy, using GPT-4o and then GPT-4.1 for failed tasks, significantly improved success rates.
  3. Evaluation and Post-processing: An evaluation agent (EvaAgent) validated trajectories, ensuring only successful and executable tasks were retained. Data was then sanitized and structured into a standardized JSON format for model consumption.

Benchmarking State-of-the-Art Models

The researchers benchmarked various state-of-the-art vision-language models on GUI-360. The results revealed significant shortcomings in existing models when applied out-of-the-box, particularly in GUI grounding and action prediction. General-purpose models often struggled with the precision required for desktop environments.

However, supervised fine-tuning and reinforcement learning on the GUI-360 dataset yielded substantial performance gains. This highlights the dataset’s immense value as a training resource, enabling models to adapt and improve their understanding and interaction capabilities for complex desktop tasks. Even with these improvements, the models are still far from human-level reliability, indicating that GUI-360 serves as a challenging yet essential benchmark for future research.

Also Read:

Conclusion and Future Impact

GUI-360 represents a significant step forward for computer-using agents. By providing a large-scale, realistic, and comprehensive dataset and benchmark, it offers the research community a powerful tool to develop more robust and generalized desktop CUAs. The dataset and accompanying code are publicly available on Hugging Face, fostering reproducible research and accelerating progress towards intelligent agents that can truly master our digital workspaces.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -