Advancing AI Agents: Introducing GUI-360 for Desktop Tasks

TLDR: GUI-360 is a new, large-scale dataset and benchmark designed to improve computer-using agents (CUAs) that automate tasks on desktop environments like Windows Office applications. It addresses key challenges such as the scarcity of real-world tasks, lack of automated data collection, and absence of a unified benchmark. The dataset, created using an LLM-augmented automated pipeline, contains over 1.2 million action steps with multimodal data. It supports three core tasks: GUI grounding, screen parsing, and action prediction, and reveals that current AI models struggle with these tasks out-of-the-box but show significant improvement with fine-tuning on GUI-360.

In the rapidly evolving world of artificial intelligence, the dream of agents that can seamlessly interact with our digital environments is becoming a reality. These ‘computer-using agents’ (CUAs) promise to automate routine tasks, making our digital lives more efficient. However, developing truly robust CUAs for desktop environments, such as Windows office applications, presents unique and significant challenges.

A new research paper introduces GUI-360, a groundbreaking dataset and benchmark suite designed to accelerate progress in this critical area. The paper, titled “GUI-360 ◦: A COMPREHENSIVEDATASET AND BENCHMARK FORCOMPUTER-USINGAGENTS,” was authored by Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang.

The researchers highlight three persistent gaps hindering CUA development: a scarcity of real-world tasks, the absence of automated data collection and annotation pipelines, and the lack of a unified benchmark for evaluating key capabilities like GUI grounding, screen parsing, and action prediction. GUI-360 directly addresses these issues.

What Makes GUI-360 Unique?

GUI-360 is a large-scale, comprehensive resource built with realism, scalability, and task breadth in mind. It features over 1.2 million executed action steps across thousands of trajectories in popular Windows office applications like Word, Excel, and PowerPoint. The dataset includes full-resolution screenshots, accessibility metadata, natural-language goals, intermediate reasoning traces, and both successful and failed action trajectories.

One of its most innovative aspects is the LLM-augmented, largely automated pipeline used for its creation. This pipeline handles everything from sourcing user queries to constructing environment templates, instantiating tasks, executing them in batches, and filtering for quality. This automation minimizes human intervention while ensuring the data reflects real-world usage patterns.

The dataset supports three core tasks crucial for CUAs:

GUI Grounding: Identifying the precise screen location or UI element to interact with based on a given instruction.
Screen Parsing: Enumerating all interactable UI elements on a screen and their properties.
Action Prediction: Predicting the next action (e.g., click, type, API call) given the current screen state and user intent.

Furthermore, GUI-360 incorporates a hybrid GUI+API action space, reflecting modern agent designs that combine direct graphical interface operations with higher-level application programming interface calls for efficiency.

How Was the Data Collected?

The collection process for GUI-360 involved three main stages:

Query Acquisition: Real-world user queries were gathered from sources like search logs, community forums, and in-app help content, then augmented with synthetic variants. These queries were then instantiated into concrete, executable tasks within specific environment templates.
Automatic Trajectory Collection: A specialized CUA called TrajAgent was developed to execute tasks automatically and consistently. It records detailed execution data, including screenshots, accessibility information, and agent actions. A two-stage execution strategy, using GPT-4o and then GPT-4.1 for failed tasks, significantly improved success rates.
Evaluation and Post-processing: An evaluation agent (EvaAgent) validated trajectories, ensuring only successful and executable tasks were retained. Data was then sanitized and structured into a standardized JSON format for model consumption.

Benchmarking State-of-the-Art Models

The researchers benchmarked various state-of-the-art vision-language models on GUI-360. The results revealed significant shortcomings in existing models when applied out-of-the-box, particularly in GUI grounding and action prediction. General-purpose models often struggled with the precision required for desktop environments.

However, supervised fine-tuning and reinforcement learning on the GUI-360 dataset yielded substantial performance gains. This highlights the dataset’s immense value as a training resource, enabling models to adapt and improve their understanding and interaction capabilities for complex desktop tasks. Even with these improvements, the models are still far from human-level reliability, indicating that GUI-360 serves as a challenging yet essential benchmark for future research.

Also Read:

Conclusion and Future Impact

GUI-360 represents a significant step forward for computer-using agents. By providing a large-scale, realistic, and comprehensive dataset and benchmark, it offers the research community a powerful tool to develop more robust and generalized desktop CUAs. The dataset and accompanying code are publicly available on Hugging Face, fostering reproducible research and accelerating progress towards intelligent agents that can truly master our digital workspaces.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing AI Agents: Introducing GUI-360 for Desktop Tasks

What Makes GUI-360 Unique?

How Was the Data Collected?

Benchmarking State-of-the-Art Models

Conclusion and Future Impact

Gen AI News and Updates

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Precision Training: Crafting Powerful GUI Agents with Filtered Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates