PG-Agent: Enhancing AI Navigation in User Interfaces with Page Graphs

TLDR: PG-Agent is a new AI framework that improves how agents interact with graphical user interfaces (GUIs). Instead of just using sequential action records, it converts them into “page graphs” that explicitly map how different GUI pages are connected by actions. This graph structure, combined with Retrieval-Augmented Generation (RAG) and a multi-agent system, allows PG-Agent to better understand GUI environments, plan complex tasks, and adapt to new scenarios more effectively, even with limited training data.

Graphical User Interface (GUI) agents are becoming increasingly important for automating tasks on mobile devices and websites. These agents, often powered by advanced multimodal large language models (MLLMs), show great promise in interacting with user interfaces. However, a common challenge for existing GUI agents is their reliance on sequential records of operations, which don’t fully capture the complex ways different pages connect and transition.

This limitation makes it difficult for agents to truly understand the GUI environment and adapt to new situations. To address this, researchers have developed PG-Agent, a novel framework that transforms these sequential records into “page graphs.” These graphs explicitly model how pages are structured and naturally linked by user actions.

The core idea behind PG-Agent is to convert linear sequences of actions into a rich, interconnected graph. Imagine a map where each city is a page and the roads are the actions that take you from one page to another. This graph provides a much more comprehensive understanding of page transitions than simply remembering a single path.

To effectively use these page graphs, PG-Agent incorporates Retrieval-Augmented Generation (RAG) technology. RAG helps the agent retrieve reliable “perception guidelines” from the page graphs. These guidelines inform a specially designed multi-agent framework within PG-Agent, which uses a task decomposition strategy to break down complex tasks into smaller, manageable sub-tasks. This allows the agent to generalize its knowledge and perform well even in scenarios it hasn’t encountered before.

How PG-Agent Works

The process involves two main parts: constructing the page graph and then using a multi-agent workflow.

Page Graph Construction

The construction of a page graph from sequential episodes is an automated process with three stages. First, there’s Page Jump Determination, where the system identifies if an action leads to a new page or is an in-page operation. Only actions resulting in new pages are considered for new nodes in the graph. Second, a Node Similarity Check is performed. When a new page appears, it’s compared against existing pages in the graph using both semantic (meaning-based) and pixel-level (visual) comparisons to avoid redundancy. Finally, during the Page Graph Update, if a page is unique, a new node is created. If it’s similar to an existing node, that existing node is used. Actions connecting pages are then added as edges, along with a summary of the action and the task description. This way, the graph stores not just pages, but also the actions and contexts that link them.

Also Read:

Multi-Agent Workflow

PG-Agent uses a multi-agent system to plan and execute tasks, leveraging the page graph for guidance. This workflow begins with Guidelines Retrieval, where the system retrieves relevant guidelines from the page graph based on the current screen state. These guidelines suggest possible actions and the tasks they can accomplish. Next, a Global Planning Agent breaks down the user’s main task into a high-level sequence of sub-tasks. An Observation Agent then analyzes the current screen, providing detailed visual and functional descriptions, and considers past interactions to understand task progress. Following this, a Sub-Task Planning Agent selects the most appropriate sub-task from the global plan, generates a detailed plan for it, and suggests a list of candidate actions, heavily informed by the retrieved guidelines. Finally, the Decision Agent chooses the specific action to perform on the current screen, using all the information gathered from the other agents and the guidelines to advance the task.

Extensive experiments were conducted on various benchmark datasets, including Android in the Wild (AITW), Mind2Web, and GUI Odyssey. The results consistently showed that PG-Agent is highly effective, even when only a limited number of episodes are used to build the initial page graph. This demonstrates the framework’s practicality and ability to generalize to unseen scenarios.

The research highlights that explicitly modeling the relationships between GUI pages as a graph, combined with intelligent retrieval of knowledge, significantly enhances the ability of AI agents to navigate and interact with complex user interfaces. For more technical details, you can refer to the original research paper: PG-Agent: An Agent Powered by Page Graph.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

PG-Agent: Enhancing AI Navigation in User Interfaces with Page Graphs

How PG-Agent Works

Page Graph Construction

Multi-Agent Workflow

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates