TLDR: PG-Agent is a new AI framework that improves how agents interact with graphical user interfaces (GUIs). Instead of just using sequential action records, it converts them into “page graphs” that explicitly map how different GUI pages are connected by actions. This graph structure, combined with Retrieval-Augmented Generation (RAG) and a multi-agent system, allows PG-Agent to better understand GUI environments, plan complex tasks, and adapt to new scenarios more effectively, even with limited training data.
Graphical User Interface (GUI) agents are becoming increasingly important for automating tasks on mobile devices and websites. These agents, often powered by advanced multimodal large language models (MLLMs), show great promise in interacting with user interfaces. However, a common challenge for existing GUI agents is their reliance on sequential records of operations, which don’t fully capture the complex ways different pages connect and transition.
This limitation makes it difficult for agents to truly understand the GUI environment and adapt to new situations. To address this, researchers have developed PG-Agent, a novel framework that transforms these sequential records into “page graphs.” These graphs explicitly model how pages are structured and naturally linked by user actions.
The core idea behind PG-Agent is to convert linear sequences of actions into a rich, interconnected graph. Imagine a map where each city is a page and the roads are the actions that take you from one page to another. This graph provides a much more comprehensive understanding of page transitions than simply remembering a single path.
To effectively use these page graphs, PG-Agent incorporates Retrieval-Augmented Generation (RAG) technology. RAG helps the agent retrieve reliable “perception guidelines” from the page graphs. These guidelines inform a specially designed multi-agent framework within PG-Agent, which uses a task decomposition strategy to break down complex tasks into smaller, manageable sub-tasks. This allows the agent to generalize its knowledge and perform well even in scenarios it hasn’t encountered before.
How PG-Agent Works
The process involves two main parts: constructing the page graph and then using a multi-agent workflow.
Page Graph Construction
The construction of a page graph from sequential episodes is an automated process with three stages. First, there’s Page Jump Determination, where the system identifies if an action leads to a new page or is an in-page operation. Only actions resulting in new pages are considered for new nodes in the graph. Second, a Node Similarity Check is performed. When a new page appears, it’s compared against existing pages in the graph using both semantic (meaning-based) and pixel-level (visual) comparisons to avoid redundancy. Finally, during the Page Graph Update, if a page is unique, a new node is created. If it’s similar to an existing node, that existing node is used. Actions connecting pages are then added as edges, along with a summary of the action and the task description. This way, the graph stores not just pages, but also the actions and contexts that link them.
Also Read:
- AnchorRAG: A Multi-Agent Framework for Enhanced Open-World Question Answering with Knowledge Graphs
- AppCopilot: Advancing Mobile AI Agents for Everyday Use
Multi-Agent Workflow
PG-Agent uses a multi-agent system to plan and execute tasks, leveraging the page graph for guidance. This workflow begins with Guidelines Retrieval, where the system retrieves relevant guidelines from the page graph based on the current screen state. These guidelines suggest possible actions and the tasks they can accomplish. Next, a Global Planning Agent breaks down the user’s main task into a high-level sequence of sub-tasks. An Observation Agent then analyzes the current screen, providing detailed visual and functional descriptions, and considers past interactions to understand task progress. Following this, a Sub-Task Planning Agent selects the most appropriate sub-task from the global plan, generates a detailed plan for it, and suggests a list of candidate actions, heavily informed by the retrieved guidelines. Finally, the Decision Agent chooses the specific action to perform on the current screen, using all the information gathered from the other agents and the guidelines to advance the task.
Extensive experiments were conducted on various benchmark datasets, including Android in the Wild (AITW), Mind2Web, and GUI Odyssey. The results consistently showed that PG-Agent is highly effective, even when only a limited number of episodes are used to build the initial page graph. This demonstrates the framework’s practicality and ability to generalize to unseen scenarios.
The research highlights that explicitly modeling the relationships between GUI pages as a graph, combined with intelligent retrieval of knowledge, significantly enhances the ability of AI agents to navigate and interact with complex user interfaces. For more technical details, you can refer to the original research paper: PG-Agent: An Agent Powered by Page Graph.


