TLDR: MGA (Memory-Driven GUI Agent) is a new framework for AI agents interacting with graphical user interfaces. It addresses common problems like error propagation and local exploration bias by adopting an “observe first, then decide” principle. MGA uses a structured memory and task-agnostic observation to treat each interaction step as an independent, context-rich state, leading to improved robustness, generalization, and efficiency in complex tasks compared to existing methods.
The rapid advancements in Large Language Models (LLMs) and their multimodal counterparts (MLLMs) have paved the way for sophisticated AI agents capable of interacting with diverse environments. A particularly challenging yet impactful area is the development of GUI (Graphical User Interface) agents, which need to navigate complex desktop and web interfaces robustly and generally. Traditional approaches often struggle with issues like error propagation from long chains of actions and a tendency to make decisions before fully observing the interface, leading to overlooked critical cues.
A new research paper introduces the Memory-Driven GUI Agent (MGA) framework, which redefines how GUI agents interact by prioritizing observation before decision-making. This innovative approach models each interaction step as an independent, yet context-rich, environment state. This state is represented by three key components: the current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. This design aims to overcome the limitations of relying heavily on historical action trajectories and local exploration biases.
How MGA Works: A Modular Approach
The MGA framework is built upon four core modules that work in concert to enable intelligent GUI interaction:
- Observer: This module is responsible for systematically transforming the current GUI into a structured, task-agnostic observation. Instead of just looking at pixels, it extracts crucial information like spatial layout, semantic roles of elements, an inventory of all interactive elements (buttons, input fields, menus), and contextual state information (like pop-up warnings or loading bars). This ensures the agent has a complete and unbiased understanding of the interface before making any decisions.
- Memory Agent: Unlike systems that simply record raw action sequences, the Memory Agent abstracts historical interactions into higher-level, structured memory units. This memory captures interface state evolution, analyzes the effects of past operations, recognizes behavioral patterns (like inefficient loops), identifies and classifies issues (such as redundant actions), and verifies state consistency. Crucially, this memory doesn’t dictate the next action but provides a de-biased, de-redundant, and evolution-aware foundation for the planning agent, helping it avoid blind dependence on past behaviors.
- Planner: The Planner takes the current screenshot, the structured spatial information from the Observer, the distilled memory from the Memory Agent, and the user’s high-level instruction to reason about the next logical step. It generates intermediate reasoning traces, or “Thoughts,” and then predicts a concrete next action in natural language. This step-wise reasoning, informed by memory, allows for flexible decision-making without replaying the entire history.
- Grounding Agent: This module translates the Planner’s natural language action specifications into executable low-level GUI interactions, such as clicks, typing, or scrolling. It uses the structured spatial information to precisely locate the intended target on the screen. After execution, the Grounding Agent updates the environment state, closing the loop and allowing the agent to continuously refine its trajectory.
Key Advantages and Performance
The MGA framework offers substantial gains in robustness, generalization, and efficiency. By decoupling current decisions from historical trajectory inertia and grounding agents in comprehensive observations rather than local task priors, MGA ensures more stable and reliable long-horizon execution.
Experiments conducted on the OSWorld benchmark, which includes real desktop applications like Chrome, VSCode, and VLC, demonstrated MGA’s superior performance. It consistently outperformed state-of-the-art baselines, including GTA1, particularly in complex, long-horizon tasks. For instance, in professional tasks like GIMP and VS Code, MGA achieved 83.0% accuracy compared to GTA1’s 77.6%. Even in daily applications, MGA showed a significant lead, reaching 62.8% against GTA1’s 44.9%.
Ablation studies further confirmed the complementary importance of both the structured observation (spatial-semantic grounding) and the abstract memory. Removing either component led to a noticeable drop in performance, highlighting their crucial roles in maintaining temporal rationality and spatial accuracy.
Also Read:
- Securing Mobile AI Agents: A Hybrid Approach to Detecting Unsafe Behaviors
- ATLAS: A New Web Agent That Learns and Plans Through Simulated Action
Looking Ahead
The MGA framework represents a significant step forward in developing more intelligent and human-like GUI agents. By emphasizing an “observe first, then decide” paradigm and leveraging a dynamic, structured memory, it addresses fundamental challenges in autonomous GUI interaction. While the current design focuses on mouse and keyboard simulation to emulate human behavior, future work could explore integrating code-level execution, similar to approaches like CoACT-1, to achieve even greater efficiency for system-level operations. For more details, you can refer to the original research paper.


