Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

TLDR: A new research paper introduces the History-Aware Reasoning (HAR) framework, designed to overcome the short-term memory limitations of current GUI agents. By implementing a reflective learning scenario, tailored error correction guidelines, and a hybrid reinforcement learning reward function, HAR enables agents to learn from past mistakes and integrate historical interaction context into their decision-making. This results in the HAR-GUI-3B model, which demonstrates superior performance and generalization in complex, multi-step GUI automation tasks by fostering stable short-term memory and reliable screen perception.

In the rapidly evolving world of artificial intelligence, Graphical User Interface (GUI) agents are becoming increasingly sophisticated, enabling devices to be manipulated autonomously. These agents, powered by Multimodal Large Language Models (MLLMs), hold immense potential for applications ranging from accessibility to automated testing. However, a significant challenge persists: equipping these agents with reliable ‘episodic reasoning’ – the ability to remember past interactions and use that history to make better decisions in long, multi-step tasks.

Current GUI agents often suffer from a ‘short-term memory’ weakness. They tend to treat each screen interaction as a standalone event, ignoring the chain of previous actions that led to the current state. This ‘history-agnostic’ reasoning can severely hinder their performance, especially in complex, long-horizon tasks where context is crucial.

To address this limitation, researchers have introduced a novel framework called History-Aware Reasoning (HAR). This framework aims to transform GUI agents from being history-agnostic to ‘history-aware’, providing them with stable short-term memory and a more reliable understanding of screen details. The core idea behind HAR is to encourage agents to learn from their own errors and acquire episodic reasoning knowledge through tailored strategies.

The HAR framework is built upon three main components:

Constructing a Reflective Learning Scenario

Instead of simply training agents on correct actions, HAR creates a special environment where the agent can reflect on its mistakes. When an agent makes an incorrect prediction, this error is flagged as a ‘historically incorrect sample’.

Synthesizing Tailored Correction Guidelines

For each error, a more advanced ‘teacher model’ generates specific guidelines. These guidelines act as external knowledge, helping the agent understand *why* it made a mistake and providing clues for correct future predictions. This is crucial because simply re-training with correct answers doesn’t always fix the underlying reasoning flaw.

Also Read:

Designing a Hybrid Reinforcement Learning (RL) Reward Function

The HAR framework uses a sophisticated reward system during training. This system doesn’t just reward correct actions; it also considers whether the agent’s thought process (its ‘Chain-of-Thought’) explicitly includes logical analysis of past interactions. This ‘Memory-Augmented Reward’ (MAR) specifically incentivizes the agent to leverage historical context. Additionally, the reward function is designed to encourage precise actions, especially for critical interactions like clicking on specific screen coordinates, by assigning higher rewards for accuracy and penalizing deviations.

The training process involves two key stages. First, a ‘GUI Scenario Warm-up’ injects foundational domain-specific knowledge into the agent through supervised fine-tuning using a wide range of GUI-related data, including screen analysis, question-answering, and action summarization. This stage enhances the agent’s basic screen perception and action understanding.

Following the warm-up, the ‘Learning From Failure’ stage begins. Here, the agent undergoes reinforcement learning within the reflective scenario, using the tailored guidelines and the hybrid reward function to perform ‘error-aware cognitive corrections’. This process helps the agent develop its short-term memory and refine its reasoning mode. A second round of RL, employing a ‘task mixing training strategy’, further refines the agent’s ability to perceive screen visual details while maintaining its enhanced episodic reasoning.

Using this innovative HAR framework, the researchers developed a native end-to-end model called HAR-GUI-3B. This model demonstrates significant improvements in handling GUI-oriented tasks, exhibiting stable short-term memory and reliable screen perception. Comprehensive evaluations across various GUI benchmarks show that HAR-GUI-3B consistently outperforms existing advanced methods, even those with more parameters. It also shows strong generalization capabilities in ‘out-of-distribution’ scenarios, such as challenging Chinese mini-program benchmarks.

The HAR framework represents a significant step forward in creating more intelligent and reliable GUI agents. By enabling agents to reflect on their errors and integrate historical context into their decision-making, it paves the way for more robust and adaptable automation of end-user devices. For more details, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

Constructing a Reflective Learning Scenario

Synthesizing Tailored Correction Guidelines

Designing a Hybrid Reinforcement Learning (RL) Reward Function

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

Subscribe to get the latest news and updates