TLDR: A new research paper introduces the History-Aware Reasoning (HAR) framework, designed to overcome the short-term memory limitations of current GUI agents. By implementing a reflective learning scenario, tailored error correction guidelines, and a hybrid reinforcement learning reward function, HAR enables agents to learn from past mistakes and integrate historical interaction context into their decision-making. This results in the HAR-GUI-3B model, which demonstrates superior performance and generalization in complex, multi-step GUI automation tasks by fostering stable short-term memory and reliable screen perception.
In the rapidly evolving world of artificial intelligence, Graphical User Interface (GUI) agents are becoming increasingly sophisticated, enabling devices to be manipulated autonomously. These agents, powered by Multimodal Large Language Models (MLLMs), hold immense potential for applications ranging from accessibility to automated testing. However, a significant challenge persists: equipping these agents with reliable ‘episodic reasoning’ – the ability to remember past interactions and use that history to make better decisions in long, multi-step tasks.
Current GUI agents often suffer from a ‘short-term memory’ weakness. They tend to treat each screen interaction as a standalone event, ignoring the chain of previous actions that led to the current state. This ‘history-agnostic’ reasoning can severely hinder their performance, especially in complex, long-horizon tasks where context is crucial.
To address this limitation, researchers have introduced a novel framework called History-Aware Reasoning (HAR). This framework aims to transform GUI agents from being history-agnostic to ‘history-aware’, providing them with stable short-term memory and a more reliable understanding of screen details. The core idea behind HAR is to encourage agents to learn from their own errors and acquire episodic reasoning knowledge through tailored strategies.
The HAR framework is built upon three main components:
Constructing a Reflective Learning Scenario
Instead of simply training agents on correct actions, HAR creates a special environment where the agent can reflect on its mistakes. When an agent makes an incorrect prediction, this error is flagged as a ‘historically incorrect sample’.
Synthesizing Tailored Correction Guidelines
For each error, a more advanced ‘teacher model’ generates specific guidelines. These guidelines act as external knowledge, helping the agent understand *why* it made a mistake and providing clues for correct future predictions. This is crucial because simply re-training with correct answers doesn’t always fix the underlying reasoning flaw.
Also Read:
- Precision Training: Crafting Powerful GUI Agents with Filtered Data
- ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps
Designing a Hybrid Reinforcement Learning (RL) Reward Function
The HAR framework uses a sophisticated reward system during training. This system doesn’t just reward correct actions; it also considers whether the agent’s thought process (its ‘Chain-of-Thought’) explicitly includes logical analysis of past interactions. This ‘Memory-Augmented Reward’ (MAR) specifically incentivizes the agent to leverage historical context. Additionally, the reward function is designed to encourage precise actions, especially for critical interactions like clicking on specific screen coordinates, by assigning higher rewards for accuracy and penalizing deviations.
The training process involves two key stages. First, a ‘GUI Scenario Warm-up’ injects foundational domain-specific knowledge into the agent through supervised fine-tuning using a wide range of GUI-related data, including screen analysis, question-answering, and action summarization. This stage enhances the agent’s basic screen perception and action understanding.
Following the warm-up, the ‘Learning From Failure’ stage begins. Here, the agent undergoes reinforcement learning within the reflective scenario, using the tailored guidelines and the hybrid reward function to perform ‘error-aware cognitive corrections’. This process helps the agent develop its short-term memory and refine its reasoning mode. A second round of RL, employing a ‘task mixing training strategy’, further refines the agent’s ability to perceive screen visual details while maintaining its enhanced episodic reasoning.
Using this innovative HAR framework, the researchers developed a native end-to-end model called HAR-GUI-3B. This model demonstrates significant improvements in handling GUI-oriented tasks, exhibiting stable short-term memory and reliable screen perception. Comprehensive evaluations across various GUI benchmarks show that HAR-GUI-3B consistently outperforms existing advanced methods, even those with more parameters. It also shows strong generalization capabilities in ‘out-of-distribution’ scenarios, such as challenging Chinese mini-program benchmarks.
The HAR framework represents a significant step forward in creating more intelligent and reliable GUI agents. By enabling agents to reflect on their errors and integrate historical context into their decision-making, it paves the way for more robust and adaptable automation of end-user devices. For more details, you can refer to the original research paper.


