TLDR: GUI-PRA is a new framework that enhances Multimodal Large Language Model (MLLM)-powered GUI agents by providing dynamic, context-aware supervision. It tackles common issues like “lost in the middle” with a dynamic memory mechanism and “UI state-change blindness” with adaptive UI perception, leading to significantly improved success rates on complex GUI tasks compared to standard process reward models.
Graphical User Interface (GUI) agents, powered by advanced Multimodal Large Language Models (MLLMs), hold immense promise for automating digital tasks. However, these agents often face significant hurdles, particularly with tasks that require many steps or involve long interactions. They can get ‘lost in the middle’ when dealing with too much historical data, making it hard to evaluate the current step effectively. Furthermore, standard Process Reward Models (PRMs), which are designed to guide these agents, often lack awareness of how the UI changes after an action, leading to static evaluations that don’t match the dynamic nature of GUI tasks.
To address these critical challenges, researchers have introduced GUI-PRA, which stands for Process Reward Agent for GUI Tasks. This innovative framework acts as a ‘judge agent’ that provides much better process rewards than traditional PRMs. It achieves this by intelligently processing historical context and actively perceiving changes in the user interface.
Dynamic Memory for Better Context
One of GUI-PRA’s core innovations is its Dynamic Memory mechanism. This mechanism directly combats the ‘lost in the middle’ phenomenon. It consists of two main parts: a Relevance-based Retrieval Module, which actively fetches only the most pertinent information from long interaction histories, and a Progressive Summarization Module, which condenses growing interaction data into a concise narrative. This ensures that the model always focuses on the most relevant context, preventing it from being overwhelmed by unnecessary historical details.
Adaptive UI Perception for Dynamic Environments
Another key feature is the Adaptive UI Perception mechanism. Standard PRMs often provide evaluations based solely on text, failing to recognize the visual consequences of actions. GUI-PRA overcomes this ‘state-change blindness’ by actively reasoning about UI state changes. It dynamically selects the most appropriate tools, such as OmniParser for a global UI analysis or Point for fine-grained, localized element grounding, to gather visual evidence. This ensures that its evaluations are always informed by the current visual reality of the task.
How GUI-PRA Works
The GUI-PRA framework operates through a three-stage process. First, the Dynamic Memory module processes the raw interaction history into a condensed summary. Concurrently, the Adaptive UI Perception Mechanism actively reasons about the UI state to select the best tool for gathering visual evidence. Finally, in the Best-of-N Selection process, GUI-PRA integrates these two streams of information, along with the previous action and its score, to evaluate and select the optimal candidate action for the agent to take.
Also Read:
- GUI-Shepherd: Enhancing Autonomous Agents for Complex Interface Tasks
- Enhancing LLM Agent Training with Principle-Based Process Rewards and Normalization
Significant Performance Improvements
Experiments were conducted on two online GUI benchmarks, AndroidWorld and Mobile-MiniWoB++. The results demonstrated GUI-PRA’s clear superiority. For instance, it boosted the average success rate of the Qwen2.5-VL model by 14.53% across both benchmarks, significantly outperforming the 8.56% gain achieved by a standard PRM baseline. The framework showed particular strength in handling ‘medium’ difficulty tasks, where it enabled a non-zero success rate for models that previously failed completely, and substantially enhanced performance for stronger models.
In conclusion, GUI-PRA offers a novel, training-free approach to supervising GUI agents, making them more reliable and efficient in dynamic digital environments. By intelligently managing historical context and actively perceiving UI changes, it addresses critical limitations of existing methods, paving the way for more capable automated assistants. You can read the full research paper here: GUI-PRA: Process Reward Agent for GUI Tasks.


