TLDR: RecAgent is a new AI agent for mobile applications that improves automation by addressing two key challenges: perceptual uncertainty (too much on-screen information) and decision uncertainty (ambiguous tasks). It uses a component recommendation system to focus on relevant UI elements and an interactive module to ask users for feedback when unsure, leading to more accurate and reliable task execution. A new dataset, ComplexAction, was also introduced to evaluate single-step action accuracy in complex mobile interfaces.
Graphical User Interface (GUI) agents are designed to automate tasks on mobile applications, from ordering food to booking tickets. While these AI systems have made significant strides, they often face two major hurdles: dealing with too much information on the screen (input redundancy) and making choices when the task is unclear (decision ambiguity).
Imagine an AI trying to find a search bar on a cluttered music app screen, or deciding what level of sweetness to choose when ordering coffee for a user who just said “help me order a coffee.” These are examples of the challenges that can lead to inefficiency and unsatisfactory results.
A new research paper introduces RecAgent, an innovative uncertainty-aware GUI agent designed to tackle these very problems through adaptive perception and human collaboration. RecAgent distinguishes between two types of uncertainty: perceptual uncertainty, which comes from overwhelming screen information, and decision uncertainty, which arises from ambiguous tasks.
How RecAgent Handles Perceptual Uncertainty
To reduce the clutter and help the agent focus, RecAgent employs a clever Component Recommendation Module (CRM). Instead of processing every single UI element on the screen, which can number in the hundreds, the CRM acts like a smart filter. It identifies and prioritizes only the most relevant UI elements based on the current task. This is achieved through multiple pathways:
- Keyword Matching: Directly matching keywords from the task (like “search” or “submit”) with text on UI elements.
- Semantic Matching: Using advanced language models to understand the meaning and relevance between the task and UI elements.
- LLM-based Intent Recommendation: A large language model analyzes the context of both the task and UI elements to recommend highly confident matches.
By combining these pathways, RecAgent significantly reduces the amount of input information, making its perception more accurate and efficient. For instance, if the goal is to open a shopping app, it will highlight only the shopping app icons, ignoring dozens of other irrelevant elements.
Addressing Decision Uncertainty with Human-in-the-Loop
When RecAgent encounters a situation where it’s unsure how to proceed—for example, when multiple valid options exist or user preferences are missing—it doesn’t guess. Instead, an Interaction Agent proactively asks the user for feedback. This “human-in-the-loop” refinement allows the agent to make intent-aware decisions. For instance, in the coffee ordering scenario, it would ask, “What level of sweetness do you prefer?” and then proceed based on the user’s response.
The RecAgent Architecture
RecAgent integrates several functional agents: a Planning Agent to break down tasks into subgoals, a Decision Agent to select actions based on filtered UI elements, and a Reflection Agent that evaluates if an action was successful. If an action fails, the Reflection Agent uses a retrospection mechanism to learn from the mistake, remove the failed option, and try an alternative, enhancing robustness.
Introducing the ComplexAction Dataset
To rigorously test GUI agents in challenging scenarios, the researchers also introduced a new dataset called ComplexAction. Unlike previous benchmarks that focus on completing entire tasks, ComplexAction specifically evaluates an agent’s ability to perform fine-grained, single-step actions (like clicking a specific button) within visually and semantically complex environments. This helps validate how well an agent can locate relevant UI elements amidst significant input redundancy.
Also Read:
- SEAgent: An AI Framework for Autonomous Software Proficiency
- GuirlVG: A Reinforcement Learning Approach for Efficient GUI Visual Grounding
Performance and Impact
Extensive experiments show that RecAgent outperforms existing state-of-the-art methods on various benchmarks, including AndroidWorld, MobileMiniWoB++, and the new ComplexAction dataset. Its ability to adaptively perceive and interactively resolve ambiguities makes it more reliable and generalizable in real-world mobile applications.
The research highlights that by tackling perceptual and decision uncertainties, RecAgent paves the way for more robust and user-friendly GUI automation. The dataset and code for RecAgent will be made available to the public, fostering further advancements in the field. You can find more details about this research paper here.


