TLDR: MapAgent is a novel LLM-based agent framework designed to automate complex tasks on mobile devices. It addresses the limitations of current LLM agents, such as lack of real-world app knowledge and hallucinations, by leveraging a memory system constructed from historical task execution trajectories. The framework uses a trajectory-based memory mechanism to store structured page information, a coarse-to-fine planning approach augmented by retrieving relevant memory pages, and a dual-LLM architecture task executor for robust action generation and progress monitoring. Experiments show MapAgent achieves superior performance and efficiency in real-world mobile scenarios.
In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have shown immense promise in automating tasks on mobile devices through their ability to interact with graphical user interfaces (GUIs). However, these advanced agents often hit roadblocks when faced with the complexities of real-world mobile applications. A significant challenge arises from LLMs’ inherent lack of practical knowledge about these apps, leading to inefficient task planning and sometimes even generating incorrect or ‘hallucinated’ actions.
Addressing these critical limitations, a new framework called MapAgent has been introduced. This innovative LLM-based agent leverages a unique memory system built from past task execution experiences, known as historical trajectories, to significantly enhance its current task planning capabilities. Imagine an agent that learns from its past interactions, much like a human user would, and applies that learned knowledge to new, similar tasks.
MapAgent operates through three core components that work in harmony. First, it features a
Trajectory-based Memory Mechanism
. This mechanism is inspired by how humans remember information during device operation. It takes the agent’s past task execution paths – sequences of actions and the pages encountered – and condenses them into a structured ‘page-memory database’. Each ‘page’ within a trajectory is captured as a concise yet comprehensive snapshot, detailing both its visual layout (UI) and its functional purpose. This ensures that crucial information from previous interactions is retained and organized for future use.
Secondly, MapAgent employs a sophisticated
Memory-Augmented Task Planning
approach, which operates in a coarse-to-fine manner. When given a new task, the system first generates a broad plan, breaking the task down into general subtasks. Then, it intelligently retrieves relevant ‘pages’ from its memory database based on how similar they are to the current subtasks. This retrieved information is then fed into the LLM planner, providing it with valuable context and compensating for any gaps in its understanding of real-world app scenarios. This process leads to more informed and context-aware task planning, preventing the agent from making common mistakes or getting stuck.
Finally, the planned tasks are brought to life by the Also Read:
- Enhancing AI Agents with Graph Structures: A Comprehensive Overview
- Streamlining AI Conversations: Dynamic Workflow Personalization for Task-Oriented Dialogue Systems
Task Executor
, which is powered by a dual-LLM architecture. This executor is designed to translate the refined plans into concrete, executable actions on the mobile device. It consists of two collaborating LLM roles: a ‘Decision-maker’ that proposes actions and a ‘Judge’ that evaluates the progress and success of each action. This collaborative approach, combined with a short-term memory unit that tracks the current task’s historical responses, ensures effective tracking of task progress and allows the agent to adapt to the dynamic nature of mobile environments. This dual-LLM setup helps the agent detect and correct errors, making it more robust in unpredictable situations.
Extensive experiments conducted in real-world scenarios, across both English and Chinese mobile applications, have demonstrated MapAgent’s superior performance compared to existing methods. It has shown notable improvements in handling complex cross-app tasks, where it efficiently breaks down large tasks into manageable subtasks while maintaining context. Furthermore, the framework exhibits a reasonable balance between computational efficiency and high success rates, proving its practicality for real-world deployment. The research paper detailing this framework can be found at arXiv:2507.21953.
MapAgent represents a significant step forward in mobile task automation, offering a robust and intelligent solution that learns from experience, plans with enhanced context, and executes tasks with greater reliability. Its ability to bridge the knowledge gap between LLMs and real-world mobile applications opens up new possibilities for AI-assisted smartphone development and user interaction.


