TLDR: E-Agent is a new AI framework that optimizes how multimodal AI systems plan and execute information retrieval. It uses a dynamic, one-time planning strategy to efficiently combine visual and text searches, significantly improving accuracy (13% gain) and reducing redundant searches (37% reduction) compared to existing methods. A new benchmark, RemPlan, was introduced to evaluate these planning capabilities in real-world scenarios.
In the rapidly evolving field of Artificial Intelligence, Multimodal Retrieval-Augmented Generation (mRAG) systems are becoming increasingly important. These systems aim to enhance the capabilities of Large Language Models (LLMs) by allowing them to access and integrate external knowledge, particularly from the internet, to answer complex questions that require up-to-date or specialized information. This is especially crucial for real-world applications like news analysis or understanding trending topics, where information changes rapidly.
However, existing mRAG approaches often face significant challenges. Many rely on rigid, pre-set ways of retrieving information, which means they don’t adapt well to different types of questions. They also frequently underutilize visual information, focusing too much on text. This can lead to incomplete information retrieval, especially when dealing with image-based queries, and often results in redundant searches, wasting computational resources and potentially introducing irrelevant data.
A new research paper, titled “Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation,” introduces a groundbreaking solution called E-Agent. This innovative agent framework is designed to overcome the limitations of current mRAG systems by optimizing their planning capabilities. The core idea behind E-Agent is to enable more efficient and accurate information retrieval while significantly reducing unnecessary operations.
Introducing E-Agent: A Smarter Approach to Multimodal AI
E-Agent stands out with two main innovations. First, it features an mRAG planner that is specifically trained to dynamically organize multimodal tools. This planner uses contextual reasoning to decide the best way to retrieve information based on the specific question and visual input. Unlike older systems that follow a fixed path, E-Agent’s planner can adapt its strategy on the fly.
Second, E-Agent includes a task executor that implements optimized mRAG workflows. This executor carries out the plan generated by the planner, invoking the right search tools and Multimodal Large Language Models (MLLMs) as needed. A key aspect of E-Agent’s design is its “one-time mRAG planning strategy.” This means it plans the entire retrieval process in a single pass, which drastically minimizes redundant tool invocations and improves efficiency.
The framework operates through two interconnected modules: the mRAG planner and the Task Executor. The planner analyzes both text and visual inputs to create a comprehensive plan, determining which multimodal search tools to use, how to configure auxiliary MLLM function, and what specific instructions and parameters are needed for each tool. The Task Executor then translates this plan into action, using tools like a Requery tool (to formulate optimized search strings), a Response tool (to synthesize information into coherent answers), an Image search tool (for visual matching), and a Text search tool (for keyword-based web queries).
RemPlan: A New Benchmark for Real-World mRAG Planning
To thoroughly evaluate the planning capabilities of mRAG systems, the researchers also introduced a new benchmark called Real-World mRAG Planning (RemPlan). This benchmark is unique because it includes both questions that require external retrieval and those that can be answered using the model’s existing knowledge. It’s meticulously annotated with the essential retrieval tools needed for each question, making it highly relevant to real-world scenarios that demand dynamic mRAG decisions.
RemPlan categorizes questions into four types: Fundamental (no search needed), Visual-Recognition (image search needed), Information-Seeking (text search needed), and Multi-Faceted (both image and text search needed). This diversity allows for a detailed assessment of an agent’s ability to discern when and what type of search is necessary. The benchmark also introduces a hierarchical plan evaluation metric, which goes beyond just answer accuracy to measure mRAG planning accuracy, search tool precision and recall, and parameter semantic scores.
Also Read:
- BrowseMaster: A New Approach to Smarter Web Browsing for AI Agents
- HierSearch: A New Framework for Enterprise Deep Search Across Local and Web Data
Impressive Results and Future Implications
Experiments conducted on RemPlan and three other established benchmarks demonstrated E-Agent’s superior performance. It achieved a 13% accuracy gain over state-of-the-art mRAG methods while significantly reducing redundant searches by 37%. This highlights E-Agent’s effectiveness in both improving answer quality and enhancing computational efficiency.
The study also validated the reliability of using GPT-4o for evaluating answer quality, showing a high correlation with human evaluations. While E-Agent shows robust performance, the researchers acknowledge limitations, particularly in handling complex multi-hop reasoning tasks that might require iterative plan refinement. The framework’s reliance on predefined toolkits also suggests a need for future updates to maintain compatibility with evolving data sources.
This research marks a significant step forward in developing intelligent multimodal Question Answering systems. By optimizing the planning process for multimodal retrieval, E-Agent paves the way for more accurate, efficient, and adaptable AI agents in real-world applications. For more in-depth technical details, you can read the full research paper here.


