TLDR: A new adaptive and data-driven memory framework optimizes LLM-based agents by modeling memory cycles. It features an MoE gate for retrieval, a learnable aggregation for utilization, and task-specific reflection for storage, all optimized through off-policy and on-policy strategies. This framework enables agents to learn how to memorize effectively, leading to improved performance and efficiency in interactive environments.
Large Language Model (LLM)-based agents are becoming increasingly common in various fields, from finance to personal assistants. A crucial aspect of their effectiveness is how they manage and utilize memory. Traditionally, memory mechanisms for these agents have been designed manually by human experts, a process that can be costly and often leads to less-than-optimal performance. Furthermore, these conventional methods frequently overlook the “memory cycle effect,” which is vital for fine-tuning LLM-based agents for specific environments.
Addressing these challenges, a new research paper introduces an innovative adaptive and data-driven memory framework. This framework aims to optimize LLM-based agents by explicitly modeling memory cycles, allowing agents to learn how to memorize information more effectively within their specific environments. The paper, titled “Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework,” was authored by Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, and Zhenhua Dong.
Understanding the Memory Cycle
The core idea behind this new framework is the “memory cycle,” which describes the continuous interaction between an agent and its environment. In this cycle, an agent perceives observations, stores them as memories, retrieves relevant information to make decisions, and then takes actions that influence the environment, leading to new observations. This creates a continuous loop where memory storage, retrieval, and utilization are interconnected and mutually influential. Previous approaches often treated these procedures in isolation, leading to suboptimal outcomes.
Key Innovations of the Framework
The proposed framework breaks down the memory cycle into three key procedures: retrieval, utilization, and storage, each enhanced with novel mechanisms:
- Memory Retrieval: Instead of fixed, manually assigned weights for different memory aspects (like relevance or recency), the researchers designed a Mix-of-Expert (MoE) gate function. This function adaptively adjusts the importance of various metrics for different states and memories, learning from training data. It also expands beyond semantic relevance to include emotional relevance and importance scoring, using pre-trained functions to make these assessments more dynamic and accurate.
- Memory Utilization: Traditional methods often just concatenate retrieved memories, which can lead to redundant information. This framework introduces a learnable aggregation process that iteratively integrates memories into a coherent context. This process is optimized using techniques like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), allowing the LLM to better align its memory utilization with desired outcomes.
- Memory Storage: When an agent observes something new, it needs to extract critical information. The framework uses a task-specific reflection mechanism to adjust this extraction process. This means the agent learns what information is most important to store based on its task, rather than relying on generic, fixed prompts. This task-specific instruction is optimized based on successful and unsuccessful interactions.
Optimization Strategies: Off-policy and On-policy
To train this adaptive memory framework, the researchers developed two optimization strategies:
- Off-policy Optimization: This strategy involves training the agent using pre-recorded interaction data (trajectories) from a reference policy. It’s flexible and efficient for offline training, allowing for data reuse. However, it can face challenges with “distribution shift” if the optimized policy deviates too much from the data-sampling policy.
- On-policy Optimization: This approach involves continuous online learning, where the agent uses its currently optimized policy to generate new interaction data for further training. This helps to alleviate the distribution shift problem, ensuring better alignment between the agent’s actions and its learning process. The research shows that on-policy optimization is particularly effective in improving the framework’s performance.
Also Read:
- Meta-R1: Giving AI Models the Power to Think About Their Own Thinking
- Unlocking Reasoning in Small Language Models: A New Approach to Blending Learning Strategies
Experimental Validation and Efficiency
The framework was rigorously tested across various datasets, including HotpotQA (with hard, medium, and easy difficulty levels) and MemDaily. The results consistently demonstrated that the on-policy optimized model outperformed other baseline memory models. Notably, the adaptive memory framework significantly reduced the average reasoning steps required for agents to complete tasks, indicating that agents could make more informed decisions and find answers more quickly.
While the method introduces a slight increase in computational time per step due to additional operations, the overall time per trajectory is significantly reduced because the agent requires fewer reasoning steps to achieve its goals. This highlights an improvement in efficiency alongside effectiveness.
The researchers have made their project publicly available on GitHub, inviting the community to explore and build upon their work. You can find more details about this innovative framework in the full research paper: Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework.


