Self-Managing Memory: A New Approach for Language Models in Long-Horizon Tasks

TLDR: Large Language Models struggle with long, complex tasks due to limited working memory. The “Memory-as-Action” (MemAct) framework proposes making memory management an intrinsic, learnable part of the agent’s policy, allowing it to actively edit its memory. A new algorithm, Dynamic Context Policy Optimization (DCPO), addresses the challenges of training with these dynamic memory changes. This approach improves task performance and reduces computational costs by enabling adaptive context curation, making LLMs more effective and efficient for long-horizon tasks.

Large Language Models (LLMs) have shown incredible capabilities, but they often hit a wall when faced with complex, long-running tasks like deep research or software engineering. The main culprit? Their working memory. This memory, which holds the history of observations and decisions, can quickly become cluttered with irrelevant information, hindering the model’s ability to reason effectively and stay on track.

Traditionally, managing this working memory has been an external affair. Systems rely on predefined rules or separate controllers to select, compress, or summarize information. While these methods help, they are detached from the LLM’s core decision-making process, making it difficult for the agent to learn a truly coherent strategy that balances task performance with resource costs.

Introducing Memory-as-Action (MemAct)

A new framework called Memory-as-Action (MemAct) proposes a fundamental shift. Instead of external management, MemAct treats working memory management as an intrinsic, learnable capability of the agent itself. Imagine an LLM that can actively decide when to keep, compress, or discard parts of its history, or even insert summaries, all as part of its unified decision-making process. This is what MemAct enables.

By framing memory operations as explicit actions, an agent, trained through reinforcement learning, can learn to curate its context dynamically. This not only helps maintain a focused and goal-relevant reasoning trace but also leads to significant reductions in computational consumption and improvements in task performance.

Overcoming Trajectory Fractures with DCPO

However, this dynamic memory editing introduces a unique challenge: “trajectory fractures.” Standard reinforcement learning methods for LLMs assume a continuously growing, append-only context. When an agent can overwrite or remove past content, this causal continuity is broken, making traditional policy optimization methods inapplicable.

To tackle this, the researchers developed Dynamic Context Policy Optimization (DCPO). This novel algorithm enables stable end-to-end reinforcement learning by segmenting the execution trajectory whenever a memory action occurs. By partitioning histories into causally consistent segments, DCPO ensures that gradients are computed correctly, allowing the agent to learn effectively despite the non-linear changes to its memory.

How MemAct Works

In the MemAct framework, the agent’s interaction is modeled as a Markov Decision Process. The agent’s actions include both task-oriented actions (interacting with the environment) and memory actions (directly modifying its working memory). A dedicated “prune context” tool allows the agent to synthesize key information into a summary and delete specified historical records, effectively replacing detailed, pruned content with a concise piece of memory.

The training process involves a segmented supervised fine-tuning phase for initial policy setup, followed by the DCPO training phase. A sparse, terminal reward function guides the agent, rewarding successful task completion and penalizing resource constraint violations, such as exceeding context length limits.

Also Read:

Key Findings and Benefits

Experiments on multi-objective and multi-hop question-answering datasets demonstrated the effectiveness of MemAct. The MemAct-14B-RL model achieved leading accuracy, even outperforming much larger models, while operating at a substantially lower token cost. This indicates that jointly optimizing for task reasoning and memory management is highly efficient.

Interestingly, the framework allows for adaptive strategies. A more capable 14B model learned an efficiency-oriented approach, using fewer external tools. In contrast, a smaller 7B model developed a strategy of extending its reasoning process with more external tool calls, compensating for its limited internal knowledge, while still managing memory intensively to remain token-efficient. This highlights MemAct’s ability to help models discover policies tailored to their intrinsic capabilities.

Furthermore, the memory pruning enabled by MemAct directly translates to improved training efficiency, reducing the duration of both rollout and policy update phases. This suggests that learned memory policies can be a viable and efficient alternative to simply expanding context windows.

In conclusion, the Memory-as-Action framework, coupled with Dynamic Context Policy Optimization, offers a powerful approach for LLMs to autonomously manage their working memory. This leads to agents that are not only more effective in tackling complex, long-horizon tasks but also more computationally efficient. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Self-Managing Memory: A New Approach for Language Models in Long-Horizon Tasks

Introducing Memory-as-Action (MemAct)

Overcoming Trajectory Fractures with DCPO

How MemAct Works

Key Findings and Benefits

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

SOCi Achieves Major Milestone with 150,000 AI Agents Automating 10 Million Local Marketing Tasks

TD Synnex Unveils Agentic AI-Powered Digital Bridge to Revolutionize Partner Sales and Productivity

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates