TLDR: MemoryBench is a new benchmark addressing the limitations of current LLM memory evaluations by simulating diverse user feedback across various tasks and domains. It reveals that existing state-of-the-art memory-based LLM systems are often ineffective and inefficient, sometimes underperforming simpler RAG methods, highlighting the need for better continual learning algorithms.
Large Language Models, or LLMs, have shown incredible potential, often seen as a stepping stone towards Artificial General Intelligence. However, researchers are finding that simply making these models bigger with more data and parameters is reaching its limits. The quality data is becoming scarce, and the performance gains from larger computational resources are diminishing. This has led to a growing interest in how LLM systems can learn continuously, much like humans or traditional AI systems such as search engines.
Inspired by how humans and existing AI systems learn from experience, the concept of building memory and continual learning frameworks for LLMs has become a crucial area of research. But there’s a catch: current benchmarks for evaluating LLM memory often focus on specific, homogeneous tasks, like reading comprehension with very long texts. They don’t adequately test an LLM system’s ability to learn from the ongoing feedback it receives from users during its operational life.
To address this gap, a new research paper introduces MemoryBench, a comprehensive benchmark designed to evaluate the continual learning capabilities of LLM systems. This benchmark proposes a unique user feedback simulation framework that spans multiple domains, languages, and types of tasks. The goal is to mimic how systems continuously learn and improve by interacting with users in real-world online services.
How MemoryBench Works
MemoryBench is built around a user feedback simulation framework. It categorizes memory into two main types: Declarative Memory (factual knowledge, like semantic memory from textbooks or episodic memory from user conversations) and Procedural Memory (non-factual knowledge related to task execution, such as workflows or rewards from solutions). User feedback and behavior logs are considered vital for building this procedural memory.
The benchmark also defines two types of user feedback: Explicit Feedback, where users directly indicate quality (e.g., verbose text critiques or “like/dislike” buttons), and Implicit Feedback, which are user actions not directly meant for judgment but still informative (e.g., clicking a “copy” button or starting a new session with a refined prompt). MemoryBench uses an “LLM-as-user” paradigm to simulate these diverse feedback types, and an “LLM-as-judge” approach to evaluate performance.
The datasets within MemoryBench are diverse, covering open-domain, legal, and academic data, with tasks varying in input and output lengths. This ensures a broad evaluation of how LLM systems handle different scenarios and learn from accumulated feedback.
Also Read:
- Assessing How Well Large Language Models Simulate Human Behavior with SIMBENCH
- New Benchmark Unveils Multimodal AI’s Challenges in Video Dialogues
Key Findings and Limitations
The experiments conducted using MemoryBench revealed some surprising insights. State-of-the-art memory-based LLM systems, such as A-Mem, Mem0, and MemoryOS, were found to be far from satisfying in both their effectiveness and efficiency. In many cases, these advanced systems were even outperformed by simpler Retrieval Augmented Generation (RAG) baselines, which simply treat all task context and feedback logs as a retrieval corpus.
This suggests that existing memory-based LLM systems struggle with generalizability across different task formats and types of knowledge. They don’t effectively differentiate between historical task logs and current task context, sometimes leading to insufficient use of feedback or even introducing noise. Furthermore, the efficiency of these systems is a major concern, with some exhibiting extremely long and inconsistent memory processing times, making on-policy continual learning impractical.
The researchers hope that MemoryBench will serve as a crucial tool for future studies, paving the way for the development of more effective and efficient memory architectures and optimization algorithms for LLM systems. For more details, you can refer to the full research paper here.


