TLDR: The research introduces Memory-QA, a new task for AI assistants to answer personal recall questions based on multimodal memories (images, text, time, location). They propose PENSIEVE, a system that augments memories with detailed text, uses a multi-signal retriever considering time and location, and fine-tunes an answer generator. PENSIEVE significantly outperforms existing methods, enabling cost-effective personal memory recall for AI.
Imagine a personal assistant that remembers details from your life, like where you parked your car or the name of that great restaurant you visited. This vision, inspired by concepts like Vannevar Bush’s MEMEX and the modern “Second Brain,” is a step closer to reality with a new research paper introducing “Memory-QA.”
Memory-QA is a novel task focused on answering recall questions based on previously stored multimodal memories. These memories aren’t just images; they include visual content, associated text, timestamps, and location information. The challenge lies in creating these task-oriented memories, effectively using temporal and location data, and drawing upon multiple memories to answer a single question.
Current AI systems, particularly those using Multi-Modal Retrieval-Augmented Generation (MM-RAG), face several hurdles in this domain. Personal recall questions often involve vague references like “yesterday” or “at Macy’s,” making precise retrieval difficult. Furthermore, many questions require combining information from several past memories, and existing Vision-Language Models (VLMs) have limited capacity for large visual contexts.
To address these challenges, researchers Hongda Jiang, Xinyuan Zhang, Siddhant Garg, and their colleagues at Meta Reality Labs propose a comprehensive pipeline called PENSIEVE. This system integrates several key innovations:
Memory Augmentation for Better Recall
When a user asks the system to “remember this,” PENSIEVE doesn’t just store a raw image. In an “offline augmentation” phase, it enriches each memory entry. This involves extracting text from the image using Optical Character Recognition (OCR), generating a detailed image description with a Large Language Model (LLM), and completing the user’s invocation command (e.g., turning “remember this restaurant” into “remember this Korean restaurant named Kochi”). These textual clues make memories richer and easier to retrieve later.
Time- and Location-Aware Retrieval
During the “runtime QA” phase, when a user asks a recall question, PENSIEVE employs a sophisticated “multi-signal retriever.” This retriever doesn’t just look for visual similarity. It also incorporates temporal (time) and location matching signals inferred from the user’s question. For instance, if you ask “Where did I park last time?”, the system prioritizes recent parking memories. This dual-modality and context-aware retrieval mechanism ensures more accurate and relevant memory selection.
Also Read:
- AI Assistant Simplifies Agricultural Plot Analysis with Geospatial Data
- Bridging the Performance Gap: How Small AI Models Learn from Large Ones Without Labeled Data
Multi-Memory Question Answering
The final step is answer generation. PENSIEVE is fine-tuned to effectively identify relevant memories from the retrieved set and aggregate information from multiple sources if needed. A surprising finding is that by relying on the high-quality textual augmentations, even text-based LLMs can achieve comparable performance to more complex VLMs for answer generation, offering a potentially lower-cost solution.
The researchers created a new multimodal benchmark called MemoryQA, comprising 9,357 recall questions, to illustrate the real-world challenges of this task. Extensive experiments show that PENSIEVE significantly outperforms state-of-the-art MM-RAG solutions, achieving up to a 14% improvement in QA accuracy on this benchmark. The system’s various components, including memory augmentation and the multi-signal retriever, were shown to contribute substantially to its superior performance.
This work represents a significant step towards building intelligent personal assistants that can genuinely remember and reason about an individual’s past experiences, moving us closer to the long-held vision of a digital “second brain.” You can read the full research paper here.


