TLDR: This research introduces a system that improves Multimodal Large Language Models (MLLMs) for Blind and Low Vision (BLV) users by making visual descriptions more relevant. Instead of generic, lengthy outputs, the system uses historical BLV user questions from similar visual contexts to guide the MLLM, anticipating what information users are most likely to seek. Evaluations showed that these “context-aware” descriptions were more accurate, anticipated user questions more often, and were preferred by human labelers for their focused delivery of critical information.
Multimodal Large Language Models (MLLMs) have become invaluable tools for visual interpretation, especially for Blind and Low Vision (BLV) individuals. These AI-powered applications, like Be My AI and SeeingAI, help users understand their surroundings by providing descriptions of images they capture. However, a common challenge with these systems is their tendency to generate very comprehensive, often lengthy descriptions, regardless of what the user actually needs to know. This can lead to inefficient interactions, forcing BLV users to sift through irrelevant details to find the specific information they are looking for.
To address this, researchers Ricardo E. Gonzalez Penuela, Felipe Arias-Russi, and Victor Capriles have developed an innovative system that guides MLLMs to provide more contextually relevant information. Their work, detailed in the paper “Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations,” leverages historical questions asked by BLV users to anticipate future informational needs. You can read the full research paper here: Research Paper.
How the System Works
The core idea behind this system is to learn from past interactions. When a BLV user provides an image, the system doesn’t just generate a generic description. Instead, it identifies similar visual contexts from a specialized dataset called VizWiz-LF, which contains real visual questions from BLV users paired with their images. By retrieving these semantically similar past visual contexts and their associated questions, the system can then guide the MLLM to generate descriptions that are more aligned with what BLV users typically seek in such situations.
For example, if past users frequently asked about nutritional information or expiration dates when viewing food products, the system would prioritize including such details in its description of a new food image. This proactive approach aims to deliver the most critical information upfront, reducing the need for follow-up questions and making the interaction more efficient.
The Evaluation and Key Findings
The researchers conducted a human evaluation to compare their “context-aware” descriptions with “context-free” (baseline) descriptions. Three human labelers reviewed 92 descriptions, assessing whether they anticipated and answered the users’ real questions and which description they preferred. The evaluation used Gemini 2.5 Pro as the MLLM and Cohere Embed v4 for generating multimodal embeddings, with a ChromaDB vector database for context retrieval.
The results were promising:
- Context-aware descriptions were found to be more accurate, anticipating and answering users’ questions in 76.1% of cases, compared to 63.0% for context-free descriptions.
- Crucially, context-aware descriptions successfully anticipated and answered the user’s question in 15.2% of cases where the context-free description failed entirely.
- Human labelers preferred context-aware descriptions in 54.3% of comparisons, primarily because they focused on critical information, especially for food-related products (e.g., expiration dates, cooking instructions, nutritional information).
- While context-free descriptions were sometimes preferred for providing broader context (20.7% of cases), the overall improvement in relevance and accuracy with context-aware descriptions was significant.
Also Read:
- A New Benchmark for Evaluating Audio Descriptions in Movies
- Decoding How AI Understands the World: A Multimodal Perspective
Impact and Future Directions
These findings highlight that historical user questions are a powerful signal for guiding MLLMs to provide proactive and more contextually relevant visual interpretations for BLV users. The system improved overall performance without degrading the MLLM’s core ability to answer questions.
Looking ahead, the researchers plan to expand the context dataset beyond the current 600-pair VizWiz-LF, incorporating larger and more varied datasets. They also aim to explore personalized context retrieval based on individual usage patterns and integrate other alternative sources of contextual information to further enhance accuracy and relevance for personal use cases.


