Enhancing Visual AI for Blind and Low Vision Users with Contextual Questions

TLDR: This research introduces a system that improves Multimodal Large Language Models (MLLMs) for Blind and Low Vision (BLV) users by making visual descriptions more relevant. Instead of generic, lengthy outputs, the system uses historical BLV user questions from similar visual contexts to guide the MLLM, anticipating what information users are most likely to seek. Evaluations showed that these “context-aware” descriptions were more accurate, anticipated user questions more often, and were preferred by human labelers for their focused delivery of critical information.

Multimodal Large Language Models (MLLMs) have become invaluable tools for visual interpretation, especially for Blind and Low Vision (BLV) individuals. These AI-powered applications, like Be My AI and SeeingAI, help users understand their surroundings by providing descriptions of images they capture. However, a common challenge with these systems is their tendency to generate very comprehensive, often lengthy descriptions, regardless of what the user actually needs to know. This can lead to inefficient interactions, forcing BLV users to sift through irrelevant details to find the specific information they are looking for.

To address this, researchers Ricardo E. Gonzalez Penuela, Felipe Arias-Russi, and Victor Capriles have developed an innovative system that guides MLLMs to provide more contextually relevant information. Their work, detailed in the paper “Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations,” leverages historical questions asked by BLV users to anticipate future informational needs. You can read the full research paper here: Research Paper.

How the System Works

The core idea behind this system is to learn from past interactions. When a BLV user provides an image, the system doesn’t just generate a generic description. Instead, it identifies similar visual contexts from a specialized dataset called VizWiz-LF, which contains real visual questions from BLV users paired with their images. By retrieving these semantically similar past visual contexts and their associated questions, the system can then guide the MLLM to generate descriptions that are more aligned with what BLV users typically seek in such situations.

For example, if past users frequently asked about nutritional information or expiration dates when viewing food products, the system would prioritize including such details in its description of a new food image. This proactive approach aims to deliver the most critical information upfront, reducing the need for follow-up questions and making the interaction more efficient.

The Evaluation and Key Findings

The researchers conducted a human evaluation to compare their “context-aware” descriptions with “context-free” (baseline) descriptions. Three human labelers reviewed 92 descriptions, assessing whether they anticipated and answered the users’ real questions and which description they preferred. The evaluation used Gemini 2.5 Pro as the MLLM and Cohere Embed v4 for generating multimodal embeddings, with a ChromaDB vector database for context retrieval.

The results were promising:

Context-aware descriptions were found to be more accurate, anticipating and answering users’ questions in 76.1% of cases, compared to 63.0% for context-free descriptions.
Crucially, context-aware descriptions successfully anticipated and answered the user’s question in 15.2% of cases where the context-free description failed entirely.
Human labelers preferred context-aware descriptions in 54.3% of comparisons, primarily because they focused on critical information, especially for food-related products (e.g., expiration dates, cooking instructions, nutritional information).
While context-free descriptions were sometimes preferred for providing broader context (20.7% of cases), the overall improvement in relevance and accuracy with context-aware descriptions was significant.

Also Read:

Impact and Future Directions

These findings highlight that historical user questions are a powerful signal for guiding MLLMs to provide proactive and more contextually relevant visual interpretations for BLV users. The system improved overall performance without degrading the MLLM’s core ability to answer questions.

Looking ahead, the researchers plan to expand the context dataset beyond the current 600-pair VizWiz-LF, incorporating larger and more varied datasets. They also aim to explore personalized context retrieval based on individual usage patterns and integrate other alternative sources of contextual information to further enhance accuracy and relevance for personal use cases.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Visual AI for Blind and Low Vision Users with Contextual Questions

How the System Works

The Evaluation and Key Findings

Impact and Future Directions

Gen AI News and Updates

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

Speeder: Boosting Efficiency and Accuracy in Multimodal Sequential Recommendation

MVU-Eval: A New Benchmark for AI’s Multi-Video Understanding

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates