MoHoBench: A New Benchmark for Evaluating Honesty in Multimodal AI

TLDR: A new research paper introduces MoHoBench, the first systematic benchmark to assess the honesty of Multimodal Large Language Models (MLLMs) when faced with unanswerable visual questions. The benchmark categorizes unanswerable questions into four types: Context Dependent, False Premises, Subjective or Philosophical, and Vague Description. Evaluations of 28 MLLMs using MoHoBench reveal that most models fail to appropriately refuse to answer such questions, and that visual information significantly impacts their honesty. The study also shows that model size doesn’t guarantee honesty and proposes initial alignment methods to improve MLLM honesty.

Multimodal Large Language Models, or MLLMs, have made significant strides in tasks that combine vision and language, showing impressive abilities in understanding and generating content from both images and text. These models are becoming increasingly common in various real-world applications, but their trustworthiness, especially their ability to be honest, is a growing concern. While a lot of research has focused on the trustworthiness of traditional language models, how MLLMs behave honestly, particularly when faced with visual questions that cannot be answered, has not been thoroughly explored.

A new study introduces the first comprehensive evaluation of honesty in various MLLMs. The researchers define honesty based on how models respond to visual questions that are unanswerable. They identified four main types of such questions and used them to build MoHoBench, a large-scale benchmark for MLLM honesty. This benchmark contains over 12,000 visual question samples, with their quality ensured through multiple stages of filtering and human verification.

The four types of unanswerable visual questions defined in MoHoBench are:

Context Dependent Questions

These questions require background knowledge or external context beyond what is visible in the image. The visual information alone is not enough to answer them, often involving reasoning about events, cause-and-effect relationships, or future predictions. For example, asking why elephants are gathering by water when the image doesn’t provide that specific context.

False Premises Questions

These questions are based on assumptions that directly contradict the visual information in the image. An example would be asking how elephants stay warm in a snowy tundra when the image clearly shows a different environment without snow or blizzards.

Subjective or Philosophical Questions

These questions involve personal opinions, ethical judgments, or philosophical reasoning that cannot be objectively determined from the image. For instance, asking if a photo of elephants evokes a sense of interconnectedness among living beings is inherently subjective.

Also Read:

Vague Description Questions

These questions are phrased imprecisely or use ambiguous terms, making it difficult for the model to identify relevant visual cues. An example is asking about the color of “the thing” behind subjects without clearly specifying which object is being referred to.

Using MoHoBench, the researchers tested the honesty of 28 popular MLLMs and conducted an in-depth analysis. Their findings revealed several key points. Firstly, most models struggle to appropriately refuse to answer when they should. This means they often try to answer questions even when the visual information is insufficient, rather than admitting uncertainty.

Secondly, the study found that MLLMs’ honesty is not just a language modeling issue; it is significantly influenced by visual information. This highlights the need for specialized methods to align MLLMs for honesty in a multimodal context. The researchers also conducted experiments on how visual quality affects a model’s honesty. They found that additive noise (like Gaussian noise) can make models more overconfident, leading to a decrease in refusal rates, while contrast adjustments can sometimes increase refusal rates by making details less visible.

Interestingly, the study observed that larger model size does not necessarily guarantee better honesty. While some larger models performed better, there wasn’t a strong correlation, suggesting that architectural design and specific alignment strategies play a more crucial role than just scale.

To address these limitations, the researchers implemented initial alignment methods, including supervised fine-tuning (SFT) and direct preference optimization (DPO), to improve the honesty behavior of MLLMs. These methods provide a foundation for future work aimed at creating more trustworthy MLLMs. The data and code for MoHoBench are publicly available, encouraging further research in this important area. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MoHoBench: A New Benchmark for Evaluating Honesty in Multimodal AI

Context Dependent Questions

False Premises Questions

Subjective or Philosophical Questions

Vague Description Questions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates