spot_img
HomeResearch & DevelopmentMoHoBench: A New Benchmark for Evaluating Honesty in Multimodal...

MoHoBench: A New Benchmark for Evaluating Honesty in Multimodal AI

TLDR: A new research paper introduces MoHoBench, the first systematic benchmark to assess the honesty of Multimodal Large Language Models (MLLMs) when faced with unanswerable visual questions. The benchmark categorizes unanswerable questions into four types: Context Dependent, False Premises, Subjective or Philosophical, and Vague Description. Evaluations of 28 MLLMs using MoHoBench reveal that most models fail to appropriately refuse to answer such questions, and that visual information significantly impacts their honesty. The study also shows that model size doesn’t guarantee honesty and proposes initial alignment methods to improve MLLM honesty.

Multimodal Large Language Models, or MLLMs, have made significant strides in tasks that combine vision and language, showing impressive abilities in understanding and generating content from both images and text. These models are becoming increasingly common in various real-world applications, but their trustworthiness, especially their ability to be honest, is a growing concern. While a lot of research has focused on the trustworthiness of traditional language models, how MLLMs behave honestly, particularly when faced with visual questions that cannot be answered, has not been thoroughly explored.

A new study introduces the first comprehensive evaluation of honesty in various MLLMs. The researchers define honesty based on how models respond to visual questions that are unanswerable. They identified four main types of such questions and used them to build MoHoBench, a large-scale benchmark for MLLM honesty. This benchmark contains over 12,000 visual question samples, with their quality ensured through multiple stages of filtering and human verification.

The four types of unanswerable visual questions defined in MoHoBench are:

Context Dependent Questions

These questions require background knowledge or external context beyond what is visible in the image. The visual information alone is not enough to answer them, often involving reasoning about events, cause-and-effect relationships, or future predictions. For example, asking why elephants are gathering by water when the image doesn’t provide that specific context.

False Premises Questions

These questions are based on assumptions that directly contradict the visual information in the image. An example would be asking how elephants stay warm in a snowy tundra when the image clearly shows a different environment without snow or blizzards.

Subjective or Philosophical Questions

These questions involve personal opinions, ethical judgments, or philosophical reasoning that cannot be objectively determined from the image. For instance, asking if a photo of elephants evokes a sense of interconnectedness among living beings is inherently subjective.

Also Read:

Vague Description Questions

These questions are phrased imprecisely or use ambiguous terms, making it difficult for the model to identify relevant visual cues. An example is asking about the color of “the thing” behind subjects without clearly specifying which object is being referred to.

Using MoHoBench, the researchers tested the honesty of 28 popular MLLMs and conducted an in-depth analysis. Their findings revealed several key points. Firstly, most models struggle to appropriately refuse to answer when they should. This means they often try to answer questions even when the visual information is insufficient, rather than admitting uncertainty.

Secondly, the study found that MLLMs’ honesty is not just a language modeling issue; it is significantly influenced by visual information. This highlights the need for specialized methods to align MLLMs for honesty in a multimodal context. The researchers also conducted experiments on how visual quality affects a model’s honesty. They found that additive noise (like Gaussian noise) can make models more overconfident, leading to a decrease in refusal rates, while contrast adjustments can sometimes increase refusal rates by making details less visible.

Interestingly, the study observed that larger model size does not necessarily guarantee better honesty. While some larger models performed better, there wasn’t a strong correlation, suggesting that architectural design and specific alignment strategies play a more crucial role than just scale.

To address these limitations, the researchers implemented initial alignment methods, including supervised fine-tuning (SFT) and direct preference optimization (DPO), to improve the honesty behavior of MLLMs. These methods provide a foundation for future work aimed at creating more trustworthy MLLMs. The data and code for MoHoBench are publicly available, encouraging further research in this important area. You can find the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -