TLDR: ChainMPQ is a new training-free method designed to reduce ‘relation hallucinations’ in Large Vision-Language Models (LVLMs). These hallucinations occur when models correctly identify objects but misinterpret their relationships. ChainMPQ addresses this by enhancing visual attention, breaking down questions into multi-perspective sub-questions, and using an interleaved chain of text and visual memories to guide a progressive, step-by-step reasoning process. Experiments show it significantly improves accuracy and reduces relational errors across various LVLMs and benchmarks.
Large Vision-Language Models (LVLMs) have made incredible strides in understanding and generating content from both images and text. They power applications like image captioning and visual question answering. However, these advanced models sometimes produce outputs that don’t quite match the visual information they’re given. This phenomenon is known as ‘hallucination’.
Hallucinations in LVLMs can be categorized into three types: object, attribute, and relation. Object hallucinations occur when the model fails to recognize entities, while attribute hallucinations involve misidentifying properties like color or shape. Relation hallucinations, which account for a significant portion of these errors (nearly 40%), happen when models correctly identify objects but struggle to infer the correct relationship between them. For example, an LVLM might see a man riding a surfboard but incorrectly state that he is ‘standing’ on it.
While previous research has made progress in reducing object and attribute hallucinations, relation hallucinations have received less attention despite their prevalence. Existing methods often treat relational reasoning as a single-step process, expecting models to identify entities and their relationships simultaneously. This approach can lead to errors because it relies heavily on pre-existing language patterns rather than a thorough visual analysis.
Introducing ChainMPQ: A New Approach to Relational Reasoning
Inspired by how humans reason—first locating objects, then examining their interactions, and finally synthesizing visual evidence—researchers Yike Wu, Yiwei Wang, and Yujun Cai have proposed a novel method called ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text). This training-free framework aims to improve relational inference in LVLMs by breaking down complex reasoning into manageable steps and utilizing accumulated textual and visual memories.
ChainMPQ works in three main stages:
1. Text-guided Attention Enhancement: First, it extracts subject and object keywords from the user’s question. These keywords are then used to enhance the corresponding regions in the image, helping the model focus precisely on the relevant entities.
2. Multi-Perspective Aware Text Prompt Construction: The original question is then decomposed into five complementary sub-questions. These questions are designed to probe different aspects of the relationship. For instance, if the original question is “Does the dog chase a disc?”, ChainMPQ generates questions like “Where is the dog?”, “Where is the disc?”, “What is the dog chasing?” (masking the object), “What is the disc being chased by?” (masking the subject), and “What is the relationship between the dog and the disc?” (masking the relation). This encourages the model to analyze individual components before making a final judgment.
3. Interleaved Text-Image Reasoning Chain: The constructed sub-questions are then fed to the model sequentially. Crucially, ChainMPQ doesn’t just use textual answers from previous steps as context; it also transfers visual memories by adjusting attention maps based on what the model focused on earlier. This creates an “interleaved chain” of images and text, guiding the model through a progressive reasoning process. This accumulated multimodal evidence helps the model systematically analyze relationships rather than relying on superficial patterns.
Demonstrated Effectiveness
The researchers evaluated ChainMPQ on two state-of-the-art LVLMs, LLaVA-1.5-7B and InstructBLIP-7B, using relation-focused benchmarks like MMRel and R-Bench. The results were promising: ChainMPQ consistently outperformed existing baselines, showing significant reductions in relation hallucinations. For example, on the MMRel benchmark, ChainMPQ achieved a 1.7% accuracy improvement over the best baseline with LLaVA-1.5. It also demonstrated strong gains in precision, indicating fewer incorrect relation predictions.
Ablation studies confirmed the importance of each core component of ChainMPQ. Removing any one part led to a decrease in performance, highlighting the synergistic effect of the text-guided attention, multi-perspective questions, and the interleaved reasoning chain.
Also Read:
- Making AI Reasoning Clearer in Visual Question Answering
- Unlocking Human-Object Interaction Detection with Language Models
Real-World Impact
Case studies vividly illustrate ChainMPQ’s ability to correct errors. In an “action case” where a baseline model incorrectly identified a man “standing” on a surfboard instead of “riding” it, ChainMPQ’s step-by-step process, guided by sub-questions, led to the correct “no, he is riding” answer. Similarly, in a “spatial case” involving a chair and a trash bin, ChainMPQ accurately determined the spatial relationship, correcting a baseline error.
By providing a structured, step-by-step approach to relational inference, ChainMPQ offers a robust framework for improving the reliability and factuality of LVLMs. This work is a significant step towards building more trustworthy and accurate AI systems that can truly understand the world through both language and vision. You can read the full research paper here: CHAINMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations.


