TLDR: This paper identifies a problem where Large Vision-Language Models (LVLMs) often generate unfaithful visual reasoning steps, meaning the visual information they present is inaccurate or ignored, even if the final answer is correct. The authors propose a new learning strategy called Sufficient-Component Cause Model (SCCM) that encourages LVLMs to use visual information that is both sufficient (can independently lead to the correct answer) and minimal (contains no irrelevant details). Experiments show SCCM significantly improves the faithfulness and accuracy of visual reasoning in these models.
Large Vision-Language Models (LVLMs) have made significant strides, particularly with the introduction of Multimodal Chain-of-Thought (MCoT) reasoning. This approach allows AI models to integrate visual information directly into their reasoning process, much like humans do. However, recent research has uncovered a critical issue: the visual information incorporated into MCoT traces is often inaccurate or largely ignored, even when the model ultimately arrives at the correct answer. This phenomenon points to a lack of ‘faithfulness’ in the visual component of the AI’s reasoning.
The core problem stems from how these models are trained, specifically the reward design in reinforcement fine-tuning (RFT). Current RFT methods primarily incentivize the mere presence of interleaved vision-text cues, rather than ensuring the correctness or sufficiency of that visual information. This can lead models to include arbitrary or ineffective visual cues, relying instead on textual reasoning to reach a conclusion.
Uncovering the Unfaithfulness
To understand this unfaithfulness, researchers conducted intervention experiments. They measured how much a model’s prediction changed when either its visual or textual ‘thoughts’ were intentionally altered. Surprisingly, predictions remained largely unchanged when visual information was intervened upon, but shifted significantly when textual information was altered. This suggests that visual evidence often plays a minimal role in the model’s actual decision-making process.
Further analysis involved a novel, automated LVLM-based evaluation metric designed to quantify visual faithfulness from two angles: reliability and sufficiency. Reliability assesses whether visual components genuinely support the predicted answer, while sufficiency determines if the visual information alone is enough to correctly answer the query. This evaluation revealed that visual information in current MCoT traces can be both unreliable and insufficient, sometimes even unrelated to the model’s final predictions.
Introducing Sufficient-Component Cause Model (SCCM) Learning
To tackle this issue, a new MCoT learning strategy called Sufficient-Component Cause Model (SCCM) learning has been proposed. This innovative approach aims to make visual components truly ‘sufficient-and-minimal’ causes for correct answers. This means two things:
- The correct answer must be derivable *solely* from the visual components of the MCoT.
- The visual components should contain *no extra information* unrelated to the correct answer, encouraging the tightest possible bounding boxes for visual cues.
A key advantage of SCCM is that it is annotation-free and can be easily integrated into various RFT frameworks. By enforcing both sufficiency and minimality, SCCM encourages robust visual reasoning, reduces over-reliance on textual reasoning, and enhances the overall faithfulness of MCoT. This leads to a more traceable and intuitive understanding of how the model arrives at its predictions.
Also Read:
- Bridging the Modality Gap: New Training Strategies for Balanced AI Reasoning
- Enhancing Trust in Multimodal AI Through Consistent Emotional Explanations
Empirical Success and Future Directions
Empirical results demonstrate that SCCM consistently improves visual faithfulness across a range of fine-grained perception and reasoning benchmarks. Ablation studies further highlighted the crucial role of both the sufficiency and minimality constraints; without minimality, models tended to include excessively large, inefficient visual regions. The code for this research is available here.
This work marks a significant step towards ensuring that Large Vision-Language Models genuinely ‘think with images,’ mirroring human cognitive processes more closely and providing more reliable and interpretable reasoning.


