Vision-Language Models: The Peril of Prolonged Reasoning and a Solution for Visual Grounding

TLDR: A new study reveals a ‘dual nature’ in Vision-Language Models (VLMs): while reasoning enhances logical inference, it can also impair perceptual grounding, leading to ‘visual forgetting’ where models disregard visual input during prolonged thought. To address this, researchers propose VISION-ANCHOREDPOLICYOPTIMIZATION (VAPO), a method that inserts ‘visual anchors’ and uses a ‘perception reward’ to explicitly steer reasoning towards visually grounded trajectories. VAPO-Thinker-7B, the resulting model, significantly improves reliance on visual information and achieves state-of-the-art performance on various benchmarks.

In the rapidly evolving world of artificial intelligence, Vision-Language Models (VLMs) have emerged as powerful tools, capable of understanding and generating content based on both images and text. These models are increasingly being trained to perform complex reasoning tasks, from solving intricate math problems to generating code. However, new research from Xinyu Tian and colleagues uncovers a surprising challenge: while advanced reasoning can boost a VLM’s ability to tackle tough logical problems, it can also make the model ‘forget’ what it’s seeing, leading to basic visual recognition errors.

The paper, titled “MORETHOUGHT, LESSACCURACY? ON THEDUALNA-TURE OFREASONING INVISION-LANGUAGEMODELS,” highlights what the authors call the ‘dual nature’ of multimodal reasoning. Imagine an AI model trying to solve a visual puzzle. The more it ‘thinks’ or reasons through complex steps, the more it might start to disregard the actual image, relying instead on its internal thought process. This phenomenon, termed ‘visual forgetting,’ means that prolonged reasoning can inadvertently reduce the model’s reliance on crucial visual input.

The Problem: When More Thinking Leads to Less Seeing

The researchers conducted a detailed analysis, evaluating how existing VLMs perform as their reasoning processes become longer. They found that while initial reasoning steps often improved accuracy, these gains would eventually plateau and even reverse. On tasks requiring precise visual understanding, such as counting objects in an image or interpreting charts, accuracy could drop significantly after extended reasoning. A key finding was that over 50% of errors made by these models were ‘perception errors’ – mistakes in correctly interpreting visual details, rather than logical missteps. Surprisingly, many of these errors could have been avoided if the model had stopped reasoning earlier, suggesting it initially had the correct visual understanding but was led astray by its own prolonged thoughts.

To further investigate, the team tracked how much attention the models paid to visual information during their reasoning process. They observed a clear decline in ‘visual attention’ as reasoning progressed, with models increasingly relying on their generated text rather than the original image. This confirmed the hypothesis of visual forgetting.

The Solution: Anchoring Reasoning in Visual Evidence

To combat this issue, the researchers propose a novel method called VISION-ANCHOREDPOLICYOPTIMIZATION (VAPO). VAPO is designed to explicitly guide the reasoning process to stay grounded in visual evidence. Here’s how it works: During training, VAPO generates a series of ‘visual claims’ about an image – some correct, some incorrect. These claims are then strategically inserted as ‘visual anchors’ at various points within the model’s reasoning path. At each anchor, the model is prompted to judge the truthfulness of the claim, forcing it to re-engage with the visual input.

A special ‘perception reward’ is introduced, which encourages the model to accurately evaluate these visual claims. This reward is weighted to give more importance to anchors later in the reasoning process, precisely where visual forgetting is most likely to occur. By integrating this perception reward with standard training techniques, VAPO effectively teaches the model to maintain its visual grounding throughout complex reasoning.

Also Read:

Impressive Results and Future Directions

The model trained with this new approach, named VAPO-Thinker-7B, achieved new state-of-the-art results across a wide range of benchmarks, including mathematical and general-purpose visual tasks. It showed particular strength in vision-intensive problems, demonstrating a significant improvement in visually grounded reasoning. Unlike simple ‘test-time’ fixes that might re-show the image or prompt the model to look again, VAPO offers a fundamental solution by training the model to inherently rely more on visual information.

The research highlights the critical importance of ensuring that AI models don’t just ‘think’ but also ‘see’ effectively, especially as they tackle increasingly complex multimodal challenges. While VAPO shows great promise, the authors acknowledge areas for future improvement, such as enhancing the quality of generated visual claims and designing adaptive policies for different task types. This work paves the way for more reliable and accurate Vision-Language Models that can truly integrate logical inference with robust visual perception. You can read the full research paper here: MORETHOUGHT, LESSACCURACY? ON THEDUALNA-TURE OFREASONING INVISION-LANGUAGEMODELS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Vision-Language Models: The Peril of Prolonged Reasoning and a Solution for Visual Grounding

The Problem: When More Thinking Leads to Less Seeing

The Solution: Anchoring Reasoning in Visual Evidence

Impressive Results and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Deductive AI Secures $7.5 Million Seed Funding to Revolutionize Software Reliability with Intelligent SRE Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates