TLDR: A new study reveals a ‘dual nature’ in Vision-Language Models (VLMs): while reasoning enhances logical inference, it can also impair perceptual grounding, leading to ‘visual forgetting’ where models disregard visual input during prolonged thought. To address this, researchers propose VISION-ANCHOREDPOLICYOPTIMIZATION (VAPO), a method that inserts ‘visual anchors’ and uses a ‘perception reward’ to explicitly steer reasoning towards visually grounded trajectories. VAPO-Thinker-7B, the resulting model, significantly improves reliance on visual information and achieves state-of-the-art performance on various benchmarks.
In the rapidly evolving world of artificial intelligence, Vision-Language Models (VLMs) have emerged as powerful tools, capable of understanding and generating content based on both images and text. These models are increasingly being trained to perform complex reasoning tasks, from solving intricate math problems to generating code. However, new research from Xinyu Tian and colleagues uncovers a surprising challenge: while advanced reasoning can boost a VLM’s ability to tackle tough logical problems, it can also make the model ‘forget’ what it’s seeing, leading to basic visual recognition errors.
The paper, titled “MORETHOUGHT, LESSACCURACY? ON THEDUALNA-TURE OFREASONING INVISION-LANGUAGEMODELS,” highlights what the authors call the ‘dual nature’ of multimodal reasoning. Imagine an AI model trying to solve a visual puzzle. The more it ‘thinks’ or reasons through complex steps, the more it might start to disregard the actual image, relying instead on its internal thought process. This phenomenon, termed ‘visual forgetting,’ means that prolonged reasoning can inadvertently reduce the model’s reliance on crucial visual input.
The Problem: When More Thinking Leads to Less Seeing
The researchers conducted a detailed analysis, evaluating how existing VLMs perform as their reasoning processes become longer. They found that while initial reasoning steps often improved accuracy, these gains would eventually plateau and even reverse. On tasks requiring precise visual understanding, such as counting objects in an image or interpreting charts, accuracy could drop significantly after extended reasoning. A key finding was that over 50% of errors made by these models were ‘perception errors’ – mistakes in correctly interpreting visual details, rather than logical missteps. Surprisingly, many of these errors could have been avoided if the model had stopped reasoning earlier, suggesting it initially had the correct visual understanding but was led astray by its own prolonged thoughts.
To further investigate, the team tracked how much attention the models paid to visual information during their reasoning process. They observed a clear decline in ‘visual attention’ as reasoning progressed, with models increasingly relying on their generated text rather than the original image. This confirmed the hypothesis of visual forgetting.
The Solution: Anchoring Reasoning in Visual Evidence
To combat this issue, the researchers propose a novel method called VISION-ANCHOREDPOLICYOPTIMIZATION (VAPO). VAPO is designed to explicitly guide the reasoning process to stay grounded in visual evidence. Here’s how it works: During training, VAPO generates a series of ‘visual claims’ about an image – some correct, some incorrect. These claims are then strategically inserted as ‘visual anchors’ at various points within the model’s reasoning path. At each anchor, the model is prompted to judge the truthfulness of the claim, forcing it to re-engage with the visual input.
A special ‘perception reward’ is introduced, which encourages the model to accurately evaluate these visual claims. This reward is weighted to give more importance to anchors later in the reasoning process, precisely where visual forgetting is most likely to occur. By integrating this perception reward with standard training techniques, VAPO effectively teaches the model to maintain its visual grounding throughout complex reasoning.
Also Read:
- Why Advanced AI Models Struggle with Simple Visual Tasks: The Serial Processing Gap
- Decoding How AI Understands the World: A Multimodal Perspective
Impressive Results and Future Directions
The model trained with this new approach, named VAPO-Thinker-7B, achieved new state-of-the-art results across a wide range of benchmarks, including mathematical and general-purpose visual tasks. It showed particular strength in vision-intensive problems, demonstrating a significant improvement in visually grounded reasoning. Unlike simple ‘test-time’ fixes that might re-show the image or prompt the model to look again, VAPO offers a fundamental solution by training the model to inherently rely more on visual information.
The research highlights the critical importance of ensuring that AI models don’t just ‘think’ but also ‘see’ effectively, especially as they tackle increasingly complex multimodal challenges. While VAPO shows great promise, the authors acknowledge areas for future improvement, such as enhancing the quality of generated visual claims and designing adaptive policies for different task types. This work paves the way for more reliable and accurate Vision-Language Models that can truly integrate logical inference with robust visual perception. You can read the full research paper here: MORETHOUGHT, LESSACCURACY? ON THEDUALNA-TURE OFREASONING INVISION-LANGUAGEMODELS.


