TLDR: A new research paper reveals that Vision-Language Models (VLMs) often capture detailed visual information internally but struggle to translate it into accurate textual responses, especially for fine-grained recognition and object counting. This ‘information loss’ primarily occurs in the language decoder. Conversely, for spatial understanding, the initial visual encoder is often the limiting factor. The study also highlights VLMs’ reliance on texture over shape and varying robustness across internal processing stages, offering insights for future model improvements.
Vision-language Models (VLMs) have become powerful tools for tackling complex tasks that combine computer vision and natural language understanding. From interpreting charts to understanding humor in images, these models showcase impressive capabilities. However, recent research suggests that despite their advanced performance, VLMs sometimes struggle with fundamental visual understanding skills, such as recognizing simple negations or accurately counting objects.
A new research paper titled “Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities” delves into these limitations. Unlike previous studies that only evaluate the final output of VLMs, this paper introduces a novel approach: it examines the performance of VLMs at different internal stages of information processing. The researchers, Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, and Leonid Sigal, aim to pinpoint exactly where these models might be falling short.
Understanding the VLM’s Internal Journey
The study breaks down the VLM’s architecture into three key ‘spaces’ or stages: the visual latent space (output of the visual encoder), the vision-language shared latent space (output of the vision-language projection module), and the language response space (output of the language decoder). By analyzing how well each of these stages performs on basic visual tasks, the researchers can identify where visual knowledge might be lost or misinterpreted.
Key Findings Across Visual Tasks
For tasks like **coarse-grained object recognition** (e.g., identifying a ‘dog’ vs. a ‘cat’), the visual and vision-language projection spaces showed excellent accuracy, often above 95%. However, there was a noticeable dip in performance in the final language response space, suggesting that while the visual information was present internally, it wasn’t always accurately verbalized by the language decoder. This gap was particularly evident in models not specifically tuned for instruction following.
The observations became even more striking for **fine-grained object recognition** (e.g., distinguishing between different dog breeds). While the internal visual and projection spaces still performed very well (above 90% accuracy), the final response space saw a drastic drop in performance, sometimes by as much as 45%. This indicates that the knowledge for fine-grained distinctions is captured early in the VLM, but the language decoder struggles to translate this detailed information into an accurate textual response. The researchers hypothesize this is due to insufficient fine-tuning data for these specific, detailed categories.
Similarly, in **object counting tasks**, the visual and projection layers showed high proficiency, but the final response layer often performed significantly worse. This further supports the idea that the language decoder is a bottleneck for certain visual understanding capabilities.
Interestingly, the trend reversed for **spatial understanding tasks** (e.g., determining if an object is ‘above’ or ‘below’ another). Here, the initial visual encoder performed the worst, with performance gradually improving in the vision-language projection and then in the final response space. This suggests that for spatial reasoning, the initial visual encoding might be the primary limitation, and the language decoder, potentially aided by its training data, can sometimes compensate for these initial shortcomings.
The Role of Language Model Size and Priors
The study also explored whether simply increasing the size of the language decoder would resolve these issues. While a larger language model did offer some improvements, especially in the vision-language projection space, it did not fully bridge the performance gap for fine-grained recognition, indicating that size alone isn’t the solution.
Furthermore, the researchers carefully designed their experiments to minimize the influence of language priors (i.e., the model guessing based on common sense or textual patterns rather than visual input). By using synthetic datasets and comparing VLM performance to a ‘blind’ language-only model, they confirmed that the observed visual understanding capabilities were indeed driven by visual processing, not just language biases.
Robustness and Visual Processing
The paper also investigated how VLMs handle visual corruptions (like noise or blur) and background transformations. Surprisingly, the final response space appeared to be the most robust to visual corruptions, but this was attributed to the very information loss observed earlier – if information isn’t fully transferred, it can’t be corrupted as easily. The intermediate vision-language projection space, however, was found to be less robust, highlighting a potential vulnerability.
When it came to background transformations, removing distracting backgrounds generally improved performance in the visual and projection spaces. Conversely, when the main object was masked, background context became important for recognition. Visual prompting techniques, such as blurring the background to highlight the foreground object, significantly improved performance, especially in the response space, suggesting these techniques can help overcome some of the information loss.
Finally, the study revealed a critical difference between VLMs and human perception: VLMs tend to rely more on **texture than shape** for object recognition. This was demonstrated by a drastic performance drop when only object shapes were retained (edge maps) compared to when textures were present but shapes distorted (patch shuffle). This reliance on texture could impact their ability to generalize in real-world scenarios.
Also Read:
- Assessing How Well Text-to-Image Models Follow Instructions
- Navigating In-Context Learning: A Deep Dive into How Examples Shape Multimodal AI for Image Captioning
Guiding Future VLM Development
In conclusion, this research provides crucial insights into the internal workings and limitations of current VLMs. It highlights that while visual information is often well-preserved in the early stages of a VLM, it frequently gets lost or misinterpreted by the language decoder, especially for detailed tasks like fine-grained recognition and counting. For spatial understanding, the visual encoder itself is often the bottleneck. The findings suggest that future efforts to improve VLMs should focus on enhancing the joint fine-tuning process between the vision-language projection and the language decoder, ensuring that the rich visual knowledge captured early on is effectively translated into accurate and robust responses. You can read the full research paper here: Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities.


