TLDR: Leading Vision-Language Models (VLMs) like Gemini 2.5 Pro and Claude Vision 3.7, despite excelling at complex visual tasks, struggle significantly with “nonlocal visual reasoning.” This type of reasoning requires integrating information from multiple, often distant, parts of an image. A new evaluation suite, comprising tasks like comparative perception, saccadic search, and smooth visual search, reveals that these models perform poorly, often barely exceeding random chance, indicating a lack of core visual reasoning capabilities and a reliance on learned priors over direct visual evidence.
Vision-Language Models (VLMs) have made remarkable strides in understanding and interpreting images, excelling at complex tasks like visual question answering and chart analysis. Their ability to achieve high accuracy on various benchmarks often suggests a deep understanding of visual information. However, recent research indicates that these advanced AI models might have a significant blind spot: they struggle with what researchers call ‘nonlocal visual reasoning’.
Understanding Nonlocal Visual Reasoning
Nonlocal visual reasoning refers to the ability to connect and interpret evidence from multiple, potentially distant, regions within an image. Unlike tasks that can be solved by simply extracting information from isolated parts of an image, nonlocal reasoning requires a more integrated and sequential approach, similar to how humans process visual information. The new research paper, titled “VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs” by Shmuel Berman and Jia Deng from Princeton University, identifies three distinct forms of this crucial visual skill:
- Comparative Perception: This involves holding two images or visual entities in working memory and comparing them, even when precise differences are hard to articulate. Think of recognizing if two complex shapes are identical without listing every single feature.
- Saccadic Search: Named after the rapid eye movements humans make, this is the process of gathering and integrating information by making discrete, evidence-driven jumps across an image. An example is consulting a chart’s legend, then locating the corresponding line, and then referencing an axis. Each step informs the next.
- Smooth Visual Search: This describes the continuous tracing of visual elements, such as following the outline of an object or tracing a curve to its conclusion. This is a purely visual operation that isn’t easily broken down into language-based steps.
The Evaluation Tasks
To systematically evaluate these capabilities, the researchers introduced a new evaluation suite comprising three procedurally-generated task categories, designed to be trivial for humans but challenging for VLMs:
- Object Re-Identification: This task tests comparative perception. Models are shown two images and must determine if the same object appears in both, even if it has undergone a rigid transformation (like rotation or translation). The task includes variants to test how connectedness of object parts or pixel-perfect matching affects performance.
- Visual Scavenger Hunt: Designed to assess saccadic search, this task presents a grid of colored, labeled shapes. The model is given a starting shape and must follow a chain of instructions (e.g., “go to the red square, then the blue circle”) for a specified number of steps, finally identifying the color of the last shape. This requires iterative visual search and evidence accumulation.
- Circuit Connections: This task evaluates smooth visual search. Models are shown a circuit diagram with a breadboard and components, and must trace a wire from a specific port to its connected component. Variants with single-colored wires or unique wire colors help determine if models are truly tracing or using shortcuts like color matching.
Key Findings: VLMs Lag Behind
The results were striking. Flagship models, including Gemini 2.5 Pro, Claude Vision 3.7, and GPT-o4-mini, performed poorly across all tasks, often barely exceeding random guessing accuracy. Even models that previously scored well on other primitive-vision benchmarks failed these tests. This suggests that despite advancements in raw visual acuity, current VLMs lack fundamental visual reasoning capabilities.
For instance, in Object Re-Identification, no model significantly outperformed random chance on the standard variant, indicating a struggle with comparative perception. While some models showed slight improvement on variants designed to be easier (e.g., pixel-perfect matches), they remained far below human performance.
In the Visual Scavenger Hunt, only a few top models performed above random accuracy, and their performance often degraded as the chain length (number of steps) increased. Weaker models exhibited guessing behavior, often hallucinating paths or citing non-existent shapes.
The Circuit Connections task also revealed significant struggles with smooth visual search. While models performed slightly better when wires had unique colors (suggesting they might use color-matching heuristics), their accuracy plummeted when all wires were the same color, forcing them to actually trace the path. This indicates a reliance on coarse spatial and color cues rather than true contour tracking.
Why VLMs Struggle
The research points to several reasons for these failures. VLMs often rely on their prior natural language judgments over direct visual evidence, meaning they might try to convert visual problems into text-based reasoning, which is where Large Language Models (LLMs) excel. This strategy works well for tasks with strong priors, like standardized chart layouts, but fails when models need to reason over nonlocal regions or defy learned conventions.
Furthermore, the study found that VLMs struggle to self-correct. Even when a mistake in a sequential task (like the scavenger hunt) would provide a clear signal of error to a human, the models did not adjust their approach. Their “fuzzy vision” interferes with visual reasoning, and they appear unable to refine their understanding based on new visual evidence.
Also Read:
- Advancing AI’s Spatial Understanding: New Strategies for Vision-Language Models
- When Visual Input Challenges VLM’s Internal Facts
Looking Ahead
The findings strongly suggest that current VLMs, despite their impressive high-level performance, lack the structured and systematic visual reasoning skills that are independent of natural language. The paper argues for a shift in focus towards developing models that can genuinely “think over the pixels they process,” rather than just describing visual scenes. By exposing these fundamental failure modes, researchers hope to guide the development of future VLMs towards more robust and human-like visual intelligence.
For more in-depth information, you can read the full research paper here.


