Beyond Basic Vision: How Leading AI Models Miss the Bigger Picture

TLDR: Leading Vision-Language Models (VLMs) like Gemini 2.5 Pro and Claude Vision 3.7, despite excelling at complex visual tasks, struggle significantly with “nonlocal visual reasoning.” This type of reasoning requires integrating information from multiple, often distant, parts of an image. A new evaluation suite, comprising tasks like comparative perception, saccadic search, and smooth visual search, reveals that these models perform poorly, often barely exceeding random chance, indicating a lack of core visual reasoning capabilities and a reliance on learned priors over direct visual evidence.

Vision-Language Models (VLMs) have made remarkable strides in understanding and interpreting images, excelling at complex tasks like visual question answering and chart analysis. Their ability to achieve high accuracy on various benchmarks often suggests a deep understanding of visual information. However, recent research indicates that these advanced AI models might have a significant blind spot: they struggle with what researchers call ‘nonlocal visual reasoning’.

Understanding Nonlocal Visual Reasoning

Nonlocal visual reasoning refers to the ability to connect and interpret evidence from multiple, potentially distant, regions within an image. Unlike tasks that can be solved by simply extracting information from isolated parts of an image, nonlocal reasoning requires a more integrated and sequential approach, similar to how humans process visual information. The new research paper, titled “VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs” by Shmuel Berman and Jia Deng from Princeton University, identifies three distinct forms of this crucial visual skill:

Comparative Perception: This involves holding two images or visual entities in working memory and comparing them, even when precise differences are hard to articulate. Think of recognizing if two complex shapes are identical without listing every single feature.
Saccadic Search: Named after the rapid eye movements humans make, this is the process of gathering and integrating information by making discrete, evidence-driven jumps across an image. An example is consulting a chart’s legend, then locating the corresponding line, and then referencing an axis. Each step informs the next.
Smooth Visual Search: This describes the continuous tracing of visual elements, such as following the outline of an object or tracing a curve to its conclusion. This is a purely visual operation that isn’t easily broken down into language-based steps.

The Evaluation Tasks

To systematically evaluate these capabilities, the researchers introduced a new evaluation suite comprising three procedurally-generated task categories, designed to be trivial for humans but challenging for VLMs:

Object Re-Identification: This task tests comparative perception. Models are shown two images and must determine if the same object appears in both, even if it has undergone a rigid transformation (like rotation or translation). The task includes variants to test how connectedness of object parts or pixel-perfect matching affects performance.
Visual Scavenger Hunt: Designed to assess saccadic search, this task presents a grid of colored, labeled shapes. The model is given a starting shape and must follow a chain of instructions (e.g., “go to the red square, then the blue circle”) for a specified number of steps, finally identifying the color of the last shape. This requires iterative visual search and evidence accumulation.
Circuit Connections: This task evaluates smooth visual search. Models are shown a circuit diagram with a breadboard and components, and must trace a wire from a specific port to its connected component. Variants with single-colored wires or unique wire colors help determine if models are truly tracing or using shortcuts like color matching.

Key Findings: VLMs Lag Behind

The results were striking. Flagship models, including Gemini 2.5 Pro, Claude Vision 3.7, and GPT-o4-mini, performed poorly across all tasks, often barely exceeding random guessing accuracy. Even models that previously scored well on other primitive-vision benchmarks failed these tests. This suggests that despite advancements in raw visual acuity, current VLMs lack fundamental visual reasoning capabilities.

For instance, in Object Re-Identification, no model significantly outperformed random chance on the standard variant, indicating a struggle with comparative perception. While some models showed slight improvement on variants designed to be easier (e.g., pixel-perfect matches), they remained far below human performance.

In the Visual Scavenger Hunt, only a few top models performed above random accuracy, and their performance often degraded as the chain length (number of steps) increased. Weaker models exhibited guessing behavior, often hallucinating paths or citing non-existent shapes.

The Circuit Connections task also revealed significant struggles with smooth visual search. While models performed slightly better when wires had unique colors (suggesting they might use color-matching heuristics), their accuracy plummeted when all wires were the same color, forcing them to actually trace the path. This indicates a reliance on coarse spatial and color cues rather than true contour tracking.

Why VLMs Struggle

The research points to several reasons for these failures. VLMs often rely on their prior natural language judgments over direct visual evidence, meaning they might try to convert visual problems into text-based reasoning, which is where Large Language Models (LLMs) excel. This strategy works well for tasks with strong priors, like standardized chart layouts, but fails when models need to reason over nonlocal regions or defy learned conventions.

Furthermore, the study found that VLMs struggle to self-correct. Even when a mistake in a sequential task (like the scavenger hunt) would provide a clear signal of error to a human, the models did not adjust their approach. Their “fuzzy vision” interferes with visual reasoning, and they appear unable to refine their understanding based on new visual evidence.

Also Read:

Looking Ahead

The findings strongly suggest that current VLMs, despite their impressive high-level performance, lack the structured and systematic visual reasoning skills that are independent of natural language. The paper argues for a shift in focus towards developing models that can genuinely “think over the pixels they process,” rather than just describing visual scenes. By exposing these fundamental failure modes, researchers hope to guide the development of future VLMs towards more robust and human-like visual intelligence.

For more in-depth information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Beyond Basic Vision: How Leading AI Models Miss the Bigger Picture

Understanding Nonlocal Visual Reasoning

The Evaluation Tasks

Key Findings: VLMs Lag Behind

Why VLMs Struggle

Looking Ahead

Gen AI News and Updates

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

Enhancing Text Legibility in AI-Generated Videos with Synthetic Data

Tailoring Image Edits: A Collaborative Approach to User Preferences in AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates