TLDR: A new study uses classic visual search tasks to evaluate Multimodal Large Language Models (MLLMs), revealing that advanced models like GPT-4o exhibit human-like perceptual behaviors. These include rapid “pop-out” detection for single features, capacity limits for combining multiple features, and the integration of natural scene priors like lighting direction. The research also shows that fine-tuning can improve complex visual search performance and that different network layers handle varying levels of visual complexity, mirroring human cognitive processes.
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in understanding and generating content across both vision and language. However, how these models internally process and represent visual information has largely remained a mystery. Traditional evaluation methods often focus on overall task accuracy, which doesn’t shed much light on the underlying mechanisms or cognitive-like processes at play.
A recent research paper, titled “I SPYWITHMYMODEL’SEYE: VISUALSEARCH AS A BEHAVIOURALTEST FORMLLMS,” delves into this opacity by adapting classic visual search paradigms from cognitive psychology. These paradigms, originally developed to study human perception, are now being used as a diagnostic tool to evaluate the perceptual capabilities of MLLMs. The study aims to uncover whether MLLMs exhibit phenomena similar to human visual attention, such as the “pop-out” effect and capacity limits in complex searches.
Exploring Visual Search in MLLMs
The researchers, John Burden, Jonathan Prunty, Ben Slater, Matthieu Tehenan, Greg Davis, and Lucy Cheke, conducted three main experiments to probe MLLM visual processing:
The first experiment, Circle Sizes, investigated whether MLLMs show “pop-out” effects in disjunctive search tasks, where a target is easily distinguishable by a single visual feature like size. In humans, a large circle among smaller ones “pops out” regardless of how many distractors are present. The study found that advanced MLLMs, particularly GPT-4o, exhibited a clear pop-out effect for large targets, with high accuracy that remained stable even with increasing numbers of distractors. This behavior closely mirrored human participants, suggesting human-like size-driven salience effects.
The second experiment, 2 Among 5, explored whether MLLMs demonstrate human-like attentional limitations in conjunctive search. Unlike disjunctive search, conjunctive search requires identifying a target based on a unique combination of features (e.g., a red ‘2’ among red ‘5’s and blue ‘2’s). In humans, this type of search is slower and becomes harder with more distractors, indicating a need for more attentional resources to bind features. GPT-4o showed high performance and set-size independence in disjunctive conditions, but its accuracy declined significantly as distractor numbers increased in conjunctive search tasks, reflecting human-like capacity limits.
The third experiment, Light Priors, tested whether MLLMs incorporate sophisticated assumptions about how objects appear in the real world, such as the “light-comes-from-above” prior observed in humans. This prior means humans are faster at detecting objects lit from unusual directions (e.g., from below). The results showed that GPT-4o’s performance pattern remarkably resembled the human baseline, with advantages for vertical gradients (top and bottom-lit) over horizontal ones, and notably, highest accuracy for bottom-lit spheres. This suggests that MLLMs, like humans, integrate natural scene regularities into their object representations, likely learned from vast real-world imagery in their training data.
Also Read:
- Grasp Any Region: Advancing Multimodal AI for Detailed Visual Understanding
- Unpacking the Layered Intelligence of Large Language Models
Beyond Basic Behavior
The research also delved into fine-tuning and mechanistic interpretability. Fine-tuning GPT-4o on conjunctive search tasks, even with minimal data, led to substantial performance improvements, which generalized to unseen distractor counts and even related tasks. This indicates that MLLMs can learn more efficient visual search strategies through targeted training.
Mechanistic interpretability analyses on Llama 90B revealed that disjunctive search tasks, relying on primitive visual features, engaged earlier layers of the network. In contrast, more complex conjunctive search tasks, requiring feature binding, recruited deeper network layers. This parallel between network layer activation and cognitive processing stages in humans (early visual cortex for saliency, higher cortical regions for conjunctive search) offers intriguing insights into MLLM internal structures.
This study highlights that advanced MLLMs, such as GPT-4o and Claude Sonnet, exhibit visual search behaviors strikingly similar to humans. They demonstrate parallel processing for simple features, capacity limits for complex feature binding, and even incorporate natural scene priors like lighting direction. This work establishes visual search as a powerful, cognitively grounded diagnostic tool for understanding the perceptual capabilities and internal representations of MLLMs. For more detailed information, you can read the full research paper here.


