TLDR: ScanDiff is a new AI model that uses diffusion models to predict human eye movements (gaze scanpaths). Unlike older models that only predict average behavior, ScanDiff generates diverse and realistic gaze patterns, capturing the natural variability in how people look at things. It works for both general viewing and specific search tasks, setting a new standard in gaze prediction by better reflecting the complexity of human visual exploration.
Understanding how humans look at the world is a fascinating and complex challenge. Our eyes don’t just randomly dart around; they follow intricate paths, known as scanpaths, that reveal where our attention is focused. This understanding is crucial for fields ranging from human-computer interaction to autonomous systems and even cognitive robotics.
For years, researchers have developed deep learning models to predict these human gaze scanpaths. While these models have made significant strides, many of them share a common limitation: they tend to predict an ‘averaged’ behavior. This means they might show you the most common way people look at something, but they often fail to capture the rich, natural variability seen in individual human visual exploration. Think about it – if ten people look at the same picture, their eye movements won’t be identical, and capturing that diversity is key to truly understanding human vision.
Introducing ScanDiff: A New Era in Gaze Prediction
A groundbreaking new research paper introduces ScanDiff, a novel architecture designed to overcome this limitation. ScanDiff combines the power of diffusion models with Vision Transformers to generate not just accurate, but also diverse and realistic scanpaths. The core innovation lies in leveraging the ‘stochastic nature’ of diffusion models, which allows ScanDiff to produce a wide range of plausible gaze trajectories, mirroring the inherent variability in human eye movements.
What makes ScanDiff even more versatile is its ability to adapt to different viewing tasks. Whether someone is simply looking at an image without a specific goal (free-viewing) or actively searching for a particular object (task-driven), ScanDiff can adjust its predictions. This is achieved through ‘textual conditioning,’ where the model can be given a text prompt, like ‘search for a laptop,’ to guide its gaze generation.
How ScanDiff Works (Simplified)
At its heart, ScanDiff uses a process inspired by how diffusion works in physics. Imagine starting with a noisy, chaotic image and gradually ‘denoising’ it to reveal a clear picture. Diffusion models do something similar: they learn to reverse a process that gradually adds noise to data. ScanDiff applies this concept to gaze paths. It starts with a noisy representation of a scanpath and iteratively refines it, guided by the visual information from the image and the specific task (if any), until it generates a realistic eye movement sequence.
The model uses advanced AI components like DINOv2 (a Vision Transformer) to understand the visual content of an image and CLIP (a text encoder) to interpret the viewing task. These two sources of information are then cleverly combined to inform the gaze prediction process. Crucially, ScanDiff also includes a module that predicts the length of the scanpath, allowing for more flexible and realistic gaze behaviors, as human scanpaths naturally vary in length.
Also Read:
- MapDiffusion: Enhancing Autonomous Driving with Generative HD Map Construction and Uncertainty Awareness
- Guiding AI Learning with Human Eye Movements
Setting New Standards
Experiments on widely recognized datasets demonstrate that ScanDiff surpasses existing state-of-the-art methods in both free-viewing and task-driven scenarios. It not only produces more accurate scanpaths but also excels at generating diverse ones. The researchers even introduced new metrics, like the Diversity-aware Sequence Score (DSS) and Recall Sequence Score (RSS), specifically to measure how well models capture this crucial variability, and ScanDiff consistently came out on top.
This ability to better capture the complexity of human visual behavior is a significant step forward in gaze prediction research. By modeling the inherent stochasticity of human gaze, ScanDiff opens new avenues for applications that require a more realistic simulation of how humans interact with visual information.
For those interested in the technical details, the full research paper can be found here.


