ScanDiff: A Unified AI Model for Diverse Eye Movement Prediction

TLDR: ScanDiff is a new AI model that uses diffusion models to predict human eye movements (gaze scanpaths). Unlike older models that only predict average behavior, ScanDiff generates diverse and realistic gaze patterns, capturing the natural variability in how people look at things. It works for both general viewing and specific search tasks, setting a new standard in gaze prediction by better reflecting the complexity of human visual exploration.

Understanding how humans look at the world is a fascinating and complex challenge. Our eyes don’t just randomly dart around; they follow intricate paths, known as scanpaths, that reveal where our attention is focused. This understanding is crucial for fields ranging from human-computer interaction to autonomous systems and even cognitive robotics.

For years, researchers have developed deep learning models to predict these human gaze scanpaths. While these models have made significant strides, many of them share a common limitation: they tend to predict an ‘averaged’ behavior. This means they might show you the most common way people look at something, but they often fail to capture the rich, natural variability seen in individual human visual exploration. Think about it – if ten people look at the same picture, their eye movements won’t be identical, and capturing that diversity is key to truly understanding human vision.

Introducing ScanDiff: A New Era in Gaze Prediction

A groundbreaking new research paper introduces ScanDiff, a novel architecture designed to overcome this limitation. ScanDiff combines the power of diffusion models with Vision Transformers to generate not just accurate, but also diverse and realistic scanpaths. The core innovation lies in leveraging the ‘stochastic nature’ of diffusion models, which allows ScanDiff to produce a wide range of plausible gaze trajectories, mirroring the inherent variability in human eye movements.

What makes ScanDiff even more versatile is its ability to adapt to different viewing tasks. Whether someone is simply looking at an image without a specific goal (free-viewing) or actively searching for a particular object (task-driven), ScanDiff can adjust its predictions. This is achieved through ‘textual conditioning,’ where the model can be given a text prompt, like ‘search for a laptop,’ to guide its gaze generation.

How ScanDiff Works (Simplified)

At its heart, ScanDiff uses a process inspired by how diffusion works in physics. Imagine starting with a noisy, chaotic image and gradually ‘denoising’ it to reveal a clear picture. Diffusion models do something similar: they learn to reverse a process that gradually adds noise to data. ScanDiff applies this concept to gaze paths. It starts with a noisy representation of a scanpath and iteratively refines it, guided by the visual information from the image and the specific task (if any), until it generates a realistic eye movement sequence.

The model uses advanced AI components like DINOv2 (a Vision Transformer) to understand the visual content of an image and CLIP (a text encoder) to interpret the viewing task. These two sources of information are then cleverly combined to inform the gaze prediction process. Crucially, ScanDiff also includes a module that predicts the length of the scanpath, allowing for more flexible and realistic gaze behaviors, as human scanpaths naturally vary in length.

Also Read:

Setting New Standards

Experiments on widely recognized datasets demonstrate that ScanDiff surpasses existing state-of-the-art methods in both free-viewing and task-driven scenarios. It not only produces more accurate scanpaths but also excels at generating diverse ones. The researchers even introduced new metrics, like the Diversity-aware Sequence Score (DSS) and Recall Sequence Score (RSS), specifically to measure how well models capture this crucial variability, and ScanDiff consistently came out on top.

This ability to better capture the complexity of human visual behavior is a significant step forward in gaze prediction research. By modeling the inherent stochasticity of human gaze, ScanDiff opens new avenues for applications that require a more realistic simulation of how humans interact with visual information.

For those interested in the technical details, the full research paper can be found here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ScanDiff: A Unified AI Model for Diverse Eye Movement Prediction

Introducing ScanDiff: A New Era in Gaze Prediction

How ScanDiff Works (Simplified)

Setting New Standards

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates