Enhanced Gaze Tracking: Combining RGBD Images with Deep Learning for Precise Eye Movement Analysis

TLDR: A master’s thesis by Tobias J. Bauer explores a new AI-based gaze tracking system using RGBD (color and depth) images and a Transformer architecture for combining visual features. The research developed a new dataset, OTH-Gaze-Estimation, and found that while depth information is valuable, a direct Transformer-based feature fusion module did not outperform a simpler Multilayer Perceptron (MLP) for gaze estimation. The system also includes a real-time pipeline for practical applications.

Gaze tracking, the technology that measures and records eye movements and gaze direction, is becoming increasingly important across various fields, from human-computer interaction to medical diagnostics. As artificial intelligence and deep learning continue to advance, new methods are emerging to make gaze tracking more precise and accessible. A recent master’s thesis by Tobias J. Bauer explores an innovative approach to this challenge by combining color (RGB) and depth (D) information from images with advanced neural network architectures.

The Challenge of Accurate Gaze Tracking

The primary goal of gaze tracking is to accurately estimate where a person is looking in a 3D space, which can then be used to determine their gaze point on a screen or in a real-world environment. Traditional methods often struggle with variations in head pose, lighting conditions, and individual differences in eye appearance. This research aimed to tackle these complexities by focusing on three key areas: creating a specialized dataset, designing a robust AI model, and developing a real-time application pipeline.

The existing datasets for gaze tracking often lack crucial depth information or are not suitable for estimating 3D gaze angles, which are essential for versatile applications. This gap highlighted the need for a new, comprehensive dataset.

Building a New Dataset: OTH-Gaze-Estimation

To address the limitations of existing resources, the thesis introduced a new dataset called OTH-Gaze-Estimation. This extensive collection comprises over 130,000 samples from 12 different subjects, each featuring normalized RGB face patches, corresponding depth maps, eye patches, facial landmarks, and precise gaze angles as labels. The data was collected using an Intel RealSense D435 RGBD camera in a semi-controlled environment, allowing subjects to move their heads freely to capture a wide range of natural viewing conditions. This rich dataset provides a solid foundation for training and evaluating advanced gaze tracking models.

The RGBDTr Model: Fusing Vision and Depth

At the heart of this research is the proposed RGBDTr model, an AI architecture designed to process both color and depth information. Inspired by earlier work, this model incorporates several key components. It begins with face landmark detection to pinpoint crucial facial features, followed by a face normalization step that standardizes head pose and distance, allowing the model to focus purely on eye movements. The model then uses specialized feature extractors to derive information about head pose, eye pose, and depth from the images. A central element of the model is the Feature Fusion Transformer, intended to intelligently combine these diverse visual features to make a final gaze prediction. The model also includes a subject-specific calibration mechanism, which allows it to adapt to individual users for improved accuracy.

Key Findings from Extensive Evaluation

The research conducted extensive experiments on three datasets: ETH-XGaze, ShanghaiTechGaze+, and the newly created OTH-Gaze-Estimation dataset. These evaluations revealed several significant insights into the performance of the RGBDTr model.

One crucial finding was that while depth information from RGBD images consistently improved gaze tracking accuracy compared to using only color (RGB) images, the specific Transformer module designed for feature fusion did not perform as well as a simpler Multilayer Perceptron (MLP) alternative. This suggests that for this particular application, a less complex fusion mechanism was more effective, indicating that the Transformer’s complexity might not always translate to superior performance in all gaze tracking contexts.

Another important discovery was that a pre-trained Generative Adversarial Network (GAN) backbone, initially intended to refine depth images and remove artifacts, actually hindered the overall gaze estimation performance. Models trained without this GAN backbone achieved significantly better results, indicating that the multi-task approach of depth reconstruction and gaze estimation might have conflicting objectives that negatively impact the primary task of gaze estimation.

The study also highlighted the critical role of subject-specific calibration, which consistently and significantly reduced errors across all datasets. This calibration step helps the model adapt to individual differences in eye and face appearance, making the system more robust and personalized.

The best model configuration achieved impressive results, setting a new benchmark on the ShanghaiTechGaze+ dataset with a mean Euclidean error of 29.7 mm. On the OTH-Gaze-Estimation dataset, it achieved a mean angular error of 4.7 degrees. These results demonstrate the potential for highly accurate and adaptable gaze tracking systems using RGBD input.

Also Read:

Real-Time Application and Future Directions

The thesis culminates in a practical, real-time gaze point estimation pipeline that can process RGBD images and predict gaze direction instantly. This system incorporates advanced filtering techniques, such as Kalman filters, to ensure smooth and accurate gaze tracking, and offers interactive 3D visualization capabilities for monitoring the gaze estimation process. The pipeline is designed for extensibility, allowing for future integration with other applications through an Application Programming Interface (API).

This work provides valuable insights into the use of RGBD images and deep learning for gaze tracking. While the Transformer’s direct application for feature fusion in this specific setup showed unexpected results, the overall approach demonstrates the potential for highly accurate and adaptable gaze tracking systems. Future research could explore alternative Transformer architectures, larger and more diverse datasets, and more sophisticated calibration methods to further enhance performance. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhanced Gaze Tracking: Combining RGBD Images with Deep Learning for Precise Eye Movement Analysis

The Challenge of Accurate Gaze Tracking

Building a New Dataset: OTH-Gaze-Estimation

The RGBDTr Model: Fusing Vision and Depth

Key Findings from Extensive Evaluation

Real-Time Application and Future Directions

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates