spot_img
HomeResearch & DevelopmentEnhanced Gaze Tracking: Combining RGBD Images with Deep Learning...

Enhanced Gaze Tracking: Combining RGBD Images with Deep Learning for Precise Eye Movement Analysis

TLDR: A master’s thesis by Tobias J. Bauer explores a new AI-based gaze tracking system using RGBD (color and depth) images and a Transformer architecture for combining visual features. The research developed a new dataset, OTH-Gaze-Estimation, and found that while depth information is valuable, a direct Transformer-based feature fusion module did not outperform a simpler Multilayer Perceptron (MLP) for gaze estimation. The system also includes a real-time pipeline for practical applications.

Gaze tracking, the technology that measures and records eye movements and gaze direction, is becoming increasingly important across various fields, from human-computer interaction to medical diagnostics. As artificial intelligence and deep learning continue to advance, new methods are emerging to make gaze tracking more precise and accessible. A recent master’s thesis by Tobias J. Bauer explores an innovative approach to this challenge by combining color (RGB) and depth (D) information from images with advanced neural network architectures.

The Challenge of Accurate Gaze Tracking

The primary goal of gaze tracking is to accurately estimate where a person is looking in a 3D space, which can then be used to determine their gaze point on a screen or in a real-world environment. Traditional methods often struggle with variations in head pose, lighting conditions, and individual differences in eye appearance. This research aimed to tackle these complexities by focusing on three key areas: creating a specialized dataset, designing a robust AI model, and developing a real-time application pipeline.

The existing datasets for gaze tracking often lack crucial depth information or are not suitable for estimating 3D gaze angles, which are essential for versatile applications. This gap highlighted the need for a new, comprehensive dataset.

Building a New Dataset: OTH-Gaze-Estimation

To address the limitations of existing resources, the thesis introduced a new dataset called OTH-Gaze-Estimation. This extensive collection comprises over 130,000 samples from 12 different subjects, each featuring normalized RGB face patches, corresponding depth maps, eye patches, facial landmarks, and precise gaze angles as labels. The data was collected using an Intel RealSense D435 RGBD camera in a semi-controlled environment, allowing subjects to move their heads freely to capture a wide range of natural viewing conditions. This rich dataset provides a solid foundation for training and evaluating advanced gaze tracking models.

The RGBDTr Model: Fusing Vision and Depth

At the heart of this research is the proposed RGBDTr model, an AI architecture designed to process both color and depth information. Inspired by earlier work, this model incorporates several key components. It begins with face landmark detection to pinpoint crucial facial features, followed by a face normalization step that standardizes head pose and distance, allowing the model to focus purely on eye movements. The model then uses specialized feature extractors to derive information about head pose, eye pose, and depth from the images. A central element of the model is the Feature Fusion Transformer, intended to intelligently combine these diverse visual features to make a final gaze prediction. The model also includes a subject-specific calibration mechanism, which allows it to adapt to individual users for improved accuracy.

Key Findings from Extensive Evaluation

The research conducted extensive experiments on three datasets: ETH-XGaze, ShanghaiTechGaze+, and the newly created OTH-Gaze-Estimation dataset. These evaluations revealed several significant insights into the performance of the RGBDTr model.

One crucial finding was that while depth information from RGBD images consistently improved gaze tracking accuracy compared to using only color (RGB) images, the specific Transformer module designed for feature fusion did not perform as well as a simpler Multilayer Perceptron (MLP) alternative. This suggests that for this particular application, a less complex fusion mechanism was more effective, indicating that the Transformer’s complexity might not always translate to superior performance in all gaze tracking contexts.

Another important discovery was that a pre-trained Generative Adversarial Network (GAN) backbone, initially intended to refine depth images and remove artifacts, actually hindered the overall gaze estimation performance. Models trained without this GAN backbone achieved significantly better results, indicating that the multi-task approach of depth reconstruction and gaze estimation might have conflicting objectives that negatively impact the primary task of gaze estimation.

The study also highlighted the critical role of subject-specific calibration, which consistently and significantly reduced errors across all datasets. This calibration step helps the model adapt to individual differences in eye and face appearance, making the system more robust and personalized.

The best model configuration achieved impressive results, setting a new benchmark on the ShanghaiTechGaze+ dataset with a mean Euclidean error of 29.7 mm. On the OTH-Gaze-Estimation dataset, it achieved a mean angular error of 4.7 degrees. These results demonstrate the potential for highly accurate and adaptable gaze tracking systems using RGBD input.

Also Read:

Real-Time Application and Future Directions

The thesis culminates in a practical, real-time gaze point estimation pipeline that can process RGBD images and predict gaze direction instantly. This system incorporates advanced filtering techniques, such as Kalman filters, to ensure smooth and accurate gaze tracking, and offers interactive 3D visualization capabilities for monitoring the gaze estimation process. The pipeline is designed for extensibility, allowing for future integration with other applications through an Application Programming Interface (API).

This work provides valuable insights into the use of RGBD images and deep learning for gaze tracking. While the Transformer’s direct application for feature fusion in this specific setup showed unexpected results, the overall approach demonstrates the potential for highly accurate and adaptable gaze tracking systems. Future research could explore alternative Transformer architectures, larger and more diverse datasets, and more sophisticated calibration methods to further enhance performance. For more details, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -