TLDR: This research introduces a new dataset (DAAD-X) and a model (VCBM) to make autonomous driving systems more understandable and safer. DAAD-X provides detailed explanations for driver actions, while VCBM is a novel framework that inherently generates human-understandable explanations for predicted maneuvers by linking spatio-temporal features to concepts. The study shows that transformer models are better for interpretability and highlights the importance of driver gaze and temporal context in explaining AI decisions.
Autonomous driving systems are rapidly advancing, but their increasing complexity, driven by deep learning and AI, brings a critical challenge: understanding why these systems make certain decisions. This lack of transparency, often referred to as the “black-box” nature of AI, raises significant safety and trust concerns, especially in critical applications like autonomous vehicles.
The Need for Understandable Driver Intention Prediction
Imagine an autonomous car attempting a left turn, but a parked vehicle is in its blind spot. Existing driver intention prediction (DIP) models might fail to anticipate this obstacle, leading to a potential collision. To prevent such scenarios and build trust, autonomous systems need to not only predict driving actions but also provide human-understandable explanations for their decisions. This interpretability allows for diagnosing failures, improving model learning, and ultimately ensuring safer deployment.
Introducing DAAD-X: A Dataset for Explainable Driving Actions
Traditional DIP datasets focus primarily on predicting maneuvers or trajectories, lacking the crucial “why” aspect. To bridge this gap, researchers have introduced the eXplainable Driving Action Anticipation Dataset (DAAD-X). This new multimodal, ego-centric video dataset provides hierarchical, high-level textual explanations as causal reasoning for a driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective, offering a richer context for understanding driving actions.
VCBM: A Model for Inherently Interpretable Predictions
To effectively leverage the detailed explanations in DAAD-X, the researchers propose the Video Concept Bottleneck Model (VCBM). This innovative framework generates spatio-temporally coherent explanations inherently, meaning it doesn’t rely on post-hoc techniques (methods applied after a model has made a prediction to try and explain it). VCBM uses a dual video encoder to process both gaze and front-view video data. A key component is the Learnable Token Merging (LTM) block, which groups semantically similar features across video frames into representative tokens. These tokens are then fed into a Localised Concept Bottleneck Model (LCBM), which maps high-dimensional features to a low-dimensional space of human-understandable explanations. This design ensures that the model not only predicts a maneuver but also provides clear justifications for it.
Key Findings and Insights
Extensive evaluations of VCBM on the DAAD-X dataset revealed several important insights:
- Transformer-based models, such as MViTv2, demonstrated greater interpretability than conventional CNN-based models for video-based explanation tasks, highlighting their strength in understanding temporal dependencies across frames.
- The Learnable Token Merging (LTM) and Localised Concept Bottleneck Model (LCBM) modules significantly improve explanation performance by preserving fine-grained spatial and temporal details.
- The gaze modality plays a crucial role. Cropping a circular region around the driver’s gaze from the driver’s view video (rather than simply overlaying it) yielded the best results for explanation prediction, as it focuses on relevant gaze information without adding noise.
- There’s a delicate balance between explanation classification and action prediction. While adding auxiliary explanation loss boosts both, excessive weighting can slightly impact action prediction accuracy.
- Temporal cues are vital for generating meaningful explanations. Disrupting the temporal order of video frames significantly impacts the explanation accuracy of transformer models, underscoring their reliance on temporal information.
Visualizing Interpretability
The research also introduces a multi-label t-SNE visualization technique. This method helps illustrate the disentanglement and causal correlation among multiple explanations in the model’s learned feature space. Semantically related explanations tend to cluster together, and individual video features are positioned near their corresponding explanation anchors, providing a deeper understanding of the model’s reasoning.
Also Read:
- Enhancing Visual Clarity for Smart Transportation in Challenging Weather
- Unlocking Real-Time AI Perception for Endless Video Streams
Towards a Safer Autonomous Future
This work marks a significant step towards developing safer and more trustworthy autonomous driving systems. By providing models that can explain their decisions in human-understandable terms, the research enhances transparency, fosters greater trust, and paves the way for more reliable deployment of AI in safety-critical applications. The dataset, code, and models are publicly available, encouraging further research in this crucial area. You can find the full research paper here: Towards Safer and Understandable Driver Intention Prediction.


