spot_img
HomeResearch & DevelopmentPerception Challenges for Compact AI in Autonomous Vehicles

Perception Challenges for Compact AI in Autonomous Vehicles

TLDR: A new benchmark, DTPQA, evaluates small Vision-Language Models (VLMs) on perception-only tasks in traffic scenes, considering object distance. Findings show these compact VLMs significantly underperform humans, especially in spatial reasoning and at longer distances, and are sensitive to question phrasing, indicating they are not yet reliable for safety-critical automated driving applications despite strong performance in tasks like traffic sign recognition.

The rapid advancements in Vision-Language Models (VLMs) hold immense promise for automated driving systems, offering powerful capabilities in understanding both visual and textual information. However, for these systems to be trusted in safety-critical applications like self-driving cars, their perception systems must be exceptionally reliable. A recent research paper, “Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception”, delves into this crucial area, specifically focusing on the performance of smaller VLMs, which are more suitable for the limited processing power of in-vehicle hardware.

The Challenge of Perception in Automated Driving

Automated driving systems require a robust perception layer to accurately interpret sensory input from cameras, LiDARs, and radars, creating a 3D map of the environment. While large VLMs have shown impressive reasoning and generalization abilities, their substantial memory requirements make them impractical for deployment in self-driving vehicles. For instance, a model like InternVL3-78B would need approximately 156 GB of VRAM, far exceeding the capacity of common platforms like the NVIDIA Jetson Orin. This limitation shifts the focus to smaller VLMs, typically with fewer than 4 billion parameters.

However, the paper highlights a critical gap: current VLMs, especially smaller ones, often exhibit limited perception capabilities. This is particularly problematic in traffic scenes where critical objects and agents can be at varying distances. A system that is “shortsighted”—struggling with perception at long ranges (30+ meters) as well as close ranges (up to 20 meters)—cannot be trusted to make appropriate driving decisions, such as detecting a distant pedestrian.

Introducing DTPQA: A New Benchmark for Traffic Perception

To address this, the researchers introduced Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark specifically designed to evaluate perception-based questions in traffic scenes, enriched with distance annotations. DTPQA deliberately excludes questions requiring complex reasoning, ensuring that model performance directly reflects their perception abilities.

DTPQA is composed of two main parts: DTP-Synthetic, created using the CARLA simulator, and DTP-Real, built upon real-world images from the nuScenes dataset. Both parts feature simple, yet crucial, visual questions relevant to driving decisions. A key aspect of DTPQA is its inclusion of variations of the same question across different object distances, ranging from 5 to 50 meters, along with “negative samples” where the object is completely absent. The benchmark also maintains a balanced number of samples for each possible answer, preventing models from exploiting language biases.

Evaluating Small VLMs: Key Findings

The study evaluated several state-of-the-art small VLMs (under 4 billion parameters) and compared their performance against human perception and a large VLM (InternVL3-78B). The results revealed a significant performance gap:

  • Human Superiority: Humans consistently outperformed all VLMs, including the large InternVL3-78B, across most tasks. The best-performing small VLM achieved an average accuracy of approximately 60%, compared to around 85% for humans.

  • Spatial Reasoning Weakness: A major weakness identified was the models’ struggle with spatial reasoning, particularly tasks requiring them to distinguish left from right. For example, identifying the direction a pedestrian is walking or which turn indicator is active on a vehicle proved extremely challenging for models, even at close distances, where human accuracy remained high.

  • Distance-Dependent Degradation: For many question types, the performance of small VLMs degraded almost linearly with increasing distance. This “shortsightedness” is a critical concern for automated driving, as many crucial objects appear at a distance.

  • Traffic Sign and Light Recognition Strength: Conversely, models performed remarkably well in tasks involving distinguishing traffic light colors and reading traffic signs, sometimes matching human performance. This suggests strong Optical Character Recognition (OCR) capabilities within these models.

  • Inconsistent Behavior and Prompt Sensitivity: The study also uncovered erratic and unpredictable behavior in some small VLMs. For instance, one model could count multiple pedestrians but failed to detect a single one. Furthermore, minor, semantically equivalent rephrasing of questions could lead to significant changes in model performance, raising serious concerns about their robustness in safety-critical scenarios.

  • Low Hallucination Rate: A positive finding was the high accuracy of small VLMs on negative samples (where no object was present), suggesting a low rate of visual hallucinations.

Also Read:

Implications for Autonomous Driving

The findings underscore that despite their potential, current small VLMs are not yet ready for deployment in automated driving systems. Their significant weaknesses in spatial reasoning, performance degradation with distance, and sensitivity to question phrasing make them unreliable for tasks where consistent and accurate perception is paramount. While they excel in specific areas like traffic sign recognition, the overall perception system requires substantial improvement to meet the demands of safety-critical applications.

The research calls for further investigation into the underlying causes of these failures and how to enhance the perception capabilities of small VLMs without compromising their performance on other tasks. Future work could involve analyzing how visual information is processed within each model component and exploring more robust prompt engineering strategies.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -