Perception Challenges for Compact AI in Autonomous Vehicles

TLDR: A new benchmark, DTPQA, evaluates small Vision-Language Models (VLMs) on perception-only tasks in traffic scenes, considering object distance. Findings show these compact VLMs significantly underperform humans, especially in spatial reasoning and at longer distances, and are sensitive to question phrasing, indicating they are not yet reliable for safety-critical automated driving applications despite strong performance in tasks like traffic sign recognition.

The rapid advancements in Vision-Language Models (VLMs) hold immense promise for automated driving systems, offering powerful capabilities in understanding both visual and textual information. However, for these systems to be trusted in safety-critical applications like self-driving cars, their perception systems must be exceptionally reliable. A recent research paper, “Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception”, delves into this crucial area, specifically focusing on the performance of smaller VLMs, which are more suitable for the limited processing power of in-vehicle hardware.

The Challenge of Perception in Automated Driving

Automated driving systems require a robust perception layer to accurately interpret sensory input from cameras, LiDARs, and radars, creating a 3D map of the environment. While large VLMs have shown impressive reasoning and generalization abilities, their substantial memory requirements make them impractical for deployment in self-driving vehicles. For instance, a model like InternVL3-78B would need approximately 156 GB of VRAM, far exceeding the capacity of common platforms like the NVIDIA Jetson Orin. This limitation shifts the focus to smaller VLMs, typically with fewer than 4 billion parameters.

However, the paper highlights a critical gap: current VLMs, especially smaller ones, often exhibit limited perception capabilities. This is particularly problematic in traffic scenes where critical objects and agents can be at varying distances. A system that is “shortsighted”—struggling with perception at long ranges (30+ meters) as well as close ranges (up to 20 meters)—cannot be trusted to make appropriate driving decisions, such as detecting a distant pedestrian.

Introducing DTPQA: A New Benchmark for Traffic Perception

To address this, the researchers introduced Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark specifically designed to evaluate perception-based questions in traffic scenes, enriched with distance annotations. DTPQA deliberately excludes questions requiring complex reasoning, ensuring that model performance directly reflects their perception abilities.

DTPQA is composed of two main parts: DTP-Synthetic, created using the CARLA simulator, and DTP-Real, built upon real-world images from the nuScenes dataset. Both parts feature simple, yet crucial, visual questions relevant to driving decisions. A key aspect of DTPQA is its inclusion of variations of the same question across different object distances, ranging from 5 to 50 meters, along with “negative samples” where the object is completely absent. The benchmark also maintains a balanced number of samples for each possible answer, preventing models from exploiting language biases.

Evaluating Small VLMs: Key Findings

The study evaluated several state-of-the-art small VLMs (under 4 billion parameters) and compared their performance against human perception and a large VLM (InternVL3-78B). The results revealed a significant performance gap:

Human Superiority: Humans consistently outperformed all VLMs, including the large InternVL3-78B, across most tasks. The best-performing small VLM achieved an average accuracy of approximately 60%, compared to around 85% for humans.
Spatial Reasoning Weakness: A major weakness identified was the models’ struggle with spatial reasoning, particularly tasks requiring them to distinguish left from right. For example, identifying the direction a pedestrian is walking or which turn indicator is active on a vehicle proved extremely challenging for models, even at close distances, where human accuracy remained high.
Distance-Dependent Degradation: For many question types, the performance of small VLMs degraded almost linearly with increasing distance. This “shortsightedness” is a critical concern for automated driving, as many crucial objects appear at a distance.
Traffic Sign and Light Recognition Strength: Conversely, models performed remarkably well in tasks involving distinguishing traffic light colors and reading traffic signs, sometimes matching human performance. This suggests strong Optical Character Recognition (OCR) capabilities within these models.
Inconsistent Behavior and Prompt Sensitivity: The study also uncovered erratic and unpredictable behavior in some small VLMs. For instance, one model could count multiple pedestrians but failed to detect a single one. Furthermore, minor, semantically equivalent rephrasing of questions could lead to significant changes in model performance, raising serious concerns about their robustness in safety-critical scenarios.
Low Hallucination Rate: A positive finding was the high accuracy of small VLMs on negative samples (where no object was present), suggesting a low rate of visual hallucinations.

Also Read:

Implications for Autonomous Driving

The findings underscore that despite their potential, current small VLMs are not yet ready for deployment in automated driving systems. Their significant weaknesses in spatial reasoning, performance degradation with distance, and sensitivity to question phrasing make them unreliable for tasks where consistent and accurate perception is paramount. While they excel in specific areas like traffic sign recognition, the overall perception system requires substantial improvement to meet the demands of safety-critical applications.

The research calls for further investigation into the underlying causes of these failures and how to enhance the perception capabilities of small VLMs without compromising their performance on other tasks. Future work could involve analyzing how visual information is processed within each model component and exploring more robust prompt engineering strategies.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Perception Challenges for Compact AI in Autonomous Vehicles

The Challenge of Perception in Automated Driving

Introducing DTPQA: A New Benchmark for Traffic Perception

Evaluating Small VLMs: Key Findings

Implications for Autonomous Driving

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Generative AI Powers Next-Gen Autonomous Emergency Response

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates