CPS Team Achieves Top Rank in CVPR 2024 Autonomous Driving Challenge with Enhanced Vision-Language Models

TLDR: The CPS Team secured the 1st rank in the CVPR 2024 Autonomous Grand Challenge’s Driving with Language track. Their winning approach involved fine-tuning LLaVA vision-language models with LoRA and DoRA, integrating depth information from open-source estimation models, and employing Chain-of-Thought reasoning. Training on the DriveLM-nuScenes dataset, their system achieved a top score of 0.7799 on the validation leaderboard through a comprehensive inference pipeline and multi-system fusion.

The field of autonomous driving is rapidly evolving, and a significant challenge lies in enabling vehicles to understand and respond to complex driving scenarios using both visual and linguistic information. This was the core focus of the Driving with Language track at the CVPR 2024 Autonomous Grand Challenge, where the CPS Team presented a highly effective solution.

The team’s approach centered on advanced vision-language model (VLM) systems. These systems are designed to process visual data from cameras alongside natural language instructions and questions, allowing for more nuanced decision-making in autonomous vehicles. Unlike traditional systems that might only react to visual cues, VLMs can interpret complex queries like “What is the object at these coordinates?” or “Predict the behavior of the ego vehicle,” integrating context from both modalities.

At the heart of their system were the LLaVA models (LLaVA-1.5-7B and LLaVA-NeXT-7B), which are known for their ability to combine vision and language processing. To tailor these powerful models specifically for the autonomous driving challenge, the CPS Team employed parameter-efficient fine-tuning methods: LoRA (Low-Rank Adaptation) and DoRA (Weight-Decomposed Low-Rank Adaptation). These techniques allowed them to enhance the models’ performance without requiring extensive computational resources for full fine-tuning.

A crucial innovation in their methodology was the integration of depth information. Using open-source depth estimation models like Depth Anything, the team calculated the depth of objects in images. This depth data was then converted into textual descriptions (e.g., ‘close’ or ‘far’) and incorporated into the model’s input. This enriched context helped the VLM better understand the spatial relationships of objects in the driving environment, leading to more accurate perceptions and predictions.

The training of these models exclusively utilized the DriveLM-nuScenes dataset, a comprehensive collection of driving scenes with associated images and question-and-answer pairs covering perception, prediction, planning, and behavior tasks. During inference, the team developed a sophisticated pipeline. This involved a prompt design module that combined the depth estimates and descriptions of key objects with the original question, creating a rich, detailed prompt for the VLM. For critical question types, such as multiple-choice and yes/no questions, they adopted a Chain-of-Thought reasoning approach to guide the VLM towards more precise answers.

Also Read:

The results were impressive. The CPS Team achieved a top score of 0.7799 on the validation set leaderboard, securing the 1st rank. This success was further bolstered by a multi-system fusion approach, where the best-performing model for each question type was leveraged to compile the final inference results. This comprehensive methodology demonstrates the significant potential of integrating advanced vision-language models and depth information for creating more intelligent and reliable autonomous driving systems. You can read the full research paper for more technical details here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CPS Team Achieves Top Rank in CVPR 2024 Autonomous Driving Challenge with Enhanced Vision-Language Models

Gen AI News and Updates

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Ensuring Data Integrity for Safe Autonomous Driving Systems

Charting the Course: How AI Video Generation is Building Interactive World Models

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates