spot_img
HomeResearch & DevelopmentCPS Team Achieves Top Rank in CVPR 2024 Autonomous...

CPS Team Achieves Top Rank in CVPR 2024 Autonomous Driving Challenge with Enhanced Vision-Language Models

TLDR: The CPS Team secured the 1st rank in the CVPR 2024 Autonomous Grand Challenge’s Driving with Language track. Their winning approach involved fine-tuning LLaVA vision-language models with LoRA and DoRA, integrating depth information from open-source estimation models, and employing Chain-of-Thought reasoning. Training on the DriveLM-nuScenes dataset, their system achieved a top score of 0.7799 on the validation leaderboard through a comprehensive inference pipeline and multi-system fusion.

The field of autonomous driving is rapidly evolving, and a significant challenge lies in enabling vehicles to understand and respond to complex driving scenarios using both visual and linguistic information. This was the core focus of the Driving with Language track at the CVPR 2024 Autonomous Grand Challenge, where the CPS Team presented a highly effective solution.

The team’s approach centered on advanced vision-language model (VLM) systems. These systems are designed to process visual data from cameras alongside natural language instructions and questions, allowing for more nuanced decision-making in autonomous vehicles. Unlike traditional systems that might only react to visual cues, VLMs can interpret complex queries like “What is the object at these coordinates?” or “Predict the behavior of the ego vehicle,” integrating context from both modalities.

At the heart of their system were the LLaVA models (LLaVA-1.5-7B and LLaVA-NeXT-7B), which are known for their ability to combine vision and language processing. To tailor these powerful models specifically for the autonomous driving challenge, the CPS Team employed parameter-efficient fine-tuning methods: LoRA (Low-Rank Adaptation) and DoRA (Weight-Decomposed Low-Rank Adaptation). These techniques allowed them to enhance the models’ performance without requiring extensive computational resources for full fine-tuning.

A crucial innovation in their methodology was the integration of depth information. Using open-source depth estimation models like Depth Anything, the team calculated the depth of objects in images. This depth data was then converted into textual descriptions (e.g., ‘close’ or ‘far’) and incorporated into the model’s input. This enriched context helped the VLM better understand the spatial relationships of objects in the driving environment, leading to more accurate perceptions and predictions.

The training of these models exclusively utilized the DriveLM-nuScenes dataset, a comprehensive collection of driving scenes with associated images and question-and-answer pairs covering perception, prediction, planning, and behavior tasks. During inference, the team developed a sophisticated pipeline. This involved a prompt design module that combined the depth estimates and descriptions of key objects with the original question, creating a rich, detailed prompt for the VLM. For critical question types, such as multiple-choice and yes/no questions, they adopted a Chain-of-Thought reasoning approach to guide the VLM towards more precise answers.

Also Read:

The results were impressive. The CPS Team achieved a top score of 0.7799 on the validation set leaderboard, securing the 1st rank. This success was further bolstered by a multi-system fusion approach, where the best-performing model for each question type was leveraged to compile the final inference results. This comprehensive methodology demonstrates the significant potential of integrating advanced vision-language models and depth information for creating more intelligent and reliable autonomous driving systems. You can read the full research paper for more technical details here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -