TLDR: The CoVeRaP research introduces a novel 21,000-frame cooperative dataset that aligns radar, camera, and GPS data from multiple vehicles to improve 3D object detection. It proposes a unified cooperative perception framework with middle and late fusion options, demonstrating that middle fusion with intensity encoding significantly boosts detection accuracy (up to 9x mAP at IoU 0.9) and consistently outperforms single-vehicle baselines. The work establishes a reproducible benchmark for multi-vehicle FMCW-radar perception, highlighting the benefits of affordable radar sharing for robust autonomous driving.
Autonomous driving systems are constantly evolving, but a major hurdle remains: reliable perception in all conditions. While cameras and LiDAR offer high-resolution data, they struggle in adverse weather like rain or glare. Traditional Frequency-Modulated Continuous Wave (FMCW) radars are more robust in these conditions, but single-radar systems often produce sparse, noisy data, limiting their effectiveness for 3D object detection.
A promising solution to these challenges is cooperative perception, particularly in a Vehicle-to-Vehicle (V2V) context. By sharing radar data from multiple vehicles, the system can create denser point clouds, reduce blind spots, and achieve more accurate 3D object detections. This significantly improves object tracking and overall situational awareness, especially in complex traffic scenarios where single sensors might fail due to occlusions.
Addressing the need for robust cooperative perception, researchers Jinyue Song, Hansol Ku, Jayneel Vora, Nelson Lee, Ahmad Kamari, Prasant Mohapatra, and Parth Pathak have introduced CoVeRaP: Cooperative Vehicular Perception through mmWave FMCW Radars. This groundbreaking work includes a novel, large-scale cooperative radar perception dataset and a unified framework for processing this data. The dataset, comprising 21,000 frames, meticulously time-aligns radar, camera, and GPS streams from multiple vehicles performing diverse maneuvers. It provides a crucial, reproducible benchmark for multi-vehicle FMCW-radar perception, complete with high-quality ground truth annotations.
Understanding CoVeRaP’s Approach
The CoVeRaP framework explores two primary data fusion strategies: middle fusion and late fusion. Both aim to enhance 3D object detection by combining complementary sensor data from different vehicles, but they differ in how and when this integration occurs.
Middle Fusion: In this approach, each vehicle first processes its own raw sensor data to extract features like point clouds, range, velocity, and signal intensity. These extracted features are then synchronized and merged into a common spatial frame before any final 3D bounding box prediction is made. This feature-level integration allows the system to leverage richer information from multiple viewpoints, leading to more robust and accurate detections. The process involves multi-vehicle sensing, precise time and spatial alignment using GPS offsets and an event-trigger mechanism, and then feeding the fused data into a deep learning model for enhanced perception.
Late Fusion: Conversely, in late fusion, each vehicle operates independently, performing its own feature extraction and prediction (e.g., generating 3D bounding boxes and confidence scores). These individual predictions are then transformed into a common coordinate system using GPS offsets. Finally, a decision layer combines these bounding boxes based on their confidence scores. While more robust to sensor noise and discrepancies, late fusion typically retains less information than middle fusion because the integration happens at the prediction level rather than the feature level.
The Baseline Model Architecture
To effectively process the sparse and noisy radar data, the researchers developed a baseline model inspired by PointNet and self-attention mechanisms. This multi-branch architecture is designed to integrate various radar cues. It includes:
- Multimodal Signal Encoding: This stage extracts distinct features: spatial coordinates (position branch), motion-related features like velocity and bearing (dynamics branch), and signal intensity values (intensity branch). The intensity branch is particularly important as stronger radar returns often indicate valid detections, helping to filter out noise.
- Contextual Feature Synthesis: The extracted features are then integrated into a unified representation. A self-attention module refines point-wise features, and pooling mechanisms create a global context vector. This stage ensures that complementary information across spatial, motion, and intensity cues is preserved.
- Multimodal Output Decoding: The fused representation is translated into actionable predictions. A depth estimation subnet predicts confidence scores for depth, and a 3D bounding box decoder outputs seven key parameters (width, height, length, center coordinates, and orientation) for the detected object.
Also Read:
- Unifying Visual Perception: A Deep Dive into Open World Detection
- HOSt3R: Capturing Detailed 3D Hand-Object Interactions from Standard Images
Key Findings and Impact
The experiments conducted on the CoVeRaP dataset yielded significant insights into cooperative perception:
The inclusion of radar-return intensity encoding in the middle fusion strategy dramatically improved performance. For instance, configurations without intensity encoding often failed to produce correct predictions, while those with intensity encoding achieved substantial mean Average Precision (mAP) boosts, especially at stricter Intersection over Union (IoU) thresholds. This highlights the critical role of intensity in distinguishing valid radar returns from background noise.
Cooperative fusion consistently outperformed single-vehicle baselines. Fused-ego views (where data from an assistant vehicle is merged with the ego vehicle’s data) showed superior detection performance compared to using data from a single viewpoint alone. This demonstrates the clear advantage of leveraging complementary information from multiple perspectives.
Middle fusion generally outperformed late fusion, particularly at higher IoU thresholds (e.g., 0.7–0.9). This is because feature-level integration in middle fusion retains more detailed information, leading to more accurate and robust detections compared to simply merging bounding box predictions.
The study found that combining ego and assistant views through middle fusion with intensity encoding could boost mean Average Precision by up to 9 times at an IoU of 0.9 over a single rear view. This remarkable improvement underscores the power of affordable radar sharing in markedly enhancing detection robustness.
CoVeRaP establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception, demonstrating that cooperative sensing can significantly improve 3D object detection. The dataset and code are publicly available to encourage further research and development in this critical area of autonomous driving. While the current public release focuses on parallel-lane scenes and vehicle targets, future work will expand to more complex trajectories and object classes like bikes, pedestrians, and road signs, pushing the capabilities of low-cost radar-only perception systems. You can find more details about this research paper here.


