TLDR: The paper introduces TrafficScene, the first multimodal dataset combining light field images and LiDAR point clouds with full semantic annotations. It also proposes Mlpfseg, a novel network that fuses these modalities for simultaneous semantic segmentation of both images and point clouds. Mlpfseg uses a Point-Pixel Feature Fusion Module to handle density differences and a Depth Difference Perception Module to improve detection of occluded objects, significantly enhancing segmentation accuracy over single-modality and previous fusion methods, especially for small and occluded objects in autonomous driving scenarios.
Semantic segmentation is a fundamental technology for autonomous driving, allowing vehicles to understand their surroundings by assigning a specific label to every pixel in an image or point in a point cloud. However, complex conditions like occlusions—where objects are partially hidden—pose significant challenges to current systems.
Traditional methods often rely on either camera images, which provide rich color and texture but lack precise 3D spatial information and are sensitive to lighting, or LiDAR point clouds, which offer accurate 3D geometry but are sparse and lack color. While fusing these two modalities has shown promise, existing approaches often segment based on a single modality, failing to fully leverage the complementary strengths of both, especially when dealing with occluded or small objects.
To overcome these limitations, a team of researchers including Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, and Yihui Fan has introduced a groundbreaking approach. Their work, detailed in the paper Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion, proposes a novel multimodal dataset and a sophisticated network architecture to enhance scene understanding.
Introducing TrafficScene: A New Multimodal Dataset
The first major contribution is TrafficScene, the inaugural dataset for semantic segmentation that integrates both light field images and LiDAR point cloud data. Unlike previous datasets, TrafficScene was captured using a unique 3×3 camera array with a 30 cm baseline, providing multiple viewpoints with significant overlap. This setup is crucial for capturing more angular information, which greatly aids in perceiving occluded objects.
Crucially, all viewpoints of the light field images in TrafficScene are semantically annotated, a significant improvement over datasets that only annotate the central view. This comprehensive annotation, combined with aligned LiDAR point cloud data, enables more effective information supplementation for occluded and small objects through multi-view consistency. The dataset includes 5607 light field images and 623 frames of point clouds from diverse traffic scenarios, enhancing its utility for real-world autonomous driving applications.
Mlpfseg: A Fusion Network for Simultaneous Segmentation
Building upon the TrafficScene dataset, the researchers developed the Multimodal Light Field Point Cloud Fusion Segmentation Method (Mlpfseg). This network is designed to simultaneously segment both light field images and LiDAR point clouds, fully exploiting the complementary nature of these modalities.
Mlpfseg incorporates two key modules:
-
Point-Pixel Feature Fusion Module (PFFM): This module addresses the challenge of density mismatch between sparse point clouds and dense image pixels. It projects point cloud features onto the image plane and then interpolates these sparse projections to create a dense feature map. A self-attention mechanism then refines the fusion, allowing both image and point cloud features to gather useful information from each other, leading to a more integrated representation.
-
Depth Difference Perception Module (DDPM): Occluded objects often present conflicting features when viewed from a single perspective. DDPM tackles this by leveraging depth information. It compares predicted depth maps from images with sparse depth maps derived from LiDAR. Regions with significant depth discrepancies are identified as potential occlusions, and the module reinforces attention scores in these areas, guiding the network to focus on and accurately segment hidden parts of objects.
Also Read:
- Advancing 3D Scene Understanding for Autonomous Driving with Progressive Gaussian Transformers
- Enhancing Vectorized HD Map Construction with Global Query Representations
Superior Performance in Complex Scenarios
Experiments on the TrafficScene dataset demonstrate Mlpfseg’s superior performance. The method significantly outperforms image-only segmentation by 1.71 Mean Intersection over Union (mIoU) and point cloud-only segmentation by 2.38 mIoU. When compared to state-of-the-art multimodal 3D semantic segmentation methods, Mlpfseg shows an improvement of 2.38 mIoU.
Notably, Mlpfseg shows substantial improvements in segmenting small objects like bicyclists, pedestrians, and traffic cones, and excels in correctly identifying partially occluded objects. This enhanced accuracy is attributed to the comprehensive fusion of light field and point cloud data, along with the intelligent design of the DDPM, which specifically targets occlusion awareness.
This research marks a significant step forward in semantic segmentation for autonomous driving, offering a robust solution for complex and challenging real-world traffic environments by effectively combining the strengths of light field imaging and LiDAR technology.


