TLDR: Humanoid Occupancy is a novel perception system for humanoid robots that integrates hardware, software, and a unique dataset. It uses multi-modal fusion of camera and LiDAR data to create detailed 3D occupancy maps with semantic labels, enabling better navigation and task planning in complex environments. The system addresses challenges like kinematic interference and introduces the first panoramic occupancy dataset for humanoids, demonstrating superior performance and a lightweight architecture.
Humanoid robots are rapidly advancing, with manufacturers developing diverse visual perception modules for various scenarios. A key challenge for these robots is understanding their environment comprehensively, which is crucial for tasks like manipulation, movement, and navigation. Traditional perception methods often fall short, especially in capturing detailed 3D geometric and semantic information.
A new system called Humanoid Occupancy has been introduced to address these challenges. This generalized multimodal occupancy perception system integrates both hardware and software components, along with specialized data acquisition devices and an annotation process. The core of this system is its use of advanced multi-modal fusion techniques to generate grid-based occupancy outputs. These outputs encode not only whether a space is occupied but also what kind of object or area it is (semantic labels), providing a holistic understanding of the environment.
One of the significant hurdles in developing perception systems for humanoid robots is dealing with issues like kinematic interference (when the robot’s own body blocks sensors) and occlusions. Humanoid Occupancy tackles these by establishing an effective sensor layout strategy. Furthermore, the researchers have developed the first panoramic occupancy dataset specifically for humanoid robots, which serves as a valuable resource for future research and development in this field.
The system’s network architecture is designed for robust perception, incorporating multi-modal feature fusion and integrating information over time. This allows the robot to understand dynamic scenes better. Overall, Humanoid Occupancy aims to provide effective environmental perception for humanoid robots, laying the groundwork for standardized visual modules that can lead to widespread deployment of these robots in complex real-world settings.
System Design and Data Collection
The Humanoid Occupancy system was implemented on the Tienkung humanoid robot. Its sensor setup includes a modular, selectable RGB-D camera with two degrees of freedom (pitch and yaw) for manipulation and terrain perception. Additionally, six standard RGB cameras and one 40-line 360-degree omnidirectional LiDAR sensor are strategically deployed on the robot’s head. This configuration maximizes the robot’s perceptual capabilities across diverse operational scenarios while minimizing interference.
Collecting data for humanoid robots is challenging due to high costs and difficulties. To overcome this, the researchers used a unique data acquisition method: human data collectors wear a device with the same sensor configuration as the robot on their heads. This wearable device allows for efficient data collection in various environments, ensuring the collected data closely matches what a real robot would encounter. Measures like requiring collectors to be around 160 cm tall and adding a neck stabilizer prevent head shaking, maintaining sensor stability.
Annotation and Network Architecture
The collected data is categorized into home, industrial, and outdoor scenes, with specific point-wise semantic categories defined for each. The annotation process involves marking bounding boxes for dynamic objects like pedestrians, cyclists, and vehicles. For non-rigid targets like pedestrians, special attention is given to distinguishing them from other objects through detailed point-by-point annotation. Static background points are superimposed and aligned, then combined with dynamic foreground points to create the final occupancy ground truth, which includes both occupancy status and semantic labels.
The perception model uses a Bird’s Eye View (BEV) paradigm, widely used in autonomous driving, for feature extraction and fusion. It processes LiDAR point clouds and camera images through separate branches. LiDAR features provide strong geometric information, while camera features offer rich semantic details. These are fused using a cross-attention mechanism. The system also incorporates historical BEV features to leverage temporal information, enhancing motion awareness and occlusion reasoning. The final 3D occupancy grid is predicted from these fused features.
Also Read:
- Humanoid Robots Master Locomotion with Limited Sensory Data
- MuStD Network: Improving Outdoor 3D Object Detection Through Multimodal Fusion
Performance and Future Directions
Experiments show that Humanoid Occupancy achieves superior performance in 3D semantic occupancy prediction compared to other methods, all while maintaining a lightweight architecture with fewer parameters. Ablation studies confirmed the effectiveness of its components, including a novel distortion-aware projection method for camera images, the temporal fusion module (using two frames for optimal performance), and the benefits of multi-modal fusion over single-modality approaches.
This work establishes a robust foundation for environmental perception in humanoid robots. Looking ahead, the researchers plan to expand into omnidirectional perception and mapping, utilizing state-of-the-art computer vision reconstruction techniques. Future work will also involve further expanding the dataset, refining temporal fusion strategies, and deploying the system across various humanoid platforms to advance robust and standardized visual perception in robotics. For more details, you can refer to the research paper.


