TLDR: XGrasp is a new real-time robotic grasping framework that allows robots to use various types of grippers, not just one. It overcomes data limitations by creating multi-gripper training data from existing datasets. Its two-stage design, with a Grasp Point Predictor and an Angle-Width Predictor, ensures both speed and accuracy. The system can even adapt to new, unseen grippers thanks to a learning technique called contrastive learning. Experiments show XGrasp achieves high success rates across different grippers and environments, significantly faster than previous methods, and can integrate with advanced vision models.
Robots are becoming increasingly common in various industries, performing tasks from assembly to handling delicate objects. A fundamental capability for any robot is grasping, but most existing robotic grasping systems are designed for a single type of gripper. This limitation restricts their flexibility in real-world situations where different tasks and objects require diverse end-effectors, such as parallel-jaw grippers for strong, fast handling or multi-finger hands for more complex shapes.
Addressing this challenge, researchers Yeonseo Lee, Jungwook Mun, Hyosup Shin, Guebin Hwang, Junhee Nam, Taeyeop Lee, and Sungho Jo have introduced XGrasp, a novel framework for gripper-aware grasp detection. XGrasp is designed to efficiently handle multiple gripper configurations in real-time, making robots more versatile and adaptable.
Overcoming Data Scarcity
One of the biggest hurdles in developing unified models for diverse grippers is the lack of comprehensive datasets. Existing datasets often focus on single gripper types, primarily two-finger parallel-jaw grippers. XGrasp tackles this by proposing a systematic method to augment existing datasets with multi-gripper annotations. This involves reinterpreting and extending current labels by considering the unique physical constraints and grasping characteristics of various grippers, such as finger span and jaw configuration. This process generates rich datasets suitable for training models that understand different gripper types.
The data augmentation process uses a clever approach to generate gripper inputs. Instead of complex 3D models, XGrasp uses a two-channel input: a Gripper Mask (static shape) and a Gripper Path (dynamic movement trajectory). This balances efficiency and expressiveness. Grasp feasibility for different grippers is then evaluated using a “Graspability Decision Rule” which checks for collisions, path intersections with the object, and grasp stability.
A Two-Stage, Real-Time Architecture
XGrasp employs a hierarchical two-stage architecture to achieve both speed and accuracy. The first stage is the Grasp Point Predictor (GPP), which uses global scene information and gripper specifications to identify optimal grasp locations. The GPP takes an RGB-D image of the scene along with the gripper mask and path as input, outputting a heatmap indicating suitable grasping positions. It’s built on a U-Net architecture, effectively combining scene and gripper features.
The second stage is the Angle-Width Predictor (AWP). This module refines the grasp angle and width using local features around the grasp points identified by the GPP. A key innovation here is the use of contrastive learning, which allows the AWP to learn fundamental grasping characteristics. This enables XGrasp to generalize to unseen grippers without needing specific prior training for them—a capability known as zero-shot generalization. The AWP uses a Siamese network architecture, learning to distinguish between successful and failed grasps in an embedding space.
Also Read:
- ManiAgent: Orchestrating Robot Actions with AI Agents
- Enhancing Robot Dexterity: How Phys2Real Combines Visual Intelligence and Interactive Learning for Real-World Tasks
Performance and Integration
The experimental results for XGrasp are impressive. On the Jacquard dataset, it achieved a superior average success rate of 90.3%, significantly outperforming existing methods. Crucially, XGrasp also demonstrated substantial improvements in inference speed, being over 10 times faster than some other gripper-aware methods, making it suitable for real-time applications.
Simulation and real-world experiments further validated XGrasp’s capabilities. In simulations using various grippers and objects from the YCB Object dataset, XGrasp achieved the highest average success rate of 81.8%. Real-world tests with an ABB IRB 14000 Yumi robot and different gripper types also showed leading performance, with an average success rate of 88.0%.
The modular design of XGrasp also allows for seamless integration with vision foundation models like FastSAM, SAM, and Grounded SAM, opening pathways for future vision-language capabilities in robotic grasping. This means robots could potentially understand natural language instructions for grasping tasks.
While currently focused on planar grasping due to dataset constraints, the researchers plan to extend XGrasp to 6-DOF multi-gripper datasets and develop a comprehensive gripper-aware grasp detection model in 3D space in the future. This work marks a significant step towards more adaptable and intelligent robotic manipulation systems. You can read the full research paper here.


