TLDR: The paper introduces Grasp-HGN, a novel approach to improve robotic prosthetic hand control by enabling them to grasp previously unseen objects. It defines “semantic projection” for this generalization capability and proposes Grasp-LLaVA, a vision-language model that uses human-like reasoning for grasp estimation, achieving 50.2% accuracy on unseen objects. To overcome latency issues, Grasp-HGN employs a hybrid edge-cloud infrastructure, combining a fast edge model with an accurate cloud model, dynamically switching between them based on confidence. This system significantly boosts accuracy and speed, and a new “User Upsetness Index” shows improved user experience in real-world scenarios.
For individuals with transradial amputations, robotic prosthetic hands hold immense promise for regaining the ability to perform daily activities. However, a significant challenge remains: current grasp models struggle to adapt to the vast variety of objects encountered in the real world, especially those not included in their training datasets. This limitation severely impacts users’ independence and quality of life.
A recent research paper, Grasp-HGN: Grasping the Unexpected, addresses this critical issue by introducing innovative solutions to enhance the robustness and generalizability of prosthetic hand control.
Understanding the Challenge: Semantic Projection
The researchers define a crucial concept called ‘semantic projection.’ This refers to a model’s ability to generalize to entirely new, unseen object types. They found that conventional models, despite achieving high accuracy (around 80%) on familiar objects during training, perform poorly—dropping to as low as 15% accuracy—when faced with objects they haven’t encountered before. This highlights a fundamental gap in how these models understand and apply grasping logic beyond their predefined datasets.
Introducing Grasp-LLaVA: Human-like Reasoning for Grasping
To overcome this limitation, the paper proposes Grasp-LLaVA, a Grasp Vision Language Model. Inspired by how humans reason, Grasp-LLaVA infers the most suitable grasp type based on an object’s physical characteristics, such as its shape and size. By leveraging a Vision Language Model (VLM) and incorporating text-based reasoning during its training, Grasp-LLaVA significantly improves accuracy on unseen object types, achieving an impressive 50.2% accuracy compared to 36.7% for state-of-the-art grasp estimation models. This marks a substantial step towards real-world applicability.
Bridging the Performance-Latency Gap with Hybrid Grasp Network (HGN)
While Grasp-LLaVA offers superior accuracy, its large size and computational demands pose a challenge for deployment on compact, power-limited edge devices typically found in prosthetics. To address this ‘performance-latency gap,’ the researchers introduce the Hybrid Grasp Network (HGN).
HGN is an intelligent edge-cloud deployment infrastructure. It combines a fast, specialized model running on the edge device (like a small computer within the prosthetic hand) with the highly accurate Grasp-LLaVA deployed in the cloud. An HGN controller dynamically decides whether to use the quick edge model’s prediction or offload the task to the more powerful cloud model as a fail-safe, based on the edge model’s confidence in its prediction. This dynamic switching mechanism is enhanced by ‘confidence calibration,’ ensuring the edge model’s confidence scores are reliable.
Also Read:
- Brain-Controlled Prosthetic Arm Achieves Real-Time Control with Embedded AI
- Advancing AI’s Understanding of Object Interaction Through Selective Learning
Real-World Performance and User Experience
The results demonstrate HGN’s effectiveness. With confidence calibration, HGN improves semantic projection accuracy by 5.6% (to 42.3%) while achieving a 3.5 times speedup over unseen object types. In a real-world scenario with a mix of seen and unseen objects, HGN reaches an average accuracy of 86% (a 12.2% gain over an edge-only model) and is 2.2 times faster than Grasp-LLaVA alone.
To evaluate the system from a user’s perspective, the researchers introduced the ‘User Upsetness Index’ (UUI). This metric quantifies user dissatisfaction by penalizing incorrect or delayed grasp decisions. HGN, particularly when calibrated, significantly reduces the UUI, indicating a more satisfactory and reliable user experience. For instance, with an optimal configuration, HGN (DC) achieves an overall accuracy of 86%, an average latency of 117.8 milliseconds, and a UUI of 1.12, showcasing a balanced improvement across accuracy, speed, and user satisfaction.
This research lays a strong foundation for developing prosthetic hands that can truly ‘grasp the unexpected,’ bringing us closer to highly functional and adaptable robotic prosthetics for daily living.


