TLDR: DSLNet is a new AI model for Isolated Sign Language Recognition (ISLR) that significantly improves accuracy by analyzing hand shape and motion trajectory separately using a dual-stream architecture. It employs wrist-centric and facial-centric reference frames, specialized networks for each, and a geometry-driven optimal transport fusion method. DSLNet achieves state-of-the-art results on WLASL and LSA64 datasets with high efficiency and robustness, making it a practical solution for bridging communication gaps.
Understanding sign language is crucial for bridging communication gaps for hearing-impaired individuals. However, a significant challenge in Isolated Sign Language Recognition (ISLR) has been distinguishing between gestures that look similar but have different meanings, often due to the complex interplay of hand shape and movement.
A new research paper introduces Dual-SignLanguageNet (DSLNet), a novel AI architecture designed to overcome these ambiguities. DSLNet takes a unique approach by separating and modeling hand morphology (shape) and motion trajectory in distinct, yet complementary, ways.
The core innovation of DSLNet lies in its dual-reference, dual-stream architecture. Instead of relying on a single viewpoint, it processes information through two specialized streams:
Wrist-Centric Frame for Shape Analysis
To understand the intrinsic shape of the hand, DSLNet uses a wrist-centric frame. This means the hand joints are normalized relative to the wrist, creating a representation of the hand’s morphology that remains consistent regardless of the viewing angle. This stream is processed by a Topology-aware Spatiotemporal Network (TSSN), which uses dynamic graph convolutions to extract multi-scale shape features.
Also Read:
- EHWGesture: A Comprehensive Resource for Understanding Clinical Hand Movements
- H2OT: A Hierarchical Approach for Efficient 3D Human Pose Estimation in Videos
Facial-Centric Frame for Trajectory Modeling
For capturing the hand’s movement, especially its spatial relationship to the body, a facial-centric frame is employed. The wrist’s position is normalized with respect to key facial landmarks, providing crucial context for the gesture’s trajectory. This stream utilizes a Finsler Trajectory Dynamics Encoder (FTDE), which models direction-sensitive dynamics and emphasizes key moments in the gesture’s execution, like changes in direction or speed.
These two specialized streams are then integrated using a geometry-driven optimal transport fusion mechanism. This advanced fusion method ensures that the shape and motion features are semantically aligned, leading to a more comprehensive understanding of the sign.
DSLNet has demonstrated impressive results, setting new state-of-the-art performance on challenging datasets. It achieved 93.70% accuracy on WLASL-100, 89.97% on WLASL-300, and 99.79% on LSA64. Remarkably, it achieves this superior accuracy with significantly fewer parameters than competing models, for instance, using 12.8 times fewer parameters than Uni-Sign.
The model is also designed for real-world deployment, boasting high computational efficiency with low FLOPs and an average inference time of 17.98ms per sample on an RTX 4090 GPU, well within real-time processing requirements. Furthermore, DSLNet exhibits superior robustness to frame dropout, a common issue in real-world data, maintaining high accuracy even with significant data loss.
This work highlights the importance of multi-reference geometric modeling in sign language recognition, offering a robust and practical solution for real-world ISLR applications. For more technical details, you can refer to the full research paper here.


