TLDR: This research introduces the first benchmark dataset for Nepali Sign Language (NSL), comprising 36 gesture classes with 1,500 samples each, collected with plain and random backgrounds. It evaluates deep learning models (MobileNetV2 and ResNet50) using transfer learning and fine-tuning for NSL character recognition. MobileNetV2 achieved a higher classification accuracy of 90.45% compared to ResNet50’s 88.78%, demonstrating its effectiveness in low-resource settings. The study also proposes a real-time video-based recognition system, laying a foundation for assistive technologies for NSL users.
Communication is a fundamental human right, yet for individuals with hearing and speech impairments, especially in regions with under-resourced sign languages, it presents significant challenges. In Nepal, where tens of thousands rely on Nepali Sign Language (NSL) for daily communication, there has been a notable absence of digital linguistic datasets and computational tools to support its recognition and use.
A recent study titled “Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches” addresses this critical gap. Authored by Birat Poudel, Satyam Ghimire, Sijan Bhattarai, Saurav Bhandari, and Suramya Sharma Dahal, this research introduces the first-ever benchmark dataset for NSL, paving the way for advanced assistive technologies and further research in this vital area. You can read the full paper here: Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches.
Building the Foundation: The NSL Dataset
The cornerstone of this research is the creation of a custom dataset specifically designed for Nepali Sign Language character recognition. This comprehensive dataset features 36 distinct NSL gesture classes, with each class containing 1,500 samples. To ensure the models could perform well in various real-world scenarios, the images were collected under two different background conditions:
- Plain Background: 1,000 images per character against uniform, clean backgrounds for controlled learning.
- Random Background: 500 images per character against varied, realistic backgrounds to enhance model robustness.
This dual-background approach resulted in a substantial dataset of 54,000 images, providing rich and diverse training data for deep learning models. The data was preprocessed into TensorFlow’s TFRecord format for optimized performance.
Leveraging Deep Learning for Recognition
To evaluate the recognition performance on their new dataset, the researchers employed two popular pre-trained Convolutional Neural Network (CNN) architectures: MobileNetV2 and ResNet50. These models, initially trained on the vast ImageNet dataset, were adapted for the 36-class NSL classification task using a technique called transfer learning and fine-tuning.
The training process involved a progressive two-phase strategy:
- Phase 1 (Frozen Base Model Training): The core convolutional layers of the pre-trained models were kept frozen, and only the newly added classification layers were trained. This allowed the models to quickly learn the specific features of NSL characters.
- Phase 2 (Partial Fine-Tuning): Selected deeper layers of the base models were unfrozen and trained with a reduced learning rate. This fine-tuning step enabled the models to adapt more precisely to the nuances of NSL gestures while retaining the powerful representations learned from ImageNet.
Both phases used the Adam optimizer and Sparse Categorical Cross Entropy as the loss function, with specific learning rates and batch sizes defined for optimal training.
Key Findings and Performance
The evaluation revealed compelling results. MobileNetV2 consistently outperformed ResNet50 in recognizing Nepali Sign Language characters. MobileNetV2 achieved a classification accuracy of 90.45%, while ResNet50 reached 88.78%. This outcome is particularly significant because MobileNetV2 is a lightweight architecture with fewer parameters compared to the deeper ResNet50.
The researchers suggest that MobileNetV2’s efficiency in capturing localized spatial and structural features, crucial for distinguishing hand gestures, made it more effective in this low-resource setting. Its design helps reduce the risk of overfitting on medium-scale datasets like the NSL dataset. In contrast, ResNet50’s deeper architecture, while powerful for highly complex datasets, might have extracted redundant features that didn’t contribute as effectively to classifying the relatively simpler gesture images, potentially leading to reduced generalization.
The system also incorporates a robust real-time recognition pipeline. It takes a continuous video stream of hand gestures, samples frames, preprocesses them, and classifies them. A sliding window with majority voting ensures stable and accurate recognition of gestures, even during transitions.
Also Read:
- Enhancing Single-Modality Hand Gesture Recognition Through Multimodal Training
- New Deep Learning Study Sets a Standard for Cricket Shot Classification, Reveals Reproducibility Challenges
Looking Ahead
This study marks a significant milestone in the field of Nepali Sign Language recognition. By providing the first benchmark dataset and demonstrating the effectiveness of deep learning models, particularly MobileNetV2, it lays a strong foundation for future advancements. The researchers propose several avenues for future work, including expanding the dataset with more gesture classes and samples, exploring more advanced neural network architectures, optimizing the system for mobile and edge devices, and incorporating additional modalities like facial expressions or body posture to enhance recognition accuracy.
This pioneering effort not only contributes valuable resources for NSL but also highlights the potential of transfer learning and fine-tuning to advance research in other under-explored sign languages worldwide.


