spot_img
HomeResearch & DevelopmentBridging Communication Gaps: A Hybrid AI Model for Continuous...

Bridging Communication Gaps: A Hybrid AI Model for Continuous Sign Language Translation

TLDR: This research introduces a new method for continuous sign language translation that combines transformer and STGCN-LSTM architectures. The goal is to translate sign language videos directly into spoken language text without needing expensive “gloss” annotations. The approach achieved state-of-the-art performance on multiple datasets, including a new benchmark for Bangla Sign Language, making communication more accessible for deaf and hard-of-hearing individuals.

Millions of people worldwide are affected by deafness and hearing impairment, making sign language a vital means of communication. However, in societies that often prioritize spoken languages, sign language can be overlooked, leading to significant communication barriers and social exclusion for the deaf and hard-of-hearing community. Addressing this critical gap, a recent research project titled “Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph” introduces an innovative approach to enhance sign language translation methods.

The project, led by Rabeya Akter and Safaeid Hossain Arib, under the supervision of Dr. Sejuti Rahman from the University of Dhaka, focuses on improving continuous sign language translation. While previous state-of-the-art methods have largely relied on transformer architectures, this new research integrates graph-based methods with the transformer architecture. This unique combination, merging transformer and STGCN-LSTM (Spatio-Temporal Graph Convolutional Network – Long Short-Term Memory) architectures, has proven to be more effective, especially for “gloss-free” translation.

The Challenge of Sign Language Translation

Translating sign language is complex. Sign gestures can vary in length depending on the signer and their pace. Unlike spoken languages, there isn’t a direct one-to-one mapping between video frames and translated words. Furthermore, the detailed meaning conveyed through sign gestures often includes subtle local aspects due to differences in grammatical rules and ordering between sign and spoken languages. Traditional transformer networks, while excellent at capturing contextual relationships, struggle to understand the topological aspect of human body joints, which is crucial in sign language.

A Hybrid Solution: Fusing Transformer and Graph Networks

To overcome these limitations, the researchers leveraged the power of STGCN to extract relationships between spatial and temporal features from the skeletal structure of the human body. By fusing both transformer and STGCN-LSTM architectures, the method aims to learn a richer, more meaningful representation that combines both contextual and spatio-temporal information. This architectural fusion is a key contribution, allowing the system to understand sign gestures at both broad and fine-grained levels.

The methodology involves a two-stream encoding process. One stream uses an I3D network and a transformer encoder to process video frames, capturing contextual information. The second stream extracts keypoint features using the Mediapipe algorithm, creating a spatio-temporal graph structure. This keypoint data is then processed by an STGCN-LSTM encoder to capture the spatial and temporal dynamics of body movements. These two streams are then fused, and the combined encoding is fed into a transformer decoder to generate the translated spoken language text. This “gloss-free” approach is particularly significant because it eliminates the need for expensive gloss annotations, which typically require the expertise of sign language professionals.

Impressive Performance Across Diverse Datasets

The effectiveness of this new method was tested on several diverse sign language datasets: RWTH-PHOENIX-2014T (German Sign Language), CSL-Daily (Chinese Sign Language), How2Sign (American Sign Language), and BornilDB v1.0 (Bangla Sign Language). The results demonstrated superior performance compared to current translation outcomes across all datasets. For instance, the method achieved notable improvements in BLEU-4 scores, surpassing previous models on RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign. Crucially, this research also introduces benchmarking on the BornilDB v1.0 dataset for the first time, setting a new standard for future research in Bangla Sign Language translation.

An ablation study was conducted to fine-tune the model, exploring the optimal number of STGCN layers, transformer encoder and decoder layers, LSTM layers, and different fusion strategies. The summation-based fusion strategy proved to be the most effective in combining the strengths of both architectural streams.

Also Read:

Towards a More Inclusive Future

The developed system offers a practical demonstration of its capabilities across German, Chinese, American, and Bangla Sign Languages. Users can select a language, choose a recorded video, and the system will play the video while displaying the reference and predicted translated text. For German and Chinese, an English translation is also provided for better understanding.

This research marks a significant step towards creating automated sign language interpretation systems. By enhancing inclusivity and accessibility for the deaf and hard-of-hearing community, especially in regions like Bangladesh where Bangla Sign Language is vital, this work holds immense potential. While the task of sign language translation is still evolving, this project sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility. The researchers also plan to develop a more lightweight version of the system for mobile devices, further expanding its real-world utility. For more details, you can read the full research paper here.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -