TLDR: AutoSign is a novel, decoder-only transformer model that directly translates continuous sign language pose sequences into natural language text, bypassing traditional multi-stage alignment methods. Developed by researchers at Carnegie Mellon University Africa, it uses 1D CNNs for pose compression and a pre-trained Arabic GPT-2 model for text generation. Evaluated on the Isharah-1000 dataset, AutoSign achieves state-of-the-art performance by focusing on body and hand gestures, offering a more robust and efficient solution for sign language recognition.
Communication is a fundamental human right, yet for the deaf and hard-of-hearing community, who comprise about 5.5% of the global population, effective interaction can be challenging due to the complexities of sign languages. Unlike spoken languages, sign languages convey meaning through a rich combination of hand gestures, facial expressions, and body movements, forming intricate visual-spatial languages with unique grammar and vocabulary. This inherent complexity often limits effective communication with those unfamiliar with sign language, highlighting the critical need for advanced automated translation systems.
Traditional Continuous Sign Language Recognition (CSLR) systems have typically relied on multi-stage pipelines. These methods first extract visual features, then align variable-length sign sequences with intermediate representations called ‘glosses’ using techniques like Connectionist Temporal Classification (CTC) or Hidden Markov Models (HMMs). While functional, these alignment-based approaches often suffer from several drawbacks: error propagation across stages, a tendency to overfit, and difficulties in scaling with larger vocabularies due to the intermediate gloss representation bottleneck.
Introducing AutoSign: A Direct Approach to Sign Language Translation
Addressing these limitations, researchers Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, and Assane Gueye from Carnegie Mellon University Africa have proposed a novel system called AutoSign. This innovative approach bypasses the traditional multi-stage pipeline entirely, offering a direct translation of sign language pose sequences into natural language text. AutoSign is an autoregressive decoder-only transformer, meaning it directly maps visual features to text without the need for intermediate gloss supervision or complex alignment mechanisms.
The core innovation of AutoSign lies in its end-to-end autoregressive generation. Inspired by the success of decoder-only models in natural language processing, AutoSign leverages a pre-trained Arabic decoder, AraGPT2, to generate text (glosses) directly from pose inputs. This allows the model to learn textual dependencies within the glosses directly, simplifying the overall CSLR pipeline.
How AutoSign Works
AutoSign’s architecture is designed for efficiency and accuracy. It begins with a temporal compression module that uses 1D Convolutional Neural Networks (CNNs) to efficiently process long pose sequences, preserving crucial temporal dynamics while reducing computational overhead. These compressed pose embeddings are then fed into the AraGPT2 decoder backbone, which, combined with gloss token embeddings, generates the target Arabic text autoregressively.
A key aspect of AutoSign is its focus on pose-based representations. Instead of processing raw RGB video data, which can be affected by background noise, lighting, and clothing variations, AutoSign utilizes 2D pose keypoints representing body joints, hand landmarks, and facial features. This approach not only offers robustness against environmental interference but also addresses privacy concerns by focusing solely on the signer’s movements. The researchers also incorporated part-aware augmentations during training, applying random rotations and scaling to hands, affine jitter to the face, and global pose jitter to the body to enhance robustness.
Performance and Key Insights
The AutoSign model was rigorously evaluated on the Isharah-1000 dataset, a large-scale Saudi Sign Language (SSL) dataset designed for signer-independent recognition. The results were compelling. AutoSign achieved state-of-the-art performance, significantly outperforming existing video-based methods and traditional Transformer + CTC baselines. For instance, it achieved a 20.5% Word Error Rate (WER) on the test set, demonstrating a substantial improvement over previous methods.
Through comprehensive ablation studies, the researchers gained valuable insights into optimal feature extraction and input modalities:
- The 1D CNN with two layers proved most effective for temporal compression, capturing relevant motion patterns while reducing computational load.
- Surprisingly, the combination of body and hand gestures provided the most discriminative features for signer-independent CSLR, outperforming configurations that included facial features. This suggests that facial expressions might introduce variability that hinders generalization to unseen signers, while hand movements carry the primary semantic information, and body posture provides crucial contextual information.
- The use of a learning rate scheduler significantly improved training efficiency and stability.
Qualitative analysis further highlighted AutoSign’s superiority, demonstrating its ability to generate more accurate and complete translations by reducing insertion, deletion, and substitution errors compared to traditional CTC-based alignment methods.
Also Read:
- Improving Facial Expression Analysis for Sign Language Users Through Color Normalization
- Enhancing AI’s Understanding of Sticker Emotions with a New Fusion Transformer
The Future of Sign Language Recognition
AutoSign represents a significant step forward in continuous sign language recognition, particularly for Arabic Sign Language. By directly translating pose sequences to natural language text, it overcomes many limitations of previous multi-stage approaches, offering a more robust, efficient, and accurate solution. The focus on pose keypoints also addresses practical concerns like privacy and environmental variability, making it suitable for real-world applications.
The research team plans to further evaluate AutoSign on the larger Isharah-2000 dataset and explore model compression techniques to facilitate deployment on mobile devices, aligning with the original data collection setup. This work paves the way for more accessible and inclusive communication technologies for the deaf and hard-of-hearing community. You can read the full research paper here.


