AutoSign: A New Era for Direct Sign Language to Text Translation

TLDR: AutoSign is a novel, decoder-only transformer model that directly translates continuous sign language pose sequences into natural language text, bypassing traditional multi-stage alignment methods. Developed by researchers at Carnegie Mellon University Africa, it uses 1D CNNs for pose compression and a pre-trained Arabic GPT-2 model for text generation. Evaluated on the Isharah-1000 dataset, AutoSign achieves state-of-the-art performance by focusing on body and hand gestures, offering a more robust and efficient solution for sign language recognition.

Communication is a fundamental human right, yet for the deaf and hard-of-hearing community, who comprise about 5.5% of the global population, effective interaction can be challenging due to the complexities of sign languages. Unlike spoken languages, sign languages convey meaning through a rich combination of hand gestures, facial expressions, and body movements, forming intricate visual-spatial languages with unique grammar and vocabulary. This inherent complexity often limits effective communication with those unfamiliar with sign language, highlighting the critical need for advanced automated translation systems.

Traditional Continuous Sign Language Recognition (CSLR) systems have typically relied on multi-stage pipelines. These methods first extract visual features, then align variable-length sign sequences with intermediate representations called ‘glosses’ using techniques like Connectionist Temporal Classification (CTC) or Hidden Markov Models (HMMs). While functional, these alignment-based approaches often suffer from several drawbacks: error propagation across stages, a tendency to overfit, and difficulties in scaling with larger vocabularies due to the intermediate gloss representation bottleneck.

Introducing AutoSign: A Direct Approach to Sign Language Translation

Addressing these limitations, researchers Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, and Assane Gueye from Carnegie Mellon University Africa have proposed a novel system called AutoSign. This innovative approach bypasses the traditional multi-stage pipeline entirely, offering a direct translation of sign language pose sequences into natural language text. AutoSign is an autoregressive decoder-only transformer, meaning it directly maps visual features to text without the need for intermediate gloss supervision or complex alignment mechanisms.

The core innovation of AutoSign lies in its end-to-end autoregressive generation. Inspired by the success of decoder-only models in natural language processing, AutoSign leverages a pre-trained Arabic decoder, AraGPT2, to generate text (glosses) directly from pose inputs. This allows the model to learn textual dependencies within the glosses directly, simplifying the overall CSLR pipeline.

How AutoSign Works

AutoSign’s architecture is designed for efficiency and accuracy. It begins with a temporal compression module that uses 1D Convolutional Neural Networks (CNNs) to efficiently process long pose sequences, preserving crucial temporal dynamics while reducing computational overhead. These compressed pose embeddings are then fed into the AraGPT2 decoder backbone, which, combined with gloss token embeddings, generates the target Arabic text autoregressively.

A key aspect of AutoSign is its focus on pose-based representations. Instead of processing raw RGB video data, which can be affected by background noise, lighting, and clothing variations, AutoSign utilizes 2D pose keypoints representing body joints, hand landmarks, and facial features. This approach not only offers robustness against environmental interference but also addresses privacy concerns by focusing solely on the signer’s movements. The researchers also incorporated part-aware augmentations during training, applying random rotations and scaling to hands, affine jitter to the face, and global pose jitter to the body to enhance robustness.

Performance and Key Insights

The AutoSign model was rigorously evaluated on the Isharah-1000 dataset, a large-scale Saudi Sign Language (SSL) dataset designed for signer-independent recognition. The results were compelling. AutoSign achieved state-of-the-art performance, significantly outperforming existing video-based methods and traditional Transformer + CTC baselines. For instance, it achieved a 20.5% Word Error Rate (WER) on the test set, demonstrating a substantial improvement over previous methods.

Through comprehensive ablation studies, the researchers gained valuable insights into optimal feature extraction and input modalities:

The 1D CNN with two layers proved most effective for temporal compression, capturing relevant motion patterns while reducing computational load.
Surprisingly, the combination of body and hand gestures provided the most discriminative features for signer-independent CSLR, outperforming configurations that included facial features. This suggests that facial expressions might introduce variability that hinders generalization to unseen signers, while hand movements carry the primary semantic information, and body posture provides crucial contextual information.
The use of a learning rate scheduler significantly improved training efficiency and stability.

Qualitative analysis further highlighted AutoSign’s superiority, demonstrating its ability to generate more accurate and complete translations by reducing insertion, deletion, and substitution errors compared to traditional CTC-based alignment methods.

Also Read:

The Future of Sign Language Recognition

AutoSign represents a significant step forward in continuous sign language recognition, particularly for Arabic Sign Language. By directly translating pose sequences to natural language text, it overcomes many limitations of previous multi-stage approaches, offering a more robust, efficient, and accurate solution. The focus on pose keypoints also addresses practical concerns like privacy and environmental variability, making it suitable for real-world applications.

The research team plans to further evaluate AutoSign on the larger Isharah-2000 dataset and explore model compression techniques to facilitate deployment on mobile devices, aligning with the original data collection setup. This work paves the way for more accessible and inclusive communication technologies for the deaf and hard-of-hearing community. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AutoSign: A New Era for Direct Sign Language to Text Translation

Introducing AutoSign: A Direct Approach to Sign Language Translation

How AutoSign Works

Performance and Key Insights

The Future of Sign Language Recognition

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates