TLDR: SignClip is a novel framework for Sign Language Translation (SLT) that significantly improves accuracy by fusing both manual (hand gestures) and non-manual (mouthing) cues. It employs a dual-stream architecture and hierarchical contrastive learning to ensure semantic consistency across visual and textual modalities. Tested on PHOENIX14T and How2Sign datasets, SignClip outperforms previous state-of-the-art models, especially in gloss-free translation, by effectively disambiguating visually similar signs and enhancing translation fluency.
Sign language serves as a vital communication method for millions, yet translating it into natural language, known as Sign Language Translation (SLT), remains a complex challenge. While significant progress has been made, many existing approaches primarily focus on manual signals, such as hand gestures, often overlooking crucial non-manual cues like mouthing. These subtle facial movements convey essential linguistic information and play a critical role in distinguishing between visually similar signs.
A new framework called SignClip aims to bridge this gap by integrating both manual and non-manual cues, specifically spatial gesture and mouthing features, to enhance the accuracy of sign language translation. The researchers behind SignClip recognized that signs like “chair” and “sit” might share similar hand configurations but differ significantly in accompanying mouth shapes, highlighting the importance of mouthing for disambiguation.
Integrating mouthing information, however, presents unique challenges. Hand gestures involve broad body movements, while mouthing is confined to subtle changes in the lower face. Without proper isolation, raw video inputs can introduce noise, making it difficult to extract useful mouthing features. Furthermore, effectively aligning multiple visual streams (hand features and lip movements) is crucial, especially in gloss-free SLT settings where no intermediate annotations guide the fusion process.
SignClip addresses these challenges with a novel multimodal contrastive fusion framework. It employs a dual-stream architecture: one stream independently encodes the visual input of the full frame to capture hand gestures, and another isolates the mouth region using facial landmark detection to derive non-manual mouthing features. These two streams are then combined using a flexible fusion module with gated mechanisms.
To further strengthen the interaction between these modalities and improve generalization, SignClip incorporates multi-level contrastive learning objectives. One objective encourages alignment between visual and mouthing features to help disambiguate visually similar gestures. The other aligns sign features with textual embeddings from a Large Language Model (LLM) to ensure semantic consistency. This comprehensive training scheme promotes robust multimodal integration, leading to more accurate and fluent translations.
Extensive experiments were conducted on two benchmark datasets, PHOENIX14T and How2Sign. SignClip consistently outperformed existing state-of-the-art models, particularly in gloss-free settings where intermediate annotations are not provided. For instance, on PHOENIX14T, SignClip improved the BLEU-4 score from 24.32 to 24.71, surpassing the previous best model, SpaMo. On the How2Sign dataset, it achieved a new state-of-the-art BLEU-4 score of 10.75, outperforming SpaMo by +0.64 BLEU-4.
Ablation studies confirmed the effectiveness of each component, demonstrating that while spatial features provide a strong baseline, the integration of mouthing features, combined with visual-text and sign-mouthing alignment, significantly boosts translation performance. The research also highlighted the impact of using powerful LLMs like Flan-T5-XL, which are fine-tuned with Low-Rank Adaptation (LoRA) for efficient adaptation to the SLT task.
The qualitative analysis further illustrated that mouthing features provide valuable complementary cues, especially for functional words and temporal expressions, which are crucial for sentence fluency and correctness. By capturing these subtle yet important elements, SignClip produces more complete and accurate translations, validating the necessity of incorporating non-manual cues like mouthing into SLT systems.
Also Read:
- Understanding How AI Connects Words to Images
- Geometry-Guided AI Enhances Multi-View Mammography Analysis
This innovative approach marks a significant step forward in making sign language translation more accurate and accessible, fostering more inclusive communication. You can read the full research paper here.


