TLDR: POSESTITCH-SLT is a novel pre-training method for sign language translation that creates large-scale synthetic datasets by stitching together word-level sign poses based on linguistic templates. This approach significantly improves translation accuracy on American and Indian Sign Languages, addressing data scarcity and privacy concerns without relying on gloss annotations or complex architectures.
Sign language serves as the primary communication method for over 70 million deaf and hard-of-hearing individuals globally. Despite advancements in natural language processing, sign language processing remains a significant challenge, largely due to the scarcity of extensive, sentence-aligned datasets and concerns regarding signer privacy.
Researchers from the Indian Institute of Technology Kanpur have introduced a novel pre-training scheme called POSESTITCH-SLT, designed to overcome these hurdles in end-to-end sign language translation. This innovative approach is inspired by linguistic templates used for sentence generation and focuses on pose-based, gloss-free translation, which means it translates directly from sign poses to spoken language without relying on intermediate annotations or raw video footage that could identify signers.
Addressing Data Scarcity and Privacy
Traditional sign language translation methods often depend on gloss annotations (textual labels for individual signs), which are labor-intensive and can strip away the linguistic richness of sign languages. Furthermore, using raw sign videos raises privacy concerns due to identifiable features of the signer. POSESTITCH-SLT tackles these issues by utilizing 2D/3D keypoints extracted from the face, hands, and body, offering a privacy-preserving alternative while retaining crucial communication information.
The core of POSESTITCH-SLT lies in its ability to construct synthetic pose-based sentence data. It leverages publicly available word-level sign language datasets, such as WLASL for American Sign Language (ASL) and CISLR for Indian Sign Language (ISL), which cover thousands of words. To generate grammatically diverse sentences, the system employs linguistic templates from benchmarks like BLiMP (Benchmark of Linguistic Minimal Pairs) and complements this with data from large text corpora like BPCC.
How POSESTITCH-SLT Works
The methodology involves two main steps: data generation and pose stitching.
Data Generation via Linguistic Templates and Large Text Corpora
To create a vast amount of training data, POSESTITCH-SLT aligns linguistic templates with word-level sign pose data. By filling these templates with words from the shared vocabulary of BLiMP and datasets like WLASL or CISLR, millions of grammatically varied English sentences are generated. For instance, the system can create sentences like “What did John read before filing the book?” by drawing words from the overlapping vocabulary and adhering to BLiMP’s syntactic structures.
To expand linguistic and lexical coverage beyond BLiMP’s limited vocabulary, the researchers also incorporate sentences from the BPCC corpus, a collection of 230 million English bitext pairs. Sentences from BPCC are selected if they have a high word match (over 90%) with the WLASL or CISLR vocabulary, ensuring they can be fully synthesized using available sign poses. Post-processing steps, such as sentence length matching and anonymization, further refine these datasets.
Pose Stitching
Once sentences are generated, the system synthesizes sentence-level sign language sequences by stitching together individual word-level pose sequences. This involves extracting 2D keypoints (76 keypoints forming 152-dimensional vectors) from word videos using the MediaPipe library, covering facial expressions, hand configurations, and upper body motion. Low-confidence keypoints are interpolated, and sequences are normalized to reduce signer variance.
Temporal alignment and concatenation of these word-level segments create a fluent stream. Boundary-aware temporal smoothing is applied to ensure smooth transitions between signs. A key design choice explored was the word order during stitching: “Same Word Order (SWO)” preserves the English syntactic structure, while “Random Word Order (RWO)” permutes word order to encourage robust representations. The researchers found that SWO generally yielded better performance.
Training Strategy
POSESTITCH-SLT employs a linear annealing strategy during training. Initially, the model is trained exclusively on synthetic pose-sentence pairs. As training progresses, the probability of sampling from real sentence-aligned datasets (iSign for ISL and How2Sign for ASL) gradually increases, reaching up to 85% after 60,000 training steps. This blended approach allows the model to benefit from both the diversity of synthetic data and the realism of target domain data, preventing catastrophic forgetting often seen in traditional pretraining-fine-tuning pipelines.
Remarkable Performance Gains
The effectiveness of POSESTITCH-SLT was evaluated using a standard Transformer encoder-decoder model on the How2Sign (ASL) and iSign (ISL) benchmark datasets. The results are significant: on How2Sign, the BLEU-4 score improved from 1.97 to 4.56 on the test set, and on iSign, it increased from 0.55 to 3.43. These gains surpass prior state-of-the-art methods for pose-based gloss-free translation, demonstrating the power of template-driven synthetic supervision in low-resource sign language settings.
Ablation studies confirmed the critical role of synthetic pose-stitched pretraining, as models without it performed significantly worse. The system also showed strong generalization to unseen synthetic sentences, achieving high BLEU-4 scores, indicating its ability to robustly handle sentence generation within the restricted vocabulary domain.
Also Read:
- Unifying Human Motion, Vision, and Language Understanding with a New AI Framework
- Understanding Human Movement: A New Approach to Pose Similarity and Action Quality Assessment
Future Outlook
While POSESTITCH-SLT marks a significant step forward, the researchers acknowledge limitations, including vocabulary coverage restricted by existing word-level datasets and the reliance on English word order due to the lack of standardized sign language grammar resources. Future work aims to expand vocabulary, integrate grammatical features specific to sign languages, and apply this strategy to more diverse sign language variants.
This research paves the way for scaling sign language translation using linguistic structure-based data synthesis, offering a promising path toward more inclusive and generalizable SLT systems. For more details, you can refer to the full research paper: POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation.


