TLDR: SiLVERScore is a novel evaluation metric for sign language generation that uses semantically-aware embeddings to directly compare generated signs with references in a joint embedding space. It overcomes the limitations of traditional back-translation methods by capturing multimodal features like facial expressions and prosody, demonstrating superior performance in distinguishing correct from random pairs, robustness to semantic variations, and stability across prosodic intensities. This advancement offers a more accurate and holistic assessment of generated sign language.
Evaluating how well artificial intelligence models generate sign language has long been a complex challenge. Traditionally, this evaluation relies on a two-step process called back-translation. This involves converting the generated signs back into text and then comparing that text to a reference using standard text-based metrics like BLEU or ROUGE. However, this method has significant drawbacks. It often fails to capture the rich, multimodal nature of sign language, which includes crucial elements like facial expressions, spatial grammar, and prosody (the rhythm and intonation of language). Moreover, it makes it difficult to determine whether an error in evaluation stems from the sign generation model itself or from the translation system used to convert signs to text.
Imagine a scenario where a sign language generation model accidentally swaps the referents in a sentence, for example, generating “John gave Mary a book” instead of “Mary gave John a book.” Traditional text-based metrics, relying on back-translation, might assign a perfect score because the English text output matches the reference text, even though the visual meaning is entirely incorrect. This highlights a critical need for an evaluation method that directly assesses the generated sign language video rather than its textual translation.
To address these limitations, researchers have introduced SiLVERScore (Sign Language Video Embedding Representation Score). This innovative metric offers a semantically-aware, embedding-based approach to evaluate sign language generation. Instead of relying on back-translation, SiLVERScore directly compares generated and reference signs within a joint embedding space. This space is designed to capture both semantic (meaning) and prosodic (expressive) features of sign language.
SiLVERScore leverages a model called CiCo, which uses contrastive learning to align video and text representations. This means it learns to understand the relationships between sign language videos and their corresponding text descriptions. A key advantage of this approach is its ability to handle continuous video streams without needing explicit segmentation, and it avoids reliance on potentially error-prone pose estimation tools. The model processes sign videos using a sliding window mechanism and combines both general and domain-specific features. Text is translated into English and then aligned with video embeddings using a contrastive learning objective, ensuring that matched video-text pairs are highly similar and unmatched pairs are dissimilar.
Experiments conducted on datasets like PHOENIX-14T (German Sign Language) and CSL-Daily (Chinese Sign Language) demonstrate SiLVERScore’s effectiveness. When distinguishing between correctly matched and randomly paired video-text samples, SiLVERScore achieved near-perfect discrimination, significantly outperforming traditional metrics. It showed minimal overlap between the distributions of scores for correct and random pairs, indicating its strong ability to identify accurate semantic alignment.
Furthermore, SiLVERScore proved robust to semantic variations, such as word reordering. When sentences were reordered while preserving their meaning, traditional metrics like BLEU and ROUGE showed a significant drop in scores, indicating their sensitivity to exact word order. SiLVERScore, however, maintained high scores, demonstrating its capacity to capture the underlying semantic content rather than just surface-level text matches.
The metric also showed stability across different levels of prosodic intensity (facial expressions, pauses, intensity). Traditional metrics often saw their scores decline as prosody increased, suggesting they struggled with expressive signing. SiLVERScore, in contrast, remained consistent, indicating it evaluates semantic alignment without being unduly influenced by prosodic variations.
While SiLVERScore represents a significant step forward, the research also acknowledges the “generalization problem” in sign language processing. Due to the scarcity and limited diversity of sign language datasets, models often struggle to generalize across different datasets without fine-tuning. SiLVERScore addresses this by being a dataset-specific evaluation metric, optimized to leverage the strengths of embedding-based methods within the constraints of current data availability. This approach aims for more reliable evaluations and better alignment with the linguistic and multimodal nature of sign language.
Also Read:
- Align-then-Slide: A New Framework for Evaluating Ultra-Long Document Translations
- AudioCodecBench: A New Standard for Evaluating Audio Codecs in Large Language Models
In conclusion, SiLVERScore offers a promising new standard for evaluating sign language generation. By moving beyond the limitations of back-translation and embracing a semantically-aware, embedding-based approach, it provides a more holistic and accurate assessment of generated sign language. This advancement is crucial for improving accessibility and inclusion for the Deaf and Hard-of-Hearing community in language technologies. For more in-depth information, you can read the full research paper here.


