spot_img
HomeResearch & DevelopmentShedding Light on Speech-to-Text Decisions with Contrastive Explanations

Shedding Light on Speech-to-Text Decisions with Contrastive Explanations

TLDR: The paper introduces the first method for generating contrastive explanations for Speech-to-Text (S2T) models. Unlike standard explanations that show why an output occurred, contrastive explanations reveal why one output was chosen *instead of another*. The method adapts a perturbation-based technique, focusing on word-level probabilities and a novel “relative scorer.” A case study on gender assignment in speech translation demonstrates its effectiveness in identifying specific audio features that drive gender choices, offering deeper insights into S2T model behavior and biases.

Artificial intelligence (AI) systems are becoming increasingly complex, and understanding why they make certain decisions is crucial. This is especially true for advanced models like Speech-to-Text (S2T) systems, which convert spoken language into written text. A new research paper introduces a groundbreaking method to provide more insightful explanations for these S2T models.

Understanding AI Decisions: Why P Instead of Q?

Traditionally, explainable AI (XAI) methods try to answer ‘Why did P happen?’ – for example, why did an S2T model transcribe ‘curious’ as ‘curiosa’ (feminine in Italian)? While helpful, these explanations often don’t tell the whole story. Humans often think in terms of alternatives: ‘Why did P happen rather than Q?’ This is where ‘contrastive explanations’ come in. They aim to explain why an AI system produced one output (the ‘target’) instead of another specific alternative (the ‘foil’). This approach mirrors human reasoning and offers more targeted insights.

Despite their benefits, contrastive explanations have been challenging to apply to S2T models. Speech signals are complex, spanning both time and frequency, and the output text can vary greatly in length. Previous attempts to explain S2T models often resulted in ‘saliency maps’ – visual representations of the audio input (spectrograms) that highlight regions most influential for a prediction. However, these maps were often holistic, showing features relevant for the entire word, not specific contrasting aspects like gender.

A Novel Approach for Speech-to-Text

The paper, titled “The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models”, proposes the first method to generate contrastive explanations for S2T models. Authored by Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, and Luisa Bentivogli, this work builds upon existing ‘feature attribution’ techniques, which analyze how parts of the input influence the output.

The core of their solution involves two main innovations:

  • Word-Level Probabilities: S2T models often generate ‘subword tokens’ rather than complete words. To explain why one word was chosen over another, the researchers developed a sophisticated way to combine these subword probabilities into meaningful word-level probabilities. This method accounts for nuances like whether a subword sequence forms a complete word or is just a prefix of a longer one.
  • A New Scoring Function: Traditional contrastive scoring functions in text-based AI, like the ‘contrastive difference scorer,’ can sometimes produce explanations that are too similar to non-contrastive ones, especially when the model strongly prefers one output over another. To overcome this, the researchers repurposed a ‘relative scorer’ that normalizes the contribution of each term. This ensures that both the target and foil remain influential in the final score, leading to truly contrastive explanations that pinpoint features specific to the choice between alternatives.

Case Study: Gender Assignment in Speech Translation

To evaluate their method, the researchers conducted a case study on gender assignment in speech translation (ST). They focused on translating gender-neutral terms referring to the speaker (e.g., “I am curious”) into languages like Italian, French, or Spanish, which require a grammatical gender choice (masculine or feminine). This scenario is ideal for contrastive explanations because it provides natural pairs to compare (e.g., ‘curioso’ vs. ‘curiosa’).

The evaluation used two key metrics:

  • Coverage: This measures the percentage of cases where the model still generates either the target or foil term after progressively removing the most salient (important) input features. High coverage indicates that the identified features are specific to the contrast, not just general text generation.
  • Flip Rate: This tracks how often the model switches from predicting the target to the foil when the most salient features are removed. A high flip rate demonstrates that the explanation accurately identifies features responsible for the specific choice.

The results showed that the new ‘relative scorer’ produced significantly different saliency maps compared to non-contrastive methods, indicating its precision in isolating gender-specific features. For feminine predictions, removing just 5% of the most relevant features caused the model to switch to masculine translations in over 70% of cases, while maintaining good coverage. This suggests the method effectively isolates the exact audio regions driving the model’s gender choice.

Interestingly, the flip rate for masculine predictions was lower, which the researchers hypothesize might be due to a ‘masculine default bias’ in S2T models, where masculine forms are generated unless strong feminine signals are present. This bias is often linked to imbalances in training data.

Also Read:

Broader Implications and Ethical Considerations

This new methodology provides a foundation for better understanding S2T models. It can help researchers investigate which phonetic cues models use for gender disambiguation and could be applied to other complex S2T tasks, such as disambiguating homophones (e.g., “plain” vs. “plane”) or resolving coreference (e.g., identifying which person a pronoun refers to).

The authors also address important ethical concerns. They acknowledge that S2T systems relying on vocal traits for gender prediction can disadvantage transgender speakers or individuals with vocal impairments. Furthermore, the study’s reliance on binary (masculine/feminine) gender annotations reflects grammatical conventions but doesn’t account for non-binary gender identities. The researchers emphasize that their work aims to explain current model behaviors, not to prescribe how systems should handle gender, and hope their methodology can be used to analyze how systems choose between binary and neutral alternatives once more inclusive datasets become available.

In conclusion, this research marks a significant step forward in making S2T models more transparent, offering a powerful tool to understand the nuanced decisions these complex AI systems make.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -