Bridging the Speech-Text Gap in SLMs Using Optimal Transport Regularization

TLDR: A new method called Optimal Transport Regularization (OTReg) is introduced to improve Spoken Language Models (SLMs) by addressing the “modality gap” between speech and text. OTReg formulates speech-text alignment as an optimal transport problem, deriving a regularization loss that helps SLMs generate speech embeddings that better align with text. This lightweight, parameter-free approach enhances generalization across diverse datasets, as demonstrated in multilingual Automatic Speech Recognition (ASR) experiments.

Spoken Language Models (SLMs) are an exciting advancement, extending the capabilities of Large Language Models (LLMs) to understand speech. These models hold great promise for various speech understanding tasks, but they often face a significant challenge: generalizing their performance across different datasets, even for languages and tasks they were trained on. This issue raises questions about whether SLMs truly process speech in a text-like manner, as intended.

The core of this problem lies in what researchers call the “modality gap” between speech and text representations. Speech embeddings, which are the numerical representations of speech, can be much longer and contain more variability than text embeddings. They capture not just linguistic content but also paralinguistic features like pauses and speech rate variations. This complexity can lead SLMs to rely on these unintended speech variations for strong performance on familiar data, hindering their ability to generalize to new, unseen data.

To tackle this challenge, researchers have introduced a novel method called Optimal Transport Regularization (OTReg). This approach aims to bridge the modality gap by formulating speech-text alignment as an optimal transport problem. In simple terms, it finds the most efficient way to “transport” information from speech embeddings to transcript embeddings, establishing a structured correspondence between them.

OTReg works by first determining an “optimal transport plan” that maps speech embeddings to transcript embeddings. This plan is based on minimizing a cost function where lower costs indicate higher similarity between speech and text embeddings. Once this plan is established, a regularization loss is derived from it and incorporated into the SLM’s training process. This loss encourages the SLM to generate speech embeddings that align more effectively with their corresponding text representations.

One of the key advantages of OTReg is its efficiency. It’s lightweight, meaning it doesn’t require any additional labels or new parameters that need to be learned, making it easy to integrate into existing SLM training procedures. The paper, available at Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models, details this innovative method.

The training of SLMs with OTReg typically involves a two-stage process. The first stage uses standard supervised fine-tuning to get the model approximately compatible with the LLM’s text space. The second stage then integrates OTReg, including an OT-based compression method that further refines speech embeddings by merging repetitive segments and removing non-informative parts, similar to how text is condensed.

Extensive experiments, particularly in multilingual Automatic Speech Recognition (ASR) tasks, have demonstrated the effectiveness of OTReg. The results show that it significantly enhances speech-text alignment, successfully mitigates the modality gap, and consequently improves the generalization capabilities of SLMs across diverse datasets. This means SLMs trained with OTReg are better at understanding speech in a text-like manner, leading to more robust performance in real-world applications.

Also Read:

In conclusion, Optimal Transport Regularization offers a promising solution to a critical challenge in Spoken Language Models, paving the way for more reliable and versatile speech understanding systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Speech-Text Gap in SLMs Using Optimal Transport Regularization

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates