Transsion's Multilingual ASR System Achieves Top Performance in 2025 Challenge

TLDR: Transsion’s Speech Team developed a novel Multilingual Automatic Speech Recognition (ASR) system for the MLC-SLM 2025 Challenge, ranking third globally. The system uses a frozen Whisper-large-v3 encoder, a trainable adaptor, and a frozen Qwen2.5-7B-Instruct LLM with LoRA. By leveraging a diverse dataset including the challenge’s 1,500 hours of conversational speech across 11 languages and an MSR-86K subset, and through meticulous training and ablation studies, the system achieved a 9.83% Word/Character Error Rate (WER/CER) on the evaluation set.

In the rapidly evolving landscape of artificial intelligence, Automatic Speech Recognition (ASR) systems are becoming increasingly vital, especially those capable of understanding and transcribing multiple languages. The scarcity of real-world conversational speech data, particularly in diverse multilingual contexts, has long been a significant hurdle for advancing this field. To address this, the MLC-SLM 2025 Challenge introduced a groundbreaking dataset featuring 1,500 hours of real-world dialogue recordings across 11 different languages, providing a crucial resource for developing advanced multilingual LLM-based ASR models.

Against this backdrop, the Transsion Speech Team developed a novel Multilingual ASR system for Track 1 of the MLC-SLM 2025 Challenge, achieving an impressive third-place ranking among global participants. This paper, available for detailed review at arXiv:2508.14916, outlines the architecture and performance of their innovative system.

System Architecture: A Three-Part Harmony

The Transsion system is built upon a sophisticated Encoder-Adaptor-LLM architecture, meticulously designed to handle the complexities of multilingual speech. It comprises three main components:

First, a Frozen Whisper-large-v3 based speech encoder forms the foundation. This component is responsible for robust acoustic feature extraction, leveraging its extensive pretraining to convert spoken words into a digital representation that the system can understand. By keeping this encoder “frozen,” its powerful, pre-trained capabilities are preserved throughout the training process.

Second, a Trainable Adaptor module acts as a crucial bridge. This module uses a Linear-ReLU-Linear transformation mechanism to effectively align the representations generated by the speech encoder with the text-based representations expected by the large language model. It also performs a frame-splicing operation to reduce the temporal resolution, enhancing computational efficiency without sacrificing accuracy.

Third, a Frozen Qwen2.5-7B-Instruct large language model (LLM) is integrated, enhanced with trainable Low-Rank Adaptation (LoRA). The LLM is responsible for contextual linguistic decoding, essentially turning the processed speech features into coherent text. LoRA is a parameter-efficient fine-tuning method that allows the system to adapt the LLM for the ASR task by training only a small subset of additional parameters, keeping the majority of the LLM’s vast knowledge base intact.

Data and Training: Fueling Performance

The system’s development relied on a rich dataset. Beyond the 1,500 hours of conversational speech provided by the MLC-SLM 2025 Challenge across languages like English, French, German, Italian, Portuguese, Spanish, Japanese, Korean, Russian, Thai, and Vietnamese, Transsion also incorporated a subset of the open-source MSR-86K dataset. This external data resource further enhanced the model’s generalization capabilities.

During training, text normalization was applied, removing punctuation and converting all text to lowercase, consistent with official baselines. The transcriptions were then structured using the Qwen chat template. The training itself was a rigorous process, utilizing eight NVIDIA A100 GPUs and an effective total batch size of 64. An Adam optimizer with a linearly decaying learning rate and a warm-up phase was employed over 4 epochs. Post-training, the best-performing model checkpoints were averaged to achieve optimal results.

Ablation Studies: Proving the Impact

Extensive ablation studies were conducted to understand the impact of different components and strategies. These studies clearly demonstrated the progressive improvements gained through data augmentation and model scaling. For instance, expanding the training dataset with the MSR-86K subset significantly reduced the Word/Character Error Rate (WER/CER). The best performance was achieved by combining both increased model capacity (using Qwen2.5-7B-Instruct) and diversified training data, highlighting the synergistic benefits of these approaches.

The final submission system, which further incorporated the development set into its training, achieved a remarkable WER/CER of 9.83% across all 11 languages in the evaluation set. This outstanding performance secured its third-place position in the challenge.

Also Read:

Conclusion: A Step Forward in Multilingual ASR

The Transsion Speech Team’s work represents a significant advancement in Multilingual Automatic Speech Recognition. By skillfully integrating a frozen Whisper-large-v3 encoder, a trainable adaptor, and a frozen Qwen2.5-7B-Instruct LLM fine-tuned with LoRA, they have created a robust and efficient system. Their success underscores the critical importance of combining advanced architectural designs with diverse and high-quality datasets to push the boundaries of multilingual ASR capabilities.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Transsion’s Multilingual ASR System Achieves Top Performance in 2025 Challenge

System Architecture: A Three-Part Harmony

Data and Training: Fueling Performance

Ablation Studies: Proving the Impact

Conclusion: A Step Forward in Multilingual ASR

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates