TLDR: Transsion’s Speech Team developed a novel Multilingual Automatic Speech Recognition (ASR) system for the MLC-SLM 2025 Challenge, ranking third globally. The system uses a frozen Whisper-large-v3 encoder, a trainable adaptor, and a frozen Qwen2.5-7B-Instruct LLM with LoRA. By leveraging a diverse dataset including the challenge’s 1,500 hours of conversational speech across 11 languages and an MSR-86K subset, and through meticulous training and ablation studies, the system achieved a 9.83% Word/Character Error Rate (WER/CER) on the evaluation set.
In the rapidly evolving landscape of artificial intelligence, Automatic Speech Recognition (ASR) systems are becoming increasingly vital, especially those capable of understanding and transcribing multiple languages. The scarcity of real-world conversational speech data, particularly in diverse multilingual contexts, has long been a significant hurdle for advancing this field. To address this, the MLC-SLM 2025 Challenge introduced a groundbreaking dataset featuring 1,500 hours of real-world dialogue recordings across 11 different languages, providing a crucial resource for developing advanced multilingual LLM-based ASR models.
Against this backdrop, the Transsion Speech Team developed a novel Multilingual ASR system for Track 1 of the MLC-SLM 2025 Challenge, achieving an impressive third-place ranking among global participants. This paper, available for detailed review at arXiv:2508.14916, outlines the architecture and performance of their innovative system.
System Architecture: A Three-Part Harmony
The Transsion system is built upon a sophisticated Encoder-Adaptor-LLM architecture, meticulously designed to handle the complexities of multilingual speech. It comprises three main components:
First, a Frozen Whisper-large-v3 based speech encoder forms the foundation. This component is responsible for robust acoustic feature extraction, leveraging its extensive pretraining to convert spoken words into a digital representation that the system can understand. By keeping this encoder “frozen,” its powerful, pre-trained capabilities are preserved throughout the training process.
Second, a Trainable Adaptor module acts as a crucial bridge. This module uses a Linear-ReLU-Linear transformation mechanism to effectively align the representations generated by the speech encoder with the text-based representations expected by the large language model. It also performs a frame-splicing operation to reduce the temporal resolution, enhancing computational efficiency without sacrificing accuracy.
Third, a Frozen Qwen2.5-7B-Instruct large language model (LLM) is integrated, enhanced with trainable Low-Rank Adaptation (LoRA). The LLM is responsible for contextual linguistic decoding, essentially turning the processed speech features into coherent text. LoRA is a parameter-efficient fine-tuning method that allows the system to adapt the LLM for the ASR task by training only a small subset of additional parameters, keeping the majority of the LLM’s vast knowledge base intact.
Data and Training: Fueling Performance
The system’s development relied on a rich dataset. Beyond the 1,500 hours of conversational speech provided by the MLC-SLM 2025 Challenge across languages like English, French, German, Italian, Portuguese, Spanish, Japanese, Korean, Russian, Thai, and Vietnamese, Transsion also incorporated a subset of the open-source MSR-86K dataset. This external data resource further enhanced the model’s generalization capabilities.
During training, text normalization was applied, removing punctuation and converting all text to lowercase, consistent with official baselines. The transcriptions were then structured using the Qwen chat template. The training itself was a rigorous process, utilizing eight NVIDIA A100 GPUs and an effective total batch size of 64. An Adam optimizer with a linearly decaying learning rate and a warm-up phase was employed over 4 epochs. Post-training, the best-performing model checkpoints were averaged to achieve optimal results.
Ablation Studies: Proving the Impact
Extensive ablation studies were conducted to understand the impact of different components and strategies. These studies clearly demonstrated the progressive improvements gained through data augmentation and model scaling. For instance, expanding the training dataset with the MSR-86K subset significantly reduced the Word/Character Error Rate (WER/CER). The best performance was achieved by combining both increased model capacity (using Qwen2.5-7B-Instruct) and diversified training data, highlighting the synergistic benefits of these approaches.
The final submission system, which further incorporated the development set into its training, achieved a remarkable WER/CER of 9.83% across all 11 languages in the evaluation set. This outstanding performance secured its third-place position in the challenge.
Also Read:
- LLaSO: An Open Standard for Large Speech-Language Models
- GOAT: Enhancing Text-to-Speech Reliability by Reducing AI Hallucinations
Conclusion: A Step Forward in Multilingual ASR
The Transsion Speech Team’s work represents a significant advancement in Multilingual Automatic Speech Recognition. By skillfully integrating a frozen Whisper-large-v3 encoder, a trainable adaptor, and a frozen Qwen2.5-7B-Instruct LLM fine-tuned with LoRA, they have created a robust and efficient system. Their success underscores the critical importance of combining advanced architectural designs with diverse and high-quality datasets to push the boundaries of multilingual ASR capabilities.


