TLDR: Researchers from Samsung Research introduce Supervised Mixture of Experts (S-MoE), a novel architecture for multi-task Speech-to-Text (STT) models. S-MoE uses special guiding tokens to route tasks (like ASR and ST) and input types (narrowband/wideband audio) to dedicated expert networks, eliminating the need for complex gating functions. This approach effectively mitigates task interference, improves performance (e.g., 6.35% WER reduction), and maintains computational efficiency, making it ideal for resource-constrained environments.
Speech-to-Text (STT) models are crucial for converting spoken language into text, enabling applications like voice assistants and transcription services. Traditionally, these models are trained on high-quality wideband audio, but real-world scenarios often involve narrowband audio, especially from phone calls. Furthermore, many applications require a single model to perform multiple tasks, such as Automatic Speech Recognition (ASR) and Speech Translation (ST).
A common approach to multi-task learning is ‘hard-parameter sharing,’ where different tasks share the same model components. However, this often leads to a problem called ‘task interference,’ where optimizing for one task can negatively impact the performance of others. This makes it challenging to build a single, efficient model that can handle diverse audio types and perform multiple functions simultaneously, especially given the resource constraints of mobile and embedded devices.
To address these challenges, researchers from Samsung Research have proposed a novel architecture called Supervised Mixture of Experts (S-MoE). This approach builds upon the concept of Mixture of Experts (MoE) models, which allocate specialized parameters (experts) for different tasks. Unlike traditional MoE models that rely on complex ‘gating functions’ to dynamically route inputs to experts, S-MoE simplifies this process significantly.
The key innovation of S-MoE is the elimination of the need to train these gating functions. Instead, it uses special ‘guiding tokens’ that explicitly direct each task or input type to its designated expert network. This supervised routing mechanism makes the model more straightforward to train and ensures efficiency during both training and inference, as it avoids the computational overhead associated with dynamic gating.
The S-MoE architecture is integrated into a Transformer-based STT model. In the encoder part of the model, S-MoE is used to handle different audio bandwidths. It assigns a separate ‘feedforward network’ (FFN) as a specialized expert for either narrowband (NB) or wideband (WB) audio signals. This allows the model to effectively process mixed-bandwidth inputs within a single framework, overcoming the performance degradation often seen when models trained on one bandwidth process another.
Similarly, in the decoder part, S-MoE is applied to manage different tasks. By prepending specific ‘task tags’ (e.g., for ASR or ST) to the text input, the S-MoE directs the decoding process to the appropriate expert network. This enables the model to jointly perform ASR and ST, meaning it can transcribe speech and translate it into another language simultaneously.
The researchers conducted extensive experiments using a large dataset of Korean speech data, including both narrowband and wideband variants, and evaluated the model’s performance using standard metrics like Word Error Rate (WER) for ASR and BLEU for ST. The results demonstrated the effectiveness of the proposed S-MoE. When applied to the decoder, the DecS-MoE model consistently outperformed the baseline model and even models with increased parameter counts, showing significant improvements in both ASR and ST performance without negative task interference.
Further enhancements were observed when S-MoE was applied to both the encoder and decoder (EncDecS-MoE). This combined approach achieved a notable 6.35% relative improvement in Word Error Rate and a 1.63% gain in BLEU score in narrowband environments. Crucially, despite slightly increasing the total number of trainable parameters, the S-MoE models maintain the same number of ‘active’ parameters as the baseline during inference. This means they deliver improved performance without compromising processing speed, making them highly suitable for resource-constrained environments.
Also Read:
- CoMoE: Making Large Language Models Efficient on Edge Devices
- MoQE: Enhancing Quantized AI Models Through Specialized Experts
In conclusion, the Supervised Mixture of Experts (S-MoE) architecture offers a simple yet effective solution for multi-task speech-to-text modeling. By using guiding tokens for task routing, it mitigates task interference and efficiently handles mixed-bandwidth inputs while jointly performing ASR and ST. This innovation paves the way for more versatile and efficient STT systems, particularly beneficial for mobile and embedded applications. For more details, you can refer to the full research paper: Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts.


