spot_img
HomeResearch & DevelopmentUniSE: A Unified Language Model Framework for Comprehensive Speech...

UniSE: A Unified Language Model Framework for Comprehensive Speech Enhancement

TLDR: UniSE is a novel framework developed by Alibaba Group that unifies multiple speech enhancement tasks, including speech restoration, target speaker extraction, and speech separation, using a single decoder-only autoregressive language model. It achieves competitive performance across various benchmarks with high parameter efficiency and demonstrates strong generalization abilities by defining distinct operational modes through task-specific tokens.

The field of speech enhancement (SE) has seen significant advancements, moving beyond simple noise reduction to encompass a broader range of tasks like restoring degraded speech, extracting a target speaker from a mixture, and separating multiple speakers. While deep neural networks have become mainstream in this area, and language models (LMs) have shown promise, existing LM-based approaches often focus on a single distortion or task, limiting their versatility.

Addressing this limitation, researchers from Alibaba Group have introduced UniSE, a unified framework designed to handle multiple speech enhancement sub-tasks within a single decoder-only autoregressive language model. UniSE aims to bring together speech restoration (SR), target speaker extraction (TSE), and speech separation (SS) under one cohesive system.

How UniSE Works

At its core, UniSE leverages a decoder-only LM, similar to the LLaMA architecture, to model the conditional probability distribution of target speech. It takes features from degraded and, optionally, reference speech as conditions to generate discrete tokens of the clean target speech. These tokens are then used to reconstruct the waveform using a neural audio codec (NAC), specifically BiCodec, which is known for high reconstruction quality.

A crucial component of UniSE is its conditional feature extractor, which uses a pre-trained WavLM model with a learnable adapter. WavLM, a self-supervised learning model, extracts rich acoustic and semantic information from speech. The BiCodec then converts continuous speech into discrete global and semantic features, which are essential for the LM’s autoregressive modeling.

To unify different tasks, UniSE defines three operational modes: SR mode, TSE mode, and reverse TSE (rTSE) mode. Each mode is distinguished by a unique, learnable task-specific token. By switching and combining these modes, UniSE can adapt to various SE scenarios. For instance, in SR mode, it restores clean speech from a degraded recording. In TSE mode, it extracts speech matching a reference speaker’s timbre. For speech separation, UniSE employs a multi-inference strategy, using SR to identify the louder speaker, then TSE to extract that speaker, and finally rTSE to isolate the remaining speaker.

Performance and Generalization

The experimental results demonstrate UniSE’s competitive performance across several benchmarks. On the 2020 DNS Challenge test sets, UniSE achieved state-of-the-art speech restoration performance. Notably, it did so with a significantly smaller model size (63 million parameters) compared to other models like LLaSE-G1, which uses approximately 1 billion parameters, highlighting UniSE’s superior parameter efficiency.

UniSE also showed strong results on the 2025 URGENT Challenge test set for SR, even with unseen distortions like codec artifacts and wind noise, indicating excellent generalization capabilities. For target speaker extraction, UniSE achieved comparable performance to other advanced baselines on the Libri2Mix clean test set. In speech separation, UniSE outperformed both discriminative and generative models on Libri2Mix and WSJ0-2mix test sets, validating its multi-mode inference strategy.

An ablation study confirmed the framework’s adaptability to different LM architectures (Qwen2, GLM) and underscored the importance of the chosen neural audio codec for overall performance.

Also Read:

Conclusion

UniSE represents a significant step towards a more unified and versatile speech enhancement system. By employing a decoder-only autoregressive language model and a clever task token mechanism, it effectively integrates speech restoration, target speaker extraction, and speech separation into a single framework. The promising results suggest the potential of LMs to unify a broader range of audio tasks in the future. For more details, you can refer to the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -