UniSE: A Unified Language Model Framework for Comprehensive Speech Enhancement

TLDR: UniSE is a novel framework developed by Alibaba Group that unifies multiple speech enhancement tasks, including speech restoration, target speaker extraction, and speech separation, using a single decoder-only autoregressive language model. It achieves competitive performance across various benchmarks with high parameter efficiency and demonstrates strong generalization abilities by defining distinct operational modes through task-specific tokens.

The field of speech enhancement (SE) has seen significant advancements, moving beyond simple noise reduction to encompass a broader range of tasks like restoring degraded speech, extracting a target speaker from a mixture, and separating multiple speakers. While deep neural networks have become mainstream in this area, and language models (LMs) have shown promise, existing LM-based approaches often focus on a single distortion or task, limiting their versatility.

Addressing this limitation, researchers from Alibaba Group have introduced UniSE, a unified framework designed to handle multiple speech enhancement sub-tasks within a single decoder-only autoregressive language model. UniSE aims to bring together speech restoration (SR), target speaker extraction (TSE), and speech separation (SS) under one cohesive system.

How UniSE Works

At its core, UniSE leverages a decoder-only LM, similar to the LLaMA architecture, to model the conditional probability distribution of target speech. It takes features from degraded and, optionally, reference speech as conditions to generate discrete tokens of the clean target speech. These tokens are then used to reconstruct the waveform using a neural audio codec (NAC), specifically BiCodec, which is known for high reconstruction quality.

A crucial component of UniSE is its conditional feature extractor, which uses a pre-trained WavLM model with a learnable adapter. WavLM, a self-supervised learning model, extracts rich acoustic and semantic information from speech. The BiCodec then converts continuous speech into discrete global and semantic features, which are essential for the LM’s autoregressive modeling.

To unify different tasks, UniSE defines three operational modes: SR mode, TSE mode, and reverse TSE (rTSE) mode. Each mode is distinguished by a unique, learnable task-specific token. By switching and combining these modes, UniSE can adapt to various SE scenarios. For instance, in SR mode, it restores clean speech from a degraded recording. In TSE mode, it extracts speech matching a reference speaker’s timbre. For speech separation, UniSE employs a multi-inference strategy, using SR to identify the louder speaker, then TSE to extract that speaker, and finally rTSE to isolate the remaining speaker.

Performance and Generalization

The experimental results demonstrate UniSE’s competitive performance across several benchmarks. On the 2020 DNS Challenge test sets, UniSE achieved state-of-the-art speech restoration performance. Notably, it did so with a significantly smaller model size (63 million parameters) compared to other models like LLaSE-G1, which uses approximately 1 billion parameters, highlighting UniSE’s superior parameter efficiency.

UniSE also showed strong results on the 2025 URGENT Challenge test set for SR, even with unseen distortions like codec artifacts and wind noise, indicating excellent generalization capabilities. For target speaker extraction, UniSE achieved comparable performance to other advanced baselines on the Libri2Mix clean test set. In speech separation, UniSE outperformed both discriminative and generative models on Libri2Mix and WSJ0-2mix test sets, validating its multi-mode inference strategy.

An ablation study confirmed the framework’s adaptability to different LM architectures (Qwen2, GLM) and underscored the importance of the chosen neural audio codec for overall performance.

Also Read:

Conclusion

UniSE represents a significant step towards a more unified and versatile speech enhancement system. By employing a decoder-only autoregressive language model and a clever task token mechanism, it effectively integrates speech restoration, target speaker extraction, and speech separation into a single framework. The promising results suggest the potential of LMs to unify a broader range of audio tasks in the future. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

UniSE: A Unified Language Model Framework for Comprehensive Speech Enhancement

How UniSE Works

Performance and Generalization

Conclusion

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates