TLDR: This research introduces a novel method for speech enhancement that uses Direct Preference Optimization (DPO) to align generative language models with human perceptual preferences. By employing UTMOS, a neural MOS prediction model, as a proxy for human ratings, the system learns to produce enhanced speech that sounds better to humans, achieving significant improvements in speech quality metrics on noisy and reverberant audio. This is the first application of DPO in speech enhancement and the first to incorporate proxy perceptual feedback into LM-based SE training.
Speech enhancement, a crucial technology, aims to reduce unwanted noise and distortion from speech, making it clearer and more understandable. This is vital for applications ranging from hearing aids and telecommunications to voice recognition systems. Traditionally, speech enhancement methods have focused on minimizing the difference between noisy and clean speech. While effective in some aspects, these methods can sometimes struggle with new environments and introduce artificial sounds.
More recently, generative methods, particularly those leveraging advanced Language Models (LMs), have shown remarkable promise in speech enhancement. These LM-based systems learn the underlying patterns of clean speech to generate enhanced audio. However, a key challenge remains: LMs are often trained on low-level technical goals, like predicting speech tokens accurately. This doesn’t always translate to what human listeners truly care about – naturalness, comfort, and overall perceptual quality. A technically “accurate” enhancement might still sound unnatural or unpleasant to a human ear.
To bridge this gap between technical accuracy and human perception, researchers have explored various alignment techniques. One such powerful and elegant method is Direct Preference Optimization (DPO). Originating in the field of natural language processing, DPO offers a simpler and more stable way to align a model’s outputs with human preferences, without the complexities often associated with traditional reinforcement learning approaches.
A recent research paper, titled “ALIGNING GENERATIVE SPEECH ENHANCEMENT WITH HUMAN PREFERENCES VIA DIRECT PREFERENCE OPTIMIZATION,” explores the pioneering application of DPO to speech enhancement. Authored by Haoyang Li, Nana Hou, Yuchen Hu, Jixun Yao, Sabato Marco Siniscalchi, and Eng Siong Chng, this work introduces a novel approach to improve the perceptual quality of enhanced speech. You can find the full paper here: Research Paper.
The core idea behind their method is to use a neural MOS (Mean Opinion Score) prediction model, UTMOS, as a stand-in for human ratings. Instead of relying on actual human listeners for every training step (which would be incredibly costly and time-consuming), UTMOS provides a reliable, automated way to estimate how humans would perceive the quality of enhanced speech. This allows the system to receive “perceptual feedback” during its training.
The process involves taking a pre-trained LM-based speech enhancement model, specifically GenSE, and fine-tuning it using DPO. The system generates multiple versions of enhanced speech from the same noisy input. These versions are then evaluated by UTMOS, which assigns a quality score to each. Based on these scores, “preferred” (higher quality) and “rejected” (lower quality) speech samples are identified. DPO then uses these pairs to teach the model to produce more of what UTMOS (and by proxy, humans) would prefer, and less of what it would reject.
Experiments conducted on the 2020 Deep Noise Suppression Challenge test sets demonstrated significant improvements. Applying DPO to the pre-trained GenSE model led to consistent gains across various speech quality metrics, including DNSMOS, NISQA, and UTMOS itself. Notably, the model showed relative gains of up to 56% in UTMOS and 19% in NISQA on speech with reverberation, indicating its effectiveness in challenging acoustic environments. While the primary focus was on speech quality, the study also observed that combining DPO with the original cross-entropy loss helped maintain speaker similarity, suggesting a balanced approach to enhancement.
Also Read:
- Boosting Speech AI Performance Through Smart Data Generation
- MIDI-VALLE: Advancing Expressive Piano Performance Synthesis with Neural Codec Language Models
This research marks a significant step forward, being the first known application of DPO to speech enhancement and the first to integrate proxy perceptual feedback into the training of LM-based speech enhancement systems. It opens up a promising new direction for creating speech enhancement technologies that are truly aligned with human listening preferences, leading to more natural and enjoyable audio experiences.


