TLDR: The paper introduces MOS-RMBench, a new benchmark for evaluating speech quality assessment models. It converts traditional Mean Opinion Score (MOS) datasets into a preference-comparison format to overcome inconsistencies. The research evaluates scalar, semi-scalar, and generative reward models, finding scalar models perform best but all struggle with subtle quality differences. To address this, a MOS-aware Generative Reward Model is proposed, which significantly improves fine-grained quality discrimination.
Evaluating the quality of synthetic speech is a critical task, especially with the rapid advancements in text-to-speech (TTS) and generative audio technologies. Traditionally, this assessment has relied on human subjective ratings, known as Mean Opinion Scores (MOS). However, these human-centric methods come with significant drawbacks: they are expensive, prone to inconsistencies in rating standards, and difficult to scale for the vast amounts of data modern systems produce.
To tackle these challenges, researchers have introduced a new benchmark called MOS-RMBench. This innovative framework redefines how speech quality is evaluated by transforming diverse MOS datasets into a preference-comparison setting. Instead of relying on absolute scores, which can vary wildly between datasets, MOS-RMBench focuses on relative preferences between speech samples. This approach eliminates scale inconsistencies and provides a unified, rigorous way to compare different models.
Exploring Reward Modeling Paradigms
Building on MOS-RMBench, the study systematically constructs and evaluates three distinct paradigms for reward modeling, which are essentially systems designed to learn and predict human preferences for speech quality:
-
Scalar Reward Models: These are the simplest, outputting a single numerical score for the quality of each audio sample.
-
Semi-Scalar Reward Models: These models go a step further by first generating natural language descriptions of audio quality (e.g., detailing noise, distortion, naturalness, and continuity) before producing a scalar score.
-
Generative Reward Models (GRMs): These are the most complex, taking two audio samples as input and learning to produce descriptive quality assessments and comparative scores.
The research utilized Qwen2-Audio as the base for these models and also included existing MOS prediction models like UTMOS and large language models (LLMs) such as Gemini-2.5-Pro and Qwen2.5-Omni-7B for comparison.
Key Findings and Challenges
The experiments conducted using MOS-RMBench revealed several important insights:
-
Scalar Models Lead: Surprisingly, the simpler scalar reward models achieved the strongest overall performance, consistently exceeding 74% accuracy and averaging around 80% across various datasets. This suggests that, for now, direct scoring remains highly effective.
-
Synthetic Speech Gap: Most models performed noticeably worse when evaluating synthetic speech compared to human speech. This highlights a persistent domain gap that needs to be addressed.
-
Fine-Grained Discrimination: A significant challenge for all models was discriminating between speech pairs with very small differences in MOS scores. While models performed well when the quality gap was large, subtle distinctions proved difficult.
Introducing the MOS-aware Generative Reward Model
To overcome the difficulty in fine-grained quality discrimination, the researchers proposed a novel solution: the MOS-aware Generative Reward Model (MOS-aware GRM). This model enhances the standard reward function by incorporating an additional MOS-difference-based reward. This means the model adaptively scales rewards based on how difficult each sample pair is to distinguish. For pairs with very similar quality, correct predictions receive a larger reward, and incorrect ones incur a smaller penalty, encouraging the model to learn subtle differences.
Experimental results demonstrated that the MOS-aware GRM significantly improved fine-grained quality discrimination, narrowing the performance gap with scalar models on the most challenging cases. It showed consistent accuracy gains, particularly on samples with highly similar speech quality.
Also Read:
- Advancing Emotional Text-to-Speech with Stepwise Preference Optimization
- Structural Reward Models: A New Approach to Interpretable and Efficient AI Evaluation
Looking Ahead
This work marks a significant step forward in automatic speech quality assessment by providing MOS-RMBench, the first unified benchmark that reformulates heterogeneous MOS datasets into a consistent preference-comparison framework. It also offers a systematic evaluation of various reward modeling paradigms and introduces an innovative MOS-aware GRM for improved fine-grained discrimination. The researchers hope this benchmark and methodological framework will foster more rigorous and scalable research in this crucial field. For more details, you can refer to the full research paper here.


