TLDR: SingMOS-Pro is a new, comprehensive dataset designed for automatic singing quality assessment. It expands on previous work by offering detailed annotations for lyrics, melody, and overall singing quality across 7,981 clips generated by various models. The dataset aims to address the challenges of evaluating singing quality, which traditionally relies on costly human subjective assessments or limited objective metrics. It provides a robust benchmark for developing and testing new models, highlighting the need for specialized approaches beyond speech quality assessment and suggesting future directions for integrating melodic and lyrical information.
Evaluating the quality of generated singing voices has long been a complex challenge in the rapidly advancing field of singing voice generation. While human listening tests are considered the ‘gold standard,’ they are often expensive and time-consuming. Existing objective metrics, on the other hand, frequently fail to capture the nuanced aspects of perceived singing quality. This gap has highlighted a critical need for more efficient, reliable, and universal methods for assessing singing quality.
Addressing this challenge, a new research paper introduces SingMOS-Pro, a groundbreaking dataset designed to facilitate automatic singing quality assessment. Building upon its predecessor, SingMOS, which offered only overall quality ratings, SingMOS-Pro significantly expands its scope. The dataset now includes detailed annotations for lyrics, melody, and overall quality, providing a much broader and more diverse evaluation framework.
SingMOS-Pro is a substantial resource, comprising 7,981 singing clips. These clips were generated by 41 different models across 12 datasets, showcasing a wide spectrum of singing voice generation technologies, from earlier systems to the latest advancements. To ensure the highest level of reliability and consistency, each clip in the dataset has received at least five ratings from professional annotators.
The researchers behind SingMOS-Pro have also explored effective strategies for utilizing Mean Opinion Score (MOS) data annotated under varying standards. They benchmarked several widely used evaluation methods from related tasks on SingMOS-Pro, establishing robust baselines and practical references for future research in this domain. The dataset itself is publicly accessible, providing a valuable tool for the community. You can find more details about this work in the research paper.
The dataset is the first multilingual and multi-task-focused MOS dataset for singing quality assessment. It includes samples from singing voice synthesis (SVS), singing voice conversion (SVC), singing voice resynthesis (SVR), and ground-truth recordings. The clips are annotated along three dimensions: overall quality, lyrics clarity, and melody naturalness. This fine-grained annotation allows for a more comprehensive understanding of singing performance.
The annotation process involved 78 experienced annotators who conducted evaluations online in quiet environments. To maintain quality, each batch of evaluations included ‘trap clips’ (noise or silence) and ‘golden clips’ (carefully selected high-quality samples). If an annotator’s ratings on these control clips fell outside acceptable parameters, their entire batch of annotations was re-evaluated.
Experiments conducted using SingMOS-Pro revealed interesting insights into model performance. Speech MOS models, such as UTMOS and DNSMOS, performed poorly on singing tasks, underscoring the significant domain gap between speech and singing. While the original SingMOS model showed strong performance on in-domain data, it struggled with out-of-domain samples, indicating a need for broader data coverage to prevent overfitting. Models like SHEET-ssqa, which integrate additional speech MOS data, demonstrated an ability to mitigate this overfitting, suggesting that combining speech and singing data could be a promising direction.
The research also explored the integration of pitch information, using methods like MIDI pitch and pitch histograms. While these approaches showed marginal improvements over a plain self-supervised learning baseline, the findings highlight the ongoing need for more effective ways to incorporate melodic cues into singing quality assessment models. Future work will also focus on leveraging the detailed melody and lyric scores provided by SingMOS-Pro to further enhance automatic SQA.
Also Read:
- Advancing Automatic Speech Quality Evaluation
- PodEval: A New Standard for Assessing AI-Generated Podcasts
In conclusion, SingMOS-Pro represents a significant step forward in the field of automatic singing quality assessment. By offering a reliable, multilingual, multi-task, and fine-grained dataset, it provides essential resources and benchmarks that will undoubtedly accelerate the development of more effective and robust SQA models, ultimately contributing to the advancement of high-quality singing voice generation.


