TLDR: QAMRO (Quality-aware Adaptive Margin Ranking Optimization) is a new framework for evaluating AI-generated audio (music, speech, general audio) that better aligns with human perception. Unlike traditional methods, QAMRO uses a novel ranking optimization strategy with an adaptive margin and quality-aware weighting. This allows it to prioritize accurate ratings and highlight subtle perceptual differences, especially for high-quality audio, leading to significantly improved alignment with human judgments and outperforming existing baselines.
Evaluating the quality of audio generated by artificial intelligence, whether it’s music, speech, or general sounds, has always been a complex challenge. Human perception is subjective and multi-dimensional, making it difficult for machines to accurately assess what sounds good to us. Traditional methods often treat this assessment as a simple regression problem, trying to predict a Mean Opinion Score (MOS), but they frequently miss the nuances of how humans compare and rank different audio samples.
A new research paper introduces a novel framework called QAMRO, which stands for Quality-aware Adaptive Margin Ranking Optimization. This framework aims to bridge the gap between machine evaluation and human judgment by integrating different regression objectives. Its core idea is to emphasize perceptual differences and prioritize accurate ratings, especially for high-quality audio content.
Addressing the Limitations of Current Methods
Existing evaluation approaches, while useful, often fall short because they don’t account for the relative rankings among audio samples. For instance, if two audio clips are very similar in quality, a human might still have a slight preference for one over the other. Standard regression losses, like Mean Absolute Error (MAE) or Mean Squared Error (MSE), don’t effectively capture these relative preferences. While ranking loss functions have gained traction in other fields, they typically use a fixed margin and treat all sample pairs equally, overlooking the varying importance of different quality levels.
How QAMRO Works
QAMRO introduces a ranking-based perspective to the MOS prediction task. It enhances the training of MOS prediction models by encouraging correct pairwise rankings. Unlike conventional ranking losses, QAMRO makes two significant improvements:
-
Adaptive Margin: It uses a data-dependent margin that adjusts based on the actual difference between the ground-truth MOS scores of two samples. This allows the model to better capture subtle perceptual discrepancies.
-
Quality-aware Weighting: A unique weighting mechanism is incorporated, giving more importance to sample pairs that include at least one high-quality audio utterance. This encourages the model to rank such cases more reliably, reflecting the intuition that errors in high-quality regions are often more critical.
The framework leverages pre-trained audio-text models like CLAP and Audiobox-Aesthetics, which are adept at understanding the joint semantics of audio and text. QAMRO combines its novel ranking loss with traditional regression objectives, ensuring a balance between perceptual alignment and score accuracy across various content types.
Demonstrated Effectiveness
The researchers rigorously tested QAMRO on the official AudioMOS Challenge 2025 datasets, including MusicEval for text-to-music (TTM) evaluation and AES-Natural for a broader range of audio content (speech, music, general audio). The results were compelling: QAMRO consistently achieved superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.
For example, on the MusicEval dataset, QAMRO showed marked improvements in metrics like the Spearman rank correlation coefficient (SRCC) for both musical impression and textual alignment. Ablation studies further confirmed that both the quality-aware weighting and the adaptive margin components are crucial for optimal performance.
The framework’s generalizability was also demonstrated on the AES-Natural dataset, where it improved performance across production quality, production complexity, content enjoyment, and content usefulness. This indicates QAMRO’s robustness across different model architectures and various aspects of MOS scoring.
Also Read:
- Crafting Expressive Voices: A Breakthrough in Emotional Voice Conversion
- Advancing AI’s Understanding of Sound Beyond Simple Recognition
Looking Ahead
The introduction of QAMRO marks a significant step forward in human-aligned audio quality assessment, demonstrating the first effective use of a quality-aware ranking loss in this context. The researchers plan to explore listwise ranking methods in the future, which consider global ranking structures, to further enhance alignment with human judgments across diverse generated audio. For more technical details, you can refer to the full research paper here.


