TLDR: JSQA is a new two-stage AI framework for speech quality assessment. It pretrains an audio encoder using perceptually-guided contrastive learning on “just noticeable difference” (JND) audio pairs, which are subtly different but perceived as the same quality by humans. This pretraining helps the model align with human perception. The encoder is then fine-tuned for Mean Opinion Score (MOS) prediction. The method significantly improves performance over models trained from scratch, demonstrating that incorporating perceptual factors is highly beneficial, even for smaller models.
Assessing the quality of speech is a critical task that impacts various fields, from enhancing audio to developing speech recognition systems. Traditionally, human listening tests, which yield Mean Opinion Scores (MOS), have been the gold standard for Speech Quality Assessment (SQA). However, these tests are notoriously time-consuming and expensive. While objective models have emerged, many struggle to accurately incorporate the subtle nuances of human perception, often relying on large, expensively labeled datasets.
Addressing these challenges, researchers Junyi Fan and Donald Williamson from The Ohio State University have introduced JSQA, a novel two-stage framework designed to improve speech quality assessment by deeply integrating perceptual factors into its learning process. Their work, detailed in their research paper, JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs, offers a promising direction for more accurate and efficient SQA.
Understanding JSQA’s Approach
JSQA stands out by employing a two-stage training process. The first stage involves “perceptually-guided contrastive pretraining” using what are called Just Noticeable Difference (JND) audio pairs. JND refers to the smallest change in a signal that humans can reliably detect. By creating pairs of audio that are perceptually very similar—meaning a human listener would likely perceive them as having the same quality despite subtle differences—JSQA trains an audio encoder to understand and map these similarities into an embedding space.
These JND pairs are ingeniously created by taking clean speech utterances from the LibriSpeech dataset and mixing them with background noise from the CHiME-3 dataset at slightly different signal-to-noise ratios (SNRs) that fall within the JND range. During pretraining, the model learns to produce similar embeddings for these JND pairs, effectively mimicking human perception of quality similarity. This process helps the encoder become invariant to irrelevant factors like speaker, content, or specific noise types, focusing instead on perceptual quality.
From Pretraining to MOS Prediction
After this crucial pretraining phase, the encoder is then fine-tuned for MOS prediction. This second stage uses a smaller, labeled dataset called NISQA, which contains audio samples with human-evaluated MOS scores. A lightweight regression network is connected to the pretrained encoder, allowing it to predict a scalar MOS score, typically ranging from 1 to 5.
Also Read:
- Enhancing Speech Clarity: A New Approach Using AI to Understand Human Preferences
- AudioMAE++: Enhancing Self-Supervised Audio Understanding with Advanced Transformer Designs
Key Findings and Implications
The experimental results from JSQA are compelling. The perceptually-inspired contrastive pretraining significantly boosts the model’s performance across various evaluation metrics compared to a network trained from scratch without this pretraining. For instance, the prediction error (RMSE) decreased by 18% and the Mean Absolute Error (MAE) by 19%, while correlation metrics (PCC and SRCC) increased by 15% and 17% respectively. This strongly suggests that incorporating human perceptual factors into the pretraining stage is highly beneficial for SQA.
Another significant finding is that even relatively small models can achieve excellent performance when trained effectively with this method. JSQA, with approximately 26 million parameters and using about 33 GB of pretraining audio data, performs comparably to much larger models like wav2vec 2.0, which can have over 95 million parameters. This highlights the efficiency and power of the JSQA framework.
Interestingly, the study also explored the role of a “projection head” during pretraining. While often used in contrastive learning, the researchers found that in this specific case, including the projection head could sometimes negatively impact performance. This suggests that if the encoder’s initial embedding is already compact and well-conditioned, further processing might not be necessary and could even lead to information loss.
In conclusion, JSQA represents a significant step forward in non-intrusive speech quality assessment. By leveraging the subtle yet powerful concept of just noticeable differences in human perception, it offers a more accurate, efficient, and perceptually aligned approach to understanding and predicting speech quality, paving the way for future advancements in audio technologies.


