spot_img
HomeResearch & DevelopmentJSQA: Enhancing Speech Quality Assessment with Perceptually-Guided AI

JSQA: Enhancing Speech Quality Assessment with Perceptually-Guided AI

TLDR: JSQA is a new two-stage AI framework for speech quality assessment. It pretrains an audio encoder using perceptually-guided contrastive learning on “just noticeable difference” (JND) audio pairs, which are subtly different but perceived as the same quality by humans. This pretraining helps the model align with human perception. The encoder is then fine-tuned for Mean Opinion Score (MOS) prediction. The method significantly improves performance over models trained from scratch, demonstrating that incorporating perceptual factors is highly beneficial, even for smaller models.

Assessing the quality of speech is a critical task that impacts various fields, from enhancing audio to developing speech recognition systems. Traditionally, human listening tests, which yield Mean Opinion Scores (MOS), have been the gold standard for Speech Quality Assessment (SQA). However, these tests are notoriously time-consuming and expensive. While objective models have emerged, many struggle to accurately incorporate the subtle nuances of human perception, often relying on large, expensively labeled datasets.

Addressing these challenges, researchers Junyi Fan and Donald Williamson from The Ohio State University have introduced JSQA, a novel two-stage framework designed to improve speech quality assessment by deeply integrating perceptual factors into its learning process. Their work, detailed in their research paper, JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs, offers a promising direction for more accurate and efficient SQA.

Understanding JSQA’s Approach

JSQA stands out by employing a two-stage training process. The first stage involves “perceptually-guided contrastive pretraining” using what are called Just Noticeable Difference (JND) audio pairs. JND refers to the smallest change in a signal that humans can reliably detect. By creating pairs of audio that are perceptually very similar—meaning a human listener would likely perceive them as having the same quality despite subtle differences—JSQA trains an audio encoder to understand and map these similarities into an embedding space.

These JND pairs are ingeniously created by taking clean speech utterances from the LibriSpeech dataset and mixing them with background noise from the CHiME-3 dataset at slightly different signal-to-noise ratios (SNRs) that fall within the JND range. During pretraining, the model learns to produce similar embeddings for these JND pairs, effectively mimicking human perception of quality similarity. This process helps the encoder become invariant to irrelevant factors like speaker, content, or specific noise types, focusing instead on perceptual quality.

From Pretraining to MOS Prediction

After this crucial pretraining phase, the encoder is then fine-tuned for MOS prediction. This second stage uses a smaller, labeled dataset called NISQA, which contains audio samples with human-evaluated MOS scores. A lightweight regression network is connected to the pretrained encoder, allowing it to predict a scalar MOS score, typically ranging from 1 to 5.

Also Read:

Key Findings and Implications

The experimental results from JSQA are compelling. The perceptually-inspired contrastive pretraining significantly boosts the model’s performance across various evaluation metrics compared to a network trained from scratch without this pretraining. For instance, the prediction error (RMSE) decreased by 18% and the Mean Absolute Error (MAE) by 19%, while correlation metrics (PCC and SRCC) increased by 15% and 17% respectively. This strongly suggests that incorporating human perceptual factors into the pretraining stage is highly beneficial for SQA.

Another significant finding is that even relatively small models can achieve excellent performance when trained effectively with this method. JSQA, with approximately 26 million parameters and using about 33 GB of pretraining audio data, performs comparably to much larger models like wav2vec 2.0, which can have over 95 million parameters. This highlights the efficiency and power of the JSQA framework.

Interestingly, the study also explored the role of a “projection head” during pretraining. While often used in contrastive learning, the researchers found that in this specific case, including the projection head could sometimes negatively impact performance. This suggests that if the encoder’s initial embedding is already compact and well-conditioned, further processing might not be necessary and could even lead to information loss.

In conclusion, JSQA represents a significant step forward in non-intrusive speech quality assessment. By leveraging the subtle yet powerful concept of just noticeable differences in human perception, it offers a more accurate, efficient, and perceptually aligned approach to understanding and predicting speech quality, paving the way for future advancements in audio technologies.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -