TLDR: HuBERT-VIC is a novel method that significantly improves the noise robustness of Automatic Speech Recognition (ASR) models. It integrates Variance-Invariance-Covariance Regularization (VICReg) into the pre-training of Speech Foundation Models (SFMs) like HuBERT. This approach helps the model maintain consistent representations in noisy conditions, capture diverse acoustic features, and reduce redundancy between feature dimensions. Experimental results show substantial performance gains on noisy speech datasets without compromising performance on clean speech, addressing key limitations of previous noise-robust ASR techniques.
Speech Foundation Models (SFMs) have brought about significant advancements in speech processing, particularly in Automatic Speech Recognition (ASR). Models like wav2vec 2.0 and HuBERT have shown remarkable performance by learning from vast amounts of unlabeled speech data. However, a major hurdle for these models is their performance degradation when exposed to noisy environments, as they are primarily trained on clean speech.
Previous attempts to enhance noise robustness in SFMs have explored methods such as contrastive learning, reconstruction loss, or knowledge distillation. While these approaches have shown some success, they often face challenges like representation collapse, where the model’s learned features become too similar, or require extensive computational resources due to large batch sizes and complex hyperparameter tuning. Some methods also necessitate additional fine-tuning stages to effectively transfer knowledge from clean speech models.
Addressing these limitations, researchers have proposed a novel approach called HuBERT-VIC. This method introduces Variance-Invariance-Covariance Regularization (VICReg) objectives during the pre-training phase of SFMs. The core idea is to adjust the statistical properties of noisy speech representations, allowing the model to better capture diverse acoustic characteristics and improve its ability to generalize across different types of noise.
HuBERT-VIC operates within a clean-noise knowledge distillation framework. It uses a teacher model, pre-trained on clean speech and kept frozen, to guide a student model that is trained on noise-augmented speech inputs. The VICReg loss is applied to the representations from both models, ensuring that the student learns to handle noise effectively.
Also Read:
- Advancing Automated Speaking Assessment with Multimodal AI and Speech-First Learning
- Advancing Audio Understanding with Multi-Hypothesis Self-Supervised Learning
Understanding the VICReg Components:
The VICReg loss is composed of three distinct terms:
-
Invariance Term: This term minimizes the difference between the clean speech representations from the teacher model and the noisy speech representations from the student model. It ensures that the model learns to maintain consistent representations even when noise is present, which is crucial for noise robustness.
-
Variance Term: This component ensures that there is sufficient dispersion across the feature dimensions of the noisy speech representations. It prevents ‘representation collapse,’ where all learned features become too concentrated, and encourages the model to capture a broader range of acoustic characteristics, including those related to noise. Higher variance in channel dimensions has been observed to correlate with better handling of noisy speech.
-
Covariance Term: The covariance term aims to reduce redundancy between different pairs of feature dimensions. By decorrelating these dimensions, it allows each dimension to capture distinct and independent information, thereby improving the overall quality of the noisy speech representations and enhancing the model’s generalization ability to various environments.
By combining these three regularization terms with the standard masked prediction loss of HuBERT, the model is jointly optimized to become more robust to noise.
Experimental results, primarily conducted on the LibriSpeech dataset augmented with MUSAN noise (including babble, music, and natural noise at various Signal-to-Noise Ratio (SNR) levels), demonstrate the effectiveness of HuBERT-VIC. The model showed significant relative performance improvements: 23.3% on LibriSpeech test-clean and 13.2% on test-other, compared to a baseline model pre-trained on noisy speech. Notably, unlike some previous methods, HuBERT-VIC effectively prevents performance degradation on clean speech, maintaining strong generalization ability across both noisy and clean conditions.
An ablation study further confirmed the individual contributions of each VICReg term, highlighting that while the invariance term is fundamental, the variance and covariance terms play crucial complementary roles in achieving superior performance. The analysis also revealed that higher SNR in input speech leads to higher variance in the channel dimensions of the model’s representations, indicating an enhanced ability to distinguish important speech characteristics as noise diminishes.
In conclusion, HuBERT-VIC offers a powerful and efficient method for improving the noise robustness of speech foundation models for ASR. By leveraging variance, invariance, and covariance regularization, it enables models to learn more robust and generalized speech representations, paving the way for more reliable speech recognition in real-world noisy environments. For more details, you can refer to the full research paper here.


