spot_img
HomeResearch & DevelopmentUnmasking AI Music: How Audio Augmentations Challenge Deepfake Detection...

Unmasking AI Music: How Audio Augmentations Challenge Deepfake Detection Models

TLDR: This research evaluates the robustness of the SONICS fake music detection model against various audio augmentations. It reveals that the model struggles to generalize to music from unseen AI generators and its performance significantly degrades even with light audio modifications like pitch shifting or adding silence. The study emphasizes the need for diverse augmentation strategies during model training to improve resilience.

The rapid advancement of generative artificial intelligence models has made it increasingly difficult to distinguish between music created by humans and music generated by AI. In response to this challenge, new models designed to detect fake music have emerged. This research explores the resilience of such detection systems when subjected to various audio modifications, known as augmentations.

The study focused on evaluating how well a state-of-the-art musical deepfake detection model, named SONICS, performs under different audio transformations. To do this, a comprehensive dataset was created, comprising both authentic music and synthetic music generated by several systems, including Suno, Udio, YuE, and MusicGen. The researchers then applied a range of audio augmentations to this dataset to observe their impact on the model’s classification accuracy.

The findings revealed that the SONICS model’s performance significantly decreased, even with the introduction of relatively minor audio augmentations. This suggests a notable vulnerability in its ability to maintain accuracy when the audio input is slightly altered from its original form.

Generalization Challenges

Initially, the SONICS model was tested on the dataset without any augmentations to assess its ability to generalize to music from different generative models, including those it had not been trained on. The results indicated that the model did not generalize well, particularly struggling with music from MusicGen, which is often perceived as sounding more synthetic. Interestingly, even music from Udio, which was part of the SONICS model’s training dataset, was not consistently classified correctly, a point also noted by the original SONICS authors.

Impact of Audio Augmentations

A variety of audio augmentations were applied, including aliasing, bit crush, equalization, high/low pass filtering, frequency masking, MP3 and OGG compression, pitch shifting, speed manipulation, silencing fragments, reverb, vibrato, and white noise. These augmentations were tested across a range of parameters, ensuring they remained within reasonable limits so as not to render the audio incomprehensible.

Among the most impactful augmentations was pitch shifting. Depending on the direction of the pitch shift, the model could be fooled into classifying audio as either highly genuine or highly fake. For instance, a downward shift of just two semitones caused the model to classify Suno-generated music as highly real, even though this change was barely noticeable to human listeners. Another significant observation was that adding more silence to a song increased the probability of it being classified as fake, to the extreme that an empty file was labeled as fake. This behavior could potentially hinder classification in normal scenarios and suggests a need for the model to treat silence as ambiguous.

Furthermore, the study found a correlation between corrupting high frequencies in an audio file and the model’s increased probability of classifying it as fake. This might imply that the model inadvertently learned to rely on specific artifacts present in the audio spectrum rather than analyzing the music holistically. While some augmentations, like low bit crush levels and strong white noise, were highly perceptible to humans, the question of whether the model should handle such severely degraded samples remains open.

Also Read:

Conclusion and Future Directions

The research highlights two key conclusions: firstly, the SONICS model struggles to perform well on music from generative models it has not encountered during training, which is a common challenge for machine learning models facing distribution shifts. Secondly, the model’s results can be easily skewed by certain audio augmentations. The consistent reaction of the model to modifications across different music sources underscores the critical importance of incorporating a diverse set of augmentations during the training phase of such detection models.

This work serves as a foundational step for analyzing other fake music detection models, as the augmentation process is largely independent of the model’s architecture, aside from sampling rate requirements. Future research could explore the explainability of these predictions and delve deeper into the model’s reliance on specific frequency bands. The dataset and code used in this study are publicly available on a GitHub repository for further exploration and development.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -