Unmasking AI Music: How Audio Augmentations Challenge Deepfake Detection Models

TLDR: This research evaluates the robustness of the SONICS fake music detection model against various audio augmentations. It reveals that the model struggles to generalize to music from unseen AI generators and its performance significantly degrades even with light audio modifications like pitch shifting or adding silence. The study emphasizes the need for diverse augmentation strategies during model training to improve resilience.

The rapid advancement of generative artificial intelligence models has made it increasingly difficult to distinguish between music created by humans and music generated by AI. In response to this challenge, new models designed to detect fake music have emerged. This research explores the resilience of such detection systems when subjected to various audio modifications, known as augmentations.

The study focused on evaluating how well a state-of-the-art musical deepfake detection model, named SONICS, performs under different audio transformations. To do this, a comprehensive dataset was created, comprising both authentic music and synthetic music generated by several systems, including Suno, Udio, YuE, and MusicGen. The researchers then applied a range of audio augmentations to this dataset to observe their impact on the model’s classification accuracy.

The findings revealed that the SONICS model’s performance significantly decreased, even with the introduction of relatively minor audio augmentations. This suggests a notable vulnerability in its ability to maintain accuracy when the audio input is slightly altered from its original form.

Generalization Challenges

Initially, the SONICS model was tested on the dataset without any augmentations to assess its ability to generalize to music from different generative models, including those it had not been trained on. The results indicated that the model did not generalize well, particularly struggling with music from MusicGen, which is often perceived as sounding more synthetic. Interestingly, even music from Udio, which was part of the SONICS model’s training dataset, was not consistently classified correctly, a point also noted by the original SONICS authors.

Impact of Audio Augmentations

A variety of audio augmentations were applied, including aliasing, bit crush, equalization, high/low pass filtering, frequency masking, MP3 and OGG compression, pitch shifting, speed manipulation, silencing fragments, reverb, vibrato, and white noise. These augmentations were tested across a range of parameters, ensuring they remained within reasonable limits so as not to render the audio incomprehensible.

Among the most impactful augmentations was pitch shifting. Depending on the direction of the pitch shift, the model could be fooled into classifying audio as either highly genuine or highly fake. For instance, a downward shift of just two semitones caused the model to classify Suno-generated music as highly real, even though this change was barely noticeable to human listeners. Another significant observation was that adding more silence to a song increased the probability of it being classified as fake, to the extreme that an empty file was labeled as fake. This behavior could potentially hinder classification in normal scenarios and suggests a need for the model to treat silence as ambiguous.

Furthermore, the study found a correlation between corrupting high frequencies in an audio file and the model’s increased probability of classifying it as fake. This might imply that the model inadvertently learned to rely on specific artifacts present in the audio spectrum rather than analyzing the music holistically. While some augmentations, like low bit crush levels and strong white noise, were highly perceptible to humans, the question of whether the model should handle such severely degraded samples remains open.

Also Read:

Conclusion and Future Directions

The research highlights two key conclusions: firstly, the SONICS model struggles to perform well on music from generative models it has not encountered during training, which is a common challenge for machine learning models facing distribution shifts. Secondly, the model’s results can be easily skewed by certain audio augmentations. The consistent reaction of the model to modifications across different music sources underscores the critical importance of incorporating a diverse set of augmentations during the training phase of such detection models.

This work serves as a foundational step for analyzing other fake music detection models, as the augmentation process is largely independent of the model’s architecture, aside from sampling rate requirements. Future research could explore the explainability of these predictions and delve deeper into the model’s reliance on specific frequency bands. The dataset and code used in this study are publicly available on a GitHub repository for further exploration and development.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking AI Music: How Audio Augmentations Challenge Deepfake Detection Models

Generalization Challenges

Impact of Audio Augmentations

Conclusion and Future Directions

Gen AI News and Updates

CINEMAE: A Breakthrough in Detecting AI-Generated Images Across Diverse Models

Enhancing Controllability and Latent Space Regularization in AI Music Generation with Power Transforms

Yasam Ayavefe Champions Human-Centered AI in Digital Innovation and Creative Fields

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates