TLDR: This research investigates the performance of various spectral and rhythm audio features (like mel-scaled spectrograms, MFCCs, tempograms, and chromagrams) for classifying environmental sounds using deep convolutional neural networks (CNNs) on the ESC-50 dataset. The study found that spectral features, particularly mel-scaled spectrograms and MFCCs, consistently deliver significantly better classification accuracy, precision, recall, and F1 scores compared to rhythm and chromagram features for both broad audio categories and specific sound classes.
In the rapidly evolving field of machine learning, classifying audio data is a crucial task with applications ranging from speech recognition to environmental sound monitoring. A recent research paper, “Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks” by Friedrich Wolf-Monheim, delves into how different audio features impact the effectiveness of deep convolutional neural networks (CNNs) in classifying environmental sounds.
The study systematically evaluates various spectral and rhythm features, which are essentially digital representations of sound, when used as input for deep CNNs. The goal was to determine which features lead to the best classification performance for both broad audio categories and specific sound classes.
Understanding the Building Blocks: Audio Features
Audio signals are complex, and to make them understandable for machine learning models, they are transformed into “features.” This research focused on six key features:
Mel-scaled spectrograms: These are visual representations of sound that capture how the energy of different frequencies changes over time, but with frequencies adjusted to mimic human hearing. They are excellent for speech and music analysis and are robust to noise.
Mel-frequency cepstral coefficients (MFCCs): Derived from mel-scaled spectrograms, MFCCs offer a compact representation of sound, particularly effective for speech and music genre classification by highlighting timbral characteristics.
Cyclic tempograms: These features focus on the rhythmic aspects of sound, showing how tempo changes over time. They are useful for analyzing tempo-related musical features and beat tracking.
Short-time Fourier transform (STFT) chromagrams: Chromagrams visualize the tonal content of audio over time, specifically focusing on individual musical pitches. STFT chromagrams are good for analyzing harmonic content and chord recognition.
Constant-Q transform (CQT) chromagrams: Similar to STFT chromagrams, CQT chromagrams offer better frequency resolution at lower pitches, making them more suitable for detailed harmonic analysis in music.
Chroma energy normalized statistics (CENS) chromagrams: These are enhanced CQT chromagrams that are normalized to be robust against variations in sound volume and tone color, useful for large-scale music structure analysis.
The Experimental Approach
The research utilized the ESC-50 dataset, a benchmark collection of 2,000 labeled environmental audio recordings, divided into 50 diverse sound classes (e.g., rain, dog, clock alarm). Each recording is 5 seconds long and in .wav format. The dataset was split into 80% for training the model and 20% for validation, ensuring the model’s performance was evaluated on unseen data.
A deep convolutional neural network (CNN) was employed for classification. CNNs are powerful deep learning models known for their ability to automatically learn hierarchical patterns in data, much like how they excel in image recognition. The network architecture included several layers designed to extract increasingly complex features from the audio inputs, with techniques like batch normalization and dropout to ensure stable training and prevent overfitting.
The model’s performance was measured using standard metrics: accuracy (overall correct predictions), precision (how many predicted positives were actually correct), recall (how many actual positives were correctly identified), and F1 score (a balance between precision and recall).
Key Findings: Spectral Features Lead the Way
The results clearly demonstrated that spectral features, specifically mel-scaled spectrograms and mel-frequency cepstral coefficients (MFCCs), significantly outperformed the other rhythm and chromagram features across all audio categories and classes. On average, mel-scaled spectrograms and MFCCs showed approximately 35% higher performance across all evaluation metrics compared to cyclic tempograms and chromagrams.
For audio category level classification, mel-scaled spectrograms achieved an average accuracy of 76.5%, closely followed by MFCCs at 76.3%. Similar trends were observed for precision, recall, and F1 scores, with mel-scaled spectrograms consistently showing a slight edge over MFCCs.
Even at the more challenging audio class level (classifying specific sounds like “frog” or “chainsaw”), mel-scaled spectrograms and MFCCs maintained their superior performance, although the absolute scores were lower. Mel-scaled spectrograms achieved an average precision of 69.3% across all classes, while MFCCs reached 61.3%. In contrast, cyclic tempograms and chromagrams had much lower average precisions, indicating their limited effectiveness for this type of audio classification.
The study also highlighted specific challenges, such as classifying low-intensity sounds like ‘breathing’ and ‘drinking, sipping’, where even MFCCs struggled due to low signal-to-noise ratios and high variability within the sound class.
Why the Difference?
The superior performance of mel-scaled spectrograms and MFCCs is attributed to their ability to effectively capture frequency-based structures, which are crucial for how humans perceive and differentiate sounds. Rhythm features like cyclic tempograms, while useful for music, are less relevant for general environmental sound recognition. Similarly, chromagrams, which emphasize harmonic structures, may not be as effective for a dataset containing a wide range of non-harmonic environmental sounds.
Also Read:
- Securing the Smart Grid: A Hybrid AI Approach to Intrusion Detection
- Spectral NSR: Unifying Logic and Learning Through Graph Frequencies
Looking Ahead
While this research provides valuable insights, it also points to future directions. Expanding the analysis to larger and more diverse datasets like UrbanSound8K or AudioSet could further validate these findings. Additionally, exploring hybrid feature extraction methods, integrating other audio transforms, and investigating noise reduction techniques for challenging low-signal-to-noise ratio classes are promising avenues for future work. Ultimately, this research contributes to optimizing feature selection for machine learning-based audio classification, paving the way for more efficient and accurate real-world applications in areas like healthcare, security, and smart environments.


