spot_img
HomeResearch & DevelopmentAdvancing Audio Fingerprinting with Pretrained Conformer Encoders

Advancing Audio Fingerprinting with Pretrained Conformer Encoders

TLDR: A new research paper introduces Conformer-based encoders for audio fingerprinting and retrieval, achieving state-of-the-art results. These models, trained using a self-supervised contrastive learning framework with advanced data augmentation, generate unique embeddings from just 3 seconds of audio. They demonstrate exceptional robustness to temporal misalignments and various audio distortions like noise and reverb, making them highly effective for identifying audio content even under challenging conditions.

In the evolving landscape of audio technology, the ability to quickly and accurately identify audio content from even a small snippet is crucial. This process, known as audio fingerprinting, underpins popular services like music identification apps (e.g., Shazam, Google Now Playing) and is vital for tasks such as detecting copyright infringement, tracking advertisements, and even identifying unsolicited phone calls in telecommunications.

Traditionally, audio fingerprinting has relied on techniques that extract unique features from audio, often converting them into low-dimensional representations or ‘fingerprints’. These fingerprints are then stored in a database, allowing for rapid matching when a new audio excerpt is presented. While various methods exist, including local descriptors, peak-based approaches, and neural networks, recent advancements have focused on leveraging deep learning for more robust and generalized solutions.

Introducing Conformer-Based Audio Fingerprinting

A new research paper, “PRETRAINED CONFORMERS FOR AUDIO FINGERPRINTING AND RETRIEVAL”, introduces a groundbreaking approach using Conformer-based encoders. Authored by Kemal Altwlkany, Elmedin Selmanovic, and Sead Delalic from Infobip and the University of Sarajevo, this work addresses key challenges in audio retrieval, particularly the need for models that can generate unique and robust audio embeddings from very short segments, even when faced with significant audio distortions.

Conformers are a type of neural network architecture that cleverly combine the strengths of Convolutional Neural Networks (CNNs) and Transformers. CNNs are excellent at capturing local features within data, while Transformers excel at understanding global interactions and long-range dependencies. This hybrid approach makes Conformers exceptionally well-suited for audio fingerprinting, where both the precise spectral content and its temporal positioning are equally important for accurate identification.

How the Models Learn and Perform

The researchers utilized a self-supervised contrastive learning framework, specifically the SimCLR framework, to train their Conformer-based encoders. In essence, the models learn by comparing different versions of the same audio segment (positive pairs) against other, unrelated audio segments (negative examples). The goal is to make the embeddings of similar audio segments close together in a high-dimensional space, while pushing dissimilar ones further apart.

A critical aspect of their training methodology involved extensive data augmentation. This included adding various types of noise (background, colored), applying reverb, pitch-shifting, time-stretching, and time-shifting. A notable innovation was the use of beta-distributed temporal shifting, which encouraged the models to learn from ‘hard examples’ – audio segments with larger temporal misalignments. This technique proved crucial in making the models highly robust to temporal shifts, allowing for accurate retrieval even when the query audio is not perfectly aligned with the database entry.

The paper presents three models: small (1.5M parameters), medium (8.8M parameters), and large (26.2M parameters), trained on different subsets of the Free Music Archive (FMA) dataset. All models generate 128-dimensional embeddings from just 3 seconds of audio input.

State-of-the-Art Results and Robustness

The experimental results demonstrate that these pretrained Conformer encoders achieve state-of-the-art performance in audio retrieval tasks. They exhibit near-perfect hit rates even with significant temporal shifts (up to 150 ms), showcasing their remarkable immunity to such misalignments. Furthermore, the large Conformer model shows comparable performance to existing state-of-the-art methods when dealing with various audio distortions like background and colored noise, especially at higher Signal-to-Noise Ratios (SNR).

While the smaller models are highly effective for applications where severe audio distortions are less common, the large model stands out by matching or exceeding the performance of leading approaches like Neural Audio Fingerprinter (NAFP) and PeakNetFP for extreme temporal distortions, and GraFPrint for noisy conditions, particularly when evaluated on large datasets.

Also Read:

Future Implications

This research marks a significant step forward in content-based audio retrieval. By leveraging the unique capabilities of Conformers and employing sophisticated self-supervised learning techniques, the authors have developed models that are not only highly accurate but also exceptionally robust to common audio distortions. The public availability of their code and models further facilitates reproducibility and future research in this vital field, paving the way for more advanced and reliable audio identification systems.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -