spot_img
HomeResearch & DevelopmentEnsuring Data Quality: De-duplicating the Lakh MIDI Dataset for...

Ensuring Data Quality: De-duplicating the Lakh MIDI Dataset for Robust AI Music Models

TLDR: This research addresses the critical issue of data duplication in large symbolic music datasets, specifically the Lakh MIDI Dataset (LMD). Duplicates, arising from minor edits or different arrangements, can lead to unreliable AI model training and evaluation. The study evaluates rule-based and neural network approaches, including a novel contrastive learning BERT model (CAugBERT). It proposes a combined method (CLaMP-1024 and CAugBERT) that effectively identifies and filters out a substantial number of duplicates (38,134 to 68,075 files) from LMD, providing cleaned lists to enhance the validity of future music AI research.

In the rapidly evolving world of artificial intelligence, especially in fields like music generation, large and diverse datasets are the backbone of powerful models. However, the way these datasets are often collected—through automated web scraping—can lead to a significant, yet often overlooked, problem: data duplication. This issue is particularly prevalent in symbolic music datasets, where minor edits, different arrangements, or metadata changes can create multiple versions of essentially the same musical piece.

A recent study by researchers from KAIST and Sogang University in South Korea sheds light on this critical problem, specifically focusing on the Lakh MIDI Dataset (LMD). LMD is one of the largest publicly available resources in the symbolic music domain, widely used for training AI models that generate music. The paper, titled “ON THE DE-DUPLICATION OF THE LAKH MIDI DATASET,” highlights how duplicates can severely compromise the reliability of AI model training and evaluation. When identical or near-identical data appears in both training and testing sets, it can lead to inflated performance metrics, making models appear more capable than they truly are—a phenomenon known as data leakage.

Understanding Duplication in Music Data

The researchers categorized duplicates into two main types: hard duplication and soft duplication. Hard duplication refers to files that are almost identical, with only minor differences such as instrument mapping, tempo, or small note-level alterations. These often arise when users modify and re-upload existing MIDI files. Soft duplication, on the other hand, involves files that preserve core musical elements like melody and harmony but differ significantly in arrangement style, reflecting diverse interpretations by different arrangers. This study primarily focused on detecting hard duplication.

The Challenge of De-duplication

Manually sifting through a massive dataset like LMD to find duplicates is practically impossible. To address this, the research team explored various automated methods. They evaluated both traditional rule-based approaches and advanced neural network models. Rule-based methods included techniques like MIDI Encoding Hash (checking for identical digital fingerprints), Beat Position Entropy (analyzing note distribution within a bar), and Chroma-DTW (measuring pitch content similarity using Dynamic Time Warping).

For neural approaches, they leveraged existing pre-trained models designed for symbolic music understanding and retrieval, such as MusicBERT and the CLaMP series (CLaMP-512, CLaMP-1024, CLaMP2, CLaMP3). These models learn to create numerical representations (embeddings) of music, allowing for similarity comparisons. Additionally, the team developed their own contrastive learning-based BERT model, named CAugBERT. This model was specifically trained to identify duplicates by learning from various “augmentations” or minor variations of the same musical piece, making it robust to common differences found in duplicates.

Evaluating the Methods

To rigorously test these methods, the researchers used a subset of LMD called LMD-clean as a benchmark. LMD-clean is meticulously organized, with different versions of the same songs grouped, providing a clear ground truth for duplicate identification. They assessed each method using standard retrieval metrics like nDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank), which measure how well duplicates are ranked as similar. They also used classification metrics like precision, recall, and F1-score to evaluate the accuracy of identifying true duplicates.

The findings revealed that neural network-based approaches generally outperformed rule-based methods. Among them, the CLaMP model series showed strong performance in retrieval tasks, while CAugBERT, their custom contrastive learning model, achieved the best classification performance, demonstrating the effectiveness of its specialized training. Interestingly, different models excelled in different aspects, suggesting they capture distinct facets of musical similarity.

Also Read:

A Proposed Solution for a Cleaner Dataset

Recognizing the complementary strengths of different models, the researchers investigated combining methods to enhance performance. Their most effective approach involved a union of CLaMP-1024 and CAugBERT. This combination proved to be highly effective in identifying duplicates, covering a broad range of duplication types. Using this proposed configuration, they identified a significant number of duplicates within the LMD-full dataset. Even with a conservative threshold for rejection, they found 38,134 duplicated files out of 178,561 to be filtered, and with their proposed configuration, the number of duplicate files to be filtered reached 68,075.

The study concludes by offering three different de-duplication lists for LMD, catering to various research needs—from filtering popular tracks based on LMD-clean to comprehensive de-duplication of the entire LMD-full dataset using both proposed and conservative thresholds. These lists are expected to significantly improve the validity and reliability of future symbolic music research and AI model training. The methods explored in this paper also hold promise for application to other large-scale symbolic music datasets, paving the way for cleaner, more robust AI in music. You can find more details about this research in the full paper available here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -