Ensuring Data Quality: De-duplicating the Lakh MIDI Dataset for Robust AI Music Models

TLDR: This research addresses the critical issue of data duplication in large symbolic music datasets, specifically the Lakh MIDI Dataset (LMD). Duplicates, arising from minor edits or different arrangements, can lead to unreliable AI model training and evaluation. The study evaluates rule-based and neural network approaches, including a novel contrastive learning BERT model (CAugBERT). It proposes a combined method (CLaMP-1024 and CAugBERT) that effectively identifies and filters out a substantial number of duplicates (38,134 to 68,075 files) from LMD, providing cleaned lists to enhance the validity of future music AI research.

In the rapidly evolving world of artificial intelligence, especially in fields like music generation, large and diverse datasets are the backbone of powerful models. However, the way these datasets are often collected—through automated web scraping—can lead to a significant, yet often overlooked, problem: data duplication. This issue is particularly prevalent in symbolic music datasets, where minor edits, different arrangements, or metadata changes can create multiple versions of essentially the same musical piece.

A recent study by researchers from KAIST and Sogang University in South Korea sheds light on this critical problem, specifically focusing on the Lakh MIDI Dataset (LMD). LMD is one of the largest publicly available resources in the symbolic music domain, widely used for training AI models that generate music. The paper, titled “ON THE DE-DUPLICATION OF THE LAKH MIDI DATASET,” highlights how duplicates can severely compromise the reliability of AI model training and evaluation. When identical or near-identical data appears in both training and testing sets, it can lead to inflated performance metrics, making models appear more capable than they truly are—a phenomenon known as data leakage.

Understanding Duplication in Music Data

The researchers categorized duplicates into two main types: hard duplication and soft duplication. Hard duplication refers to files that are almost identical, with only minor differences such as instrument mapping, tempo, or small note-level alterations. These often arise when users modify and re-upload existing MIDI files. Soft duplication, on the other hand, involves files that preserve core musical elements like melody and harmony but differ significantly in arrangement style, reflecting diverse interpretations by different arrangers. This study primarily focused on detecting hard duplication.

The Challenge of De-duplication

Manually sifting through a massive dataset like LMD to find duplicates is practically impossible. To address this, the research team explored various automated methods. They evaluated both traditional rule-based approaches and advanced neural network models. Rule-based methods included techniques like MIDI Encoding Hash (checking for identical digital fingerprints), Beat Position Entropy (analyzing note distribution within a bar), and Chroma-DTW (measuring pitch content similarity using Dynamic Time Warping).

For neural approaches, they leveraged existing pre-trained models designed for symbolic music understanding and retrieval, such as MusicBERT and the CLaMP series (CLaMP-512, CLaMP-1024, CLaMP2, CLaMP3). These models learn to create numerical representations (embeddings) of music, allowing for similarity comparisons. Additionally, the team developed their own contrastive learning-based BERT model, named CAugBERT. This model was specifically trained to identify duplicates by learning from various “augmentations” or minor variations of the same musical piece, making it robust to common differences found in duplicates.

Evaluating the Methods

To rigorously test these methods, the researchers used a subset of LMD called LMD-clean as a benchmark. LMD-clean is meticulously organized, with different versions of the same songs grouped, providing a clear ground truth for duplicate identification. They assessed each method using standard retrieval metrics like nDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank), which measure how well duplicates are ranked as similar. They also used classification metrics like precision, recall, and F1-score to evaluate the accuracy of identifying true duplicates.

The findings revealed that neural network-based approaches generally outperformed rule-based methods. Among them, the CLaMP model series showed strong performance in retrieval tasks, while CAugBERT, their custom contrastive learning model, achieved the best classification performance, demonstrating the effectiveness of its specialized training. Interestingly, different models excelled in different aspects, suggesting they capture distinct facets of musical similarity.

Also Read:

A Proposed Solution for a Cleaner Dataset

Recognizing the complementary strengths of different models, the researchers investigated combining methods to enhance performance. Their most effective approach involved a union of CLaMP-1024 and CAugBERT. This combination proved to be highly effective in identifying duplicates, covering a broad range of duplication types. Using this proposed configuration, they identified a significant number of duplicates within the LMD-full dataset. Even with a conservative threshold for rejection, they found 38,134 duplicated files out of 178,561 to be filtered, and with their proposed configuration, the number of duplicate files to be filtered reached 68,075.

The study concludes by offering three different de-duplication lists for LMD, catering to various research needs—from filtering popular tracks based on LMD-clean to comprehensive de-duplication of the entire LMD-full dataset using both proposed and conservative thresholds. These lists are expected to significantly improve the validity and reliability of future symbolic music research and AI model training. The methods explored in this paper also hold promise for application to other large-scale symbolic music datasets, paving the way for cleaner, more robust AI in music. You can find more details about this research in the full paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Ensuring Data Quality: De-duplicating the Lakh MIDI Dataset for Robust AI Music Models

Understanding Duplication in Music Data

The Challenge of De-duplication

Evaluating the Methods

A Proposed Solution for a Cleaner Dataset

Gen AI News and Updates

Enhancing Interpretability and Performance in Vision Transformers with Randomized-MLP Regularization

C3-Diff: Enhancing Spatial Gene Expression Maps with AI and Histology

Bridging Gaps in EEG Emotion Recognition with EMOD

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates