MH-1M: A Vast New Dataset for Android Malware Research

TLDR: MH-1M is a new, extensive dataset for Android malware research, featuring over 1.34 million applications and 22,810 attributes collected from 2010-2024. It addresses limitations of older datasets by providing comprehensive, up-to-date information including API calls, intents, permissions, opcodes, and VirusTotal reports. Developed using advanced tools like AMGenerator and AMExplorer, MH-1M supports diverse research, from longitudinal studies to improving machine learning models for malware detection. Its validation shows superior performance in classifying Android applications, making it a critical resource for cybersecurity advancements.

A new research paper introduces MH-1M, a groundbreaking dataset designed to significantly advance the study of Android malware. This comprehensive dataset, comprising over 1.34 million Android applications, aims to address critical limitations found in existing malware datasets, such as outdated information, insufficient sample sizes, and narrow feature sets. The creation of MH-1M is a collaborative effort by Hendrio Bragança, Diego Kreutz, Vanderson Rocha, Joner Assolin, and Eduardo Feitosa, who recognize the urgent need for robust data to combat the evolving landscape of Android cyber threats.

The Challenge of Android Malware

Android malware continues to pose a substantial threat due to the platform’s open-source nature and widespread use. Malware constantly adapts, employing sophisticated tactics that lead to data theft, destruction, and various other cybercrimes. While machine learning (ML) has proven effective in detecting malicious applications, its success heavily relies on the quality and relevance of the training data. Many current datasets fall short, offering limited samples, focusing on single features like permissions or API calls, and often containing outdated information, which can lead to misleading research conclusions.

Introducing MH-1M: A New Standard for Malware Research

MH-1M stands out as one of the most extensive and up-to-date datasets for Android malware research. It includes 1,340,515 applications and an impressive 22,810 extracted attributes, covering a 14-year period from 2010 to 2024. This vast collection, totaling over 400 GB of data, includes detailed Android features such as 22,394 API calls, 407 intents, 232 opcodes, and 214 permissions. Beyond these features, MH-1M provides extensive metadata, including SHA-256 hashes, file names, package names, compilation APIs, and crucial VirusTotal reports, offering a rich foundation for advanced research.

Unlike previous datasets, MH-1M not only surpasses them in scale but also incorporates updated and comprehensive information essential for modern malware detection. For instance, while the Drebin dataset, released in 2014, contained approximately 1 million samples, MH-1M offers more samples with significantly richer and more current metadata. This extensive metadata improves transparency and reproducibility in research and opens new avenues for applying deep learning and large language models to develop more sophisticated and context-aware malware detection systems.

How MH-1M Was Built

The development of MH-1M involved a sophisticated pipeline utilizing three primary tools: ADBuilder, AMGenerator, and AMExplorer. AMGenerator, an evolution of ADBuilder, is responsible for acquiring Android application packages (APKs), extracting static features using tools like AndroGuard, and crucially, labeling samples. The labeling process leverages the VirusTotal API, which aggregates results from over 65 malware detection engines. To ensure accuracy, the researchers established a robust labeling threshold: an application is classified as malicious if at least four VirusTotal scanners flag it as such. AMExplorer then processes and integrates these outputs, allowing researchers to generate various dataset types tailored to specific research objectives.

Key Research Applications

The MH-1M dataset supports a wide array of contemporary research questions in Android malware detection:

Longitudinal Studies: Its decade-spanning data enables analysis of malware evolution, including changes in API call, opcode, and permission usage over time.
Temporal Generalization: Researchers can evaluate how well models trained on older APKs perform against newer threats, addressing temporal bias.
Feature Selection: With tens of thousands of static features, MH-1M facilitates studies on the most predictive feature subsets and dimensionality reduction techniques.
Imbalance Handling: The dataset’s realistic malware-to-benign ratio (approximately 1:10) provides a testbed for developing classification methods that effectively handle imbalanced data.
Labeling Robustness: Detailed VirusTotal metadata allows for exploring different labeling strategies and improving accuracy over time.

Performance and Validation

The researchers validated MH-1M using the XGBoost classifier, a powerful machine learning model. The results demonstrated the dataset’s robustness, achieving an overall accuracy of 98.51% in distinguishing between benign and malicious applications. This performance surpassed that achieved with the MH-100K dataset, highlighting the benefits of MH-1M’s larger scale and greater variability. The model showed a very low misclassification rate for benign applications (0.49%) and strong malware detection capabilities, with an 11.69% misclassification rate for malware applications classified as benign.

Cross-classification experiments further emphasized the importance of comprehensive datasets. Models trained on MH-1M generalized well to the MH-100K dataset, whereas models trained on the smaller MH-100K struggled to accurately classify samples from the more diverse MH-1M. This underscores that larger, more diverse datasets like MH-1M are crucial for developing robust and generalizable malware detection models.

Understanding Malware Families

MH-1M also provides a solid foundation for studying and classifying malware families. Malware samples are grouped into superclasses such as Adware, Trojan, Riskware, and Others, based on their behavioral characteristics. Visualizations show that Trojan and Adware samples often share features, while Riskware exhibits a more dispersed distribution, reflecting its heterogeneous nature. The ‘Others’ category includes outliers and mixed-behavior samples, further illustrating the complex and evolving nature of Android malware.

Also Read:

Accessing the Dataset

The MH-1M dataset is openly accessible to the research community. The processed data is available through Figshare and GitHub repositories, while the raw version, including its extensive supplementary metadata, is archived in the Harvard Dataverse repository. This open access ensures that researchers worldwide can leverage MH-1M to replicate findings, conduct further investigations, and contribute to the ongoing fight against Android malware.