TLDR: HiGraph is the largest public hierarchical graph dataset for malware analysis, containing over 200 million Control Flow Graphs (CFGs) nested within 595,000 Function Call Graphs (FCGs) of Android applications. It addresses the limitation of single-level datasets by capturing both low-level instruction logic and high-level functional interactions, enabling more robust malware detection, especially against evolving threats and obfuscation techniques. The dataset and a hierarchical graph neural network (Hi-GNN) demonstrate superior performance in detecting and classifying malware, showing strong temporal robustness against evolving threats.
The fight against ever-evolving malware is a constant challenge for cybersecurity experts. A significant hurdle in this battle has been the lack of comprehensive datasets that truly capture the intricate, multi-layered nature of software. Traditional approaches often simplify programs into single, flat graphs, missing crucial details about how different parts of a program interact and function at various levels.
Addressing this critical gap, a new research paper introduces HiGraph, the largest public hierarchical graph dataset specifically designed for malware analysis. This groundbreaking dataset offers a two-level representation of software, providing a much deeper insight into program behavior than previously possible.
Understanding the Layers of Software
Imagine a complex building. A traditional, ‘flat’ graph might show you the overall layout of the building – which rooms connect to which. But it wouldn’t tell you about the intricate wiring within each room, the plumbing systems, or the specific functions of individual appliances. HiGraph, on the other hand, provides both views.
At its core, HiGraph models software using two types of graphs:
- Control Flow Graphs (CFGs): These represent the low-level instruction logic within individual functions. Think of this as the detailed wiring diagram inside a single room, showing how electricity flows from one point to another.
- Function Call Graphs (FCGs): These depict the high-level interactions between different functions within an application. This is like the building’s blueprint, showing how different rooms (functions) are connected and communicate with each other.
By combining these two levels, HiGraph captures the ‘structural semantics’ of software – how high-level functional interactions relate to low-level instruction logic. This is vital because malware often evolves by changing superficial code details while maintaining its core malicious behavior. A hierarchical view can spot these persistent patterns, making detection more robust against obfuscation and malware evolution.
A Dataset of Unprecedented Scale
HiGraph is truly massive, comprising over 200 million CFGs nested within 595,000 FCGs. This scale allows for large-scale analysis that reveals distinct structural properties between benign (harmless) and malicious software. The dataset was meticulously constructed from 595,211 Android applications collected from AndroZoo, with labels established using VirusTotal reports to ensure accuracy and temporal consistency.
Key Insights from HiGraph
The researchers conducted extensive analysis on HiGraph, uncovering fascinating differences:
- Structural Complexity: Malicious applications tend to have more influential functions and a more centralized architecture at the FCG level. At the CFG level, malware shows higher node degrees and cyclomatic complexity, indicating more intricate conditional logic within individual functions, often designed for obfuscation.
- Temporal Evolution: Over a decade of data (2012-2022) revealed divergent evolutionary patterns. Benign applications show increasing complexity and modularity over time, reflecting good software engineering practices. In contrast, malware FCGs tend to shrink after 2015, suggesting a shift towards smaller, more targeted functional units, but with increasing density – meaning functions become more tightly interconnected, likely for operational efficiency and evasion.
- API Usage: Analysis of API call frequencies provides a ‘functional fingerprint,’ distinguishing between common utility APIs and security-sensitive platform APIs often exploited by malware.
Also Read:
- Unlocking Deeper Insights: A New Approach to Forecasting Multivariate Time Series
- Securing AI on the Go: A Look at Privacy and Security in Mobile Large Language Models
Enhanced Malware Detection with Hi-GNN
To demonstrate HiGraph’s utility, the researchers developed Hi-GNN, a hierarchical graph neural network. This model processes both the local CFGs and the global FCGs simultaneously. Experiments showed that Hi-GNN significantly outperforms traditional single-level GNNs in malware detection and classification tasks. Crucially, Hi-GNN also exhibited superior temporal robustness, meaning it’s better at detecting new, evolving malware variants – a critical advantage in the fast-paced cybersecurity landscape.
This resilience to ‘concept drift’ or ‘model aging’ is attributed to Hi-GNN’s ability to learn stable semantic patterns from CFGs and adaptive architectural patterns from FCGs, allowing it to identify persistent malicious blueprints even as malware changes its superficial appearance.
HiGraph and its associated tools are publicly available, aiming to standardize the evaluation of hierarchical malware analysis methods and foster reproducible research in AI for cybersecurity. This work represents a significant step towards building next-generation defense systems capable of combating sophisticated and evolving cyber threats. You can find the full research paper here: HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis.


