A New Dataset for Understanding Malware's Evolving Structures

TLDR: HiGraph is the largest public hierarchical graph dataset for malware analysis, containing over 200 million Control Flow Graphs (CFGs) nested within 595,000 Function Call Graphs (FCGs) of Android applications. It addresses the limitation of single-level datasets by capturing both low-level instruction logic and high-level functional interactions, enabling more robust malware detection, especially against evolving threats and obfuscation techniques. The dataset and a hierarchical graph neural network (Hi-GNN) demonstrate superior performance in detecting and classifying malware, showing strong temporal robustness against evolving threats.

The fight against ever-evolving malware is a constant challenge for cybersecurity experts. A significant hurdle in this battle has been the lack of comprehensive datasets that truly capture the intricate, multi-layered nature of software. Traditional approaches often simplify programs into single, flat graphs, missing crucial details about how different parts of a program interact and function at various levels.

Addressing this critical gap, a new research paper introduces HiGraph, the largest public hierarchical graph dataset specifically designed for malware analysis. This groundbreaking dataset offers a two-level representation of software, providing a much deeper insight into program behavior than previously possible.

Understanding the Layers of Software

Imagine a complex building. A traditional, ‘flat’ graph might show you the overall layout of the building – which rooms connect to which. But it wouldn’t tell you about the intricate wiring within each room, the plumbing systems, or the specific functions of individual appliances. HiGraph, on the other hand, provides both views.

At its core, HiGraph models software using two types of graphs:

Control Flow Graphs (CFGs): These represent the low-level instruction logic within individual functions. Think of this as the detailed wiring diagram inside a single room, showing how electricity flows from one point to another.
Function Call Graphs (FCGs): These depict the high-level interactions between different functions within an application. This is like the building’s blueprint, showing how different rooms (functions) are connected and communicate with each other.

By combining these two levels, HiGraph captures the ‘structural semantics’ of software – how high-level functional interactions relate to low-level instruction logic. This is vital because malware often evolves by changing superficial code details while maintaining its core malicious behavior. A hierarchical view can spot these persistent patterns, making detection more robust against obfuscation and malware evolution.

A Dataset of Unprecedented Scale

HiGraph is truly massive, comprising over 200 million CFGs nested within 595,000 FCGs. This scale allows for large-scale analysis that reveals distinct structural properties between benign (harmless) and malicious software. The dataset was meticulously constructed from 595,211 Android applications collected from AndroZoo, with labels established using VirusTotal reports to ensure accuracy and temporal consistency.

Key Insights from HiGraph

The researchers conducted extensive analysis on HiGraph, uncovering fascinating differences:

Structural Complexity: Malicious applications tend to have more influential functions and a more centralized architecture at the FCG level. At the CFG level, malware shows higher node degrees and cyclomatic complexity, indicating more intricate conditional logic within individual functions, often designed for obfuscation.
Temporal Evolution: Over a decade of data (2012-2022) revealed divergent evolutionary patterns. Benign applications show increasing complexity and modularity over time, reflecting good software engineering practices. In contrast, malware FCGs tend to shrink after 2015, suggesting a shift towards smaller, more targeted functional units, but with increasing density – meaning functions become more tightly interconnected, likely for operational efficiency and evasion.
API Usage: Analysis of API call frequencies provides a ‘functional fingerprint,’ distinguishing between common utility APIs and security-sensitive platform APIs often exploited by malware.

Also Read:

Enhanced Malware Detection with Hi-GNN

To demonstrate HiGraph’s utility, the researchers developed Hi-GNN, a hierarchical graph neural network. This model processes both the local CFGs and the global FCGs simultaneously. Experiments showed that Hi-GNN significantly outperforms traditional single-level GNNs in malware detection and classification tasks. Crucially, Hi-GNN also exhibited superior temporal robustness, meaning it’s better at detecting new, evolving malware variants – a critical advantage in the fast-paced cybersecurity landscape.

This resilience to ‘concept drift’ or ‘model aging’ is attributed to Hi-GNN’s ability to learn stable semantic patterns from CFGs and adaptive architectural patterns from FCGs, allowing it to identify persistent malicious blueprints even as malware changes its superficial appearance.

HiGraph and its associated tools are publicly available, aiming to standardize the evaluation of hierarchical malware analysis methods and foster reproducible research in AI for cybersecurity. This work represents a significant step towards building next-generation defense systems capable of combating sophisticated and evolving cyber threats. You can find the full research paper here: HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Dataset for Understanding Malware’s Evolving Structures

Understanding the Layers of Software

A Dataset of Unprecedented Scale

Key Insights from HiGraph

Enhanced Malware Detection with Hi-GNN

Gen AI News and Updates

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

TrojAI Unveils Defend for MCP to Bolster Security for AI Agent Workflows

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates