Advanced Code Clone Detection Using Multiple Code Representations

TLDR: MAGNET is a novel multi-graph attentional framework for code clone detection that leverages Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs) to capture comprehensive syntactic and semantic features of source code. It integrates residual graph neural networks with node-level self-attention for local and long-range dependencies, introduces a gated cross-attention mechanism for fine-grained inter-graph interactions, and employs Set2Set pooling to fuse multi-graph embeddings. Experiments show MAGNET achieves state-of-the-art performance on BigCloneBench and Google Code Jam datasets, demonstrating the critical contributions of multi-graph fusion and its attentional components.

Code clone detection is a crucial task in software engineering, helping with everything from finding bugs and refactoring code to detecting plagiarism and analyzing vulnerabilities. It’s all about identifying duplicated or very similar pieces of code within a software project. Traditionally, methods for this task have often relied on single ways of representing code, like Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), or Data Flow Graphs (DFGs). While these methods capture some aspects of code, they often miss the full picture, leading to limitations, especially when dealing with more complex or semantically similar code clones.

Hybrid approaches have tried to combine these different representations, but their methods for merging this information have often been basic or manually designed, leading to inconsistent results and sometimes even slowing down the process without significant performance gains.

Introducing MAGNET: A Multi-Graph Attentional Network

To address these challenges, researchers Zixian Zhang and Takfarinas Saber have proposed a novel framework called MAGNET (Multi-Graph Attentional Network). This new approach jointly uses AST, CFG, and DFG representations to capture both the syntactic (structure) and semantic (meaning) features of source code in a more comprehensive way. MAGNET is designed to overcome the limitations of previous methods by intelligently fusing information from these multiple graph types.

The core of MAGNET lies in its three main components:

Intra-graph Embedding Learning: This stage focuses on understanding each individual code graph (AST, CFG, DFG). It uses a combination of residual graph neural networks (GNNs) to capture local connections within the code and a node-level self-attention mechanism to identify longer-range dependencies. This means it can see both the immediate relationships between code elements and how distant parts of the code might be semantically linked.
Cross-graph Embedding Learning: For accurate clone detection, it’s vital to understand how two different code fragments relate to each other across their various graph representations. MAGNET introduces a gated cross-attention mechanism that allows for fine-grained interactions between the nodes of paired code fragments. This helps the model identify subtle correspondences and similarities between two pieces of code.
Multi-graph Fusion and Pooling: After processing individual and paired graphs, MAGNET uses a Set2Set pooling layer. This advanced pooling technique aggregates the embeddings from the AST, CFG, and DFG into a single, unified representation for each program. This dynamic fusion allows the model to adaptively integrate the complementary information from all three graph types, creating a holistic understanding of the code.

The beauty of using ASTs, CFGs, and DFGs together is that they each offer a unique perspective on the code. ASTs show the hierarchical structure, CFGs illustrate the execution order and logical flow, and DFGs reveal how data moves and depends on different operations. By combining these, MAGNET gets a much richer and more complete understanding of code semantics.

Also Read:

Performance and Efficiency

Extensive experiments were conducted on two widely recognized datasets: BigCloneBench and Google Code Jam. The results showed that MAGNET achieves state-of-the-art performance, with impressive F1 scores of 96.5% and 99.2% on these datasets, respectively. This significantly outperforms existing methods, especially in detecting challenging semantic (Type-3 and Type-4) clones, which are functionally similar but may have very different structures.

Ablation studies, where components of MAGNET were selectively removed, confirmed that each part of the framework—the multi-graph fusion, residual GNNs, node-level self-attention, and gated cross-attention—makes a critical contribution to its high performance. The integration of all three graph types (AST, CFG, DFG) proved to be the most effective strategy, demonstrating that a comprehensive view of code is key to superior clone detection.

While MAGNET excels in accuracy, the researchers acknowledge a trade-off in computational efficiency. Processing multiple graph modalities and employing sophisticated attention mechanisms introduce overhead, making it slower than some other tools, though still more efficient than certain large language models like CodeBERT. Future work aims to address this by exploring lightweight attention mechanisms, graph sampling, and adaptive modality fusion to improve scalability and speed.

This research marks a significant step forward in code clone detection, offering a powerful framework for understanding and comparing code fragments with unprecedented accuracy. For more technical details, you can refer to the full research paper. Read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advanced Code Clone Detection Using Multiple Code Representations

Introducing MAGNET: A Multi-Graph Attentional Network

Performance and Efficiency

Gen AI News and Updates

Enhancing Code Retrieval for Complex Software Changes with a New Benchmark and AI Model

Boosting Code Language Models with Hypergraph-based Adapters

Optimizing Large Language Models for Finding Duplicate Code

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates