TLDR: This research paper introduces a novel approach to binary code similarity detection (BCSD) that uses large language models (LLMs) to extract human-interpretable features from assembly code. Unlike traditional embedding-based methods that produce opaque numerical vectors, this new method generates structured, human-readable features like input/output types, side effects, and algorithmic intent. This approach addresses key limitations of previous methods by offering improved interpretability, enhanced scalability through standard text indexing, and better generalization across different architectures and optimization levels without specific training. The method demonstrates strong performance, often outperforming state-of-the-art embedding models, and can be combined with embeddings for even superior results.
Binary code similarity detection (BCSD) is a crucial task in cybersecurity and reverse engineering. It helps identify similar pieces of code in different software, which is vital for tasks like analyzing malware, finding vulnerabilities, and auditing software supply chains. Imagine a scenario where a vulnerability is found in a widely used software library; BCSD can quickly pinpoint all other programs or firmware that use that same vulnerable code, allowing for rapid defensive action.
The Evolution of Code Similarity Detection
Historically, BCSD methods have evolved through several stages. Early approaches relied on ‘hand-crafted’ features, where human experts would define specific statistics or patterns to look for in the code, such as the number of basic blocks or the ratio of different instruction types. These features were easy to understand, or ‘interpretable’, but they were often too simplistic to capture deep semantic similarities and struggled to generalize across different compilers or optimization settings.
More recently, machine learning, particularly deep learning, introduced ’embedding-based’ methods. These approaches transform code fragments into high-dimensional numerical vectors, or ’embeddings’, that capture complex structural and semantic patterns. While these embeddings are powerful and can generalize well across various compilation settings, they come with significant drawbacks. They are ‘opaque’—meaning an analyst cannot easily understand why two code fragments are deemed similar, making verification difficult. They also face a ‘scalability-accuracy trade-off’ because searching through millions of high-dimensional vectors for exact matches is too slow, forcing the use of approximations that can reduce precision. Furthermore, these models require extensive training data and often need retraining for new architectures or optimization levels, limiting their ‘generalization’ to unseen scenarios.
A New Approach: LLM-Based Interpretable Feature Extraction
A new research paper, titled “Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity,” proposes a novel method that bridges these gaps by leveraging large language models (LLMs) to perform structured reasoning analysis of assembly code. Authored by Charles E. Gagnon, Steven H. H. Ding, Philippe Charland, and Benjamin C. M. Fung from Defence Research and Development Canada and McGill University, this work introduces a way to generate human-readable, interpretable features directly from assembly code.
Instead of producing opaque vectors, their method uses an LLM to act as an expert reverse engineering assistant. It analyzes raw assembly code and generates a structured set of semantic features in a JSON format. These features include details like input/output types, side effects, notable constants, loop structures, and even the inferred algorithmic intent of a function. For example, it might identify a function as performing “Initialization” or “Data Processing.”
Key Advantages of the LLM-Powered Solution
This LLM-based approach offers several significant advantages:
- Interpretability: The generated features are human-readable, allowing analysts to directly understand and verify why two code fragments are considered similar. This transparency is crucial for high-stakes applications like vulnerability detection.
- Scalability: Because the features are textual and structured, they can be stored and queried using standard inverted indexes or relational databases, similar to how search engines work. This eliminates the need for complex and often approximate nearest-neighbor searches in high-dimensional spaces, allowing for efficient and exact similarity searches at scale without sacrificing accuracy.
- Generalization: By relying on the broad, pre-trained knowledge of modern LLMs, the method naturally generalizes to different compilers, optimization levels, and CPU architectures without requiring specific training or fine-tuning for each new setting. This is a major improvement over embedding-based models that often struggle with out-of-distribution inputs.
How It Works
The core of the method involves a carefully designed prompt that instructs the LLM to analyze assembly code and output its findings as a JSON object. The prompt defines the task, the expected format, and the specific types of features to extract, such as:
- Type Signature: Number and types of input parameters (e.g., Integer, Pointer), and the return value type.
- Logic and Operations: Presence of loops, jump tables, SIMD instructions, and dominant operation categories (e.g., arithmetic, conditional branching).
- Notable Constants: A list of unique, informative integer literals (excluding trivial ones) and string literals.
- Side Effects: Whether the function modifies input parameters, global state, performs memory allocation, or I/O operations.
- Categorization: A high-level label summarizing the function’s purpose (e.g., system interaction, memory management, cryptographic).
The researchers also implemented robustness techniques, such as retrying queries with slightly higher sampling temperatures if invalid JSON is generated, and using ‘few-shot prompting’ (providing a few examples of assembly functions with their expected feature analyses) to guide the LLM’s behavior.
Also Read:
- AI Agents Transform Data Analysis: A Comprehensive Overview
- Guiding Small Language Models to Think: A New Approach to Reasoning Distillation
Impressive Results and Future Potential
The experiments demonstrated that this LLM-based method performs comparably to, and often surpasses, state-of-the-art embedding methods, especially in challenging cross-architecture and cross-optimization tasks. For instance, it achieved 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks respectively, which is competitive with or better than embedding methods that require extensive training.
Furthermore, the research shows that combining these interpretable features with generic embedding models in a hybrid framework can significantly outperform existing methods, demonstrating that accuracy, scalability, and interpretability can indeed coexist. This hybrid approach could use the textual features for initial filtering, then refine results with embeddings for top candidates.
While the method currently relies on powerful LLMs, which can be computationally intensive, the benefits in interpretability, scalability, and generalization offer a compelling new direction for binary code similarity detection. This work opens exciting avenues for future research, including fine-tuning smaller LLMs using larger models as a guide, and exploring different output formats for even greater efficiency and interpretability. You can read the full research paper here.


