Unlocking Binary Code Secrets: How Language Models Make Code Similarity Understandable

TLDR: This research paper introduces a novel approach to binary code similarity detection (BCSD) that uses large language models (LLMs) to extract human-interpretable features from assembly code. Unlike traditional embedding-based methods that produce opaque numerical vectors, this new method generates structured, human-readable features like input/output types, side effects, and algorithmic intent. This approach addresses key limitations of previous methods by offering improved interpretability, enhanced scalability through standard text indexing, and better generalization across different architectures and optimization levels without specific training. The method demonstrates strong performance, often outperforming state-of-the-art embedding models, and can be combined with embeddings for even superior results.

Binary code similarity detection (BCSD) is a crucial task in cybersecurity and reverse engineering. It helps identify similar pieces of code in different software, which is vital for tasks like analyzing malware, finding vulnerabilities, and auditing software supply chains. Imagine a scenario where a vulnerability is found in a widely used software library; BCSD can quickly pinpoint all other programs or firmware that use that same vulnerable code, allowing for rapid defensive action.

The Evolution of Code Similarity Detection

Historically, BCSD methods have evolved through several stages. Early approaches relied on ‘hand-crafted’ features, where human experts would define specific statistics or patterns to look for in the code, such as the number of basic blocks or the ratio of different instruction types. These features were easy to understand, or ‘interpretable’, but they were often too simplistic to capture deep semantic similarities and struggled to generalize across different compilers or optimization settings.

More recently, machine learning, particularly deep learning, introduced ’embedding-based’ methods. These approaches transform code fragments into high-dimensional numerical vectors, or ’embeddings’, that capture complex structural and semantic patterns. While these embeddings are powerful and can generalize well across various compilation settings, they come with significant drawbacks. They are ‘opaque’—meaning an analyst cannot easily understand why two code fragments are deemed similar, making verification difficult. They also face a ‘scalability-accuracy trade-off’ because searching through millions of high-dimensional vectors for exact matches is too slow, forcing the use of approximations that can reduce precision. Furthermore, these models require extensive training data and often need retraining for new architectures or optimization levels, limiting their ‘generalization’ to unseen scenarios.

A New Approach: LLM-Based Interpretable Feature Extraction

A new research paper, titled “Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity,” proposes a novel method that bridges these gaps by leveraging large language models (LLMs) to perform structured reasoning analysis of assembly code. Authored by Charles E. Gagnon, Steven H. H. Ding, Philippe Charland, and Benjamin C. M. Fung from Defence Research and Development Canada and McGill University, this work introduces a way to generate human-readable, interpretable features directly from assembly code.

Instead of producing opaque vectors, their method uses an LLM to act as an expert reverse engineering assistant. It analyzes raw assembly code and generates a structured set of semantic features in a JSON format. These features include details like input/output types, side effects, notable constants, loop structures, and even the inferred algorithmic intent of a function. For example, it might identify a function as performing “Initialization” or “Data Processing.”

Key Advantages of the LLM-Powered Solution

This LLM-based approach offers several significant advantages:

Interpretability: The generated features are human-readable, allowing analysts to directly understand and verify why two code fragments are considered similar. This transparency is crucial for high-stakes applications like vulnerability detection.
Scalability: Because the features are textual and structured, they can be stored and queried using standard inverted indexes or relational databases, similar to how search engines work. This eliminates the need for complex and often approximate nearest-neighbor searches in high-dimensional spaces, allowing for efficient and exact similarity searches at scale without sacrificing accuracy.
Generalization: By relying on the broad, pre-trained knowledge of modern LLMs, the method naturally generalizes to different compilers, optimization levels, and CPU architectures without requiring specific training or fine-tuning for each new setting. This is a major improvement over embedding-based models that often struggle with out-of-distribution inputs.

How It Works

The core of the method involves a carefully designed prompt that instructs the LLM to analyze assembly code and output its findings as a JSON object. The prompt defines the task, the expected format, and the specific types of features to extract, such as:

Type Signature: Number and types of input parameters (e.g., Integer, Pointer), and the return value type.
Logic and Operations: Presence of loops, jump tables, SIMD instructions, and dominant operation categories (e.g., arithmetic, conditional branching).
Notable Constants: A list of unique, informative integer literals (excluding trivial ones) and string literals.
Side Effects: Whether the function modifies input parameters, global state, performs memory allocation, or I/O operations.
Categorization: A high-level label summarizing the function’s purpose (e.g., system interaction, memory management, cryptographic).

The researchers also implemented robustness techniques, such as retrying queries with slightly higher sampling temperatures if invalid JSON is generated, and using ‘few-shot prompting’ (providing a few examples of assembly functions with their expected feature analyses) to guide the LLM’s behavior.

Also Read:

Impressive Results and Future Potential

The experiments demonstrated that this LLM-based method performs comparably to, and often surpasses, state-of-the-art embedding methods, especially in challenging cross-architecture and cross-optimization tasks. For instance, it achieved 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks respectively, which is competitive with or better than embedding methods that require extensive training.

Furthermore, the research shows that combining these interpretable features with generic embedding models in a hybrid framework can significantly outperform existing methods, demonstrating that accuracy, scalability, and interpretability can indeed coexist. This hybrid approach could use the textual features for initial filtering, then refine results with embeddings for top candidates.

While the method currently relies on powerful LLMs, which can be computationally intensive, the benefits in interpretability, scalability, and generalization offer a compelling new direction for binary code similarity detection. This work opens exciting avenues for future research, including fine-tuning smaller LLMs using larger models as a guide, and exploring different output formats for even greater efficiency and interpretability. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Binary Code Secrets: How Language Models Make Code Similarity Understandable

The Evolution of Code Similarity Detection

A New Approach: LLM-Based Interpretable Feature Extraction

Key Advantages of the LLM-Powered Solution

How It Works

Impressive Results and Future Potential

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates