AI's Growing Prowess in Smart Contract Exploits: Understanding the Threat and Defense

TLDR: A new research paper introduces REX, a framework that uses Large Language Models (LLMs) to automatically generate and validate exploits for vulnerable smart contracts. The study found that LLMs, particularly Gemini 2.5 Pro and GPT-4.1, can reliably create functional exploits with high success rates, driven by their internal reasoning capabilities rather than contract complexity. The paper also explores various defense strategies, showing that combined measures can significantly reduce exploit success, though some vulnerability types remain challenging.

Smart contracts, the self-executing agreements on blockchain, have revolutionized various industries. However, their immutable nature means that even a small vulnerability can lead to permanent and substantial financial losses. A notable example is the February 2025 exploit on Bybit’s Safe multi-signature wallet, which resulted in a staggering 1.5 billion US dollars being drained.

While traditional vulnerability detection tools like Slither and Mythril exist, they often struggle with accuracy and scalability, performing poorly in real-world scenarios. This is where Large Language Models (LLMs) come into play. LLMs have shown impressive capabilities in code-related tasks, including generation, summarization, and bug fixing, making them a promising avenue for identifying smart contract vulnerabilities.

Introducing REX: Automated Exploit Generation

A recent research paper, Prompt to Pwn: Automated Exploit Generation for Smart Contracts, explores the feasibility of using LLMs for Automated Exploit Generation (AEG) against vulnerable smart contracts. The researchers introduce REX, a novel framework that integrates LLM-based exploit synthesis with the Foundry testing suite. REX enables the automated generation and validation of proof-of-concept (PoC) exploits, offering an end-to-end pipeline for exploit generation, compilation, execution, and verification.

The REX framework operates in five key steps. First, it preprocesses input smart contracts by removing comments and non-functional content, ensuring the LLM focuses solely on the core contract logic. Second, given a vulnerable contract, the LLM generates two Foundry scripts: an exploit contract to trigger the vulnerability and a test contract to validate its success. The LLMs are designed to iteratively optimize their prompts and reason step-by-step to improve accuracy. Third, an optional script optimization step automatically fixes common errors like non-checksummed Ethereum addresses and missing ‘payable’ casts. Fourth, the generated scripts are compiled and tested within a Foundry project, verifying them syntactically and semantically. Finally, an iterative feedback loop returns error messages to the LLM if compilation or testing fails, allowing it to correct and regenerate the scripts until a valid exploit is found or a retry limit is reached.

LLM Performance in Exploit Generation

The research evaluated five state-of-the-art LLMs: GPT-4.1, Gemini 2.5 Pro, Claude Opus 4, DeepSeek, and Qwen3 Plus. They were tested on both synthetic benchmarks (SMART BUGS-CURATED) and real-world smart contracts affected by known high-impact exploits (WEB3-AEG). The results were compelling: modern LLMs can reliably generate functional PoC exploits for diverse vulnerability types, with success rates reaching up to 92%.

Notably, Gemini 2.5 Pro and GPT-4.1 consistently outperformed the others in both synthetic and real-world scenarios. Gemini 2.5 Pro achieved the highest average success rate of 67.3% across various vulnerability types, excelling in arithmetic (92.9%), front running (75.0%), and unchecked low-level calls (56.7%). It even demonstrated autonomous exploit discovery capabilities and expert-like reasoning in some real-world cases. While LLMs showed strong performance, they primarily generated single-contract exploits. Human experts, in contrast, often craft complex exploit chains that span multiple contracts and interact with DeFi protocols for maximum profit.

Factors Influencing Exploit Generation

The study also delved into factors affecting AEG effectiveness. The LLM’s inherent capabilities emerged as the primary determinant of success. Models with stronger general coding abilities, as evidenced by benchmarks like Aider and LMArena, performed better in exploit generation. Recurring failure patterns included cryptographic limitations (e.g., incorrect Ethereum addresses) and semantic misunderstandings (e.g., misusing the ‘payable’ modifier).

Interestingly, structural properties of target contracts, such as code length or complexity, showed only weak correlations with AEG success. This suggests that while complexity might correlate with vulnerability, it doesn’t reliably predict how easily an LLM can exploit it. Vulnerability types with predictable structures, like arithmetic overflows, were more exploitable due to their simplicity and fixed patterns. Lastly, prompt engineering, while helpful for output format, had limited effect on overall AEG performance, indicating that the LLM’s internal reasoning capacity is more crucial than external instructions.

Also Read:

Defending Against LLM-Based Threats

The research also proposes several defense strategies to mitigate LLM-driven threats. These include:

Externalization via Code Splitting: Decomposing contract logic into modular components (e.g., separating proxies) forces LLMs to reason across multiple contracts, increasing exploit generation difficulty.
Structural, Not Superficial, Complexity: Using deep inheritance trees, abstract interfaces, and polymorphic dispatch complicates semantic tracing for LLMs, reducing exploit success.
Breaking Canonical Signatures: Diversifying vulnerability contexts with redundant logic, unconventional naming, or control-flow indirection can disrupt the LLM’s pattern-matching abilities.
Decoy Vulnerabilities: Intentionally introducing false-positive patterns that resemble canonical vulnerabilities can mislead LLMs.
Use of Edge Syntax and Low-Level Features: Implementing critical logic using less common Solidity constructs like inline Yul or assembly can introduce semantic obfuscation.

While individual defensive modifications showed limited impact, combining multiple techniques significantly reduced the success rate of LLM-based AEG. However, even with strengthened protection, certain vulnerability types, such as bad randomness and time manipulation, remained susceptible to LLM-generated attacks.

In conclusion, LLMs are proving to be powerful tools for automated exploit generation in smart contracts, driven primarily by their reasoning and code generation abilities. This research not only highlights their potential but also provides valuable insights into developing more robust defense mechanisms for the evolving landscape of blockchain security.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI’s Growing Prowess in Smart Contract Exploits: Understanding the Threat and Defense

Introducing REX: Automated Exploit Generation

LLM Performance in Exploit Generation

Factors Influencing Exploit Generation

Defending Against LLM-Based Threats

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates