Unpacking AI's Role in Software Security: A Deep Dive into Vulnerability Detection

TLDR: This systematic review examines 227 studies from 2020-2025 on using Large Language Models (LLMs) for software vulnerability detection. It categorizes existing methods by how tasks are set up, how code is represented, system designs, and adaptation techniques. The review also thoroughly analyzes the datasets used, highlighting issues like limited vulnerability types and data imbalance. It concludes by identifying key challenges, such as the need for more realistic datasets, better code understanding, and improved model explainability, and suggests future research directions to make LLM-based vulnerability detection more practical and reliable.

The rapid rise of Large Language Models (LLMs) has sparked considerable interest in their potential to revolutionize various fields, including software engineering. One particularly promising application is the detection of software vulnerabilities, a critical step in ensuring the security and reliability of modern software systems. However, the swift pace of development in this area has led to a fragmented research landscape, making it challenging to get a clear picture of the current state-of-the-art and compare different approaches effectively.

To address this challenge, a comprehensive systematic literature review (SLR) was conducted, analyzing 227 studies published between January 2020 and June 2025. This review aims to provide a structured overview of LLM-based software vulnerability detection, categorizing studies by how tasks are formulated, how input code is represented, the system architectures employed, and the adaptation techniques used. Furthermore, it delves into the characteristics, vulnerability coverage, and diversity of the datasets utilized in these studies.

The Critical Role of Vulnerability Detection

Identifying security vulnerabilities early in the software development lifecycle is paramount to prevent their exploitation in deployed systems. The complexity of modern software, coupled with an ever-expanding threat landscape, has led to an explosion in reported vulnerabilities. For instance, over 40,000 Common Vulnerabilities and Exposures (CVEs) were published in 2024 alone, with more than 12,000 reported in the first quarter of 2025. While traditional tools like static analysis exist, they often rely on manual rule sets and suffer from high false positive rates, making them insufficient to meet the current demands for remediation.

LLMs, despite their impressive capabilities in understanding and generating code, also introduce new risks. Code generated by LLMs can sometimes lack awareness of existing libraries or internal codebases, leading to redundancies or the creation of insecure code. They are also prone to ‘hallucinations’ and generating vulnerable code, making automated, scalable, and reliable vulnerability assessment pipelines more crucial than ever.

Understanding How LLMs Detect Vulnerabilities

The review introduces a detailed taxonomy to classify how LLMs are applied to vulnerability detection:

Detection Task: Most commonly, vulnerability detection is framed as a classification problem: binary (vulnerable/not vulnerable), vulnerability-specific (e.g., a particular CWE-ID), or multi-class (identifying the specific type of vulnerability). While binary classification is prevalent, multi-class offers more detailed insights crucial for real-world remediation. Beyond classification, some studies extend LLMs to tasks like vulnerability localization, severity estimation, repair, security testing, and even reasoning about root causes.
Input Representation: How code is fed to the LLM significantly impacts performance. ‘Raw’ input treats code as plain text, relying on the model’s pre-trained understanding. ‘Structure-aware’ representations convert code into graphs (like Abstract Syntax Trees or Control Flow Graphs) or code slices to capture deeper semantic relationships. ‘Prompt’ engineering involves carefully designed text instructions, sometimes augmented with external information like vulnerability reports. ‘Conversation-style’ interactions allow for multi-step analysis and refinement.
System Architecture: Systems can be ‘LLM-centric,’ where the LLM is the primary analytical component, using either general-purpose LLMs (e.g., GPT series, LLaMA) or specialized ‘Code LLMs’ (e.g., CodeBERT, StarCoder) pre-trained on code. Alternatively, ‘Hybrid’ systems combine LLMs with other deep learning architectures like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), or Graph Neural Networks (GNNs) to leverage their complementary strengths.
Adaptation Techniques: LLMs are adapted through ‘prompt engineering’ (zero-shot, few-shot, Retrieval-Augmented Generation (RAG), Chain-of-Thought, self-verification, and agentic approaches) or ‘training’ (full-parameter fine-tuning or more efficient methods like LoRA). Advanced ‘learning paradigms’ such as contrastive learning, causal learning, multi-task learning, knowledge distillation, and continual learning are also explored to enhance model performance and generalization.

The Role and Challenges of Datasets

Datasets are fundamental to training and evaluating LLM-based vulnerability detection models. The review categorizes datasets by type (synthetic, real-world, mixed), granularity (project, file, function, line level), source (open-source, collected, constructed, closed-source), and labeling method (security vendor-provided, developer-provided, tool-created, synthetically created).

A significant finding is the heavy bias towards C/C++ in commonly used datasets, limiting applicability to other languages. Many datasets also suffer from class imbalance, where vulnerable samples are rare, and a ‘long-tail distribution’ of vulnerability types, meaning a few Common Weakness Enumeration (CWE) types are heavily represented while many others are barely present. This can lead to models that perform well on common vulnerabilities but struggle with rarer ones.

The persistent use of older datasets like Devign and Big-Vul, alongside a growing trend of creating new or custom datasets, highlights a lack of standardized benchmarks. This fragmentation complicates direct comparison of research results and hinders reproducibility across studies.

Also Read:

Looking Ahead: Addressing Limitations and Future Opportunities

The review identifies several key limitations that impede the practical adoption of LLMs in software vulnerability detection:

Limited Detection Granularity: Current models often focus on individual functions, missing complex vulnerabilities spanning multiple files or requiring broader context.
Dataset Quality: Synthetic datasets oversimplify, while automated labeling in real-world datasets introduces noise and redundancy. Class imbalance and skewed CWE distributions remain major issues.
Evaluation and Comparability: Inconsistent dataset usage, lack of predefined data splits, and diverse modifications to existing datasets make it difficult to compare results across studies.
Outdated Knowledge: Static datasets quickly become obsolete, and retraining large models is computationally expensive, limiting awareness of new vulnerabilities.
Code Representation: Models often rely on superficial textual patterns rather than deep semantic understanding, making them vulnerable to simple code transformations.
Interpretability: LLMs offer limited insight into their predictions, sometimes providing ‘hallucinated’ or misleading justifications, which undermines trust.
Integration into Workflows: Few studies evaluate LLMs in realistic development settings, and high computational costs hinder practical deployment.

To overcome these challenges, the authors propose several actionable research directions. These include developing context-enhanced and multilingual vulnerability detection, building new datasets with improved label quality and balanced CWE coverage, and establishing standardized evaluation protocols. Further exploration of RAG and continual learning techniques is crucial to keep models updated with new vulnerability knowledge. Investing in structure-aware input representations and hybrid architectures, along with developing metrics for explanation trustworthiness, will enhance model robustness and interpretability. Finally, integrating LLM-based tools into real-world development pipelines and exploring model compression techniques are vital for practical applicability.

This comprehensive review serves as a valuable guide for researchers and practitioners, offering a structured overview of the field, identifying key limitations, and outlining actionable future research opportunities. The authors have also made all artifacts publicly available to support the community and foster more comparable and reproducible research. You can find the artifacts and a living repository of LLM-based software vulnerability detection studies at Awesome-LLM4SVD.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Role in Software Security: A Deep Dive into Vulnerability Detection

The Critical Role of Vulnerability Detection

Understanding How LLMs Detect Vulnerabilities

The Role and Challenges of Datasets

Looking Ahead: Addressing Limitations and Future Opportunities

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates