Assessing AI Code Security: Introducing A.S.E, a New Benchmark for Real-World Vulnerabilities

TLDR: A.S.E (AI Code Generation Security Evaluation) is a new, repository-level benchmark designed to rigorously evaluate the security of AI-generated code. It uses real-world vulnerabilities (CVEs) from actual software projects and a reproducible, containerized framework to assess security, build quality, and generation stability. Key findings from evaluating 26 LLMs show that models struggle with secure coding despite high code quality, “fast-thinking” decoding strategies are more effective for security patching than complex reasoning, and open-source models are highly competitive with proprietary ones in terms of security performance.

As large language models (LLMs) become increasingly integrated into software development, from writing new code to fixing bugs, a critical question arises: how secure is the code they generate? While LLMs are powerful, their output can inadvertently introduce, spread, or even worsen security vulnerabilities, especially in complex projects with many interconnected files.

Existing methods for evaluating the security of AI-generated code often fall short. Many focus on small, isolated code snippets, use inconsistent evaluation techniques that are hard to reproduce, and don’t adequately link the quality of the input context to the security of the output code. This leaves a significant gap in understanding how well LLMs perform in real-world software engineering scenarios.

Introducing A.S.E: A New Standard for AI Code Security

To address these challenges, a team of researchers from Tencent, Peking University, Fudan University, Shanghai Jiao Tong University, Tsinghua University, Zhejiang University, Institute of Information Engineering, Chinese Academy of Sciences, and Singapore Management University has introduced A.S.E (AI Code Generation Security Evaluation). This groundbreaking benchmark is designed for evaluating secure code generation at the repository level, meaning it considers the entire project context, including build systems and dependencies between different files.

A.S.E stands out by constructing its evaluation tasks from real-world software repositories that have documented Common Vulnerabilities and Exposures (CVEs). This approach ensures that the benchmark reflects genuine security challenges faced in practical development. It also maintains the full repository context, forcing LLMs to reason about how code changes affect the entire project, not just isolated parts.

The benchmark’s evaluation framework is highly reproducible and auditable. It uses containerized environments (like Docker) to ensure consistent results and employs expert-defined rules, combining industry-grade analyzers like CodeQL and Joern with specific logic tailored to different vulnerability types. This provides stable and transparent assessments of code security, build quality, and the consistency of the generated code.

Key Insights from A.S.E’s Evaluation

The researchers evaluated 26 leading LLMs, including both proprietary and open-source models, using A.S.E. Their findings offer crucial insights into the current state of AI code security:

LLMs Struggle with Secure Coding: Despite advancements, no evaluated LLM exceeded a 50% score in Code Security. While models like Claude-3.7-Sonnet achieved high scores in code quality (producing syntactically correct and functional code), their security scores were significantly lower. This highlights a tendency for LLMs to prioritize correctness over security.
Repository-Level Complexity is a Major Hurdle: Models that perform well on simpler, snippet-level security tasks often struggle with A.S.E’s repository-level challenges, which involve understanding cross-file dependencies and long contexts.
“Fast-Thinking” Outperforms “Slow-Thinking” for Security: Surprisingly, decoding strategies that are more concise and direct (dubbed “fast-thinking”) consistently achieved better security patching results than complex, multi-step reasoning approaches (“slow-thinking”). This suggests that increased reasoning budget doesn’t always translate to better security fixes at the repository level.
Stability Doesn’t Guarantee Security: Some models showed high generation stability (consistent output across multiple runs) but still produced highly vulnerable code. This underscores the importance of evaluating security independently from stability.
Open-Source Models are Highly Competitive: The security performance gap between proprietary and open-source models was found to be narrow. In fact, Qwen3-235B-A22B-Instruct, an open-source model, achieved the highest security score, surpassing even leading proprietary models.

Also Read:

Specific Challenges and Future Directions

The A.S.E benchmark revealed that “Path Traversal” vulnerabilities pose the greatest challenge for LLMs. These attacks, which involve manipulating file paths to access unauthorized directories, are subtle and context-dependent, indicating that current LLMs lack robust reasoning about file system operations.

The study also noted that Mixture-of-Experts (MoE) architectures, commonly found in leading open-source LLMs, generally showed stronger security performance compared to dense models.

In conclusion, A.S.E provides a robust and realistic foundation for evaluating the security of AI-generated code. Its findings have significant implications for the development of more secure and reliable LLMs for software engineering, suggesting that prompting strategies are as crucial as model choice and that open-source models are strong contenders in the security arena. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing AI Code Security: Introducing A.S.E, a New Benchmark for Real-World Vulnerabilities

Introducing A.S.E: A New Standard for AI Code Security

Key Insights from A.S.E’s Evaluation

Specific Challenges and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates