TLDR: A.S.E (AI Code Generation Security Evaluation) is a new, repository-level benchmark designed to rigorously evaluate the security of AI-generated code. It uses real-world vulnerabilities (CVEs) from actual software projects and a reproducible, containerized framework to assess security, build quality, and generation stability. Key findings from evaluating 26 LLMs show that models struggle with secure coding despite high code quality, “fast-thinking” decoding strategies are more effective for security patching than complex reasoning, and open-source models are highly competitive with proprietary ones in terms of security performance.
As large language models (LLMs) become increasingly integrated into software development, from writing new code to fixing bugs, a critical question arises: how secure is the code they generate? While LLMs are powerful, their output can inadvertently introduce, spread, or even worsen security vulnerabilities, especially in complex projects with many interconnected files.
Existing methods for evaluating the security of AI-generated code often fall short. Many focus on small, isolated code snippets, use inconsistent evaluation techniques that are hard to reproduce, and don’t adequately link the quality of the input context to the security of the output code. This leaves a significant gap in understanding how well LLMs perform in real-world software engineering scenarios.
Introducing A.S.E: A New Standard for AI Code Security
To address these challenges, a team of researchers from Tencent, Peking University, Fudan University, Shanghai Jiao Tong University, Tsinghua University, Zhejiang University, Institute of Information Engineering, Chinese Academy of Sciences, and Singapore Management University has introduced A.S.E (AI Code Generation Security Evaluation). This groundbreaking benchmark is designed for evaluating secure code generation at the repository level, meaning it considers the entire project context, including build systems and dependencies between different files.
A.S.E stands out by constructing its evaluation tasks from real-world software repositories that have documented Common Vulnerabilities and Exposures (CVEs). This approach ensures that the benchmark reflects genuine security challenges faced in practical development. It also maintains the full repository context, forcing LLMs to reason about how code changes affect the entire project, not just isolated parts.
The benchmark’s evaluation framework is highly reproducible and auditable. It uses containerized environments (like Docker) to ensure consistent results and employs expert-defined rules, combining industry-grade analyzers like CodeQL and Joern with specific logic tailored to different vulnerability types. This provides stable and transparent assessments of code security, build quality, and the consistency of the generated code.
Key Insights from A.S.E’s Evaluation
The researchers evaluated 26 leading LLMs, including both proprietary and open-source models, using A.S.E. Their findings offer crucial insights into the current state of AI code security:
- LLMs Struggle with Secure Coding: Despite advancements, no evaluated LLM exceeded a 50% score in Code Security. While models like Claude-3.7-Sonnet achieved high scores in code quality (producing syntactically correct and functional code), their security scores were significantly lower. This highlights a tendency for LLMs to prioritize correctness over security.
- Repository-Level Complexity is a Major Hurdle: Models that perform well on simpler, snippet-level security tasks often struggle with A.S.E’s repository-level challenges, which involve understanding cross-file dependencies and long contexts.
- “Fast-Thinking” Outperforms “Slow-Thinking” for Security: Surprisingly, decoding strategies that are more concise and direct (dubbed “fast-thinking”) consistently achieved better security patching results than complex, multi-step reasoning approaches (“slow-thinking”). This suggests that increased reasoning budget doesn’t always translate to better security fixes at the repository level.
- Stability Doesn’t Guarantee Security: Some models showed high generation stability (consistent output across multiple runs) but still produced highly vulnerable code. This underscores the importance of evaluating security independently from stability.
- Open-Source Models are Highly Competitive: The security performance gap between proprietary and open-source models was found to be narrow. In fact, Qwen3-235B-A22B-Instruct, an open-source model, achieved the highest security score, surpassing even leading proprietary models.
Also Read:
- Automated Program Repair: Bridging the Gap Between Benchmarks and Real-World Code
- Beyond Jailbreaks: Unpacking the True Criminal Potential of Large Language Models
Specific Challenges and Future Directions
The A.S.E benchmark revealed that “Path Traversal” vulnerabilities pose the greatest challenge for LLMs. These attacks, which involve manipulating file paths to access unauthorized directories, are subtle and context-dependent, indicating that current LLMs lack robust reasoning about file system operations.
The study also noted that Mixture-of-Experts (MoE) architectures, commonly found in leading open-source LLMs, generally showed stronger security performance compared to dense models.
In conclusion, A.S.E provides a robust and realistic foundation for evaluating the security of AI-generated code. Its findings have significant implications for the development of more secure and reliable LLMs for software engineering, suggesting that prompting strategies are as crucial as model choice and that open-source models are strong contenders in the security arena. For more details, you can read the full research paper here.


