TLDR: The CIRCLE benchmark evaluates the security of LLM code interpreters against resource exhaustion attacks (CPU, memory, disk). It uses both direct and indirect malicious prompts and executes the generated code to assess vulnerabilities. Findings show significant inconsistencies and high susceptibility to indirect, socially-engineered prompts across commercial models, highlighting an urgent need for better interpreter-specific cybersecurity measures and industry standards.
Large language models (LLMs) are becoming increasingly powerful, especially with the integration of native code interpreters. These interpreters allow LLMs to execute code in real-time, greatly expanding their capabilities for tasks like code generation, mathematical reasoning, and data analysis. However, this advancement also introduces new cybersecurity risks, specifically at the system level, which are different from traditional prompt-based vulnerabilities.
A new benchmark called CIRCLE (Code-Interpreter Resilience Check for LLM Exploits) has been developed to systematically evaluate these interpreter-specific security risks. This benchmark consists of 1,260 prompts designed to test an LLM’s resilience against resource exhaustion attacks targeting CPU, memory, and disk. The prompts are categorized into two types: ‘direct’ prompts, which are explicitly malicious, and ‘indirect’ prompts, which appear benign but are designed to subtly trigger resource-intensive operations.
The CIRCLE benchmark uses an automated evaluation framework. It doesn’t just check if an LLM refuses or generates risky code; it actually executes the generated code within the interpreter environment. This allows for a comprehensive assessment of the LLM’s behavior, including whether the code is correct, if the LLM simplifies the code to make it safe, or if the execution times out. A separate ‘judge LLM’ is used to classify the outcomes into six categories: refusal, reframe, follow-up, incorrect code, fulfilled, or timeout.
Key Findings from the Evaluation
The evaluation of seven commercially available models from OpenAI and Google revealed significant and inconsistent vulnerabilities. For instance, OpenAI’s o4-Mini model showed a high rate of refusing risky requests (7.1%), but paradoxically, it also had the highest rate of unsafe code execution (70.2%) once execution began. This suggests that while it might initially be robust, it becomes highly susceptible once it starts executing code.
Another important observation was the variability in timeout behaviors, especially among Gemini models. Gemini 2.5 Pro Preview, for example, had a 65.1% timeout rate, partly due to Google’s documented 30-second timeout policy. This highlights how critical provider-specific timeout policies are in managing resource exhaustion vulnerabilities.
The research also found that indirect, socially-engineered prompts are particularly effective at bypassing model defenses. These prompts, disguised as normal tasks, consistently weakened the security measures of the models, emphasizing the substantial threat they pose.
Why This Matters
Existing security benchmarks for LLMs often focus on vulnerabilities in third-party software or the robustness of evaluation sandboxes. CIRCLE fills a crucial gap by specifically addressing denial-of-service vectors (CPU, memory, disk) that occur within the LLM’s own execution context and can be triggered by a single prompt. This interpreter-centric approach demands new ways of measuring outcomes, such as distinguishing between a timeout and a successfully fulfilled but risky task.
The benchmark’s contributions include a comprehensive risk taxonomy with both direct and indirect prompt variants, an automated multi-provider evaluation system that executes generated code, and its open-source accessibility to encourage further research.
Also Read:
- AI Ecosystems Face Mounting Threats from LLM Plugin Vulnerabilities
- Navigating the AI Frontier: Large Language Models in Social Simulation
Limitations and Ethical Considerations
While CIRCLE is a significant step, it has limitations. Its static nature means it might not adapt quickly to new threats. The cost of extensive evaluations, especially with current API pricing, can also be prohibitive. Future work could explore more dynamic prompt databases and cost-efficient evaluation methods. The benchmark primarily assesses API-native interpreters, and expanding it to include third-party or local interpreter frameworks would enhance its comprehensiveness.
Ethical considerations were paramount in its design. All prompts adhere to typical interpreter environment constraints (e.g., limited memory, no network access) to minimize real-world harm. Prompts avoid sensitive operations, and the nature of the prompts is disclosed to model providers before public release to facilitate remediation and strengthen industry standards.
In conclusion, the CIRCLE benchmark is a vital tool for understanding and addressing interpreter-specific cybersecurity vulnerabilities in LLMs. Its initial findings underscore the urgent need for ongoing benchmarking, specialized mitigation tools, and clear industry standards to ensure the safe and responsible deployment of these increasingly capable AI systems. You can find the full research paper here: A Simple Benchmark for LLM Code Interpreter Security.


