TLDR: STAF (Security Test Automation Framework) is a novel approach that uses Large Language Models (LLMs) and a four-step self-corrective Retrieval-Augmented Generation (RAG) framework to automate the generation of executable security test cases from attack trees. Designed for modern automotive development, STAF significantly improves the efficiency, accuracy, and scalability of security testing, addressing the labor-intensive and error-prone nature of traditional methods. It generates comprehensive and executable test suites, including Python scripts and LTL properties, by analyzing attack trees, adaptively retrieving information, generating test cases, and iteratively refining them.
In the rapidly evolving world of automotive technology, ensuring the security of vehicle systems against sophisticated cyber threats is paramount. Traditional security testing methods, which often rely on “attack trees” to map out potential vulnerabilities, are typically labor-intensive, prone to errors, and struggle with automation, especially for complex vehicular systems.
A groundbreaking new research paper introduces STAF (Security Test Automation Framework), a novel solution designed to revolutionize this critical area. STAF leverages the power of Large Language Models (LLMs) and a unique four-step self-corrective Retrieval-Augmented Generation (RAG) framework to automate the creation of executable security test cases directly from attack trees. This provides a comprehensive, end-to-end approach to cover the entire attack surface of modern automotive systems.
Understanding STAF: How It Works
STAF’s innovative approach streamlines the process of generating security test cases. It integrates with existing threat modeling tools, like AVL ThreatGuard, which can create attack trees from Threat Analysis and Risk Assessment (TARA) inputs. These attack trees then serve as the foundation for STAF to generate executable Python scripts or Linear Temporal Logic (LTL) properties for model checking.
The framework operates through four interconnected stages:
- Attack-tree Analysis: An LLM analyzes the structured JSON format of attack trees to understand the relationships between threats, attack vectors, and system weaknesses. It extracts crucial details like affected components, potential impacts, preconditions, and required access levels.
- Adaptive Information Retrieval: This stage ensures STAF has access to current and relevant knowledge. It uses keywords from the attack tree analysis to search a vectorized database containing automotive cybersecurity knowledge, including the Automotive ISAC Automotive Threat Matrix and test libraries from AVL TestGuard. If initial results are insufficient, it performs a targeted web search. Behavioral models (Mealy machines) of protocols can also be included to enhance the LLM’s contextual understanding.
- Test-case Generation: With the gathered knowledge, STAF generates structured test cases in JSON format. The LLM is guided by prompts to include essential elements such as a descriptive title, scenario overview, setup instructions, executable test scripts, tear-down procedures, and expected outcomes.
- Iterative Refinement: Using an “LLM-as-a-judge” approach, STAF evaluates the generated test cases for alignment with the attack tree, completeness, runnability, and overall quality. If a test case doesn’t meet the quality benchmarks, the framework adjusts or regenerates it based on suggested improvements, continuing this cycle until satisfactory scores are achieved.
Significant Advancements and Performance
The evaluation of STAF demonstrated significant improvements in efficiency, accuracy, and scalability compared to using general-purpose (vanilla) LLMs. The research compared STAF’s performance using GPT-4.1 and DeepSeek-V3 against their pure versions. STAF, especially when combined with Mealy Models (STAF&MM), consistently led to a higher number of generated tests and substantial gains across metrics like alignment (how well tests address threats), runnability (executability of code), and completeness (thoroughness of test cases).
For instance, GPT-4.1 integrated with STAF saw its overall score increase from 7.17 to 9.11, with a notable rise in alignment from 7.00 to 9.80. The inclusion of learned protocol models further boosted the quality of generated test cases, enabling the LLM to craft more specific and effective tests, such as utilizing undocumented sub-functions in UDS protocol attacks, which vanilla LLMs failed to achieve.
Real-World Application: Battery Management System Case Study
To demonstrate its practical utility, STAF was applied in a case study involving the Battery Management System (BMS) of a vehicle. By analyzing an attack tree targeting a “Man-in-the-Middle Attack via UDS Message Collection,” STAF successfully generated security test cases for attack vectors like “Intercept UDS Communication” and “Inject Malicious UDS Messages.” This case study highlighted STAF’s ability to translate complex threat models into actionable security tests in a realistic scenario.
Also Read:
- SecureFixAgent: A New Era for Automated Python Vulnerability Repair
- Navigating the Security and Privacy Landscape of Retrieval-Augmented Generation
Looking Ahead
While STAF marks a substantial advancement, the researchers acknowledge certain limitations, such as the need for manual input for specific implementation details (e.g., CAN baud rates) and the resource-intensive nature of multiple refinement iterations for complex applications. Future work aims to address these by integrating test cases into Domain Specific Languages (DSLs) for easier implementation detail injection and incorporating feedback loops from testing frameworks to further refine test quality and automation.
STAF represents a significant leap forward in automating automotive security testing, offering a scalable and adaptable solution that enhances the robustness of modern vehicles against cyber threats. You can read the full research paper here.


