TLDR: ExCyTIn-Bench is a new benchmark for evaluating LLM agents on cyber threat investigation tasks. It uses a simulated Azure tenant with real-world security logs and multi-stage attacks to generate questions from threat investigation graphs. The benchmark provides a MySQL environment for agents to interact with and offers fine-grained rewards for intermediate steps. Initial experiments show the task is challenging for current LLMs, with significant room for improvement, and highlights the rapid progress of open-source models.
Cybersecurity is a critical field in our increasingly digital world, but it faces a growing challenge: the sheer volume and complexity of cyberattacks. Traditional defense mechanisms are often outmaneuvered by sophisticated attackers, making human-led threat investigations essential. However, these investigations are time-consuming, requiring analysts to manually sift through vast amounts of security logs and piece together evidence.
This is where the potential of Large Language Model (LLM) agents comes into play. LLMs have shown remarkable capabilities in understanding complex environments and performing multi-step actions, making them promising candidates for automating cyber threat investigations. To truly advance this area, a robust and standardized way to evaluate these LLM agents is needed.
A new benchmark called ExCyTIn-Bench has been introduced to address this need. It’s designed to assess how well LLM agents can perform cyber threat investigations. Unlike previous benchmarks that focused on knowledge recall, ExCyTIn-Bench evaluates the agent’s ability to investigate and reason through security scenarios.
What is ExCyTIn-Bench?
ExCyTIn-Bench is built on a realistic foundation. It uses data from a simulated Microsoft Azure tenant, which mimics a real-world corporate environment. This tenant includes 57 different log tables, capturing various activities like login events, email events, and virtual environment actions. Crucially, it also contains data from 8 simulated multi-stage cyberattacks, providing a rich and diverse dataset for evaluation.
The benchmark’s creators developed a unique method for generating questions. They construct “threat investigation graphs” from security alerts and incidents. These graphs map out the relationships between alerts and entities (like user accounts, hosts, or IP addresses) involved in an attack. By using these graphs, they can automatically generate 589 specific questions. Each question is tied to a clear “start” and “end” point on the graph, ensuring that there’s a verifiable answer and a logical path to reach it. This structured approach not only provides ground truth for evaluation but also allows for the automatic generation of procedural tasks, which could be useful for training agents using reinforcement learning.
To enable agents to interact with this data, ExCyTIn-Bench sets up a MySQL Docker environment. Agents can submit SQL queries as “actions” and receive results as “observations,” simulating how a security analyst would interact with a database. The evaluation system is also quite sophisticated. It doesn’t just check for the final answer; it also provides “decayed rewards” for intermediate steps. This means if an agent finds useful information along the way, even if it doesn’t reach the final answer, it gets partial credit. This fine-grained feedback is invaluable for understanding an agent’s investigative process and for future training efforts.
Key Findings from Experiments
The researchers tested a wide range of LLMs, including both proprietary and open-source models, as well as different types like chat and reasoning models. The results showed that the task is quite challenging. The average reward across all evaluated models was 0.249, with the best model achieving 0.368. This indicates significant room for improvement in future research.
Interestingly, the “o4-mini” model delivered the best performance. The experiments also highlighted that open-source models are rapidly improving and closing the gap with proprietary ones. The study found a strong correlation between an agent’s ability to generate successful database queries and its overall reward, emphasizing the importance of accurate database interaction skills.
The benchmark also allows for analysis of agent behavior. For instance, agents often start by exploring table schemas and then refine their SQL queries based on errors or empty results, much like a human analyst would. This demonstrates that the benchmark supports various problem-solving strategies.
Also Read:
- Large Language Models: A New Frontier in Cybersecurity
- PromptArmor: A New Shield Against AI Prompt Injection Attacks
Future Outlook
ExCyTIn-Bench is a significant step forward in evaluating LLM agents for cybersecurity. By providing a realistic dataset, a structured question generation method, and a detailed evaluation framework, it aims to accelerate research in this field. The environment’s ability to provide fine-grained rewards makes it particularly well-suited for training LLM agents using reinforcement learning. Future work could also explore how to leverage the inherent graph structure of the environment to further enhance agent training.
For more in-depth information, you can refer to the full research paper: ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation.


