TLDR: CyberSOCEval is a new open-source benchmark suite designed to evaluate Large Language Models (LLMs) for critical Security Operations Center (SOC) tasks: Malware Analysis and Threat Intelligence Reasoning. It aims to provide a clear standard for AI developers to improve models and for cyber defenders to select effective tools, revealing that while larger LLMs perform better, there’s significant room for improvement, especially in security-specific reasoning and multimodal data processing.
Cyber defenders today face an overwhelming barrage of security alerts and threat intelligence. This constant deluge highlights an urgent need for advanced AI systems that can significantly enhance operational security work. While Large Language Models (LLMs) hold immense potential to automate and scale Security Operations Center (SOC) operations, existing evaluations often fall short in assessing their capabilities in real-world security scenarios.
This gap in comprehensive evaluation has significant implications. AI developers lack a clear direction for improving their models, and cyber defenders struggle to reliably select the most effective tools. Furthermore, with malicious actors increasingly leveraging AI to scale cyber attacks, the need for robust, open-source benchmarks to drive community-driven improvement among defenders and AI model developers has become critical.
Introducing CyberSOCEval: A New Benchmark for Cyber Defense
To address these pressing challenges, researchers have introduced CyberSOCEval, a groundbreaking suite of open-source benchmarks. As part of the broader CyberSecEval 4 initiative, CyberSOCEval is specifically designed to evaluate LLMs in two core defensive domains: Malware Analysis and Threat Intelligence Reasoning. These areas are crucial for cyber defenders but have historically received inadequate coverage in existing security benchmarks.
The evaluations conducted using CyberSOCEval have yielded several key insights. Firstly, larger and more modern LLMs generally tend to perform better, aligning with the established paradigm of training scaling laws. However, a surprising finding was that reasoning models, which typically show significant performance boosts in areas like coding and mathematics, do not achieve the same level of improvement in cybersecurity analysis. This suggests that these models may not have been specifically trained to reason about cybersecurity, pointing to a significant opportunity for future improvement.
Crucially, the research also reveals that current LLMs are far from saturating the evaluations. This indicates that CyberSOCEval presents a substantial challenge for AI developers, offering a clear path and incentive to enhance AI capabilities in cyber defense.
Key Contributions of CyberSOCEval
The introduction of CyberSOCEval marks a significant step forward in cybersecurity. Unlike previous closed benchmarks, it provides an open-source framework for AI systems, serving as a ‘North Star’ for AI model developers and a practical selection metric for cyber practitioners. The benchmark highlights a considerable ‘hill to climb’ for AI developers, demonstrating ample room for improving cyber defensive capabilities. It also identifies performance themes across many popular LLMs, emphasizing the value of these evaluations for practitioners in choosing the right LLM for their needs.
Deep Dive into the Benchmarks
CyberSOCEval focuses on two critical areas, chosen for their essential role in defensive operations:
Malware Analysis
This benchmark assesses an LLM’s precision and recall in identifying malicious activities from potential malware, such as detecting ransomware or remote access trojans. Accurate threat detection with minimal false positives is vital for SOC analysts. The evaluations measure how well AI systems perform this task, which is currently heavily manual and expertise-dependent. Test cases are derived from Hybrid Analysis report data, produced by detonating malware samples in a controlled environment like CrowdStrike Falcon® Sandbox. These logs provide detailed information about running processes, extracted files, and static signature detections. The dataset covers five malware categories: EDR/AV Killers, Ransomware, RAT (REMCOS Family), Infostealers, and UM Unhooking. Questions are multiple-choice, based on JSON-formatted system log data, and evaluated for accuracy.
Threat Intelligence Reasoning
This component evaluates an AI system’s ability to understand, analyze, and extract actionable insights from threat intelligence reports. Going beyond basic document comprehension, it emphasizes security reasoning, such as identifying threat actors not explicitly mentioned, understanding complex attack chains, and mapping tactics to frameworks like MITRE ATT&CK. SOC analysts rely on synthesizing threat intelligence to inform detection strategies and prioritize vulnerabilities, but the sheer volume of information often overwhelms human capacity. The benchmark tests models’ ability to approximate this manual work, measuring their capacity to process unstructured reports, often provided as multiple images per report page, and answer multiple-choice questions based on the content.
Also Read:
- Next-Gen Cybersecurity: An Intelligent Honeypot for LDAP Using Large Language Models
- Evaluating LLM Precision: A New Benchmark for Function Calling Instruction Adherence
Insights and Future Directions
The research also explored the impact of different input modalities. Interestingly, models often performed better when provided with threat intelligence reports as extracted text rather than images, or even a combination of both. This suggests that while multimodal capabilities are crucial, there’s still significant progress to be made in how AI models reason about text ingested via visual modalities. This has direct implications for SOC teams selecting models for threat intelligence tasks, highlighting the need for robust multimodal understanding alongside security-specific reasoning.
Looking ahead, the CyberSOCEval team plans several extensions, including expanding attack coverage to include fileless attacks and sensitive data exfiltration, extending to additional operating environments like Linux and mobile, and increasing the emphasis on integrating information from charts and graphics in threat reports. They also aim to develop a standardized threat intelligence ontology and new benchmarks for incident response and triage capabilities. The full research paper can be found here.
In conclusion, CyberSOCEval is a crucial open-source benchmark for LLMs in SOC activities. It addresses a critical evaluation gap, providing security teams with robust tools to assess and enhance their capabilities against emerging AI-driven threats. The benchmark clearly shows that more research and development are needed to fine-tune LLMs for optimal cybersecurity defense, encouraging the community to build upon these results and contribute to its ongoing evolution.


