TLDR: PentestJudge is a novel system that uses Large Language Models (LLMs) as judges to evaluate the operational behavior of AI-based penetration testing agents. It moves beyond outcome-focused evaluations by analyzing an agent’s entire trajectory (tool calls and states) against a hierarchical rubric covering operational objectives, security, and tradecraft. The system was tested on agents in a simulated Active Directory environment, showing that LLM judges can achieve high agreement with human experts at a significantly lower cost. The research identifies specific failure modes and suggests that clearer rubric definitions and a multi-model approach can further enhance evaluation accuracy, paving the way for more reliable AI deployment in cybersecurity.
The rapid advancements in Large Language Models (LLMs) are transforming various fields, including cybersecurity. While LLMs are increasingly being used as autonomous agents capable of complex tasks, evaluating their performance, especially in sensitive areas like penetration testing, goes beyond just checking if they achieve a final objective. It’s crucial to assess how they achieve those objectives, ensuring they adhere to operational guidelines, security protocols, and best practices.
This is where PentestJudge comes in. Introduced by researchers from dreadnode, USA, PentestJudge is an innovative system designed to evaluate the operational behavior of penetration testing agents. Unlike traditional evaluation methods that often focus solely on the end result, PentestJudge delves into the entire process, analyzing an agent’s actions and tool usage history to determine if they meet specific operating criteria that are difficult to check automatically.
Understanding PentestJudge’s Approach
At its core, PentestJudge utilizes an LLM-as-judge model. This means a powerful language model is given access to the complete ‘trajectory’ of a security agent – essentially a detailed log of its states and every tool call it made. The system then uses a structured rubric to evaluate these actions. This rubric is designed as a tree, breaking down the complex task of penetration testing into smaller, more manageable sub-tasks and criteria. Each ‘leaf node’ in this tree represents a simple yes-or-no question for the judge to answer, making the evaluation process systematic and clear.
The criteria within these rubrics are categorized into three main areas:
- Operational Objectives: These relate to the agent’s end-state goals, such as achieving domain administrator access.
- Operational Security: This focuses on how the agent performs its task, ensuring it doesn’t violate scope, cause service outages, or create unnecessary risks.
- Tradecraft & Thoroughness: This assesses the agent’s resilience to setbacks, its ability to adapt strategies, and its overall completeness in achieving an objective, like trying multiple techniques before giving up.
To validate PentestJudge, its scores were compared against human domain experts, who served as the ‘ground truth’. This allowed the researchers to measure the system’s accuracy using standard metrics like F1 scores.
Real-World Application and Results
As a case study, PentestJudge was applied to evaluate a penetration testing agent operating in the Game of Active Directory (GOAD) environment, a realistic simulation of a Windows Active Directory network with common vulnerabilities. The agent’s task was to achieve Domain Administrator privileges on a specific domain.
The evaluation involved various frontier and open-source LLMs acting as judges. The results were promising: the best-performing model, Claude Sonnet 3.7, achieved an F1 score of 0.83, demonstrating a high level of agreement with human experts. Many other models also performed significantly better than a random judge baseline.
A key finding was the cost-effectiveness of LLM judges. Human experts, estimated at a consultant-tier salary, were significantly more expensive than even the most capable LLM judges. For instance, Kimi-k2-instruct, an open-source model, achieved a respectable F1 score of 0.79 at a fraction of the cost per evaluation. Even more budget-friendly models like Gemini Flash Lite could provide reasonable verification at a very low cost.
The study also highlighted areas for improvement. Some models struggled with ‘shallow tool calling’, not fully exploring the agent’s trajectory. Others showed a ‘lack of security understanding’, misinterpreting security terminology, or ‘inferred additional requirements’, being too strict in their judgments. These issues suggest that more specific and culturally aligned phrasing in rubrics can significantly enhance LLM judge performance.
Interestingly, the research found that different model families excelled in different evaluation categories. For example, Anthropic models were strong in judging Operational Objectives, while OpenAI models showed better balance, particularly in Operational Security. This suggests that a ‘portfolio approach’, using different models for different types of criteria, could lead to even higher overall performance.
Also Read:
- AI Judging AI: A New Era for Language Model Evaluation
- Evaluating Trust in AI: A New Benchmark for Multimodal Model Confidence
The Future of Agent Evaluation
PentestJudge represents a significant step forward in evaluating AI-based security agents. It provides a scalable and holistic method to assess not just the outcome, but the crucial process quality of these agents, which is vital for their confident deployment in sensitive production environments. The findings suggest that verifying agent behavior might be less computationally demanding than generating it, offering a cost-effective path to ensuring AI agents operate safely and effectively in cybersecurity. Future work aims to further refine these judges, potentially using their outputs as reward signals for training even more aligned and capable security agents.
For more detailed information, you can read the full research paper here.


