PentestJudge: A New System for Evaluating AI Penetration Testing Agents

TLDR: PentestJudge is a novel system that uses Large Language Models (LLMs) as judges to evaluate the operational behavior of AI-based penetration testing agents. It moves beyond outcome-focused evaluations by analyzing an agent’s entire trajectory (tool calls and states) against a hierarchical rubric covering operational objectives, security, and tradecraft. The system was tested on agents in a simulated Active Directory environment, showing that LLM judges can achieve high agreement with human experts at a significantly lower cost. The research identifies specific failure modes and suggests that clearer rubric definitions and a multi-model approach can further enhance evaluation accuracy, paving the way for more reliable AI deployment in cybersecurity.

The rapid advancements in Large Language Models (LLMs) are transforming various fields, including cybersecurity. While LLMs are increasingly being used as autonomous agents capable of complex tasks, evaluating their performance, especially in sensitive areas like penetration testing, goes beyond just checking if they achieve a final objective. It’s crucial to assess how they achieve those objectives, ensuring they adhere to operational guidelines, security protocols, and best practices.

This is where PentestJudge comes in. Introduced by researchers from dreadnode, USA, PentestJudge is an innovative system designed to evaluate the operational behavior of penetration testing agents. Unlike traditional evaluation methods that often focus solely on the end result, PentestJudge delves into the entire process, analyzing an agent’s actions and tool usage history to determine if they meet specific operating criteria that are difficult to check automatically.

Understanding PentestJudge’s Approach

At its core, PentestJudge utilizes an LLM-as-judge model. This means a powerful language model is given access to the complete ‘trajectory’ of a security agent – essentially a detailed log of its states and every tool call it made. The system then uses a structured rubric to evaluate these actions. This rubric is designed as a tree, breaking down the complex task of penetration testing into smaller, more manageable sub-tasks and criteria. Each ‘leaf node’ in this tree represents a simple yes-or-no question for the judge to answer, making the evaluation process systematic and clear.

The criteria within these rubrics are categorized into three main areas:

Operational Objectives: These relate to the agent’s end-state goals, such as achieving domain administrator access.
Operational Security: This focuses on how the agent performs its task, ensuring it doesn’t violate scope, cause service outages, or create unnecessary risks.
Tradecraft & Thoroughness: This assesses the agent’s resilience to setbacks, its ability to adapt strategies, and its overall completeness in achieving an objective, like trying multiple techniques before giving up.

To validate PentestJudge, its scores were compared against human domain experts, who served as the ‘ground truth’. This allowed the researchers to measure the system’s accuracy using standard metrics like F1 scores.

Real-World Application and Results

As a case study, PentestJudge was applied to evaluate a penetration testing agent operating in the Game of Active Directory (GOAD) environment, a realistic simulation of a Windows Active Directory network with common vulnerabilities. The agent’s task was to achieve Domain Administrator privileges on a specific domain.

The evaluation involved various frontier and open-source LLMs acting as judges. The results were promising: the best-performing model, Claude Sonnet 3.7, achieved an F1 score of 0.83, demonstrating a high level of agreement with human experts. Many other models also performed significantly better than a random judge baseline.

A key finding was the cost-effectiveness of LLM judges. Human experts, estimated at a consultant-tier salary, were significantly more expensive than even the most capable LLM judges. For instance, Kimi-k2-instruct, an open-source model, achieved a respectable F1 score of 0.79 at a fraction of the cost per evaluation. Even more budget-friendly models like Gemini Flash Lite could provide reasonable verification at a very low cost.

The study also highlighted areas for improvement. Some models struggled with ‘shallow tool calling’, not fully exploring the agent’s trajectory. Others showed a ‘lack of security understanding’, misinterpreting security terminology, or ‘inferred additional requirements’, being too strict in their judgments. These issues suggest that more specific and culturally aligned phrasing in rubrics can significantly enhance LLM judge performance.

Interestingly, the research found that different model families excelled in different evaluation categories. For example, Anthropic models were strong in judging Operational Objectives, while OpenAI models showed better balance, particularly in Operational Security. This suggests that a ‘portfolio approach’, using different models for different types of criteria, could lead to even higher overall performance.

Also Read:

The Future of Agent Evaluation

PentestJudge represents a significant step forward in evaluating AI-based security agents. It provides a scalable and holistic method to assess not just the outcome, but the crucial process quality of these agents, which is vital for their confident deployment in sensitive production environments. The findings suggest that verifying agent behavior might be less computationally demanding than generating it, offering a cost-effective path to ensuring AI agents operate safely and effectively in cybersecurity. Future work aims to further refine these judges, potentially using their outputs as reward signals for training even more aligned and capable security agents.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

PentestJudge: A New System for Evaluating AI Penetration Testing Agents

Understanding PentestJudge’s Approach

Real-World Application and Results

The Future of Agent Evaluation

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates