TLDR: JudgeAgent is a new framework for evaluating large language models (LLMs) that uses an “interviewer-style” approach. It dynamically assesses LLMs by first grading them on benchmarks, then interactively extending tests with difficulty-adaptive questions, and finally providing detailed feedback for improvement. This method helps identify precise knowledge and capability boundaries, offering more effective and targeted optimization suggestions than traditional static evaluations.
Evaluating the capabilities of large language models (LLMs) is a crucial step before they are widely used. Traditionally, LLMs are tested with predefined sets of questions, and their responses are then assessed. While this method offers simplicity, it has several limitations. These include a lack of deep interaction with the models, difficulty in controlling the test’s challenge level, and challenges in confirming the validity of the evaluation results. This makes it hard to truly understand what an LLM knows and what its limits are.
To overcome these challenges, a new framework called JudgeAgent has been proposed. JudgeAgent introduces an innovative “interviewer-style” evaluation method that dynamically adapts to the LLM being tested and its knowledge base. Think of it like a human interviewer who dynamically probes a candidate’s knowledge and skills.
How JudgeAgent Works
JudgeAgent employs a comprehensive evaluation process with three main components:
1. Benchmark Grading: This is the initial step, similar to a preliminary written test. JudgeAgent first evaluates the target LLM on publicly available static datasets. Based on the model’s performance, it determines an approximate capability range, categorizing it into Easy, Medium, or Hard tiers. This initial assessment guides the difficulty of subsequent questions.
2. Interactive Extension: This is the core of JudgeAgent’s dynamic evaluation, acting as the “interview” itself. In this phase, JudgeAgent iteratively expands on the initial questions. It retrieves related knowledge from a knowledge base, using a context-graph-based approach to ensure both breadth and depth in the information. Crucially, it then generates new questions, dynamically adjusting their difficulty (Easy, Medium, or Hard) based on the LLM’s ongoing performance. This adaptive approach allows for a more precise understanding of the LLM’s knowledge boundaries.
3. Evaluation Feedback: After the benchmark grading and multiple rounds of interactive extension, JudgeAgent compiles all the test results. It then provides a comprehensive evaluation report, identifying specific deficiencies in the LLM’s semantic understanding, logical reasoning, or ability to capture details. Importantly, it also offers actionable suggestions for optimizing the model. JudgeAgent even includes a novel way to validate its own evaluation: by prompting the LLM to answer the same questions again after receiving these suggestions and comparing the accuracy before and after.
Also Read:
- Assessing LLM Capabilities: A New Framework to Counter Data Contamination
- DeepResearch Arena: A New Benchmark to Test AI’s Research Acumen
Experimental Validation
The effectiveness of JudgeAgent was validated through extensive experiments using various datasets, including MedQA and MultiHop-RAG (which are knowledge-intensive) and QuALITY (which focuses on comprehension and reasoning). Several LLMs, such as Qwen3, GLM4-Flash, GPT-4.1, and Gemini-2.5-pro, were used as target models.
The results showed that JudgeAgent effectively identifies knowledge gaps in LLMs and helps mitigate them by providing targeted prompts. For datasets emphasizing reasoning, JudgeAgent offered valuable guidance to address shortcomings in logical reasoning and semantic understanding. The framework proved particularly beneficial for relatively weaker models and for questions of higher difficulty, demonstrating its ability to accurately assess capability boundaries and provide adaptive optimization.
Ablation studies, where different components of JudgeAgent were removed, further highlighted the importance of each module. The context graph, difficulty-adaptive mechanism, and interactive extension all played crucial roles in enhancing the evaluation’s effectiveness.
A case study illustrated JudgeAgent’s superiority over direct evaluation. When an LLM failed a medical question, a direct evaluator offered generic feedback. In contrast, JudgeAgent, through its knowledge-driven synthesis and adaptive questioning, pinpointed the exact knowledge gap (e.g., insufficient understanding of duodenal injury symptoms) and provided targeted guidance, enabling the LLM to answer correctly.
In conclusion, JudgeAgent offers a powerful and dynamic framework for evaluating LLMs, providing precise assessments of their knowledge and capabilities, along with targeted suggestions for improvement. This research marks a significant step towards more reliable and effective evaluation tools for LLMs. You can read the full research paper here: JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer.


