CompassJudger-2: Advancing LLM Evaluation with a New Generalist Judge Model

TLDR: CompassJudger-2 is a new generalist judge model designed to overcome the limitations of existing specialized and less robust LLM evaluators. It uses a unique training approach with diverse, verifiable data and a refined learning objective. The paper also introduces JudgerBenchV2, a comprehensive benchmark that uses a “Mix-of-Judgers” for ground truth and evaluates both judgment accuracy and ranking consistency. CompassJudger-2 demonstrates superior performance, competitive accuracy with much larger models, and strong critique generation capabilities, paving the way for more reliable LLM evaluation.

In the rapidly evolving field of artificial intelligence, large language models (LLMs) are becoming increasingly sophisticated, capable of everything from understanding complex language to generating creative text and code. As these models are deployed in real-world applications, ensuring the quality and accuracy of their responses is paramount. This is where “judge models” come into play, acting as evaluators for LLM outputs.

However, existing judge models often face limitations. They tend to be highly specialized, meaning they perform well only on specific types of tasks or datasets, and they can lack robustness, struggling with the diverse and often unpredictable nature of LLM outputs. This can hinder their ability to provide truly comprehensive evaluations.

A new research paper, “CompassJudger-2: Towards Generalist Judge Model Via Verifiable Rewards”, introduces CompassJudger-2, a novel approach designed to overcome these challenges. Developed by researchers Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, and Kai Chen from Shanghai AI Laboratory and Tsinghua University, CompassJudger-2 aims to be a more versatile and robust judge model.

A New Approach to Training Judge Models

The core of CompassJudger-2’s innovation lies in its training methodology. The researchers adopted a “task-driven, multi-domain data curation strategy.” This means they carefully collected and prepared a wide variety of data from different sources and across various domains to train the model. They focused on “verifiable rewards,” which essentially means guiding the model’s learning with clear, objective signals about the correctness of its judgments. This helps the model develop intrinsic critical reasoning abilities.

The data pipeline for CompassJudger-2 is quite sophisticated. It involves both “data curation” and “data synthesis.” For data curation, they gathered existing public judge and reward datasets. They even took steps to rectify outdated judgments by using a powerful LLM (Qwen2.5-72B-Instruct) to reconstruct and verify their correctness. To enhance diversity, they replaced original prompt templates with those from various subjective evaluation datasets.

Data synthesis involved creating new data from “knowledge-based datasets” (like MMLU and GSM8K) and “chat-based datasets.” For knowledge-based data, they used an LLM to evaluate model outputs and generate detailed rationales, retaining only the verified correct evaluations. For chat-based data, they generated contrasting response pairs and had an LLM select the superior one based on style requirements, creating style-sensitive judgment data.

Enhancing Judgment Accuracy with Verifiable Rewards

A key aspect of CompassJudger-2’s training is the integration of verifiable rewards through a technique called “policy gradient optimization” combined with “rejection sampling.” This process guides the model to think critically before making a judgment. The model is prompted to analyze the user’s demand, identify strengths and weaknesses of different model responses, reason based on this analysis, and then make a final prediction.

The “verifiable reward” acts as a clear signal: if the model’s prediction matches the ground truth, it gets a reward of 1; otherwise, it gets 0. This signal is used to optimize the model. Rejection sampling further refines this by generating multiple response candidates and filtering out those that don’t match the ground truth, ensuring that the model learns from high-quality examples and improves its generalization ability.

Introducing JudgerBenchV2: A Comprehensive Evaluation Benchmark

Recognizing the limitations of existing benchmarks for evaluating judge models, the researchers also introduced “JudgerBenchV2.” This new benchmark is designed to be more robust and comprehensive. It includes 10,000 questions across 10 different scenarios, covering a wide range of judging capabilities.

A notable feature of JudgerBenchV2 is its use of a “Mix-of-Judgers (MoJ)” consensus as ground truth. Instead of relying on a single human or model, it leverages judgments from several high-performing LLMs (DeepSeek-R1, DeepSeek-v3-0324, and Qwen3-235B-A22B), taking their majority consensus as the true answer. This helps mitigate bias from a single source.

Furthermore, JudgerBenchV2 introduces new metrics that assess both “sample-level accuracy” (how often the judge model agrees with the ground truth on individual judgments) and “model-level rank consistency” (how well the judge model’s overall ranking of LLMs aligns with the ground truth ranking). This provides a more nuanced and reliable evaluation of judge models.

Also Read:

Impressive Performance and Future Outlook

Empirical results show that CompassJudger-2 achieves superior performance across multiple judge and reward benchmarks. The 7B model, for instance, demonstrates competitive judgment accuracy even when compared to significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. It also maintains strong performance on general objective and subjective benchmarks, suggesting a positive correlation between judging ability and general language understanding.

Beyond just judging, CompassJudger-2 also shows a strong “critique ability.” When used to generate analyses of model responses, it provides high-quality critiques that can actually help other LLMs improve their outputs. This highlights its potential for enhancing training performance during model iterations.

While CompassJudger-2 represents a significant step forward, the researchers acknowledge some limitations, such as the inference costs associated with rejection sampling and the potential for hallucinations when synthesizing data. However, CompassJudger-2 paves the way for more adaptable, interpretable, and efficient judge services, promising to advance the evaluation and improvement of LLMs in real-world applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CompassJudger-2: Advancing LLM Evaluation with a New Generalist Judge Model

A New Approach to Training Judge Models

Enhancing Judgment Accuracy with Verifiable Rewards

Introducing JudgerBenchV2: A Comprehensive Evaluation Benchmark

Impressive Performance and Future Outlook

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates