spot_img
HomeResearch & DevelopmentCompassJudger-2: Advancing LLM Evaluation with a New Generalist Judge...

CompassJudger-2: Advancing LLM Evaluation with a New Generalist Judge Model

TLDR: CompassJudger-2 is a new generalist judge model designed to overcome the limitations of existing specialized and less robust LLM evaluators. It uses a unique training approach with diverse, verifiable data and a refined learning objective. The paper also introduces JudgerBenchV2, a comprehensive benchmark that uses a “Mix-of-Judgers” for ground truth and evaluates both judgment accuracy and ranking consistency. CompassJudger-2 demonstrates superior performance, competitive accuracy with much larger models, and strong critique generation capabilities, paving the way for more reliable LLM evaluation.

In the rapidly evolving field of artificial intelligence, large language models (LLMs) are becoming increasingly sophisticated, capable of everything from understanding complex language to generating creative text and code. As these models are deployed in real-world applications, ensuring the quality and accuracy of their responses is paramount. This is where “judge models” come into play, acting as evaluators for LLM outputs.

However, existing judge models often face limitations. They tend to be highly specialized, meaning they perform well only on specific types of tasks or datasets, and they can lack robustness, struggling with the diverse and often unpredictable nature of LLM outputs. This can hinder their ability to provide truly comprehensive evaluations.

A new research paper, “CompassJudger-2: Towards Generalist Judge Model Via Verifiable Rewards”, introduces CompassJudger-2, a novel approach designed to overcome these challenges. Developed by researchers Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, and Kai Chen from Shanghai AI Laboratory and Tsinghua University, CompassJudger-2 aims to be a more versatile and robust judge model.

A New Approach to Training Judge Models

The core of CompassJudger-2’s innovation lies in its training methodology. The researchers adopted a “task-driven, multi-domain data curation strategy.” This means they carefully collected and prepared a wide variety of data from different sources and across various domains to train the model. They focused on “verifiable rewards,” which essentially means guiding the model’s learning with clear, objective signals about the correctness of its judgments. This helps the model develop intrinsic critical reasoning abilities.

The data pipeline for CompassJudger-2 is quite sophisticated. It involves both “data curation” and “data synthesis.” For data curation, they gathered existing public judge and reward datasets. They even took steps to rectify outdated judgments by using a powerful LLM (Qwen2.5-72B-Instruct) to reconstruct and verify their correctness. To enhance diversity, they replaced original prompt templates with those from various subjective evaluation datasets.

Data synthesis involved creating new data from “knowledge-based datasets” (like MMLU and GSM8K) and “chat-based datasets.” For knowledge-based data, they used an LLM to evaluate model outputs and generate detailed rationales, retaining only the verified correct evaluations. For chat-based data, they generated contrasting response pairs and had an LLM select the superior one based on style requirements, creating style-sensitive judgment data.

Enhancing Judgment Accuracy with Verifiable Rewards

A key aspect of CompassJudger-2’s training is the integration of verifiable rewards through a technique called “policy gradient optimization” combined with “rejection sampling.” This process guides the model to think critically before making a judgment. The model is prompted to analyze the user’s demand, identify strengths and weaknesses of different model responses, reason based on this analysis, and then make a final prediction.

The “verifiable reward” acts as a clear signal: if the model’s prediction matches the ground truth, it gets a reward of 1; otherwise, it gets 0. This signal is used to optimize the model. Rejection sampling further refines this by generating multiple response candidates and filtering out those that don’t match the ground truth, ensuring that the model learns from high-quality examples and improves its generalization ability.

Introducing JudgerBenchV2: A Comprehensive Evaluation Benchmark

Recognizing the limitations of existing benchmarks for evaluating judge models, the researchers also introduced “JudgerBenchV2.” This new benchmark is designed to be more robust and comprehensive. It includes 10,000 questions across 10 different scenarios, covering a wide range of judging capabilities.

A notable feature of JudgerBenchV2 is its use of a “Mix-of-Judgers (MoJ)” consensus as ground truth. Instead of relying on a single human or model, it leverages judgments from several high-performing LLMs (DeepSeek-R1, DeepSeek-v3-0324, and Qwen3-235B-A22B), taking their majority consensus as the true answer. This helps mitigate bias from a single source.

Furthermore, JudgerBenchV2 introduces new metrics that assess both “sample-level accuracy” (how often the judge model agrees with the ground truth on individual judgments) and “model-level rank consistency” (how well the judge model’s overall ranking of LLMs aligns with the ground truth ranking). This provides a more nuanced and reliable evaluation of judge models.

Also Read:

Impressive Performance and Future Outlook

Empirical results show that CompassJudger-2 achieves superior performance across multiple judge and reward benchmarks. The 7B model, for instance, demonstrates competitive judgment accuracy even when compared to significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. It also maintains strong performance on general objective and subjective benchmarks, suggesting a positive correlation between judging ability and general language understanding.

Beyond just judging, CompassJudger-2 also shows a strong “critique ability.” When used to generate analyses of model responses, it provides high-quality critiques that can actually help other LLMs improve their outputs. This highlights its potential for enhancing training performance during model iterations.

While CompassJudger-2 represents a significant step forward, the researchers acknowledge some limitations, such as the inference costs associated with rejection sampling and the potential for hallucinations when synthesizing data. However, CompassJudger-2 paves the way for more adaptable, interpretable, and efficient judge services, promising to advance the evaluation and improvement of LLMs in real-world applications.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -