The Case for Deliberate AI: Enhancing LLM Judging with Explicit Reasoning

TLDR: A systematic study comparing “thinking” (explicit reasoning) and “non-thinking” Large Language Models (LLMs) as automated judges found that thinking models significantly outperform non-thinking ones in accuracy, efficiency, and robustness to biases, even in multilingual contexts. While augmentations for non-thinking models exist, they are less effective and more computationally expensive. The research emphasizes that explicit reasoning is a superior, low-cost strategy for reliable LLM evaluation, though model capacity and specialized rubrics for tasks like safety remain important considerations.

Large Language Models (LLMs) are increasingly becoming the go-to solution for automated evaluation in various applications, from benchmarking to reward modeling. This shift, known as the LLM-as-a-judge paradigm, offers scalable and adaptable assessments. However, the reliability of these judgments isn’t just about the model’s size; it also depends on how the model processes information internally.

A recent systematic study, titled Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness, delves into the differences between “thinking” and “non-thinking” LLMs in this evaluation role. Thinking models are those that generate clear, intermediate reasoning steps before arriving at a final decision, much like a human thinking aloud. Non-thinking models, on the other hand, produce a verdict directly without showing their internal thought process.

The research, conducted by Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi, utilized open-source Qwen 3 models of varying sizes (0.6B, 1.7B, and 4B parameters). They evaluated these models on RewardBench tasks, focusing on accuracy, computational efficiency (measured in FLOPs), and robustness to various biases. The study also explored different augmentation strategies for non-thinking models, such as in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation.

Key Findings: Thinking Models Lead the Way

The results strongly indicate that thinking models offer significant advantages. They achieved approximately 10 percentage points higher accuracy with minimal computational overhead, typically under 2 times the FLOPs of their non-thinking counterparts. This is a stark contrast to augmentation strategies like few-shot learning, which provided only modest accuracy gains at a much higher cost (over 8 times the FLOPs).

Beyond accuracy, thinking models demonstrated superior robustness. They maintained significantly greater consistency when exposed to various bias conditions, including positional, bandwagon, identity, diversity, and random biases, showing an average improvement of about 6% in consistency. This suggests that explicit reasoning helps models to disregard superficial cues and focus on the substantive quality of the content, leading to more principled and less biased evaluations.

The benefits of explicit reasoning were not limited to English. The study extended its experiments to a multilingual setting using M-RewardBench, confirming that thinking models also excel in diverse language contexts, with an average multilingual evaluation score gain of 8.88 points.

When Smaller Models Struggle and Rubrics Shine

The research also highlighted a crucial point about model capacity: a certain level of capability is essential for reliable judging. The smallest model tested, Qwen 3 0.6B, struggled significantly on challenging tasks like “Chat Hard” and “Safety,” sometimes performing worse than random selection. This underscores the risks of deploying very small LLMs as automated judges, even with explicit reasoning enabled.

Interestingly, while thinking models generally outperformed, a key exception was observed in the “Safety” category. Here, rubric-based prompting consistently achieved higher accuracy across all model scales. This is attributed to the policy-driven nature of safety evaluation, which benefits from adherence to specific, nuanced criteria outlined in a structured rubric. For such specialized domains, a well-defined rubric can guide the model more effectively than general generative reasoning.

Also Read:

Implications for the Future of AI Evaluation

The findings of this study have important implications for both practitioners and researchers. For those deploying LLMs as judges, enabling an explicit reasoning mode appears to be a low-cost, high-reward strategy for enhancing the quality of automated evaluations and reward modeling pipelines. It offers a more efficient alternative to computationally intensive methods for improving the performance of smaller models.

For the broader research community, this work adds to the growing evidence that optimizing inference-time computation can be as impactful as, or even more effective than, simply scaling up model parameters. This paves the way for developing more efficient and accessible state-of-the-art models, suggesting a future where AI judges are not only powerful but also transparent, fair, and computationally sensible.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The Case for Deliberate AI: Enhancing LLM Judging with Explicit Reasoning

Key Findings: Thinking Models Lead the Way

When Smaller Models Struggle and Rubrics Shine

Implications for the Future of AI Evaluation

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates