LLMs Grapple with Human Disagreement: A Look at Test-Time Scaling in Nuanced NLP Tasks

TLDR: A study on Large Language Models (LLMs) and “test-time scaling” methods found that while Model Averaging and Majority Voting improve performance on NLP tasks with human annotation disagreements (LeWiDi-2025), the Best-of-N (BoN) sampling method, successful in math, struggles. This is likely due to LLMs generating vague reasoning steps and allocating less computational effort for these nuanced tasks, suggesting a gap in their training for interpretative variability.

Large Language Models (LLMs) have shown remarkable capabilities, especially when given extra computation time during inference, a technique known as test-time scaling. This approach has been particularly effective in domains requiring verifiable correct answers, such as mathematics and coding. However, a recent study by the BoN Appetit Team at LeWiDi-2025 explored how these methods fare in more subjective areas, specifically Natural Language Processing (NLP) tasks characterized by human annotation disagreements.

The research, titled “BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)” by Tomas Ruiz, Siyao Peng, Barbara Plank, and Carsten Schwemmer, delves into the challenges LLMs face when confronted with interpretative variability. Unlike traditional supervised learning, where each example has a single, fixed label, many NLP tasks involve substantial human disagreement, which the LeWiDi-2025 shared task aims to address.

The team experimented with three test-time scaling methods: Model Averaging, Majority Voting (both established benchmarks), and a Best-of-N (BoN) sampling method. The LeWiDi tasks involve two main types: a Perspectivist task (predicting individual annotator labels) and a Soft-label task (predicting the distribution of human annotations, also known as a human judgment distribution).

A key finding was that Model Averaging and Majority Voting consistently improved LLM performance across all LeWiDi datasets. These methods effectively synthesize multiple LLM predictions into a single, more robust output. However, the Best-of-N (BoN) sampling method, which typically involves generating multiple solutions and using an LLM-as-a-judge to score and select the best one based on step-wise reasoning, did not yield similar positive results on the LeWiDi tasks. Its performance was often inconsistent or even worse than simple single-sample predictions.

To better understand these dynamics, the researchers introduced a new metric called “prediction diversity.” This metric quantifies the variability of soft-labels across multiple predictions for a single problem. They found a strong correlation between prediction diversity and problem difficulty: problems with higher diversity were generally harder. Interestingly, methods like Model Averaging and the theoretical “BoN oracle” (representing the best possible BoN performance) showed greater improvements on problems with higher prediction diversity, indicating that these methods are more beneficial when the model’s initial predictions are varied.

The underperformance of BoN sampling in the LeWiDi tasks, especially given its success in mathematical domains, points to a challenge in “cross-domain generalization.” The authors suggest two primary reasons for this gap. Firstly, LLMs tend to generate more vague Chain-of-Thought (CoT) steps when tackling LeWiDi tasks compared to mathematical problems. This vagueness makes it difficult for a judge LLM to accurately discriminate between good and bad reasoning steps. This might be because current reasoning LLMs are primarily post-trained on mathematical, coding, and general STEM problems, rather than tasks requiring nuanced interpretative variation.

Secondly, the study observed that LLMs and their judges allocate a significantly lower “compute budget” (i.e., produce fewer tokens for reasoning) on LeWiDi tasks than on mathematical tasks. This difference in computational effort further supports the hypothesis that the models’ training biases them towards logical and mathematical reasoning, making them less adept at handling the subjective and diverse interpretations inherent in many NLP tasks.

Also Read:

In conclusion, while Model Averaging and Majority Voting prove effective for improving LLM performance on tasks with annotation disagreements, the BoN sampling method, successful in mathematics, struggles to transfer. This highlights a need for future LLM training to incorporate tasks that explicitly deal with interpretative variability and diverse perspectives to enhance their capabilities in nuanced NLP domains. You can read the full research paper for more technical details and experimental results here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

LLMs Grapple with Human Disagreement: A Look at Test-Time Scaling in Nuanced NLP Tasks

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates