TLDR: A study on Large Language Models (LLMs) and “test-time scaling” methods found that while Model Averaging and Majority Voting improve performance on NLP tasks with human annotation disagreements (LeWiDi-2025), the Best-of-N (BoN) sampling method, successful in math, struggles. This is likely due to LLMs generating vague reasoning steps and allocating less computational effort for these nuanced tasks, suggesting a gap in their training for interpretative variability.
Large Language Models (LLMs) have shown remarkable capabilities, especially when given extra computation time during inference, a technique known as test-time scaling. This approach has been particularly effective in domains requiring verifiable correct answers, such as mathematics and coding. However, a recent study by the BoN Appetit Team at LeWiDi-2025 explored how these methods fare in more subjective areas, specifically Natural Language Processing (NLP) tasks characterized by human annotation disagreements.
The research, titled “BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)” by Tomas Ruiz, Siyao Peng, Barbara Plank, and Carsten Schwemmer, delves into the challenges LLMs face when confronted with interpretative variability. Unlike traditional supervised learning, where each example has a single, fixed label, many NLP tasks involve substantial human disagreement, which the LeWiDi-2025 shared task aims to address.
The team experimented with three test-time scaling methods: Model Averaging, Majority Voting (both established benchmarks), and a Best-of-N (BoN) sampling method. The LeWiDi tasks involve two main types: a Perspectivist task (predicting individual annotator labels) and a Soft-label task (predicting the distribution of human annotations, also known as a human judgment distribution).
A key finding was that Model Averaging and Majority Voting consistently improved LLM performance across all LeWiDi datasets. These methods effectively synthesize multiple LLM predictions into a single, more robust output. However, the Best-of-N (BoN) sampling method, which typically involves generating multiple solutions and using an LLM-as-a-judge to score and select the best one based on step-wise reasoning, did not yield similar positive results on the LeWiDi tasks. Its performance was often inconsistent or even worse than simple single-sample predictions.
To better understand these dynamics, the researchers introduced a new metric called “prediction diversity.” This metric quantifies the variability of soft-labels across multiple predictions for a single problem. They found a strong correlation between prediction diversity and problem difficulty: problems with higher diversity were generally harder. Interestingly, methods like Model Averaging and the theoretical “BoN oracle” (representing the best possible BoN performance) showed greater improvements on problems with higher prediction diversity, indicating that these methods are more beneficial when the model’s initial predictions are varied.
The underperformance of BoN sampling in the LeWiDi tasks, especially given its success in mathematical domains, points to a challenge in “cross-domain generalization.” The authors suggest two primary reasons for this gap. Firstly, LLMs tend to generate more vague Chain-of-Thought (CoT) steps when tackling LeWiDi tasks compared to mathematical problems. This vagueness makes it difficult for a judge LLM to accurately discriminate between good and bad reasoning steps. This might be because current reasoning LLMs are primarily post-trained on mathematical, coding, and general STEM problems, rather than tasks requiring nuanced interpretative variation.
Secondly, the study observed that LLMs and their judges allocate a significantly lower “compute budget” (i.e., produce fewer tokens for reasoning) on LeWiDi tasks than on mathematical tasks. This difference in computational effort further supports the hypothesis that the models’ training biases them towards logical and mathematical reasoning, making them less adept at handling the subjective and diverse interpretations inherent in many NLP tasks.
Also Read:
- Improving AI Evaluation with Collaborative LLM Debates
- Unpacking LLM Judge Capabilities: Human-Like vs. Super-Consistent AI
In conclusion, while Model Averaging and Majority Voting prove effective for improving LLM performance on tasks with annotation disagreements, the BoN sampling method, successful in mathematics, struggles to transfer. This highlights a need for future LLM training to incorporate tasks that explicitly deal with interpretative variability and diverse perspectives to enhance their capabilities in nuanced NLP domains. You can read the full research paper for more technical details and experimental results here.


