spot_img
HomeResearch & DevelopmentOpt-ICL: A Winning Strategy for Modeling Human Disagreement in...

Opt-ICL: A Winning Strategy for Modeling Human Disagreement in NLP

TLDR: The Opt-ICL system, developed by Taylor Sorensen and Yejin Choi, won the LeWiDi-2025 competition by effectively modeling human variation and disagreement in NLP tasks. It leverages large language models’ in-context learning abilities through a two-step meta-learning process: post-training (Spectrum Tuning) and dataset-specific fine-tuning. Key findings show that including rater examples in-context is crucial, while demographics had less impact. The system’s performance also scales with model size, though specialized training remains vital.

In the evolving landscape of Natural Language Processing (NLP), tasks often grapple with subjectivity, ambiguity, and genuine disagreement among human annotators. Traditionally, this disagreement was seen as ‘noise’ to be eliminated, often attributed to unclear instructions or faulty data. However, a new perspective is emerging: disagreement can be a valuable ‘signal’ of diverse interpretations and human variation, crucial for building robust and uncertainty-calibrated AI systems.

Addressing this challenge, the Learning With Disagreements (LeWiDi) competition was established to inspire methods for integrating human variation into AI evaluation and modeling. The competition featured two main tasks: a ‘perspectivist’ task, aiming to predict an individual annotator’s rating, and a ‘soft label’ task, focused on predicting the distribution of labels from a pool of annotators.

A system named Opt-ICL (Optimizing In-Context Learning), developed by Taylor Sorensen from the University of Washington and Yejin Choi from Stanford University, emerged as the overall winner in both LeWiDi-2025 tasks. Their approach takes a fully perspectivist stance, predicting individual annotator responses and then aggregating these into a distribution for the soft task. The core of Opt-ICL lies in leveraging the in-context learning abilities of large language models (LLMs), specifically the google/gemma-3-12b-pt model.

How Opt-ICL Works: A Three-Component System

The Opt-ICL system is built upon three key components:

1. Spectrum Tuning (SpecT): This involves post-training an autoregressive LLM on a vast collection of over 40 datasets that exhibit human variation, stochasticity, or epistemic uncertainty. This step enhances the model’s in-context learning capabilities and teaches it a unified prompt format.

2. Dataset-Specific Fine-Tuning: After Spectrum Tuning, the model undergoes further specialization on the particular dataset of interest. This fine-tuning uses in-context demonstrations from each rater, essentially meta-learning how best to adapt to individual rater examples.

3. Inference with In-Context Annotator Information: During inference, the system includes annotator demographics and as many example training ratings as can fit within the context window, placing the target instance to be evaluated at the end. This allows the model to predict a rater’s response by directly calculating the probability of each label.

The prompt structure is crucial, incorporating a task description, annotator demographics, input instance, and the expected output (rating and explanation). The inclusion of rater explanations, even if not directly evaluated, is believed to provide predictive information.

Also Read:

Key Findings and Ablation Studies

The Opt-ICL system demonstrated superior performance, securing the lowest overall rank across all teams in the LeWiDi-2025 competition. An ablation study was conducted to understand the contribution of each system component:

  • In-Context Rater Examples are Crucial: The study found a substantial performance degradation when restricting the model to only a single rater demonstration instead of many. This highlights the critical role of in-context demonstrations and the LLM’s ability to learn from them.
  • Demographics Not Significantly Helpful: Omitting rater demographics did not cause a significant drop in performance, suggesting the system didn’t heavily rely on sociodemographic information for improved predictivity.
  • Dataset-Specific Fine-Tuning Matters for Large Datasets: For larger datasets like MultiPIco (MP) and Conversational Sarcasm Corpus (CSC), dataset-specific fine-tuning significantly improved performance. This is hypothesized to help the model meta-learn how to utilize in-context examples, build better priors, and specialize to the data distribution. However, its impact was less significant on smaller datasets like Paraphrase Detection (Par) and VariErrNLI (VEN).
  • Spectrum Tuning Helped on One Dataset: Spectrum Tuning significantly aided performance on the MP dataset but showed less impact on others. The reasons for this selective benefit are still being explored.
  • Performance Scales with Model Size, But Training is Key: While larger models generally performed better, the 12B Opt-ICL system, with its specialized training (SpecT and SFT), outperformed a larger 27B pretrained model without such training on the larger datasets. This indicates that model scale alone does not compensate for targeted, dataset-specific training.

The Opt-ICL system offers several advantages, including using a single model per dataset, potential adaptation to new raters at test time, strong performance even with limited data, and a consistent approach for both perspectivist and soft tasks. However, limitations include expensive inference due to long prompt lengths and the inability to leverage rater demonstrations that exceed the context window. Future work aims to explore how performance scales with more in-context examples, selective example inclusion, and the impact of rater explanations.

For more in-depth technical details, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -