Opt-ICL: A Winning Strategy for Modeling Human Disagreement in NLP

TLDR: The Opt-ICL system, developed by Taylor Sorensen and Yejin Choi, won the LeWiDi-2025 competition by effectively modeling human variation and disagreement in NLP tasks. It leverages large language models’ in-context learning abilities through a two-step meta-learning process: post-training (Spectrum Tuning) and dataset-specific fine-tuning. Key findings show that including rater examples in-context is crucial, while demographics had less impact. The system’s performance also scales with model size, though specialized training remains vital.

In the evolving landscape of Natural Language Processing (NLP), tasks often grapple with subjectivity, ambiguity, and genuine disagreement among human annotators. Traditionally, this disagreement was seen as ‘noise’ to be eliminated, often attributed to unclear instructions or faulty data. However, a new perspective is emerging: disagreement can be a valuable ‘signal’ of diverse interpretations and human variation, crucial for building robust and uncertainty-calibrated AI systems.

Addressing this challenge, the Learning With Disagreements (LeWiDi) competition was established to inspire methods for integrating human variation into AI evaluation and modeling. The competition featured two main tasks: a ‘perspectivist’ task, aiming to predict an individual annotator’s rating, and a ‘soft label’ task, focused on predicting the distribution of labels from a pool of annotators.

A system named Opt-ICL (Optimizing In-Context Learning), developed by Taylor Sorensen from the University of Washington and Yejin Choi from Stanford University, emerged as the overall winner in both LeWiDi-2025 tasks. Their approach takes a fully perspectivist stance, predicting individual annotator responses and then aggregating these into a distribution for the soft task. The core of Opt-ICL lies in leveraging the in-context learning abilities of large language models (LLMs), specifically the google/gemma-3-12b-pt model.

How Opt-ICL Works: A Three-Component System

The Opt-ICL system is built upon three key components:

1. Spectrum Tuning (SpecT): This involves post-training an autoregressive LLM on a vast collection of over 40 datasets that exhibit human variation, stochasticity, or epistemic uncertainty. This step enhances the model’s in-context learning capabilities and teaches it a unified prompt format.

2. Dataset-Specific Fine-Tuning: After Spectrum Tuning, the model undergoes further specialization on the particular dataset of interest. This fine-tuning uses in-context demonstrations from each rater, essentially meta-learning how best to adapt to individual rater examples.

3. Inference with In-Context Annotator Information: During inference, the system includes annotator demographics and as many example training ratings as can fit within the context window, placing the target instance to be evaluated at the end. This allows the model to predict a rater’s response by directly calculating the probability of each label.

The prompt structure is crucial, incorporating a task description, annotator demographics, input instance, and the expected output (rating and explanation). The inclusion of rater explanations, even if not directly evaluated, is believed to provide predictive information.

Also Read:

Key Findings and Ablation Studies

The Opt-ICL system demonstrated superior performance, securing the lowest overall rank across all teams in the LeWiDi-2025 competition. An ablation study was conducted to understand the contribution of each system component:

In-Context Rater Examples are Crucial: The study found a substantial performance degradation when restricting the model to only a single rater demonstration instead of many. This highlights the critical role of in-context demonstrations and the LLM’s ability to learn from them.
Demographics Not Significantly Helpful: Omitting rater demographics did not cause a significant drop in performance, suggesting the system didn’t heavily rely on sociodemographic information for improved predictivity.
Dataset-Specific Fine-Tuning Matters for Large Datasets: For larger datasets like MultiPIco (MP) and Conversational Sarcasm Corpus (CSC), dataset-specific fine-tuning significantly improved performance. This is hypothesized to help the model meta-learn how to utilize in-context examples, build better priors, and specialize to the data distribution. However, its impact was less significant on smaller datasets like Paraphrase Detection (Par) and VariErrNLI (VEN).
Spectrum Tuning Helped on One Dataset: Spectrum Tuning significantly aided performance on the MP dataset but showed less impact on others. The reasons for this selective benefit are still being explored.
Performance Scales with Model Size, But Training is Key: While larger models generally performed better, the 12B Opt-ICL system, with its specialized training (SpecT and SFT), outperformed a larger 27B pretrained model without such training on the larger datasets. This indicates that model scale alone does not compensate for targeted, dataset-specific training.

The Opt-ICL system offers several advantages, including using a single model per dataset, potential adaptation to new raters at test time, strong performance even with limited data, and a consistent approach for both perspectivist and soft tasks. However, limitations include expensive inference due to long prompt lengths and the inability to leverage rater demonstrations that exceed the context window. Future work aims to explore how performance scales with more in-context examples, selective example inclusion, and the impact of rater explanations.

For more in-depth technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Opt-ICL: A Winning Strategy for Modeling Human Disagreement in NLP

How Opt-ICL Works: A Three-Component System

Key Findings and Ablation Studies

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates