TLDR: This research paper proposes a novel framework for collecting high-quality preference data for LLM alignment directly from end-users. By generating two responses from different LLMs in a comparison mode, the system infers a user’s ‘attentiveness level’ through a probabilistic behavioral model and an EM algorithm. This allows for filtering out casual or noisy feedback, leading to a higher-quality dataset that significantly improves downstream LLM alignment tasks like Direct Preference Optimization (DPO).
Large Language Models (LLMs) have become incredibly popular, and their ability to understand and generate human-like text is constantly improving. A crucial part of this improvement involves aligning these models with human preferences and values. Traditionally, this alignment relies on data collected by professional human annotators who compare different model responses and indicate which one they prefer. However, this method is expensive and doesn’t scale well.
A new research paper, titled “Users as Annotators: LLM Preference Learning from Comparison Mode,” explores an innovative way to gather this valuable preference data directly from the vast user base of LLMs. Think of it like the ‘comparison mode’ you might see in some LLM interfaces, where you’re shown two responses to your query and asked to pick your favorite. This approach has a huge advantage: users are the ultimate experts in judging responses to their own questions.
However, there’s a significant challenge with user-generated feedback: quality control. Unlike professional annotators who are incentivized and trained to provide consistent judgments, everyday users might not always be attentive or consistent. They might casually select a response, or even pick one randomly, making it difficult to distinguish high-quality feedback from noisy data.
This paper introduces a clever framework to tackle this quality control issue. The core idea involves a slight but significant change to how responses are generated in comparison mode. Instead of presenting two responses from the same LLM, the framework proposes generating the two responses from *different* LLMs, or different versions of the same model. This asymmetry is key.
Here’s why this asymmetry is so important: if one model (say, Model A) is generally more powerful or produces better responses than another (Model B), attentive users are expected to favor Model A more often. Casual users, on the other hand, might choose between the two models with roughly equal probability, regardless of which one is objectively better. By tracking a user’s preference history over time, the system can infer their ‘attentiveness level’ – essentially, how careful and committed they are to providing high-quality feedback.
The researchers developed a probabilistic model to capture this user behavior and an Expectation-Maximization (EM) algorithm to estimate a latent quality factor for each user. This algorithm helps determine how attentive a user is. Once the attentiveness level is inferred, the system can filter the user-annotated data, retaining only the feedback from users deemed ‘attentive.’ This filtered, higher-quality dataset can then be used for downstream LLM alignment tasks, such as Direct Preference Optimization (DPO).
Experiments showed that this data filtering approach significantly improves DPO performance. Even though filtering reduces the total amount of training data, the higher quality of the remaining data leads to better average reward scores and increased win rates over baseline models. The paper also discusses trade-offs, such as finding the optimal filtering threshold and the impact of the performance gap between the two generating LLMs on the effectiveness of the filtering process.
Also Read:
- Protecting Human Feedback Privacy in AI Alignment
- Optimizing LLM Fine-tuning with Smart Data Selection
This innovative framework not only provides a scalable way to collect preference data but also ensures its quality, making user feedback a powerful tool for improving LLMs. It opens doors for future advancements, including modeling attentiveness at a sample level (rather than just user level) and adapting to diverse user prompt distributions. For more technical details, you can read the full paper here.


