TLDR: A new method called Steerable Pluralistic Model (SPM) is introduced to align large language models (LLMs) with diverse individual user preferences, moving beyond average preferences. It uses few-shot comparative regression, where LLMs score responses based on fine-grained attributes, and a distance function selects the best match. The paper also proposes two new benchmarks for evaluation and demonstrates that this approach outperforms existing methods, offering more interpretable and adaptable AI.
Large language models, or LLMs, are becoming increasingly common in our daily lives, from helping us write emails to assisting with complex decision-making. However, a significant challenge with these powerful AI systems is ensuring they align with human intentions and values. Traditionally, LLMs are aligned using methods like reinforcement learning from human feedback (RLHF), which often relies on a single, scalar reward. This means the AI learns to reflect average user preferences, potentially overlooking the rich diversity of individual human values and perspectives.
Imagine an AI that needs to provide advice on a sensitive topic. An average alignment might not cater to someone who prioritizes compassion over strict adherence to rules, or vice versa. This is where the concept of “pluralistic alignment” comes in. Instead of a one-size-fits-all approach, pluralistic alignment aims to capture and adapt to a wide range of user preferences across various attributes, moving beyond just being generally helpful or harmless.
Researchers at Kitware Inc. have introduced a novel approach to address this challenge with their “Steerable Pluralistic Model” (SPM). This new model is designed to be adaptable to individual user preferences through a technique called few-shot comparative regression. At its core, the SPM leverages the LLM’s ability to understand and reason about fine-grained attributes. When presented with a question and multiple possible responses, the LLM is prompted to score each response based on how well it aligns with a set of specific attributes, such as ‘care,’ ‘fairness,’ ‘helpfulness,’ or ‘correctness’.
The process is quite clever: the LLM doesn’t directly pick a response. Instead, it acts as a ‘judge,’ assigning scores to each option. Then, a separate function calculates the ‘distance’ between these predicted scores and a user-defined ‘alignment target’ – a vector representing the user’s desired attribute values. The response with the smallest distance to the target is then selected. This indirect approach helps reduce the inherent biases that LLMs might have from their initial training, allowing for more precise steering towards specific user profiles.
A key innovation of this method is its use of “in-context learning” (ICL) with “few-shot examples.” This means the LLM is given a few examples of how responses should be scored against attributes, essentially providing it with a rubric to follow. This significantly improves the accuracy of the regression. Furthermore, the model is designed to produce “reasoning statements,” explaining why a particular response received its score, which enhances the interpretability of the AI’s decisions.
To properly evaluate their SPM, the researchers also developed two new “steerable pluralistic benchmarks” by adapting existing open-source datasets: the Moral Integrity Corpus (MIC) for value-based decision-making and HelpSteer2 for reward modeling. These benchmarks allow for testing how well a model can be customized to a particular set of target attributes, a crucial step that was previously lacking in the field.
In experiments, the proposed SPM consistently outperformed other baseline and state-of-the-art methods, demonstrating better alignment accuracy across diverse user profiles. Notably, it showed less susceptibility to the implicit biases found in unaligned LLMs and traditional reward models, which often lean towards responses with generally ‘high’ moral or preference attributes. This means the SPM can effectively align with a full spectrum of preferences, including those that might be considered ‘low’ on certain attributes if that’s what the user desires.
While the new approach offers significant advancements, the researchers acknowledge some limitations, primarily increased runtime due to longer prompts and the use of a structured output schema. However, the benefits of more accurate and flexible steering across pluralistic profiles often outweigh these costs. Future work aims to explore weighted multi-attribute alignment objectives and user studies to further refine the model’s capabilities.
Also Read:
- New Loss Function Enhances Language Model Alignment Stability
- Unlocking Better AI: The Power of Quantified Human Preferences
This research marks an important step forward in making AI systems more fair, representative, and adaptable to the nuanced and diverse preferences of individual users. By enabling LLMs to align with specific values and perspectives, this work contributes to the development of more ethical and user-centric AI. You can find more details about this research in the paper: Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression.


