spot_img
HomeResearch & DevelopmentEnhancing LLM Alignment: A Novel Method to Combat Over-Optimization

Enhancing LLM Alignment: A Novel Method to Combat Over-Optimization

TLDR: Weights-Rotated Preference Optimization (RoPO) is a new algorithm designed to improve Large Language Model (LLM) alignment by addressing the ‘reward hacking’ problem in Direct Preference Optimization (DPO). Reward hacking leads to overly long, repetitive, and knowledge-forgetting generations. RoPO tackles this by implicitly constraining the output layer and explicitly constraining intermediate hidden states using a multi-granularity orthogonal matrix. This dual approach preserves angle-encoded knowledge, resulting in better performance, shorter generations, increased diversity, and reduced knowledge forgetting with minimal trainable parameters.

Large Language Models (LLMs) have shown incredible performance, but they often struggle to consistently meet human expectations. To address this, researchers use alignment techniques, with Direct Preference Optimization (DPO) being a prominent method. DPO helps LLMs learn human preferences by comparing chosen and rejected responses.

However, DPO faces a significant challenge known as “reward hacking” or “reward overoptimization.” This occurs when an LLM excessively reduces the probability of rejected responses to maximize its reward, rather than genuinely improving its output. This leads to several undesirable outcomes: models generate overly lengthy content, responses lack diversity, and the model can even forget previously learned knowledge, a phenomenon called catastrophic forgetting. Imagine an AI that, when asked a simple question, gives a very long, repetitive answer that doesn’t quite hit the mark, as illustrated in the paper’s Figure 1.

The researchers behind the new Weights-Rotated Preference Optimization (RoPO) algorithm investigated the root cause of this reward hacking. They found that DPO optimization can cause the model’s neurons to “collapse” in the parameter space, leading to what they call “representation redundancy.” This means the model’s internal representations become less distinct, impairing its expressive capabilities and leading to knowledge forgetting.

To combat this, RoPO introduces a novel approach that applies dual constraints during the optimization process. It implicitly uses the KL divergence from DPO to regulate the output layer, ensuring diverse and fluent expressions. Crucially, RoPO also explicitly constrains the intermediate hidden states by fine-tuning them with a unique multi-granularity orthogonal matrix. This matrix is composed of global Householder Reflection matrices and fine-grained Givens matrices, which work together to rotate the model’s weights. This rotation helps preserve the angle-encoded knowledge and expressive capabilities acquired during earlier training stages (pre-training and supervised fine-tuning).

By preventing the policy model from deviating too far from the reference model, RoPO effectively mitigates the reward hacking problem. The experimental results are compelling: RoPO achieves significant performance improvements on benchmarks like AlpacaEval 2 and MT-Bench, outperforming existing baselines. Importantly, it does so while maintaining shorter generation lengths, addressing the verbosity issue. Furthermore, RoPO demonstrates superior knowledge retention, preventing the catastrophic forgetting observed in other DPO methods, and enhances the diversity of generated content.

One of RoPO’s notable advantages is its efficiency. It achieves these strong results with merely 0.015% of the trainable parameters compared to full-parameter baselines, making it a highly parameter-efficient fine-tuning method. This means it can be trained more quickly and with fewer computational resources.

Also Read:

In essence, RoPO offers a robust solution to a critical problem in LLM alignment, enabling models to better align with human preferences without sacrificing their expressive power or forgetting their foundational knowledge. For more technical details, you can refer to the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -