spot_img
HomeResearch & DevelopmentSolving Diversity Collapse in LLMs with Diversity-Preserving Hybrid RL

Solving Diversity Collapse in LLMs with Diversity-Preserving Hybrid RL

TLDR: This paper introduces Diversity-Preserving Hybrid RL (DPH-RL), a new framework that tackles the problem of “diversity collapse” and “catastrophic forgetting” in Large Language Models (LLMs) fine-tuned with Reinforcement Learning with Verifiable Reward (RLVR). Traditional methods often degrade multi-attempt performance (Pass@k) and lose previously learned skills. DPH-RL uses mass-covering f-divergences (like forward-KL and JS-divergence) as a “rehearsal mechanism” to ensure the model maintains a broad range of solution styles by continuously referencing its initial knowledge. Experiments on math and SQL tasks show DPH-RL significantly improves both single-attempt (Pass@1) and multi-attempt (Pass@k) performance, even on new tasks, while being more training-efficient.

In the rapidly evolving field of Artificial Intelligence, Large Language Models (LLMs) are being fine-tuned with advanced techniques like Reinforcement Learning with Verifiable Reward (RLVR) to enhance their capabilities in complex tasks such as mathematical problem-solving and code generation. While these methods have shown promise in improving single-attempt accuracy, a significant challenge known as ‘diversity collapse’ often emerges. This paradox means that while a model might get a single answer right more often (Pass@1), its ability to generate a variety of correct solutions (Pass@k) can actually degrade, sometimes even falling below the performance of the original, untrained model. This issue is frequently accompanied by ‘catastrophic forgetting,’ where the model loses previously acquired skills. For a deeper dive into this research, you can read the full paper here.

A new research paper titled “The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward” by Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi, addresses this critical problem head-on. The authors argue that the standard approaches in RLVR, which either use a ‘mode-seeking’ reverse KL-divergence or completely omit a divergence term, lack a crucial mechanism for retaining knowledge and diversity. The reverse-KL divergence, as its name suggests, actively pushes the model to converge on a single, most probable solution, thereby narrowing its focus and suppressing the diversity of its outputs. Without any divergence term, the model has no safeguard against drifting away from its diverse knowledge base.

Introducing Diversity-Preserving Hybrid RL (DPH-RL)

The researchers propose a fundamental shift in perspective: using the divergence term itself as a solution rather than just a constraint. Their innovative framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages ‘mass-covering’ f-divergences, such as forward-KL and Jensen-Shannon (JS) divergence. These divergences act as a “rehearsal mechanism,” continuously referencing the model’s initial policy. This approach forces the model to maintain a broad coverage of potential solutions, effectively preventing diversity collapse and catastrophic forgetting.

The DPH-RL framework operates in two main phases: a pre-sampling stage and an online training stage. In the pre-sampling stage, the initial dataset is partitioned into a “perfect” dataset (for queries the base model already handles well) and an “exploration” dataset (for challenging queries requiring improvement). During online training, different loss functions are applied to these datasets. For the exploration dataset, the model is given maximum freedom to learn from rewards. For the perfect dataset, the f-divergence constraint ensures the model retains its original capabilities. A key advantage of DPH-RL is its training efficiency; it computes f-divergence using generator functions, which only require sampling from the initial policy and eliminate the need for an online reference model.

Also Read:

Demonstrated Superiority and Generalization

The effectiveness of DPH-RL was rigorously tested through extensive experiments on complex reasoning tasks, including math and SQL generation. These experiments utilized various LLM architectures, specifically Llama and Qwen models ranging from 7B to 32B parameters. DPH-RL consistently outperformed existing methods like GRPO, DAPO, and standard reverse-KL approaches.

The results showed that DPH-RL not only resolves the degradation of multi-attempt performance (Pass@k) but also significantly improves both single-attempt (Pass@1) and multi-attempt (Pass@k) scores, both within the training domain and on entirely new, out-of-domain tasks. For instance, on SQL tasks, DPH-RL methods maintained higher Pass@k scores than baselines, especially on out-of-domain datasets like Spider, where other methods showed significant performance collapse. Similarly, in mathematical reasoning tasks, DPH-RL demonstrated a more balanced improvement, enhancing both Pass@k and mean@k averages without sacrificing one for the other.

The research highlights that while mode-seeking divergences like reverse-KL can cause models to over-focus and lose generalization, mass-covering divergences in DPH-RL enable models to maintain a richer, more diverse set of solution strategies. This leads to more robust, general, and diverse reasoning models, achieved without requiring external knowledge from stronger models. The work underscores the critical, often overlooked, importance of selecting the appropriate divergence measure in reinforcement learning for LLMs.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -