Solving Diversity Collapse in LLMs with Diversity-Preserving Hybrid RL

TLDR: This paper introduces Diversity-Preserving Hybrid RL (DPH-RL), a new framework that tackles the problem of “diversity collapse” and “catastrophic forgetting” in Large Language Models (LLMs) fine-tuned with Reinforcement Learning with Verifiable Reward (RLVR). Traditional methods often degrade multi-attempt performance (Pass@k) and lose previously learned skills. DPH-RL uses mass-covering f-divergences (like forward-KL and JS-divergence) as a “rehearsal mechanism” to ensure the model maintains a broad range of solution styles by continuously referencing its initial knowledge. Experiments on math and SQL tasks show DPH-RL significantly improves both single-attempt (Pass@1) and multi-attempt (Pass@k) performance, even on new tasks, while being more training-efficient.

In the rapidly evolving field of Artificial Intelligence, Large Language Models (LLMs) are being fine-tuned with advanced techniques like Reinforcement Learning with Verifiable Reward (RLVR) to enhance their capabilities in complex tasks such as mathematical problem-solving and code generation. While these methods have shown promise in improving single-attempt accuracy, a significant challenge known as ‘diversity collapse’ often emerges. This paradox means that while a model might get a single answer right more often (Pass@1), its ability to generate a variety of correct solutions (Pass@k) can actually degrade, sometimes even falling below the performance of the original, untrained model. This issue is frequently accompanied by ‘catastrophic forgetting,’ where the model loses previously acquired skills. For a deeper dive into this research, you can read the full paper here.

A new research paper titled “The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward” by Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi, addresses this critical problem head-on. The authors argue that the standard approaches in RLVR, which either use a ‘mode-seeking’ reverse KL-divergence or completely omit a divergence term, lack a crucial mechanism for retaining knowledge and diversity. The reverse-KL divergence, as its name suggests, actively pushes the model to converge on a single, most probable solution, thereby narrowing its focus and suppressing the diversity of its outputs. Without any divergence term, the model has no safeguard against drifting away from its diverse knowledge base.

Introducing Diversity-Preserving Hybrid RL (DPH-RL)

The researchers propose a fundamental shift in perspective: using the divergence term itself as a solution rather than just a constraint. Their innovative framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages ‘mass-covering’ f-divergences, such as forward-KL and Jensen-Shannon (JS) divergence. These divergences act as a “rehearsal mechanism,” continuously referencing the model’s initial policy. This approach forces the model to maintain a broad coverage of potential solutions, effectively preventing diversity collapse and catastrophic forgetting.

The DPH-RL framework operates in two main phases: a pre-sampling stage and an online training stage. In the pre-sampling stage, the initial dataset is partitioned into a “perfect” dataset (for queries the base model already handles well) and an “exploration” dataset (for challenging queries requiring improvement). During online training, different loss functions are applied to these datasets. For the exploration dataset, the model is given maximum freedom to learn from rewards. For the perfect dataset, the f-divergence constraint ensures the model retains its original capabilities. A key advantage of DPH-RL is its training efficiency; it computes f-divergence using generator functions, which only require sampling from the initial policy and eliminate the need for an online reference model.

Also Read:

Demonstrated Superiority and Generalization

The effectiveness of DPH-RL was rigorously tested through extensive experiments on complex reasoning tasks, including math and SQL generation. These experiments utilized various LLM architectures, specifically Llama and Qwen models ranging from 7B to 32B parameters. DPH-RL consistently outperformed existing methods like GRPO, DAPO, and standard reverse-KL approaches.

The results showed that DPH-RL not only resolves the degradation of multi-attempt performance (Pass@k) but also significantly improves both single-attempt (Pass@1) and multi-attempt (Pass@k) scores, both within the training domain and on entirely new, out-of-domain tasks. For instance, on SQL tasks, DPH-RL methods maintained higher Pass@k scores than baselines, especially on out-of-domain datasets like Spider, where other methods showed significant performance collapse. Similarly, in mathematical reasoning tasks, DPH-RL demonstrated a more balanced improvement, enhancing both Pass@k and mean@k averages without sacrificing one for the other.

The research highlights that while mode-seeking divergences like reverse-KL can cause models to over-focus and lose generalization, mass-covering divergences in DPH-RL enable models to maintain a richer, more diverse set of solution strategies. This leads to more robust, general, and diverse reasoning models, achieved without requiring external knowledge from stronger models. The work underscores the critical, often overlooked, importance of selecting the appropriate divergence measure in reinforcement learning for LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Solving Diversity Collapse in LLMs with Diversity-Preserving Hybrid RL

Introducing Diversity-Preserving Hybrid RL (DPH-RL)

Demonstrated Superiority and Generalization

Gen AI News and Updates

Optimizing Biomedical Image Segmentation: Uncovering Data Redundancy and Mitigating Forgetting in Cellpose Models

Google Research Unveils ‘Nested Learning’: A Paradigm Shift for Continual AI Adaptation

Stabilizing AI: A Path-Coordinated Approach to Continual Learning

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates