TLDR: MEML-GRPO is a novel framework that enhances Large Language Models’ (LLMs) reasoning capabilities by addressing reward sparsity in Reinforcement Learning with Verifiable Rewards (RLVR). It uses diverse ‘expert prompts’ to generate a wider range of responses, increasing the likelihood of finding correct solutions. An inter-expert mutual learning mechanism facilitates knowledge sharing, and a hard example accumulation feature ensures continuous learning from challenging problems. Experiments show significant performance gains with Qwen and Llama models on various reasoning benchmarks.
Large Language Models (LLMs) have shown remarkable progress in reasoning tasks, especially when combined with Reinforcement Learning with Verifiable Rewards (RLVR). This approach helps LLMs improve by giving them feedback based on whether their answers are correct or not, like a student learning from a test. However, a significant challenge in RLVR is ‘reward sparsity’. This happens when an LLM consistently produces incorrect answers for complex problems, leading to zero rewards. Without any positive feedback, the model doesn’t get a clear signal to learn and improve, essentially getting stuck in its existing knowledge.
To tackle this persistent problem, a new framework called Multi-Expert Mutual Learning GRPO (MEML-GRPO) has been introduced. This innovative approach aims to make LLMs more robust and capable of exploring new reasoning paths, even when faced with difficult tasks.
MEML-GRPO works by leveraging the strengths of multiple, diverse ‘experts’. Imagine you have several highly intelligent individuals, each with a unique way of thinking and solving problems. MEML-GRPO brings this concept to LLMs by using different ‘expert prompts’ – specific instructions that guide the model to generate a wider variety of responses. This significantly increases the chances of finding a correct solution, even if the initial attempts are wrong. By generating more diverse answers, the system is more likely to hit upon a correct one, providing the crucial learning signal that was previously missing due to reward sparsity.
Beyond just generating diverse responses, MEML-GRPO also incorporates an ‘inter-expert mutual learning’ mechanism. This means that the different ‘experts’ within the system don’t just work in isolation; they learn from each other. If one expert finds a successful way to solve a problem, that knowledge is shared and transferred to other experts, helping the weaker ones improve. This collaborative learning boosts the overall performance of the model, allowing all experts to become more competitive.
Furthermore, MEML-GRPO includes a ‘hard example accumulation’ feature. For problems where all experts struggle to find a correct answer, the system stores these ‘hard examples’. It then uses a technique called supervised fine-tuning (SFT) with the actual correct answers to ensure that the model continues to learn from these challenging cases. This prevents the model from stalling on particularly difficult problems and ensures continuous progress.
Extensive experiments have shown that MEML-GRPO delivers substantial improvements across various reasoning benchmarks, including mathematical reasoning (GSM8K, MathQA) and commonsense reasoning (StrategyQA). For instance, it achieved an average performance gain of 4.89% with Qwen models and an impressive 11.33% with Llama models compared to traditional RLVR methods. These results demonstrate MEML-GRPO’s effectiveness in overcoming the core limitations of existing RLVR approaches, particularly the reward sparsity problem.
Also Read:
- Teaching LLMs to Be Concise: A New Approach to Efficient Reasoning
- Evaluating Continuous Learning in Multimodal AI: Introducing MLLM-CTBench
In essence, MEML-GRPO strikes a balance between exploring new solutions and exploiting known successful strategies. By dynamically integrating complementary strengths from multiple reasoning approaches, it ensures steady learning progress and pushes the boundaries of what LLMs can achieve in complex reasoning tasks. For more technical details, you can refer to the full research paper: MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement.


