TLDR: A new research paper introduces ROVER (Random Policy Valuation for Diverse Reasoning), a simplified reinforcement learning algorithm for Large Language Models (LLMs) in math reasoning tasks. By leveraging the specific, simpler structure of these tasks, ROVER bypasses complex policy optimization loops, instead deriving optimal actions from the Q-function of a fixed uniformly random policy. This minimalist approach leads to superior performance in both solution quality and diversity, more stable training, and better generalization across various mathematical benchmarks, demonstrating that simpler methods can achieve state-of-the-art results.
In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) are increasingly being trained to tackle complex reasoning tasks, particularly in mathematics. A promising approach for enhancing these capabilities is Reinforcement Learning with Verifiable Rewards (RLVR). However, current RLVR methods, often relying on sophisticated policy optimization frameworks like PPO and GRPO, frequently encounter challenges such as training instability and a reduction in the diversity of solutions. These issues necessitate complex adjustments and careful tuning, making the training process intricate and often unpredictable.
A recent research paper, titled “Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards,” introduces a novel and surprisingly simple algorithm called ROVER (Random Policy Valuation for Diverse Reasoning). Authored by Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, and Ling Pan from institutions including the Hong Kong University of Science and Technology, Kuaishou Technology, and StepFun, this work offers a fresh perspective on how LLMs can achieve high-quality and diverse reasoning.
The Core Insight: Simpler Problems, Simpler Solutions
The researchers observed that standard RLVR in math reasoning can be formalized as a specialized type of Markov Decision Process (MDP). Unlike the general-purpose control settings for which algorithms like PPO were originally developed (e.g., computer games or robotics with complex, graph-like state transitions), LLM math reasoning tasks often involve deterministic state transitions, tree-structured dynamics, and binary terminal rewards (either correct or incorrect). This underlying structure is significantly simpler, suggesting that many of the sophisticated techniques used in existing RL methods might be unnecessary.
Based on this insight, the paper presents a surprising theoretical result: the optimal action for these specific MDPs can be recovered by simply evaluating the Q-function of a fixed, uniformly random policy. This means that the traditional generalized policy iteration (GPI) loop, which involves alternating between evaluating a policy and improving it, can be bypassed entirely. This simplification eliminates the need for many heuristic tricks and complex tuning that plague current methods.
Introducing ROVER: Simplicity Meets Effectiveness
ROVER translates this theoretical principle into a practical and scalable algorithm. Instead of iteratively optimizing a policy, ROVER focuses on evaluating a fixed uniformly random policy. The Q-values derived from this evaluation are then used to guide action selection. While a naive greedy selection based on these Q-values guarantees optimality, it can lead to a lack of diversity in solutions. To address this, ROVER samples actions from a softmax distribution over these uniform-policy Q-values. This approach maintains diversity by allowing exploration of multiple valid reasoning pathways, aligning well with modern LLM decoding strategies.
The practical implementation of ROVER also introduces several clever techniques. It intrinsically parameterizes the Q-function using the LLM’s own parameters, removing the need for a separate value network. To stabilize training and provide a dense reward signal, ROVER employs a low-variance reward mechanism, sampling multiple responses for each prompt and using mean-centered rewards, which are then broadcast to every token in the generation process.
Also Read:
- DIVER: A New Approach to Enhance LLM Reasoning Through Diverse Exploration
- Unlocking Deeper Exploration in LLMs with Risk-Sensitive Reinforcement Learning
Impressive Results Across Benchmarks
Despite its radical simplification, ROVER demonstrates superior performance across multiple base models and standard math reasoning benchmarks. On competition-level tasks like AIME24, AIME25, and HMMT25, ROVER showed significant improvements in both solution quality (e.g., +8.2 on pass@1 and +16.8 on pass@256) and diversity (+17.6%) compared to strong, more complicated existing methods. The method also exhibited remarkable generalization capabilities on out-of-distribution tasks like GPQA-diamond, a challenging benchmark of graduate-level science questions.
Behavioral analysis revealed that ROVER encourages enhanced reflection behaviors in LLMs, leading to a higher frequency of tokens associated with rethinking and self-correction. Furthermore, ROVER was found to discover novel reasoning strategies that were absent in base models and those trained with standard RL approaches, pushing the boundaries of LLM reasoning. The algorithm also scales robustly at test-time, consistently improving upon base models across various metrics like majority voting (maj@k).
Ablation studies confirmed the importance of the expected Q-value of the successor state in the Bellman target, showing that while this term is crucial, ROVER is not overly sensitive to its precise scaling.
This research provides a strong foundation for simplifying RLVR in deterministic, tree-structured MDPs with binary terminal rewards, a structure that naturally aligns with autoregressive LLM generation. The paper can be accessed in full at arXiv:2509.24981.
ROVER’s minimalist yet highly effective approach challenges the conventional wisdom that complex problems require complex solutions, opening new avenues for developing more robust and simplified methods for LLM reasoning and beyond.


