TLDR: wd1 is a new reinforcement learning method for diffusion-based large language models (dLLMs) that improves reasoning capabilities and computational efficiency. It uses a weighted log-likelihood objective, requiring only a single likelihood approximation, which avoids bias from policy ratios and reduces overhead. Experiments show wd1 outperforms existing methods like d1 on reasoning benchmarks, achieving higher accuracy and faster training without needing supervised fine-tuning.
In the rapidly evolving field of artificial intelligence, large language models (LLMs) are constantly being refined to improve their reasoning capabilities. A particularly promising area involves diffusion-based large language models (dLLMs), which generate text by iteratively refining entire response sequences through a denoising process, offering significant inference efficiency compared to traditional autoregressive models.
However, applying reinforcement learning (RL) to enhance dLLMs’ reasoning has faced a significant hurdle: the complexity of their likelihood functions. Existing RL methods, such as Group Relative Policy Optimization (GRPO) adapted for dLLMs (like d1), require approximating the likelihoods of multiple policies (current, old, and reference) at each optimization step. This not only adds substantial computational overhead but also introduces potential biases, especially when approximation errors occur in the policy ratios used for importance sampling.
Introducing wd1: A Novel Approach to dLLM Optimization
To overcome these challenges, researchers from the UCL AI Centre have introduced a new policy optimization method called wd1. This innovative approach redefines the objective as a weighted likelihood, dramatically simplifying the process by requiring only a single approximation for the current parametrized policy likelihood. This eliminates the need for explicit policy ratios, thereby mitigating the large biases that can arise from approximation errors and significantly reducing computational demands.
The core idea behind wd1 is to optimize a weighted log-likelihood objective. This objective is derived from approximating a closed-form solution of single-iterate reverse-KL-constrained policy optimization. Crucially, wd1 also incorporates a complementary penalty term that minimizes the likelihood of low-advantage completions, ensuring that both positive and negative samples are effectively utilized in the optimization process. This is achieved through the use of group-relative advantage to determine positive and negative weights.
Also Read:
- Streamlining AI Reasoning: A New Approach to Clearer Thinking in Large Language Models
- Optimizing Large Reasoning Models: Balancing Depth and Efficiency
Performance and Efficiency Gains
Experiments conducted on widely used reasoning benchmarks, including Sudoku, Countdown, GSM8K, and MATH500, demonstrate that wd1 delivers superior performance. Without the need for supervised fine-tuning (SFT) or any supervised data, wd1 consistently outperforms existing RL methods for dLLMs, such as d1. For instance, wd1 achieved up to 16% higher accuracy on these benchmarks. On the Countdown task, it showed up to a 25% improvement with maximum length 256, and a remarkable 38% gain relative to the base LLaDA model.
Beyond accuracy, wd1 also brings substantial computational benefits. Unlike d1, wd1 eliminates the need for an SFT stage, which alone can account for hours of training time. During the RL training phase, wd1 exhibits additional speed-ups, with reduced runtime, lower FLOPs, and fewer function evaluations (NFEs) per gradient step. This efficiency is attributed to wd1 bypassing the need to approximate the likelihood of the old policy.
The training dynamics further highlight wd1’s advantages, showing a notably faster reward increase and superior sample efficiency compared to d1. For math reasoning tasks like GSM8K and MATH500, wd1 converges to shorter output sequences, indicating improved token efficiency while maintaining or enhancing performance.
The simplicity of wd1’s implementation and its R1-Zero-like training (no SFT) position it as a more effective and efficient method for applying reinforcement learning to dLLMs for reasoning tasks. For more technical details, you can refer to the full research paper: wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models.


