TLDR: Reward-Weighted Sampling (RWS) is a new decoding method for Masked Diffusion Language Models (MDMs) that uses an external reward model to guide text generation. Unlike standard methods that often produce sequential, autoregressive-like outputs, RWS evaluates the quality of the entire sequence at each step and adjusts token selection accordingly. This promotes a more non-autoregressive generation order, leading to improved text quality, coherence, and flexibility, though it adds some computational overhead.
Large Language Models (LLMs) have transformed how we interact with technology, excelling in various natural language tasks. Traditionally, many LLMs operate in an ‘autoregressive’ manner, meaning they generate text sequentially, token by token. While effective, this approach can sometimes lead to cumulative errors, where an early mistake can propagate and affect the coherence of longer texts.
Recently, a promising alternative has emerged: Masked Diffusion Models (MDMs). Unlike their autoregressive counterparts, MDMs generate text by iteratively unmasking tokens in parallel, leveraging a full, bidirectional understanding of the text. This non-autoregressive approach holds the potential to mitigate the error propagation issues seen in sequential generation.
However, a challenge with current MDMs is that their standard decoding methods, such as confidence-based sampling, often inadvertently fall back into an autoregressive-like pattern. This happens because tokens adjacent to already unmasked parts of the text tend to receive higher confidence scores, leading to a sequential, left-to-right unmasking process. This limits the true non-autoregressive potential of MDMs, especially for tasks requiring global coherence.
To address this, researchers Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, and Jaegul Choo have introduced a novel decoding strategy called Reward-Weighted Sampling (RWS). This method is designed to enhance the non-autoregressive characteristics of MDMs by integrating a ‘global signal’ during the text generation process.
How Reward-Weighted Sampling Works
RWS introduces an external ‘reward model’ into the iterative decoding process of MDMs. Here’s a simplified breakdown of its steps:
- Potential Full Sequence Prediction: At each step of the diffusion process, the model first predicts what the complete, unmasked text might look like given its current state.
- Reward Evaluation: This predicted full sequence is then fed into an external reward model. This model evaluates the overall quality and coherence of the entire sequence, providing a ‘reward score’. This score reflects the global quality, not just individual token confidences.
- Reward-Weighted Logit Scaling: The original prediction scores (logits) for individual tokens are then adjusted based on this global reward. If the predicted sequence has a high reward, the scaling factor increases, which can cause a ‘rank reversal’ in token selection—meaning tokens that initially had lower confidence might now be prioritized.
- Guided Token Selection: Finally, tokens are selected to be unmasked based on these reward-adjusted scores. This adaptive adjustment encourages a more diverse and non-sequential generation order, moving away from the default left-to-right pattern.
The theoretical underpinnings of RWS show that this reward-based logit scaling consistently improves the expected reward of generated tokens and creates conditions where tokens with initially lower confidence can become preferred, thus promoting a truly non-autoregressive generation pattern.
Demonstrated Improvements
Experiments using LLaDA-8B-Instruct as the base MDM, evaluated across various reward models and benchmarks, showed significant improvements:
- Non-Autoregressive Behavior: RWS consistently achieved higher Generation Order Deviation (GOD) values, indicating a greater departure from strict left-to-right generation compared to standard methods.
- Generation Quality: The method outperformed baseline confidence-based sampling across multiple metrics, including win rates on the RewardBench dataset (which assesses alignment with human preferences) and MT-Bench (for multi-turn conversations).
- Fluency and Coherence: In keyword-constrained generation tasks, RWS produced text with lower perplexity (indicating better fluency) while successfully incorporating all required keywords.
For instance, in multi-turn dialogues, RWS generated more coherent and contextually rich responses, avoiding the repetitive and vague outputs often seen with default sampling.
Also Read:
- Top-H Decoding: A Smarter Way for LLMs to Balance Creativity and Coherence
- Crafting Unique Narratives: A New Decoding Strategy for LLMs
Considerations and Future Directions
While RWS offers substantial benefits, it does introduce some computational overhead, increasing inference time by approximately 21-33% compared to standard sampling. Additionally, its performance is inherently tied to the quality and potential biases of the external reward models used. Future research aims to reduce this overhead and explore more robust techniques for selecting and calibrating reward models.
In conclusion, Reward-Weighted Sampling represents a significant step forward in unlocking the full non-autoregressive potential of Masked Diffusion Models, leading to higher-quality, more coherent, and flexible text generation. You can read the full research paper here.


