TLDR: Discrete Diffusion Forcing (D2F) is a novel method that enables Diffusion Large Language Models (dLLMs) to achieve significantly faster inference speeds than traditional autoregressive (AR) LLMs. By combining block-wise autoregressive generation for KV cache utilization and inter-block parallel decoding, D2F refurbishes dLLMs into an efficient AR-diffusion hybrid. This breakthrough allows open-source dLLMs to surpass AR models in throughput by up to 2.5x, while also accelerating vanilla dLLMs by over 50x, all while maintaining comparable output quality.
Large Language Models (LLMs) have become central to text generation, with autoregressive (AR) models traditionally dominating the field. However, a newer class of models, Diffusion Large Language Models (dLLMs), has emerged, promising the ability to decode multiple tokens simultaneously, potentially offering a significant leap in inference speed. Despite this theoretical advantage, open-source dLLMs have struggled to match or surpass the inference speeds of their AR counterparts – until now.
A recent research paper introduces a breakthrough strategy called Discrete Diffusion Forcing (D2F), which enables dLLMs to achieve faster-than-AR inference speeds. This marks a significant milestone, as D2F-equipped dLLMs are the first open-source models to demonstrate superior inference throughput compared to similarly sized AR LLMs.
The Challenge of dLLMs
Traditional AR LLMs generate text token by token, a sequential process that can be slow, especially for long outputs. dLLMs, on the other hand, aim to denoise a fully masked sequence iteratively, allowing for parallel prediction of all tokens. While closed-source dLLMs like Gemini Diffusion and Mercury have shown impressive speeds, open-source dLLMs have faced hurdles, primarily due to incompatibility with standard KV (Key-Value) cache mechanisms and limitations in parallelization. Existing acceleration methods for dLLMs have offered only limited speedups, often failing to match the efficiency of AR models.
Introducing Discrete Diffusion Forcing (D2F)
D2F addresses these challenges by transforming vanilla dLLMs into an AR-diffusion hybrid paradigm. It equips dLLMs with two crucial capabilities:
- Block-wise Autoregressive Generation: This allows dLLMs to utilize the efficient KV cache, significantly reducing redundant computations. Instead of processing the entire sequence at once, D2F breaks it into blocks, processing them in a way that allows previously generated block states to be reused.
- Inter-block Parallel Decoding: D2F trains the model to predict subsequent tokens without needing prior blocks to be fully completed. This means multiple blocks can be decoded in parallel, maximizing the number of tokens generated in each inference step.
The implementation of D2F involves an asymmetric distillation process. This means a D2F dLLM (the student) is trained by mimicking the predictions of a pre-trained, standard bidirectional dLLM (the teacher). The student model learns to predict content from a causally restricted view (only seeing preceding noisy blocks), while the teacher provides a global view. This distillation efficiently transfers the mask prediction capabilities of existing dLLMs into the new D2F framework.
Furthermore, the paper proposes a pipelined parallel decoding algorithm for inference. This algorithm uses a sliding window of active blocks, dynamically adding new masked blocks as decoding progresses. It also incorporates a dual-state decoding mechanism, where newly added blocks start in a ‘semi-activated’ state for conservative decoding and transition to ‘fully-activated’ for more aggressive decoding once sufficient context is available from preceding blocks. This synergy optimizes both per-step efficiency and inter-block parallelism.
Also Read:
- Accelerating Language Models: A Deep Dive into Parallel Text Generation
- Leveraging Intermediate Predictions in Diffusion Language Models for Better Accuracy
Groundbreaking Results
Empirical evaluations demonstrate D2F’s remarkable effectiveness. D2F dLLMs have achieved more than 2.5 times faster inference speeds than leading AR LLMs like LLaMA3 and Qwen2.5 on benchmarks such as GSM8K. For instance, D2F-Dream-Base-7B achieved a throughput of 119.9 tokens/second on GSM8K, significantly outperforming LLaMA3-Instruct-8B (48.0 tokens/second) and Qwen2.5-Base-7B (52.7 tokens/second).
Compared to vanilla dLLMs like LLaDA and Dream, the acceleration is even more dramatic, exceeding 50 times while maintaining comparable output quality. D2F-LLaDA-Instruct-8B, for example, achieved a 52.9x speedup on MBPP with minimal performance difference. These results not only surpass existing dLLM acceleration techniques but also establish D2F as the first open-source dLLM to outrun AR models in terms of throughput, significantly enhancing their practical utility.
The research also includes ablation studies, confirming that both the KV cache utilization and the parallel decoding pipeline are crucial for these performance gains. The D2F training strategy, with its structured progressive noising, also proved superior to random noise schedules.
This work represents a significant step forward in the field of large language models, making dLLMs a more viable and efficient alternative for various text generation tasks. The code for Discrete Diffusion Forcing is available for public use, fostering further innovation in the community. You can find more details in the full research paper.


