TLDR: Planned Diffusion is a novel approach for large language models that combines autoregressive planning with parallel diffusion-based execution. It first generates a high-level plan for text structure using control tags, then simultaneously generates the content for independent text segments. This method significantly improves the speed-quality trade-off in text generation, offering substantial speedups over traditional autoregressive models with minimal quality reduction, and provides flexible control over this balance.
Large language models (LLMs) have become incredibly powerful, but they often face a fundamental challenge: how to generate high-quality text quickly. Traditional autoregressive models, which generate text word by word, produce excellent results but are slow. On the other hand, diffusion models can generate text in parallel, offering speed, but often struggle to match the quality of autoregressive models without many iterations.
A new research paper introduces a novel approach called Planned Diffusion, aiming to overcome this trade-off. This hybrid method combines the best aspects of both autoregressive and diffusion paradigms to achieve faster, high-quality text generation.
How Planned Diffusion Works
Planned Diffusion operates in two distinct stages, making text generation a dynamic parallel scheduling problem:
First, the model enters a sequential ‘planning’ stage. Here, it acts autoregressively, meaning it generates a high-level execution plan token by token. This plan isn’t the final text itself, but rather a structural outline composed of special ‘control tags’. These tags break down the overall output into smaller, independent segments or ‘spans’. For example, if the model needs to answer a question with a bulleted list, the planning stage would define each bullet point as a separate, independent span.
Second, once the plan is established, the model moves into a ‘parallel diffusion’ stage. In this phase, it executes the plan by simultaneously generating the text for all the independent spans defined in the planning stage. This means multiple parts of the answer can be created at the same time, significantly speeding up the process. The model uses diffusion techniques to fill in the content for each span.
Imagine asking an AI, “What is Aurora Borealis?” Planned Diffusion might first plan to define it, then describe it, and finally state its location. In the second stage, it would generate the definition, description, and location text all at once, then combine them into a coherent answer.
Key Innovations and Performance
The researchers developed a specific ‘control tag language’ to enable this two-stage process, along with a tailored training methodology and an inference algorithm that uses KV caching for efficiency. The model is trained to understand these tags and switch between autoregressive and diffusion-based generation seamlessly.
Experimental evaluations on AlpacaEval, a benchmark with 805 instruction-following prompts, show promising results. Planned Diffusion achieves a Pareto-optimal trade-off between quality and latency. It demonstrated a speedup of 1.27x to 1.81x over pure autoregressive generation, with only a minimal drop in quality (0.87% to 5.4% in win rate). This means users get their answers much faster without a significant compromise on the output quality.
The speedup is largely attributed to a shorter ‘critical path’ of generation. Because multiple parts of the text are generated in parallel, the total number of sequential steps required is significantly reduced. The study also found that Planned Diffusion continues to improve with more training, unlike autoregressive baselines which tend to plateau.
Also Read:
- Saber: A Smart Sampling Approach for Faster, Higher-Quality Code Generation with Diffusion Language Models
- Enhancing Language Models with Soft-Masking for Improved Generation
Flexible Control
Another notable aspect is the flexibility it offers. Users can adjust runtime parameters like the ‘step ratio’ (which determines the number of denoising steps relative to span length) and a ‘confidence threshold’ for decoding. These ‘knobs’ allow for fine-tuned control over the balance between generation speed and output quality, enabling users to prioritize one over the other based on their specific needs.
This innovative hybrid architecture, detailed further in the research paper, presents a practical path toward developing faster and more efficient large language models, pushing the boundaries of what’s possible in text generation.


