TLDR: DiffusionNFT is a novel online reinforcement learning (RL) paradigm for diffusion models that directly optimizes on the forward process using flow matching. It contrasts positive and negative generations to define an implicit policy improvement direction, eliminating the need for likelihood estimation and supporting arbitrary black-box solvers. The method is up to 25 times more efficient than FlowGRPO and significantly boosts performance across various benchmarks without requiring Classifier-Free Guidance (CFG).
Online reinforcement learning (RL) has been a game-changer for improving large language models after their initial training, helping them align better with human preferences and enhance their reasoning. However, bringing similar success to diffusion models, which are powerful tools for visual generation, has been a significant challenge. The main hurdle lies in the difficulty of calculating exact likelihoods in diffusion models, which are crucial for traditional RL methods.
Previous attempts to apply RL to diffusion models often involved discretizing the reverse sampling process, essentially turning diffusion generation into a multi-step decision-making problem. While this allowed for the use of existing RL algorithms like GRPO, it came with several drawbacks. These methods often suffered from a lack of consistency with the forward diffusion process, restrictions on the types of solvers that could be used, and complicated integration with Classifier-Free Guidance (CFG), a technique commonly used to improve image quality.
Introducing DiffusionNFT: A New Approach
A new paradigm called Diffusion Negative-aware FineTuning (DiffusionNFT) has been introduced to overcome these limitations. Instead of relying on the traditional Policy Gradient framework, DiffusionNFT optimizes diffusion models directly on the forward process using a technique called flow matching. This method cleverly contrasts positive and negative generations to define an implicit direction for policy improvement, seamlessly integrating reinforcement signals into the standard supervised learning objective.
The core idea is to split generated samples into positive and negative groups based on a reward function. By learning from both good and bad examples, DiffusionNFT can guide the model towards better generations. This approach offers several practical benefits:
-
Solver Flexibility: DiffusionNFT allows for the use of any black-box solvers during data collection, unlike previous methods that were restricted to first-order SDE samplers.
-
Efficiency: It eliminates the need to store entire sampling trajectories, requiring only clean images and their associated rewards for policy optimization.
-
CFG-Free Operation: The method naturally incorporates reinforcement guidance directly into the optimized policy, making Classifier-Free Guidance (CFG) unnecessary. This simplifies the training process and improves efficiency.
-
Likelihood-Free: DiffusionNFT bypasses the need for complex and often biased likelihood estimations, which is a fundamental constraint for many other diffusion RL methods.
Performance and Efficiency
The effectiveness of DiffusionNFT has been demonstrated through extensive experiments. When compared head-to-head with FlowGRPO, DiffusionNFT proved to be significantly more efficient, achieving up to 25 times faster training. For instance, it improved the GenEval score from 0.24 to 0.98 within just 1,000 steps, while FlowGRPO took over 5,000 steps and required additional CFG employment to reach 0.95.
Furthermore, by leveraging multiple reward models, DiffusionNFT substantially boosted the performance of SD3.5-Medium across various benchmarks, including GenEval, OCR, PickScore, ClipScore, HPSv2.1, Aesthetic, ImageReward, and UnifiedReward. Remarkably, it achieved this while being entirely CFG-free, even outperforming larger CFG-based models like SD3.5-L and FLUX.1-Dev in some metrics.
Also Read:
- Improving Text-to-Image Spatial Understanding Through Structured Information
- Teaching Neural Networks to Solve Knapsack: A Two-Phase Algorithmic Approach
Practical Implementation Details
The practical implementation of DiffusionNFT involves a few key design choices. Rewards, which are often continuous scalars, are transformed into an optimality probability between 0 and 1. The sampling policy is updated using a ‘soft’ Exponential Moving Average (EMA) approach, balancing learning speed and stability. An adaptive weighting scheme is used for the flow-matching loss, further enhancing training stability. The decision to operate in a CFG-free setting, despite leading to a lower initial performance, proved beneficial as the model quickly surpassed CFG baselines through RL post-training.
This work represents a significant step towards unifying supervised and reinforcement learning in the diffusion domain, highlighting the forward process as a promising foundation for scalable, efficient, and theoretically sound diffusion RL. For more in-depth technical details, you can refer to the full research paper: DiffusionNFT: Online Diffusion Reinforcement with Forward Process.


