TLDR: A new method called Direct-Align and Semantic Relative Preference Optimization (SRPO) significantly improves AI image generation. Direct-Align allows diffusion models to be optimized across their entire image creation process, not just the final steps, preventing common issues like “reward hacking.” SRPO enables users to adjust image preferences online using text prompts, reducing the need for costly offline fine-tuning. This combined approach leads to more realistic and aesthetically pleasing AI-generated images with remarkable efficiency, outperforming existing methods.
A new research paper introduces a groundbreaking approach to enhance the quality and realism of AI-generated images, addressing key limitations in current diffusion models. Titled Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference, the work by Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, and Yansong Tang from Hunyuan, Tencent, The Chinese University of Hong Kong, Shenzhen, and Tsinghua University, presents two novel methods: Direct-Align and Semantic Relative Preference Optimization (SRPO).
Current methods for aligning diffusion models with human preferences often face two major hurdles. Firstly, they are computationally intensive, relying on multi-step denoising with gradient computation for reward scoring. This restricts optimization to only a few diffusion steps, making models susceptible to ‘reward hacking’ – where they achieve high scores for low-quality images. Secondly, these methods typically require continuous, costly offline adjustments of reward models to achieve desired aesthetic qualities like photorealism or specific lighting effects, lacking an online mechanism for real-time adjustments.
Direct-Align: Optimizing the Full Image Creation Process
To tackle the limitation of multi-step denoising, the researchers propose Direct-Align. This method predefines a noise prior, allowing the model to effectively recover original images from any time step through interpolation. This is a significant advancement because it leverages the fundamental equation that diffusion states are interpolations between noise and target images. By doing so, Direct-Align avoids over-optimization in the later stages of image generation and enables the reinforcement learning algorithm to be applied across the entire diffusion trajectory, from early, noisy stages to the final clean image. This full-trajectory optimization is crucial for preventing artifacts and improving overall image quality.
Semantic Relative Preference Optimization (SRPO): Online Control and Bias Mitigation
Complementing Direct-Align, the paper introduces Semantic Relative Preference Optimization (SRPO). In SRPO, rewards are formulated as text-conditioned signals. This innovative approach allows for online adjustment of rewards in response to positive and negative prompt augmentations. Essentially, users can guide the model’s preferences in real-time by adding descriptive words to their prompts, reducing the heavy reliance on offline reward fine-tuning. SRPO also plays a vital role in mitigating reward hacking by regularizing the reward signal. It does this by evaluating each sample with both positive and negative prompt conditional preferences, effectively filtering out information irrelevant to semantic guidance and neutralizing general biases.
Also Read:
- Enhancing Text-to-Image Models with Dual-Domain Gaussianity Regularization
- Enhancing Multimodal Models with Reconstruction Alignment
Breakthrough Results and Efficiency
The researchers fine-tuned the FLUX.1.dev model using their SRPO framework, demonstrating remarkable improvements. Their method substantially enhances human-evaluated realism and aesthetic quality by over 3x compared to the baseline. For instance, it achieved an approximate 3.7-fold increase in perceived realism and a 3.1-fold improvement in aesthetic quality. Furthermore, the efficiency of SRPO is a major highlight. The method converges in just 10 minutes using 32 NVIDIA H20 GPUs, showcasing a 75x improvement in training efficiency compared to state-of-the-art online reinforcement learning methods like DanceGRPO, while matching or exceeding their image quality.
The extensive evaluations, including both automatic metrics and comprehensive human assessments, confirm that Direct-Align and SRPO achieve state-of-the-art performance. The approach is also robust across different CLIP-based reward models, consistently enhancing image realism and detail complexity without observing reward hacking. This work represents a significant step forward in aligning text-to-image models with fine-grained human preferences, offering more controllable, realistic, and aesthetically pleasing AI-generated images with unprecedented efficiency.


