TLDR: PCPO (Proportionate Credit Policy Optimization) is a new framework that addresses training instability and model collapse in text-to-image (T2I) models. It achieves this by reformulating the training objective for numerical stability and, critically, by enforcing proportional credit assignment across timesteps during generation. This leads to significantly accelerated convergence, superior image quality, and effective mitigation of model collapse, outperforming current state-of-the-art methods.
The world of artificial intelligence has seen remarkable advancements in text-to-image (T2I) models, allowing us to generate stunning visuals from simple text prompts. However, ensuring these generated images consistently align with human preferences remains a significant challenge. While reinforcement learning (RL) techniques have been instrumental in improving these models, they often face hurdles like training instability, slow convergence, and a phenomenon known as “model collapse,” where the generated images lose diversity and quality over time.
Researchers Jeongjae Lee and Jong Chul Ye from KAIST have identified a core reason behind these issues: “disproportionate credit assignment.” In simpler terms, during the training process, the feedback signals given to the model across different stages of image generation are often inconsistent and highly volatile. This makes it difficult for the model to learn effectively, leading to the observed instabilities and quality degradation.
To tackle this, they introduce a novel framework called Proportionate Credit Policy Optimization, or PCPO. This innovative approach aims to stabilize the training of T2I models by ensuring that the feedback provided to the model is fair and proportional across all steps of the image generation process. PCPO achieves this through two main mechanisms: first, it reformulates the training objective to enhance numerical stability, making the learning process smoother. Second, and more crucially, it reweights the importance of different timesteps during training, ensuring that each step contributes proportionally to the overall policy update.
Also Read:
- ACPO: Enhancing Vision-Language Models for Complex Reasoning with Adaptive Learning
- AI Models Learn Image Preferences Without Human-Labeled Image Pairs
How PCPO Works Its Magic
For diffusion models, which are a popular type of T2I model, PCPO re-engineers the underlying variance schedule. This technical adjustment ensures that the “weight” or influence of each timestep on the model’s learning is kept constant, preventing the volatile and non-uniform feedback that previously hampered training. For flow models, another class of generative models, PCPO directly reweights the training objective to achieve the same proportionality.
The impact of PCPO is substantial. Experiments show that it significantly accelerates the training process, with speedups ranging from 24.6% to over 41% compared to existing methods. This means models can be trained faster and more efficiently. More importantly, PCPO leads to superior image quality and effectively mitigates model collapse. Instead of producing blurry or repetitive outputs, PCPO-trained models generate clear, diverse, and high-fidelity images.
The research demonstrates that PCPO consistently outperforms state-of-the-art policy gradient baselines, including DanceGRPO and DDPO, across various metrics. For instance, it achieves better Fréchet Inception Distance (FID) scores, indicating higher sample fidelity, and helps reduce the Inception Score (IS) when a high IS is an indicator of model collapse. Human evaluators also strongly preferred images generated by PCPO, even when compared to baselines that had undergone longer training.
One of the key advantages of PCPO is that it offers the benefits typically associated with using larger batch sizes in training—such as improved stability and diversity—without incurring the significant computational overhead. This makes it a more efficient and comprehensive solution for enhancing the alignment and quality of T2I models.
This breakthrough represents a significant step forward in making text-to-image generation models more robust, efficient, and capable of producing outputs that truly reflect human preferences. For more technical details, you can refer to the full research paper: PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models.


