This is the Official Repository of "Reinforcing Diffusion Models by Direct Group Preference Optimization", by Yihong Luo, Tianyang Hu, Jing Tang.
The key insight of our work is that the success of methods like GRPO stems from leveraging fine-grained, relative preference information within a group of samples, not from the policy-gradient formulation itself. Existing methods for diffusion models force the use of inefficient stochastic (SDE) samplers to fit the policy-gradient framework, leading to slow training and suboptimal sample quality.
DGPO circumvents this problem by optimizing group-level preferences directly, extending the Direct Preference Optimization (DPO) framework to handle pairwise groups instead of pairwise samples. This allows us to:
- Use Efficient Samplers: Employ fast and high-fidelity deterministic ODE samplers for generating training data, leading to better-quality rollouts.
- Learn Directly from Preferences: Optimize the model by maximizing the likelihood of group-wise preferences, eliminating the need for a stochastic policy and inefficient random exploration.
- Train Efficiently: Avoid training on the entire sampling trajectory, significantly reducing the computational cost of each iteration.
For a group of generated samples, we partition them into a positive group
We refer to our paper for more details.
DGPO consistently outperforms Flow-GRPO on target metrics for Compositional Generation, Text Rendering, and Human Preference Alignment, while also showing strong or superior performance on out-of-domain quality and preference scores.
| Model | GenEval | OCR Acc. | PickScore | Aesthetic | DeQA | ImgRwd | PickScore | UniRwd |
|---|---|---|---|---|---|---|---|---|
| SD3.5-M | 0.63 | 0.59 | 21.72 | 5.39 | 4.07 | 0.87 | 22.34 | 3.33 |
| Compositional Image Generation: | ||||||||
| Flow-GRPO | 0.95 | --- | --- | 5.25 | 4.01 | 1.03 | 22.37 | 3.51 |
| DGPO (Ours) | 0.97 | --- | --- | 5.31 | 4.03 | 1.08 | 22.41 | 3.60 |
| Visual Text Rendering: | ||||||||
| Flow-GRPO | --- | 0.92 | --- | 5.32 | 4.06 | 0.95 | 22.44 | 3.42 |
| DGPO (Ours) | --- | 0.96 | --- | 5.37 | 4.09 | 1.02 | 22.52 | 3.48 |
| Human Preference Alignment: | ||||||||
| Flow-GRPO | --- | --- | 23.31 | 5.92 | 4.22 | 1.28 | 23.53 | 3.66 |
| DGPO (Ours) | --- | --- | 23.89 | 6.08 | 4.40 | 1.32 | 23.91 | 3.74 |
Please contact Yihong Luo ([email protected]) if you have any questions about this work.
@misc{luo2025dgpo,
title={Reinforcing Diffusion Models by Direct Group Preference Optimization},
author={Yihong Luo and Tianyang Hu and Jing Tang},
year={2025},
eprint={2510.08425},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.08425},
}