From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Introduction

We introduce ReDiff, a refining-enhanced vision-language diffusion model.

Quick Inference Demo

The ReDiff model is now available on Hugging Face Hub. To quickly test the model with a visual instruction demo, follow these simple steps:

Clone the repository

git clone https://github.com/jiyt17/ReDiff
cd train

Initialize the environment
Run the environment setup script to install necessary dependencies:
```
bash init_env.sh
```
Run the demo script
Execute the demo script to test ReDiff on an example image:
```
python generate_demo.py
```

Training

Our model is trained from LLaDA-V.

Foundational revision training

We train model to revise two types of synthetic errors: syntactic errors and semantic hallucinations.

The syntactic errors are injected by randomly replacing a fraction of tokens with other tokens from the vocabulary, and semantic hallucinations are from ViCrit dataset, which provides pairs of correct captions and captions with factual errors.

Training data preparation:

We select detailed image captioning as the representative task to validate our framework on enhancing the generative capabilities of vision-language diffusion models.

The data sources contain ViCrit, LLaVA-1.5, ShareGPT4V, utilizing 160k, 20k and 80k data respectively.
```
cd train/data_pipe
python preprocess_vicrit.py
```

Training script:

cd train
bash ./scripts/llada_v_finetune_vicrit.sh

Online self-correction learning

In the second stage, ReDiff-base generates its own flawed "drafts". These drafts, containing the model's grammatical and hallucinatory errors, are then revised by an expert AI assistant.

Data praparation with o4-mini.

cd train/data_pipe
bash inference.sh
python data_pipeline_gpt.py
python process.py

Data sample:

{
     "image": "ViCrit-Train/images/535.jpg",
     "conversations": [
         {
             "from": "human",
             "value": "<image>\nWrite a detailed description of the given image."
         },
         {
             "from": "gpt",
             "value": "In the image a smiling black man is standing outside a brick wall. He is holding a a gray magazine with a white cover on it. The man dressed in a gray suit suit and a white shirt shirt with a yellow tie. His tie tie is a gold color and has blue stripes.  He is a wearing a brown hat with a a white logo on it.  The magazine he is holding has a picture of on and it and a black background with a white text on it. The title \"the scene\" is also visible in the magazine. The man appears to be the featured on the front cover of the magazine."
         }
     ],
     "revise": [
         {
             "org": "a white shirt shirt",
             "target": "a light blue shirt"
         },
         {
             "org": "tie tie",
             "target": "gold tie"
         },
         {
             "org": "a picture of on and it",
             "target": "a man on the front cover"
         },
         {
             "org": "black background",
             "target": "red background"
         }
     ]
 }

Training script:

cd train
bash ./scripts/llada_v_finetune_o4.sh

Evaluation

We tend to improve the generation quality of vision-language diffusion model, and demonstrate the effectiveness of refining-enhanced diffusion framework on three detailed image caption benchmarks: CapMAS (3 metrics: CLAIR for overall caption quality, Coverage for the comprehensiveness of the description, and Factuality for the accuracy of the content), CapArena (score based on pairwise comparison) and DetailCaps-4870 (metric: CAPTURE).

Evaluation script:

cd eval
bash inference.sh

Acknowledgments

The code is largely based on the LLaDA-V, training data source contains ViCrit, LLaVA-OneVision. We thank the authors for their great work.

Citation

@article{ji2025denoising,
  title={From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model},
  author={Ji, Yatai and Wang, Teng and Ge, Yuying and Liu, Zhiheng and Yang, Sidi and Shan, Ying and Luo, Ping},
  journal={arXiv preprint arXiv:2510.19871},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
eval		eval
train		train
LICENSE		LICENSE
README.md		README.md
init_env.sh		init_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Introduction

Quick Inference Demo

Training

Foundational revision training

Online self-correction learning

Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

License

jiyt17/ReDiff

Folders and files

Latest commit

History

Repository files navigation

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Introduction

Quick Inference Demo

Training

Foundational revision training

Online self-correction learning

Evaluation

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages