This repository is the official implementation of "Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning". ACM MM, 2024.
Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning
Xian Zhang^1, Haokun Wen^1, Jianlong Wu^1, Pengda Qin^2, Hui Xue^2, Liqiang Nie^1
^1Harbin Institute of Technology, Shenzhen, ^2Alibaba Group
The codebase is mainly built with following libraries:
- Python 3.9
- PyTorch and torchvision.
- huggingface
- lavis
- loralib
The raw pretrianed weight of Vicuna-7B version in InstructBLIP is available at Hugging Face.
For CLEVR-Change
The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).
For Spot-the-Diff
Resized images can be download from Learning to Describe Differences Between Pairs of Similar Images (EMNLP18). Raw captions can be download from link.
For Image-Editing-Request
The official data can be found here: google drive link provided by Expressing Visual Relationships via Language (ACL 2019).
For CLEVR-Change
To evaluate captions, we need to first reformat the caption annotations into COCO eval tool format. Please run the command python utils/eval_utils.py according to the instructions given in Evaluation in Robust Change Captioning (ICCV19).
Renaming the output file as clevr_test_change_captions_reformat.json.
For Spot-the-Diff and Image-Editing-Request
Running the command python preprocess/eval_utils.py, renaming the output file as spot_test_change_captions_reformat.json.
We provide these evaluation files in the eval_data/ directory:
eval_data
|–– clevr_test_change_captions_reformat.json
|–– spot_val_change_captions_reformat.json
|–– spot_test_change_captions_reformat.json
|–– IER_val_change_captions_reformat.json
|–– IER_test_change_captions_reformat.json
The retrieval augmented generation (RAG) requires a effective retriever for visual-to-text retrieval. We utilized the pretrained model from CLIP4IDC to retrieve texts, where the corpus is a collection of texts in the training set of each dataset.
The corpus retrieved in our work can be downloaded from FINER-MLLM ModelScope. You can get the following files:
rag_store
|–– clevr_retrieval_corpus_store.json
|–– spot_retrieval_corpus_store.json
|–– IER_retrieval_corpus_store.json
Note: Since CLIP4IDC does not provide the pre-trained weights on Image-Editing-Request, we retrieve the corpus after reproducing it according to the paper's method. The retrieval results we reproduced on Image-Editing-Request are slightly worse than the retrieval metrics provided by the original paper. We also provide the detailed process of retrieval corpus on FINER-MLLM ModelScope.
In our experiments, we utilzed a single gpu for training and evaluation.
Detailed commands for training model with single GPU:
# Clevr-change
bash scripts/train_clevr.sh
# Spot-the-diff
bash scripts/train_spot.sh
# Image-editing-request
bash scripts/train_IER.sh
The commands for evaluation:
bash scripts/eval.sh
You can download the pretrined weights and other necessary files on FINER-MLLM ModelScope.
If you find the repo useful for your research, please consider citing our paper:
@inproceedings{
zhang2024finermllm,
title={Differential-Perceptive and Retrieval-Augmented {MLLM} for Change Captioning},
author={Xian Zhang and Haokun Wen and Jianlong Wu and Pengda Qin and Hui Xue' and Liqiang Nie},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=eiGs5VCsYM}
}