Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning(ACM MM 2024)

This repository is the official implementation of "Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning". ACM MM, 2024.

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning
Xian Zhang^1, Haokun Wen^1, Jianlong Wu^1, Pengda Qin^2, Hui Xue^2, Liqiang Nie^1
^1Harbin Institute of Technology, Shenzhen, ^2Alibaba Group

🔨 Installation

The codebase is mainly built with following libraries:

The raw pretrianed weight of Vicuna-7B version in InstructBLIP is available at Hugging Face.

➡️ Data Preparation

Dataset

For CLEVR-Change

The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).

For Spot-the-Diff

Resized images can be download from Learning to Describe Differences Between Pairs of Similar Images (EMNLP18). Raw captions can be download from link.

For Image-Editing-Request

The official data can be found here: google drive link provided by Expressing Visual Relationships via Language (ACL 2019).

Prepare for Evaluation

For CLEVR-Change

To evaluate captions, we need to first reformat the caption annotations into COCO eval tool format. Please run the command python utils/eval_utils.py according to the instructions given in Evaluation in Robust Change Captioning (ICCV19).

Renaming the output file as clevr_test_change_captions_reformat.json.

For Spot-the-Diff and Image-Editing-Request

Running the command python preprocess/eval_utils.py, renaming the output file as spot_test_change_captions_reformat.json.

We provide these evaluation files in the eval_data/ directory:

eval_data
|–– clevr_test_change_captions_reformat.json
|–– spot_val_change_captions_reformat.json
|–– spot_test_change_captions_reformat.json
|–– IER_val_change_captions_reformat.json
|–– IER_test_change_captions_reformat.json

Retrieval corpus for RAG

The retrieval augmented generation (RAG) requires a effective retriever for visual-to-text retrieval. We utilized the pretrained model from CLIP4IDC to retrieve texts, where the corpus is a collection of texts in the training set of each dataset.

The corpus retrieved in our work can be downloaded from FINER-MLLM ModelScope. You can get the following files:

rag_store
|–– clevr_retrieval_corpus_store.json
|–– spot_retrieval_corpus_store.json
|–– IER_retrieval_corpus_store.json

Note: Since CLIP4IDC does not provide the pre-trained weights on Image-Editing-Request, we retrieve the corpus after reproducing it according to the paper's method. The retrieval results we reproduced on Image-Editing-Request are slightly worse than the retrieval metrics provided by the original paper. We also provide the detailed process of retrieval corpus on FINER-MLLM ModelScope.

🔄 Running

In our experiments, we utilzed a single gpu for training and evaluation.

Detailed commands for training model with single GPU:

# Clevr-change
bash scripts/train_clevr.sh

# Spot-the-diff
bash scripts/train_spot.sh

# Image-editing-request
bash scripts/train_IER.sh

The commands for evaluation:

bash scripts/eval.sh

📍 Pretrained Weights

You can download the pretrined weights and other necessary files on FINER-MLLM ModelScope.

✏️ Citation

If you find the repo useful for your research, please consider citing our paper:

@inproceedings{
zhang2024finermllm,
title={Differential-Perceptive and Retrieval-Augmented {MLLM} for Change Captioning},
author={Xian Zhang and Haokun Wen and Jianlong Wu and Pengda Qin and Hui Xue' and Liqiang Nie},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=eiGs5VCsYM}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
eval_data		eval_data
figs		figs
preprocess		preprocess
scripts		scripts
LICENSE		LICENSE
LoraQformer.py		LoraQformer.py
LoraViT.py		LoraViT.py
datasets.py		datasets.py
llm_model.py		llm_model.py
model.py		model.py
readme.md		readme.md
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning(ACM MM 2024)

🔨 Installation

➡️ Data Preparation

Dataset

Prepare for Evaluation

Retrieval corpus for RAG

🔄 Running

📍 Pretrained Weights

✏️ Citation

About

Uh oh!

Releases

Packages

Languages

License

xianzhangzx/FINER-MLLM

Folders and files

Latest commit

History

Repository files navigation

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning(ACM MM 2024)

🔨 Installation

➡️ Data Preparation

Dataset

Prepare for Evaluation

Retrieval corpus for RAG

🔄 Running

📍 Pretrained Weights

✏️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages