Thanks to visit codestin.com
Credit goes to github.com

Skip to content

xianzhangzx/FINER-MLLM

Repository files navigation

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning(ACM MM 2024)

FINER-MLLM Framework

This repository is the official implementation of "Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning". ACM MM, 2024.

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning
Xian Zhang^1, Haokun Wen^1, Jianlong Wu^1, Pengda Qin^2, Hui Xue^2, Liqiang Nie^1
^1Harbin Institute of Technology, Shenzhen, ^2Alibaba Group

🔨 Installation

The codebase is mainly built with following libraries:

The raw pretrianed weight of Vicuna-7B version in InstructBLIP is available at Hugging Face.

➡️ Data Preparation

Dataset

For CLEVR-Change

The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).

For Spot-the-Diff

Resized images can be download from Learning to Describe Differences Between Pairs of Similar Images (EMNLP18). Raw captions can be download from link.

For Image-Editing-Request

The official data can be found here: google drive link provided by Expressing Visual Relationships via Language (ACL 2019).

Prepare for Evaluation

For CLEVR-Change

To evaluate captions, we need to first reformat the caption annotations into COCO eval tool format. Please run the command python utils/eval_utils.py according to the instructions given in Evaluation in Robust Change Captioning (ICCV19).

Renaming the output file as clevr_test_change_captions_reformat.json.

For Spot-the-Diff and Image-Editing-Request

Running the command python preprocess/eval_utils.py, renaming the output file as spot_test_change_captions_reformat.json.

We provide these evaluation files in the eval_data/ directory:

eval_data
|–– clevr_test_change_captions_reformat.json
|–– spot_val_change_captions_reformat.json
|–– spot_test_change_captions_reformat.json
|–– IER_val_change_captions_reformat.json
|–– IER_test_change_captions_reformat.json

Retrieval corpus for RAG

The retrieval augmented generation (RAG) requires a effective retriever for visual-to-text retrieval. We utilized the pretrained model from CLIP4IDC to retrieve texts, where the corpus is a collection of texts in the training set of each dataset.

The corpus retrieved in our work can be downloaded from FINER-MLLM ModelScope. You can get the following files:

rag_store
|–– clevr_retrieval_corpus_store.json
|–– spot_retrieval_corpus_store.json
|–– IER_retrieval_corpus_store.json

Note: Since CLIP4IDC does not provide the pre-trained weights on Image-Editing-Request, we retrieve the corpus after reproducing it according to the paper's method. The retrieval results we reproduced on Image-Editing-Request are slightly worse than the retrieval metrics provided by the original paper. We also provide the detailed process of retrieval corpus on FINER-MLLM ModelScope.

🔄 Running

In our experiments, we utilzed a single gpu for training and evaluation.

Detailed commands for training model with single GPU:

# Clevr-change
bash scripts/train_clevr.sh

# Spot-the-diff
bash scripts/train_spot.sh

# Image-editing-request
bash scripts/train_IER.sh

The commands for evaluation:

bash scripts/eval.sh

📍 Pretrained Weights

You can download the pretrined weights and other necessary files on FINER-MLLM ModelScope.

✏️ Citation

If you find the repo useful for your research, please consider citing our paper:

@inproceedings{
zhang2024finermllm,
title={Differential-Perceptive and Retrieval-Augmented {MLLM} for Change Captioning},
author={Xian Zhang and Haokun Wen and Jianlong Wu and Pengda Qin and Hui Xue' and Liqiang Nie},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=eiGs5VCsYM}
}

About

The implementation of FINER-MLLM, which is accepted by MM2024.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published