Provable Dynamic Fusion for Low-Quality Multimodal Data

This repository provides the official implementation for the paper "Provable Dynamic Fusion for Low-Quality Multimodal Data" presented at ICML 2023 by Qingyang Zhang and Haitao Wu.

Highlights

Theoretical Framework: This paper introduces a theoretical framework to understand the criterion for robust dynamic multimodal fusion.
Novel Method: A novel dynamic multimodal fusion method, termed Quality-aware Multimodal Fusion (QMF), is proposed, demonstrating provably better generalization ability.

Environment Setup

To set up the environment, run the following command:

pip install -r requirements.txt

Dataset Preparation

This project uses two types of multimodal datasets: Text-Image Classification and RGBD Scene Recognition.

Text-Image Classification

Download Datasets:
- Download food101
- Download MVSA_Single
- Place them in the datasets folder.
- (Baidu Netdisk links for convenience: food101 (pwd: 5jy4), MVSA_Single (pwd: 18fw))
Prepare Splits: The train/dev/test splits (jsonl files) are prepared following the MMBT settings and are provided in their corresponding folders.
Optional: Pre-trained Models for Text Embeddings:
- Glove: For the Bow model, download glove.840B.300d.txt and place it in the datasets/glove_embeds folder.
- BERT: For the Bert model, download bert-base-uncased (Google Drive Link) and place it in the root folder bert-base-uncased/.

RGBD Scene Recognition

Download Datasets:
- Download NYUD2
- Download SUNRGBD
- Place them in the datasets folder.
- (Baidu Netdisk links for convenience: NYUD2 (pwd: xhq3), SUNRGBD (pwd: pv6m))

Trained Models

We provide the trained models for download. Please ensure you have the necessary tools to access Baidu Netdisk if using those links.

Trained QMF Models: Baidu Netdisk (pwd: 8995)
Pre-trained BERT Model: Baidu Netdisk (pwd: zu13)
Pre-trained ResNet18 (for RGB-D tasks): PyTorch official pre-trained resnet18 can be downloaded from this link.

Usage Example: Text-Image Classification

To run our method (QMF) on benchmark datasets:

python train_qmf.py --alg qmf --noise_level 0.0 --noise_type Gaussian \

To evaluate and get the reported accuracy in our paper:

python train_qmf.py --alg qmf --epoch 0 --noise_level 5.0 --noise_type Gaussian \

To run TMC (Trusted Multi-View Classification, ICLR'21):

# Set parameters
task="MVSA_Single" # or "food101"
task_type="classification"
model="latefusion" # TMC often involves a fusion step, "latefusion" is used as an example base
i=0 # Example seed

name="${task}_tmc_model_run_${i}" # Naming convention for TMC runs

python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40 \
    --savedir "./saved/${task}" --name "${name}" --data_path "./datasets/" \
    --task "${task}" --task_type "${task_type}" --model "${model}" --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed "${i}"

Citation

If our QMF method or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:

@inproceedings{zhang2023provable,
  title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
  author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
  booktitle={International Conference on Machine Learning},
  year={2023}
}

Acknowledgement

The code implementation is inspired by the following excellent works:

Related Works

Here are some interesting works related to this paper:

Contact

For any additional questions, feel free to email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
RGBD-scene-recognition		RGBD-scene-recognition
text-image-classification		text-image-classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
illustration.png		illustration.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Provable Dynamic Fusion for Low-Quality Multimodal Data

Highlights

Environment Setup

Dataset Preparation

Text-Image Classification

RGBD Scene Recognition

Trained Models

Usage Example: Text-Image Classification

Citation

Acknowledgement

Related Works

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

QingyangZhang/QMF

Folders and files

Latest commit

History

Repository files navigation

Provable Dynamic Fusion for Low-Quality Multimodal Data

Highlights

Environment Setup

Dataset Preparation

Text-Image Classification

RGBD Scene Recognition

Trained Models

Usage Example: Text-Image Classification

Citation

Acknowledgement

Related Works

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages