This repository provides the official implementation for the paper "Provable Dynamic Fusion for Low-Quality Multimodal Data" presented at ICML 2023 by Qingyang Zhang and Haitao Wu.
- Theoretical Framework: This paper introduces a theoretical framework to understand the criterion for robust dynamic multimodal fusion.
- Novel Method: A novel dynamic multimodal fusion method, termed Quality-aware Multimodal Fusion (QMF), is proposed, demonstrating provably better generalization ability.
To set up the environment, run the following command:
pip install -r requirements.txtThis project uses two types of multimodal datasets: Text-Image Classification and RGBD Scene Recognition.
-
Download Datasets:
- Download food101
- Download MVSA_Single
- Place them in the
datasetsfolder. - (Baidu Netdisk links for convenience: food101 (pwd: 5jy4), MVSA_Single (pwd: 18fw))
-
Prepare Splits: The
train/dev/testsplits (jsonl files) are prepared following the MMBT settings and are provided in their corresponding folders. -
Optional: Pre-trained Models for Text Embeddings:
- Glove: For the Bow model, download glove.840B.300d.txt and place it in the
datasets/glove_embedsfolder. - BERT: For the Bert model, download bert-base-uncased (Google Drive Link) and place it in the root folder
bert-base-uncased/.
- Glove: For the Bow model, download glove.840B.300d.txt and place it in the
- Download Datasets:
We provide the trained models for download. Please ensure you have the necessary tools to access Baidu Netdisk if using those links.
- Trained QMF Models: Baidu Netdisk (pwd: 8995)
- Pre-trained BERT Model: Baidu Netdisk (pwd: zu13)
- Pre-trained ResNet18 (for RGB-D tasks): PyTorch official pre-trained
resnet18can be downloaded from this link.
To run our method (QMF) on benchmark datasets:
python train_qmf.py --alg qmf --noise_level 0.0 --noise_type Gaussian \To evaluate and get the reported accuracy in our paper:
python train_qmf.py --alg qmf --epoch 0 --noise_level 5.0 --noise_type Gaussian \To run TMC (Trusted Multi-View Classification, ICLR'21):
# Set parameters
task="MVSA_Single" # or "food101"
task_type="classification"
model="latefusion" # TMC often involves a fusion step, "latefusion" is used as an example base
i=0 # Example seed
name="${task}_tmc_model_run_${i}" # Naming convention for TMC runs
python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir "./saved/${task}" --name "${name}" --data_path "./datasets/" \
--task "${task}" --task_type "${task_type}" --model "${model}" --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed "${i}"If our QMF method or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:
@inproceedings{zhang2023provable,
title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
booktitle={International Conference on Machine Learning},
year={2023}
}The code implementation is inspired by the following excellent works:
Here are some interesting works related to this paper:
- Uncertainty-based Fusion Netwok for Automatic Skin Lesion Diagnosis
- Uncertainty Estimation for Multi-view Data: The Power of Seeing the Whole Picture
- Reliable Multimodality Eye Disease Screening via Mixture of Student's t Distributions
- Trusted Multi-Scale Classification Framework for Whole Slide Image
- Fast Road Segmentation via Uncertainty-aware Symmetric Network
- Trustworthy multimodal regression with mixture of normal-inverse gamma distributions
- Uncertainty-Aware Multiview Deep Learning for Internet of Things Applications
- Automated crystal system identification from electron diffraction patterns using multiview opinion fusion machine learning
- Trustworthy Long-Tailed Classification
- Trusted multi-view deep learning with opinion aggregation
- EvidenceCap: Towards trustworthy medical image segmentation via evidential identity cap
- Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging
- Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification
For any additional questions, feel free to email [email protected].