This repository includes code for the paper:
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil, Yi-Lin Sung ,Peter Hase, Jie Peng, Tianlong Chen and Mohit Bansal
Multimodal LLMs (MLLMs) can inadvertently store sensitive information, making them vulnerable to extraction attacks; to address this, we introduce UnLOK-VQA, a benchmark and attack-defense framework for evaluating multimodal unlearning, showing that multimodal attacks are more effective than text-only or image-only attacks, while the best defense reduces attack success rates significantly, with larger models demonstrating greater resilience.
Link: Dataset Link
This dataset contains approximately 500 entries with the following key attributes:
- "id": Unique Identifier for each entry
- "src": The question whose answer is to be deleted โ
- "pred": The answer to the question meant for deletion โ
- "loc": Related neighborhood questions ๐
- "loc_ans": Answers to the neighborhood questions ๐ฃ๏ธ
- "image_id": The ID corresponding to the image in the COCO dataset ๐ผ๏ธ
To access the Neighborhood Images (Easy, Hard) and Paraphrase Images (Easy, Medium, Hard), visit the following link: Neighborhood & Paraphrase Images.
Save the UNLOK-VQA.json file in the "DATA_DIR" path in globals.yml as zsre_mend_eval.json
This is built on top of code from LLaVA and Belief-localization
For needed packages, first create a conda virtual environment via
conda env create -f mmmedit.yml
and activate it using the following command where $CONDA_PATH is the path to your conda installation
source $CONDA_PATH/bin/activate mmmedit
To set up LLaVA, clone the repository and install the required dependencies: This is the version 1.1.1 of the LLaVA installation compatible with this codebase.
cd LLaVA
pip install -e .Then, install the remaining requirements:
cd third_party
pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt')"
Set the following parameters for running attacks with the following defenses:
--fact_erasure
--dummy_string
- Neither
--fact_erasurenor--dummy_string
--fact_erasure,--margin_loss,--margin_layers 23 24 25 26 27 28 29 30 31 32
--fact_erasure,--entropy_loss,--entropy_layers 23 24 25 26 27 28 29 30 31 32
--fact_erasure,--rephrase_defense
- Highest ID in the dataset:
-n
Our experiments involve iterating over the dataset to delete each instance one-by-one from the model, then evaluating it using an attack if the instance has been deleted. The attack success rate is the the percentage of instances in the dataset recovered by the attack
Following are some sample commands for running Head Projection Attack with Error Injection defense, Probability Delta Attack with Head Projection defense and Input Rephrasing Attack with Empty Response defense
Evaluate Fact Erasure defense against the Question Rephrase Attack
CUDA_VISIBLE_DEVICES="7" python3 -m experiments.evaluate_llava_mm_parap -n 4973 --alg_name FT --window_sizes "1" --ds_name zsre --model_name liuhaotian/llava-v1.5-7b --run 1 --edit_layer 7 --correctness_filter 1 --norm_constraint 1e-4 --kl_factor 1 --fact_token subject_last --overwrite --retain_rate --skip_generation_test --num_attack_parap 4 --bb_num_samples 5 --attack bb --img_attack_parap orig --lft_edit --fact_erasure
Evaluate Fact Erasure + Question Rephrase defense against the Head Projection Attack
CUDA_VISIBLE_DEVICES="3" python3 -m experiments.evaluate_llava_mm -n 4973 --alg_name FT --window_sizes "1" --ds_name zsre --model_name liuhaotian/llava-v1.5-7b --run 1 --edit_layer 7 --correctness_filter 1 --norm_constraint 1e-4 --kl_factor 1 --fact_token subject_last --overwrite --retain_rate --skip_generation_test --attack hp --img_attack_parap orig --lft_edit --fact_erasure --use_img_token --debug --layers_wb_attack "25 29 30 31 32" --k 4 --epoch 10 --lora_lr 9e-3 --neigh_img --neigh_mm --neigh_type em --rephrase_defense
Evaluate Error Injection defense against the PD^2 Attack
CUDA_VISIBLE_DEVICES="2" python3 -m experiments.evaluate_llava_mm -n 4973 --alg_name FT --window_sizes "1" --ds_name zsre --model_name liuhaotian/llava-v1.5-7b --run 1 --edit_layer 7 --correctness_filter 1 --norm_constraint 1e-4 --kl_factor 1 --fact_token subject_last --overwrite --retain_rate --skip_generation_test --attack hess --img_attack_parap orig --lft_edit --fact_erasure --use_img_token --debug --layers_wb_attack "26 27 28 29 30 31 32" --k 4 --epoch 3 --lora_lr 2e-2 --neigh_img --neigh_mm --neigh_type em --lft_edit --lora_enable
To run full fine-tuning, add --lora_enable to the above commands:
CUDA_VISIBLE_DEVICES="2" python3 -m experiments.evaluate_llava_mm -n 4973 --alg_name FT --window_sizes "1" --ds_name zsre --model_name liuhaotian/llava-v1.5-7b --run 1 --edit_layer 7 --correctness_filter 1 --norm_constraint 1e-4 --kl_factor 1 --fact_token subject_last --overwrite --retain_rate --skip_generation_test --attack hess --img_attack_parap orig --lft_edit --fact_erasure --use_img_token --debug --layers_wb_attack "26 27 28 29 30 31 32" --k 4 --epoch 3 --lora_lr 2e-4 --neigh_img --neigh_mm --neigh_type em --lft_edit --lora_enable
@article{
patil2024unlearning,
title={Unlearning Sensitive Information in Multimodal {LLM}s: Benchmark and Attack-Defense Evaluation},
author={Vaidehi Patil and Yi-Lin Sung and Peter Hase and Jie Peng and Tianlong Chen and Mohit Bansal},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=YcnjgKbZQS},
note={}
}