Thanks to visit codestin.com
Credit goes to github.com

Skip to content

YitingQu/UnsafeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UnsafeBench: A Comprehensive Benchmark for Vision-Language Model Safety

UnsafeBench is a comprehensive evaluation framework for assessing the safety and robustness of Vision-Language Models (VLMs) and image safety classifiers against unsafe content.

πŸš€ Features

  • Multi-modal Safety Assessment: Evaluate both image safety classifiers and vision-language models
  • Comprehensive Dataset Support: Built-in support for multiple safety datasets including SMID, NSFWDataset, MultiHeaded_Dataset, Violence_Dataset, and Self-harm_Dataset
  • Adversarial Robustness Testing: Tools for evaluating model robustness against adversarial attacks
  • Standardized Evaluation: Consistent evaluation protocols across different model types
  • Extensible Architecture: Easy to add new models and datasets

πŸ“¦ Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-compatible GPU
  • Git
  • Conda

Quick Setup

  1. Setup the environment:
bash setup.sh
conda activate llava
  1. For InstructBLIP models (optional):
bash setup_lavis_env.sh
conda activate lavis
  1. Setup Tokens:
export HF_TOKEN=hf_xxx   # required
export OPENAI_API_KEY=sk-proj-xxx   # optional

🎯 UnsafeBench Evaluation

Attention: due to ethical considerations, the UnsafeBench dataset is provided for research or education purposes only. To use it, please first request access from the Hugging Face: link

Effectiveness Evaluation

  1. Generate VLM responses using the UnsafeBench dataset:
python query_vlms.py --model_name llava-v1.5-7b --save_dir outputs/vlm_responses
conda activate lavis
python query_vlms.py --model_name instructblip-7b --save_dir outputs/vlm_responses

GPT-4V (gpt-4-vision-preview) has been deprecated, so it cannot be queried at the moment. Nonetheless, we provide the generated responses in the link.

For reproducible evaluation, this step can be skipped by directly using our responses, which will be automatically downloaded in step 2.

  1. Obtain Evaluation Results:
python assess_effectiveness.py --classifiers "Q16" "SD_Filter" "llava-v1.5-7b" \
    --vlm_response_dir ./outputs/vlm_responses \
    --save_dir ./outputs/effectiveness/UnsafeBench

This reproduces the effectiveness result in Table 3.

Robustness Evaluation

  1. Conduct adversarial attacks against conventional classifiers:
python assess_robustness.py --classifiers "Q16" "MultiHeaded" "SD_Filter" "NSFW_Detector" "NudeNet" \
    --attack_types "GN" "FGSM" "PGD" "DeepFool" \
    --eps 0.01 \
    --prediction_path ./outputs/effectiveness/UnsafeBench \
    --save_dir ./outputs/robustness

Running this will yield the Robust Accuracy of conventional classifiers, as shown in Table 4.

  1. Conduct adversarial attacks against VLM classifiers:

For LLaVA:

python llava_adv_attack.py --model-base liuhaotian/llava-v1.5-7b \
    --attack_types "GN" "FGSM" "PGD" "DeepFool" \
    --eps 0.01 \
    --prediction_path ./outputs/effectiveness/UnsafeBench \
    --save_dir ./outputs/robustness

For InstructBLIP:

python lavis_adv_attack.py \
    --attack_types "GN" "FGSM" "PGD" "DeepFool" \
    --eps 0.01 \
    --prediction_path ./outputs/effectiveness/UnsafeBench \
    --save_dir ./outputs/robustness

Running these will yield the Robust Accuracy of LLaVA and InstructBLIP, as shown in Table 4.

🎯 PerspectiveVision

  1. For direct inference:
cd PerspectiveVision
python eval.py --dataset_names UnsafeBench_test SMID NSFWDataset MultiHeaded_Dataset Violence_Dataset Self-harm_Dataset

This reproduces the effectiveness result of PerspectiveVision in Table 5.

  1. For Training:
python build_training_set.py

Train LLaVA using LoRA:

bash train_llava.sh

πŸ“Š Supported Models

Image Safety Classifiers

  • Q16: CLIP-based multi-category safety classifier
  • NudeNet: Nudity detection model
  • NSFW_Detector: General NSFW content detector
  • MultiHeaded: Multi-head classification model
  • SD_Filter: Stable Diffusion safety filter

Vision-Language Models

  • LLaVA-v1.5: Large Language and Vision Assistant
  • InstructBLIP: Instruction-tuned BLIP model
  • GPT-4V: OpenAI GPT models

Usage Example:

from classifiers import Q16
import torch

classifier = Q16()

image_paths = ["data/test.png"]
images = classifier.preprocess_images(image_paths)
logits = classifier.classify(images)
pred = torch.argmax(logits, dim=-1)
print(f"Prediction: {pred} (0: safe, 1: unsafe)")

πŸ—‚οΈ Dataset Support

The framework supports multiple safety datasets:

  • UnsafeBench: Comprehensive multi-category unsafe content dataset
  • SMID: Safety in Multimodal Intelligence Dataset
  • NSFWDataset: Not Safe For Work content dataset
  • MultiHeaded_Dataset: Multi-head classification dataset
  • Violence_Dataset: Violence detection dataset
  • Self-harm_Dataset: Self-harm content detection dataset

Usage Example:

UnsafeBench:

from datasets import load_dataset
dataset = load_dataset("yiting/UnsafeBench")["train"]
print(dataset[0])

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1600x1435 at 0x7FB291227D90>,
'safety_label': 'Safe',
'category': 'Hate',
'source': 'Laion5B',
'text': "xxx"}

Other External Datasets:

from datasets import load_dataset
import io
import base64
from PIL import Image

dataset_name = "SMID"
    
dataset = load_dataset(f"yiting/{dataset_name}", split="train")

encoded_image = dataset[0]["image"]

def decode_base64_to_image(base64_string, target_size=-1):
    image_data = base64.b64decode(base64_string)
    image = Image.open(io.BytesIO(image_data))
    if image.mode in ('RGBA', 'P'):
        image = image.convert('RGB')
    if target_size > 0:
        image.thumbnail((target_size, target_size))
    return image

image = decode_base64_to_image(encoded_image)
label = dataset[0]["label"]
print(image, label) # 0 safe, 1 unsafe

[Attention!] Do not cite our work when you use these external datasets, cite their original works pls.

πŸ› οΈ Project Structure

β”œβ”€β”€ README.md                # Project documentation
β”œβ”€β”€ pyproject.toml            # Project configuration
β”œβ”€β”€ unsafe_datasets.py        # Dataset loading and management
β”œβ”€β”€ classifiers.py            # Image safety classifiers
β”œβ”€β”€ vlms.py                   # Vision-language models
β”œβ”€β”€ assess_effectiveness.py   # Model effectiveness evaluation
β”œβ”€β”€ assess_robustness.py      # Robustness evaluation
β”œβ”€β”€ query_vlms.py             # VLM querying utilities
β”œβ”€β”€ utils.py                  # Helper functions
β”œβ”€β”€ data/                     # Dataset storage
β”œβ”€β”€ outputs/                  # Evaluation results
β”œβ”€β”€ PerspectiveVision/        # PerspectiveVision

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“§ Contact

For questions, issues, or collaborations, please:

πŸ“š Citation

If you use UnsafeBench in your research, please cite:

@inproceedings{QSWBZZ24,
  author    = {Yiting Qu and Xinyue Shen and Yixin Wu and Michael Backes and Savvas Zannettou and Yang Zhang},
  title     = {{UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images}},
  booktitle = {{ACM SIGSAC Conference on Computer and Communications Security (CCS)}},
  publisher = {ACM},
  year      = {2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages