👩‍⚖️ MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

If you like our project, please consider giving us a star ⭐ 🥹🙏

Setup

Installation

Create environment and install dependencies.

conda create -n MM python=3.8
pip install -r requirements.txt

Dataset Preparation

We have host MJ-Bench dataset on huggingface, where you should request access on this page first and shall be automatically approved. Then you can simply load the dataset vi:

from datasets import load_dataset
dataset = load_dataset("MJ-Bench/MJ-Bench")
# use streaming mode to load on the fly
dataset = load_dataset("MJ-Bench/MJ-Bench", streaming=True)

Judge Model Configuration

config/config.yaml contains the configuration for the three types of multimodal judges that you want to evaluate. You can copy the default configuration to a new file and modify the model_path and api_key to use in your own envionrment. If you add new models, make sure you also add the load_model and get_score functions in the corresponding files under reward_models/.

Judge Model Evaluation

To get the inference result from a multimodal judge, simply run

python inference.py --model [MODEL_NAME] --config_path [CONFIG_PATH] --dataset [DATASET] --perspective [PERSPECTIVE] --save_dir [SAVE_DIR] --threshold [THRESHOLD] --multi_image [MULTI_IMAGE] --prompt_template_path [PROMPT_PATH]

where MODEL_NAME is the name of the reward model to evaluate; CONFIG_PATH is the path to the configuration file; DATASET is the dataset to evaluate on (default is MJ-Bench/MJ-Bench); PERSPECTIVE is the data subset to evaluate (e.g. alignment, safety, quality, bias); SAVE_DIR is the directory to save the results; and THRESHOLD is the preference threshold for the score-based RMs(i.e. image_0 is prefered only if score(image_0) - score(image_1) > THRESHOLD); MULTI_IMAGE indicates whether input multiple images or not (only close-source VLMs and some open-source VLMs support this); PROMPT_PATH indicates the path to the prompt for the VLM judges (needs to be consistent with MULTI_IMAGE).

Citation

@article{chen2024mj,
  title={MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?},
  author={Chen, Zhaorun and Du, Yichao and Wen, Zichen and Zhou, Yiyang and Cui, Chenhang and Weng, Zhenzhen and Tu, Haoqin and Wang, Chaoqi and Tong, Zhengwei and Huang, Qinglan and others},
  journal={arXiv preprint arXiv:2407.04842},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
alignment		alignment
assets		assets
bias		bias
config		config
eval		eval
experimental		experimental
finetune_datasets		finetune_datasets
human_eval		human_eval
personal		personal
prompt_template		prompt_template
quality		quality
reward_models		reward_models
safety/nsfw		safety/nsfw
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate_bias_avg.py		evaluate_bias_avg.py
evaluate_bias_type.py		evaluate_bias_type.py
inference.py		inference.py
process.py		process.py
requirements.txt		requirements.txt
split.py		split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👩‍⚖️ MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

If you like our project, please consider giving us a star ⭐ 🥹🙏

Setup

Installation

Dataset Preparation

Judge Model Configuration

Judge Model Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

MJ-Bench/MJ-Bench

Folders and files

Latest commit

History

Repository files navigation

👩‍⚖️ MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

If you like our project, please consider giving us a star ⭐ 🥹🙏

Setup

Installation

Dataset Preparation

Judge Model Configuration

Judge Model Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages