"HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models"
Eslam Abdelrahman, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny
This is the forked version of HRSBench, modified from Attention-Refocusing paper. User can benchmark both text-to-image and box layout-to-image generation tasks.
For box layout-to-image tasks, GPT4 generated layouts which are provided by Attention-Refocusing
repository are used.
The core functionality of HRS-Bench are included in src
directory, majorly copied from Attention-Refocusing.
With the HRS prompts and GPT-4 generated box layouts from Attention-Refocusing, I pre-processed the complete dataset into hrs_dataset
in jsonl
format.
The dataset includes four main categories: spatial relationship, color, size, and counting. Even though HRS prompts for each category counting/spatial/size/color are 3, 000/1,002/501/501, there are some duplicated prompts. If we count only unique prompts, the numbers are 2,990/898/424/484.
And there are also mysteriously missing prompts in *.p
pickle files, last lines for all datasets. Therefore, 2,990/896/423/483 unique prompts can be evaluated.
Setup virtual environment first:
uv sync
source .venv/bin/activate
Download UniDet weight and MaskDino weight with following code:
gdown 110JSpmfNU__7T3IMSJwv0QSfLLo_AqtZ
wget https://github.com/IDEA-Research/detrex-storage/releases/download/maskdino-v0.1.0/maskdino_swinl_50ep_300q_hid2048_3sd1_instance_maskenhanced_mask52.3ap_box59.0ap.pth
The HRS Bench dataset is structured to support benchmarking for both text-to-image (T2I) and layout-to-image (L2I) tasks. The dataset includes prompts and corresponding box layouts for each image. Users can choose to generate images using either the prompts alone or the prompts in conjunction with the box layouts.
All dataset specifications, including prompt formats and layout details, are provided in hrs_dataset/
in jsonl
format. Here, we will show core components that users need to be aware of when working with the dataset.
prompt
: The text prompt used for image generation.phrases
: Simple descriptions of its corresponding box layout.bounding_boxes
: The box layout information, 0-1 scale tuple4 (x_min, y_min, x_max, y_max)
-
about tags
expected_obj1
toexpected_obj4
(maximum) provides prompt including object tags for each image. These tags are crucial for tag based Attention Modulation.
-
about instance count
- For
spatial
,size
andcolor
tasks, only one instance is presented per each object category. This means,n=1
is the default setting for these tasks. - For
counting
tasks, multiple instances may be presented for maximum two object categories. You may referexpected_n1
andexpected_n2
to retrieve instance counts. (GT values)
- For
Each generated image for a specific task should be saved in a separate folder, sharing the same parent directory. And generated images should follow the naming convention: <prompt_idx>_<level>_<prompt>.[png|jpg]
For example:
/path/to/IMAGE_ROOT/
├── color_seed42/
├── counting_seed42/
├── size_seed42/
└── spatial_seed42/
run_hrs_benchmark.sh
provides an easy way to evaluate the generated images against the HRS dataset. This script will automatically run the evaluation process, saving the results in a specified output directory.
More details about the intermediate process can be found in README.md
in src
directory.
Here is the way to run the benchmark script:
bash run_hrs_benchmark.sh <METHOD_NAME> <IMAGE_ROOT> <GENERATION_SEED>
where
<METHOD_NAME>
: The name of the method to use for evaluation. This will be used for creating output directory name (e.g.,SD1.5
).<IMAGE_ROOT>
: The root directory containing the generated images for evaluation.<GENERATION_SEED>
: The seed used for image generation, which helps in reproducing results.