🎨ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies
ComplexBench-Edit is a benchmark for image editing specifically designed to assess performance on complex instructions involving multiple combined and dependent modifications. Our benchmark systematically evaluates howwell models can handle both parallel and, critically, chain-dependent instructions. Furthermore, we propose a novel vision consistency evaluation method that excludes the influence of modified content by assessing consistency only in the remaining, unaltered regions. We also introduce a simple yet powerful CoT-based approach for image editing.
- [2025.6.3] We release the comparison cases between different baselines and GPT-4o.
- [2025.6.2] We release the source image and editing instructions about ComplexBench-Edit Benchmark.
- [2025.6.1] We release the evaluation code.
- Clone the repository:
git clone https://github.com/llllly26/ComplexBench-Edit cd ComplexBench-Edit - Install Dependencies:
pip install -r requirements.txt
- Download Datasets: The source image could be downloaded from [ Here ], put the source images in
data/more-object-no-multi3directory. Overview of data could be found in
ComplexBench-Edit/
├── LICENSE
├── README.md
├── baselines/ # Contains implementations of some baseline models
│ ├── icedit.py
├── data/ # Contains benchmark images and instructions in json file.
│ ├── instructions/
│ │ ├── COCO-obj-attr-global/
│ │ ├── COCO-three-obj/
│ │ ├── COCO-two-obj-one-attr/
│ │ ├── three-chain/
│ │ └── two-chain/
│ ├── more-object-no-multi3/
├── edited-image/ # Stores editing images of models
│ └── Gemini/ # Example: Images edited by Gemini
└── evaluation/ # Contains evaluation scripts and prompts
├── count_score.py
├── eval-detection.py
├── eval_prompt/ # Evaluation prompts
├── final_score.py
├── get-bbox.py
├── ins_eval.py
└── read.txt
For the evaluations of all baselines, we utilize the demo code parameters provided in their respective original repositories. Thanks for all the authors.
Example for running a baseline:
python .\baselines\icedit.pyExample for running evaluation of instruction following:
python .\evaluation\ins_eval.py --results_folder ".\edited-image\Gemini\COCO-three-obj\testResults_42" --json_path ".\data\COCO-three-obj\final_update_v2.json" --output_dir ".\edited-image\Gemini\COCO-three-obj\testResults_42_eval_v3_thinking_01_21"Here, we showcase several examples from our ComplexBench-Edit benchmark. The image demonstrates the evaluation results of leading instruction-driven editing methods, including GPT-4o.

If you find that this work is useful for your research, please kindly give a star ⭐ and consider citation:
@article{wang2025complexbench,
title={ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies},
author={Wang, Chenglin and Zhou, Yucheng and Wang, Qianning and Wang, Zhe and Zhang, Kai},
journal={arXiv preprint arXiv:2506.12830},
year={2025}
}
