This is the code base for CVPR2022 paper Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
Download dataset according to LVIS, VOC, COCO and Objects365. Precomputed proposals generated by RPN trained on only base classes can be downloaded from google drive baiduyun (code:yadt). It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to data as below.
├── mmdet
├── tools
├── configs
├── data
├── ├── lvis_v1
├── ├── ├──annotations
├── ├── ├──train2017
├── ├── ├──val2017
├── ├── ├──proposals
│   ├── coco
│   │   ├── annotations
│   │   ├── train2017
│   │   ├── val2017
│   ├── VOCdevkit
│   │   ├── VOC2007
│   │   ├── VOC2012
│   ├── objects365
│   │   ├── annotations
│   │   ├── train
│   │   ├── val
All models use the backbone pretrained with SoCo which can be downloaded from google drive baiduyun (code:kwps). Put the pretrained backbone under data/.
| Model | Lr Schd | APbbr | APbbc | APbbf | APbb | APmkr | APmkc | APmkf | APmk | Config | Prompt | Model | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ViLD* | 20 epochs | 17.4 | 27.5 | 31.9 | 27.5 | 16.8 | 25.6 | 28.5 | 25.2 | config | google drive baiduyun (code:a5ni) | google drive baiduyun (code:cyhv) | 
| DetPro (Mask R-CNN) | 20 epochs | 20.8 | 27.8 | 32.4 | 28.4 | 19.8 | 25.6 | 28.9 | 25.9 | config | google drive baiduyun (code:uvab) | google drive baiduyun (code:apmq) | 
| DetPro (Cascade R-CNN) | 20 epochs | 21.7 | 29.6 | 35.0 | 30.5 | 20.0 | 26.7 | 30.4 | 27.0 | config | google drive baiduyun (code:uvab) | google drive baiduyun (code:5ee9) | 
In the original implementation of ViLD, the whole training process takes up to 180,000 iterations with batchsize of 256, approximately 460 epochs, which is unaffordable. We re-implement ViLD (denoted as ViLD*) with backbone pretrained using SoCo. Our re-implementation version achieves comparable AP compared with the original implementation, while reducing the training epochs from 460 to 20.
- python3.8
- pytorch 1.7.0
- cuda 11.0
This repo is built on mmdetection, CLIP and CoOP
pip install -r requirements/build.txt
pip install -e .
pip install git+https://github.com/openai/CLIP.git
pip uninstall pycocotools -y
pip uninstall mmpycocotools -y
pip install mmpycocotools
pip install git+https://github.com/lvis-dataset/lvis-api.git
pip install mmcv-full==1.2.5 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html./tools/dist_test.sh <config> <model> <gpu_num> --eval bbox segm --cfg-options model.roi_head.prompt_path=<prompt> model.roi_head.load_feature=False 
see prepare.sh.
This process will take a long time. So we also provide the extracted clip image embeddings of precomputed proposals baiduyun (code:o4n5). You can download all these zip files and merge them into one file (lvis_clip_image_embedding.zip).
The training code and checkpoint are available here baiduyun(code:tqsd).
see detpro.sh
see vild_detpro.sh
see transer.sh
The empty prompt is provided here, you can use it to generate the prompt for COCO, VOC and Objects365.
@article{du2022learning,
  title={Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model},
  author={Du, Yu and Wei, Fangyun and Zhang, Zihe and Shi, Miaojing and Gao, Yue and Li, Guoqi},
  journal={arXiv preprint arXiv:2203.14940},
  year={2022}
}