- Python 3.11
- CUDA 11.8
- PyTorch 2.1.0
- Create a conda environment
conda create -n mmdet3 python=3.11 -y
conda activate mmdet3
- Install PyTorch
pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118
- Install MMDetection ecosystem using OpenMIM
pip install -U openmim
mim install mmengine==0.10.5
mim install mmcv==2.1.0
mim install mmdet==3.3.0
- Install other dependencies
pip install -r requirements.txt
The expected directory structure for datasets:
data/
├── coco
│ ├── annotations
│ ├── mdetr_annotations
│ ├── train2014
│ ├── train2017
│ └── val2017
├── flickr30k_entities
│ ├── flickr30k_images
│ └── flickr_train_vg7.jsonl
├── gqa
│ ├── gqa_train_vg7.jsonl
│ └── images
├── mmovod
│ ├── merged.json
│ ├── pseudo_list.pth
│ └── samples
├── objects365
│ ├── annotations
│ └── train
├── qwen
│ ├── annotations
│ └── features
├── retrival
│ └── object_detection.json
└── v3det
├── annotations
└── images
All required annotations have been uploaded to Google Drive. You can download them from: [占位符]
After downloading, extract and place them in the corresponding directories as shown in the structure above.
Download the pretrained MM-Grounding-DINO models:
mm_grounding_dino/
├── grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
├── grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
└── grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
You can download them using:
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
Before using the distillation technique, we need to extract features offline. You can use the code in the main
branch for extraction. The specific command is:
All experiments were conducted on 8x NVIDIA RTX 4090 (24GB) GPUs.
Train the text-based distillation model using EVA-CLIP features:
PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
configs/ov_distill_shortest_edge.py \
--work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest_edge \
--cfg-options \
model.obj_loss_weight=0.025 \
model.block_loss_weight=0.25 \
model.global_loss_weight=0.025 \
load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
--resume \
--launcher pytorch
Train the image-based distillation model using LLM-extracted features:
PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
configs/fs_llm_features_distill.py \
--work-dir work_dirs/fs_distill_0.03_0.8_0.8 \
--cfg-options \
model.w_distill=0.03 \
model.w_global=0.8 \
model.w_structure=0.8 \
load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
--resume \
--launcher pytorch
Evaluate the text-based distillation model on LVIS open-vocabulary detection:
PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
configs/evaluation/lvis_val_ov.py \
work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/iter_150000.pth \
--work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/150000 \
--launcher pytorch
Evaluate the image-based distillation model on LVIS validation:
PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
configs/evaluation/lvis_val.py \
mm_work_dirs/fs_distill_0.03_0.8_0.8/iter_16000.pth \
--work-dir work_dirs/fs_distill_0.03_0.8_0.8/iter_16000 \
--launcher pytorch