Distillation Pyramid for Multimodal Open-Vocabulary Object Detection

Installation

Prerequisites

Python 3.11
CUDA 11.8
PyTorch 2.1.0

Environment Setup

Create a conda environment

conda create -n mmdet3 python=3.11 -y
conda activate mmdet3

Install PyTorch

pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118

Install MMDetection ecosystem using OpenMIM

pip install -U openmim
mim install mmengine==0.10.5
mim install mmcv==2.1.0
mim install mmdet==3.3.0

Install other dependencies

pip install -r requirements.txt

Preparation

Datasets

The expected directory structure for datasets:

data/
├── coco
│   ├── annotations
│   ├── mdetr_annotations
│   ├── train2014
│   ├── train2017
│   └── val2017
├── flickr30k_entities
│   ├── flickr30k_images
│   └── flickr_train_vg7.jsonl
├── gqa
│   ├── gqa_train_vg7.jsonl
│   └── images
├── mmovod
│   ├── merged.json
│   ├── pseudo_list.pth
│   └── samples
├── objects365
│   ├── annotations
│   └── train
├── qwen
│   ├── annotations
│   └── features
├── retrival
│   └── object_detection.json
└── v3det
    ├── annotations
    └── images

Dataset Annotations

All required annotations have been uploaded to Google Drive. You can download them from: [占位符]

After downloading, extract and place them in the corresponding directories as shown in the structure above.

Pretrained Models

Download the pretrained MM-Grounding-DINO models:

mm_grounding_dino/
├── grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
├── grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
└── grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth

You can download them using:

wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth

OADP Features

Before using the distillation technique, we need to extract features offline. You can use the code in the main branch for extraction. The specific command is:

Training

Hardware Requirements

All experiments were conducted on 8x NVIDIA RTX 4090 (24GB) GPUs.

Text-based Model Training

Train the text-based distillation model using EVA-CLIP features:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
    configs/ov_distill_shortest_edge.py \
    --work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest_edge \
    --cfg-options \
        model.obj_loss_weight=0.025 \
        model.block_loss_weight=0.25 \
        model.global_loss_weight=0.025 \
        load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
    --resume \
    --launcher pytorch

Image-based Model Training

Train the image-based distillation model using LLM-extracted features:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/train.py \
    configs/fs_llm_features_distill.py \
    --work-dir work_dirs/fs_distill_0.03_0.8_0.8 \
    --cfg-options \
        model.w_distill=0.03 \
        model.w_global=0.8 \
        model.w_structure=0.8 \
        load_from=mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \
    --resume \
    --launcher pytorch

Evaluation

Text-based Model Evaluation

Evaluate the text-based distillation model on LVIS open-vocabulary detection:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
    configs/evaluation/lvis_val_ov.py \
    work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/iter_150000.pth \
    --work-dir work_dirs/ov_distill_0.025_0.25_0.025_eva-clip_shortest-edge/150000 \
    --launcher pytorch

Image-based Model Evaluation

Evaluate the image-based distillation model on LVIS validation:

PYTHONPATH=$(pwd):$PYTHONPATH torchrun --nproc_per_node=8 tools/test.py \
    configs/evaluation/lvis_val.py \
    mm_work_dirs/fs_distill_0.03_0.8_0.8/iter_16000.pth \
    --work-dir work_dirs/fs_distill_0.03_0.8_0.8/iter_16000 \
    --launcher pytorch

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
dp_detector		dp_detector
tools		tools
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distillation Pyramid for Multimodal Open-Vocabulary Object Detection

Installation

Prerequisites

Environment Setup

Preparation

Datasets

Dataset Annotations

Pretrained Models

OADP Features

Training

Hardware Requirements

Text-based Model Training

Image-based Model Training

Evaluation

Text-based Model Evaluation

Image-based Model Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

LutingWang/OADP

Folders and files

Latest commit

History

Repository files navigation

Distillation Pyramid for Multimodal Open-Vocabulary Object Detection

Installation

Prerequisites

Environment Setup

Preparation

Datasets

Dataset Annotations

Pretrained Models

OADP Features

Training

Hardware Requirements

Text-based Model Training

Image-based Model Training

Evaluation

Text-based Model Evaluation

Image-based Model Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages