DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

This repository is the official PyTorch implementation of DeCLIP.

Overview

DeCLIP is an unsupervised fine-tuning framework for open-vocabulary dense perception tasks, which decouples CLIP's self-attention module to obtain "content" and "context" features, learning from itself and vision foundation models respectively to enhance local discriminability and spatial consistency.

Contributions

We analyze CLIP and find that its limitation in open-vocabulary dense prediction arises from image tokens failing to aggregate information from spatially or semantically related regions.
To address this issue, we propose DeCLIP, a simple yet effective unsupervised fine-tuning framework, to enhance the discriminability and spatial consistency of CLIP’s local features via a decoupled feature enhancement strategy.
DeCLIP outperforms previous state-of-the-art models on a broad range of open-vocabulary dense prediction benchmarks.

Video

🎉News

[2025.05.07] We will update the complete training and inference code as well as weights. Stay tuned!
[2025.02.27] Our work has been accepted at CVPR 2025.

🔥TODO

Initialize Project
Release the training and inference code of DeCLIP
Release evaluation code for DeCLIP in open vocabulary semantic segmentation based on VLM features. Please refer to the ZSSS branch.
Release the code to integrate DeCLIP into CAT-Seg. Please refer to the DeCLIP_CATSeg.
Release the code to integrate DeCLIP into F-ViT and OV-DQUO

🌈Environment

Linux with Python == 3.10.0
CUDA 11.7
The provided environment is suggested for reproducing our results, similar configurations may also work.

🚀Quick Start

1. Create Conda Environment

conda create -n DeCLIP python=3.10.0
conda activate DeCLIP
pip install -r requirements.txt
pip install -e . -v

2. Dataset Preparation

The distillation process of DeCLIP does not rely on any annotations and only requires input images. In our paper, the COCO dataset was used. Please download the dataset and organize the folders as follows.

DeCLIP/
├── dataset
    ├── coco
        ├── annotations
            ├── instances_train2017.json  # only access images
            ├── panoptic_val2017.json # for validation
            ├── panoptic_val2017     
        ├── train2017
        ├── val2017

3.Preparing Pretrained Checkpoints

Please download the pretrained weights from EVA-CLIP and organize them as shown below.

DeCLIP/
├── checkpoints
    ├── EVA02_CLIP_B_psz16_s8B.pt
    ├── EVA02_CLIP_L_336_psz14_s6B.pt

4. Modify the Necessary Parameters.

Before starting the training, please modify the paths in the training script.

data_root=** # path to your coco dataset
pretrain_ckpt=**  # path to your EVA-CLIP checkpoint
exp_name=**  # output folder name
vfm_type=**  # which vfm to use, we use dinov2-B & dinov2-L for default, {sam-B, sam-L, dinov2-B, dinov2-L, dino-B-8, dino-B-16}

Note: If you encounter any freezing or hanging issues during training, it is likely due to problems with downloading or loading the Visual Feature Models (VFMs) from torch hub. We highly recommend manually downloading the SAM, DINOV2, or DINO code and weights locally before proceeding with DeCLIP training.
You can modify the relevant code in the build_vfm() located in DeCLIP/src/training.

5. Script for training DeCLIP

To train the DeCLIP on the COCO dataset, please run the following script:

# dist training, EVA-CLIP, ViT-B-16
bash scripts/dist_DeCLIP_eva_vitb16_coco.sh

# dist training, EVA-CLIP, ViT-L-16-336
bash scripts/dist_DeCLIP_eva_vitL14_336_coco.sh

5. Use DeCLIP for zero-shot image segmentation.

Please refer to the ZSSS branch.

6. Using DeCLIP for open-vocabulary semantic segmentation in the CAT-Seg model.

Please refer to the DeCLIP_CATSeg.

❤️ Acknowledgement

Our work builds upon the method and codebase of CLIPSelf, ClearCLIP, CAT-Seg, EVA-CLIP, OpenCLIP. We sincerely thank the authors for their remarkable contributions, which provided an essential foundation for our research.

🙏 Citing DeCLIP

@inproceedings{wang2025declip,
  title={Declip: Decoupled learning for open-vocabulary dense perception},
  author={Wang, Junjie and Chen, Bin and Li, Yulin and Kang, Bin and Chen, Yichi and Tian, Zhuotao},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={14824--14834},
  year={2025}
}

@article{wang2025generalized,
  title={Generalized decoupled learning for enhancing open-vocabulary dense perception},
  author={Wang, Junjie and Chen, Keyu and Li, Yulin and Chen, Bin and Zhao, Hengshuang and Qi, Xiaojuan and Tian, Zhuotao},
  journal={arXiv preprint arXiv:2508.11256},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
metadata		metadata
scripts		scripts
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Overview

Contributions

Video

🎉News

🔥TODO

🌈Environment

🚀Quick Start

1. Create Conda Environment

2. Dataset Preparation

3.Preparing Pretrained Checkpoints

4. Modify the Necessary Parameters.

5. Script for training DeCLIP

5. Use DeCLIP for zero-shot image segmentation.

6. Using DeCLIP for open-vocabulary semantic segmentation in the CAT-Seg model.

❤️ Acknowledgement

🙏 Citing DeCLIP

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

xiaomoguhz/DeCLIP

Folders and files

Latest commit

History

Repository files navigation

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Overview

Contributions

Video

🎉News

🔥TODO

🌈Environment

🚀Quick Start

1. Create Conda Environment

2. Dataset Preparation

3.Preparing Pretrained Checkpoints

4. Modify the Necessary Parameters.

5. Script for training DeCLIP

5. Use DeCLIP for zero-shot image segmentation.

6. Using DeCLIP for open-vocabulary semantic segmentation in the CAT-Seg model.

❤️ Acknowledgement

🙏 Citing DeCLIP

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages