This repository is the official PyTorch implementation of DeCLIP.
DeCLIP is an unsupervised fine-tuning framework for open-vocabulary dense perception tasks, which decouples CLIP's self-attention module to obtain "content" and "context" features, learning from itself and vision foundation models respectively to enhance local discriminability and spatial consistency.
- We analyze CLIP and find that its limitation in open-vocabulary dense prediction arises from image tokens failing to aggregate information from spatially or semantically related regions.
- To address this issue, we propose DeCLIP, a simple yet effective unsupervised fine-tuning framework, to enhance the discriminability and spatial consistency of CLIP’s local features via a decoupled feature enhancement strategy.
- DeCLIP outperforms previous state-of-the-art models on a broad range of open-vocabulary dense prediction benchmarks.
- [2025.05.07] We will update the complete training and inference code as well as weights. Stay tuned!
- [2025.02.27] Our work has been accepted at CVPR 2025.
- Initialize Project
- Release the training and inference code of DeCLIP
- Release evaluation code for DeCLIP in open vocabulary semantic segmentation based on VLM features. Please refer to the ZSSS branch.
- Release the code to integrate DeCLIP into CAT-Seg. Please refer to the DeCLIP_CATSeg.
- Release the code to integrate DeCLIP into F-ViT and OV-DQUO
- Linux with Python == 3.10.0
- CUDA 11.7
- The provided environment is suggested for reproducing our results, similar configurations may also work.
conda create -n DeCLIP python=3.10.0
conda activate DeCLIP
pip install -r requirements.txt
pip install -e . -v
The distillation process of DeCLIP does not rely on any annotations and only requires input images. In our paper, the COCO dataset was used. Please download the dataset and organize the folders as follows.
DeCLIP/
├── dataset
├── coco
├── annotations
├── instances_train2017.json # only access images
├── panoptic_val2017.json # for validation
├── panoptic_val2017
├── train2017
├── val2017
Please download the pretrained weights from EVA-CLIP and organize them as shown below.
DeCLIP/
├── checkpoints
├── EVA02_CLIP_B_psz16_s8B.pt
├── EVA02_CLIP_L_336_psz14_s6B.pt
Before starting the training, please modify the paths in the training script.
data_root=** # path to your coco dataset
pretrain_ckpt=** # path to your EVA-CLIP checkpoint
exp_name=** # output folder name
vfm_type=** # which vfm to use, we use dinov2-B & dinov2-L for default, {sam-B, sam-L, dinov2-B, dinov2-L, dino-B-8, dino-B-16}
Note: If you encounter any freezing or hanging issues during training, it is likely due to problems with downloading or loading the Visual Feature Models (VFMs) from torch hub. We highly recommend manually downloading the SAM, DINOV2, or DINO code and weights locally before proceeding with DeCLIP training.
You can modify the relevant code in the build_vfm() located inDeCLIP/src/training.
To train the DeCLIP on the COCO dataset, please run the following script:
# dist training, EVA-CLIP, ViT-B-16
bash scripts/dist_DeCLIP_eva_vitb16_coco.sh
# dist training, EVA-CLIP, ViT-L-16-336
bash scripts/dist_DeCLIP_eva_vitL14_336_coco.sh
Please refer to the ZSSS branch.
Please refer to the DeCLIP_CATSeg.
Our work builds upon the method and codebase of CLIPSelf, ClearCLIP, CAT-Seg, EVA-CLIP, OpenCLIP. We sincerely thank the authors for their remarkable contributions, which provided an essential foundation for our research.
@inproceedings{wang2025declip,
title={Declip: Decoupled learning for open-vocabulary dense perception},
author={Wang, Junjie and Chen, Bin and Li, Yulin and Kang, Bin and Chen, Yichi and Tian, Zhuotao},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14824--14834},
year={2025}
}
@article{wang2025generalized,
title={Generalized decoupled learning for enhancing open-vocabulary dense perception},
author={Wang, Junjie and Chen, Keyu and Li, Yulin and Chen, Bin and Zhao, Hengshuang and Qi, Xiaojuan and Tian, Zhuotao},
journal={arXiv preprint arXiv:2508.11256},
year={2025}
}