Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.
- βοΈ 10/2025: Added support for DINOv3 π¦π¦π¦!
- π 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces π¦
- π€ 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub π
- π₯ 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! πΊπ΄ποΈ
| Image | Ground Truth | FreeDA | ProxyCLIP | CLIP-DINOiser | Ours (Talk2DINO) |
|---|---|---|---|---|---|
Hereβs a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:
To quickly run Talk2DINO on your own images:
# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO
# Install dependencies
pip install -r requirements.txt
# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.
If you want to perform benchmark evaluation using MMSegmentation:
# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino
# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"
# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0
# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html
# Install MMSegmentation
pip install mmsegmentation==0.30.0Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:
Easily load pretrained models with the HF interface:
from transformers import AutoModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()
with torch.no_grad():
text_embed = model.encode_text("a pikachu")If you prefer local configs and weights:
import clip
from src.model import ProjectionLayer
import torch, os
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")
talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)
# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize
# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)To speed up training, we use pre-extracted features. Follow these steps:
- Download the 2014 images and annotations from the COCO website.
- Run the following commands to extract features:
mkdir ../coco2014_b14 python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn python text_features_extraction.py --ann_path ../coco2014_b14/train.pth python text_features_extraction.py --ann_path ../coco2014_b14/val.pth
To train the model, use the following command (this example runs training for the ViT-Base configuration):
python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pthThis section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:
data
βββ cityscapes
β βββ leftImg8bit
β β βββ train
β β βββ val
β βββ gtFine
β β βββ train
β β βββ val
βββ VOCdevkit
β βββ VOC2012
β β βββ JPEGImages
β β βββ SegmentationClass
β β βββ ImageSets
β β β βββ Segmentation
β βββ VOC2010
β β βββ JPEGImages
β β βββ SegmentationClassContext
β β βββ ImageSets
β β β βββ SegmentationContext
β β β β βββ train.txt
β β β β βββ val.txt
β β βββ trainval_merged.json
β βββ VOCaug
β β βββ dataset
β β β βββ cls
βββ ade
β βββ ADEChallengeData2016
β β βββ annotations
β β β βββ training
β β β βββ validation
β β βββ images
β β β βββ training
β β β βββ validation
βββ coco_stuff164k
β βββ images
β β βββ train2017
β β βββ val2017
β βββ annotations
β β βββ train2017
β β βββ val2017Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.
COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:
python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:
# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml
Instead, the evaluations for the ViT-Large architecture are:
# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.ymlWe provide two simple entry points for trying out Talk2DINO:
hf_demo.ipynbβ an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.demo.pyβ a lightweight script for running inference on a custom image with your own textual categories. . Run
python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..Example:
python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,routeResult:
Thanks to AyoubDamak for contributing to the updated installation instructions.
If you found this code useful, please cite the following paper:
@misc{barsellotti2024talkingdinobridgingselfsupervised,
title={Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation},
author={Luca Barsellotti and Lorenzo Bianchi and Nicola Messina and Fabio Carrara and Marcella Cornia and Lorenzo Baraldi and Fabrizio Falchi and Rita Cucchiara},
year={2024},
eprint={2411.19331},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19331},
}