Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ICCV2025] Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"

Notifications You must be signed in to change notification settings

lorebianchi98/Talk2DINO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

arXiv Paper Project Website Demo on Hugging Face Hugging Face Models

Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.

Updates

  • β˜„οΈ 10/2025: Added support for DINOv3 πŸ¦–πŸ¦–πŸ¦•!
  • πŸš€ 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces πŸ¦–
  • πŸ€— 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub πŸŽ‰
  • πŸ”₯ 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! πŸŒΊπŸŒ΄πŸ–οΈ

Results

Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours (Talk2DINO)
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours

Here’s a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:


Installation

1️⃣ Hugging Face Interface (for inference)

To quickly run Talk2DINO on your own images:

# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO

# Install dependencies
pip install -r requirements.txt

# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.


2️⃣ MMCV Interface (for evaluation & full pipelines)

If you want to perform benchmark evaluation using MMSegmentation:

# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino

# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"

# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda 
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0

# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine

# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html

# Install MMSegmentation
pip install mmsegmentation==0.30.0

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:

πŸ”Ή Using the Hugging Face Hub

Easily load pretrained models with the HF interface:

from transformers import AutoModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()

with torch.no_grad():
    text_embed = model.encode_text("a pikachu")

πŸ”Ή Using the Original Talk2DINO Interface

If you prefer local configs and weights:

import clip
from src.model import ProjectionLayer
import torch, os

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")

talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)

# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize

# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)

Feature Extraction

To speed up training, we use pre-extracted features. Follow these steps:

  1. Download the 2014 images and annotations from the COCO website.
  2. Run the following commands to extract features:
    mkdir ../coco2014_b14
    python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python text_features_extraction.py --ann_path ../coco2014_b14/train.pth
    python text_features_extraction.py --ann_path ../coco2014_b14/val.pth

Training

To train the model, use the following command (this example runs training for the ViT-Base configuration):

python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pth

Evaluation

This section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:

data
β”œβ”€β”€ cityscapes
β”‚   β”œβ”€β”€ leftImg8bit
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”œβ”€β”€ gtFine
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ val
β”œβ”€β”€ VOCdevkit
β”‚   β”œβ”€β”€ VOC2012
β”‚   β”‚   β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”œβ”€β”€ SegmentationClass
β”‚   β”‚   β”œβ”€β”€ ImageSets
β”‚   β”‚   β”‚   β”œβ”€β”€ Segmentation
β”‚   β”œβ”€β”€ VOC2010
β”‚   β”‚   β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”œβ”€β”€ SegmentationClassContext
β”‚   β”‚   β”œβ”€β”€ ImageSets
β”‚   β”‚   β”‚   β”œβ”€β”€ SegmentationContext
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ train.txt
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ val.txt
β”‚   β”‚   β”œβ”€β”€ trainval_merged.json
β”‚   β”œβ”€β”€ VOCaug
β”‚   β”‚   β”œβ”€β”€ dataset
β”‚   β”‚   β”‚   β”œβ”€β”€ cls
β”œβ”€β”€ ade
β”‚   β”œβ”€β”€ ADEChallengeData2016
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ training
β”‚   β”‚   β”‚   β”œβ”€β”€ validation
β”‚   β”‚   β”œβ”€β”€ images
β”‚   β”‚   β”‚   β”œβ”€β”€ training
β”‚   β”‚   β”‚   β”œβ”€β”€ validation
β”œβ”€β”€ coco_stuff164k
β”‚   β”œβ”€β”€ images
β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   β”œβ”€β”€ val2017
β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   β”œβ”€β”€ val2017

Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.

COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:

python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/

To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml

# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml

# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml

# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml

# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml

# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml

# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml

# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml

Instead, the evaluations for the ViT-Large architecture are:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml

# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml

# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml

# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml

# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml

# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml

# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml

# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.yml

Demo

We provide two simple entry points for trying out Talk2DINO:

  • hf_demo.ipynb – an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.
  • demo.py – a lightweight script for running inference on a custom image with your own textual categories. . Run
python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..

Example:

python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route

Result:

Acknowledgments

Thanks to AyoubDamak for contributing to the updated installation instructions.

Reference

If you found this code useful, please cite the following paper:

@misc{barsellotti2024talkingdinobridgingselfsupervised,
      title={Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation}, 
      author={Luca Barsellotti and Lorenzo Bianchi and Nicola Messina and Fabio Carrara and Marcella Cornia and Lorenzo Baraldi and Fabrizio Falchi and Rita Cucchiara},
      year={2024},
      eprint={2411.19331},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19331}, 
}

About

[ICCV2025] Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published