NeurIPS 2025
Paul Couairon*, Loick Chambon*, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome
*Equal Contribution.
JAFAR improves metrics on many downstream tasks: semantic segmentation, depth estimation, feature activation, zero-shot open vocabulary, bird's eye view segmentation by upsampling features from any backbone.
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR—a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention- based module designed to promote semantic alignment between high-resolution queries—derived from low-level image features—and semantically enriched low- resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive xperiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks.
JAFAR is an efficient attention-based feature upsampler that allows upsampling to any resolution.
- 【24/09/2025】 JAFAR checkpoints for Radio v2.5 (B/L/H) released.
- 【23/09/2025】 JAFAR checkpoints for DINOv3 (S/S+/B/L) released.
- 【18/09/2025】 JAFAR is accepted at NeurIPS 2025!
- 【10/06/2025】 Code released.
- 【16/06/2025】 JAFAR is now on arxiv.
PCA visualization of features from various upsamplers.
Linear probing results for semantic segmentation across various upsamplers.
📊 Linear Probing Results
Linear probing results for depth estimation across various upsamplers.
📊 Linear Probing Results
Class Activation Map visualizations across various upsamplers.
📊 Evaluation
Vehicle segmentation in Bird's Eye View using DINOv2 + JAFAR.
📊 Evaluation
➡️ Install.
Launch the following commands to install the dependencies and create a mamba (/ conda) environment.
Details
git clone https://github.com/...
cd JAFAR
micromamba create -n jafar python==3.10.14 -y -c conda-forge
micromamba activate jafar
micromamba install pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=11.8 -c pytorch -c nvidia -c conda-forge -y
pip install uv
uv pip install einops==0.8.0 matplotlib==3.7.0 numpy==1.24.4 timm==1.0.11 plotly tensorboard hydra-core ipykernel rich pytest scikit-learn torchmetrics==1.6.2 transformers➡️ Datasets.
See Preparing Datasets for JAFAR for details on how to download the datasets.
To train JAFAR with the dinov2 backbone, execute the following command:
python train.py backbone.name=vit_small_patch14_dinov2.lvd142m hydra.run.dir=output/jafar/dinov2You can change the backbone to any other available backbone in the timm library by just changing the backbone.name argument.
To fast prototyping we add a sanity argument, it will execute the code for only a few steps and helps you to see if everything is working properly. You can use it as follows:
python train.py sanity=TrueTo evaluate the model on segmentation on the VOC dataset with the dinov2 backbone, execute:
python evaluation/train_probes.py eval.task=seg dataset_evaluation=voc \
backbone.name=vit_small_patch14_dinov2.lvd142m \
eval.model_ckpt=model.pth \
hydra.run.dir=evaluation/unsupervised/voc/vit_small_patch14_dinov2.lvd142mYou can change the dataset and the backbone to any other available dataset and backbone in the timm library by just changing the dataset_evaluation and backbone.name arguments. It will save logs, tensorboard and checkpoits in the hydra directory.
We add a file to benchmark the evaluation time and memory usage of the model. You can run it as follows:
pytest test/test_time_and_memory.py -s -vWe provide notebooks to perform training, inference and visualisation.
We provide pre-trained JAFAR models for various backbones. You can find the model weights from the following links:
| Backbone Name | Download Link |
|---|---|
| ViT-B-16 | Download |
| ViT-B-16-DINO | Download |
| ViT-S-14-DINOv2 | Download |
| ViT-B-14-DINOv2 | Download |
| ViT-S-Reg4-14-DINOv2 | Download |
| ViT-S-16-DINOv3 | Download |
| ViT-S+-16-DINOv3 | Download |
| ViT-B-16-DINOv3 | Download |
| ViT-L-16-DINOv3 | Download |
| ViT-B-16-CLIP | Download |
| ViT-B-16-SigLIP2 | Download |
| Radio_v2.5-B | Download |
| Radio_v2.5-L | Download |
| Radio_v2.5-H | Download |
Do not hesitate to open an issue if you need a specific backbone or if you have any questions to train it by yourself.
Many thanks to these excellent open source projects:
- https://github.com/mhamilton723/FeatUp
- https://github.com/saksham-s/lift/tree/main
- https://github.com/Jiawei-Yang/Denoising-ViT
- https://github.com/chongzhou96/MaskCLIP
- https://github.com/valeoai/PointBeV
To structure our code we used:
If this work is helpful for your research, please consider citing the following BibTeX entry and putting a star on this repository.
@misc{couairon2025jafar,
title={JAFAR: Jack up Any Feature at Any Resolution},
author={Paul Couairon and Loick Chambon and Louis Serrano and Jean-Emmanuel Haugeard and Matthieu Cord and Nicolas Thome},
year={2025},
eprint={2506.11136},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.11136},
}