This is the official repository for the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition".
This paper presents a novel method to realize Seamless adaptation of pre-trained foundation models for the (two-stage) VPR task, named SelaVPR. By adding a few tunable lightweight adapters to the frozen pre-trained model, we achieve an efficient hybrid global-local adaptation to get both global features for retrieving candidate places and dense local features for re-ranking. The SelaVPR feature representation can focus on discriminative landmarks, thus closing the gap between the pre-training and VPR tasks (fully unleash the capability of pre-trained models for VPR). SelaVPR can directly match the local features without spatial verification, making the re-ranking much faster.
The global adaptation is achieved by adding adapters after the multi-head attention layer and in parallel to the MLP layer in each transformer block (see adapter1 and adapter2 in /backbone/dinov2/block.py).
The local adaptation is implemented by adding up-convolutional layers after the entire ViT backbone to upsample the feature map and get dense local features (see LocalAdapt in network.py).
This repo follows the Visual Geo-localization Benchmark. You can refer to it (VPR-datasets-downloader) to prepare datasets.
The dataset should be organized in a directory tree as such:
├── datasets_vg
    └── datasets
        └── pitts30k
            └── images
                ├── train
                │   ├── database
                │   └── queries
                ├── val
                │   ├── database
                │   └── queries
                └── test
                    ├── database
                    └── queries
Before training, you should download the pre-trained foundation model DINOv2(ViT-L/14) HERE.
Finetuning on MSLS
python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_pretrain.pth
Further finetuning on Pitts30k
python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth
The model finetuned on MSLS (for diverse scenes).
| DOWNLOAD | MSLS-val | Nordland-test | St. Lucia | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| LINK | 90.8 | 96.4 | 97.2 | 85.2 | 95.5 | 98.5 | 99.8 | 100.0 | 100.0 | 
The model further finetuned on Pitts30k (only for urban scenes).
| DOWNLOAD | Tokyo24/7 | Pitts30k | Pitts250k | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| LINK | 94.0 | 96.8 | 97.5 | 92.8 | 96.8 | 97.7 | 95.7 | 98.8 | 99.2 | 
Set rerank_num=100 to reproduce the results in paper, and set rerank_num=20 to achieve a close result with only 1/5 re-ranking runtime (0.018s for a query).
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100
By adding registers, DINOv2 can achieve better local matching performance. A pre-trained DINOv2+registers model can be downloaded HERE.
You can simply add --registers to the (train or test) run command and load the model with registers to use the SelaVPR model based on DINOv2+registers backbone, for example
python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_reg4_pretrain.pth --registers
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --rerank_num=100 --registers
The finetuned (on MSLS) SelaVPR model with registers can be downloaded HERE.
For the (dense or coarse) local matching between two images, run
python3 visualize_pairs.py --datasets_folder=./ --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --registers
To obtain the heatmap of an input image, run
python3 visual.py --datasets_folder=/home/lufeng/data/VPR/datasets_vg/datasets --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth
The test_efficient_ram_usage() function in test.py is used to address the issue of RAM out of memory (this issue may cause the program to be killed). This function saves the extracted local features in ./output_local_features/ and loads only the local features currently needed into RAM each time. You can simply add --efficient_ram_testing to the (train or test) run command to use it, for example
python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth --efficient_ram_testing
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100 --efficient_ram_testing
MSLS-val: We use the official version of MSLS-val (only contains 740 query images) for testing, which is a subset of the MSLS-val formated by VPR-datasets-downloader (contains about 11k query images). More detail can be found here.
Nordland-test: Download the Downsampled version here.
Parts of this repo are inspired by the following repositories:
Visual Geo-localization Benchmark
If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper
@inproceedings{selavpr,
  title={Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition},
  author={Lu, Feng and Zhang, Lijun and Lan, Xiangyuan and Dong, Shuting and Wang, Yaowei and Yuan, Chun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}