Code for the image-caption retrieval methods from VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, Proceedings of the British Machine Vision Conference (BMVC), 2018. (BMVC Spotlight)
We recommended to use Anaconda for the following packages.
-
Python 2.7 (Checkout branch
python3) -
PyTorch (>0.2) (Checkout branch
pytorch4.1) -
NumPy (>1.12.1)
-
Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punktDownload the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The precomputed image features are from here and here. To use full image encoders, download the images from their original sources here, here and here.
wget http://www.cs.toronto.edu/~faghri/vsepp/vocab.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/runs.tarWe refer to the path of extracted files for data.tar as $DATA_PATH and
files for models.tar as $RUN_PATH. Extract vocab.tar to ./vocab
directory.
Update: The vocabulary was originally built using all sets (including test set captions). Please see issue #29 for details. Please consider not using test set captions if building up on this project.
python -c "\
from vocab import Vocabulary
import evaluation
evaluation.evalrank('$RUN_PATH/coco_vse++/model_best.pth.tar', data_path='$DATA_PATH', split='test')"To do cross-validation on MSCOCO, pass fold5=True with a model trained using
--data_name coco.
Run train.py:
python train.py --data_path "$DATA_PATH" --data_name coco_precomp --logger_name
runs/coco_vse++ --max_violationArguments used to train pre-trained models:
| Method | Arguments |
|---|---|
| VSE0 | --no_imgnorm |
| VSE++ | --max_violation |
| Order0 | --measure order --use_abs --margin .05 --learning_rate .001 |
| Order++ | --measure order --max_violation |
If you found this code useful, please cite the following paper:
@article{faghri2018vse++,
title={VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
author={Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja},
booktitle = {Proceedings of the British Machine Vision Conference ({BMVC})},
url = {https://github.com/fartashf/vsepp},
year={2018}
}