Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds

By Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki.

Official implementation of "Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds", accepted by ECCV 2022.

Note:

This is the code for the 2D BUTD-DETR. For the 3D version check the main branch.

Installation

Requirements

Linux, GCC>=5.5 <=10.0
Python>=3.8

We recommend you to use Anaconda to create a conda environment:
```
conda create -n bdetr2d python=3.8
```
Then, activate the environment:
```
conda activate bdetr2d
```
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)

For example, if your CUDA version is 10.2, you could install pytorch and torchvision as following:
```
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=10.2 -c pytorch
```
Other requirements
```
pip install -r requirements.txt
```
Compiling CUDA operators
```
sh init.sh
```

Data Preparation

Download the original Flickr30k image dataset from : Flickr30K webpage and update the flickr_img_path to the folder containing the images.
Download the original Flickr30k entities annotations from: Flickr30k annotations by cloning it and update the flickr_dataset_path to the folder with annotations.
Download the gqa images at GQA images and update vg_img_path to point to the folder containing the images.
Download COCO images Coco train2014. Update the coco_path to the folder containing the downloaded images.
Download MDETR's pre-processed annotations that are converted to coco format (all datasets present in the same zip folder for MDETR annotations): Pre-processed annotations
Download additional files which include the bottom-up detected boxes from Faster-RCNN (trained on VG): extra_data. If you want to download it using terminal, you can install gdown and run gdown 1tIL7VBXHfG71ccIXwPSauP_7WyHAL0rg (gdown is not always reliable though)
Unzip extra_data.zip and move instances_train2014.json from the unzipped folder to the parent folder of train2014 folder (In other words train2014 and instances_train2014.json should be inside same parent folder)

Pre-trained checkpoints

Download our checkpoints for pretraining, RefCOCO, RefCOCO+. Add --resume CKPT_NAME to the above scripts in order to utilize the stored checkpoints. Since we don't do additional fine-tuning on Flickr, you can use pretraining checkpoint to evaluate on Flickr. You can also use pretraining checkpoint to finetune on your own datasets.

Note that these checkpoints were stored while using DistributedDataParallel. To use them outside these checkpoints without DistributedDataParallel, take a look here.

Usage

Training

Scripts for running training/evaluation are as follows:

sh scripts/run_train_pretrain.sh          # for pretraining
sh scripts/run_train_flickr.sh            # for flickr
sh scripts/run_train_refcoco.sh           # for refcoco
sh scripts/run_train_refcoco_plus.sh      # for refcoco+

For running on multiple GPUs, you can change run_train files. For example, to run pre-training on 8 GPUs, you can change run_train_pretrain.sh to:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8  ./configs/pretrain.sh

Some other useful flags are:

--eval: Add it to skip training and just evaluate. You can remove it to train the model.
--resume: Add this flag and the path to the checkpoint you want to evaluate.

General Recommendations

We have found the default learning rates in main.py to work well if your effective batch size is low (like <10). If you are running on larger number of GPUs, you might want to enable --large_scale flag and increase the learning rates (refer to configs/pretrain.sh whose hyperparameters are set to work with 64 GPUs). Due to computational limitations, we couldn't tune these hyperparameters and hence you might get better results by hyperparameter tuning
You might see slightly worse results when evaluating with batch size > 1. This is because we used batch size 1 during training, and hence the model never saw padded images. For more details, please refer this issue

Acknowledgements

Parts of this code were based on the codebase of MDETR and Deformable-DETR

Citing BUTD-DETR

If you find BUTD-DETR useful in your research, please consider citing:

@misc{https://doi.org/10.48550/arxiv.2112.08879,
        doi = {10.48550/ARXIV.2112.08879},
        url = {https://arxiv.org/abs/2112.08879},
        author = {Jain, Ayush and Gkanatsios, Nikolaos and Mediratta, Ishita and Fragkiadaki, Katerina},
        keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
        title = {Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds},
        publisher = {arXiv},
        year = {2021},
        copyright = {Creative Commons Attribution 4.0 International}
      }

License

The majority of BUTD-DETR code is licensed under CC-BY-NC, however portions of the project are available under separate license terms: MDETR and Deformable-DETR are licensed under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
configs		configs
datasets		datasets
demos		demos
models		models
scripts		scripts
submodules		submodules
tools		tools
utils		utils
vis_tools		vis_tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda-package.txt		conda-package.txt
demo.ipynb		demo.ipynb
engine.py		engine.py
environment.yml		environment.yml
inference.py		inference.py
init.sh		init.sh
main.py		main.py
meta_data_v10.zip		meta_data_v10.zip
pip-requirements.txt		pip-requirements.txt
relation_others_image.txt		relation_others_image.txt
relation_view_image.txt		relation_view_image.txt
run_talk2event.sh		run_talk2event.sh
sanity_check.py		sanity_check.py
status_event.txt		status_event.txt
status_image.txt		status_image.txt
test.py		test.py
test_other_model.py		test_other_model.py
visualize_image.py		visualize_image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds

Installation

Requirements

Data Preparation

Pre-trained checkpoints

Usage

Training

General Recommendations

Acknowledgements

Citing BUTD-DETR

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

DylanOrange/magiclidar

Folders and files

Latest commit

History

Repository files navigation

Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds

Installation

Requirements

Data Preparation

Pre-trained checkpoints

Usage

Training

General Recommendations

Acknowledgements

Citing BUTD-DETR

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages