Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DylanOrange/magiclidar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds

By Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki.

Official implementation of "Bottom Up and Top Down Detection Transformers for Language Grounding in Images and Point Clouds", accepted by ECCV 2022.

teaser

Note:

This is the code for the 2D BUTD-DETR. For the 3D version check the main branch.

Installation

Requirements

  • Linux, GCC>=5.5 <=10.0

  • Python>=3.8

    We recommend you to use Anaconda to create a conda environment:

    conda create -n bdetr2d python=3.8

    Then, activate the environment:

    conda activate bdetr2d
  • PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)

    For example, if your CUDA version is 10.2, you could install pytorch and torchvision as following:

    conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=10.2 -c pytorch
  • Other requirements

    pip install -r requirements.txt
  • Compiling CUDA operators

    sh init.sh

Data Preparation

  • Download the original Flickr30k image dataset from : Flickr30K webpage and update the flickr_img_path to the folder containing the images.
  • Download the original Flickr30k entities annotations from: Flickr30k annotations by cloning it and update the flickr_dataset_path to the folder with annotations.
  • Download the gqa images at GQA images and update vg_img_path to point to the folder containing the images.
  • Download COCO images Coco train2014. Update the coco_path to the folder containing the downloaded images.
  • Download MDETR's pre-processed annotations that are converted to coco format (all datasets present in the same zip folder for MDETR annotations): Pre-processed annotations
  • Download additional files which include the bottom-up detected boxes from Faster-RCNN (trained on VG): extra_data. If you want to download it using terminal, you can install gdown and run gdown 1tIL7VBXHfG71ccIXwPSauP_7WyHAL0rg (gdown is not always reliable though)
  • Unzip extra_data.zip and move instances_train2014.json from the unzipped folder to the parent folder of train2014 folder (In other words train2014 and instances_train2014.json should be inside same parent folder)

Pre-trained checkpoints

Download our checkpoints for pretraining, RefCOCO, RefCOCO+. Add --resume CKPT_NAME to the above scripts in order to utilize the stored checkpoints. Since we don't do additional fine-tuning on Flickr, you can use pretraining checkpoint to evaluate on Flickr. You can also use pretraining checkpoint to finetune on your own datasets.

Note that these checkpoints were stored while using DistributedDataParallel. To use them outside these checkpoints without DistributedDataParallel, take a look here.

Usage

Training

Scripts for running training/evaluation are as follows:

sh scripts/run_train_pretrain.sh          # for pretraining
sh scripts/run_train_flickr.sh            # for flickr
sh scripts/run_train_refcoco.sh           # for refcoco
sh scripts/run_train_refcoco_plus.sh      # for refcoco+

For running on multiple GPUs, you can change run_train files. For example, to run pre-training on 8 GPUs, you can change run_train_pretrain.sh to:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8  ./configs/pretrain.sh 

Some other useful flags are:

  • --eval: Add it to skip training and just evaluate. You can remove it to train the model.
  • --resume: Add this flag and the path to the checkpoint you want to evaluate.

General Recommendations

  • We have found the default learning rates in main.py to work well if your effective batch size is low (like <10). If you are running on larger number of GPUs, you might want to enable --large_scale flag and increase the learning rates (refer to configs/pretrain.sh whose hyperparameters are set to work with 64 GPUs). Due to computational limitations, we couldn't tune these hyperparameters and hence you might get better results by hyperparameter tuning
  • You might see slightly worse results when evaluating with batch size > 1. This is because we used batch size 1 during training, and hence the model never saw padded images. For more details, please refer this issue

Acknowledgements

Parts of this code were based on the codebase of MDETR and Deformable-DETR

Citing BUTD-DETR

If you find BUTD-DETR useful in your research, please consider citing:

@misc{https://doi.org/10.48550/arxiv.2112.08879,
        doi = {10.48550/ARXIV.2112.08879},
        url = {https://arxiv.org/abs/2112.08879},
        author = {Jain, Ayush and Gkanatsios, Nikolaos and Mediratta, Ishita and Fragkiadaki, Katerina},
        keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
        title = {Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds},
        publisher = {arXiv},
        year = {2021},
        copyright = {Creative Commons Attribution 4.0 International}
      }
    

License

The majority of BUTD-DETR code is licensed under CC-BY-NC, however portions of the project are available under separate license terms: MDETR and Deformable-DETR are licensed under the Apache 2.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •