Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Dmmm1997/InstanceVG

Repository files navigation

InstanceVG: Improving Generalized Visual Grounding with Instance-aware Joint Learning

Ming Dai1, Wenxuan Cheng1, Jiang-Jiang Liu2, Lingfeng Yang3, Zhenhua Feng4, Wankou Yang1*, Jingdong Wang2

1Southeast University    2Baidu VIS    3Jiangnan University    4Nanjing University of Science and Technology


📢 News

  • [2025.10.11] Codes, pretrained models, and datasets are now released! 🎉 .

🧩 Abstract

Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical paradigm by accommodating multi-target and non-target scenarios. While GREC focuses on coarse-level bounding box localization, GRES aims for fine-grained pixel-level segmentation.

Existing approaches typically treat these tasks independently, ignoring the potential benefits of joint learning and cross-granularity consistency. Moreover, most treat GRES as mere semantic segmentation, lacking instance-aware reasoning between boxes and masks.

We propose InstanceVG, a multi-task generalized visual grounding framework that unifies GREC and GRES via instance-aware joint learning. InstanceVG introduces instance queries with prior reference points to ensure consistent prediction of points, boxes, and masks across granularities.

To our knowledge, InstanceVG is the first framework to jointly tackle both GREC and GRES while integrating instance-aware consistency learning. Extensive experiments on 10 datasets across 4 tasks demonstrate that InstanceVG achieves state-of-the-art performance, substantially surpassing existing methods across various evaluation metrics.


🏗️ Framework Overview


⚙️ Installation

Environment requirements

CUDA == 11.8
torch == 2.0.0
torchvision == 0.15.1

1. Install dependencies

pip install -r requirements.txt

InstanceVG depends on components from detrex and detectron2.

python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/IDEA-Research/detrex.git
cd detrex
git submodule init && git submodule update
pip install -e .

Finally, install InstanceVG in editable mode:

pip install -e .

🧮 Data Preparation

Prepare the MS-COCO dataset and download the referring and foreground annotations from the HF-Data.

Expected directory structure:

data/
└── seqtr_type/
    ├── annotations/
    │   ├── mixed-seg/
    │   │   └── instances_nogoogle_withid.json
    │   ├── grefs/instance.json
    │   ├── ref-zom/instance.json
    │   └── rrefcoco/instance.json
    └── images/
        └── mscoco/
            └── train2014/

🧠 Pretrained Weights

InstanceVG uses BEiT-3 as both the backbone and multi-modal fusion module.

Download pretrained weights and tokenizer from BEiT-3’s official repository.

mkdir pretrain_weights

Place the following files:

pretrain_weights/
├── beit3_base_patch16_224.zip
├── beit3_large_patch16_224.zip
└── beit3.spm

🚀 Demo

Example 1 — GRES task

python tools/demo.py \
  --img "asserts/imgs/Figure_1.jpg" \
  --expression "three skateboard guys" \
  --config "configs/gres/InstanceVG-grefcoco.py" \
  --checkpoint /PATH/TO/InstanceVG-grefcoco.pth

Example 2 — RIS task

python tools/demo.py \
  --img "asserts/imgs/Figure_2.jpg" \
  --expression "full half fruit" \
  --config "configs/refcoco/InstanceVG-refcoco.py" \
  --checkpoint /PATH/TO/InstanceVG-refcoco.pth

For additional options (e.g., thresholds, alternate checkpoints), see tools/demo.py.


🧩 Training

To train InstanceVG from scratch:

bash tools/dist_train.sh [PATH_TO_CONFIG] [NUM_GPUS]

📊 Evaluation

To reproduce reported results:

bash tools/dist_test.sh [PATH_TO_CONFIG] [NUM_GPUS] \
  --load-from [PATH_TO_CHECKPOINT_FILE]

🏆 Model Zoo

All pretrained checkpoints are available on Model.

Task / Train Set Config Checkpoint
RefCOCO/+/g (Base) configs/refcoco/InstanceVG-B-refcoco.py InstanceVG-B-refcoco.pth
RefCOCO/+/g (Large) configs/refcoco/InstanceVG-L-refcoco.py InstanceVG-L-refcoco.pth
gRefCOCO configs/gres/InstanceVG-grefcoco.py InstanceVG-grefcoco.pth
Ref-ZOM configs/refzom/InstanceVG-refzom.py InstanceVG-refzom.pth
RRefCOCO configs/rrefcoco/InstanceVG-rrefcoco.py InstanceVG-rrefcoco.pth

Example reproduction:

bash tools/dist_test.sh configs/refcoco/InstanceVG-B-refcoco.py 1 \
  --load-from work_dir/refcoco/InstanceVG-B-refcoco.pth

📚 Citation

If you find our work useful, please cite:

@ARTICLE{instancevg,
  author={Dai, Ming and Cheng, Wenxuan and Liu, Jiang-Jiang and Yang, Lingfeng and Feng, Zhenhua and Yang, Wankou and Wang, Jingdong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title={Improving Generalized Visual Grounding with Instance-aware Joint Learning},
  year={2025},
  doi={10.1109/TPAMI.2025.3607387}
}

@article{dai2024simvg,
  title={SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-Modal Fusion},
  author={Dai, Ming and Yang, Lingfeng and Xu, Yihao and Feng, Zhenhua and Yang, Wankou},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={121670--121698},
  year={2024}
}

@inproceedings{dai2025multi,
  title={Multi-Task Visual Grounding with Coarse-to-Fine Consistency Constraints},
  author={Dai, Ming and Li, Jian and Zhuang, Jiedong and Zhang, Xian and Yang, Wankou},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={3},
  pages={2618--2626},
  year={2025}
}

⭐ Acknowledgements

Our implementation builds upon

We thank these excellent open-source projects for their contributions to the community.

About

[TPAMI2025] Improving Generalized Visual Grounding with Instance-aware Joint Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published