Ming Dai1, Wenxuan Cheng1, Jiang-Jiang Liu2, Lingfeng Yang3, Zhenhua Feng4, Wankou Yang1*, Jingdong Wang2
1Southeast University 2Baidu VIS 3Jiangnan University 4Nanjing University of Science and Technology
- [2025.10.11] Codes, pretrained models, and datasets are now released! 🎉 .
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical paradigm by accommodating multi-target and non-target scenarios. While GREC focuses on coarse-level bounding box localization, GRES aims for fine-grained pixel-level segmentation.
Existing approaches typically treat these tasks independently, ignoring the potential benefits of joint learning and cross-granularity consistency. Moreover, most treat GRES as mere semantic segmentation, lacking instance-aware reasoning between boxes and masks.
We propose InstanceVG, a multi-task generalized visual grounding framework that unifies GREC and GRES via instance-aware joint learning. InstanceVG introduces instance queries with prior reference points to ensure consistent prediction of points, boxes, and masks across granularities.
To our knowledge, InstanceVG is the first framework to jointly tackle both GREC and GRES while integrating instance-aware consistency learning. Extensive experiments on 10 datasets across 4 tasks demonstrate that InstanceVG achieves state-of-the-art performance, substantially surpassing existing methods across various evaluation metrics.
Environment requirements
CUDA == 11.8
torch == 2.0.0
torchvision == 0.15.1pip install -r requirements.txtInstanceVG depends on components from detrex and detectron2.
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/IDEA-Research/detrex.git
cd detrex
git submodule init && git submodule update
pip install -e .Finally, install InstanceVG in editable mode:
pip install -e .Prepare the MS-COCO dataset and download the referring and foreground annotations from the HF-Data.
Expected directory structure:
data/
└── seqtr_type/
├── annotations/
│ ├── mixed-seg/
│ │ └── instances_nogoogle_withid.json
│ ├── grefs/instance.json
│ ├── ref-zom/instance.json
│ └── rrefcoco/instance.json
└── images/
└── mscoco/
└── train2014/
InstanceVG uses BEiT-3 as both the backbone and multi-modal fusion module.
Download pretrained weights and tokenizer from BEiT-3’s official repository.
mkdir pretrain_weightsPlace the following files:
pretrain_weights/
├── beit3_base_patch16_224.zip
├── beit3_large_patch16_224.zip
└── beit3.spm
Example 1 — GRES task
python tools/demo.py \
--img "asserts/imgs/Figure_1.jpg" \
--expression "three skateboard guys" \
--config "configs/gres/InstanceVG-grefcoco.py" \
--checkpoint /PATH/TO/InstanceVG-grefcoco.pthExample 2 — RIS task
python tools/demo.py \
--img "asserts/imgs/Figure_2.jpg" \
--expression "full half fruit" \
--config "configs/refcoco/InstanceVG-refcoco.py" \
--checkpoint /PATH/TO/InstanceVG-refcoco.pthFor additional options (e.g., thresholds, alternate checkpoints), see tools/demo.py.
To train InstanceVG from scratch:
bash tools/dist_train.sh [PATH_TO_CONFIG] [NUM_GPUS]To reproduce reported results:
bash tools/dist_test.sh [PATH_TO_CONFIG] [NUM_GPUS] \
--load-from [PATH_TO_CHECKPOINT_FILE]All pretrained checkpoints are available on Model.
| Task / Train Set | Config | Checkpoint |
|---|---|---|
| RefCOCO/+/g (Base) | configs/refcoco/InstanceVG-B-refcoco.py |
InstanceVG-B-refcoco.pth |
| RefCOCO/+/g (Large) | configs/refcoco/InstanceVG-L-refcoco.py |
InstanceVG-L-refcoco.pth |
| gRefCOCO | configs/gres/InstanceVG-grefcoco.py |
InstanceVG-grefcoco.pth |
| Ref-ZOM | configs/refzom/InstanceVG-refzom.py |
InstanceVG-refzom.pth |
| RRefCOCO | configs/rrefcoco/InstanceVG-rrefcoco.py |
InstanceVG-rrefcoco.pth |
Example reproduction:
bash tools/dist_test.sh configs/refcoco/InstanceVG-B-refcoco.py 1 \
--load-from work_dir/refcoco/InstanceVG-B-refcoco.pthIf you find our work useful, please cite:
@ARTICLE{instancevg,
author={Dai, Ming and Cheng, Wenxuan and Liu, Jiang-Jiang and Yang, Lingfeng and Feng, Zhenhua and Yang, Wankou and Wang, Jingdong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Improving Generalized Visual Grounding with Instance-aware Joint Learning},
year={2025},
doi={10.1109/TPAMI.2025.3607387}
}
@article{dai2024simvg,
title={SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-Modal Fusion},
author={Dai, Ming and Yang, Lingfeng and Xu, Yihao and Feng, Zhenhua and Yang, Wankou},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={121670--121698},
year={2024}
}
@inproceedings{dai2025multi,
title={Multi-Task Visual Grounding with Coarse-to-Fine Consistency Constraints},
author={Dai, Ming and Li, Jian and Zhuang, Jiedong and Zhang, Xian and Yang, Wankou},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={3},
pages={2618--2626},
year={2025}
}Our implementation builds upon
We thank these excellent open-source projects for their contributions to the community.