Evangelos Kazakos1, Cordelia Schmid2, Josef Sivic1
1Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague
2Inria, Γcole normale supΓ©rieure, CNRS, PSL Research University
π arXiv | π Project Website
- π 09/11/2025: The HowToGround1M and iGround datasets are now available on π€ Hugging Face: HowToGround1M | iGround. They can be loaded directly with
load_dataset()from the π€ Datasets library. - π€ 02/09/2025: We release grove-transformers β a lightweight, inference-only interface for GROVE, implemented with π€ Transformers.
- π» 21/08/2025: Code, checkpoints, and datasets released!
- π₯ 25/06/2025: Paper accepted to ICCV 2025 π
π BibTeX
@inproceedings{kazakos2025grove,
title = {Large-scale Pre-training for Grounded Video Caption Generation},
author = {Evangelos Kazakos and Cordelia Schmid and Josef Sivic},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}If you only need inference for GROVE (quick experimentation, no training), use the lightweight grove-transformers package instead. Installation instructions are in its README and on the Hugging Face Hub.
If you want to train and/or evaluate GROVE end-to-end, follow the instructions below.
First, create a new conda environment:
conda create -n grove python=3.12
conda activate groveChoose the CUDA version that matches your system (e.g., cu124, cu121, cu118).
Example for CUDA 12.4:
pip install --index-url https://download.pytorch.org/whl/cu124 \
torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install torchtext==0.18.0 torchdata==0.9.0π‘ Replace
cu124in the URL with the correct CUDA version tag for your machine.
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout cfd5d3a985b0249de009b67d04f37263e11cdf3d
pip install -e . --no-build-isolation
cd ..git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout 57c4e25e06e2d4f8a9357c84bcd24089a284dc88
pip install -r requirements/optional.txt
pip install -e . -v
cd ..git clone https://github.com/facebookresearch/sam2.git
cd sam2
git checkout 2b90b9f5ceec907a1c18123530e92e794ad901a4
pip install -e . --no-build-isolation
cd ..pip install flash-attn==2.7.3 --no-build-isolation-
Stanford CoreNLP 3.4.1 (for evaluation in HowToGround1M/iGround)
-
Stanford CoreNLP 4.5.7 (for evaluation in ActivityNet-Entities)
pip install -r requirements.txt-
Download HowTo100M videos (for pre-training on HowToGround1M)
- Follow the instructions in the HowTo100M webpage
- The webpage has some broken links and is currently under construction. The domain will change and the link above will be updated to the new domain
-
Download iGround videos (for fine-tuning/evaluating on iGround)
- Fill in this form to obtain links to the iGround videos
- Run the following script to download the iGround videos using the provided links
bash scripts/download_iGround.sh iGround_links.txt /path/to/iground_videos_dir
- Caution: the links expire in 7 days
-
Download annotations
-
Preprocess annotations
- Run the following command to split the annotations into separate files per video:
bash scripts/preprocess_howtoground_annot.py /path/to/{HowToGround1M,iGround}.pkl target_dir
- Run the following command to split the annotations into separate files per video:
Note: The iGround annotations include both processed and raw versions (e.g., iGround_train_set_processed.pkl vs iGround_train_set_raw.pkl). The processed annotations were used to train GROVE. Processing merges multiple instances of the same object type per frame into a single annotation by taking the union of all bounding boxes for that instance. The raw annotations are unprocessed β the same object type may appear multiple times in a frame, each with its own distinct bounding box per instance.
-
Download ActivityNet videos
- From Hugging Face
-
Download annotations
-
Preprocess videos
bash scripts/preprocess_anet_videos.sh input_dataset_dir preprocessed_dataset_dir
-
Download VidSTG videos
- From Hugging Face
-
Download annotations
- Download GROVE pre-trained on HowToGround1M from link
- Download GROVE fine-tuned on iGround from link
- Download SAM checkpoint from link
- Run:
mkdir checkpoints mv /path/to/checkpoints checkpoints/
- In
train_scripts/train_{howtoground,vidstg,anet}.sh:- (Optional) Modify the sbatch configuration based on your cluster's configuration, though it is suggested to use the provided ones
- Modify the path to the data and checkpoint
- Run:
bash train_scripts/train_{howtoground,vidstg,anet}.sh
Note: train_scripts/train_howtoground.sh can be used for both HowToGround1M and iGround datasets.
Below, it is shown how to run inference & evaluation on iGround validation and test sets. Similarly, for the other datasets use the scripts found in infer_eval_scripts/
- For iGround validation set:
bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_val_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27
- For iGround test set:
bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_test_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27
Note: By downloading Stanford CoreNLP from the links provided in the installation instructions, you will get a directory stanford-corenlp-full-2014-08-27 which contains Stanford CoreNLP 3.4.1 (used above for evaluation in iGround) and a directory stanford-corenlp-4.5.7 which contains Stanford CoreNLP 4.5.7 (used for evaluation in ActivityNet-Entities).
