🚀 Large-scale Pre-training for Grounded Video Caption Generation

Evangelos Kazakos¹, Cordelia Schmid², Josef Sivic¹
¹Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague
²Inria, École normale supérieure, CNRS, PSL Research University

📄 arXiv | 🌐 Project Website

📢 News

📚 09/11/2025: The HowToGround1M and iGround datasets are now available on 🤗 Hugging Face: HowToGround1M | iGround. They can be loaded directly with load_dataset() from the 🤗 Datasets library.
🤗 02/09/2025: We release grove-transformers — a lightweight, inference-only interface for GROVE, implemented with 🤗 Transformers.
💻 21/08/2025: Code, checkpoints, and datasets released!
🔥 25/06/2025: Paper accepted to ICCV 2025 🎉

📖 BibTeX

@inproceedings{kazakos2025grove,
  title     = {Large-scale Pre-training for Grounded Video Caption Generation},
  author    = {Evangelos Kazakos and Cordelia Schmid and Josef Sivic},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

Quick start

If you only need inference for GROVE (quick experimentation, no training), use the lightweight grove-transformers package instead. Installation instructions are in its README and on the Hugging Face Hub.

If you want to train and/or evaluate GROVE end-to-end, follow the instructions below.

Installation instructions

Create and activate a pip environment

First, create a new conda environment:

conda create -n grove python=3.12
conda activate grove

Install PyTorch

Choose the CUDA version that matches your system (e.g., cu124, cu121, cu118).
Example for CUDA 12.4:

pip install --index-url https://download.pytorch.org/whl/cu124 \
    torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install torchtext==0.18.0 torchdata==0.9.0

💡 Replace cu124 in the URL with the correct CUDA version tag for your machine.

Install mmdetection

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout cfd5d3a985b0249de009b67d04f37263e11cdf3d
pip install -e . --no-build-isolation
cd ..

Install mmcv

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout 57c4e25e06e2d4f8a9357c84bcd24089a284dc88
pip install -r requirements/optional.txt
pip install -e . -v
cd ..

Install SAM2

git clone https://github.com/facebookresearch/sam2.git
cd sam2
git checkout 2b90b9f5ceec907a1c18123530e92e794ad901a4
pip install -e . --no-build-isolation
cd ..

Install Flash Attention

pip install flash-attn==2.7.3 --no-build-isolation

Download Stanford CoreNLP

Stanford CoreNLP 3.4.1 (for evaluation in HowToGround1M/iGround)
Stanford CoreNLP 4.5.7 (for evaluation in ActivityNet-Entities)

Install remaining dependencies

pip install -r requirements.txt

Data Preparation

HowToGround1M & iGround

Download HowTo100M videos (for pre-training on HowToGround1M)
- Follow the instructions in the HowTo100M webpage
- The webpage has some broken links and is currently under construction. The domain will change and the link above will be updated to the new domain
Download iGround videos (for fine-tuning/evaluating on iGround)
- Fill in this form to obtain links to the iGround videos
- Run the following script to download the iGround videos using the provided links
```
bash scripts/download_iGround.sh iGround_links.txt /path/to/iground_videos_dir
```
- Caution: the links expire in 7 days
Download annotations
- HowToGround1M annotations
- iGround annotations
Preprocess annotations
- Run the following command to split the annotations into separate files per video:
```
bash scripts/preprocess_howtoground_annot.py /path/to/{HowToGround1M,iGround}.pkl target_dir
```

Note: The iGround annotations include both processed and raw versions (e.g., iGround_train_set_processed.pkl vs iGround_train_set_raw.pkl). The processed annotations were used to train GROVE. Processing merges multiple instances of the same object type per frame into a single annotation by taking the union of all bounding boxes for that instance. The raw annotations are unprocessed — the same object type may appear multiple times in a frame, each with its own distinct bounding box per instance.

ActivityNet-Entities

Download ActivityNet videos
- From Hugging Face
Download annotations
- ActivityNet-Entities annotations

Preprocess videos

bash scripts/preprocess_anet_videos.sh input_dataset_dir preprocessed_dataset_dir

VidSTG

Download VidSTG videos
- From Hugging Face
Download annotations
- VidSTG annotations

Checkpoints

Download GROVE pre-trained on HowToGround1M from link
Download GROVE fine-tuned on iGround from link
Download SAM checkpoint from link

Run:

mkdir checkpoints
mv /path/to/checkpoints checkpoints/

Training (using SLURM's sbatch)

In train_scripts/train_{howtoground,vidstg,anet}.sh:
1. (Optional) Modify the sbatch configuration based on your cluster's configuration, though it is suggested to use the provided ones
2. Modify the path to the data and checkpoint

Run:

bash train_scripts/train_{howtoground,vidstg,anet}.sh

Note: train_scripts/train_howtoground.sh can be used for both HowToGround1M and iGround datasets.

Inference & evaluation

Below, it is shown how to run inference & evaluation on iGround validation and test sets. Similarly, for the other datasets use the scripts found in infer_eval_scripts/

For iGround validation set:

bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_val_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27

For iGround test set:

bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_test_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27

Note: By downloading Stanford CoreNLP from the links provided in the installation instructions, you will get a directory stanford-corenlp-full-2014-08-27 which contains Stanford CoreNLP 3.4.1 (used above for evaluation in iGround) and a directory stanford-corenlp-4.5.7 which contains Stanford CoreNLP 4.5.7 (used for evaluation in ActivityNet-Entities).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dataset		dataset
eval		eval
grove_transformers		grove_transformers
infer_eval_scripts		infer_eval_scripts
model		model
scripts		scripts
train_scripts		train_scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
embed_tokens.py		embed_tokens.py
embed_tokens.sh		embed_tokens.sh
eval_anet.py		eval_anet.py
eval_iground.py		eval_iground.py
eval_vidstg.py		eval_vidstg.py
eval_youcookinteractions.py		eval_youcookinteractions.py
infer_anet.py		infer_anet.py
infer_groundingyoutube.py		infer_groundingyoutube.py
infer_iground.py		infer_iground.py
infer_vidstg.py		infer_vidstg.py
infer_youcookinteractions.py		infer_youcookinteractions.py
requirements.txt		requirements.txt
teaser.png		teaser.png
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Large-scale Pre-training for Grounded Video Caption Generation

📢 News

Quick start

Installation instructions

Create and activate a pip environment

Install PyTorch

Install mmdetection

Install mmcv

Install SAM2

Install Flash Attention

Download Stanford CoreNLP

Install remaining dependencies

Data Preparation

HowToGround1M & iGround

ActivityNet-Entities

VidSTG

Checkpoints

Training (using SLURM's sbatch)

Inference & evaluation

About

Uh oh!

Releases

Packages

Languages

License

ekazakos/grove

Folders and files

Latest commit

History

Repository files navigation

🚀 Large-scale Pre-training for Grounded Video Caption Generation

📢 News

Quick start

Installation instructions

Create and activate a pip environment

Install PyTorch

Install mmdetection

Install mmcv

Install SAM2

Install Flash Attention

Download Stanford CoreNLP

Install remaining dependencies

Data Preparation

HowToGround1M & iGround

ActivityNet-Entities

VidSTG

Checkpoints

Training (using SLURM's sbatch)

Inference & evaluation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages