Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, Meng Wang

Official code for our CVPR 2025 paper: Towards Open-Vocabulary Audio-Visual Event Localization

Introduction

In this paer, we propose Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) task, aiming to localize both seen and unseen audio-visual events in test videos. To the best of our knowledge, this work is the first to advance the AVEL area toward more practical applications in open-vocabulary scenarios. To facilitate this new task, we construct the OV-AVEBench dataset, which includes segment-level manual event annotations. Besides, we establish standard evaluation metrics that encompass typical accuracy, as well as segment-level and event-level F1-scores. We propose two simple baselines: one adopting a training-free paradigm, which can be upgraded through further fine-tuning on available training data. We hope that our benchmark will inspire future research in this field.

Data Preparation

Dataset

The proposed OV-AVEBench dataset is available now. You may directly download the preprocessed audio (.wav) and visual (.png) files from this link to develop your own models for OV-AVEL task. The raw videos are also available at here. Please put the downloaded preprocessed data into `ovave_dataset_preprocessed' directory.

pretrained backbone

Download the ImageBind_Huge from https://github.com/facebookresearch/ImageBind/tree/main

Training-free Baseline

bash run_baseline_v0.sh

Fine-tuning Baseline

bash run_baseline_v1_train_fully.sh

Citation

If our work is helpful for your research, please consider to give us a star and cite our paper:

@article{zhou2024towards,
  title={Towards Open-Vocabulary Audio-Visual Event Localization},
  author={Zhou, Jinxing and Guo, Dan and Guo, Ruohao and Mao, Yuxin and Hu, Jingjing and Zhong, Yiran and Chang, Xiaojun and Wang, Meng},
  journal={arXiv preprint arXiv:2411.11278},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fig		fig
meta_anno_files		meta_anno_files
proposed_method/ImageBind-main		proposed_method/ImageBind-main
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Towards Open-Vocabulary Audio-Visual Event Localization

Introduction

Data Preparation

Dataset

pretrained backbone

Training-free Baseline

Fine-tuning Baseline

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

Uh oh!

jasongief/OV-AVEL

Folders and files

Latest commit

History

Repository files navigation

Towards Open-Vocabulary Audio-Visual Event Localization

Introduction

Data Preparation

Dataset

pretrained backbone

Training-free Baseline

Fine-tuning Baseline

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages