Chao Huang,
Ruohan Gao,
J. M. F. Tsang,
Jan Kurcius,
Cagdas Bilen,
Chenliang Xu,
Anurag Kumar,
Sanjeel Parekh
- ๐ Table of Contents
- ๐ฐ News
- ๐ Overview
- ๐ ๏ธ Installation
- ๐ค Dataset
- ๐๏ธ Training
- โ Evaluation
- ๐ฏ Gallery
- ๐ Inference Examples
- โ FAQ
- ๐ Contributing
- ๐ง Contact
- ๐ Acknowledgements
- ๐ Citation
- [2025.07] ๐ VisAH dataset (base & full) and model checkpoints are now available on the Hugging Face Hub.
- [2025.03] ๐ฅ๐ฅ Released training and evaluation codes for VisAH.
- [2025.02] ๐๐ VisAH is accepted to CVPR 2025.
VisAH (Visually Guided Audio Highlighting) is a novel framework that learns to highlight important audio elements in movie scenes by leveraging visual cues. The approach addresses the challenge of automatically enhancing audio elements that align with visual content, improving the overall multimedia experience. This repository contains the unofficial implementation of the CVPR 2025 paper.
Clone the repository and create a conda environment:
git clone https://github.com/WikiChao/VisAH.git
conda create --name VisAH python=3.10
conda activate VisAHInstall dependencies:
git clone https://github.com/facebookresearch/ImageBind.git
cd ImageBind
pip install .
cd ..
python -m pip install lightning==2.3.0
pip install -U tensorboardX
pip install hear21passt
python3 -m pip install -U demucsWe have prepared all data and features needed to reproduce the training and evaluation process described in our paper.
Option 1: Base Dataset (Recommended for Quick Start)
- Contains essential files: visual features, text features, input audio, and ground truth audio
- Download Base Dataset
- After downloading, unzip and rename the folder from
Muddy_Mix_basetoMuddy_Mix - Place it in the
visah/data/directory
Option 2: Full Dataset
- Contains everything in the base dataset plus extracted frames, separated audios, and original video clips
- Please check download links in
dataset/dowload_links
Directory structure:
Muddy_Mix
โโโ _2EQFo-vIH0
| โโโ sub-video
โ | โโโ _2EQFo-vIH0_000
โ | | โโโaudio_raw # Ground truth movie audio
โ | | | โโโ_2EQFo-vIH0_000.wav
โ | | โโโframes # Video frames
โ | | | โโโ001.png
โ | | | โโโ...
โ | | โโโframes_feats # Extracted visual features
โ | | | โโโvisual_feats.pt
โ | | โโโframes_captions # Extracted textual features
โ | | | โโโInternVL2-8B_prompt1_feats.pt
โ | | โโโremix_global # Mixed audio data
โ | | | โโโ...
โ | | | โโโtarget_mix.wav
โ | | โโโseparated # Separated wav files from original waveform
โ | โโโ_2EQFo-vIH0_000.mkv
We generated the dataset once for the experiments in our paper. However, you can generate additional data for augmentation using the example in preprocessing/Degradation_generation.py.
We now host both versions of our dataset on the Hugging Face Hub for easy download:
- Base Dataset: Hugging Face - Muddy_Mix Base
- Full Dataset: Hugging Face - Muddy_Mix Full
You can also programmatically fetch the datasets:
from huggingface_hub import snapshot_download
# Base version
snapshot_download(
repo_id="ChaoHuangCS/Muddy_Mix_base",
repo_type="dataset",
local_dir="visah/data/Muddy_Mix_base"
)
# Full version
snapshot_download(
repo_id="ChaoHuangCS/Muddy_Mix",
repo_type="dataset",
local_dir="visah/data/Muddy_Mix"
)After setting your dataset path, start training with:
cd visah
python run_model.py --config configs/main_config.yamlTo evaluate the model:
- Set
mode: testinconfigs/main_config.yaml - Run:
python run_model.py --config configs/main_config.yamlDownload our pretrained model checkpoints from here.
Our model checkpoints are also available on the Hugging Face Hub:
- VisAH Model: Hugging Face - VisAH Model
You can fetch them programmatically:
snapshot_download(
repo_id="ChaoHuangCS/VisAH",
repo_type="model",
local_dir="visah/checkpoints"
)Please refer to Gallery that showcases audio highlighting results along with the original movie clips.
To run inference on your own videos, follow these steps:
- Prepare your video files and place them in the
input_videosdirectory. - Run the inference script:
python run_inference.py --config configs/inference_config.yaml- The enhanced audio files will be saved in the
output_audiodirectory.
-
Q: Can I use a different dataset for training?
- A: Yes, you can modify the dataset path in the configuration file and ensure the data format matches our requirements.
-
Q: How can I contribute to this project?
- A: Please refer to the Contributing section for guidelines.
We welcome contributions from the community! If you would like to contribute, please follow these steps:
- Fork the repository
- Create a new branch (
git checkout -b feature-branch) - Make your changes
- Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature-branch) - Create a new Pull Request
If you have any questions or need further assistance, feel free to reach out to us:
- Chao Huang: [email protected]
We utilized code from the bandit Cinematic Audio Source Separation repository for imperfect separation results generation.
If you use this code for your research, please cite our work:
@inproceedings{huang2025learning,
title={Learning to Highlight Audio by Watching Movies},
author={Huang, Chao and Gao, Ruohan and Tsang, JMF and Kurcius, Jan and Bilen, Cagdas and Xu, Chenliang and Kumar, Anurag and Parekh, Sanjeel},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={23925--23935},
year={2025}
}