Learning to Highlight Audio by Watching Movies

Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

University of Rochester, University of Maryland College Park, Meta Reality Labs Research

If our project helps you, please give us a star ⭐ on GitHub to support us.

📋 Table of Contents

📰 News

[2025.07] 🚀 VisAH dataset (base & full) and model checkpoints are now available on the Hugging Face Hub.
[2025.03] 🔥🔥 Released training and evaluation codes for VisAH.
[2025.02] 🎉🎉 VisAH is accepted to CVPR 2025.

📝 Overview

VisAH (Visually Guided Audio Highlighting) is a novel framework that learns to highlight important audio elements in movie scenes by leveraging visual cues. The approach addresses the challenge of automatically enhancing audio elements that align with visual content, improving the overall multimedia experience. This repository contains the unofficial implementation of the CVPR 2025 paper.

🛠️ Installation

1. Clone the repository and create environment

Clone the repository and create a conda environment:

git clone https://github.com/WikiChao/VisAH.git
conda create --name VisAH python=3.10
conda activate VisAH

2. Install dependencies

Install dependencies:

git clone https://github.com/facebookresearch/ImageBind.git
cd ImageBind
pip install .
cd ..
python -m pip install lightning==2.3.0
pip install -U tensorboardX
pip install hear21passt
python3 -m pip install -U demucs

🤖 Dataset

1. Download "The Muddy Mix" Dataset

We have prepared all data and features needed to reproduce the training and evaluation process described in our paper.

Download Options:

Option 1: Base Dataset (Recommended for Quick Start)

Contains essential files: visual features, text features, input audio, and ground truth audio
Download Base Dataset
After downloading, unzip and rename the folder from Muddy_Mix_base to Muddy_Mix
Place it in the visah/data/ directory

Option 2: Full Dataset

Contains everything in the base dataset plus extracted frames, separated audios, and original video clips
Please check download links in dataset/dowload_links

Directory structure:

Muddy_Mix
├── _2EQFo-vIH0
|   ├── sub-video
│   |   ├── _2EQFo-vIH0_000
│   |   |     ├──audio_raw                     # Ground truth movie audio
│   |   |     |   ├──_2EQFo-vIH0_000.wav
│   |   |     ├──frames                        # Video frames
│   |   |     |   ├──001.png
│   |   |     |   ├──...
│   |   |     ├──frames_feats                  # Extracted visual features
│   |   |     |   ├──visual_feats.pt
│   |   |     ├──frames_captions               # Extracted textual features
│   |   |     |   ├──InternVL2-8B_prompt1_feats.pt
│   |   |     ├──remix_global                  # Mixed audio data
│   |   |     |   ├──...
│   |   |     |   ├──target_mix.wav
│   |   |     ├──separated                     # Separated wav files from original waveform
│   |   ├──_2EQFo-vIH0_000.mkv

2. Degradation Method

We generated the dataset once for the experiments in our paper. However, you can generate additional data for augmentation using the example in preprocessing/Degradation_generation.py.

3. Hugging Face Hub

We now host both versions of our dataset on the Hugging Face Hub for easy download:

Base Dataset: Hugging Face - Muddy_Mix Base
Full Dataset: Hugging Face - Muddy_Mix Full

You can also programmatically fetch the datasets:

from huggingface_hub import snapshot_download

# Base version
snapshot_download(
    repo_id="ChaoHuangCS/Muddy_Mix_base",
    repo_type="dataset",
    local_dir="visah/data/Muddy_Mix_base"
)

# Full version
snapshot_download(
    repo_id="ChaoHuangCS/Muddy_Mix",
    repo_type="dataset",
    local_dir="visah/data/Muddy_Mix"
)

🗝️ Training

After setting your dataset path, start training with:

cd visah
python run_model.py --config configs/main_config.yaml

✅ Evaluation

To evaluate the model:

Set mode: test in configs/main_config.yaml
Run:

python run_model.py --config configs/main_config.yaml

🌎 Pretrained model

Download our pretrained model checkpoints from here.

🌎 Hugging Face Hub Checkpoints

Our model checkpoints are also available on the Hugging Face Hub:

VisAH Model: Hugging Face - VisAH Model

You can fetch them programmatically:

snapshot_download(
    repo_id="ChaoHuangCS/VisAH",
    repo_type="model",
    local_dir="visah/checkpoints"
)

🎯 Gallery

Please refer to Gallery that showcases audio highlighting results along with the original movie clips.

🚀 Inference Examples

To run inference on your own videos, follow these steps:

Prepare your video files and place them in the input_videos directory.
Run the inference script:

python run_inference.py --config configs/inference_config.yaml

The enhanced audio files will be saved in the output_audio directory.

❓ FAQ

Q: Can I use a different dataset for training?
- A: Yes, you can modify the dataset path in the configuration file and ensure the data format matches our requirements.
Q: How can I contribute to this project?
- A: Please refer to the Contributing section for guidelines.

👐 Contributing

We welcome contributions from the community! If you would like to contribute, please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature-branch)
Make your changes
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature-branch)
Create a new Pull Request

📧 Contact

If you have any questions or need further assistance, feel free to reach out to us:

Chao Huang: [email protected]

👍 Acknowledgements

We utilized code from the bandit Cinematic Audio Source Separation repository for imperfect separation results generation.

📑 Citation

If you use this code for your research, please cite our work:

@inproceedings{huang2025learning,
  title={Learning to Highlight Audio by Watching Movies},
  author={Huang, Chao and Gao, Ruohan and Tsang, JMF and Kurcius, Jan and Bilen, Cagdas and Xu, Chenliang and Kumar, Anurag and Parekh, Sanjeel},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={23925--23935},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
asset		asset
dataset		dataset
preprocessing		preprocessing
visah		visah
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning to Highlight Audio by Watching Movies

Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

University of Rochester, University of Maryland College Park, Meta Reality Labs Research

If our project helps you, please give us a star ⭐ on GitHub to support us.

📋 Table of Contents

📰 News

📝 Overview

🛠️ Installation

1. Clone the repository and create environment

2. Install dependencies

🤖 Dataset

1. Download "The Muddy Mix" Dataset

Download Options:

2. Degradation Method

3. Hugging Face Hub

🗝️ Training

✅ Evaluation

🌎 Pretrained model

🌎 Hugging Face Hub Checkpoints

🎯 Gallery

🚀 Inference Examples

❓ FAQ

👐 Contributing

📧 Contact

👍 Acknowledgements

📑 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

WikiChao/VisAH

Folders and files

Latest commit

History

Repository files navigation

Learning to Highlight Audio by Watching Movies

Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

University of Rochester, University of Maryland College Park, Meta Reality Labs Research

If our project helps you, please give us a star ⭐ on GitHub to support us.

📋 Table of Contents

📰 News

📝 Overview

🛠️ Installation

1. Clone the repository and create environment

2. Install dependencies

🤖 Dataset

1. Download "The Muddy Mix" Dataset

Download Options:

2. Degradation Method

3. Hugging Face Hub

🗝️ Training

✅ Evaluation

🌎 Pretrained model

🌎 Hugging Face Hub Checkpoints

🎯 Gallery

🚀 Inference Examples

❓ FAQ

👐 Contributing

📧 Contact

👍 Acknowledgements

📑 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages