Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ VisAH Public

[CVPR 2025] Pytorch implementation of the paper "Learning to Highlight Audio by Watching Movies"

Notifications You must be signed in to change notification settings

WikiChao/VisAH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Learning to Highlight Audio by Watching Movies

University of Rochester, University of Maryland College Park, Meta Reality Labs Research
If our project helps you, please give us a star โญ on GitHub to support us.

๐Ÿ“‹ Table of Contents

๐Ÿ“ฐ News

  • [2025.07] ๐Ÿš€ VisAH dataset (base & full) and model checkpoints are now available on the Hugging Face Hub.
  • [2025.03] ๐Ÿ”ฅ๐Ÿ”ฅ Released training and evaluation codes for VisAH.
  • [2025.02] ๐ŸŽ‰๐ŸŽ‰ VisAH is accepted to CVPR 2025.

๐Ÿ“ Overview

VisAH (Visually Guided Audio Highlighting) is a novel framework that learns to highlight important audio elements in movie scenes by leveraging visual cues. The approach addresses the challenge of automatically enhancing audio elements that align with visual content, improving the overall multimedia experience. This repository contains the unofficial implementation of the CVPR 2025 paper.

๐Ÿ› ๏ธ Installation

1. Clone the repository and create environment

Clone the repository and create a conda environment:

git clone https://github.com/WikiChao/VisAH.git
conda create --name VisAH python=3.10
conda activate VisAH

2. Install dependencies

Install dependencies:

git clone https://github.com/facebookresearch/ImageBind.git
cd ImageBind
pip install .
cd ..
python -m pip install lightning==2.3.0
pip install -U tensorboardX
pip install hear21passt
python3 -m pip install -U demucs

๐Ÿค– Dataset

1. Download "The Muddy Mix" Dataset

We have prepared all data and features needed to reproduce the training and evaluation process described in our paper.

Download Options:

Option 1: Base Dataset (Recommended for Quick Start)

  • Contains essential files: visual features, text features, input audio, and ground truth audio
  • Download Base Dataset
  • After downloading, unzip and rename the folder from Muddy_Mix_base to Muddy_Mix
  • Place it in the visah/data/ directory

Option 2: Full Dataset

  • Contains everything in the base dataset plus extracted frames, separated audios, and original video clips
  • Please check download links in dataset/dowload_links

Directory structure:

Muddy_Mix
โ”œโ”€โ”€ _2EQFo-vIH0
|   โ”œโ”€โ”€ sub-video
โ”‚   |   โ”œโ”€โ”€ _2EQFo-vIH0_000
โ”‚   |   |     โ”œโ”€โ”€audio_raw                     # Ground truth movie audio
โ”‚   |   |     |   โ”œโ”€โ”€_2EQFo-vIH0_000.wav
โ”‚   |   |     โ”œโ”€โ”€frames                        # Video frames
โ”‚   |   |     |   โ”œโ”€โ”€001.png
โ”‚   |   |     |   โ”œโ”€โ”€...
โ”‚   |   |     โ”œโ”€โ”€frames_feats                  # Extracted visual features
โ”‚   |   |     |   โ”œโ”€โ”€visual_feats.pt
โ”‚   |   |     โ”œโ”€โ”€frames_captions               # Extracted textual features
โ”‚   |   |     |   โ”œโ”€โ”€InternVL2-8B_prompt1_feats.pt
โ”‚   |   |     โ”œโ”€โ”€remix_global                  # Mixed audio data
โ”‚   |   |     |   โ”œโ”€โ”€...
โ”‚   |   |     |   โ”œโ”€โ”€target_mix.wav
โ”‚   |   |     โ”œโ”€โ”€separated                     # Separated wav files from original waveform
โ”‚   |   โ”œโ”€โ”€_2EQFo-vIH0_000.mkv

2. Degradation Method

We generated the dataset once for the experiments in our paper. However, you can generate additional data for augmentation using the example in preprocessing/Degradation_generation.py.


3. Hugging Face Hub

We now host both versions of our dataset on the Hugging Face Hub for easy download:

You can also programmatically fetch the datasets:

from huggingface_hub import snapshot_download

# Base version
snapshot_download(
    repo_id="ChaoHuangCS/Muddy_Mix_base",
    repo_type="dataset",
    local_dir="visah/data/Muddy_Mix_base"
)

# Full version
snapshot_download(
    repo_id="ChaoHuangCS/Muddy_Mix",
    repo_type="dataset",
    local_dir="visah/data/Muddy_Mix"
)

๐Ÿ—๏ธ Training

After setting your dataset path, start training with:

cd visah
python run_model.py --config configs/main_config.yaml

โœ… Evaluation

To evaluate the model:

  1. Set mode: test in configs/main_config.yaml
  2. Run:
python run_model.py --config configs/main_config.yaml

๐ŸŒŽ Pretrained model

Download our pretrained model checkpoints from here.

๐ŸŒŽ Hugging Face Hub Checkpoints

Our model checkpoints are also available on the Hugging Face Hub:

You can fetch them programmatically:

snapshot_download(
    repo_id="ChaoHuangCS/VisAH",
    repo_type="model",
    local_dir="visah/checkpoints"
)

๐ŸŽฏ Gallery

Please refer to Gallery that showcases audio highlighting results along with the original movie clips.

๐Ÿš€ Inference Examples

To run inference on your own videos, follow these steps:

  1. Prepare your video files and place them in the input_videos directory.
  2. Run the inference script:
python run_inference.py --config configs/inference_config.yaml
  1. The enhanced audio files will be saved in the output_audio directory.

โ“ FAQ

  • Q: Can I use a different dataset for training?

    • A: Yes, you can modify the dataset path in the configuration file and ensure the data format matches our requirements.
  • Q: How can I contribute to this project?

    • A: Please refer to the Contributing section for guidelines.

๐Ÿ‘ Contributing

We welcome contributions from the community! If you would like to contribute, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature-branch)
  3. Make your changes
  4. Commit your changes (git commit -am 'Add new feature')
  5. Push to the branch (git push origin feature-branch)
  6. Create a new Pull Request

๐Ÿ“ง Contact

If you have any questions or need further assistance, feel free to reach out to us:

๐Ÿ‘ Acknowledgements

We utilized code from the bandit Cinematic Audio Source Separation repository for imperfect separation results generation.

๐Ÿ“‘ Citation

If you use this code for your research, please cite our work:

@inproceedings{huang2025learning,
  title={Learning to Highlight Audio by Watching Movies},
  author={Huang, Chao and Gao, Ruohan and Tsang, JMF and Kurcius, Jan and Bilen, Cagdas and Xu, Chenliang and Kumar, Anurag and Parekh, Sanjeel},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={23925--23935},
  year={2025}
}

About

[CVPR 2025] Pytorch implementation of the paper "Learning to Highlight Audio by Watching Movies"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages