Thanks to visit codestin.com
Credit goes to github.com

Skip to content

GenjiB/AVED

Repository files navigation

📗Paper|| 🏠Project Page

License: MIT

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Official PyTorch implementation of AvED.

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Abstract

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation.

We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits.

Setup

Requirements

conda create -n aved python=3.10
conda activate aved

# Install dependencies
pip install -r requirements.txt

The original Stable Diffusion 2.1 has been removed from huggingface. It would be slightly different.

Using the Demo Script

bash run_single_video.sh

You can increase FPS for demo purpose. (Small trick: set up around fps=15, then upsample FPS=30 to save time and somewhat improve visual temporal smoothness.)

Dataset

AvED-Bench

If your model is trained on VGGSound, make sure you exclude samples from the training split.

Citation

If you find this work useful, please consider citing:

@InProceedings{lin2026aved,
author = {Lin, Yan-Bo and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Wang, Xiaofei and Bertasius, Gedas and Wang, Lijuan},
title = {Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2026}
}

Acknowledgments

This work builds upon Delta Denoising Score (DDS), CDS and leverages pretrained models from Stable Diffusion and AudioLDM2.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published