Official PyTorch implementation of AvED.
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation.
We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits.
conda create -n aved python=3.10
conda activate aved
# Install dependencies
pip install -r requirements.txtThe original Stable Diffusion 2.1 has been removed from huggingface. It would be slightly different.
bash run_single_video.shYou can increase FPS for demo purpose. (Small trick: set up around fps=15, then upsample FPS=30 to save time and somewhat improve visual temporal smoothness.)
If your model is trained on VGGSound, make sure you exclude samples from the training split.
If you find this work useful, please consider citing:
@InProceedings{lin2026aved,
author = {Lin, Yan-Bo and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Wang, Xiaofei and Bertasius, Gedas and Wang, Lijuan},
title = {Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2026}
}This work builds upon Delta Denoising Score (DDS), CDS and leverages pretrained models from Stable Diffusion and AudioLDM2.
