Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Official PyTorch implementation of AvED.

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Abstract

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation.

We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits.

Setup

Requirements

conda create -n aved python=3.10
conda activate aved

# Install dependencies
pip install -r requirements.txt

The original Stable Diffusion 2.1 has been removed from huggingface. It would be slightly different.

Using the Demo Script

bash run_single_video.sh

You can increase FPS for demo purpose. (Small trick: set up around fps=15, then upsample FPS=30 to save time and somewhat improve visual temporal smoothness.)

Dataset

AvED-Bench

If your model is trained on VGGSound, make sure you exclude samples from the training split.

Citation

If you find this work useful, please consider citing:

@InProceedings{lin2026aved,
author = {Lin, Yan-Bo and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Wang, Xiaofei and Bertasius, Gedas and Wang, Lijuan},
title = {Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2026}
}

Acknowledgments

This work builds upon Delta Denoising Score (DDS), CDS and leverages pretrained models from Stable Diffusion and AudioLDM2.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
utils		utils
README.md		README.md
audio_processing.py		audio_processing.py
audio_tool.py		audio_tool.py
cRt6_axWZqY_000020.mp4		cRt6_axWZqY_000020.mp4
dds_audio_pipeline.py		dds_audio_pipeline.py
pipeline_cds.py		pipeline_cds.py
prepare_video.py		prepare_video.py
requirements.txt		requirements.txt
run.py		run.py
run_single_video.sh		run_single_video.sh
stft.py		stft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Abstract

Setup

Requirements

Using the Demo Script

Dataset

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

GenjiB/AVED

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Abstract

Setup

Requirements

Using the Demo Script

Dataset

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages