CAFA - a Controllable Automatic Foley Artist

Introduction

CAFA (Controllable Automatic Foley Artist) is a controllable text-video-to-audio model for Foley sound generation. Given a short video and a textual prompt, CAFA generates a synchronized audio waveform that matches both the visual content and the desired semantics described in the prompt. This allows users to modify or override the natural sound of the video by changing the prompt, enabling fine-grained control over the generated audio.

This repository provides the inference tools, pretrained weights, and test results to reproduce our results or build upon our work.

Examples

demo_video_compressed.mp4

For more examples, visit our demo page

Installation

We recommend working in a fresh enviornment

git clone https://github.com/finmickey/CAFA.git
cd CAFA

# create env
python -m venv env
source env/bin/activate

Install preqrequisite if not installed yet

pip install torch torchvision torchaudio wheel
# Used for downloading the ckpt
git lfs install

Install requirements (using legacy resolver speeds up CUDA dependencies installation, but is optional)

pip install -r requirements.txt --use-deprecated=legacy-resolver

Download ckpts and config files. Notice that we use the avclip model from Synchformer, but use a different config.

mkdir ckpts

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/MichaelFinkelson/CAFA-avclip ckpts/

cd ckpts
git lfs pull
cd ..

wget -O ckpts/avclip.pt https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/sync/sync_models/24-01-04T16-39-21/24-01-04T16-39-21.pt

Inference

An inference script is provided at demo.py

python demo.py --video_path <path_to_video_file.mp4> --prompt "your_prompt"

The output will be saved at ./output by default as both .wav and combined .mp4 files.

The computed video embeddings will be cached at ./embeds and can be reused for faster generation:

python demo.py --video_path <path_to_video_file.mp4> --embed_path <path_to_embedding_file.npy> --prompt "your_prompt"

The model supports generation of up to 10 seconds of audio. Videos longer than that will be trimmed to 10 seconds.

Additional options

Common parameters:

--cfg: Classifier-free guidance scale (default: 7.0)
--steps: Number of diffusion steps (default: 50)
--seed: Random seed (default: 42)
--asym_cfg: Asymmetric CFG scale (default: 0.5)

For all available options:

python demo.py --help

Outputs

We provide the model's generations of the VGGSound test set at this huggingface dataset

Citation

@inproceedings{benita2025controllableautomaticfoleyartist,
      title={Controllable Automatic Foley Artist}, 
      author={Roi Benita and Michael Finkelson and Tavi Halperin and Gleb Sterkin and Yossi Adi},
      year={2025},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      url={https://arxiv.org/abs/2504.06778}, 
}

Acknowledgement

The code is primarily based on stable-audio-tools and Synchformer.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
docs		docs
stable_audio_tools		stable_audio_tools
synchformer		synchformer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
avclip_utils.py		avclip_utils.py
demo.py		demo.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAFA - a Controllable Automatic Foley Artist

Introduction

Examples

Installation

Inference

Additional options

Outputs

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

CAFA-VT2A/CAFA

Folders and files

Latest commit

History

Repository files navigation

CAFA - a Controllable Automatic Foley Artist

Introduction

Examples

Installation

Inference

Additional options

Outputs

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages