This repository contains implementation details of “TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining”.
TACOS is a dataset with strong captions, i.e., textual description of acoustic events with their corresponding temporal onsets and offsets.
The dataset can be downloaded via Zenodo: https://zenodo.org/records/15379789
Traditional audio-language models rely on global (clip-level) captions, which occasionally provide a rough temporal position of acoustic events.
TACOS solves this by providing:
- 12,358 audio recordings annotated with
- 47,748 temporally-aligned captions linked to specific regions The region's on- and offsets can be used to provide stronger supervision during text-audio pretraining.
The following figure illustrates the difference between weak captions (left) and strong captions (right):
Prerequisites
- linux (tested on Ubuntu 24.04)
- conda, e.g., Miniconda3-latest-Linux-x86_64.sh
- Clone this repository.
clone [email protected]:OptimusPrimus/tacos.git
- Create and activate a conda environment with Python 3.11:
conda create -n d25_t6 python=3.11
conda activate d25_t6
- Install 7z
# (on linux)
sudo apt install p7zip-full
# (on linux)
conda install -c conda-forge p7zip
# (on windows)
conda install -c conda-forge 7zip
- Install a PyTorch version that suits your system. For example:
# for cuda >= 12.1 (check with nvidia-smi)
pip3 install torch torchvision torchaudio
# for cuda 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# for otther versions see: https://pytorch.org/get-started/locally/
- Install other dependencies:
pip3 install -r requirements.txt
-
If you have not used Weights and Biases for logging before, you can create a free account. On your machine, run
wandb loginand copy your API key from this link to the command line. -
Download TACOS Dataset. The dataset is available on Zenodo:
- https://zenodo.org/records/15379789
- place it into the folder called
datain the main directory
- Download AudioSet Stong
- AudioSet recordings are not publicly available due to licensing issues.
- A download script will be provided upon request to [email protected].
Pre-Training on Clotho
python srv.train \
--data_path=data \
--strong_weight=0.0 \
--weak_weight=1.0Strong Fine-Tuning on TACOS
python srv.train \
--no-clotho \
--tacos \
--data_path=data \
--strong_weight=1.0 \
--weak_weight=0.0 \
--test_on_audioset \
--test_on_audioset_full \
--load_ckpt_path=PATH_TO_PRETRAINING_CHECKPOINT.ckptIf you use our dataset, please cite our WASPAA paper:
- TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining