TACOS: Temporally-Aligned Audio Captions for Audio-Language Pretraining

This repository contains implementation details of “TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining”.

TACOS is a dataset with strong captions, i.e., textual description of acoustic events with their corresponding temporal onsets and offsets.

The dataset can be downloaded via Zenodo: https://zenodo.org/records/15379789

Overview

Traditional audio-language models rely on global (clip-level) captions, which occasionally provide a rough temporal position of acoustic events.

TACOS solves this by providing:

12,358 audio recordings annotated with
47,748 temporally-aligned captions linked to specific regions The region's on- and offsets can be used to provide stronger supervision during text-audio pretraining.

The following figure illustrates the difference between weak captions (left) and strong captions (right):

Quick Start to Run Experiments

Installation

Prerequisites

linux (tested on Ubuntu 24.04)
conda, e.g., Miniconda3-latest-Linux-x86_64.sh

Clone this repository.

clone [email protected]:OptimusPrimus/tacos.git

Create and activate a conda environment with Python 3.11:

conda create -n d25_t6 python=3.11
conda activate d25_t6

Install 7z

# (on linux)
sudo apt install p7zip-full
# (on linux)
conda install -c conda-forge p7zip
# (on windows)
conda install -c conda-forge 7zip

Install a PyTorch version that suits your system. For example:

# for cuda >= 12.1 (check with nvidia-smi)
pip3 install torch torchvision torchaudio
# for cuda 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# for otther versions see: https://pytorch.org/get-started/locally/

Install other dependencies:

pip3 install -r requirements.txt

If you have not used Weights and Biases for logging before, you can create a free account. On your machine, run wandb login and copy your API key from this link to the command line.
Download TACOS Dataset. The dataset is available on Zenodo:

https://zenodo.org/records/15379789
place it into the folder called data in the main directory

Download AudioSet Stong

AudioSet recordings are not publicly available due to licensing issues.
A download script will be provided upon request to [email protected].

Example Training Command

Pre-Training on Clotho

python srv.train \
  --data_path=data \
  --strong_weight=0.0 \
  --weak_weight=1.0

Strong Fine-Tuning on TACOS

python srv.train \
  --no-clotho \
  --tacos \
  --data_path=data \
  --strong_weight=1.0 \
  --weak_weight=0.0 \
  --test_on_audioset \
  --test_on_audioset_full \
  --load_ckpt_path=PATH_TO_PRETRAINING_CHECKPOINT.ckpt

Citation

If you use our dataset, please cite our WASPAA paper:

TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
figures		figures
prompts		prompts
src		src
.gitignore		.gitignore
README.md		README.md
ontology.json		ontology.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TACOS: Temporally-Aligned Audio Captions for Audio-Language Pretraining

Overview

Quick Start to Run Experiments

Installation

Example Training Command

Citation

About

Uh oh!

Releases

Packages

Languages

OptimusPrimus/tacos

Folders and files

Latest commit

History

Repository files navigation

TACOS: Temporally-Aligned Audio Captions for Audio-Language Pretraining

Overview

Quick Start to Run Experiments

Installation

Example Training Command

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages