An audio-visual dataset in the wild for cross-modal retrieval

##Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval (TNN-C-CCA)

This paper was published in ACM TOMM 2020, research article: https://dl.acm.org/doi/abs/10.1145/3387164

Arxiv link: https://arxiv.org/pdf/1908.03737.pdf

Cite this paper

Donghuo Zeng, Yi Yu, Keizo Oyama. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16(3), pp:1-23.

@article{zeng2020deep,
  title={Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval},
  author={Zeng, Donghuo and Yu, Yi and Oyama, Keizo},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  volume={16},
  number={3},
  pages={1--23},
  year={2020},
  publisher={ACM New York, NY, USA}
}

@article{zeng2023learning,
  title={Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval},
  author={Zeng, Donghuo and Wu, Jianming and Hattori, Gen and Xu, Rong and Yu, Yi},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  volume={19},
  number={2s},
  pages={1--23},
  year={2023},
  publisher={ACM New York, NY}
}

Introduction

This paper is to address cross-modal retrieval between audio and visual (silence) with audio-visual one-to-one paired datasets. The main work focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. There are three steps to implement our model:

#Feature Extraction: extract features from raw data to reduce the dimensionality of inpu data by removing the redundant data, so that the correlation between audio and visual modality can be learn in a common space more effective and accurate. We used vggish pre-trained model to extract audio features, while used Inception V3 pre-trained model to extract visual features, more detail seen in the following Section Pre-trained Models and the original paper.

#TNN-C-CCA Model: this model contain two parts, cluter-cca and triplet neural network.

Cluster-CCA [1] is a supervised Conanical Correlation Anlysis (CCA), Unlike the standard pairwise correspondence 
between the data points as CCA, in Cluster-CCA learning process, each set is partitioned into multiple clusters or 
classes, where the class labels define correspondences between the sets. The reason why we use this model is that 
Cluster-CCA is able to learn discriminant low dimensional representations that maximize the correlation between the 
audio and visual sets while segregating the different classes on the common space. 

Triplet Neural Network [2] uses triplet loss as function loss to learn a triplet networks for three branches. The 
Triplet Loss minimizes the distance between an anchor and a positive, both of which belong to the same class, and 
maximizes the distance between the anchor and a negative of a different class.

#Evelation: we use MAP and PRC as metrics to evaluate our architecture, when the system generates a ranked list in one modality for a query in another modality. Those documents in the ranked list with the same class are regarded as relevant or correct.

Requirements

conda 4.8.5: https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html
python 3.5
keras 2.0.5
tensorflow 1.4.1 (pip install tensorflow==1.4.1 --ignore-installed)
theano 1.0.5
scipy 1.4.1
numpy 1.18.5
h5py 2.10.0
pip 20.3.4
sklearn

Usage

MV-10K dataset: https://drive.google.com/drive/u/1/folders/1-N1uQDkwvWEBJmzRexHUA0QrRFDuvKBu

5-folds: https://drive.google.com/drive/u/1/folders/1-N1uQDkwvWEBJmzRexHUA0QrRFDuvKBu

  MV-10K dataset includes pre-trained model extracted audio-visual features. 
  More detail seen in the link:http://research.google.com/youtube8m/download.html

VEGAS dataset:

The Raw dataset from https://arxiv.org/abs/1712.01393 is no longer available; I didn't copy it from my previous desktop.

If you want it, please contact me.

You can also download the extracted features of both audio and visual by us from my Google Drive: https://drive.google.com/drive/folders/1ZtjV-sdCUpLhxd8-Ge4hnm83Rqlj5PU1?usp=drive_link

please kindly cite my paper, thank you!

AVE dataset:

Original Dataset homepage:https://sites.google.com/view/audiovisualresearch Extracted features by us: https://drive.google.com/drive/folders/1ZtjV-sdCUpLhxd8-Ge4hnm83Rqlj5PU1?usp=drive_link

Feature Extraction

The pre-trained model is in google drive the same as above, named pretrain.zip

ffmpeg:

It is a tool to edit the video or audio, more detail seen: http://ffmpeg.org/. Here, I use the tool to extract audio track from video.

Contact

If you have any questions, please email [email protected] (I am Zeng).

References：

[1]Rasiwasia, Nikhil, et al. "Cluster canonical correlation analysis." Artificial intelligence and statistics. PMLR, 2014. [2]Hermans, Alexander, Lucas Beyer, and Bastian Leibe. "In defense of the triplet loss for person re-identification." arXiv preprint arXiv:1703.07737 (2017).

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
AVIDEO_test.ipynb		AVIDEO_test.ipynb
Category_DCCA.py		Category_DCCA.py
GAN_linearccca.py		GAN_linearccca.py
MultiCCA.py		MultiCCA.py
PRC_VEGAS.py		PRC_VEGAS.py
PRC_componet.py		PRC_componet.py
PRC_mv10k.py		PRC_mv10k.py
PRC_para.py		PRC_para.py
PRC_recond.py		PRC_recond.py
README.md		README.md
TNN.py		TNN.py
audiowave.ipynb		audiowave.ipynb
batch_all.py		batch_all.py
batch_all_shared.py		batch_all_shared.py
batch_hard.py		batch_hard.py
c_cca.py		c_cca.py
embed_main.py		embed_main.py
eva_main.py		eva_main.py
linear_cca.py		linear_cca.py
linear_cca.pyc		linear_cca.pyc
model.py		model.py
myconfig.py		myconfig.py
myconfig.pyc		myconfig.pyc
objectives.py		objectives.py
printPDF.py		printPDF.py
random_main.py		random_main.py
result.py.npy		result.py.npy
sample_prec_rec.cpk		sample_prec_rec.cpk
t_sne_X.ipynb		t_sne_X.ipynb
triplet111.py		triplet111.py
triplet222.py		triplet222.py
triplet_arch.py		triplet_arch.py
triplet_arch_all.py		triplet_arch_all.py
triplet_ccca.py		triplet_ccca.py
triplet_org_batchall.py		triplet_org_batchall.py
triplet_semi_hard		triplet_semi_hard
util_triplet.py		util_triplet.py
util_triplet.pyc		util_triplet.pyc
utils.py		utils.py
utils.pyc		utils.pyc
utils_semihard.py		utils_semihard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An audio-visual dataset in the wild for cross-modal retrieval

Cite this paper

Introduction

Requirements

Usage

Feature Extraction

Contact

References：

About

Uh oh!

Releases

Packages

Languages

ZenzenDatabase/CMR

Folders and files

Latest commit

History

Repository files navigation

An audio-visual dataset in the wild for cross-modal retrieval

Cite this paper

Introduction

Requirements

Usage

Feature Extraction

Contact

References：

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages