DiariZen is a speaker diarization toolkit driven by AudioZen and Pyannote 3.1.
# create virtual python environment
conda create --name diarizen python=3.10
conda activate diarizen
# install diarizen
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt && pip install -e .
# install pyannote-audio
cd pyannote-audio && pip install -e .[dev,testing]
# install dscore
git submodule init
git submodule update
We use SDM (first channel from the first far-field microphone array) data from public AMI, AISHELL-4, and AliMeeting for model training and evaluation. Please download these datasets firstly. Our data partition is here.
- download WavLM Base+ model
- download ResNet34-LM model
- modify the path of used dataset and configuration file
cd recipes/diar_ssl && bash -i run_stage.sh
- our pre-trained checkpoints and the estimated rttm files can be found here. The local experimental path has been anonymized. To use the pre-trained models, please check the
diar_ssl/run_stage.sh. - in case you have trouble reproducing our experiments, we also provide the intermediate results of
EN2002a, an AMI test recording, during inference for debugging.
We aim to make the whole pipeline as simple as possible. Therefore, for the results below:
- we did not use any simulated data
- we did not apply advanced learning scheduler strategies
- we did not perform further domain adaptation to each dataset
- all experiments share the same hyper-parameters for clustering
collar=0s
--------------------------------------------------------------
System Features AMI AISHELL-4 AliMeeting
--------------------------------------------------------------
Pyannote3 SincNet 21.1 13.9 22.8
Proposed Fbank 19.7 12.5 21.0
WavLM-frozen 17.0 11.7 19.9
WavLM-updated 15.4 11.7 17.6
--------------------------------------------------------------
collar=0.25s
--------------------------------------------------------------
System Features AMI AISHELL-4 AliMeeting
--------------------------------------------------------------
Pyannote3 SincNet 13.7 7.7 13.6
Proposed Fbank 12.9 6.9 12.6
WavLM-frozen 10.9 6.1 12.0
WavLM-updated 9.8 5.9 10.2
--------------------------------------------------------------
Note:
The results above are different from our ICASSP submission.
We made a few updates to experimental numbers but the conclusions in our paper are as same as the original ones.
If you found this work helpful, please consider citing: J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, and L. Burget, Leveraging Self-Supervised Learning for Speaker Diarization, arXiv preprint arXiv:2409.09408, 2024.
@article{han2024leveragingselfsupervisedlearningspeaker,
title={Leveraging Self-Supervised Learning for Speaker Diarization},
author={Jiangyu Han and Federico Landini and Johan Rohdin and Anna Silnova and Mireia Diez and Lukas Burget},
journal={arXiv preprint arXiv:2409.09408},
year={2024}
}
This repository under the MIT license.
If you have any comment or question, please contact [email protected]