3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope.
git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt# Speaker verification: CAM++ on voxceleb
cd egs/sv-cam++/voxceleb/
bash run.sh
# Self-supervised speaker verification: RDINO on voxceleb
cd egs/sv-rdino/voxceleb/
bash run.shAll pretrained models are released on Modelscope.
# Install modelscope
pip install modelscope
# CAM++ trained on VoxCeleb
model_id=damo/speech_campplus_sv_en_voxceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# Run cam++ inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run rdino inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path| Task | Dataset | Model | Performance |
|---|---|---|---|
| speaker verification | VoxCeleb | CAM++ | EER = 0.73% |
| self-supervised speaker verification | VoxCeleb | RDINO | EER = 3.24% |
- [2023.4] RDINO training recipes on VoxCeleb released. RDINO is a self-supervised learning framework in speaker verification aiming to alleviate model collapse in non-contrastive methods. It contains teacher and student network with an identical architecture but different parameters. Two regularization terms are proposed in RDINO, namely diversity regularization and redundancy elimination regularization. RDINO achieve 3.05% EER and 0.220 MinDCF in VoxCeleb using single-stage self-supervised training.
- [2023.4] CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
- [2023.4] CAM++ training recipe on VoxCeleb released. CAM++ is a fast and efficient speaker embedding extractor based on a densely connected time-delay neural network (D-TDNN). It adopts a novel multi-granularity pooling method to conduct context-aware masking. CAM++ achieves an EER of 0.73% in Voxceleb and 6.78% in CN-Celeb, outperforming other mainstream speaker embedding models such as ECAPA-TDNN and ResNet34, while having lower computational cost and faster inference speed.
- [2023.5] Releasing ERes2Net (Enhanced Res2Net) training framework.
- [2023.5] Releasing ERes2Net model trained on over 100k labeled speakers.
3D-Speaker is released under the Apache License 2.0.
3D-Speaker contains third-party components and code modified from some open-source repos, including:
If you have any comment or question about 3D-Speaker, please contact us by
- email: [email protected], [email protected]
@inproceedings{rdino,
title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
author={Yafeng Chen and Siqi Zheng and Hui Wang and Luyao Cheng and Qian Chen},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2023},
organization={IEEE}
}
@article{cam++,
title={CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking},
author={Hui Wang and Siqi Zheng and Yafeng Chen and Luyao Cheng and Qian Chen},
journal={arXiv preprint arXiv:2303.00332},
year={2023}
}