Add script to use speaker averaged xvectors in TTS training#5244
Add script to use speaker averaged xvectors in TTS training#5244kan-bayashi merged 4 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5244 +/- ##
==========================================
+ Coverage 74.43% 74.99% +0.55%
==========================================
Files 642 655 +13
Lines 57611 58553 +942
==========================================
+ Hits 42885 43909 +1024
+ Misses 14726 14644 -82
Flags with carried forward coverage won't be shown. Click here to find out more. see 48 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
Can you also add an example (config file?) to use this averaged xvector? |
|
The script modifies the xvector.scp file to reference the corresponding spk_xvector.ark locations for the desires speaker(s) . No further modifications are needed as (the now modified) xvector.scp is used during training. The original xvector.scp is backed up so it is possible to manually revert the changes. |
|
OK, where will it be used then? |
|
It is applied after xvector extraction (stage 2) and before model training (stage 6) I have models trained on averaged xvectors for Icelandic (talromur and talromur2 datasets) abut no proper evaluation completed. |
There was a problem hiding this comment.
Very cool!
Could you add the brief description about your new function here?
https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#multi-speaker-model-with-x-vector-training
(e.g., example command to replace xvector with spk-xvector)
|
I solved this issue in a different way. Only modifying the tts.sh file at lines: espnet/egs2/TEMPLATE/tts1/tts.sh Lines 402 to 417 in abd3aa7 and using a flag # Assume that others toolkits are python-based
log "Stage 2+: Extract X-vector: data/ -> ${dumpdir}/xvector using python toolkits"
if "${use_ave_xvector}"; then
use_dsets=""
for dset in "${train_set}" "${valid_set}" ${test_sets}; do
if [ "${dset}" = "${train_set}" ] || [ "${dset}" = "${valid_set}" ]; then
_suf="/org"
else
_suf=""
fi
use_dsets+=" ${data_feats}${_suf}/${dset}"
done
utils/combine_data.sh ${data_feats}/allsplits ${use_dsets}
pyscripts/utils/extract_xvectors.py \
--pretrained_model ${xvector_model} \
--toolkit ${xvector_tool} \
${data_feats}/allsplits \
${dumpdir}/xvector/averaged
for dset in "${train_set}" "${valid_set}" ${test_sets}; do
mkdir -p ${dumpdir}/xvector/${dset}
if [ "${dset}" = "${train_set}" ] || [ "${dset}" = "${valid_set}" ]; then
_suf="/org"
else
_suf=""
fi
<"${dumpdir}/xvector/averaged/ave_xvector.scp" \
utils/filter_scp.pl "${data_feats}${_suf}/${dset}/wav.scp" \
>"${dumpdir}/xvector/${dset}/xvector.scp"
done
else
for dset in "${train_set}" "${valid_set}" ${test_sets}; do
if [ "${dset}" = "${train_set}" ] || [ "${dset}" = "${valid_set}" ]; then
_suf="/org"
else
_suf=""
fi
pyscripts/utils/extract_xvectors.py \
--pretrained_model ${xvector_model} \
--toolkit ${xvector_tool} \
${data_feats}${_suf}/${dset} \
${dumpdir}/xvector/${dset}
done
fiOfc, I implemented for python-based toolkits in my terminal bc I do not use kaldi for xvector extraction. |
|
You may find some samples at: https://1drv.ms/f/s!AliZ3I0uDW8HgTxDtaGlsH8FmYcA. |
Using speaker averaged xvectors in TTS training may generalise better to inference tasks, where the utterance-specific xvector is unknown.
I added a small script to modify xvector.scp to refer to spk_xvector.ark entries instead of utterance-specific ones. It works well for my task so I figured it may be of use for others.