Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Speaker embedding extractor (with ESPnet pre-trained speaker model)#5579

Merged
sw005320 merged 47 commits intoespnet:masterfrom
ftshijt:spk_inference
Jan 10, 2024
Merged

Speaker embedding extractor (with ESPnet pre-trained speaker model)#5579
sw005320 merged 47 commits intoespnet:masterfrom
ftshijt:spk_inference

Conversation

@ftshijt
Copy link
Collaborator

@ftshijt ftshijt commented Dec 5, 2023

What?

  • Add ESPnet speaker embedding extractor (inference script)
  • Add ESPnet speaker embedding extractor for TTS purpose
  • Separate the spk embedding and id converting stage in TTS
    • For flexibility concerns (e.g., after formatting the waveform to use different speaker embedding)
  • change xvector to spk_embed as suggested by @Jungjee

TODO

  • replace the current spk.sh template with the new inference
  • upload pre-trained vctk model with the ESPnet speaker pre-trained model
  • test function for spk_inference.py

@ftshijt ftshijt requested review from Fhrozen and kan-bayashi and removed request for Fhrozen December 5, 2023 10:13
@ftshijt ftshijt added Documentation TTS Text-to-speech SID Speaker identification/embedding and removed README labels Dec 5, 2023
@ftshijt ftshijt added this to the v.202312 milestone Dec 5, 2023
@mergify mergify bot added the README label Dec 5, 2023
@ftshijt
Copy link
Collaborator Author

ftshijt commented Dec 5, 2023

@Jungjee Please feel free to have a check for the implementation~

@codecov
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (4771515) 76.53% compared to head (d0740d1) 76.49%.
Report is 2 commits behind head on master.

Files Patch % Lines
espnet2/bin/spk_inference.py 51.66% 29 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5579      +/-   ##
==========================================
- Coverage   76.53%   76.49%   -0.04%     
==========================================
  Files         720      720              
  Lines       66639    66607      -32     
==========================================
- Hits        51001    50951      -50     
- Misses      15638    15656      +18     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (+0.14%) ⬆️
test_integration_espnet2 49.47% <100.00%> (-0.63%) ⬇️
test_python_espnet1 19.09% <0.00%> (+<0.01%) ⬆️
test_python_espnet2 52.55% <53.22%> (+0.15%) ⬆️
test_utils 22.15% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jungjee Jungjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort @ftshijt !! I left some comments.
Mostly looks good to me.

@sw005320
Copy link
Contributor

sw005320 commented Dec 6, 2023

@Jungjee Hi, I applied the name change and verified most of the process (but still need double-check for some previous checkpoints of TTS). But I believe it would be ready for review.

Btw, for an example usage of the API:

from espnet2.bin.spk_inference import Speech2SpkEmbedding
import numpy as np

# for huggingface
speech2spk_embed = Speech2SpkEmbedding.from_pretrained(model_tag="espnet/voxcelebs12_rawnet3")
speech2spk_embed(np.zeros( 16500))

# for local ckpt
speech2spk_embed = Speech2SpkEmbedding(model_file="model.pth", train_config="config.yaml")
speech2spk_embed(np.zeros(32000))

One naming-level comment.
How about changing the class name from Speech2SpkEmbedding to Speech2Embedding?
We may also provide possible other embedding vectors (e.g., lang or whatever) with the same API name.
Speech2Text is based on this policy (it would be ASR or OWSM S2T).

@Jungjee
Copy link
Contributor

Jungjee commented Dec 6, 2023

One naming-level comment.
How about changing the class name from Speech2SpkEmbedding to Speech2Embedding?
We may also provide possible other embedding vectors (e.g., lang or whatever) with the same API name.
Speech2Text is based on this policy (it would be ASR or OWSM S2T).

I see, I didn't think about that.
I think it's a good suggestion! @ftshijt, sorry let's go with your first choice !

@sw005320
Copy link
Contributor

sw005320 commented Jan 4, 2024

LGTM.
Is it ready for merge?

replace the current spk.sh template with the new inference in TODO is not checked yet.

@ftshijt
Copy link
Collaborator Author

ftshijt commented Jan 4, 2024

LGTM. Is it ready for merge?

replace the current spk.sh template with the new inference in TODO is not checked yet.

Sorry, it is not done yet. I recently mostly focused on checking the TTS performance (which is good). Will back to that later this week.

@sw005320
Copy link
Contributor

sw005320 commented Jan 4, 2024

Sounds good.
Please ping me if you finish it.

@Jungjee
Copy link
Contributor

Jungjee commented Jan 6, 2024

Sorry, it is not done yet. I recently mostly focused on checking the TTS performance (which is good). Will back to that later this week.

FYI, to me, replacing existing stage 6 with this extraction can be done in another PR since this can impact the speed of current inference on models and also need several tests.
(I'm bit worried about losing easy multi-GPU extraction that I've currently made. To not lose speed and at the same time use the new HF-based extraction, to me quite a lot of codes need to be fixed)

Also (maybe not a good reason but) already several users trying to use the models we uploaded, e.g., @Emrys365 for SE challenge and @underdogliu for ASVspoof5 is another reason to split the PR for me.

@sw005320
Copy link
Contributor

@ftshijt
Given the discussion with @Jungjee, it would be good to split the PR about "replace the current spk.sh template with the new inference"
So, I just merged this PR.
Thanks for your great PR and please continue it with the other PR!

@sw005320 sw005320 merged commit 3b2e0d3 into espnet:master Jan 10, 2024
G-Thor added a commit to G-Thor/espnet that referenced this pull request Mar 21, 2024
After an extra stage was added to tts.sh in espnet#5579 , following stage numbers were updated. A few were missed in the update and this PR covers those that remained.
@ftshijt ftshijt deleted the spk_inference branch May 19, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation ESPnet2 README SID Speaker identification/embedding TTS Text-to-speech

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants