Codestin Search App

ZhuoyanTao · 2025-08-28T07:09:38Z

What did you change?

Parallelize speaker embedding extraction for TTS recipes

Why did you make this change?

Speaker embedding extraction used to run sequentially, and took 40+h for dataset with many speakers

Is your PR small enough?

yes

Additional Context

This script extracts speaker embeddings in parallel using specified toolkits. It includes argument parsing, model loading, and batch processing for audio files.

Fix the conditional check for use_spk_embed.

This script sorts standard Kaldi data files and aligns embedding SCP files based on provided directories.

gemini-code-assist

Code Review

This pull request introduces parallel processing for speaker embedding extraction, which is a great improvement for performance. The changes include a new Python script for parallel extraction, a utility script for sorting embedding files, and an update to the TTS recipe to use it.

My review focuses on ensuring the correctness and efficiency of the new parallel implementation. I've identified a critical issue with incorrect arguments being passed to the new sorting script in tts.sh. Additionally, in the parallel extraction script, there's a significant performance issue with the rawnet implementation which is not properly batched, and a race condition in the audio loading logic. Addressing these points will make the parallel implementation more robust and efficient.

egs2/TEMPLATE/tts1/tts.sh

gemini-code-assist · 2025-08-28T07:12:27Z

egs2/TEMPLATE/asr1/pyscripts/utils/extract_spk_embed_parallel.py

+            elif self.toolkit == "rawnet":
+                outs = []
+                for w in wav_list:
+                    if torch.is_tensor(w):
+                        w = w.detach().cpu().numpy()
+                    outs.append(
+                        self._rawnet_extract_embd(np.asarray(w, dtype=np.float32))
+                    )
+                return np.asarray(outs, dtype=np.float32)


The current implementation for rawnet in extract_batch processes utterances sequentially within a loop. This underutilizes the GPU and negates the benefits of batching, creating a performance bottleneck. A true batched implementation should be used, where segments from all utterances in wav_list are collected and processed in a single forward pass on the GPU.

elif self.toolkit == "rawnet": n_samples = 48000 n_segments = 10 all_segments = [] utterance_segment_counts = [] for w in wav_list: if torch.is_tensor(w): w = w.detach().cpu().numpy() audio = np.asarray(w, dtype=np.float32) if len(audio.shape) > 1: raise ValueError( "RawNet3 supports mono input only." f"Input data has a shape of {audio.shape}." ) if len(audio) < n_samples: shortage = n_samples - len(audio) + 1 audio = np.pad(audio, (0, shortage), "wrap") startframe = np.linspace(0, len(audio) - n_samples, num=n_segments) segments = [audio[int(asf) : int(asf) + n_samples] for asf in startframe] all_segments.extend(segments) utterance_segment_counts.append(len(segments)) if not all_segments: return np.array([], dtype=np.float32) stacked_segments = torch.from_numpy( np.stack(all_segments, axis=0).astype(np.float32) ).to(self.device) with torch.no_grad(): output = self.model(stacked_segments) output_np = output.detach().cpu().numpy() embs = [] start_idx = 0 for count in utterance_segment_counts: end_idx = start_idx + count utterance_embs = output_np[start_idx:end_idx] embs.append(utterance_embs.mean(axis=0)) start_idx = end_idx return np.asarray(embs, dtype=np.float32)

This is because rawnet does not accept a batch as a single input. I tried the suggested approach during development and there were errors. Please let me know if my understanding is incorrect or a better approach can be used.

Thanks for clarification. I think the current solution is fine.

egs2/TEMPLATE/asr1/pyscripts/utils/extract_spk_embed_parallel.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Added a lock for thread-safe access to resamplers.

codecov · 2025-08-28T07:45:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.82%. Comparing base (412aa11) to head (3168ebe).
⚠️ Report is 548 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6227       +/-   ##
===========================================
+ Coverage   20.63%   55.82%   +35.19%     
===========================================
  Files          95      889      +794     
  Lines       10347    84275    +73928     
===========================================
+ Hits         2135    47049    +44914     
- Misses       8212    37226    +29014

Flag	Coverage Δ
test_integration_espnet2	`46.15% <ø> (?)`
test_integration_espnetez	`36.94% <ø> (?)`
test_python_espnet2	`50.53% <ø> (?)`
test_python_espnetez	`12.82% <ø> (?)`
test_utils	`18.77% <ø> (-1.87%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fhrozen · 2025-08-28T08:16:13Z

This pull request introduces a new utility script to ensure that speaker embedding SCP files are sorted and aligned with the main data files, and integrates this sorting step into the TTS data preparation pipeline. This helps prevent mismatches between speaker embeddings and utterance IDs during training and evaluation.

Speaker embedding sorting and integration:

Added a new script scripts/utils/sort_spk_embed_scp.sh that sorts all standard Kaldi data files and aligns speaker embedding SCP files with the corresponding text files, including sanity checks for key order.
Integrated the new sorting script into the TTS data preparation pipeline (tts.sh), so that if speaker embeddings are used, the script is called automatically to fix the order of speaker-embed SCP files to match the text.

sw005320 · 2025-08-28T11:10:24Z

egs2/TEMPLATE/asr1/pyscripts/utils/extract_spk_embed_parallel.py

+        # for speaker in tqdm(spk2utt):
+        #     spk_embeddings = list()
+        #     for utt in spk2utt[speaker]:
+        #         in_sr, wav = wav_scp[utt]
+        #         # Speaker Embedding
+        #         embeds = spk_embed_extractor(wav, in_sr)
+        #         writer_utt[utt] = np.squeeze(embeds)
+        #         spk_embeddings.append(embeds)
+
+        #     # Speaker Normalization
+        #     embeds = np.mean(np.stack(spk_embeddings, 0), 0)
+        #     writer_spk[speaker] = embeds
+        # Build flat work list (speaker, utt)


what is your intention of keeping these comment-out lines?
If needed, please explain
If not, you can delete them

sw005320 · 2025-08-28T11:10:42Z

Can you make a test script?

ZhuoyanTao · 2025-08-28T12:10:14Z

Yes of course.

…

On Thu, Aug 28, 2025 at 4:11 AM Shinji Watanabe ***@***.***> wrote: sw005320 left a comment (espnet/espnet#6227) Can you make a test script? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread. Message ID: <espnet/espnet/pull/6227/c3233060966@ github. com> *sw005320* left a comment (espnet/espnet#6227) <https://urldefense.com/v3/__https://github.com/espnet/espnet/pull/6227*issuecomment-3233060966__;Iw!!LIr3w8kk_Xxm!pTjFkVtbd1KKOHgcj5DSfju3qktt7Nwy-s2g59kX-iAqWTzhKt78Q9BJaYD6LDDohavNwCe2rJUm9nXTFzv44H8$> Can you make a test script? — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/espnet/espnet/pull/6227*issuecomment-3233060966__;Iw!!LIr3w8kk_Xxm!pTjFkVtbd1KKOHgcj5DSfju3qktt7Nwy-s2g59kX-iAqWTzhKt78Q9BJaYD6LDDohavNwCe2rJUm9nXTFzv44H8$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A2XAAA5SKGIUCVYB3KTPVAT3P3PUPAVCNFSM6AAAAACFAOAVUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMZTGA3DAOJWGY__;!!LIr3w8kk_Xxm!pTjFkVtbd1KKOHgcj5DSfju3qktt7Nwy-s2g59kX-iAqWTzhKt78Q9BJaYD6LDDohavNwCe2rJUm9nXTVD3rgho$> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ftshijt

Thanks @ZhuoyanTao for supporting the parallel version!

ftshijt · 2025-08-28T14:09:20Z

egs2/TEMPLATE/asr1/pyscripts/utils/extract_spk_embed_parallel.py

+            elif self.toolkit == "rawnet":
+                outs = []
+                for w in wav_list:
+                    if torch.is_tensor(w):
+                        w = w.detach().cpu().numpy()
+                    outs.append(
+                        self._rawnet_extract_embd(np.asarray(w, dtype=np.float32))
+                    )
+                return np.asarray(outs, dtype=np.float32)


Thanks for clarification. I think the current solution is fine.

ftshijt · 2025-08-28T14:10:01Z

egs2/TEMPLATE/asr1/pyscripts/utils/extract_spk_embed_parallel.py

@@ -0,0 +1,401 @@
+#!/usr/bin/env python3


Could you please also update the tts.sh correspondingly to use the parallel version (or let user to select which version to use)?

Removed commented-out code for speaker embeddings processing.

sw005320 · 2025-09-09T13:26:43Z

Any update?

ZhuoyanTao · 2025-09-09T14:45:44Z

I should get everything done by today

…

On Tue, Sep 9, 2025 at 6:27 AM Shinji Watanabe ***@***.***> wrote: sw005320 left a comment (espnet/espnet#6227) Any update? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned. Message ID: <espnet/espnet/pull/6227/c3270748261@ github. com> *sw005320* left a comment (espnet/espnet#6227) <https://urldefense.com/v3/__https://github.com/espnet/espnet/pull/6227*issuecomment-3270748261__;Iw!!LIr3w8kk_Xxm!sJqLfKg9uA6_5B6tshFnjoDyAIJ6cPC56cdhdDjSFwYLzj0aXEkbEXHqHrVD9kiAiyXA_qsgorLvgYSY7vLCnxI$> Any update? — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/espnet/espnet/pull/6227*issuecomment-3270748261__;Iw!!LIr3w8kk_Xxm!sJqLfKg9uA6_5B6tshFnjoDyAIJ6cPC56cdhdDjSFwYLzj0aXEkbEXHqHrVD9kiAiyXA_qsgorLvgYSY7vLCnxI$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A2XAAA2GM24CXGDYYXFSA533R3ISTAVCNFSM6AAAAACFAOAVUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENZQG42DQMRWGE__;!!LIr3w8kk_Xxm!sJqLfKg9uA6_5B6tshFnjoDyAIJ6cPC56cdhdDjSFwYLzj0aXEkbEXHqHrVD9kiAiyXA_qsgorLvgYSYUdkx0KA$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

This script runs sequential and parallel speaker embedding extraction, creates synthetic audio data, and compares the outputs.

for more information, see https://pre-commit.ci

Added support for parallel speaker embedding extraction with configurable parameters.

ZhuoyanTao · 2025-09-09T21:05:30Z

Any update?

I have added a .sh test script, which tests on a dummy dataset. The script passes with no errors. Please let me know if any further modifications are necessary. Thanks!

sw005320 · 2025-09-10T13:21:26Z

@ftshijt, can you check his update?
It looks good to me.

ftshijt

Many thanks for your update! The overall design is great and details are good, except for the minor comment below. We can merge it after the minor comment is resolved.

ftshijt · 2025-09-11T13:32:44Z

egs2/TEMPLATE/tts1/tts.sh

+		if "${use_spk_embed}"; then
+            log "Fixing order of speaker-embed scp to match text"
+            scripts/utils/sort_spk_embed_scp.sh "${dumpdir}" "${spk_embed_tag}"
+        fi


Is there a specific reason to separate the "fix order" and "extraction" into different stage? It might be better to include them together in a single stage (to avoid issues for people choosing not to run stage 4)

…ence of spk emb files to avoid confusing bugs I added use spk emb at the end instead of inside each case during extraction due to the multiple branches of logic in stage 3

ZhuoyanTao added 4 commits August 28, 2025 00:00

Add parallel speaker embedding extraction script

921a184

This script extracts speaker embeddings in parallel using specified toolkits. It includes argument parsing, model loading, and batch processing for audio files.

Correct speaker-embed scp sorting condition

bc8bcdb

Fix the conditional check for use_spk_embed.

Fix script path for sorting speaker embeddings

867117f

Add script to sort and align Kaldi data files

abf6535

This script sorts standard Kaldi data files and aligns embedding SCP files based on provided directories.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. SID Speaker identification/embedding TTS Text-to-speech labels Aug 28, 2025

mergify bot added the ESPnet2 label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

ZhuoyanTao and others added 2 commits August 28, 2025 00:22

Update egs2/TEMPLATE/tts1/tts.sh

c9dee8d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Ensure thread safety for resampling in extract_spk_embed_parallel.py

43c64e1

Added a lock for thread-safe access to resamplers.

Fhrozen added this to the v.202509 milestone Aug 28, 2025

sw005320 reviewed Aug 28, 2025

View reviewed changes

ftshijt reviewed Aug 28, 2025

View reviewed changes

Clean up extract_spk_embed_parallel.py

027c287

Removed commented-out code for speaker embeddings processing.

Add script for parallel speaker embedding extraction

1339dbd

This script runs sequential and parallel speaker embedding extraction, creates synthetic audio data, and compares the outputs.

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Sep 9, 2025

pre-commit-ci bot and others added 3 commits September 9, 2025 20:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a1834b

for more information, see https://pre-commit.ci

add option in tts.sh to use parallel speaker embedding or not

c257a85

Added support for parallel speaker embedding extraction with configurable parameters.

Remove duplicate comment in tts.sh

a4cde16

ftshijt reviewed Sep 11, 2025

View reviewed changes

Move sorting spk emb into end of stage 3, and add check for the exist…

3168ebe

…ence of spk emb files to avoid confusing bugs I added use spk emb at the end instead of inside each case during extraction due to the multiple branches of logic in stage 3

Fhrozen modified the milestones: v.202509, v.202512 Sep 12, 2025

sw005320 merged commit 7e7200c into espnet:master Sep 12, 2025
32 of 33 checks passed

Fhrozen modified the milestones: v.202512, v.202511 Nov 14, 2025

Conversation

ZhuoyanTao commented Aug 28, 2025

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ZhuoyanTao Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Fhrozen commented Aug 28, 2025

Uh oh!

sw005320 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 commented Aug 28, 2025

Uh oh!

ZhuoyanTao commented Aug 28, 2025 via email

Uh oh!

ftshijt left a comment

Choose a reason for hiding this comment

Uh oh!

ftshijt Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 commented Sep 9, 2025

Uh oh!

ZhuoyanTao commented Sep 9, 2025 via email

Uh oh!

ZhuoyanTao commented Sep 9, 2025

Uh oh!

sw005320 commented Sep 10, 2025

Uh oh!

ftshijt left a comment

Choose a reason for hiding this comment

Uh oh!

ftshijt Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Aug 28, 2025 •

edited

Loading