Codestin Search App

whr-a · 2025-10-28T15:15:37Z

What did you change?

Corrected the "dead code" revival logic within the EuclideanCodebook's EMA update.
Added DDP (Distributed Data Parallel) synchronization (all_reduce) for all codebook updates, including K-Means init, EMA statistics (cluster_size, embed_sum), and dead code sampling.

Why did you make this change?

To fix a EMA bug: The previous logic for reviving dead code was non-functional. When a dead code (self.embed) was replaced with a new vector, its corresponding EMA state (self.embed_avg, self.cluster_size) was not reset. This caused the new vector to be immediately overwritten by a stale value (calculated from the old, dead state) in the same forward pass, preventing the codebook from recovering from collapse.
To fix DDP training: In the previous module, EMA updates (for cluster_size and embed_sum) were calculated using only the local batch on each worker, also with no step-by-step synchronization. Synchronization only occurred intermittently during initialization (init_embed_) and code expiration (expire_codes_). broadcast_tensors was used to copy the buffers from Rank 0 to all other workers, discarding the updates computed by other ranks. This PR changes the logic to use all_reduce on these statistics on every training step, ensuring the EMA update is calculated using the full global batch and keeping codebooks consistent across all workers.

Is your PR small enough?

yes

Additional Context

The logic in https://github.com/cisco-open/espnet/blob/master/espnet2/gan_codec/shared/quantizer/modules/core_vq.py is fully referenced.
Also compared to https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/src/mimo_audio_tokenizer/quantization.py
The code has been validated for correctness through training.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request introduces significant and necessary fixes for DDP training and the dead code revival logic within the EuclideanCodebook. The implementation of all_reduce to synchronize statistics across distributed workers is a crucial correction for DDP training. The fix for the EMA update logic, which now correctly resets state for revived codebook vectors, resolves a critical bug that prevented the codebook from recovering. The overall changes are well-reasoned and address the described issues effectively. I have two suggestions rated as 'high' severity to improve numerical stability and code simplicity in the new distributed logic.

gemini-code-assist · 2025-10-28T15:20:16Z

espnet2/gan_codec/shared/quantizer/modules/core_vq.py

+    return [size if i == dim else s for i, s in enumerate(shape)]

-    means = sample_vectors(samples, num_clusters)
+
+def sample_multinomial(total_count, probs):
+    device = probs.device
+    probs = probs.cpu()
+
+    total_count = probs.new_full((), total_count)
+    remainder = probs.new_ones(())
+    sample = torch.empty_like(probs, dtype=torch.long)
+
+    num_probs = len(probs)
+
+    for i, prob in enumerate(probs):
+        is_last = i == (num_probs - 1)
+
+        s = (
+            torch.binomial(total_count, prob / remainder)
+            if not is_last
+            else total_count


The current implementation of sample_multinomial is functionally correct but overly complex and potentially fragile due to manual floating-point arithmetic for re-normalizing probabilities (prob / remainder). A much simpler, more robust, and idiomatic way to achieve this is by using torch.distributions.Multinomial. This will also avoid the unnecessary device transfer to CPU and back.

Suggested change

return [size if i == dim else s for i, s in enumerate(shape)]

means = sample_vectors(samples, num_clusters)

def sample_multinomial(total_count, probs):

device = probs.device

probs = probs.cpu()

total_count = probs.new_full((), total_count)

remainder = probs.new_ones(())

sample = torch.empty_like(probs, dtype=torch.long)

num_probs = len(probs)

for i, prob in enumerate(probs):

is_last = i == (num_probs - 1)

s = (

torch.binomial(total_count, prob / remainder)

if not is_last

else total_count

from torch.distributions import Multinomial

dist = Multinomial(total_count, probs=probs)

return dist.sample().to(dtype=torch.long)

gemini-code-assist · 2025-10-28T15:20:16Z

espnet2/gan_codec/shared/quantizer/modules/core_vq.py

+        buckets = torch.argmax(dists, dim=-1)
        bins = torch.bincount(buckets, minlength=num_clusters)


The current order of operations in the k-means update step (division then all_reduce) is mathematically correct but can be numerically unstable. Performing all_reduce on the summed vectors before dividing by the counts is generally more robust as it avoids potential precision loss from intermediate divisions, especially when cluster counts (bins) are large.

Suggested change

buckets = torch.argmax(dists, dim=-1)

bins = torch.bincount(buckets, minlength=num_clusters)

all_reduce_fn(new_means)

new_means = new_means / rearrange(bins_min_clamped, '... -> ... 1')

codecov · 2025-10-28T16:14:19Z

Codecov Report

❌ Patch coverage is 54.44444% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.45%. Comparing base (27d4ab8) to head (b0be59b).
⚠️ Report is 59 commits behind head on master.

Files with missing lines	Patch %	Lines
...net2/gan_codec/shared/quantizer/modules/core_vq.py	54.44%	41 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6284      +/-   ##
==========================================
- Coverage   56.49%   56.45%   -0.04%     
==========================================
  Files         896      896              
  Lines       84814    84881      +67     
==========================================
+ Hits        47914    47923       +9     
- Misses      36900    36958      +58

Flag	Coverage Δ
test_integration_espnet2	`46.83% <54.44%> (+0.02%)`	⬆️
test_integration_espnetez	`36.93% <ø> (ø)`
test_python_espnet2	`50.90% <54.44%> (-0.03%)`	⬇️
test_python_espnetez	`12.72% <0.00%> (-0.02%)`	⬇️
test_utils	`18.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ftshijt

Many thanks for the update and for catching the issue. I have two questions related to this setup as follows:

ftshijt · 2025-11-03T04:42:05Z

espnet2/gan_codec/shared/quantizer/modules/core_vq.py

-            self.embed.data.copy_(embed_normalized)
+            self.update_ema()
+
+            self.expire_codes_(x)


If expire codes are in later stage, shall we do another all_reduce to keep all gpus the same?

I think it's OK,

expire_codes_ determines dead codes by checking self.cluster_size < self.threshold_ema_dead_code, and self.cluster_size has just been synchronized across all GPUs, so the calculated mask is identical.

The replacement vectors are determined by self.sample_fn, which in the DDP environment is sample_vectors_distributed. This function is designed for distributed sampling and ensures every GPU gets the exact same new vectors. Though each GPU will handle the dead codes independently, the resulting codebook will still remain synchronized."

Thanks for sharing the update. It looks good to me

ftshijt · 2025-11-03T19:40:28Z

Thanks for your great contribution!

codec fix: DDP logic and dead code revival logic

e3b2a08

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. Bug bug should be fixed Codec labels Oct 28, 2025

mergify bot added the ESPnet2 label Oct 28, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

41b66c7

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

sw005320 requested a review from ftshijt October 28, 2025 15:24

sw005320 added this to the v.202512 milestone Oct 28, 2025

fix CI: remove unused import

b0be59b

ftshijt reviewed Nov 3, 2025

View reviewed changes

ftshijt merged commit d93495b into espnet:master Nov 3, 2025
34 of 35 checks passed

Fhrozen modified the milestones: v.202512, v.202511 Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codec fix: DDP logic and dead code revival logic#6284

codec fix: DDP logic and dead code revival logic#6284
ftshijt merged 3 commits intoespnet:masterfrom
whr-a:pr_codec

whr-a commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

codecov bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

ftshijt left a comment

Uh oh!

ftshijt Nov 3, 2025

Uh oh!

whr-a Nov 3, 2025

Uh oh!

ftshijt Nov 3, 2025

Uh oh!

Uh oh!

ftshijt commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		buckets = torch.argmax(dists, dim=-1)
		bins = torch.bincount(buckets, minlength=num_clusters)

Conversation

whr-a commented Oct 28, 2025

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ftshijt left a comment

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

whr-a Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ftshijt commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 28, 2025 •

edited

Loading