Codestin Search App

Qingzheng-Wang · 2025-06-19T06:08:09Z

What did you change?

This PR introduces two sampling strategies to support balanced mini-batch construction for language identification (LID), addressing both language (category) and dataset-level imbalance.
These samplers can also be used for other category-aware and multi-dataset tasks.

Specifically:

CategoryPowerSampler:
- Configurable via batch_type=catpow.
- Implements power-law sampling over language categories.
- Enables low-resource language upsampling by adjusting a power factor.
- Batches are constructed by sampling categories proportionally to $(n_l / N)^\beta$, then uniformly sampling utterances within each category.
CategoryDatasetPowerSampler:
- Configurable via batch_type=catpow_balance_dataset.
- Extends the above to support hierarchical sampling over both datasets and categories.
- First samples a dataset, then a language category within that dataset, both under configurable power-law distributions (β_D and β_L).
- Useful for training on heterogeneous multi-domain LID corpora with imbalanced distributions.

Both samplers:

Support batch_bins-based batching and work with ESPnet’s dynamic batch construction pipeline.

The espnet2/tasks/abs_task.py is changed for integration of these samplers.

Why did you make this change?

Language identification tasks often face significant data imbalance, both within datasets (low-resource languages) and across datasets (domain shifts, size mismatches).
These samplers enable controlled and reproducible upsampling to improve training stability and generalization, especially in multilingual and cross-domain settings.

The category-level sampler enables targeted control over category frequency.
The hierarchical sampler balances dataset exposure during training.

Is your PR small enough?

Yes

Additional Context

These samplers are general-purpose and can be used in any classification task with categorical imbalance and heterogeneous corpora.
Depends on:
Reference:
- Scaling Speech Technology to 1,000+ Languages (https://arxiv.org/pdf/2305.13516)

for more information, see https://pre-commit.ci

codecov · 2025-06-20T01:01:29Z

Codecov Report

❌ Patch coverage is 18.45494% with 190 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.84%. Comparing base (40eb85a) to head (d7a13cf).
⚠️ Report is 13 commits behind head on master.

Files with missing lines	Patch %	Lines
espnet2/samplers/category_power_sampler.py	10.43%	163 Missing ⚠️
espnet2/samplers/build_batch_sampler.py	26.92%	19 Missing ⚠️
espnet2/tasks/abs_task.py	68.00%	8 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6158       +/-   ##
===========================================
+ Coverage   27.20%   55.84%   +28.63%     
===========================================
  Files         880      883        +3     
  Lines       83774    83968      +194     
===========================================
+ Hits        22787    46888    +24101     
+ Misses      60987    37080    -23907

Flag	Coverage Δ
test_integration_espnet2	`46.17% <18.02%> (?)`
test_integration_espnetez	`36.94% <18.02%> (?)`
test_python_espnet1	`?`
test_python_espnet2	`50.52% <14.16%> (?)`
test_python_espnetez	`12.83% <13.73%> (-0.03%)`	⬇️
test_utils	`18.77% <ø> (-1.87%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR adds two new sampling strategies—CategoryPowerSampler and CategoryDatasetPowerSampler—to address language and dataset imbalance in mini-batch construction for language identification.

Introduces samplers configurable via batch_type settings ("catpow" and "catpow_balance_dataset").
Adds new CLI arguments for controlling upsampling and batch size scaling.
Updates the task integration in espnet2/tasks/abs_task.py to support these new samplers.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
espnet2/tasks/abs_task.py	New sampler-specific arguments and integration into task logic.
espnet2/samplers/category_power_sampler.py	Implementation of CategoryPowerSampler and CategoryDatasetPowerSampler.

espnet2/tasks/abs_task.py

espnet2/samplers/category_power_sampler.py

Co-authored-by: Copilot <[email protected]>

for more information, see https://pre-commit.ci

Emrys365

Sorry for the late review. I think the code looks good now.

espnet2/samplers/category_power_sampler.py

for more information, see https://pre-commit.ci

Fhrozen · 2025-08-23T02:33:49Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces two new powerful samplers, CategoryPowerSampler and CategoryDatasetPowerSampler, to handle data imbalance in language identification tasks. The implementation is well-structured, but there are a few critical areas that need attention. Specifically, there's a potential regression in build_category_chunk_iter_factory that causes an argument to be ignored. Additionally, the docstrings in the new sampler file have some discrepancies with the implementation which could cause confusion. Finally, there is significant code duplication in abs_task.py that should be refactored to improve maintainability. Addressing these points will make the new feature more robust and easier to maintain.

espnet2/tasks/abs_task.py

espnet2/samplers/category_power_sampler.py

espnet2/tasks/abs_task.py

sw005320 · 2025-08-25T11:42:22Z

@Qingzheng-Wang, please check gemni's reviews

for more information, see https://pre-commit.ci

sw005320 · 2025-08-26T21:52:54Z

I don't know why, but your PR is always stacking with

FAILED test/espnet2/enh/separator/test_bsrnn_separator.py::test_bsrnn_separator_forward_backward_subbands[erb-False-48000-3-16-1-481] - Failed: Timeout >2.0s
FAILED test/espnet2/gan_tts/jets/test_jets.py::test_jets_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >15.0s
FAILED test/espnet2/gan_tts/vits/test_vits.py::test_vits_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >10.0s

I know that this is not your part, but can you increase the timeout thresholds of these tests to pass your PR?

Qingzheng-Wang · 2025-08-26T22:08:45Z

I don't know why, but your PR is always stacking with

FAILED test/espnet2/enh/separator/test_bsrnn_separator.py::test_bsrnn_separator_forward_backward_subbands[erb-False-48000-3-16-1-481] - Failed: Timeout >2.0s
FAILED test/espnet2/gan_tts/jets/test_jets.py::test_jets_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >15.0s
FAILED test/espnet2/gan_tts/vits/test_vits.py::test_vits_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >10.0s

I know that this is not your part, but can you increase the timeout thresholds of these tests to pass your PR?

Sure.

sw005320 · 2025-08-27T12:04:25Z

Thanks, @Qingzheng-Wang!

Add catpow and catpow_balance_dataset batch samplers.

a67a7c0

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 19, 2025

mergify bot added the ESPnet2 label Jun 19, 2025

dosubot bot added the Enhancement Enhancement label Jun 19, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

64b3b67

for more information, see https://pre-commit.ci

This was referenced Jun 19, 2025

LID-5: Tri-stage learning rate scheduler #6159

Merged

LID-6: LID recipe template #6160

Merged

sw005320 requested a review from Copilot June 19, 2025 11:33

This comment was marked as outdated.

Sign in to view

This was referenced Jun 19, 2025

LID-1: Training and task setup #6155

Merged

LID-2: Model, loss and pooling modules #6156

Merged

LID-3: Inference, embedding extraction and t-SNE visualization #6157

Merged

Add catpow and catpow_balance_dataset integration.

023815c

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 19, 2025

Qingzheng-Wang and others added 3 commits June 19, 2025 16:55

Fix overlong lines.

8742175

[pre-commit.ci] auto fixes from pre-commit.com hooks

d42cf1d

for more information, see https://pre-commit.ci

Fix lengthy lines.

c271940

Qingzheng-Wang mentioned this pull request Jun 26, 2025

LID-7: VoxLingua107 recipe #6174

Merged

Fhrozen requested a review from Copilot June 27, 2025 05:49

Copilot AI reviewed Jun 27, 2025

View reviewed changes

espnet2/tasks/abs_task.py Outdated Show resolved Hide resolved

espnet2/samplers/category_power_sampler.py Show resolved Hide resolved

espnet2/samplers/category_power_sampler.py Show resolved Hide resolved

Qingzheng-Wang and others added 8 commits June 27, 2025 18:02

Fix typo.

381f6c2

Co-authored-by: Copilot <[email protected]>

Fix condition expression.

5c7271e

Fix condition expression.

735d10a

[pre-commit.ci] auto fixes from pre-commit.com hooks

b40f131

for more information, see https://pre-commit.ci

Fix condition expression.

0cdb144

Rerun CI.

95545ae

[pre-commit.ci] auto fixes from pre-commit.com hooks

f9221cb

for more information, see https://pre-commit.ci

Update error note.

5a4381f

Fix conditions when meet max batch size.

a230913

Emrys365 approved these changes Aug 18, 2025

View reviewed changes

espnet2/samplers/category_power_sampler.py Show resolved Hide resolved

espnet2/samplers/category_power_sampler.py Show resolved Hide resolved

dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 18, 2025

Qingzheng-Wang and others added 3 commits August 18, 2025 15:33

Merge branch 'master' into lid_release4

92387f9

Re CI.

e119d4c

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a1ca2b

for more information, see https://pre-commit.ci

Qingzheng-Wang mentioned this pull request Aug 20, 2025

LID-9: Geolocation-aware LID recipe and codes #6212

Open

Merge branch 'master' into lid_release4

3104671

gemini-code-assist bot reviewed Aug 23, 2025

View reviewed changes

espnet2/tasks/abs_task.py Outdated Show resolved Hide resolved

espnet2/samplers/category_power_sampler.py Outdated Show resolved Hide resolved

espnet2/samplers/category_power_sampler.py Outdated Show resolved Hide resolved

espnet2/tasks/abs_task.py Outdated Show resolved Hide resolved

Qingzheng-Wang and others added 9 commits August 25, 2025 11:41

Fixed build_category_chunk_iter_factory.

652b227

Fixed comments.

3b76f3e

Add unified function to build category samplers.

edaf971

Merge branch 'master' into lid_release4

9c0c888

[pre-commit.ci] auto fixes from pre-commit.com hooks

4ebc657

for more information, see https://pre-commit.ci

Fixed missing sampler_args.

c120b06

Merge branch 'master' into lid_release4

26a3715

Re CI.

525377b

[pre-commit.ci] auto fixes from pre-commit.com hooks

269219f

for more information, see https://pre-commit.ci

Qingzheng-Wang and others added 2 commits August 26, 2025 15:06

Merge branch 'master' into lid_release4

22027c1

Improve timeout limit.

40eb85a

Qingzheng-Wang and others added 3 commits August 26, 2025 15:30

Merge branch 'master' into lid_release4

d2341d3

Add test time limit.

a982e42

Improve time limit.

d7a13cf

sw005320 merged commit 0e90714 into espnet:master Aug 27, 2025
31 of 32 checks passed

Fhrozen mentioned this pull request Sep 11, 2025

Version Release #6236

Merged

Conversation

Qingzheng-Wang commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Emrys365 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fhrozen commented Aug 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sw005320 commented Aug 25, 2025

Uh oh!

sw005320 commented Aug 26, 2025

Uh oh!

Qingzheng-Wang commented Aug 26, 2025

Uh oh!

Uh oh!

sw005320 commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Qingzheng-Wang commented Jun 19, 2025 •

edited

Loading

codecov bot commented Jun 20, 2025 •

edited

Loading