Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LID-4: Category- and dataset-aware balanced sampler#6158

Merged
sw005320 merged 59 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release4
Aug 27, 2025
Merged

LID-4: Category- and dataset-aware balanced sampler#6158
sw005320 merged 59 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release4

Conversation

@Qingzheng-Wang
Copy link
Contributor

@Qingzheng-Wang Qingzheng-Wang commented Jun 19, 2025

What did you change?

This PR introduces two sampling strategies to support balanced mini-batch construction for language identification (LID), addressing both language (category) and dataset-level imbalance.
These samplers can also be used for other category-aware and multi-dataset tasks.

Specifically:

  • CategoryPowerSampler:

    • Configurable via batch_type=catpow.
    • Implements power-law sampling over language categories.
    • Enables low-resource language upsampling by adjusting a power factor.
    • Batches are constructed by sampling categories proportionally to $(n_l / N)^\beta$, then uniformly sampling utterances within each category.
  • CategoryDatasetPowerSampler:

    • Configurable via batch_type=catpow_balance_dataset.
    • Extends the above to support hierarchical sampling over both datasets and categories.
    • First samples a dataset, then a language category within that dataset, both under configurable power-law distributions (β_D and β_L).
    • Useful for training on heterogeneous multi-domain LID corpora with imbalanced distributions.

Both samplers:

  • Support batch_bins-based batching and work with ESPnet’s dynamic batch construction pipeline.

The espnet2/tasks/abs_task.py is changed for integration of these samplers.


Why did you make this change?

Language identification tasks often face significant data imbalance, both within datasets (low-resource languages) and across datasets (domain shifts, size mismatches).
These samplers enable controlled and reproducible upsampling to improve training stability and generalization, especially in multilingual and cross-domain settings.

  • The category-level sampler enables targeted control over category frequency.
  • The hierarchical sampler balances dataset exposure during training.

Is your PR small enough?

Yes


Additional Context

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 19, 2025
@mergify mergify bot added the ESPnet2 label Jun 19, 2025
@dosubot dosubot bot added the Enhancement Enhancement label Jun 19, 2025

This comment was marked as outdated.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 19, 2025
@codecov
Copy link

codecov bot commented Jun 20, 2025

Codecov Report

❌ Patch coverage is 18.45494% with 190 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.84%. Comparing base (40eb85a) to head (d7a13cf).
⚠️ Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/samplers/category_power_sampler.py 10.43% 163 Missing ⚠️
espnet2/samplers/build_batch_sampler.py 26.92% 19 Missing ⚠️
espnet2/tasks/abs_task.py 68.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6158       +/-   ##
===========================================
+ Coverage   27.20%   55.84%   +28.63%     
===========================================
  Files         880      883        +3     
  Lines       83774    83968      +194     
===========================================
+ Hits        22787    46888    +24101     
+ Misses      60987    37080    -23907     
Flag Coverage Δ
test_integration_espnet2 46.17% <18.02%> (?)
test_integration_espnetez 36.94% <18.02%> (?)
test_python_espnet1 ?
test_python_espnet2 50.52% <14.16%> (?)
test_python_espnetez 12.83% <13.73%> (-0.03%) ⬇️
test_utils 18.77% <ø> (-1.87%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds two new sampling strategies—CategoryPowerSampler and CategoryDatasetPowerSampler—to address language and dataset imbalance in mini-batch construction for language identification.

  • Introduces samplers configurable via batch_type settings ("catpow" and "catpow_balance_dataset").
  • Adds new CLI arguments for controlling upsampling and batch size scaling.
  • Updates the task integration in espnet2/tasks/abs_task.py to support these new samplers.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
espnet2/tasks/abs_task.py New sampler-specific arguments and integration into task logic.
espnet2/samplers/category_power_sampler.py Implementation of CategoryPowerSampler and CategoryDatasetPowerSampler.

Copy link
Collaborator

@Emrys365 Emrys365 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I think the code looks good now.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 18, 2025
@Fhrozen
Copy link
Member

Fhrozen commented Aug 23, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new powerful samplers, CategoryPowerSampler and CategoryDatasetPowerSampler, to handle data imbalance in language identification tasks. The implementation is well-structured, but there are a few critical areas that need attention. Specifically, there's a potential regression in build_category_chunk_iter_factory that causes an argument to be ignored. Additionally, the docstrings in the new sampler file have some discrepancies with the implementation which could cause confusion. Finally, there is significant code duplication in abs_task.py that should be refactored to improve maintainability. Addressing these points will make the new feature more robust and easier to maintain.

@sw005320
Copy link
Contributor

@Qingzheng-Wang, please check gemni's reviews

@sw005320
Copy link
Contributor

I don't know why, but your PR is always stacking with

FAILED test/espnet2/enh/separator/test_bsrnn_separator.py::test_bsrnn_separator_forward_backward_subbands[erb-False-48000-3-16-1-481] - Failed: Timeout >2.0s
FAILED test/espnet2/gan_tts/jets/test_jets.py::test_jets_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >15.0s
FAILED test/espnet2/gan_tts/vits/test_vits.py::test_vits_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >10.0s

I know that this is not your part, but can you increase the timeout thresholds of these tests to pass your PR?

@Qingzheng-Wang
Copy link
Contributor Author

I don't know why, but your PR is always stacking with

FAILED test/espnet2/enh/separator/test_bsrnn_separator.py::test_bsrnn_separator_forward_backward_subbands[erb-False-48000-3-16-1-481] - Failed: Timeout >2.0s
FAILED test/espnet2/gan_tts/jets/test_jets.py::test_jets_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >15.0s
FAILED test/espnet2/gan_tts/vits/test_vits.py::test_vits_is_trainable_and_decodable[gen_dict2-dis_dict2-loss_dict2] - Failed: Timeout >10.0s

I know that this is not your part, but can you increase the timeout thresholds of these tests to pass your PR?

Sure.

@sw005320 sw005320 merged commit 0e90714 into espnet:master Aug 27, 2025
31 of 32 checks passed
@sw005320
Copy link
Contributor

Thanks, @Qingzheng-Wang!

@Fhrozen Fhrozen mentioned this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Enhancement ESPnet2 lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants