LID-4: Category- and dataset-aware balanced sampler#6158
LID-4: Category- and dataset-aware balanced sampler#6158sw005320 merged 59 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6158 +/- ##
===========================================
+ Coverage 27.20% 55.84% +28.63%
===========================================
Files 880 883 +3
Lines 83774 83968 +194
===========================================
+ Hits 22787 46888 +24101
+ Misses 60987 37080 -23907
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull Request Overview
This PR adds two new sampling strategies—CategoryPowerSampler and CategoryDatasetPowerSampler—to address language and dataset imbalance in mini-batch construction for language identification.
- Introduces samplers configurable via batch_type settings ("catpow" and "catpow_balance_dataset").
- Adds new CLI arguments for controlling upsampling and batch size scaling.
- Updates the task integration in espnet2/tasks/abs_task.py to support these new samplers.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| espnet2/tasks/abs_task.py | New sampler-specific arguments and integration into task logic. |
| espnet2/samplers/category_power_sampler.py | Implementation of CategoryPowerSampler and CategoryDatasetPowerSampler. |
Co-authored-by: Copilot <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Emrys365
left a comment
There was a problem hiding this comment.
Sorry for the late review. I think the code looks good now.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces two new powerful samplers, CategoryPowerSampler and CategoryDatasetPowerSampler, to handle data imbalance in language identification tasks. The implementation is well-structured, but there are a few critical areas that need attention. Specifically, there's a potential regression in build_category_chunk_iter_factory that causes an argument to be ignored. Additionally, the docstrings in the new sampler file have some discrepancies with the implementation which could cause confusion. Finally, there is significant code duplication in abs_task.py that should be refactored to improve maintainability. Addressing these points will make the new feature more robust and easier to maintain.
|
@Qingzheng-Wang, please check gemni's reviews |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
I don't know why, but your PR is always stacking with I know that this is not your part, but can you increase the timeout thresholds of these tests to pass your PR? |
Sure. |
|
Thanks, @Qingzheng-Wang! |
What did you change?
This PR introduces two sampling strategies to support balanced mini-batch construction for language identification (LID), addressing both language (category) and dataset-level imbalance.
These samplers can also be used for other category-aware and multi-dataset tasks.
Specifically:
CategoryPowerSampler:batch_type=catpow.CategoryDatasetPowerSampler:batch_type=catpow_balance_dataset.β_Dandβ_L).Both samplers:
batch_bins-based batching and work with ESPnet’s dynamic batch construction pipeline.The
espnet2/tasks/abs_task.pyis changed for integration of these samplers.Why did you make this change?
Language identification tasks often face significant data imbalance, both within datasets (low-resource languages) and across datasets (domain shifts, size mismatches).
These samplers enable controlled and reproducible upsampling to improve training stability and generalization, especially in multilingual and cross-domain settings.
Is your PR small enough?
Yes
Additional Context