Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LID-1: Training and task setup#6155

Merged
sw005320 merged 14 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release1
Jun 20, 2025
Merged

LID-1: Training and task setup#6155
sw005320 merged 14 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release1

Conversation

@Qingzheng-Wang
Copy link
Contributor

@Qingzheng-Wang Qingzheng-Wang commented Jun 19, 2025

What did you change?

This PR introduces the core training and evaluation pipeline for the LID task in ESPnet, including:


Why did you make this change?

This is the foundational step to enable language identification (LID) training within ESPnet. It defines the task and enables downstream LID model training.


Is your PR small enough?

Yes


Additional Context

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 19, 2025
@mergify mergify bot added the ESPnet2 label Jun 19, 2025
@dosubot dosubot bot added the New Features label Jun 19, 2025
@sw005320
Copy link
Contributor

This pull request introduces significant enhancements to the espnet2 framework, including new batch sampling strategies, additional arguments for batch configuration, and a specialized preprocessor for Language Identification (LID) tasks. These changes improve flexibility in data handling and expand functionality for specific use cases.

Batch Sampling Enhancements:

  • Added two new batch samplers, CategoryPowerSampler and CategoryDatasetPowerSampler, to support advanced sampling strategies for low-resource categories and datasets.
  • Introduced new arguments (--upsampling_factor, --language_upsampling_factor, --dataset_upsampling_factor, --dataset_scaling_factor, and --max_batch_size) for configuring batch sampling, particularly for the new samplers.
  • Updated the build_category_iter_factory method to implement logic for the new batch types (catpow and catpow_balance_dataset) and added error handling for missing required files (dataset2utt and utt2dataset). [1] [2] [3]

Scheduler Integration:

  • Added TristageLR to the list of supported learning rate schedulers, enabling more complex learning rate adjustments during training.

Preprocessor for LID Tasks:

  • Introduced the LIDPreprocessor class, which provides specialized preprocessing for Language Identification tasks, including duration fixing, noise augmentation, and label mapping based on lang2utt files.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds foundational support for language identification (LID) training and evaluation in ESPnet by defining a new preprocessor and extending task configuration for sampling strategies.

  • Introduce LIDPreprocessor in espnet2/train/preprocessor.py
  • Integrate new samplers and schedulers in espnet2/tasks/abs_task.py
  • Expose CLI arguments for CategoryPowerSampler and CategoryDatasetPowerSampler

Reviewed Changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 4 comments.

File Description
espnet2/train/preprocessor.py Add LIDPreprocessor class for LID-specific preprocessing
espnet2/tasks/abs_task.py Import new samplers/schedulers and hook up CLI options

@sw005320 sw005320 added this to the v.202506 milestone Jun 19, 2025
@sw005320
Copy link
Contributor

@Qingzheng-Wang, can you try to fix the CI errors for all your PRs?

@Qingzheng-Wang
Copy link
Contributor Author

@Qingzheng-Wang, can you try to fix the CI errors for all your PRs?

Yes, I'm fixing them.

@codecov
Copy link

codecov bot commented Jun 20, 2025

Codecov Report

Attention: Patch coverage is 7.80142% with 130 lines in your changes missing coverage. Please review.

Project coverage is 46.52%. Comparing base (d3db636) to head (3c077ef).
Report is 49 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/train/preprocessor.py 7.80% 130 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (d3db636) and HEAD (3c077ef). Click for more details.

HEAD has 15 uploads less than BASE
Flag BASE (d3db636) HEAD (3c077ef)
test_python_espnet2 7 0
test_utils 7 0
test_integration_espnetez 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6155      +/-   ##
==========================================
- Coverage   55.45%   46.52%   -8.94%     
==========================================
  Files         882      542     -340     
  Lines       82812    49601   -33211     
==========================================
- Hits        45927    23079   -22848     
+ Misses      36885    26522   -10363     
Flag Coverage Δ
test_integration_espnet2 46.52% <7.80%> (?)
test_integration_espnetez ?
test_python_espnet2 ?
test_utils ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sw005320
Copy link
Contributor

LGTM.
I want to make sure that a test (in the other PRs or follow-up PRs) can go through this change.

@sw005320 sw005320 merged commit 3bff1f0 into espnet:master Jun 20, 2025
42 of 69 checks passed
@Fhrozen Fhrozen mentioned this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ESPnet2 New Features size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants