Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LID-6: LID recipe template#6160

Merged
sw005320 merged 52 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release6
Sep 9, 2025
Merged

LID-6: LID recipe template#6160
sw005320 merged 52 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release6

Conversation

@Qingzheng-Wang
Copy link
Contributor

What did you change?

Added a new LID recipe template under egs2/TEMPLATE/lid1, including config files and basic run scripts for reproducible experiments.


Why did you make this change?

This provides a ready-to-use starting point for LID experiments, aligned with ESPnet2 standards.


Is your PR small enough?

This PR is slightly over 1000 lines, but keeping it unified is better for clarity since the components are tightly coupled.


Additional Context

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 19, 2025
@dosubot dosubot bot added the Recipe label Jun 19, 2025
@codecov
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.82%. Comparing base (fc8e461) to head (dcadf5b).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/bin/lid_inference.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6160   +/-   ##
=======================================
  Coverage   55.82%   55.82%           
=======================================
  Files         884      884           
  Lines       84012    84012           
=======================================
  Hits        46903    46903           
  Misses      37109    37109           
Flag Coverage Δ
test_integration_espnet2 46.16% <ø> (ø)
test_integration_espnetez 36.94% <ø> (ø)
test_python_espnet2 50.51% <60.00%> (ø)
test_python_espnetez 12.83% <0.00%> (ø)
test_utils 18.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sw005320 sw005320 requested a review from Copilot June 19, 2025 11:34
@sw005320
Copy link
Contributor

This pull request introduces a new template for spoken language identification (lid1) in ESPnet2. The changes include detailed documentation, configuration files, scripts, and utilities to support the language identification pipeline. Key updates focus on recipe flow, job scheduling, scoring, and data preparation.

Recipe and Documentation Updates:

  • egs2/TEMPLATE/lid1/README.md: Added comprehensive documentation for the lid1 recipe, detailing the pipeline stages, data preparation, model training, inference, scoring, and visualization. Includes an example for VoxLingua107 training.

Job Scheduling and Configuration:

  • egs2/TEMPLATE/lid1/cmd.sh: Introduced backend selection for job scheduling systems (local, slurm, sge, etc.), enabling flexible job execution environments.
  • Configuration files (pbs.conf, queue.conf, slurm.conf): Added default settings for job scheduling systems, including memory, GPU, and thread specifications. [1] [2] [3]

Scoring and Evaluation:

  • egs2/TEMPLATE/lid1/local/score.py: Implemented scoring script to calculate accuracy, macro accuracy, and error frequencies for language identification predictions. Supports detailed error analysis.

Data Preparation and Utilities:

Symlinks and Setup:

  • Symlinks (pyscripts, scripts, steps, utils) and setup.sh: Created symlinks to shared resources and added setup script for initializing the lid1 recipe directory. [1] [2] [3] [4] [5]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new LID recipe template under egs2/TEMPLATE/lid1 with configuration files, run scripts, and utility scripts for reproducible LID experiments following ESPnet2 standards.

  • Added symlinked utility folders (utils, steps, scripts, pyscripts) referencing the asr1 recipe.
  • Introduced setup, scoring, and data copying scripts along with various job scheduler configuration files.

Reviewed Changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
egs2/TEMPLATE/lid1/utils Points to asr1 utilities via a relative path.
egs2/TEMPLATE/lid1/steps Points to asr1 steps via a relative path.
egs2/TEMPLATE/lid1/setup.sh Adds environment setup with copying and symlink creation logic.
egs2/TEMPLATE/lid1/scripts Points to asr1 scripts via a relative path.
egs2/TEMPLATE/lid1/pyscripts Points to asr1 pyscripts via a relative path.
egs2/TEMPLATE/lid1/path.sh Sets up the PATH and other environment variables.
egs2/TEMPLATE/lid1/local/score.py Provides scoring logic for LID experiments.
egs2/TEMPLATE/lid1/local/copy_data_dir.sh Copies and maps data files with optional prefix/suffix modifications.
egs2/TEMPLATE/lid1/db.sh Declares download paths for various corpora.
egs2/TEMPLATE/lid1/conf/* Contains configuration files for different schedulers.
egs2/TEMPLATE/lid1/cmd.sh Configures command execution strategy for different backends.
egs2/TEMPLATE/lid1/README.md Documents the LID recipe flow and user instructions.
Comments suppressed due to low confidence (1)

egs2/TEMPLATE/lid1/local/copy_data_dir.sh:119

  • The variable 'spk_map' is used here, but it is never defined or created in the script. Consider using 'lang_map' or ensuring that 'spk_map' is properly initialized before this step.
  utils/apply_map.pl -f 1 $destdir/spk_map <$srcdir/spk2gender >$destdir/spk2gender

@sw005320
Copy link
Contributor

@ftshijt, can you review this PR?
I took a look at this and it looks good to me.

@Fhrozen Fhrozen closed this Aug 29, 2025
@Fhrozen Fhrozen reopened this Aug 29, 2025
@Fhrozen
Copy link
Member

Fhrozen commented Aug 29, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive new recipe template for Language Identification (LID), which is a great addition. The structure follows existing ESPnet conventions. However, my review identified several critical and high-severity issues. These include a script that will fail due to a missing file, bugs in the main recipe script related to incorrect logic and filenames, and the use of fragile file-renaming workarounds that could leave the system in an inconsistent state. Addressing these issues is crucial for ensuring the robustness and correctness of this new template.

Comment on lines +249 to +250
mv "data/${train_set}/utt2lang" "data/${train_set}/utt2spk"
mv "data/${train_set}/lang2utt" "data/${train_set}/spk2utt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Temporarily renaming utt2lang to utt2spk and lang2utt to spk2utt is a fragile practice. If the script fails during the speed perturbation stage, the data directory will be left in an inconsistent state. A more robust approach would be to modify local/perturb_lid_data_dir_speed.sh to handle utt2lang and lang2utt files directly, which would eliminate the need for these mv commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can modify the perturb_lid_data_dir_speed.sh script to replace utt2spk with utt2lang and spk2utt with lang2utt. However, lid.sh still relies on many util scripts that are originally designed for speaker-based setups, such as utils/combine_data.sh, utils/data/fix_data_dir.sh, utils/validate_data_dir.sh, and utils/subset_data_dir.sh. This leaves us with two options:

  1. before using these scripts, mv ./utt2lang ./utt2spk, mv ./lang2utt ./spk2utt, and use these scripts without any modifications, then move files back to lang mv ./utt2spk ./utt2lang mv ./spk2utt ./lang2utt

    • Pros: No need to duplicate or maintain modified versions of existing scripts.
    • Cons: If execution stops between renaming steps, the directory may become inconsistent. That said, such cases are rare, and users can simply clean up and rerun the stage if needed.
  2. modifies these scripts, copy them from the utils/ directory into the local folder (e.g., cp utils/fix_data_dir.sh local/fix_data_dir.sh), then modify all the utt2spk/spk2utt references in the copied version to utt2lang/lang2utt, and call these modified versions in the recipe.

    • Pros: Keeps file naming (utt2lang/lang2utt) consistent at all times.
    • Cons: Creates repetitive code, and any updates to the upstream utils scripts would need to be manually synced.

Although option 1 may cause file inconsistency if the process breaks between renaming steps, I think such cases are rare, and users can simply delete the generated files and rerun the stage. Option 2, on the other hand, would create a lot of repetitive code, which may not be necessary. Therefore, I prefer to use option 1, and in other lid1 recipes, when I create data, I will also follow this style.

Comment on lines +140 to +144
mv $destdir/lang2utt $destdir/spk2utt
mv $destdir/utt2lang $destdir/utt2spk
utils/validate_data_dir.sh $validate_opts $destdir
mv $destdir/spk2utt $destdir/lang2utt
mv $destdir/utt2spk $destdir/utt2lang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Renaming lang2utt to spk2utt and utt2lang to utt2spk just to run validate_data_dir.sh is a fragile approach. If the validation script fails, the data directory will be left in an inconsistent state. To improve robustness, please consider creating a LID-specific validation script or adapting validate_data_dir.sh to avoid these temporary file moves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for keeping this style is the same as with perturb_lid_data_dir_speed.sh discussed earlier; please refer to the previous conversation.

Comment on lines +56 to +58
if [[ ! -f ${srcdir}/utt2spk ]]; then
echo "$0: no such file ${srcdir}/utt2spk"
exit 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script is intended for LID data but expects Kaldi-style speaker information files (utt2spk) instead of language information files (utt2lang). This forces the main lid.sh script to use a fragile mv hack. Please modify this script to work directly with utt2lang and lang2utt to improve robustness. You should check for utt2lang instead of utt2spk.

Suggested change
if [[ ! -f ${srcdir}/utt2spk ]]; then
echo "$0: no such file ${srcdir}/utt2spk"
exit 1;
if [[ ! -f ${srcdir}/utt2lang ]]; then
echo "$0: no such file ${srcdir}/utt2lang"
exit 1;
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, before using this script, the file has been renamed to from utt2lang.sh to utt2spk.sh.

For more details, please refer to previous comments on perturb_lid_data_dir_speed.sh.

@sw005320
Copy link
Contributor

@Qingzheng-Wang, please check Gemni's reviews and reflect on them

@ftshijt, can you review this PR?

@Qingzheng-Wang
Copy link
Contributor Author

@Qingzheng-Wang, please check Gemni's reviews and reflect on them

@ftshijt, can you review this PR?

Yes, I’ve just addressed Gemni’s comments.

Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current setup looks good to me in general. I just have a few potential addition that might be interesting to consider as a follow up

  • whether we can consider multiple language in a single utterances (related to the code-switching or long audio processing scenarios)
  • Please also include the lid task in mini_an4 to integrate to the CI

Comment on lines +366 to +369
for x in music noise speech; do
if [ -f data/musan_${x}.scp ]; then
cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fix the musan usage for augmentation (e.g., if we need other noise database for augmentation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider making this more flexible in the future. For now, it supports using Musan. In my experiments, the performance remains strong even without it.

Comment on lines +56 to +58
accuracy = correct / total
accuracy_per_lang = {lang: lang_correct[lang] / lang_total[lang] for lang in langs}
macro_accuracy = sum(accuracy_per_lang.values()) / len(langs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while for balanced set, accuracy would be good. Shall we also consider precision/recall etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I've just added precision, recall, and F1 score computation.

@Fhrozen
Copy link
Member

Fhrozen commented Sep 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new recipe template for Language Identification (LID). The changes are comprehensive, adding all the necessary scripts and configurations for a new lid1 task. My review focuses on improving the robustness and correctness of the new scripts. I've identified several areas for improvement:

  • The shell scripts use fragile mv commands to temporarily rename files to be compatible with existing utility scripts. This can leave the data directory in an inconsistent state if the script is interrupted. I've suggested modifying the utility scripts to handle LID-specific files directly.
  • Some Python scripts do not correctly handle utterance IDs that may contain spaces, which could lead to parsing errors. I've provided suggestions to fix this.
  • The setup.sh script references a file that is missing from the pull request, which would cause it to fail.

Overall, this is a great addition, and addressing these points will make the new recipe template more robust and reliable.

@Qingzheng-Wang
Copy link
Contributor Author

Qingzheng-Wang commented Sep 2, 2025

The current setup looks good to me in general. I just have a few potential addition that might be interesting to consider as a follow up

  • whether we can consider multiple language in a single utterances (related to the code-switching or long audio processing scenarios)
  • Please also include the lid task in mini_an4 to integrate to the CI

Thank you for the review!

Currently we only support one language per utterance. For code-switching cases or long audio scenarios, I think this would require language diarization techniques, which is an interesting direction to explore in the future.

I've also added the mini_an4 lid1 recipe in PR #6210 .

@sw005320 sw005320 merged commit 80bf926 into espnet:master Sep 9, 2025
34 checks passed
@sw005320
Copy link
Contributor

sw005320 commented Sep 9, 2025

Thanks!
Please move to the next one.

@Fhrozen Fhrozen mentioned this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ESPnet2 README Recipe size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants