Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6160 +/- ##
=======================================
Coverage 55.82% 55.82%
=======================================
Files 884 884
Lines 84012 84012
=======================================
Hits 46903 46903
Misses 37109 37109
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request introduces a new template for spoken language identification ( Recipe and Documentation Updates:
Job Scheduling and Configuration:
Scoring and Evaluation:
Data Preparation and Utilities:
Symlinks and Setup: |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new LID recipe template under egs2/TEMPLATE/lid1 with configuration files, run scripts, and utility scripts for reproducible LID experiments following ESPnet2 standards.
- Added symlinked utility folders (utils, steps, scripts, pyscripts) referencing the asr1 recipe.
- Introduced setup, scoring, and data copying scripts along with various job scheduler configuration files.
Reviewed Changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| egs2/TEMPLATE/lid1/utils | Points to asr1 utilities via a relative path. |
| egs2/TEMPLATE/lid1/steps | Points to asr1 steps via a relative path. |
| egs2/TEMPLATE/lid1/setup.sh | Adds environment setup with copying and symlink creation logic. |
| egs2/TEMPLATE/lid1/scripts | Points to asr1 scripts via a relative path. |
| egs2/TEMPLATE/lid1/pyscripts | Points to asr1 pyscripts via a relative path. |
| egs2/TEMPLATE/lid1/path.sh | Sets up the PATH and other environment variables. |
| egs2/TEMPLATE/lid1/local/score.py | Provides scoring logic for LID experiments. |
| egs2/TEMPLATE/lid1/local/copy_data_dir.sh | Copies and maps data files with optional prefix/suffix modifications. |
| egs2/TEMPLATE/lid1/db.sh | Declares download paths for various corpora. |
| egs2/TEMPLATE/lid1/conf/* | Contains configuration files for different schedulers. |
| egs2/TEMPLATE/lid1/cmd.sh | Configures command execution strategy for different backends. |
| egs2/TEMPLATE/lid1/README.md | Documents the LID recipe flow and user instructions. |
Comments suppressed due to low confidence (1)
egs2/TEMPLATE/lid1/local/copy_data_dir.sh:119
- The variable 'spk_map' is used here, but it is never defined or created in the script. Consider using 'lang_map' or ensuring that 'spk_map' is properly initialized before this step.
utils/apply_map.pl -f 1 $destdir/spk_map <$srcdir/spk2gender >$destdir/spk2gender
for more information, see https://pre-commit.ci
Change local files from copy from template, to symbolink from the template.
for more information, see https://pre-commit.ci
|
@ftshijt, can you review this PR? |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive new recipe template for Language Identification (LID), which is a great addition. The structure follows existing ESPnet conventions. However, my review identified several critical and high-severity issues. These include a script that will fail due to a missing file, bugs in the main recipe script related to incorrect logic and filenames, and the use of fragile file-renaming workarounds that could leave the system in an inconsistent state. Addressing these issues is crucial for ensuring the robustness and correctness of this new template.
| mv "data/${train_set}/utt2lang" "data/${train_set}/utt2spk" | ||
| mv "data/${train_set}/lang2utt" "data/${train_set}/spk2utt" |
There was a problem hiding this comment.
Temporarily renaming utt2lang to utt2spk and lang2utt to spk2utt is a fragile practice. If the script fails during the speed perturbation stage, the data directory will be left in an inconsistent state. A more robust approach would be to modify local/perturb_lid_data_dir_speed.sh to handle utt2lang and lang2utt files directly, which would eliminate the need for these mv commands.
There was a problem hiding this comment.
I can modify the perturb_lid_data_dir_speed.sh script to replace utt2spk with utt2lang and spk2utt with lang2utt. However, lid.sh still relies on many util scripts that are originally designed for speaker-based setups, such as utils/combine_data.sh, utils/data/fix_data_dir.sh, utils/validate_data_dir.sh, and utils/subset_data_dir.sh. This leaves us with two options:
-
before using these scripts,
mv ./utt2lang ./utt2spk,mv ./lang2utt ./spk2utt, and use these scripts without any modifications, then move files back to langmv ./utt2spk ./utt2langmv ./spk2utt ./lang2utt- Pros: No need to duplicate or maintain modified versions of existing scripts.
- Cons: If execution stops between renaming steps, the directory may become inconsistent. That said, such cases are rare, and users can simply clean up and rerun the stage if needed.
-
modifies these scripts, copy them from the
utils/directory into the local folder (e.g.,cp utils/fix_data_dir.sh local/fix_data_dir.sh), then modify all theutt2spk/spk2uttreferences in the copied version toutt2lang/lang2utt, and call these modified versions in the recipe.- Pros: Keeps file naming (
utt2lang/lang2utt) consistent at all times. - Cons: Creates repetitive code, and any updates to the upstream
utilsscripts would need to be manually synced.
- Pros: Keeps file naming (
Although option 1 may cause file inconsistency if the process breaks between renaming steps, I think such cases are rare, and users can simply delete the generated files and rerun the stage. Option 2, on the other hand, would create a lot of repetitive code, which may not be necessary. Therefore, I prefer to use option 1, and in other lid1 recipes, when I create data, I will also follow this style.
| mv $destdir/lang2utt $destdir/spk2utt | ||
| mv $destdir/utt2lang $destdir/utt2spk | ||
| utils/validate_data_dir.sh $validate_opts $destdir | ||
| mv $destdir/spk2utt $destdir/lang2utt | ||
| mv $destdir/utt2spk $destdir/utt2lang |
There was a problem hiding this comment.
Renaming lang2utt to spk2utt and utt2lang to utt2spk just to run validate_data_dir.sh is a fragile approach. If the validation script fails, the data directory will be left in an inconsistent state. To improve robustness, please consider creating a LID-specific validation script or adapting validate_data_dir.sh to avoid these temporary file moves.
There was a problem hiding this comment.
The reason for keeping this style is the same as with perturb_lid_data_dir_speed.sh discussed earlier; please refer to the previous conversation.
| if [[ ! -f ${srcdir}/utt2spk ]]; then | ||
| echo "$0: no such file ${srcdir}/utt2spk" | ||
| exit 1; |
There was a problem hiding this comment.
This script is intended for LID data but expects Kaldi-style speaker information files (utt2spk) instead of language information files (utt2lang). This forces the main lid.sh script to use a fragile mv hack. Please modify this script to work directly with utt2lang and lang2utt to improve robustness. You should check for utt2lang instead of utt2spk.
| if [[ ! -f ${srcdir}/utt2spk ]]; then | |
| echo "$0: no such file ${srcdir}/utt2spk" | |
| exit 1; | |
| if [[ ! -f ${srcdir}/utt2lang ]]; then | |
| echo "$0: no such file ${srcdir}/utt2lang" | |
| exit 1; | |
| fi |
There was a problem hiding this comment.
No, before using this script, the file has been renamed to from utt2lang.sh to utt2spk.sh.
For more details, please refer to previous comments on perturb_lid_data_dir_speed.sh.
|
@Qingzheng-Wang, please check Gemni's reviews and reflect on them @ftshijt, can you review this PR? |
Yes, I’ve just addressed Gemni’s comments. |
for more information, see https://pre-commit.ci
ftshijt
left a comment
There was a problem hiding this comment.
The current setup looks good to me in general. I just have a few potential addition that might be interesting to consider as a follow up
- whether we can consider multiple language in a single utterances (related to the code-switching or long audio processing scenarios)
- Please also include the lid task in mini_an4 to integrate to the CI
| for x in music noise speech; do | ||
| if [ -f data/musan_${x}.scp ]; then | ||
| cp data/musan_${x}.scp ${data_feats}/musan_${x}.scp | ||
| fi |
There was a problem hiding this comment.
Do we need to fix the musan usage for augmentation (e.g., if we need other noise database for augmentation)
There was a problem hiding this comment.
We can consider making this more flexible in the future. For now, it supports using Musan. In my experiments, the performance remains strong even without it.
| accuracy = correct / total | ||
| accuracy_per_lang = {lang: lang_correct[lang] / lang_total[lang] for lang in langs} | ||
| macro_accuracy = sum(accuracy_per_lang.values()) / len(langs) |
There was a problem hiding this comment.
while for balanced set, accuracy would be good. Shall we also consider precision/recall etc.?
There was a problem hiding this comment.
Good suggestion. I've just added precision, recall, and F1 score computation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new recipe template for Language Identification (LID). The changes are comprehensive, adding all the necessary scripts and configurations for a new lid1 task. My review focuses on improving the robustness and correctness of the new scripts. I've identified several areas for improvement:
- The shell scripts use fragile
mvcommands to temporarily rename files to be compatible with existing utility scripts. This can leave the data directory in an inconsistent state if the script is interrupted. I've suggested modifying the utility scripts to handle LID-specific files directly. - Some Python scripts do not correctly handle utterance IDs that may contain spaces, which could lead to parsing errors. I've provided suggestions to fix this.
- The
setup.shscript references a file that is missing from the pull request, which would cause it to fail.
Overall, this is a great addition, and addressing these points will make the new recipe template more robust and reliable.
Thank you for the review! Currently we only support one language per utterance. For code-switching cases or long audio scenarios, I think this would require language diarization techniques, which is an interesting direction to explore in the future. I've also added the mini_an4 lid1 recipe in PR #6210 . |
|
Thanks! |
What did you change?
Added a new LID recipe template under
egs2/TEMPLATE/lid1, including config files and basic run scripts for reproducible experiments.Why did you make this change?
This provides a ready-to-use starting point for LID experiments, aligned with ESPnet2 standards.
Is your PR small enough?
This PR is slightly over 1000 lines, but keeping it unified is better for clarity since the components are tightly coupled.
Additional Context