LID-9: Geolocation-aware LID recipe and codes#6212
LID-9: Geolocation-aware LID recipe and codes#6212Qingzheng-Wang wants to merge 48 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
This pull request introduces a new geolocation-aware spoken language identification (LID) recipe in ESPnet, along with detailed documentation, job scheduling configurations, and multiple training configurations. The main contributions are the addition of a robust, geolocation-conditioned LID model, comprehensive usage instructions, and support for various cluster environments. Documentation and Usage:
Job Scheduling and Cluster Support:
Model Configuration:
|
|
This would be the last PR. |
|
@Qingzheng-Wang, @brianyan918 wants to use this. |
Sure! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6212 +/- ##
===========================================
+ Coverage 0 56.71% +56.71%
===========================================
Files 0 892 +892
Lines 0 84667 +84667
===========================================
+ Hits 0 48022 +48022
- Misses 0 36645 +36645
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This PR introduces a comprehensive geolocation-aware language identification recipe. The implementation is extensive, covering new model architectures, loss functions, frontend components, and a complete set of data preparation and training scripts. My review focuses on the core implementation and script correctness. I've identified several critical issues related to script failures (missing imports, filename mismatches, hardcoded paths) and potential runtime errors in the model logic. There are also high-severity concerns regarding maintainability, such as dependencies on personal forks and violations of class interface contracts. Addressing these points will significantly improve the robustness and long-term viability of this new recipe.
| log "Directory dump/raw/train_babel_over_10s_lang exists." | ||
| else | ||
| log "Directory dump/raw/train_babel_over_10s_lang does not exist. Running local/filter_babel_train.sh." | ||
| local/filter_babel_train.sh || exit 1 |
There was a problem hiding this comment.
This script calls local/filter_babel_train.sh, but this file does not exist in the pull request. It seems the intended script is local/filter_babel_over_10s.sh. Please correct the script name to avoid a command not found error.
| local/filter_babel_train.sh || exit 1 | |
| local/filter_babel_over_10s.sh || exit 1 |
| . utils/parse_options.sh || exit 1; | ||
|
|
||
| for split in $splits; do | ||
| python local/filter_babel_over_10s.py --babel_dir $dump_dir/$split --babel_over_10s_dir $dump_dir/$split_over_10s |
There was a problem hiding this comment.
This script attempts to execute local/filter_babel_over_10s.py, but the Python script added in this pull request is named local/filter_babel_overl_10s.py (with an extra 'l'). This filename mismatch will cause a FileNotFoundError and prevent the script from running. Please correct the name of the Python script to match the one being called here, or update the call to use the correct filename.
| import os | ||
|
|
||
|
|
||
| def parse_args(): |
There was a problem hiding this comment.
The argparse module is used to parse command-line arguments, but it is not imported in this script. This will result in a NameError when the script is executed. Please add import argparse to the top of the file.
| import os | |
| def parse_args(): | |
| import os | |
| import argparse | |
| def parse_args(): |
| local train_list="$4" | ||
| local dev_list="$5" | ||
|
|
||
| local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational" |
There was a problem hiding this comment.
The base_dir is hardcoded to an absolute path /scratch/bbjs/shared/corpora/babel/.... This makes the script non-portable and will cause it to fail on any machine where this path does not exist. It should use the $dataset_path variable, which is passed as an argument to the script, to construct the path.
| local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational" | |
| local base_dir="${dataset_path}/${lang_code}/conversational" |
| import os | ||
| import sys | ||
| import traceback | ||
|
|
There was a problem hiding this comment.
| loss = ( | ||
| 1 - self.loss.lang2vec_weight | ||
| ) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all |
There was a problem hiding this comment.
There is a potential TypeError in this loss calculation. lang2vec_loss_all can be None if lang2vec_loss is None (e.g., if lang2vec_weight is not configured in the loss module). If lang2vec_loss_all is None, the multiplication self.loss.lang2vec_weight * lang2vec_loss_all will raise a TypeError. You should add a check to ensure lang2vec_loss_all is not None before performing this calculation.
| loss = ( | |
| 1 - self.loss.lang2vec_weight | |
| ) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all | |
| if lang2vec_loss_all is not None: | |
| loss = ( | |
| 1 - self.loss.lang2vec_weight | |
| ) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all | |
| else: | |
| loss = lid_class_loss_all |
| This project requires **modified versions** of s3prl and transformers for geolocation conditioning functionality. | ||
|
|
||
| **Install Modified s3prl:** | ||
| ```bash | ||
| # If you have already installed s3prl, please uninstall it first | ||
| pip uninstall s3prl # (Optional if already installed) | ||
|
|
||
| # Clone and install the modified version | ||
| git clone -b lid https://github.com/Qingzheng-Wang/s3prl.git | ||
| cd s3prl | ||
| pip install -e . | ||
| cd .. | ||
| ``` | ||
|
|
||
| **Install Modified Transformers:** | ||
| ```bash | ||
| # If you have already installed transformers, please uninstall it first | ||
| pip uninstall transformers # (Optional if already installed) | ||
|
|
||
| # Clone and install the modified version | ||
| git clone -b v4.51.3-qingzheng https://github.com/Qingzheng-Wang/transformers.git | ||
| cd transformers | ||
| pip install -e . | ||
| cd .. | ||
| ``` |
There was a problem hiding this comment.
The installation instructions rely on personal forks of s3prl and transformers from a personal GitHub account. This introduces a significant maintainability and reproducibility risk for the project. If these forks are changed or deleted, it will break this recipe. For long-term stability, these modified dependencies should be vendored within ESPnet, merged upstream, or maintained in a repository under the espnet organization.
There was a problem hiding this comment.
@sw005320 I am wondering about this particular issue. Is it required to use a cloned repo? How many functions/classes were changed? Is it no possible to add to the espnet2 directly? Also, in case if transformers, is it not possible to use customize models and import using autoclass.register? https://huggingface.co/docs/transformers/custom_models. Private repo may complicate and limit future releases.
There was a problem hiding this comment.
I agree.
This is not our normal practice.
@Qingzheng-Wang, please avoid them
- s3prl: You can make a PR to them
- transformers: you can try Nelson's suggestions
There was a problem hiding this comment.
OK, I’ll try to work on this, though it may take quite a bit of time for me. I’ll open a PR to the S3PRL repo and see whether they accept it.
There was a problem hiding this comment.
Please add me to the PR.
If the change is reasonable, I expect they’ll address it quickly.
If not, we can create a fork under the ESPnet organization and maintain it ourselves.
Before deciding, I’d like to review the changes you made in your PR.
There was a problem hiding this comment.
Got it, thank you!
| uttid = uttid.replace("hin", lang) | ||
| break | ||
| else: | ||
| raise ValueError(f"{indian_langs} not in {uttid}") |
There was a problem hiding this comment.
The script raises a ValueError if a hin utterance ID does not contain one of ["tam", "tel", "guj"]. This assumption might be too strong and could cause the script to fail if there are legitimate Hindi utterances that don't follow this naming convention. It would be more robust to log a warning and skip the utterance instead of crashing.
| raise ValueError(f"{indian_langs} not in {uttid}") | |
| # If no Indian language code is found in the utterance ID for a 'hin' entry, | |
| # we assume it's a standard Hindi utterance and do nothing. | |
| # A warning could be logged here if this is an unexpected case. | |
| pass |
| def forward( | ||
| self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor | ||
| ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
The forward method signature forward(self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor) is incompatible with the base class AbsFrontend, which defines forward(self, input: torch.Tensor, input_lengths: torch.Tensor). This violates the Liskov Substitution Principle and can lead to unexpected errors if this frontend is used in a context expecting a standard AbsFrontend. The labels should be handled within the main model, not passed through the frontend, to maintain proper separation of concerns.
| assert ( | ||
| 0 < self.lang2vec_weight < 1 | ||
| ), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}" |
There was a problem hiding this comment.
The assertion 0 < self.lang2vec_weight < 1 is too strict. It prevents setting lang2vec_weight to 0 or 1, which are valid scenarios to disable the language vector loss or the classification loss, respectively. This can be useful for ablation studies or for using the model in a classification-only mode. The assertion should be relaxed to 0 <= self.lang2vec_weight <= 1.
| assert ( | |
| 0 < self.lang2vec_weight < 1 | |
| ), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}" | |
| assert ( | |
| 0 <= self.lang2vec_weight <= 1 | |
| ), f"lang2vec_weight should be in [0, 1], but got {self.lang2vec_weight}" |
|
This PR is stale because it has been open for 90 days with no activity. |
What did you change?
This PR adds a complete geolocation-aware language identification recipe and supporting infrastructure.
Core Implementation:
espnet2/lid/espnet_model_upstream_condition.py: ESPnet LID model with upstream lang2vec conditioning and downstream predictionespnet2/lid/frontend/s3prl_condition.py: Modified S3PRL frontend supporting geolocation-aware conditioningespnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py: AAMSoftmax loss with lang2vec prediction (supporting geo, phonology_knn, syntax_knn, inventory_knn)Recipe:
egs2/geolid/lid1/: Complete recipe with data preparation, training scripts, and configurationsWhy did you make this change?
This is an implementation of our ASRU paper Geolocation-Aware Robust Spoken Language Identification. Our proposed geolocation-aware LID improves the robustness of SSL-based LID systems in dialectal and accented variations.
Is your PR small enough?
Not exactly, but the majority consists of data preparation scripts that should be added together, and the core implementation components are interdependent.
Additional Context