Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LID-9: Geolocation-aware LID recipe and codes#6212

Open
Qingzheng-Wang wants to merge 48 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release9
Open

LID-9: Geolocation-aware LID recipe and codes#6212
Qingzheng-Wang wants to merge 48 commits intoespnet:masterfrom
Qingzheng-Wang:lid_release9

Conversation

@Qingzheng-Wang
Copy link
Contributor

What did you change?

This PR adds a complete geolocation-aware language identification recipe and supporting infrastructure.

Core Implementation:

  • espnet2/lid/espnet_model_upstream_condition.py: ESPnet LID model with upstream lang2vec conditioning and downstream prediction
  • espnet2/lid/frontend/s3prl_condition.py: Modified S3PRL frontend supporting geolocation-aware conditioning
  • espnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py: AAMSoftmax loss with lang2vec prediction (supporting geo, phonology_knn, syntax_knn, inventory_knn)

Recipe:

  • egs2/geolid/lid1/: Complete recipe with data preparation, training scripts, and configurations
  • Support for VoxLingua107-only and Combined dataset training (combining 5 datasets in multiple domains, 157 languages, 9,865 hours)
  • Multiple model configurations: independent/shared conditioning, frozen/trainable conditioning projections

Why did you make this change?

This is an implementation of our ASRU paper Geolocation-Aware Robust Spoken Language Identification. Our proposed geolocation-aware LID improves the robustness of SSL-based LID systems in dialectal and accented variations.


Is your PR small enough?

Not exactly, but the majority consists of data preparation scripts that should be added together, and the core implementation components are interdependent.


Additional Context

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. ESPnet2 New Features labels Aug 20, 2025
@mergify mergify bot added the README label Aug 20, 2025
@Fhrozen Fhrozen requested a review from Copilot August 21, 2025 12:59
@Fhrozen
Copy link
Member

Fhrozen commented Aug 21, 2025

This pull request introduces a new geolocation-aware spoken language identification (LID) recipe in ESPnet, along with detailed documentation, job scheduling configurations, and multiple training configurations. The main contributions are the addition of a robust, geolocation-conditioned LID model, comprehensive usage instructions, and support for various cluster environments.

Documentation and Usage:

  • Added a comprehensive README.md for the geolid/lid1 recipe, detailing the model's geolocation-aware innovations, installation of required dependencies, usage instructions, datasets, evaluation results, and citation information.

Job Scheduling and Cluster Support:

  • Added and documented cmd.sh to support multiple job scheduling backends (local, SGE, Slurm, PBS, SSH, JHU), enabling flexible training and inference on different compute environments.
  • Added default configuration files for SGE (conf/queue.conf), Slurm (conf/slurm.conf), and PBS (conf/pbs.conf) to facilitate job submission and resource allocation on various cluster types. [1] [2] [3]

Model Configuration:

  • Added a training configuration for the combined dataset setup (conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml), featuring shared geolocation conditioning modules, multilayer features, and advanced training strategies.
  • Added a training configuration for VoxLingua107-only with independent and frozen conditioning modules (conf/voxlingua107_only/mms_ecapa_upcon_32_44_it0.4_independent_frozen.yaml), supporting reproducibility and ablation studies.

@Fhrozen Fhrozen modified the milestones: v.202509, v.202512 Sep 12, 2025
@sw005320
Copy link
Contributor

This would be the last PR.
Please continue working on it.

@sw005320
Copy link
Contributor

@Qingzheng-Wang, @brianyan918 wants to use this.
Can you accelerate this PR?

@Qingzheng-Wang
Copy link
Contributor Author

Qingzheng-Wang commented Sep 24, 2025

@Qingzheng-Wang, @brianyan918 wants to use this. Can you accelerate this PR?

Sure!

@codecov
Copy link

codecov bot commented Sep 24, 2025

Codecov Report

❌ Patch coverage is 40.90909% with 195 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.71%. Comparing base (265913a) to head (06c8961).
⚠️ Report is 749 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/lid/espnet_model_upstream_condition.py 19.64% 90 Missing ⚠️
espnet2/lid/frontend/s3prl_condition.py 31.81% 45 Missing ⚠️
espnet2/train/preprocessor.py 26.47% 25 Missing ⚠️
espnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py 78.12% 21 Missing ⚠️
espnet2/tasks/lid.py 36.36% 14 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6212       +/-   ##
===========================================
+ Coverage        0   56.71%   +56.71%     
===========================================
  Files           0      892      +892     
  Lines           0    84667    +84667     
===========================================
+ Hits            0    48022    +48022     
- Misses          0    36645    +36645     
Flag Coverage Δ
test_integration_espnet2 46.63% <19.69%> (?)
test_integration_espnetez 36.91% <2.94%> (?)
test_python_espnet2 51.14% <37.27%> (?)
test_python_espnetez 12.76% <0.30%> (?)
test_utils 18.77% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fhrozen
Copy link
Member

Fhrozen commented Sep 25, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a comprehensive geolocation-aware language identification recipe. The implementation is extensive, covering new model architectures, loss functions, frontend components, and a complete set of data preparation and training scripts. My review focuses on the core implementation and script correctness. I've identified several critical issues related to script failures (missing imports, filename mismatches, hardcoded paths) and potential runtime errors in the model logic. There are also high-severity concerns regarding maintainability, such as dependencies on personal forks and violations of class interface contracts. Addressing these points will significantly improve the robustness and long-term viability of this new recipe.

log "Directory dump/raw/train_babel_over_10s_lang exists."
else
log "Directory dump/raw/train_babel_over_10s_lang does not exist. Running local/filter_babel_train.sh."
local/filter_babel_train.sh || exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script calls local/filter_babel_train.sh, but this file does not exist in the pull request. It seems the intended script is local/filter_babel_over_10s.sh. Please correct the script name to avoid a command not found error.

Suggested change
local/filter_babel_train.sh || exit 1
local/filter_babel_over_10s.sh || exit 1

. utils/parse_options.sh || exit 1;

for split in $splits; do
python local/filter_babel_over_10s.py --babel_dir $dump_dir/$split --babel_over_10s_dir $dump_dir/$split_over_10s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script attempts to execute local/filter_babel_over_10s.py, but the Python script added in this pull request is named local/filter_babel_overl_10s.py (with an extra 'l'). This filename mismatch will cause a FileNotFoundError and prevent the script from running. Please correct the name of the Python script to match the one being called here, or update the call to use the correct filename.

Comment on lines +2 to +5
import os


def parse_args():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The argparse module is used to parse command-line arguments, but it is not imported in this script. This will result in a NameError when the script is executed. Please add import argparse to the top of the file.

Suggested change
import os
def parse_args():
import os
import argparse
def parse_args():

local train_list="$4"
local dev_list="$5"

local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The base_dir is hardcoded to an absolute path /scratch/bbjs/shared/corpora/babel/.... This makes the script non-portable and will cause it to fail on any machine where this path does not exist. It should use the $dataset_path variable, which is passed as an argument to the script, to construct the path.

Suggested change
local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational"
local base_dir="${dataset_path}/${lang_code}/conversational"

Comment on lines +1 to +4
import os
import sys
import traceback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The argparse module is used to parse command-line arguments, but it is not imported. This will cause a NameError at runtime. Please add import argparse at the beginning of the file.

Suggested change
import os
import sys
import traceback
import argparse
import os
import sys
import traceback

Comment on lines +310 to +312
loss = (
1 - self.loss.lang2vec_weight
) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential TypeError in this loss calculation. lang2vec_loss_all can be None if lang2vec_loss is None (e.g., if lang2vec_weight is not configured in the loss module). If lang2vec_loss_all is None, the multiplication self.loss.lang2vec_weight * lang2vec_loss_all will raise a TypeError. You should add a check to ensure lang2vec_loss_all is not None before performing this calculation.

Suggested change
loss = (
1 - self.loss.lang2vec_weight
) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all
if lang2vec_loss_all is not None:
loss = (
1 - self.loss.lang2vec_weight
) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all
else:
loss = lid_class_loss_all

Comment on lines +20 to +44
This project requires **modified versions** of s3prl and transformers for geolocation conditioning functionality.

**Install Modified s3prl:**
```bash
# If you have already installed s3prl, please uninstall it first
pip uninstall s3prl # (Optional if already installed)

# Clone and install the modified version
git clone -b lid https://github.com/Qingzheng-Wang/s3prl.git
cd s3prl
pip install -e .
cd ..
```

**Install Modified Transformers:**
```bash
# If you have already installed transformers, please uninstall it first
pip uninstall transformers # (Optional if already installed)

# Clone and install the modified version
git clone -b v4.51.3-qingzheng https://github.com/Qingzheng-Wang/transformers.git
cd transformers
pip install -e .
cd ..
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The installation instructions rely on personal forks of s3prl and transformers from a personal GitHub account. This introduces a significant maintainability and reproducibility risk for the project. If these forks are changed or deleted, it will break this recipe. For long-term stability, these modified dependencies should be vendored within ESPnet, merged upstream, or maintained in a repository under the espnet organization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sw005320 I am wondering about this particular issue. Is it required to use a cloned repo? How many functions/classes were changed? Is it no possible to add to the espnet2 directly? Also, in case if transformers, is it not possible to use customize models and import using autoclass.register? https://huggingface.co/docs/transformers/custom_models. Private repo may complicate and limit future releases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
This is not our normal practice.

@Qingzheng-Wang, please avoid them

  • s3prl: You can make a PR to them
  • transformers: you can try Nelson's suggestions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I’ll try to work on this, though it may take quite a bit of time for me. I’ll open a PR to the S3PRL repo and see whether they accept it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add me to the PR.
If the change is reasonable, I expect they’ll address it quickly.
If not, we can create a fork under the ESPnet organization and maintain it ourselves.
Before deciding, I’d like to review the changes you made in your PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

uttid = uttid.replace("hin", lang)
break
else:
raise ValueError(f"{indian_langs} not in {uttid}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script raises a ValueError if a hin utterance ID does not contain one of ["tam", "tel", "guj"]. This assumption might be too strong and could cause the script to fail if there are legitimate Hindi utterances that don't follow this naming convention. It would be more robust to log a warning and skip the utterance instead of crashing.

Suggested change
raise ValueError(f"{indian_langs} not in {uttid}")
# If no Indian language code is found in the utterance ID for a 'hin' entry,
# we assume it's a standard Hindi utterance and do nothing.
# A warning could be logged here if this is an unexpected case.
pass

Comment on lines +131 to +133
def forward(
self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The forward method signature forward(self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor) is incompatible with the base class AbsFrontend, which defines forward(self, input: torch.Tensor, input_lengths: torch.Tensor). This violates the Liskov Substitution Principle and can lead to unexpected errors if this frontend is used in a context expecting a standard AbsFrontend. The labels should be handled within the main model, not passed through the frontend, to maintain proper separation of concerns.

Comment on lines +231 to +233
assert (
0 < self.lang2vec_weight < 1
), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion 0 < self.lang2vec_weight < 1 is too strict. It prevents setting lang2vec_weight to 0 or 1, which are valid scenarios to disable the language vector loss or the classification loss, respectively. This can be useful for ablation studies or for using the model in a classification-only mode. The assertion should be relaxed to 0 <= self.lang2vec_weight <= 1.

Suggested change
assert (
0 < self.lang2vec_weight < 1
), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}"
assert (
0 <= self.lang2vec_weight <= 1
), f"lang2vec_weight should be in [0, 1], but got {self.lang2vec_weight}"

@github-actions
Copy link

This PR is stale because it has been open for 90 days with no activity.
It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added Stale For probot and removed Stale For probot labels Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ESPnet2 New Features README size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants