Codestin Search App

Qingzheng-Wang · 2025-08-20T04:11:11Z

What did you change?

This PR adds a complete geolocation-aware language identification recipe and supporting infrastructure.

Core Implementation:

espnet2/lid/espnet_model_upstream_condition.py: ESPnet LID model with upstream lang2vec conditioning and downstream prediction
espnet2/lid/frontend/s3prl_condition.py: Modified S3PRL frontend supporting geolocation-aware conditioning
espnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py: AAMSoftmax loss with lang2vec prediction (supporting geo, phonology_knn, syntax_knn, inventory_knn)

Recipe:

egs2/geolid/lid1/: Complete recipe with data preparation, training scripts, and configurations
Support for VoxLingua107-only and Combined dataset training (combining 5 datasets in multiple domains, 157 languages, 9,865 hours)
Multiple model configurations: independent/shared conditioning, frozen/trainable conditioning projections

Why did you make this change?

This is an implementation of our ASRU paper Geolocation-Aware Robust Spoken Language Identification. Our proposed geolocation-aware LID improves the robustness of SSL-based LID systems in dialectal and accented variations.

Is your PR small enough?

Not exactly, but the majority consists of data preparation scripts that should be added together, and the core implementation components are interdependent.

Additional Context

Paper: "Geolocation-Aware Robust Spoken Language Identification" (ASRU 2025)
Depends on:

for more information, see https://pre-commit.ci

Fhrozen · 2025-08-21T13:00:10Z

This pull request introduces a new geolocation-aware spoken language identification (LID) recipe in ESPnet, along with detailed documentation, job scheduling configurations, and multiple training configurations. The main contributions are the addition of a robust, geolocation-conditioned LID model, comprehensive usage instructions, and support for various cluster environments.

Documentation and Usage:

Added a comprehensive README.md for the geolid/lid1 recipe, detailing the model's geolocation-aware innovations, installation of required dependencies, usage instructions, datasets, evaluation results, and citation information.

Job Scheduling and Cluster Support:

Added and documented cmd.sh to support multiple job scheduling backends (local, SGE, Slurm, PBS, SSH, JHU), enabling flexible training and inference on different compute environments.
Added default configuration files for SGE (conf/queue.conf), Slurm (conf/slurm.conf), and PBS (conf/pbs.conf) to facilitate job submission and resource allocation on various cluster types. [1] [2] [3]

Model Configuration:

Added a training configuration for the combined dataset setup (conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml), featuring shared geolocation conditioning modules, multilayer features, and advanced training strategies.
Added a training configuration for VoxLingua107-only with independent and frozen conditioning modules (conf/voxlingua107_only/mms_ecapa_upcon_32_44_it0.4_independent_frozen.yaml), supporting reproducibility and ablation studies.

sw005320 · 2025-09-22T11:15:24Z

This would be the last PR.
Please continue working on it.

sw005320 · 2025-09-23T14:27:11Z

@Qingzheng-Wang, @brianyan918 wants to use this.
Can you accelerate this PR?

Qingzheng-Wang · 2025-09-24T04:36:43Z

@Qingzheng-Wang, @brianyan918 wants to use this. Can you accelerate this PR?

Sure!

codecov · 2025-09-24T06:27:26Z

Codecov Report

❌ Patch coverage is 40.90909% with 195 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.71%. Comparing base (265913a) to head (06c8961).
⚠️ Report is 749 commits behind head on master.

Files with missing lines	Patch %	Lines
espnet2/lid/espnet_model_upstream_condition.py	19.64%	90 Missing ⚠️
espnet2/lid/frontend/s3prl_condition.py	31.81%	45 Missing ⚠️
espnet2/train/preprocessor.py	26.47%	25 Missing ⚠️
espnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py	78.12%	21 Missing ⚠️
espnet2/tasks/lid.py	36.36%	14 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6212       +/-   ##
===========================================
+ Coverage        0   56.71%   +56.71%     
===========================================
  Files           0      892      +892     
  Lines           0    84667    +84667     
===========================================
+ Hits            0    48022    +48022     
- Misses          0    36645    +36645

Flag	Coverage Δ
test_integration_espnet2	`46.63% <19.69%> (?)`
test_integration_espnetez	`36.91% <2.94%> (?)`
test_python_espnet2	`51.14% <37.27%> (?)`
test_python_espnetez	`12.76% <0.30%> (?)`
test_utils	`18.77% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

Fhrozen · 2025-09-25T05:21:06Z

/gemini review

gemini-code-assist

Code Review

This PR introduces a comprehensive geolocation-aware language identification recipe. The implementation is extensive, covering new model architectures, loss functions, frontend components, and a complete set of data preparation and training scripts. My review focuses on the core implementation and script correctness. I've identified several critical issues related to script failures (missing imports, filename mismatches, hardcoded paths) and potential runtime errors in the model logic. There are also high-severity concerns regarding maintainability, such as dependencies on personal forks and violations of class interface contracts. Addressing these points will significantly improve the robustness and long-term viability of this new recipe.

gemini-code-assist · 2025-09-25T05:24:16Z

egs2/geolid/lid1/local/combine.sh

+  log "Directory dump/raw/train_babel_over_10s_lang exists."
+else
+  log "Directory dump/raw/train_babel_over_10s_lang does not exist. Running local/filter_babel_train.sh."
+  local/filter_babel_train.sh || exit 1


This script calls local/filter_babel_train.sh, but this file does not exist in the pull request. It seems the intended script is local/filter_babel_over_10s.sh. Please correct the script name to avoid a command not found error.

Suggested change

local/filter_babel_train.sh || exit 1

local/filter_babel_over_10s.sh || exit 1

gemini-code-assist · 2025-09-25T05:24:16Z

egs2/geolid/lid1/local/filter_babel_over_10s.sh

+. utils/parse_options.sh || exit 1;
+
+for split in $splits; do
+    python local/filter_babel_over_10s.py --babel_dir $dump_dir/$split --babel_over_10s_dir $dump_dir/$split_over_10s


This script attempts to execute local/filter_babel_over_10s.py, but the Python script added in this pull request is named local/filter_babel_overl_10s.py (with an extra 'l'). This filename mismatch will cause a FileNotFoundError and prevent the script from running. Please correct the name of the Python script to match the one being called here, or update the call to use the correct filename.

gemini-code-assist · 2025-09-25T05:24:16Z

egs2/geolid/lid1/local/filter_babel_overl_10s.py

+import os
+
+
+def parse_args():


The argparse module is used to parse command-line arguments, but it is not imported in this script. This will result in a NameError when the script is executed. Please add import argparse to the top of the file.

Suggested change

import os

def parse_args():

import os

import argparse

def parse_args():

gemini-code-assist · 2025-09-25T05:24:16Z

egs2/geolid/lid1/local/prepare_babel_lists.sh

+    local train_list="$4"
+    local dev_list="$5"
+
+    local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational"


The base_dir is hardcoded to an absolute path /scratch/bbjs/shared/corpora/babel/.... This makes the script non-portable and will cause it to fail on any machine where this path does not exist. It should use the $dataset_path variable, which is passed as an argument to the script, to construct the path.

Suggested change

local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational"

local base_dir="${dataset_path}/${lang_code}/conversational"

gemini-code-assist · 2025-09-25T05:24:17Z

egs2/geolid/lid1/local/prepare_ml_superb2.py

+import os
+import sys
+import traceback
+


The argparse module is used to parse command-line arguments, but it is not imported. This will cause a NameError at runtime. Please add import argparse at the beginning of the file.

Suggested change

import os

import sys

import traceback

import argparse

import os

import sys

import traceback

gemini-code-assist · 2025-09-25T05:24:17Z

espnet2/lid/espnet_model_upstream_condition.py

+            loss = (
+                1 - self.loss.lang2vec_weight
+            ) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all


There is a potential TypeError in this loss calculation. lang2vec_loss_all can be None if lang2vec_loss is None (e.g., if lang2vec_weight is not configured in the loss module). If lang2vec_loss_all is None, the multiplication self.loss.lang2vec_weight * lang2vec_loss_all will raise a TypeError. You should add a check to ensure lang2vec_loss_all is not None before performing this calculation.

Suggested change

loss = (

1 - self.loss.lang2vec_weight

) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all

if lang2vec_loss_all is not None:

loss = (

1 - self.loss.lang2vec_weight

) * lid_class_loss_all + self.loss.lang2vec_weight * lang2vec_loss_all

else:

loss = lid_class_loss_all

gemini-code-assist · 2025-09-25T05:24:17Z

egs2/geolid/lid1/README.md

+This project requires **modified versions** of s3prl and transformers for geolocation conditioning functionality.
+
+**Install Modified s3prl:**
+```bash
+# If you have already installed s3prl, please uninstall it first
+pip uninstall s3prl  # (Optional if already installed)
+
+# Clone and install the modified version
+git clone -b lid https://github.com/Qingzheng-Wang/s3prl.git
+cd s3prl
+pip install -e .
+cd ..
+```
+
+**Install Modified Transformers:**
+```bash
+# If you have already installed transformers, please uninstall it first
+pip uninstall transformers  # (Optional if already installed)
+
+# Clone and install the modified version
+git clone -b v4.51.3-qingzheng https://github.com/Qingzheng-Wang/transformers.git
+cd transformers
+pip install -e .
+cd ..
+```


The installation instructions rely on personal forks of s3prl and transformers from a personal GitHub account. This introduces a significant maintainability and reproducibility risk for the project. If these forks are changed or deleted, it will break this recipe. For long-term stability, these modified dependencies should be vendored within ESPnet, merged upstream, or maintained in a repository under the espnet organization.

@sw005320 I am wondering about this particular issue. Is it required to use a cloned repo? How many functions/classes were changed? Is it no possible to add to the espnet2 directly? Also, in case if transformers, is it not possible to use customize models and import using autoclass.register? https://huggingface.co/docs/transformers/custom_models. Private repo may complicate and limit future releases.

I agree.
This is not our normal practice.

@Qingzheng-Wang, please avoid them

s3prl: You can make a PR to them

transformers: you can try Nelson's suggestions

OK, I’ll try to work on this, though it may take quite a bit of time for me. I’ll open a PR to the S3PRL repo and see whether they accept it.

Please add me to the PR.
If the change is reasonable, I expect they’ll address it quickly.
If not, we can create a fork under the ESPnet organization and maintain it ourselves.
Before deciding, I’d like to review the changes you made in your PR.

Got it, thank you!

gemini-code-assist · 2025-09-25T05:24:17Z

egs2/geolid/lid1/local/fix_ml_superb2_dev_dialect.py

+                uttid = uttid.replace("hin", lang)
+                break
+        else:
+            raise ValueError(f"{indian_langs} not in {uttid}")


The script raises a ValueError if a hin utterance ID does not contain one of ["tam", "tel", "guj"]. This assumption might be too strong and could cause the script to fail if there are legitimate Hindi utterances that don't follow this naming convention. It would be more robust to log a warning and skip the utterance instead of crashing.

Suggested change

raise ValueError(f"{indian_langs} not in {uttid}")

# If no Indian language code is found in the utterance ID for a 'hin' entry,

# we assume it's a standard Hindi utterance and do nothing.

# A warning could be logged here if this is an unexpected case.

pass

gemini-code-assist · 2025-09-25T05:24:17Z

espnet2/lid/frontend/s3prl_condition.py

+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:


The forward method signature forward(self, input: torch.Tensor, input_lengths: torch.Tensor, labels: torch.Tensor) is incompatible with the base class AbsFrontend, which defines forward(self, input: torch.Tensor, input_lengths: torch.Tensor). This violates the Liskov Substitution Principle and can lead to unexpected errors if this frontend is used in a context expecting a standard AbsFrontend. The labels should be handled within the main model, not passed through the frontend, to maintain proper separation of concerns.

gemini-code-assist · 2025-09-25T05:24:17Z

espnet2/lid/loss/aamsoftmax_sc_topk_lang2vec.py

+            assert (
+                0 < self.lang2vec_weight < 1
+            ), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}"


The assertion 0 < self.lang2vec_weight < 1 is too strict. It prevents setting lang2vec_weight to 0 or 1, which are valid scenarios to disable the language vector loss or the classification loss, respectively. This can be useful for ablation studies or for using the model in a classification-only mode. The assertion should be relaxed to 0 <= self.lang2vec_weight <= 1.

Suggested change

assert (

0 < self.lang2vec_weight < 1

), f"lang2vec_weight should be in (0, 1), but got {self.lang2vec_weight}"

assert (

0 <= self.lang2vec_weight <= 1

), f"lang2vec_weight should be in [0, 1], but got {self.lang2vec_weight}"

for more information, see https://pre-commit.ci

github-actions · 2025-12-31T02:13:10Z

This PR is stale because it has been open for 90 days with no activity.
It will be closed if no further activity occurs.
Thank you for your contributions.

Qingzheng-Wang added 12 commits August 19, 2025 04:29

Add upstream condition model.

c74efb8

Add geolid recipe

fe3f598

Update README.

226b1ca

Add fix script.

18a0480

Fix.

2267a0c

Add.

2176a3d

Fix data preparation scripts.

6326313

Rm back file.

7db881b

Update readme.

7142aef

Update data prepare readme.

974ae4c

Update.

6d1b881

Fix comments.

c0a78f7

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. ESPnet2 New Features labels Aug 20, 2025

mergify bot added the README label Aug 20, 2025

pre-commit-ci bot and others added 12 commits August 20, 2025 04:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

ffb225b

for more information, see https://pre-commit.ci

Fix to keep same with early pr.

85522e5

Fix: change spk to lang.

249afb1

Fix data script.

2b934f0

Fix comments.

4e5e3a9

Fix comments and lengthy lines.

76aff1a

Fix keys in configs.

0d3a83c

[pre-commit.ci] auto fixes from pre-commit.com hooks

17754cf

for more information, see https://pre-commit.ci

Fix local files to symlinks

0debb52

Fix.

a178289

Fix lengthy lines.

3caf701

[pre-commit.ci] auto fixes from pre-commit.com hooks

bdaaa8a

for more information, see https://pre-commit.ci

Fhrozen requested a review from Copilot August 21, 2025 12:59

Rename save_every to checkpoint_interval.

6e44a65

Fhrozen modified the milestones: v.202509, v.202512 Sep 12, 2025

Add lid to overall readme.

e2f25ba

Merge branch 'master' into lid_release9

9d4c825

Qingzheng-Wang and others added 2 commits September 23, 2025 23:06

Merge branch 'master' into lid_release9

d72bab1

Remove repeated help message.

6cb2fbe

Qingzheng-Wang and others added 9 commits September 23, 2025 23:30

Add arXiv link.

5e85db2

Add unit tests for geo lid related modules.

de704fa

Fix error message.

41431b6

Add assertion and paper link.

3291d05

Fix unit test.

84bbc1f

[pre-commit.ci] auto fixes from pre-commit.com hooks

ce77f76

for more information, see https://pre-commit.ci

Fix variables.

74ec432

Fix test_aamsoftmax_sc_topk_lang2vec_loss.

2548fa0

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a22452

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

Qingzheng-Wang and others added 5 commits September 27, 2025 15:35

Fix sh call.

af7cb8c

Add lang2vec to preprocessor.

a5bb4b4

[pre-commit.ci] auto fixes from pre-commit.com hooks

089f626

for more information, see https://pre-commit.ci

Fix import.

412052e

Fix train and test name extract.

06c8961

github-actions bot added Stale For probot and removed Stale For probot labels Dec 31, 2025

	local/filter_babel_train.sh \|\| exit 1
	local/filter_babel_over_10s.sh \|\| exit 1

	local base_dir="/scratch/bbjs/shared/corpora/babel/${lang_code}/conversational"
	local base_dir="${dataset_path}/${lang_code}/conversational"

-            raise ValueError(f"{indian_langs} not in {uttid}")
+            # If no Indian language code is found in the utterance ID for a 'hin' entry,
+            # we assume it's a standard Hindi utterance and do nothing.
+            # A warning could be logged here if this is an unexpected case.
+            pass

Conversation

Qingzheng-Wang commented Aug 20, 2025

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

Fhrozen commented Aug 21, 2025

Uh oh!

sw005320 commented Sep 22, 2025

Uh oh!

sw005320 commented Sep 23, 2025

Uh oh!

Qingzheng-Wang commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Fhrozen commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Fhrozen Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Qingzheng-Wang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Qingzheng-Wang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Qingzheng-Wang commented Sep 24, 2025 •

edited

Loading

codecov bot commented Sep 24, 2025 •

edited

Loading