Codestin Search App

jctian98 · 2025-10-07T20:13:45Z

The data iterator block of the SpeechLM repo.

This PR should not be reviewed or merged before #6257

(1) espnet2/speechlm/configuration: general-purpose and language model-specific task template design
(2) espnet2/speechlm/dataloader/iterator.py: the iterator factory logic to build the dataloader object, with good state trace and deterministic behavior.
(3) espnet2/speechlm/dataloader/batch.py the method to batchify the data.
(4) Plus some minor fix.

Also see: #6258

(1) SpeechLM preprocessor;
(2) SpeechLM model files
(3) A speechLM-specific trainer
(4) A PR to connect all components
(5) Inference

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request introduces a significant data infrastructure for SpeechLM, including data batching, sampling, and iteration logic. It adds new recipes for GigaSpeech and LibriSpeech, along with several Python modules to handle multi-modal data loading using Lhotse, dataset management, and batch creation. The overall structure is well-designed and modular. I've identified a couple of high-severity issues: one related to hardcoded environment-specific paths in a template file, which harms portability, and a bug in data key generation that would lead to mismatches. The rest of the implementation appears solid and follows good practices.

gemini-code-assist · 2025-10-07T20:15:38Z

egs2/TEMPLATE/speechlm1/path.sh

+if [[ "$(hostname)" == dt* ]] || [[ "$(hostname)" == gh* ]] || [[ "$(hostname)" == gpu* ]] ; then # For Delta/DeltaAI
+    ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/train_shared.yaml"
+    ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/valid_shared.yaml"
+fi


The logic to set ESPNET_DATASET_REGISTRY is specific to an internal environment (wavlab, with hardcoded hostnames like dt*, gh*, and paths like /work/nvme/bbjs/shared/...). Including this in a TEMPLATE script makes it not general-purpose and difficult for other users to adapt. Environment-specific configurations should be placed in a local/path.sh file, which is sourced but not tracked by git, or managed through a separate configuration mechanism.

gemini-code-assist · 2025-10-07T20:15:38Z

espnet2/speechlm/bin/prepare_dataset_json.py

+allowed_keys.extend([f"audio{x}" for x in range(0, 10)])
+allowed_keys.extend([f"text{x}" for x in range(0, 10)])


The allowed_keys for audio and text entries are generated using range(0, 10), which produces keys like audio0, text0, ..., audio9, text9. This is inconsistent with the documentation (which mentions audio1, text1, etc.) and other configuration files like espnet2/speechlm/configuration/task_conf.py which use 1-based indexing (range(1, 11)). This will cause key mismatches and errors. The range should be changed to range(1, 11) to generate keys from 1 to 10.

Suggested change

allowed_keys.extend([f"audio{x}" for x in range(0, 10)])

allowed_keys.extend([f"text{x}" for x in range(0, 10)])

allowed_keys.extend([f"audio{x}" for x in range(1, 11)])

allowed_keys.extend([f"text{x}" for x in range(1, 11)])

codecov · 2025-10-07T20:31:12Z

Codecov Report

❌ Patch coverage is 0% with 298 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.49%. Comparing base (cc1f815) to head (a5a9ea6).
⚠️ Report is 16 commits behind head on master.

Files with missing lines	Patch %	Lines
espnet2/speechlm/dataloader/iterator.py	0.00%	150 Missing ⚠️
espnet2/speechlm/dataloader/batch.py	0.00%	73 Missing ⚠️
...eechlm/dataloader/multimodal_loader/text_loader.py	0.00%	39 Missing ⚠️
...pnet2/speechlm/configuration/task_conf_speechlm.py	0.00%	20 Missing ⚠️
espnet2/speechlm/configuration/task_conf.py	0.00%	8 Missing ⚠️
espnet2/speechlm/dataloader/dataset.py	0.00%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6260      +/-   ##
==========================================
- Coverage   56.77%   56.49%   -0.29%     
==========================================
  Files         889      896       +7     
  Lines       84361    84789     +428     
==========================================
  Hits        47898    47898              
- Misses      36463    36891     +428

Flag	Coverage Δ
test_integration_espnet2	`46.80% <ø> (ø)`
test_integration_espnetez	`36.92% <ø> (ø)`
test_python_espnet2	`50.94% <0.00%> (-0.26%)`	⬇️
test_python_espnetez	`12.75% <0.00%> (-0.07%)`	⬇️
test_utils	`18.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…erator merge remote

Masao-Someki · 2025-10-07T22:31:59Z

espnet2/speechlm/bin/prepare_audio_lhotse.py

I just wanted to confirm whether you’ve checked the lhotse.kaldi package. It contains a number of utility files for Lhotse–Kaldi interaction, and I think they are more matured codebase than implementing everything from scratch.

Oh thanks for the reminder. It's great if Lhotse provides native support for Kaldi format, and I'd love to adopt.

Maybe have a chat with Samuele and William later, when William is back.
The interface is already there, so this switching should be easy.

Masao-Someki · 2025-10-07T22:39:02Z

espnet2/speechlm/configuration/task_conf.py

+from typing import List
+
+# List of supported entries for task configurations
+SUPPORTED_ENTRIES = (


Are these audio/text labels same as the keys we have in prepare_dataset_json.allowed_keys?
I think it is nice to define it here as constant and import in prepare_dataset_json.py

Good idea, will update this.

Masao-Someki · 2025-10-07T22:52:12Z

espnet2/speechlm/dataloader/multimodal_loader.py

Do you plan to include all future readers in this file?
Since the text loader doesn’t use Lhotse, I thought it might also make sense to split the files. I’m mainly thinking about future development — for example, if we add image or video readers, they might rely on other specialized packages that aren’t needed for audio tasks. This approach isn’t fully compatible with the current ESPnet 3 design (task-base dependencies during installation), but if possible, it might be nice to separate the dependencies for each class.

I'll split the file to text_loader.py and audio_loader.py. Thanks!

…erator merge remote master

for more information, see https://pre-commit.ci

…erator merge remote

jctian98 · 2025-10-08T21:46:23Z

@sw005320 I think this PR can also be merged. @Masao-Someki has reviewed it and I have revised accordingly.
Same CI issue as in #6257

siddhu001 · 2025-10-14T16:47:45Z

@jctian98 this is amazing! Do you plan to add any unit tests? Might be helpful for maintaining this.

jctian98 · 2025-10-20T13:39:52Z

@siddhu001 i may need more discussion with @Masao-Someki and @wanchichen about how to make test cases for recent PRs, mostly because these PRs would have some dependencies (e.g., lhotse manifest) that may not be easy to be compatible with current CI.

@sw005320 maybe ok to merge this PR.

sw005320 · 2025-10-20T23:59:48Z

Thanks, @jctian98, @siddhu001, and @Masao-Someki!

make data iterator implementation

894aa28

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Recipe labels Oct 7, 2025

mergify bot added the ESPnet2 label Oct 7, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

beaedf3

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Oct 7, 2025

View reviewed changes

jctian98 added 2 commits October 7, 2025 17:15

add __init__.py

7f1b7e3

Merge branch 'iterator' of https://github.com/jctian98/espnet into it…

80c32e3

…erator merge remote

Masao-Someki reviewed Oct 7, 2025

View reviewed changes

sw005320 and others added 5 commits October 8, 2025 07:35

Merge branch 'master' into iterator

a4a833c

reflect masaos comments

5550013

Merge branch 'iterator' of https://github.com/jctian98/espnet into it…

ef9128d

…erator merge remote master

fix local CI issue

a28dc71

[pre-commit.ci] auto fixes from pre-commit.com hooks

e8f78df

for more information, see https://pre-commit.ci

jctian98 mentioned this pull request Oct 8, 2025

SpeechLM Data Infra: multimodal IO #6258

Merged

jctian98 added 2 commits October 8, 2025 15:33

allow whole batch iterator

c9139f6

Merge branch 'iterator' of https://github.com/jctian98/espnet into it…

bae77ce

…erator merge remote

Merge branch 'master' into iterator

a5a9ea6

sw005320 merged commit 9a2db4c into espnet:master Oct 20, 2025
31 of 32 checks passed

Fhrozen added this to the v.202512 milestone Oct 26, 2025

This was referenced Oct 27, 2025

[SpeechLM] Deepspeed trainer #6278

Merged

[SpeechLM] model, preprocessor and collect_stats #6279

Merged

[SpeechLM] Minor fix on data loading #6280

Merged

Fhrozen modified the milestones: v.202512, v.202511 Nov 14, 2025

		allowed_keys.extend([f"audio{x}" for x in range(0, 10)])
		allowed_keys.extend([f"text{x}" for x in range(0, 10)])

Conversation

jctian98 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Masao-Someki Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jctian98 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Masao-Someki Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jctian98 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Masao-Someki Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jctian98 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

jctian98 commented Oct 8, 2025

Uh oh!

siddhu001 commented Oct 14, 2025

Uh oh!

jctian98 commented Oct 20, 2025

Uh oh!

Uh oh!

sw005320 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jctian98 commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 7, 2025 •

edited

Loading