Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SpeechLM Data Infra: Data batchfy, sampling and iterator#6260

Merged
sw005320 merged 12 commits intoespnet:masterfrom
jctian98:iterator
Oct 20, 2025
Merged

SpeechLM Data Infra: Data batchfy, sampling and iterator#6260
sw005320 merged 12 commits intoespnet:masterfrom
jctian98:iterator

Conversation

@jctian98
Copy link
Collaborator

@jctian98 jctian98 commented Oct 7, 2025

The data iterator block of the SpeechLM repo.

This PR should not be reviewed or merged before #6257

(1) espnet2/speechlm/configuration: general-purpose and language model-specific task template design
(2) espnet2/speechlm/dataloader/iterator.py: the iterator factory logic to build the dataloader object, with good state trace and deterministic behavior.
(3) espnet2/speechlm/dataloader/batch.py the method to batchify the data.
(4) Plus some minor fix.

Also see: #6258

(1) SpeechLM preprocessor;
(2) SpeechLM model files
(3) A speechLM-specific trainer
(4) A PR to connect all components
(5) Inference

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Recipe labels Oct 7, 2025
@mergify mergify bot added the ESPnet2 label Oct 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant data infrastructure for SpeechLM, including data batching, sampling, and iteration logic. It adds new recipes for GigaSpeech and LibriSpeech, along with several Python modules to handle multi-modal data loading using Lhotse, dataset management, and batch creation. The overall structure is well-designed and modular. I've identified a couple of high-severity issues: one related to hardcoded environment-specific paths in a template file, which harms portability, and a bug in data key generation that would lead to mismatches. The rest of the implementation appears solid and follows good practices.

Comment on lines +27 to +30
if [[ "$(hostname)" == dt* ]] || [[ "$(hostname)" == gh* ]] || [[ "$(hostname)" == gpu* ]] ; then # For Delta/DeltaAI
ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/train_shared.yaml"
ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/valid_shared.yaml"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to set ESPNET_DATASET_REGISTRY is specific to an internal environment (wavlab, with hardcoded hostnames like dt*, gh*, and paths like /work/nvme/bbjs/shared/...). Including this in a TEMPLATE script makes it not general-purpose and difficult for other users to adapt. Environment-specific configurations should be placed in a local/path.sh file, which is sourced but not tracked by git, or managed through a separate configuration mechanism.

Comment on lines +22 to +23
allowed_keys.extend([f"audio{x}" for x in range(0, 10)])
allowed_keys.extend([f"text{x}" for x in range(0, 10)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The allowed_keys for audio and text entries are generated using range(0, 10), which produces keys like audio0, text0, ..., audio9, text9. This is inconsistent with the documentation (which mentions audio1, text1, etc.) and other configuration files like espnet2/speechlm/configuration/task_conf.py which use 1-based indexing (range(1, 11)). This will cause key mismatches and errors. The range should be changed to range(1, 11) to generate keys from 1 to 10.

Suggested change
allowed_keys.extend([f"audio{x}" for x in range(0, 10)])
allowed_keys.extend([f"text{x}" for x in range(0, 10)])
allowed_keys.extend([f"audio{x}" for x in range(1, 11)])
allowed_keys.extend([f"text{x}" for x in range(1, 11)])

@codecov
Copy link

codecov bot commented Oct 7, 2025

Codecov Report

❌ Patch coverage is 0% with 298 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.49%. Comparing base (cc1f815) to head (a5a9ea6).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/speechlm/dataloader/iterator.py 0.00% 150 Missing ⚠️
espnet2/speechlm/dataloader/batch.py 0.00% 73 Missing ⚠️
...eechlm/dataloader/multimodal_loader/text_loader.py 0.00% 39 Missing ⚠️
...pnet2/speechlm/configuration/task_conf_speechlm.py 0.00% 20 Missing ⚠️
espnet2/speechlm/configuration/task_conf.py 0.00% 8 Missing ⚠️
espnet2/speechlm/dataloader/dataset.py 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6260      +/-   ##
==========================================
- Coverage   56.77%   56.49%   -0.29%     
==========================================
  Files         889      896       +7     
  Lines       84361    84789     +428     
==========================================
  Hits        47898    47898              
- Misses      36463    36891     +428     
Flag Coverage Δ
test_integration_espnet2 46.80% <ø> (ø)
test_integration_espnetez 36.92% <ø> (ø)
test_python_espnet2 50.94% <0.00%> (-0.26%) ⬇️
test_python_espnetez 12.75% <0.00%> (-0.07%) ⬇️
test_utils 18.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to confirm whether you’ve checked the lhotse.kaldi package. It contains a number of utility files for Lhotse–Kaldi interaction, and I think they are more matured codebase than implementing everything from scratch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for the reminder. It's great if Lhotse provides native support for Kaldi format, and I'd love to adopt.

Maybe have a chat with Samuele and William later, when William is back.
The interface is already there, so this switching should be easy.

from typing import List

# List of supported entries for task configurations
SUPPORTED_ENTRIES = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these audio/text labels same as the keys we have in prepare_dataset_json.allowed_keys?
I think it is nice to define it here as constant and import in prepare_dataset_json.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will update this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to include all future readers in this file?
Since the text loader doesn’t use Lhotse, I thought it might also make sense to split the files. I’m mainly thinking about future development — for example, if we add image or video readers, they might rely on other specialized packages that aren’t needed for audio tasks. This approach isn’t fully compatible with the current ESPnet 3 design (task-base dependencies during installation), but if possible, it might be nice to separate the dependencies for each class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll split the file to text_loader.py and audio_loader.py. Thanks!

@jctian98
Copy link
Collaborator Author

jctian98 commented Oct 8, 2025

@sw005320 I think this PR can also be merged. @Masao-Someki has reviewed it and I have revised accordingly.
Same CI issue as in #6257

@siddhu001
Copy link
Collaborator

@jctian98 this is amazing! Do you plan to add any unit tests? Might be helpful for maintaining this.

@jctian98
Copy link
Collaborator Author

@siddhu001 i may need more discussion with @Masao-Someki and @wanchichen about how to make test cases for recent PRs, mostly because these PRs would have some dependencies (e.g., lhotse manifest) that may not be easy to be compatible with current CI.

@sw005320 maybe ok to merge this PR.

@sw005320 sw005320 merged commit 9a2db4c into espnet:master Oct 20, 2025
31 of 32 checks passed
@sw005320
Copy link
Contributor

Thanks, @jctian98, @siddhu001, and @Masao-Someki!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ESPnet2 Recipe size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants