SpeechLM Data Infra: Data batchfy, sampling and iterator#6260
SpeechLM Data Infra: Data batchfy, sampling and iterator#6260sw005320 merged 12 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces a significant data infrastructure for SpeechLM, including data batching, sampling, and iteration logic. It adds new recipes for GigaSpeech and LibriSpeech, along with several Python modules to handle multi-modal data loading using Lhotse, dataset management, and batch creation. The overall structure is well-designed and modular. I've identified a couple of high-severity issues: one related to hardcoded environment-specific paths in a template file, which harms portability, and a bug in data key generation that would lead to mismatches. The rest of the implementation appears solid and follows good practices.
| if [[ "$(hostname)" == dt* ]] || [[ "$(hostname)" == gh* ]] || [[ "$(hostname)" == gpu* ]] ; then # For Delta/DeltaAI | ||
| ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/train_shared.yaml" | ||
| ESPNET_DATASET_REGISTRY+=":/work/nvme/bbjs/shared/data_registry/valid_shared.yaml" | ||
| fi |
There was a problem hiding this comment.
The logic to set ESPNET_DATASET_REGISTRY is specific to an internal environment (wavlab, with hardcoded hostnames like dt*, gh*, and paths like /work/nvme/bbjs/shared/...). Including this in a TEMPLATE script makes it not general-purpose and difficult for other users to adapt. Environment-specific configurations should be placed in a local/path.sh file, which is sourced but not tracked by git, or managed through a separate configuration mechanism.
| allowed_keys.extend([f"audio{x}" for x in range(0, 10)]) | ||
| allowed_keys.extend([f"text{x}" for x in range(0, 10)]) |
There was a problem hiding this comment.
The allowed_keys for audio and text entries are generated using range(0, 10), which produces keys like audio0, text0, ..., audio9, text9. This is inconsistent with the documentation (which mentions audio1, text1, etc.) and other configuration files like espnet2/speechlm/configuration/task_conf.py which use 1-based indexing (range(1, 11)). This will cause key mismatches and errors. The range should be changed to range(1, 11) to generate keys from 1 to 10.
| allowed_keys.extend([f"audio{x}" for x in range(0, 10)]) | |
| allowed_keys.extend([f"text{x}" for x in range(0, 10)]) | |
| allowed_keys.extend([f"audio{x}" for x in range(1, 11)]) | |
| allowed_keys.extend([f"text{x}" for x in range(1, 11)]) |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6260 +/- ##
==========================================
- Coverage 56.77% 56.49% -0.29%
==========================================
Files 889 896 +7
Lines 84361 84789 +428
==========================================
Hits 47898 47898
- Misses 36463 36891 +428
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…erator merge remote
There was a problem hiding this comment.
I just wanted to confirm whether you’ve checked the lhotse.kaldi package. It contains a number of utility files for Lhotse–Kaldi interaction, and I think they are more matured codebase than implementing everything from scratch.
There was a problem hiding this comment.
Oh thanks for the reminder. It's great if Lhotse provides native support for Kaldi format, and I'd love to adopt.
Maybe have a chat with Samuele and William later, when William is back.
The interface is already there, so this switching should be easy.
| from typing import List | ||
|
|
||
| # List of supported entries for task configurations | ||
| SUPPORTED_ENTRIES = ( |
There was a problem hiding this comment.
Are these audio/text labels same as the keys we have in prepare_dataset_json.allowed_keys?
I think it is nice to define it here as constant and import in prepare_dataset_json.py
There was a problem hiding this comment.
Good idea, will update this.
There was a problem hiding this comment.
Do you plan to include all future readers in this file?
Since the text loader doesn’t use Lhotse, I thought it might also make sense to split the files. I’m mainly thinking about future development — for example, if we add image or video readers, they might rely on other specialized packages that aren’t needed for audio tasks. This approach isn’t fully compatible with the current ESPnet 3 design (task-base dependencies during installation), but if possible, it might be nice to separate the dependencies for each class.
There was a problem hiding this comment.
I'll split the file to text_loader.py and audio_loader.py. Thanks!
…erator merge remote master
for more information, see https://pre-commit.ci
|
@sw005320 I think this PR can also be merged. @Masao-Someki has reviewed it and I have revised accordingly. |
|
@jctian98 this is amazing! Do you plan to add any unit tests? Might be helpful for maintaining this. |
|
@siddhu001 i may need more discussion with @Masao-Someki and @wanchichen about how to make test cases for recent PRs, mostly because these PRs would have some dependencies (e.g., lhotse manifest) that may not be easy to be compatible with current CI. @sw005320 maybe ok to merge this PR. |
|
Thanks, @jctian98, @siddhu001, and @Masao-Someki! |
The data iterator block of the SpeechLM repo.
This PR should not be reviewed or merged before #6257
(1) espnet2/speechlm/configuration: general-purpose and language model-specific task template design
(2) espnet2/speechlm/dataloader/iterator.py: the iterator factory logic to build the dataloader object, with good state trace and deterministic behavior.
(3) espnet2/speechlm/dataloader/batch.py the method to batchify the data.
(4) Plus some minor fix.
Also see: #6258
(1) SpeechLM preprocessor;
(2) SpeechLM model files
(3) A speechLM-specific trainer
(4) A PR to connect all components
(5) Inference