S2T Recipe for IPAPack++: Data Preparation#6169
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, covering dataset downloading, processing, and formatting to support speech-to-text tasks. Key changes include:
- New scripts for subsampling, filtering, renaming, and combining dataset files (subset.py).
- A processing script that converts the IPAPack++ data into the target format with language normalization and statistics (process_ipapack.py).
- Auxiliary scripts for downloading, fixing file paths, generating non-linguistic symbols, and orchestrating data preparation.
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| egs2/ipapack_plus/s2t1/local/subset.py | Implements subsampling, filtering, renaming, split generation, and file combination for dataset preparation. |
| egs2/ipapack_plus/s2t1/local/process_ipapack.py | Converts and processes the raw data into the expected format with text normalization and language statistics. |
| egs2/ipapack_plus/s2t1/local/path.sh | Provides environmental path setup (minimal content). |
| egs2/ipapack_plus/s2t1/local/generate_nlsyms.py | Generates lists of non-linguistic symbols for BPE tokenization and scoring. |
| egs2/ipapack_plus/s2t1/local/fix_doreco.py | Corrects dataset file paths and converts DoReCo transcripts to a Kaldi-compatible format. |
| egs2/ipapack_plus/s2t1/local/download.sh | Installs dependencies, triggers dataset download, and extracts audio archives. |
| egs2/ipapack_plus/s2t1/local/download.py | Downloads dataset partitions from Hugging Face with retry logic. |
| egs2/ipapack_plus/s2t1/local/data.sh | Orchestrates the various stages of data preparation, invoking the other scripts. |
Comments suppressed due to low confidence (1)
egs2/ipapack_plus/s2t1/local/download.sh:8
- The script uses a 'log' command without defining it; either define the 'log' function or replace it with 'echo' to avoid potential runtime errors.
if [ -z "${IPAPACK_PLUS}" ]; then
Co-authored-by: Copilot <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6169 +/- ##
==========================================
- Coverage 46.53% 43.87% -2.66%
==========================================
Files 542 631 +89
Lines 49601 57635 +8034
==========================================
+ Hits 23080 25287 +2207
- Misses 26521 32348 +5827
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| @@ -0,0 +1,28 @@ | |||
| #!/usr/bin/env bash | |||
|
|
|||
| pip3 install -r local/requirements.txt | |||
There was a problem hiding this comment.
Could you add this requirements.txt?
for more information, see https://pre-commit.ci
Kazuki-Ya
left a comment
There was a problem hiding this comment.
I've commented on the parts that caught my attention! I think the rest looks good!
|
|
||
| # Read orthography and phoneme sequences | ||
| with open(os.path.join(dump_dir, "orthography"), "r") as orth_file: | ||
| orthography = orth_file.readlines() |
There was a problem hiding this comment.
The orthography file doesn't seem to be placed in dump_dir with the current processing, so this will likely cause an error. I think we might need to copy the orthography file to dump_dir, but I'm not entirely sure about this.
There was a problem hiding this comment.
Thanks for spotting this! I changed it to the path to data/{dataset} (data_dir_path)
| dst_file = os.path.join(process_dir, file_name) | ||
| if os.path.exists(src_file): | ||
| with open(src_file, "r") as src, open(dst_file, "w") as dst: | ||
| dst.write(src.read()) |
There was a problem hiding this comment.
This could be dangerous if src_file and dst_file point to the same file. The open(dst_file, "w") would truncate the file before src.read() is called, potentially causing data loss. Could you consider adding a check like:
if os.path.abspath(src_file) != os.path.abspath(dst_file):
shutil.copy2(src_file, dst_file)
print(f"{dst_file} copied!")
else:
print(f"Skipping {file_name}: source and destination are the same")
egs2/ipapack_plus/s2t1/local/data.sh
Outdated
|
|
||
| for dir in data/*; do | ||
| utils/fix_data_dir.sh --utt_extra_files "orthography" $dir | ||
| utils/validate_data_dir.sh --no-feats $dir || exit 1 |
There was a problem hiding this comment.
I think we might need to add --non-print to utils/validate_data_dir.sh , but I'm not entirely sure.
for more information, see https://pre-commit.ci
|
@chinjouli, please fix the conflict (due to the merge of #6168 |
|
Thanks! |
This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, enabling efficient downloading, processing, and formatting of data for speech-to-text tasks. The changes include new scripts for downloading the dataset, preparing data, handling language-specific processing, and generating subsets. Below is a breakdown of the most important changes by theme:
Data Preparation and Processing:
local/data.shto orchestrate the data preparation pipeline, including downloading the dataset, preparing data directories, and validating data integrity. This script also introduces a logging mechanism and stage-based execution.local/process_ipapack.pyto convert IPAPack++ data into the expected format for downstream tasks. This includes text normalization, language duration statistics, and generating task-specific text files.Dataset Download and Extraction:
local/download.pyto download dataset partitions from Hugging Face, with retry logic for robustness.local/download.shto manage dataset dependencies, install required Python packages, and extract audio files from tar archives while preserving directory structures.Auxiliary Scripts for Data Handling:
local/fix_doreco.pyto correct file paths and convert DoReCo transcripts into Kaldi-compatible formats.local/subset.pyto enable dataset subsampling, filtering, renaming, and combining. This script supports creating subsets for training, development, and testing.Language-Specific Features:
local/generate_nlsyms.pyto generate non-linguistic symbols (NLSyms) for BPE tokenization and scoring, ensuring proper handling of special tokens and phonemes.What did you change?
Data preparation part for the new s2t recipe for IPAPack++
Why did you make this change?
Provide a basic setup for developing a multitask phone recognition model
Is your PR small enough?
Exceeds # lines, but all files here are for local data preparation, including a 400-line mapping json file.
Additional Context
Depends on #6168