Thanks to visit codestin.com
Credit goes to github.com

Skip to content

S2T Recipe for IPAPack++: Data Preparation#6169

Merged
sw005320 merged 21 commits intoespnet:masterfrom
chinjouli:ipapack_dataprep
Jul 2, 2025
Merged

S2T Recipe for IPAPack++: Data Preparation#6169
sw005320 merged 21 commits intoespnet:masterfrom
chinjouli:ipapack_dataprep

Conversation

@chinjouli
Copy link
Contributor

@chinjouli chinjouli commented Jun 23, 2025

This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, enabling efficient downloading, processing, and formatting of data for speech-to-text tasks. The changes include new scripts for downloading the dataset, preparing data, handling language-specific processing, and generating subsets. Below is a breakdown of the most important changes by theme:

Data Preparation and Processing:

  • Added local/data.sh to orchestrate the data preparation pipeline, including downloading the dataset, preparing data directories, and validating data integrity. This script also introduces a logging mechanism and stage-based execution.
  • Implemented local/process_ipapack.py to convert IPAPack++ data into the expected format for downstream tasks. This includes text normalization, language duration statistics, and generating task-specific text files.

Dataset Download and Extraction:

  • Added local/download.py to download dataset partitions from Hugging Face, with retry logic for robustness.
  • Added local/download.sh to manage dataset dependencies, install required Python packages, and extract audio files from tar archives while preserving directory structures.

Auxiliary Scripts for Data Handling:

  • Added local/fix_doreco.py to correct file paths and convert DoReCo transcripts into Kaldi-compatible formats.
  • Added local/subset.py to enable dataset subsampling, filtering, renaming, and combining. This script supports creating subsets for training, development, and testing.

Language-Specific Features:

  • Added local/generate_nlsyms.py to generate non-linguistic symbols (NLSyms) for BPE tokenization and scoring, ensuring proper handling of special tokens and phonemes.

What did you change?

Data preparation part for the new s2t recipe for IPAPack++


Why did you make this change?

Provide a basic setup for developing a multitask phone recognition model


Is your PR small enough?

Exceeds # lines, but all files here are for local data preparation, including a 400-line mapping json file.


Additional Context

Depends on #6168

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 23, 2025
@mergify mergify bot added the ESPnet2 label Jun 23, 2025
@dosubot dosubot bot added the Recipe label Jun 23, 2025
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 23, 2025
@sw005320 sw005320 requested a review from Copilot June 23, 2025 17:35
@sw005320 sw005320 added this to the v.202506 milestone Jun 23, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, covering dataset downloading, processing, and formatting to support speech-to-text tasks. Key changes include:

  • New scripts for subsampling, filtering, renaming, and combining dataset files (subset.py).
  • A processing script that converts the IPAPack++ data into the target format with language normalization and statistics (process_ipapack.py).
  • Auxiliary scripts for downloading, fixing file paths, generating non-linguistic symbols, and orchestrating data preparation.

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
egs2/ipapack_plus/s2t1/local/subset.py Implements subsampling, filtering, renaming, split generation, and file combination for dataset preparation.
egs2/ipapack_plus/s2t1/local/process_ipapack.py Converts and processes the raw data into the expected format with text normalization and language statistics.
egs2/ipapack_plus/s2t1/local/path.sh Provides environmental path setup (minimal content).
egs2/ipapack_plus/s2t1/local/generate_nlsyms.py Generates lists of non-linguistic symbols for BPE tokenization and scoring.
egs2/ipapack_plus/s2t1/local/fix_doreco.py Corrects dataset file paths and converts DoReCo transcripts to a Kaldi-compatible format.
egs2/ipapack_plus/s2t1/local/download.sh Installs dependencies, triggers dataset download, and extracts audio archives.
egs2/ipapack_plus/s2t1/local/download.py Downloads dataset partitions from Hugging Face with retry logic.
egs2/ipapack_plus/s2t1/local/data.sh Orchestrates the various stages of data preparation, invoking the other scripts.
Comments suppressed due to low confidence (1)

egs2/ipapack_plus/s2t1/local/download.sh:8

  • The script uses a 'log' command without defining it; either define the 'log' function or replace it with 'echo' to avoid potential runtime errors.
if [ -z "${IPAPACK_PLUS}" ]; then

@codecov
Copy link

codecov bot commented Jun 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.87%. Comparing base (333b6f7) to head (a3d1b68).
Report is 22 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6169      +/-   ##
==========================================
- Coverage   46.53%   43.87%   -2.66%     
==========================================
  Files         542      631      +89     
  Lines       49601    57635    +8034     
==========================================
+ Hits        23080    25287    +2207     
- Misses      26521    32348    +5827     
Flag Coverage Δ
test_integration_espnet2 46.53% <ø> (ø)
test_integration_espnetez 37.16% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -0,0 +1,28 @@
#!/usr/bin/env bash

pip3 install -r local/requirements.txt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this requirements.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Copy link

@Kazuki-Ya Kazuki-Ya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented on the parts that caught my attention! I think the rest looks good!


# Read orthography and phoneme sequences
with open(os.path.join(dump_dir, "orthography"), "r") as orth_file:
orthography = orth_file.readlines()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orthography file doesn't seem to be placed in dump_dir with the current processing, so this will likely cause an error. I think we might need to copy the orthography file to dump_dir, but I'm not entirely sure about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this! I changed it to the path to data/{dataset} (data_dir_path)

dst_file = os.path.join(process_dir, file_name)
if os.path.exists(src_file):
with open(src_file, "r") as src, open(dst_file, "w") as dst:
dst.write(src.read())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be dangerous if src_file and dst_file point to the same file. The open(dst_file, "w") would truncate the file before src.read() is called, potentially causing data loss. Could you consider adding a check like:

if os.path.abspath(src_file) != os.path.abspath(dst_file):
    shutil.copy2(src_file, dst_file)
    print(f"{dst_file} copied!")
else:
    print(f"Skipping {file_name}: source and destination are the same")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!


for dir in data/*; do
utils/fix_data_dir.sh --utt_extra_files "orthography" $dir
utils/validate_data_dir.sh --no-feats $dir || exit 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to add --non-print to utils/validate_data_dir.sh , but I'm not entirely sure.

@sw005320
Copy link
Contributor

@chinjouli, please fix the conflict (due to the merge of #6168

@mergify mergify bot added the README label Jul 1, 2025
@sw005320 sw005320 merged commit 78570db into espnet:master Jul 2, 2025
40 of 41 checks passed
@sw005320
Copy link
Contributor

sw005320 commented Jul 2, 2025

Thanks!

@chinjouli chinjouli deleted the ipapack_dataprep branch October 27, 2025 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ESPnet2 README Recipe size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants