Codestin Search App

chinjouli · 2025-06-23T17:30:09Z

This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, enabling efficient downloading, processing, and formatting of data for speech-to-text tasks. The changes include new scripts for downloading the dataset, preparing data, handling language-specific processing, and generating subsets. Below is a breakdown of the most important changes by theme:

Data Preparation and Processing:

Added local/data.sh to orchestrate the data preparation pipeline, including downloading the dataset, preparing data directories, and validating data integrity. This script also introduces a logging mechanism and stage-based execution.
Implemented local/process_ipapack.py to convert IPAPack++ data into the expected format for downstream tasks. This includes text normalization, language duration statistics, and generating task-specific text files.

Dataset Download and Extraction:

Added local/download.py to download dataset partitions from Hugging Face, with retry logic for robustness.
Added local/download.sh to manage dataset dependencies, install required Python packages, and extract audio files from tar archives while preserving directory structures.

Auxiliary Scripts for Data Handling:

Added local/fix_doreco.py to correct file paths and convert DoReCo transcripts into Kaldi-compatible formats.
Added local/subset.py to enable dataset subsampling, filtering, renaming, and combining. This script supports creating subsets for training, development, and testing.

Language-Specific Features:

Added local/generate_nlsyms.py to generate non-linguistic symbols (NLSyms) for BPE tokenization and scoring, ensuring proper handling of special tokens and phonemes.

What did you change?

Data preparation part for the new s2t recipe for IPAPack++

Why did you make this change?

Provide a basic setup for developing a multitask phone recognition model

Is your PR small enough?

Exceeds # lines, but all files here are for local data preparation, including a 400-line mapping json file.

Additional Context

Depends on #6168

for more information, see https://pre-commit.ci

…/path.sh

for more information, see https://pre-commit.ci

Copilot

Pull Request Overview

This pull request introduces a comprehensive data preparation pipeline for the IPAPack++ dataset, covering dataset downloading, processing, and formatting to support speech-to-text tasks. Key changes include:

New scripts for subsampling, filtering, renaming, and combining dataset files (subset.py).
A processing script that converts the IPAPack++ data into the target format with language normalization and statistics (process_ipapack.py).
Auxiliary scripts for downloading, fixing file paths, generating non-linguistic symbols, and orchestrating data preparation.

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
egs2/ipapack_plus/s2t1/local/subset.py	Implements subsampling, filtering, renaming, split generation, and file combination for dataset preparation.
egs2/ipapack_plus/s2t1/local/process_ipapack.py	Converts and processes the raw data into the expected format with text normalization and language statistics.
egs2/ipapack_plus/s2t1/local/path.sh	Provides environmental path setup (minimal content).
egs2/ipapack_plus/s2t1/local/generate_nlsyms.py	Generates lists of non-linguistic symbols for BPE tokenization and scoring.
egs2/ipapack_plus/s2t1/local/fix_doreco.py	Corrects dataset file paths and converts DoReCo transcripts to a Kaldi-compatible format.
egs2/ipapack_plus/s2t1/local/download.sh	Installs dependencies, triggers dataset download, and extracts audio archives.
egs2/ipapack_plus/s2t1/local/download.py	Downloads dataset partitions from Hugging Face with retry logic.
egs2/ipapack_plus/s2t1/local/data.sh	Orchestrates the various stages of data preparation, invoking the other scripts.

Comments suppressed due to low confidence (1)

egs2/ipapack_plus/s2t1/local/download.sh:8

The script uses a 'log' command without defining it; either define the 'log' function or replace it with 'echo' to avoid potential runtime errors.

if [ -z "${IPAPACK_PLUS}" ]; then

egs2/ipapack_plus/s2t1/local/process_ipapack.py

Co-authored-by: Copilot <[email protected]>

for more information, see https://pre-commit.ci

codecov · 2025-06-25T09:11:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.87%. Comparing base (333b6f7) to head (a3d1b68).
Report is 22 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6169      +/-   ##
==========================================
- Coverage   46.53%   43.87%   -2.66%     
==========================================
  Files         542      631      +89     
  Lines       49601    57635    +8034     
==========================================
+ Hits        23080    25287    +2207     
- Misses      26521    32348    +5827

Flag	Coverage Δ
test_integration_espnet2	`46.53% <ø> (ø)`
test_integration_espnetez	`37.16% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

egs2/ipapack_plus/s2t1/local/ipa_mapping.json

Kazuki-Ya · 2025-06-29T09:27:53Z

egs2/ipapack_plus/s2t1/local/download.sh

@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+
+pip3 install -r local/requirements.txt


Could you add this requirements.txt?

for more information, see https://pre-commit.ci

Kazuki-Ya

I've commented on the parts that caught my attention! I think the rest looks good!

Kazuki-Ya · 2025-06-29T14:22:05Z

egs2/ipapack_plus/s2t1/local/process_ipapack.py

+
+            # Read orthography and phoneme sequences
+            with open(os.path.join(dump_dir, "orthography"), "r") as orth_file:
+                orthography = orth_file.readlines()


The orthography file doesn't seem to be placed in dump_dir with the current processing, so this will likely cause an error. I think we might need to copy the orthography file to dump_dir, but I'm not entirely sure about this.

Thanks for spotting this! I changed it to the path to data/{dataset} (data_dir_path)

Kazuki-Ya · 2025-06-29T16:06:17Z

egs2/ipapack_plus/s2t1/local/process_ipapack.py

+                dst_file = os.path.join(process_dir, file_name)
+                if os.path.exists(src_file):
+                    with open(src_file, "r") as src, open(dst_file, "w") as dst:
+                        dst.write(src.read())


This could be dangerous if src_file and dst_file point to the same file. The open(dst_file, "w") would truncate the file before src.read() is called, potentially causing data loss. Could you consider adding a check like:

if os.path.abspath(src_file) != os.path.abspath(dst_file): shutil.copy2(src_file, dst_file) print(f"{dst_file} copied!") else: print(f"Skipping {file_name}: source and destination are the same")

Kazuki-Ya · 2025-06-29T16:58:52Z

egs2/ipapack_plus/s2t1/local/data.sh

+
+    for dir in data/*; do
+        utils/fix_data_dir.sh --utt_extra_files "orthography" $dir
+        utils/validate_data_dir.sh --no-feats $dir || exit 1


I think we might need to add --non-print to utils/validate_data_dir.sh , but I'm not entirely sure.

for more information, see https://pre-commit.ci

sw005320 · 2025-06-30T20:20:19Z

@chinjouli, please fix the conflict (due to the merge of #6168

for more information, see https://pre-commit.ci

sw005320 · 2025-07-02T07:46:10Z

Thanks!

chinjouli added 2 commits June 23, 2025 13:08

remove changes except dataprep for PR2

da44ed7

balancing PR file count

0d4f40b

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 23, 2025

mergify bot added the ESPnet2 label Jun 23, 2025

chinjouli mentioned this pull request Jun 23, 2025

S2T Recipe for IPAPack++: main recipe #6168

Merged

dosubot bot added the Recipe label Jun 23, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

5bc7ea7

for more information, see https://pre-commit.ci

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 23, 2025

Rename egs2/ipapack_plus/s2t1/path.sh to egs2/ipapack_plus/s2t1/local…

21606a7

…/path.sh

sw005320 requested a review from Copilot June 23, 2025 17:35

[pre-commit.ci] auto fixes from pre-commit.com hooks

9fe6852

for more information, see https://pre-commit.ci

sw005320 added this to the v.202506 milestone Jun 23, 2025

Copilot AI reviewed Jun 23, 2025

View reviewed changes

egs2/ipapack_plus/s2t1/local/process_ipapack.py Outdated Show resolved Hide resolved

egs2/ipapack_plus/s2t1/local/process_ipapack.py Outdated Show resolved Hide resolved

chinjouli and others added 6 commits June 23, 2025 19:35

Update egs2/ipapack_plus/s2t1/local/process_ipapack.py

e1424c7

Co-authored-by: Copilot <[email protected]>

accidentally missed a core dataprep file

9406ac0

[pre-commit.ci] auto fixes from pre-commit.com hooks

cd26dc6

for more information, see https://pre-commit.ci

fix line length format error

8052911

[pre-commit.ci] auto fixes from pre-commit.com hooks

7a14955

for more information, see https://pre-commit.ci

fix small errors

28a2ce7

sw005320 reviewed Jun 26, 2025

View reviewed changes

egs2/ipapack_plus/s2t1/local/ipa_mapping.json Show resolved Hide resolved

Kazuki-Ya reviewed Jun 29, 2025

View reviewed changes

chinjouli and others added 2 commits June 29, 2025 05:32

add requirements.txt

7e53de6

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc80ec2

for more information, see https://pre-commit.ci

Kazuki-Ya reviewed Jun 29, 2025

View reviewed changes

sw005320 and others added 3 commits June 29, 2025 15:50

makes CI happy

657e2b6

bugs in process_ipapack; add non-print to data.sh

1b5af3e

intent error

0e0ab72

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a2adc2

for more information, see https://pre-commit.ci

resolve conflict

8294d06

mergify bot added the README label Jul 1, 2025

chinjouli and others added 3 commits July 1, 2025 00:11

update s2t.sh to recipe pr's

fda5260

[pre-commit.ci] auto fixes from pre-commit.com hooks

ae4b5b5

for more information, see https://pre-commit.ci

add back accidentally rmed files

a3d1b68

sw005320 merged commit 78570db into espnet:master Jul 2, 2025
40 of 41 checks passed

chinjouli deleted the ipapack_dataprep branch October 27, 2025 05:13

		@@ -0,0 +1,28 @@
		#!/usr/bin/env bash

		pip3 install -r local/requirements.txt

Conversation

chinjouli commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data Preparation and Processing:

Dataset Download and Extraction:

Auxiliary Scripts for Data Handling:

Language-Specific Features:

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Kazuki-Ya Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

chinjouli Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Kazuki-Ya left a comment

Choose a reason for hiding this comment

Uh oh!

Kazuki-Ya Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

chinjouli Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Kazuki-Ya Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

chinjouli Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Kazuki-Ya Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 commented Jun 30, 2025

Uh oh!

Uh oh!

sw005320 commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chinjouli commented Jun 23, 2025 •

edited

Loading

codecov bot commented Jun 25, 2025 •

edited

Loading