Codestin Search App

Masao-Someki · 2025-06-22T19:58:04Z

What did you change?

Removed ESPnet-EZ, as we are transitioning to ESPnet-3.
Added a data organizer module along with accompanying unit tests.

Why did you make this change?

The new data organizer is designed to support multi-dataset experiments. It allows users to easily add or remove datasets from their experiments, improving flexibility and modularity.

Is your PR small enough?

Yes. Although the removal of ESPnet-EZ results in 14 files being deleted, only 4 new files are added.
The added code is well-documented, with comprehensive docstrings, which accounts for a relatively high number of lines.

Additional Context

#6133

- This will be removed when espnet#6139 merged into master.

Masao-Someki · 2025-06-22T20:10:01Z

Here's details of unit test cases.

🔷 1. Basic Integration

Basic Integration of train/valid/test combinations

Test Function Name	Description
test_data_organizer_init	Initializes DataOrganizer with train/valid/test
test_data_organizer_without_test	Verifies behavior without test section
test_data_organizer_test_only	Validates usage of test-only pipelines
test_data_organizer_train_valid_multiple	Ensures multiple train/valid datasets are supported
test_data_organizer_empty_train_valid_ok	Confirms train/valid can be empty lists

🔷 2. Transform / Preprocessor Application

Behavior of transform and preprocessor combinations

Test Function Name	Description
test_combined_dataset	Combines datasets and applies transform
test_data_organizer_transform_only	Applies only transform to data
test_data_organizer_preprocessor_only	Applies only preprocessor to data
test_data_organizer_transform_and_preprocessor	Applies both transform and preprocessor
test_espnet_preprocessor_without_transform	Uses only ESPnet-style preprocessor (UID-based)
test_espnet_preprocessor_with_transform	Combines transform with ESPnet preprocessor

🔷 3. Test Set Variants

Tests for variations in test dataset configurations

Test Function Name	Description
test_data_organizer_test_multiple_sets	Handles multiple named test sets

🔶 Error Cases

Test Function Name	Description	Expected Exception
`test_data_organizer_train_only_assertion`	Raises error when only train is provided without valid	`RuntimeError`
`test_data_organizer_inconsistent_keys`	Fails when dataset output keys are inconsistent in a combined setup	`AssertionError`
`test_data_organizer_transform_none`	Simulates transform failure that raises an internal exception	`ValueError`
`test_data_organizer_invalid_preprocessor_type`	Fails when a non-callable object is used as preprocessor	`AssertionError`

Masao-Someki · 2025-06-22T20:14:21Z

The current CI error is mainly caused by the pyproject.toml configuration.
Since formatting-related changes are being handled separately in #6164 , the pyproject.toml in the base branch is not yet compatible with the current CI setup.

This PR should be merged only after #6139 has been merged and its CI has passed.

sw005320 · 2025-06-22T21:30:49Z

This pull request introduces support for ESPnet-3 across multiple CI workflows and adds new functionality for dataset organization in the espnet3 module. Key changes include updates to CI configurations, the addition of new Python scripts, and the introduction of a DataOrganizer class for managing datasets.

CI Workflow Updates:

.github/workflows/ci_on_debian11.yml: Added support for the espnet3 branch and updated job names and scripts to include ESPnet-3 testing. [1] [2]
.github/workflows/ci_on_macos.yml: Added support for the espnet3 branch.
.github/workflows/ci_on_ubuntu.yml: Updated CI jobs to include ESPnet-3 testing, expanded Python and PyTorch versions for matrix testing, and added a step to fetch PR labels. [1] [2] [3] [4] [5]
.github/workflows/ci_on_windows.yml: Added support for the espnet3 branch.

ESPnet-3 Module Enhancements:

ci/test_python_espnet3.sh: Renamed from ci/test_python_espnetez.sh and updated to include ESPnet-3-specific paths and testing configurations.
espnet3/data/__init__.py: Added imports for data_organizer and dataset.
espnet3/data/data_organizer.py: Introduced the DataOrganizer class to manage training, validation, and test datasets, with support for modular dataset configuration and preprocessing.

sw005320 · 2025-07-29T09:45:17Z

@Masao-Someki, can you fix the conflict?

…anizer

sw005320 · 2025-07-29T12:30:28Z

.github/workflows/ci_on_ubuntu.yml

-        chainer-version: ["6.0.0"]
+        python-version: ["3.10", "3.11"]
+        pytorch-version: [2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.6.0]
+        chainer-version: [6.0.0]


Do we need it?

I thought it would be good to include those versions in the tests for ESPnet-3 as well, since we already test them for ESPnet1/2 and have table for them in the Readme.

Not a version, but a chainer
Do you need chainer?

Thank you, I understand now. Chainer is required by espnet-1 and is supposed to be removed in #6179.
However, it looks like the PR didn’t actually remove it, so I’ll go ahead and remove the Chainer version in that PR.

sw005320 · 2025-07-29T13:31:06Z

.github/workflows/ci_on_ubuntu.yml

-        chainer-version: [6.0.0]
+        python-version: ["3.10"]
+        pytorch-version: ["2.6.0"]
+        chainer-version: ["6.0.0"]


Can we keep chainer here?

sw005320 · 2025-07-29T13:34:27Z

Please fix https://github.com/espnet/espnet/actions/runs/16596452146/job/46944176786?pr=6167#step:8:302

Copilot

Pull Request Overview

This PR transitions from ESPnet-EZ to ESPnet-3 by removing the ESPnet-EZ framework and introducing a new data organizer module for multi-dataset experiments.

Removes entire ESPnet-EZ framework including trainer, task, config, and preprocessing modules
Adds new ESPnet-3 data organizer module with CombinedDataset and DataOrganizer classes
Updates CI workflows to run ESPnet-3 tests instead of ESPnet-EZ tests

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
espnet3/data/	Adds new data organizer and dataset classes for ESPnet-3
test/espnet3/	Comprehensive unit tests for the new data organizer module
espnetez/	Complete removal of ESPnet-EZ framework
test/espnetez/	Removal of ESPnet-EZ integration and unit tests
.github/workflows/	Updates CI to test ESPnet-3 instead of ESPnet-EZ
ci/	Updates test scripts for ESPnet-3

Copilot · 2025-07-29T13:34:46Z

test/espnet3/test_data_organizer.py

+# | Test Function Name               | Description                                                              | # noqa: E501
+# |----------------------------------|--------------------------------------------------------------------------| # noqa: E501
+# | test_data_organizer_init         | Initializes DataOrganizer with train/valid/test                          | # noqa: E501
+# | test_data_organizer_without_test | Verifies behavior without test section                                   | # noqa: E501
+# | test_data_organizer_test_only    | Validates usage of test-only pipelines                                   | # noqa: E501
+# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported             | # noqa: E501
+# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists                          | # noqa: E501
+#
+# Transform / Preprocessor Application
+# | Test Function Name                    | Description                                                              | # noqa: E501
+# |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501
+# | test_combined_dataset                | Combines datasets and applies transform                                  | # noqa: E501
+# | test_data_organizer_transform_only   | Applies only transform to data                                           | # noqa: E501
+# | test_data_organizer_preprocessor_only| Applies only preprocessor to data                                        | # noqa: E501
+# | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor                        | # noqa: E501
+# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based)               | # noqa: E501
+# | test_espnet_preprocessor_with_transform    | Combines transform with ESPnet preprocessor                   | # noqa: E501
+#
+# Test Set Variants
+# | Test Function Name                 | Description                                                              | # noqa: E501
+# |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501
+# | test_data_organizer_test_multiple_sets | Handles multiple named test sets                                   | # noqa: E501
+#
+# Error Cases
+# | Test Function Name                         | Description                                                  | Expected Exception | # noqa: E501
+# |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501
+# | test_data_organizer_train_only_assertion   | Raises error when only train is provided without valid       | RuntimeError       | # noqa: E501
+# | test_data_organizer_inconsistent_keys      | Fails when dataset output keys are inconsistent in combined  | AssertionError     | # noqa: E501
+# | test_data_organizer_transform_none         | Simulates transform failure that raises an internal exception| ValueError         | # noqa: E501
+# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor         | AssertionError     | # noqa: E501


[nitpick] The comment lines with # noqa: E501 suggest line length issues. Consider using a more concise table format or breaking the table into sections to improve readability.

Suggested change

# | Test Function Name | Description | # noqa: E501

# |----------------------------------|--------------------------------------------------------------------------| # noqa: E501

# | test_data_organizer_init | Initializes DataOrganizer with train/valid/test | # noqa: E501

# | test_data_organizer_without_test | Verifies behavior without test section | # noqa: E501

# | test_data_organizer_test_only | Validates usage of test-only pipelines | # noqa: E501

# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported | # noqa: E501

# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | # noqa: E501

#

# Transform / Preprocessor Application

# | Test Function Name | Description | # noqa: E501

# |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501

# | test_combined_dataset | Combines datasets and applies transform | # noqa: E501

# | test_data_organizer_transform_only | Applies only transform to data | # noqa: E501

# | test_data_organizer_preprocessor_only| Applies only preprocessor to data | # noqa: E501

# | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor | # noqa: E501

# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based) | # noqa: E501

# | test_espnet_preprocessor_with_transform | Combines transform with ESPnet preprocessor | # noqa: E501

#

# Test Set Variants

# | Test Function Name | Description | # noqa: E501

# |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501

# | test_data_organizer_test_multiple_sets | Handles multiple named test sets | # noqa: E501

#

# Error Cases

# | Test Function Name | Description | Expected Exception | # noqa: E501

# |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501

# | test_data_organizer_train_only_assertion | Raises error when only train is provided without valid | RuntimeError | # noqa: E501

# | test_data_organizer_inconsistent_keys | Fails when dataset output keys are inconsistent in combined | AssertionError | # noqa: E501

# | test_data_organizer_transform_none | Simulates transform failure that raises an internal exception| ValueError | # noqa: E501

# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor | AssertionError | # noqa: E501

# | Test Function Name | Description |

# |----------------------------------|------------------------------------------|

# | test_data_organizer_init | Initializes DataOrganizer with |

# | | train/valid/test |

# | test_data_organizer_without_test | Verifies behavior without test section |

# | test_data_organizer_test_only | Validates usage of test-only pipelines |

# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets |

# | | are supported |

# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists |

#

# Transform / Preprocessor Application

# | Test Function Name | Description |

# |--------------------------------------|------------------------------------------|

# | test_combined_dataset | Combines datasets and applies transform |

# | test_data_organizer_transform_only | Applies only transform to data |

# | test_data_organizer_preprocessor_only| Applies only preprocessor to data |

# | test_data_organizer_transform_and_preprocessor | Applies both transform and |

# | | preprocessor |

# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor |

# | | (UID-based) |

# | test_espnet_preprocessor_with_transform | Combines transform with ESPnet |

# | | preprocessor |

#

# Test Set Variants

# | Test Function Name | Description |

# |-----------------------------------|------------------------------------------|

# | test_data_organizer_test_multiple_sets | Handles multiple named test sets |

#

# Error Cases

# | Test Function Name | Description | Expected Exception |

# |--------------------------------------------|------------------------------------------|--------------------|

# | test_data_organizer_train_only_assertion | Raises error when only train is provided| RuntimeError |

# | | without valid | |

# | test_data_organizer_inconsistent_keys | Fails when dataset output keys are | AssertionError |

# | | inconsistent in combined | |

# | test_data_organizer_transform_none | Simulates transform failure that raises | ValueError |

# | | an internal exception | |

# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as | AssertionError |

# | | preprocessor | |

Copilot · 2025-07-29T13:34:47Z

espnet3/data/data_organizer.py

+        if ((self.train is None) ^ (self.valid is None)) or (
+            isinstance(self.train, CombinedDataset)
+            ^ isinstance(self.valid, CombinedDataset)


[nitpick] The boolean logic using XOR operator could be simplified for better readability. Consider using explicit conditions like (self.train is None and self.valid is not None) or (self.train is not None and self.valid is None).

Suggested change

if ((self.train is None) ^ (self.valid is None)) or (

isinstance(self.train, CombinedDataset)

^ isinstance(self.valid, CombinedDataset)

if ((self.train is None and self.valid is not None) or (self.train is not None and self.valid is None)) or (

(isinstance(self.train, CombinedDataset) and not isinstance(self.valid, CombinedDataset)) or

(not isinstance(self.train, CombinedDataset) and isinstance(self.valid, CombinedDataset))

Copilot · 2025-07-29T13:34:47Z

espnet3/data/dataset.py

+            if self.multiple_iterator and not isinstance(dataset, ShardedDataset):
+                raise RuntimeError(
+                    "If any dataset is a subclass of ShardedDataset,"
+                    " then all dataset should be a subclass of ShardedDataset."
+                )
+            if isinstance(dataset, ShardedDataset):
+                self.multiple_iterator = True


The condition checks self.multiple_iterator before it's properly set. This could cause incorrect behavior since multiple_iterator is initialized to False and only set to True inside the loop. Move this check after the multiple_iterator flag is set.

Suggested change

if self.multiple_iterator and not isinstance(dataset, ShardedDataset):

raise RuntimeError(

"If any dataset is a subclass of ShardedDataset,"

" then all dataset should be a subclass of ShardedDataset."

)

if isinstance(dataset, ShardedDataset):

self.multiple_iterator = True

if isinstance(dataset, ShardedDataset):

self.multiple_iterator = True

if self.multiple_iterator and not isinstance(dataset, ShardedDataset):

raise RuntimeError(

"If any dataset is a subclass of ShardedDataset,"

" then all dataset should be a subclass of ShardedDataset."

)

Copilot · 2025-07-29T13:34:48Z

espnet3/data/dataset.py

+        for i, (dataset, transform) in enumerate(zip(self.datasets, self.transforms)):
+            if len(dataset) == 0:
+                continue  # Skip empty datasets
+            sample = transform[0](dataset[0].copy())


Calling .copy() assumes all dataset items are dictionaries with a copy method. This could fail if dataset items are not dictionaries or don't implement copy(). Consider using a safer approach like copy.deepcopy() or handle the case where copy() is not available.

Suggested change

sample = transform[0](dataset[0].copy())

sample = transform[0](copy.deepcopy(dataset[0]))

espnet3/data/dataset.py

- Due to the review comment for `multiple_iterator` flag, I recognized that the exception with the flag is not properly tested. I've added a new test case for this fix.

sw005320 · 2025-07-30T07:22:09Z

It looks good to me.
I'll merge after CI.
Please work on the next PRs.

codecov · 2025-07-30T07:32:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.65%. Comparing base (8e018bc) to head (f074ae3).
⚠️ Report is 19 commits behind head on espnet3.

❗ There is a different number of reports uploaded between BASE (8e018bc) and HEAD (f074ae3). Click for more details.

HEAD has 5 uploads less than BASE

Flag BASE (8e018bc) HEAD (f074ae3)

test_integration_espnetez 5 0

Additional details and impacted files

@@             Coverage Diff              @@
##           espnet3    #6167       +/-   ##
============================================
- Coverage    47.14%   27.65%   -19.49%     
============================================
  Files          563      876      +313     
  Lines        51417    83576    +32159     
============================================
- Hits         24239    23115     -1124     
- Misses       27178    60461    +33283

Flag	Coverage Δ
test_integration_espnet2	`46.56% <ø> (ø)`
test_integration_espnetez	`?`
test_python_espnet3	`0.46% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Masao-Someki · 2025-08-01T17:03:21Z

Could you merge or rerun the CI? It seems one integration test fails with timeout

FAILED test/espnet2/bin/test_asr_inference.py::test_Speech2Text_hugging_face_causal_lm[--akreal/tiny-random-BloomForCausalLM] - Failed: Timeout >30.0s

Masao-Someki added 9 commits June 22, 2025 11:39

Removed espnetez

6575ef4

Add data organizer and dataset class

43e194a

Add unit test

acca3f4

Add data organizer

af9b203

Applied black and isort

f59da27

Add unit test to CI

a0ec3b6

Fix unit test

e99f683

Enhance docstrings

b2700e5

Add espnet-3 branch for triggering CI

4b69bca

- This will be removed when espnet#6139 merged into master.

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 22, 2025

mergify bot added the CI Travis, Circle CI, etc label Jun 22, 2025

dosubot bot added ESPnet3 New Features labels Jun 22, 2025

Masao-Someki mentioned this pull request Jun 22, 2025

Development plan for ESPnet-3 #6133

Open

52 tasks

sw005320 requested a review from Copilot June 22, 2025 21:30

This comment was marked as outdated.

Sign in to view

Masao-Someki added 4 commits June 23, 2025 21:27

Fixed copilot review and format

000f1ae

Added test case in test script and format

13e8266

Bugfix based on unit tests

bebe477

Fix pycodestyle

63e7a04

Merge remote-tracking branch 'upstream/espnet3' into espnet3/data_org…

d314ef5

…anizer

sw005320 reviewed Jul 29, 2025

View reviewed changes

Fixed format issue in workflow file

7090176

sw005320 added this to the v.202506 milestone Jul 29, 2025

sw005320 reviewed Jul 29, 2025

View reviewed changes

sw005320 requested a review from Copilot July 29, 2025 13:33

Copilot AI reviewed Jul 29, 2025

View reviewed changes

Masao-Someki added 2 commits July 30, 2025 15:27

Fixed shell-check CI

7217501

Fixed copilot's review

f074ae3

- Due to the review comment for `multiple_iterator` flag, I recognized that the exception with the flag is not properly tested. I've added a new test case for this fix.

sw005320 self-requested a review July 30, 2025 07:20

sw005320 approved these changes Jul 30, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 30, 2025

sw005320 added the auto-merge Enable auto-merge label Aug 2, 2025

sw005320 merged commit 385e034 into espnet:espnet3 Aug 4, 2025
37 of 38 checks passed

Masao-Someki mentioned this pull request Sep 16, 2025

[espnet3-5] (2) Add parallel module and collect_stats #6242

Merged

Masao-Someki mentioned this pull request Sep 26, 2025

[espnet3-7](2) Add Callbacks #6249

Merged

Masao-Someki mentioned this pull request Nov 17, 2025

[espnet3-10] Merge espnet3 branch into master #6304

Merged

Masao-Someki deleted the espnet3/data_organizer branch November 26, 2025 18:24

	sample = transform[0](dataset[0].copy())
	sample = transform[0](copy.deepcopy(dataset[0]))

Conversation

Masao-Someki commented Jun 22, 2025

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

Masao-Someki commented Jun 22, 2025

🔷 1. Basic Integration

🔷 2. Transform / Preprocessor Application

🔷 3. Test Set Variants

🔶 Error Cases

Uh oh!

Masao-Someki commented Jun 22, 2025

Uh oh!

sw005320 commented Jun 22, 2025

CI Workflow Updates:

ESPnet-3 Module Enhancements:

Uh oh!

This comment was marked as outdated.

Uh oh!

sw005320 commented Jul 29, 2025

Uh oh!

sw005320 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Masao-Someki Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Masao-Someki Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

sw005320 commented Jul 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sw005320 commented Jul 30, 2025

Uh oh!

codecov bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Masao-Someki commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 30, 2025 •

edited

Loading