Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[espnet3-1] Add Data Organizer#6167

Merged
sw005320 merged 17 commits intoespnet:espnet3from
Masao-Someki:espnet3/data_organizer
Aug 4, 2025
Merged

[espnet3-1] Add Data Organizer#6167
sw005320 merged 17 commits intoespnet:espnet3from
Masao-Someki:espnet3/data_organizer

Conversation

@Masao-Someki
Copy link
Contributor

What did you change?

  1. Removed ESPnet-EZ, as we are transitioning to ESPnet-3.
  2. Added a data organizer module along with accompanying unit tests.

Why did you make this change?

The new data organizer is designed to support multi-dataset experiments. It allows users to easily add or remove datasets from their experiments, improving flexibility and modularity.


Is your PR small enough?

Yes. Although the removal of ESPnet-EZ results in 14 files being deleted, only 4 new files are added.
The added code is well-documented, with comprehensive docstrings, which accounts for a relatively high number of lines.


Additional Context

#6133

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 22, 2025
@mergify mergify bot added the CI Travis, Circle CI, etc label Jun 22, 2025
@Masao-Someki
Copy link
Contributor Author

Here's details of unit test cases.

🔷 1. Basic Integration

Basic Integration of train/valid/test combinations

Test Function Name Description
test_data_organizer_init Initializes DataOrganizer with train/valid/test
test_data_organizer_without_test Verifies behavior without test section
test_data_organizer_test_only Validates usage of test-only pipelines
test_data_organizer_train_valid_multiple Ensures multiple train/valid datasets are supported
test_data_organizer_empty_train_valid_ok Confirms train/valid can be empty lists

🔷 2. Transform / Preprocessor Application

Behavior of transform and preprocessor combinations

Test Function Name Description
test_combined_dataset Combines datasets and applies transform
test_data_organizer_transform_only Applies only transform to data
test_data_organizer_preprocessor_only Applies only preprocessor to data
test_data_organizer_transform_and_preprocessor Applies both transform and preprocessor
test_espnet_preprocessor_without_transform Uses only ESPnet-style preprocessor (UID-based)
test_espnet_preprocessor_with_transform Combines transform with ESPnet preprocessor

🔷 3. Test Set Variants

Tests for variations in test dataset configurations

Test Function Name Description
test_data_organizer_test_multiple_sets Handles multiple named test sets

🔶 Error Cases

Test Function Name Description Expected Exception
test_data_organizer_train_only_assertion Raises error when only train is provided without valid RuntimeError
test_data_organizer_inconsistent_keys Fails when dataset output keys are inconsistent in a combined setup AssertionError
test_data_organizer_transform_none Simulates transform failure that raises an internal exception ValueError
test_data_organizer_invalid_preprocessor_type Fails when a non-callable object is used as preprocessor AssertionError

@Masao-Someki
Copy link
Contributor Author

The current CI error is mainly caused by the pyproject.toml configuration.
Since formatting-related changes are being handled separately in #6164 , the pyproject.toml in the base branch is not yet compatible with the current CI setup.

This PR should be merged only after #6139 has been merged and its CI has passed.

@Masao-Someki Masao-Someki mentioned this pull request Jun 22, 2025
52 tasks
@sw005320 sw005320 requested a review from Copilot June 22, 2025 21:30
@sw005320
Copy link
Contributor

This pull request introduces support for ESPnet-3 across multiple CI workflows and adds new functionality for dataset organization in the espnet3 module. Key changes include updates to CI configurations, the addition of new Python scripts, and the introduction of a DataOrganizer class for managing datasets.

CI Workflow Updates:

ESPnet-3 Module Enhancements:

  • ci/test_python_espnet3.sh: Renamed from ci/test_python_espnetez.sh and updated to include ESPnet-3-specific paths and testing configurations.
  • espnet3/data/__init__.py: Added imports for data_organizer and dataset.
  • espnet3/data/data_organizer.py: Introduced the DataOrganizer class to manage training, validation, and test datasets, with support for modular dataset configuration and preprocessing.

This comment was marked as outdated.

@sw005320
Copy link
Contributor

@Masao-Someki, can you fix the conflict?

chainer-version: ["6.0.0"]
python-version: ["3.10", "3.11"]
pytorch-version: [2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.6.0]
chainer-version: [6.0.0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be good to include those versions in the tests for ESPnet-3 as well, since we already test them for ESPnet1/2 and have table for them in the Readme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a version, but a chainer
Do you need chainer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand now. Chainer is required by espnet-1 and is supposed to be removed in #6179.
However, it looks like the PR didn’t actually remove it, so I’ll go ahead and remove the Chainer version in that PR.

@sw005320 sw005320 added this to the v.202506 milestone Jul 29, 2025
chainer-version: [6.0.0]
python-version: ["3.10"]
pytorch-version: ["2.6.0"]
chainer-version: ["6.0.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep chainer here?

@sw005320 sw005320 requested a review from Copilot July 29, 2025 13:33
@sw005320
Copy link
Contributor

Please fix https://github.com/espnet/espnet/actions/runs/16596452146/job/46944176786?pr=6167#step:8:302

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR transitions from ESPnet-EZ to ESPnet-3 by removing the ESPnet-EZ framework and introducing a new data organizer module for multi-dataset experiments.

  • Removes entire ESPnet-EZ framework including trainer, task, config, and preprocessing modules
  • Adds new ESPnet-3 data organizer module with CombinedDataset and DataOrganizer classes
  • Updates CI workflows to run ESPnet-3 tests instead of ESPnet-EZ tests

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
espnet3/data/ Adds new data organizer and dataset classes for ESPnet-3
test/espnet3/ Comprehensive unit tests for the new data organizer module
espnetez/ Complete removal of ESPnet-EZ framework
test/espnetez/ Removal of ESPnet-EZ integration and unit tests
.github/workflows/ Updates CI to test ESPnet-3 instead of ESPnet-EZ
ci/ Updates test scripts for ESPnet-3

Comment on lines +15 to +44
# | Test Function Name | Description | # noqa: E501
# |----------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_data_organizer_init | Initializes DataOrganizer with train/valid/test | # noqa: E501
# | test_data_organizer_without_test | Verifies behavior without test section | # noqa: E501
# | test_data_organizer_test_only | Validates usage of test-only pipelines | # noqa: E501
# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported | # noqa: E501
# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | # noqa: E501
#
# Transform / Preprocessor Application
# | Test Function Name | Description | # noqa: E501
# |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_combined_dataset | Combines datasets and applies transform | # noqa: E501
# | test_data_organizer_transform_only | Applies only transform to data | # noqa: E501
# | test_data_organizer_preprocessor_only| Applies only preprocessor to data | # noqa: E501
# | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor | # noqa: E501
# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based) | # noqa: E501
# | test_espnet_preprocessor_with_transform | Combines transform with ESPnet preprocessor | # noqa: E501
#
# Test Set Variants
# | Test Function Name | Description | # noqa: E501
# |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_data_organizer_test_multiple_sets | Handles multiple named test sets | # noqa: E501
#
# Error Cases
# | Test Function Name | Description | Expected Exception | # noqa: E501
# |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501
# | test_data_organizer_train_only_assertion | Raises error when only train is provided without valid | RuntimeError | # noqa: E501
# | test_data_organizer_inconsistent_keys | Fails when dataset output keys are inconsistent in combined | AssertionError | # noqa: E501
# | test_data_organizer_transform_none | Simulates transform failure that raises an internal exception| ValueError | # noqa: E501
# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor | AssertionError | # noqa: E501
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment lines with # noqa: E501 suggest line length issues. Consider using a more concise table format or breaking the table into sections to improve readability.

Suggested change
# | Test Function Name | Description | # noqa: E501
# |----------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_data_organizer_init | Initializes DataOrganizer with train/valid/test | # noqa: E501
# | test_data_organizer_without_test | Verifies behavior without test section | # noqa: E501
# | test_data_organizer_test_only | Validates usage of test-only pipelines | # noqa: E501
# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported | # noqa: E501
# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | # noqa: E501
#
# Transform / Preprocessor Application
# | Test Function Name | Description | # noqa: E501
# |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_combined_dataset | Combines datasets and applies transform | # noqa: E501
# | test_data_organizer_transform_only | Applies only transform to data | # noqa: E501
# | test_data_organizer_preprocessor_only| Applies only preprocessor to data | # noqa: E501
# | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor | # noqa: E501
# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based) | # noqa: E501
# | test_espnet_preprocessor_with_transform | Combines transform with ESPnet preprocessor | # noqa: E501
#
# Test Set Variants
# | Test Function Name | Description | # noqa: E501
# |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501
# | test_data_organizer_test_multiple_sets | Handles multiple named test sets | # noqa: E501
#
# Error Cases
# | Test Function Name | Description | Expected Exception | # noqa: E501
# |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501
# | test_data_organizer_train_only_assertion | Raises error when only train is provided without valid | RuntimeError | # noqa: E501
# | test_data_organizer_inconsistent_keys | Fails when dataset output keys are inconsistent in combined | AssertionError | # noqa: E501
# | test_data_organizer_transform_none | Simulates transform failure that raises an internal exception| ValueError | # noqa: E501
# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor | AssertionError | # noqa: E501
# | Test Function Name | Description |
# |----------------------------------|------------------------------------------|
# | test_data_organizer_init | Initializes DataOrganizer with |
# | | train/valid/test |
# | test_data_organizer_without_test | Verifies behavior without test section |
# | test_data_organizer_test_only | Validates usage of test-only pipelines |
# | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets |
# | | are supported |
# | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists |
#
# Transform / Preprocessor Application
# | Test Function Name | Description |
# |--------------------------------------|------------------------------------------|
# | test_combined_dataset | Combines datasets and applies transform |
# | test_data_organizer_transform_only | Applies only transform to data |
# | test_data_organizer_preprocessor_only| Applies only preprocessor to data |
# | test_data_organizer_transform_and_preprocessor | Applies both transform and |
# | | preprocessor |
# | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor |
# | | (UID-based) |
# | test_espnet_preprocessor_with_transform | Combines transform with ESPnet |
# | | preprocessor |
#
# Test Set Variants
# | Test Function Name | Description |
# |-----------------------------------|------------------------------------------|
# | test_data_organizer_test_multiple_sets | Handles multiple named test sets |
#
# Error Cases
# | Test Function Name | Description | Expected Exception |
# |--------------------------------------------|------------------------------------------|--------------------|
# | test_data_organizer_train_only_assertion | Raises error when only train is provided| RuntimeError |
# | | without valid | |
# | test_data_organizer_inconsistent_keys | Fails when dataset output keys are | AssertionError |
# | | inconsistent in combined | |
# | test_data_organizer_transform_none | Simulates transform failure that raises | ValueError |
# | | an internal exception | |
# | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as | AssertionError |
# | | preprocessor | |

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +176
if ((self.train is None) ^ (self.valid is None)) or (
isinstance(self.train, CombinedDataset)
^ isinstance(self.valid, CombinedDataset)
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The boolean logic using XOR operator could be simplified for better readability. Consider using explicit conditions like (self.train is None and self.valid is not None) or (self.train is not None and self.valid is None).

Suggested change
if ((self.train is None) ^ (self.valid is None)) or (
isinstance(self.train, CombinedDataset)
^ isinstance(self.valid, CombinedDataset)
if ((self.train is None and self.valid is not None) or (self.train is not None and self.valid is None)) or (
(isinstance(self.train, CombinedDataset) and not isinstance(self.valid, CombinedDataset)) or
(not isinstance(self.train, CombinedDataset) and isinstance(self.valid, CombinedDataset))

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +105
if self.multiple_iterator and not isinstance(dataset, ShardedDataset):
raise RuntimeError(
"If any dataset is a subclass of ShardedDataset,"
" then all dataset should be a subclass of ShardedDataset."
)
if isinstance(dataset, ShardedDataset):
self.multiple_iterator = True
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks self.multiple_iterator before it's properly set. This could cause incorrect behavior since multiple_iterator is initialized to False and only set to True inside the loop. Move this check after the multiple_iterator flag is set.

Suggested change
if self.multiple_iterator and not isinstance(dataset, ShardedDataset):
raise RuntimeError(
"If any dataset is a subclass of ShardedDataset,"
" then all dataset should be a subclass of ShardedDataset."
)
if isinstance(dataset, ShardedDataset):
self.multiple_iterator = True
if isinstance(dataset, ShardedDataset):
self.multiple_iterator = True
if self.multiple_iterator and not isinstance(dataset, ShardedDataset):
raise RuntimeError(
"If any dataset is a subclass of ShardedDataset,"
" then all dataset should be a subclass of ShardedDataset."
)

Copilot uses AI. Check for mistakes.
for i, (dataset, transform) in enumerate(zip(self.datasets, self.transforms)):
if len(dataset) == 0:
continue # Skip empty datasets
sample = transform[0](dataset[0].copy())
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .copy() assumes all dataset items are dictionaries with a copy method. This could fail if dataset items are not dictionaries or don't implement copy(). Consider using a safer approach like copy.deepcopy() or handle the case where copy() is not available.

Suggested change
sample = transform[0](dataset[0].copy())
sample = transform[0](copy.deepcopy(dataset[0]))

Copilot uses AI. Check for mistakes.
- Due to the review comment for `multiple_iterator` flag, I recognized that the exception with the flag is not properly tested. I've added a new test case for this fix.
@sw005320 sw005320 self-requested a review July 30, 2025 07:20
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 30, 2025
@sw005320
Copy link
Contributor

It looks good to me.
I'll merge after CI.
Please work on the next PRs.

@codecov
Copy link

codecov bot commented Jul 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.65%. Comparing base (8e018bc) to head (f074ae3).
⚠️ Report is 19 commits behind head on espnet3.

❗ There is a different number of reports uploaded between BASE (8e018bc) and HEAD (f074ae3). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (8e018bc) HEAD (f074ae3)
test_integration_espnetez 5 0
Additional details and impacted files
@@             Coverage Diff              @@
##           espnet3    #6167       +/-   ##
============================================
- Coverage    47.14%   27.65%   -19.49%     
============================================
  Files          563      876      +313     
  Lines        51417    83576    +32159     
============================================
- Hits         24239    23115     -1124     
- Misses       27178    60461    +33283     
Flag Coverage Δ
test_integration_espnet2 46.56% <ø> (ø)
test_integration_espnetez ?
test_python_espnet3 0.46% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Masao-Someki
Copy link
Contributor Author

Could you merge or rerun the CI? It seems one integration test fails with timeout

FAILED test/espnet2/bin/test_asr_inference.py::test_Speech2Text_hugging_face_causal_lm[--akreal/tiny-random-BloomForCausalLM] - Failed: Timeout >30.0s

@sw005320 sw005320 added the auto-merge Enable auto-merge label Aug 2, 2025
@sw005320 sw005320 merged commit 385e034 into espnet:espnet3 Aug 4, 2025
37 of 38 checks passed
@Masao-Someki Masao-Someki deleted the espnet3/data_organizer branch November 26, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enable auto-merge CI Travis, Circle CI, etc ESPnet3 lgtm This PR has been approved by a maintainer New Features size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants