[espnet3-1] Add Data Organizer#6167
Conversation
- This will be removed when espnet#6139 merged into master.
|
Here's details of unit test cases. 🔷 1. Basic IntegrationBasic Integration of train/valid/test combinations
🔷 2. Transform / Preprocessor ApplicationBehavior of transform and preprocessor combinations
🔷 3. Test Set VariantsTests for variations in test dataset configurations
🔶 Error Cases
|
|
The current CI error is mainly caused by the pyproject.toml configuration. This PR should be merged only after #6139 has been merged and its CI has passed. |
|
This pull request introduces support for ESPnet-3 across multiple CI workflows and adds new functionality for dataset organization in the CI Workflow Updates:
ESPnet-3 Module Enhancements:
|
|
@Masao-Someki, can you fix the conflict? |
| chainer-version: ["6.0.0"] | ||
| python-version: ["3.10", "3.11"] | ||
| pytorch-version: [2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.6.0] | ||
| chainer-version: [6.0.0] |
There was a problem hiding this comment.
I thought it would be good to include those versions in the tests for ESPnet-3 as well, since we already test them for ESPnet1/2 and have table for them in the Readme.
There was a problem hiding this comment.
Not a version, but a chainer
Do you need chainer?
There was a problem hiding this comment.
Thank you, I understand now. Chainer is required by espnet-1 and is supposed to be removed in #6179.
However, it looks like the PR didn’t actually remove it, so I’ll go ahead and remove the Chainer version in that PR.
| chainer-version: [6.0.0] | ||
| python-version: ["3.10"] | ||
| pytorch-version: ["2.6.0"] | ||
| chainer-version: ["6.0.0"] |
There was a problem hiding this comment.
Can we keep chainer here?
There was a problem hiding this comment.
Pull Request Overview
This PR transitions from ESPnet-EZ to ESPnet-3 by removing the ESPnet-EZ framework and introducing a new data organizer module for multi-dataset experiments.
- Removes entire ESPnet-EZ framework including trainer, task, config, and preprocessing modules
- Adds new ESPnet-3 data organizer module with CombinedDataset and DataOrganizer classes
- Updates CI workflows to run ESPnet-3 tests instead of ESPnet-EZ tests
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| espnet3/data/ | Adds new data organizer and dataset classes for ESPnet-3 |
| test/espnet3/ | Comprehensive unit tests for the new data organizer module |
| espnetez/ | Complete removal of ESPnet-EZ framework |
| test/espnetez/ | Removal of ESPnet-EZ integration and unit tests |
| .github/workflows/ | Updates CI to test ESPnet-3 instead of ESPnet-EZ |
| ci/ | Updates test scripts for ESPnet-3 |
test/espnet3/test_data_organizer.py
Outdated
| # | Test Function Name | Description | # noqa: E501 | ||
| # |----------------------------------|--------------------------------------------------------------------------| # noqa: E501 | ||
| # | test_data_organizer_init | Initializes DataOrganizer with train/valid/test | # noqa: E501 | ||
| # | test_data_organizer_without_test | Verifies behavior without test section | # noqa: E501 | ||
| # | test_data_organizer_test_only | Validates usage of test-only pipelines | # noqa: E501 | ||
| # | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported | # noqa: E501 | ||
| # | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | # noqa: E501 | ||
| # | ||
| # Transform / Preprocessor Application | ||
| # | Test Function Name | Description | # noqa: E501 | ||
| # |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501 | ||
| # | test_combined_dataset | Combines datasets and applies transform | # noqa: E501 | ||
| # | test_data_organizer_transform_only | Applies only transform to data | # noqa: E501 | ||
| # | test_data_organizer_preprocessor_only| Applies only preprocessor to data | # noqa: E501 | ||
| # | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor | # noqa: E501 | ||
| # | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based) | # noqa: E501 | ||
| # | test_espnet_preprocessor_with_transform | Combines transform with ESPnet preprocessor | # noqa: E501 | ||
| # | ||
| # Test Set Variants | ||
| # | Test Function Name | Description | # noqa: E501 | ||
| # |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501 | ||
| # | test_data_organizer_test_multiple_sets | Handles multiple named test sets | # noqa: E501 | ||
| # | ||
| # Error Cases | ||
| # | Test Function Name | Description | Expected Exception | # noqa: E501 | ||
| # |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501 | ||
| # | test_data_organizer_train_only_assertion | Raises error when only train is provided without valid | RuntimeError | # noqa: E501 | ||
| # | test_data_organizer_inconsistent_keys | Fails when dataset output keys are inconsistent in combined | AssertionError | # noqa: E501 | ||
| # | test_data_organizer_transform_none | Simulates transform failure that raises an internal exception| ValueError | # noqa: E501 | ||
| # | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor | AssertionError | # noqa: E501 |
There was a problem hiding this comment.
[nitpick] The comment lines with # noqa: E501 suggest line length issues. Consider using a more concise table format or breaking the table into sections to improve readability.
| # | Test Function Name | Description | # noqa: E501 | |
| # |----------------------------------|--------------------------------------------------------------------------| # noqa: E501 | |
| # | test_data_organizer_init | Initializes DataOrganizer with train/valid/test | # noqa: E501 | |
| # | test_data_organizer_without_test | Verifies behavior without test section | # noqa: E501 | |
| # | test_data_organizer_test_only | Validates usage of test-only pipelines | # noqa: E501 | |
| # | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets are supported | # noqa: E501 | |
| # | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | # noqa: E501 | |
| # | |
| # Transform / Preprocessor Application | |
| # | Test Function Name | Description | # noqa: E501 | |
| # |--------------------------------------|--------------------------------------------------------------------------| # noqa: E501 | |
| # | test_combined_dataset | Combines datasets and applies transform | # noqa: E501 | |
| # | test_data_organizer_transform_only | Applies only transform to data | # noqa: E501 | |
| # | test_data_organizer_preprocessor_only| Applies only preprocessor to data | # noqa: E501 | |
| # | test_data_organizer_transform_and_preprocessor | Applies both transform and preprocessor | # noqa: E501 | |
| # | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor (UID-based) | # noqa: E501 | |
| # | test_espnet_preprocessor_with_transform | Combines transform with ESPnet preprocessor | # noqa: E501 | |
| # | |
| # Test Set Variants | |
| # | Test Function Name | Description | # noqa: E501 | |
| # |-----------------------------------|--------------------------------------------------------------------------| # noqa: E501 | |
| # | test_data_organizer_test_multiple_sets | Handles multiple named test sets | # noqa: E501 | |
| # | |
| # Error Cases | |
| # | Test Function Name | Description | Expected Exception | # noqa: E501 | |
| # |--------------------------------------------|--------------------------------------------------------------|--------------------| # noqa: E501 | |
| # | test_data_organizer_train_only_assertion | Raises error when only train is provided without valid | RuntimeError | # noqa: E501 | |
| # | test_data_organizer_inconsistent_keys | Fails when dataset output keys are inconsistent in combined | AssertionError | # noqa: E501 | |
| # | test_data_organizer_transform_none | Simulates transform failure that raises an internal exception| ValueError | # noqa: E501 | |
| # | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as preprocessor | AssertionError | # noqa: E501 | |
| # | Test Function Name | Description | | |
| # |----------------------------------|------------------------------------------| | |
| # | test_data_organizer_init | Initializes DataOrganizer with | | |
| # | | train/valid/test | | |
| # | test_data_organizer_without_test | Verifies behavior without test section | | |
| # | test_data_organizer_test_only | Validates usage of test-only pipelines | | |
| # | test_data_organizer_train_valid_multiple | Ensures multiple train/valid datasets | | |
| # | | are supported | | |
| # | test_data_organizer_empty_train_valid_ok | Confirms train/valid can be empty lists | | |
| # | |
| # Transform / Preprocessor Application | |
| # | Test Function Name | Description | | |
| # |--------------------------------------|------------------------------------------| | |
| # | test_combined_dataset | Combines datasets and applies transform | | |
| # | test_data_organizer_transform_only | Applies only transform to data | | |
| # | test_data_organizer_preprocessor_only| Applies only preprocessor to data | | |
| # | test_data_organizer_transform_and_preprocessor | Applies both transform and | | |
| # | | preprocessor | | |
| # | test_espnet_preprocessor_without_transform | Uses only ESPnet-style preprocessor | | |
| # | | (UID-based) | | |
| # | test_espnet_preprocessor_with_transform | Combines transform with ESPnet | | |
| # | | preprocessor | | |
| # | |
| # Test Set Variants | |
| # | Test Function Name | Description | | |
| # |-----------------------------------|------------------------------------------| | |
| # | test_data_organizer_test_multiple_sets | Handles multiple named test sets | | |
| # | |
| # Error Cases | |
| # | Test Function Name | Description | Expected Exception | | |
| # |--------------------------------------------|------------------------------------------|--------------------| | |
| # | test_data_organizer_train_only_assertion | Raises error when only train is provided| RuntimeError | | |
| # | | without valid | | | |
| # | test_data_organizer_inconsistent_keys | Fails when dataset output keys are | AssertionError | | |
| # | | inconsistent in combined | | | |
| # | test_data_organizer_transform_none | Simulates transform failure that raises | ValueError | | |
| # | | an internal exception | | | |
| # | test_data_organizer_invalid_preprocessor_type | Fails when a non-callable is used as | AssertionError | | |
| # | | preprocessor | | |
espnet3/data/data_organizer.py
Outdated
| if ((self.train is None) ^ (self.valid is None)) or ( | ||
| isinstance(self.train, CombinedDataset) | ||
| ^ isinstance(self.valid, CombinedDataset) |
There was a problem hiding this comment.
[nitpick] The boolean logic using XOR operator could be simplified for better readability. Consider using explicit conditions like (self.train is None and self.valid is not None) or (self.train is not None and self.valid is None).
| if ((self.train is None) ^ (self.valid is None)) or ( | |
| isinstance(self.train, CombinedDataset) | |
| ^ isinstance(self.valid, CombinedDataset) | |
| if ((self.train is None and self.valid is not None) or (self.train is not None and self.valid is None)) or ( | |
| (isinstance(self.train, CombinedDataset) and not isinstance(self.valid, CombinedDataset)) or | |
| (not isinstance(self.train, CombinedDataset) and isinstance(self.valid, CombinedDataset)) |
espnet3/data/dataset.py
Outdated
| if self.multiple_iterator and not isinstance(dataset, ShardedDataset): | ||
| raise RuntimeError( | ||
| "If any dataset is a subclass of ShardedDataset," | ||
| " then all dataset should be a subclass of ShardedDataset." | ||
| ) | ||
| if isinstance(dataset, ShardedDataset): | ||
| self.multiple_iterator = True |
There was a problem hiding this comment.
The condition checks self.multiple_iterator before it's properly set. This could cause incorrect behavior since multiple_iterator is initialized to False and only set to True inside the loop. Move this check after the multiple_iterator flag is set.
| if self.multiple_iterator and not isinstance(dataset, ShardedDataset): | |
| raise RuntimeError( | |
| "If any dataset is a subclass of ShardedDataset," | |
| " then all dataset should be a subclass of ShardedDataset." | |
| ) | |
| if isinstance(dataset, ShardedDataset): | |
| self.multiple_iterator = True | |
| if isinstance(dataset, ShardedDataset): | |
| self.multiple_iterator = True | |
| if self.multiple_iterator and not isinstance(dataset, ShardedDataset): | |
| raise RuntimeError( | |
| "If any dataset is a subclass of ShardedDataset," | |
| " then all dataset should be a subclass of ShardedDataset." | |
| ) |
espnet3/data/dataset.py
Outdated
| for i, (dataset, transform) in enumerate(zip(self.datasets, self.transforms)): | ||
| if len(dataset) == 0: | ||
| continue # Skip empty datasets | ||
| sample = transform[0](dataset[0].copy()) |
There was a problem hiding this comment.
Calling .copy() assumes all dataset items are dictionaries with a copy method. This could fail if dataset items are not dictionaries or don't implement copy(). Consider using a safer approach like copy.deepcopy() or handle the case where copy() is not available.
| sample = transform[0](dataset[0].copy()) | |
| sample = transform[0](copy.deepcopy(dataset[0])) |
- Due to the review comment for `multiple_iterator` flag, I recognized that the exception with the flag is not properly tested. I've added a new test case for this fix.
|
It looks good to me. |
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## espnet3 #6167 +/- ##
============================================
- Coverage 47.14% 27.65% -19.49%
============================================
Files 563 876 +313
Lines 51417 83576 +32159
============================================
- Hits 24239 23115 -1124
- Misses 27178 60461 +33283
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Could you merge or rerun the CI? It seems one integration test fails with timeout
|
What did you change?
Why did you make this change?
The new data organizer is designed to support multi-dataset experiments. It allows users to easily add or remove datasets from their experiments, improving flexibility and modularity.
Is your PR small enough?
Yes. Although the removal of ESPnet-EZ results in 14 files being deleted, only 4 new files are added.
The added code is well-documented, with comprehensive docstrings, which accounts for a relatively high number of lines.
Additional Context
#6133