Support external dataset library for ESPnetEasy#5584
Support external dataset library for ESPnetEasy#5584mergify[bot] merged 53 commits intoespnet:masterfrom
Conversation
- ASRTransducerTask - S2STTask - S2TTask - SpeakerTask
- Support torchaudio.dataset, and any other dataset format (if it can fit to ESPnet format).
for more information, see https://pre-commit.ci
|
Currently, several |
…ki/espnet into feature/multi_dataset_support
for more information, see https://pre-commit.ci
|
Hey @Masao-Someki @sw005320 - Lovely that you are adding support for datasets. Do let me know if you hit any roadblocks and I'd be more than happy to help you! 🤗 |
- Add TEMPLATE folder - Modified directory structure of current recipes
…ki/espnet into feature/multi_dataset_support
for more information, see https://pre-commit.ci
|
@sw005320 @wanchichen @juice500ml Could you please review the recipe folder, especially for the directory structure? Any comments you may have would be greatly appreciated! |
…ki/espnet into feature/multi_dataset_support
for more information, see https://pre-commit.ci
…ki/espnet into feature/multi_dataset_support
for more information, see https://pre-commit.ci
…ki/espnet into feature/multi_dataset_support
- Forgot to comment out unrequired annotation.
sw005320
left a comment
There was a problem hiding this comment.
- please try to consistently use
ez. `easy' is a too general word, and it causes confusion. - did you include a fine-tuning test? This is the most important use case. We may use the OWSM base model (or we may make a very small pre-trained model for this test).
|
@sw005320 |
- Removed unrequired lines in test_integration_espnetez - Rename test_integrate_easy to test_integration_espnetez - Rename test_easy to test_ez - Downsize the model size in transducer config
- and little refactoring
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
This looks good but you can optionally check some of pytests inside https://github.com/espnet/espnet/blob/master/test/ and maybe instead of args you can use pytest.mark.parametrize.
There was a problem hiding this comment.
I think this finetuning test should be an integration test, because it requires trained checkpoints from the test_integration_espnetez.py
There was a problem hiding this comment.
Same as before. You can consider adding it to https://github.com/espnet/espnet/blob/master/test/ and maybe instead of args you can use pytest.mark.parametrize.
|
Thanks @Masao-Someki for your great work! I have added some minor suggestions but overall, the PR looks good to me. |
- Moved integration test and unit test into `test/espnetez` - Rename `test_finetune_espnetez` to `test_integration_espnetez_ft`
…ki/espnet into feature/multi_dataset_support
sw005320
left a comment
There was a problem hiding this comment.
This looks good to me.
After the CI test is passed, I’ll merge this PR.
|
Can you fix the CI issue? |
- Format shellscript
What?
This PR is to support several external dataset libraries or formats and create instruction documents.
Currently, I'm planning to support the following datasets:
Why?
While progressing with the espnetez project, we found that it is possible to support external dataset formats.
Espnetez has the potential to develop finetuning Python code more easily, but we cannot take full advantage if we cannot define our custom dataset.
Therefore, the support of typical dataset libraries in this PR contributes to improving the usability of espnetez.
See also
#5372