Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support external dataset library for ESPnetEasy#5584

Merged
mergify[bot] merged 53 commits intoespnet:masterfrom
Masao-Someki:feature/multi_dataset_support
Mar 18, 2024
Merged

Support external dataset library for ESPnetEasy#5584
mergify[bot] merged 53 commits intoespnet:masterfrom
Masao-Someki:feature/multi_dataset_support

Conversation

@Masao-Someki
Copy link
Contributor

@Masao-Someki Masao-Someki commented Dec 10, 2023

What?

This PR is to support several external dataset libraries or formats and create instruction documents.
Currently, I'm planning to support the following datasets:

  • torchdata (torchaudio.dataset)
  • Lhotse
  • huggingface dataset (optional)
  • WebDataset (optional)

Why?

While progressing with the espnetez project, we found that it is possible to support external dataset formats.
Espnetez has the potential to develop finetuning Python code more easily, but we cannot take full advantage if we cannot define our custom dataset.
Therefore, the support of typical dataset libraries in this PR contributes to improving the usability of espnetez.

See also

#5372

- ASRTransducerTask
- S2STTask
- S2TTask
- SpeakerTask
- Support torchaudio.dataset, and any other dataset format (if it can fit to ESPnet format).
@mergify mergify bot added the ESPnet2 label Dec 10, 2023
@Masao-Someki
Copy link
Contributor Author

Currently, several iter_factory related functions are not supported and I'm working on this part.

@sw005320 sw005320 added this to the v.202312 milestone Dec 13, 2023
@Vaibhavs10
Copy link

Hey @Masao-Someki @sw005320 - Lovely that you are adding support for datasets. Do let me know if you hit any roadblocks and I'd be more than happy to help you! 🤗
Very exciting!!

Masao-Someki and others added 3 commits December 25, 2023 02:04
- Add TEMPLATE folder
- Modified directory structure of current recipes
…ki/espnet into feature/multi_dataset_support
@Masao-Someki
Copy link
Contributor Author

@sw005320 @wanchichen @juice500ml
Hi, I've incorporated the template directory into the egsez and adjusted the recipe directory to align with the structure used in other recipes, like egs2.

Could you please review the recipe folder, especially for the directory structure? Any comments you may have would be greatly appreciated!

@kan-bayashi kan-bayashi modified the milestones: v.202312, v.202405 Feb 6, 2024
@Masao-Someki Masao-Someki changed the title [WIP] Support external dataset library Support external dataset library for ESPnetEasy Mar 17, 2024
Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please try to consistently use ez. `easy' is a too general word, and it causes confusion.
  • did you include a fine-tuning test? This is the most important use case. We may use the OWSM base model (or we may make a very small pre-trained model for this test).

@Masao-Someki
Copy link
Contributor Author

@sw005320
Thank you for your review. I overlooked adding tests for fine-tuning. I plan to use the trained model (small one) in the CI test so that both training and fine-tuning can be done in one file.

Masao-Someki and others added 3 commits March 17, 2024 21:53
- Removed unrequired lines in test_integration_espnetez
- Rename test_integrate_easy to test_integration_espnetez
- Rename test_easy to test_ez
- Downsize the model size in transducer config
- and little refactoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but you can optionally check some of pytests inside https://github.com/espnet/espnet/blob/master/test/ and maybe instead of args you can use pytest.mark.parametrize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this finetuning test should be an integration test, because it requires trained checkpoints from the test_integration_espnetez.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before. You can consider adding it to https://github.com/espnet/espnet/blob/master/test/ and maybe instead of args you can use pytest.mark.parametrize.

@siddhu001
Copy link
Collaborator

Thanks @Masao-Someki for your great work! I have added some minor suggestions but overall, the PR looks good to me.

- Moved integration test and unit test into `test/espnetez`
- Rename `test_finetune_espnetez` to `test_integration_espnetez_ft`
…ki/espnet into feature/multi_dataset_support
Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
After the CI test is passed, I’ll merge this PR.

@sw005320 sw005320 added the auto-merge Enable auto-merge label Mar 18, 2024
@sw005320
Copy link
Contributor

Can you fix the CI issue?

- Format shellscript
@mergify mergify bot merged commit c3064f9 into espnet:master Mar 18, 2024
@Masao-Someki Masao-Someki deleted the feature/multi_dataset_support branch March 26, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enable auto-merge CI Travis, Circle CI, etc ESPnet2 New Features OWSM Open Whisper-style Speech Model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants