-
Notifications
You must be signed in to change notification settings - Fork 1.6k
25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
…peechbrain-released into titou/LargeScaleASR
Here i'd love to have an opinion about having this kind of recipe into SB from @Adel-Moumen @pplantinga and @mravanelli . It's quite uncommon to have such a big part of the code devoted to preparing a dataset. I could use an official review by someone as well. |
I think this sort of recipe is sorely needed for open-source research. NeMo has a similar recipe that is not open-sourced for their ASRset. I'm wondering if this recipe could be developed further to involve a more sophisticated sample filtering by automatically transcribing via multiple ASR systems and taking the samples with low rates of transcription differences -- from what I understand this is a comon technique for large-scale ASR systems these days. As for the recipe itself, it looks like it repeats a lot of the dataset preparation for datasets we already have. Is there any way we can re-use some of the scripts already available? |
Agreed @pplantinga . Let me answer for the recipe part. All the scripts are different as the csv rows and steps / filtering are not the same. There is also some file copying involved. I cannot reuse existing data prep. As you can see in the PR there is also a new TextNormaliser class that I use. The ASR recipe PR will be much easier to review and merge ... |
Hey, what should we do about this PR? If I am not mistaken, at some point you were thinking of closing this PR, right? |
Yes, but not now. |
This PR introduces the data preparation of [insert hidden name](anonymity due to conference rules for now).
The 6 subsets are (more details in the readme):
According to already trained models, this is the best English ASR model that SpeechBrain can have so far. This PR is just here so that someone recreate the dataset and upload it on HuggingFace (as the data preparation leads to a properly sharded HuggingFace dataset)...
The code is is progress until the dataset has been uploaded by someone onto HuggingFace.