Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

TParcollet
Copy link
Collaborator

This PR introduces the data preparation of [insert hidden name](anonymity due to conference rules for now).

The 6 subsets are (more details in the readme):

  1. large contains 25,000 hours of read / spontaneous and clean / noisy transcribed speech.
  2. medium contains 2,500 hours of read / spontaneous and clean / noisy transcribed speech.
  3. small contains 250 hours of read / spontaneous and clean / noisy transcribed speech.
  4. clean contains 13,000 hours of read and clean / less noisy transcribed speech.
  5. dev contains 17 hours.
  6. test contains 17 hours.

According to already trained models, this is the best English ASR model that SpeechBrain can have so far. This PR is just here so that someone recreate the dataset and upload it on HuggingFace (as the data preparation leads to a properly sharded HuggingFace dataset)...

The code is is progress until the dataset has been uploaded by someone onto HuggingFace.

@TParcollet TParcollet added the recipes Changes to recipes only (add/edit) label Jan 16, 2025
@TParcollet TParcollet self-assigned this Jan 16, 2025
@TParcollet
Copy link
Collaborator Author

Here i'd love to have an opinion about having this kind of recipe into SB from @Adel-Moumen @pplantinga and @mravanelli . It's quite uncommon to have such a big part of the code devoted to preparing a dataset. I could use an official review by someone as well.

@pplantinga
Copy link
Collaborator

I think this sort of recipe is sorely needed for open-source research. NeMo has a similar recipe that is not open-sourced for their ASRset. I'm wondering if this recipe could be developed further to involve a more sophisticated sample filtering by automatically transcribing via multiple ASR systems and taking the samples with low rates of transcription differences -- from what I understand this is a comon technique for large-scale ASR systems these days.

As for the recipe itself, it looks like it repeats a lot of the dataset preparation for datasets we already have. Is there any way we can re-use some of the scripts already available?

@TParcollet
Copy link
Collaborator Author

Agreed @pplantinga . Let me answer for the recipe part. All the scripts are different as the csv rows and steps / filtering are not the same. There is also some file copying involved. I cannot reuse existing data prep. As you can see in the PR there is also a new TextNormaliser class that I use. The ASR recipe PR will be much easier to review and merge ...

@Adel-Moumen
Copy link
Collaborator

Hey, what should we do about this PR? If I am not mistaken, at some point you were thinking of closing this PR, right?

@TParcollet
Copy link
Collaborator Author

Yes, but not now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
recipes Changes to recipes only (add/edit)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants