25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802

TParcollet · 2025-01-16T14:00:26Z

This PR introduces the data preparation of [insert hidden name](anonymity due to conference rules for now).

The 6 subsets are (more details in the readme):

large contains 25,000 hours of read / spontaneous and clean / noisy transcribed speech.
medium contains 2,500 hours of read / spontaneous and clean / noisy transcribed speech.
small contains 250 hours of read / spontaneous and clean / noisy transcribed speech.
clean contains 13,000 hours of read and clean / less noisy transcribed speech.
dev contains 17 hours.
test contains 17 hours.

According to already trained models, this is the best English ASR model that SpeechBrain can have so far. This PR is just here so that someone recreate the dataset and upload it on HuggingFace (as the data preparation leads to a properly sharded HuggingFace dataset)...

The code is is progress until the dataset has been uploaded by someone onto HuggingFace.

…peechbrain-released into titou/LargeScaleASR

TParcollet · 2025-01-23T14:22:28Z

Here i'd love to have an opinion about having this kind of recipe into SB from @Adel-Moumen @pplantinga and @mravanelli . It's quite uncommon to have such a big part of the code devoted to preparing a dataset. I could use an official review by someone as well.

pplantinga · 2025-01-23T16:36:57Z

I think this sort of recipe is sorely needed for open-source research. NeMo has a similar recipe that is not open-sourced for their ASRset. I'm wondering if this recipe could be developed further to involve a more sophisticated sample filtering by automatically transcribing via multiple ASR systems and taking the samples with low rates of transcription differences -- from what I understand this is a comon technique for large-scale ASR systems these days.

As for the recipe itself, it looks like it repeats a lot of the dataset preparation for datasets we already have. Is there any way we can re-use some of the scripts already available?

TParcollet · 2025-01-24T11:33:29Z

Agreed @pplantinga . Let me answer for the recipe part. All the scripts are different as the csv rows and steps / filtering are not the same. There is also some file copying involved. I cannot reuse existing data prep. As you can see in the PR there is also a new TextNormaliser class that I use. The ASR recipe PR will be much easier to review and merge ...

Adel-Moumen · 2025-04-08T16:14:23Z

Hey, what should we do about this PR? If I am not mistaken, at some point you were thinking of closing this PR, right?

TParcollet · 2025-04-08T16:19:01Z

Yes, but not now.

TParcollet added 3 commits January 16, 2025 13:52

initial push

4b98787

data prep

2ff8fec

quick fixÂ

90c03f7

TParcollet added the recipes Changes to recipes only (add/edit) label Jan 16, 2025

TParcollet self-assigned this Jan 16, 2025

TParcollet and others added 25 commits January 16, 2025 14:23

text norm

ffc27e9

dataset_root

c619893

update with splits of audio files

e205be0

chunks

5870cbd

update text norm

bc21b69

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

8a45368

…peechbrain-released into titou/LargeScaleASR

new rules yodas

a3f2669

textnorm

b8e3cee

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

5de4c73

…peechbrain-released into titou/LargeScaleASR

lid

d6faaa0

remove lid

dcf6d76

Ãfaster prep

fa6222e

faster

5e6f8d7

Ãuse soundfile instead of ffmpeg

96b756c

ffmpeg le sang

01ca1d6

temporary script

b9495b5

small update

7d49653

dataprep readme

387a30a

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

f47e702

…peechbrain-released into titou/LargeScaleASR

quick fix

dfc4473

further cleaning

62e5392

logging error

8170e48

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

eec5d53

…peechbrain-released into titou/LargeScaleASR

HF prep

a12f741

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

bbc87fd

…peechbrain-released into titou/LargeScaleASR

TParcollet added 13 commits January 21, 2025 17:24

HF prep

d881cf6

HF prep

a44e3ce

minor fix

3e33d9f

fix

3e7445f

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

e30c700

…peechbrain-released into titou/LargeScaleASR

readme

848a4b0

mergie

3108bf9

mergie

5ffcaa9

update readme

b7a96ff

new yodas data prep

468acb3

new yodas data prep

e556a3f

Merge branch 'titou/LargeScaleASR' of https://github.com/TParcollet/s…

b44a7a4

…peechbrain-released into titou/LargeScaleASR

text norm update

50ae62e

TParcollet mentioned this pull request Jan 23, 2025

Conformer recipe for LargeScaleASR (code from Samsung AI Center Cambridge) #2806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802

Uh oh!

TParcollet commented Jan 16, 2025

Uh oh!

TParcollet commented Jan 23, 2025

Uh oh!

pplantinga commented Jan 23, 2025

Uh oh!

TParcollet commented Jan 24, 2025

Uh oh!

Adel-Moumen commented Apr 8, 2025

Uh oh!

TParcollet commented Apr 8, 2025

Uh oh!

Uh oh!

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802

Are you sure you want to change the base?

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge) #2802

Uh oh!

Conversation

TParcollet commented Jan 16, 2025

Uh oh!

TParcollet commented Jan 23, 2025

Uh oh!

pplantinga commented Jan 23, 2025

Uh oh!

TParcollet commented Jan 24, 2025

Uh oh!

Adel-Moumen commented Apr 8, 2025

Uh oh!

TParcollet commented Apr 8, 2025

Uh oh!

Uh oh!