A few updates for asr2 and hubert#5285
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5285 +/- ##
=======================================
Coverage 76.10% 76.10%
=======================================
Files 658 658
Lines 59156 59156
=======================================
Hits 45022 45022
Misses 14134 14134
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| sampler = NumElementsBatchSampler( | ||
| batch_bins=batch_bins, | ||
| shape_files=[utt2num_samples], | ||
| ) | ||
| batches = list(sampler) | ||
| iterator = SequenceIterFactory( | ||
| dataset=dataset, | ||
| batches=batches, | ||
| collate_fn=CommonCollateFn(float_pad_value=0.0, int_pad_value=-1), | ||
| num_workers=2, | ||
| ).build_iter(0) |
There was a problem hiding this comment.
Is there a specific reason of using numel sampler and sequence iterator, I thought streaming iterator would be enough (it is also easier to trace as it does not go through the sampler)
There was a problem hiding this comment.
Not really. I just think it is too complicated to define multiple choices of iterators / samplers. So I randomly chose one. Not sure which one is good enough. But in the end, we may find one that covers most of the use cases, which may be enough, right?
There was a problem hiding this comment.
The reason for using numel sampler and sequence iterator instead of streaming iterator is because the utterances are sorted, thus less computations are wasted due to padding. Another reason for using numel sampler is that it is easier to avoid OOM error than others, e.g. sorted sampler.
There was a problem hiding this comment.
That's fair, thanks for the clarification.
…etrained hubert checkpoints. Related to hubert and asr2 recipes
for more information, see https://pre-commit.ci
|
Do these new parts go through the test? |
|
Yeah, it will be tested in |
|
OK, sounds good. |
Update
hubert.shto support download pretrained ckpt.