Add Whisper SOT recipe for Librimix#5371
Conversation
simpleoier
left a comment
There was a problem hiding this comment.
Thanks! It looks good to me in general.
| --train_set "${train_set}" \ | ||
| --valid_set "${valid_set}" \ | ||
| --test_sets "${test_sets}" \ | ||
| --lm_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2 data/local/other_text/text" \ |
There was a problem hiding this comment.
Can we simply use the ${train_set}/text which contains the <sc> in the text?
There was a problem hiding this comment.
LM is not used in this recipe. I'll remove it.
| --valid_set "${valid_set}" \ | ||
| --test_sets "${test_sets}" \ | ||
| --lm_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2 data/local/other_text/text" \ | ||
| --bpe_train_text "data/${train_set}/text_spk1 data/${train_set}/text_spk2" "$@" |
There was a problem hiding this comment.
ditto. (not sure about LM, but bpe should be fine to use the text file)
espnet2/bin/asr_inference.py
Outdated
| converter = TokenIDConverter(token_list=token_list) | ||
| else: | ||
| converter = OpenAIWhisperTokenIDConverter(model_type=bpemodel) | ||
| converter = OpenAIWhisperTokenIDConverter(model_type=bpemodel, sot=sot_asr) |
There was a problem hiding this comment.
Is it necessary to check if <sc> in the token list in this case? Or has it been done in the token_id_converter class?
There was a problem hiding this comment.
If sot_asr is True, the OpenAIWhisperTokenIDConverter will add the <sc> token in it's init function.
There was a problem hiding this comment.
In the inference part, we can retrieve the value of sot_asr from the asr_train_args, which eliminates the need for dumplicate params in both train_args and decode_args, making them independent of each other.
In most cases, I think decode_args should only contains the params specific to inference, like beam_size, lm_weight, etc., without including those realted to model or tokenizer initialization.
There was a problem hiding this comment.
Thanks for the suggestion! I updated the asr_inference.py, and now it get the sot_asr from asr_train_args.
Co-authored-by: Wangyou Zhang <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #5371 +/- ##
=======================================
Coverage 77.17% 77.17%
=======================================
Files 684 684
Lines 62643 62686 +43
=======================================
+ Hits 48343 48380 +37
- Misses 14300 14306 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
|
||
| if sot_asr: | ||
| full_vocab_size += 1 | ||
| fout.write("<sc>" + "\n") |
There was a problem hiding this comment.
Is it possible to import custom tokens from a file, like additional_vocab. This would allow us to add any tokens as needed, rather than only adding a token for the SOT training.
There was a problem hiding this comment.
The Whisper's vocabulary is not imported form a file, but loaded form the whisper pypi package. So, here I add the special token in the code.
There was a problem hiding this comment.
I think it would be better to have at least a way to specify which special token would be used as a speaker change. The hard-corded token symbol may confuse the user.
I guess this would depend on other places, so if it is difficult to make this speaker change token a variable, we could at least provide the comment in the source code and README that is a reserved token in an appropriate place.
There was a problem hiding this comment.
Thanks for your suggestions! Now it's configurable.
|
This pull request is now in conflict :( |
for more information, see https://pre-commit.ci
|
Thanks, @LiChenda! |
|
@pengchengguo, this is a reminder. Can you review this PR again? |
pengchengguo
left a comment
There was a problem hiding this comment.
I only have a few advice and the rest looks good to me.
|
|
||
| paste -d "" \ | ||
| <(<data/${dset}/text_spk1 awk '{$0=$0" <sc>"; print($0)}') \ | ||
| <(<data/${dset}/text_spk1 awk '{$0=$0" <sc> "; print($0)}') \ |
There was a problem hiding this comment.
Xuankai and I have considered whether a "space" should be added after "sc". We don't know if it matters and how the original paper does it. Do you have any comments?
There was a problem hiding this comment.
I think there should be no essential difference. But the second makes the text look more natural. My uploaded pre-trained model was trained with "space" after "". Do you have any technical concerns about that "space"?
| for i in range(full_vocab_size - vocab_size): | ||
| fout.write("()" + "\n") | ||
|
|
||
| if sot_asr: |
There was a problem hiding this comment.
The exported token.txt file includes 50257 normal tokens, 1501 "{}" (should be timestamps), and 1 sc token.
During training, the tokenizer will add additional 1501 timestamps tokens and 1 sc token, as implemented in whisper_token_id_converter.py and whisper_token_id_converter.py.
Although the token.txt file will not be used in practice, it is better to change the {} to real timestamps and make it consistent with training progress.
There was a problem hiding this comment.
Thanks for your comments, I also feel like that will be better. Now updated.
espnet2/text/whisper_tokenizer.py
Outdated
| raise ValueError("tokenizer unsupported:", model_type) | ||
|
|
||
| self.tokenizer = copy.deepcopy(self.tokenizer) | ||
| timestamps = [f"<|{i*30/1500:.2f}|>" for i in range(0, 1501)] |
There was a problem hiding this comment.
I'm assuming that 30 comes from the whisper's sentence length and 1500 comes from the 20ms shift.
It's okay to embed such numbers, but it would be great to add some comments (logging?) and also make them variables inside the function.
There was a problem hiding this comment.
Thank you for pointing it out! I updated the code according to your comments.
What?
This PR adds the Whisper SOT-style multi-talker ASR recipe for the Librimix dataset.
Why?
To add the multi-talker Whisper recipe.