Support Whisper-style training as a new task S2T#5120
Support Whisper-style training as a new task S2T#5120sw005320 merged 72 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
|
Very cool, @pyf98! |
egs2/must_c_v2/s2t1/conf/tuning/train_s2t_ebf_lr1e-3_warmup5k.yaml
Outdated
Show resolved
Hide resolved
| yield | ||
|
|
||
|
|
||
| class ESPnetS2TModel(AbsESPnetModel): |
There was a problem hiding this comment.
Maybe we can inherit ASR model? That would be especially helpful to get the latest updates from there.
There was a problem hiding this comment.
This seems great. Let me check it. Thanks
| ) | ||
|
|
||
|
|
||
| class S2TTask(AbsTask): |
There was a problem hiding this comment.
Similar to here, maybe inherit ASR task would be a good option?
There was a problem hiding this comment.
Thanks. Will check it.
Codecov Report
@@ Coverage Diff @@
## master #5120 +/- ##
==========================================
- Coverage 77.15% 77.10% -0.05%
==========================================
Files 676 681 +5
Lines 61337 62288 +951
==========================================
+ Hits 47327 48030 +703
- Misses 14010 14258 +248
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
Hi @sw005320 , can we merge this PR now (given the ASRU result)? Thanks! |
sw005320
left a comment
There was a problem hiding this comment.
It almost ready but I just want to check some points.
There was a problem hiding this comment.
what kind of statistics are you getting, and what is it used for?
There was a problem hiding this comment.
It shows the number of hours for every language pairs in the specified data directory. It is not used for training or decoding. Just a separate utility tool
There was a problem hiding this comment.
It seems that this part does not have good test coverage.
Are there some specific reasons?
There was a problem hiding this comment.
One reason is that we cannot easily test the long-form decoding. It relies on the predicted timestamps. If we use random input or random model, we cannot generate structured output which makes the decoding hang forever.
There was a problem hiding this comment.
What is the difference between s2t_inference vs. this?
There was a problem hiding this comment.
s2t_inference requires a language token and performs the full decoding.
Here, it only predicts the language token, which means we set the output length to be exactly 1.
There was a problem hiding this comment.
Would it be possible to add an option to make the output length 1 to realize this with s2t_inference?
If there are any other changes, we can leave it as it is.
| return data | ||
|
|
||
|
|
||
| class S2TPreprocessor(CommonPreprocessor): |
There was a problem hiding this comment.
Can you leave some comments on how they are different from the CommonPreprocessor?
|
Thanks for your great efforts.
|
Hi, this PR adds a new task
s2t1(speech-to-text) in ESPnet2 which follows OpenAI Whisper's training style. It has two major features:The training data has the following format:
where
<sop>is a special token denoting the start of prev/prompt sentence. The timestamps are also treated as special tokens because the audio has a fixed length (30s) and resolution (20ms). An example looks like:During data preparation, three text files are generated:
textcontains the normal target sentence, i.e., the text between<sos>and<eos>.text.prevcontains the previous sentence, i.e., the text between<sop>and<sos>. This might be unavailable at the beginning of a talk. In such cases, a special token<na>will be used.text.ctccontains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived fromtext, but for ST utterances, this is in a different language. If the ASR transcription is not available,<na>will be used.For decoding, the model can perform utterance-level ASR or ST, which follows the same procedure as the standard tasks. It can also perform long-form ASR or ST based on the predicted timestamps.