CoVoST2 ASR2 recipe and new ST2 recipe#5318
CoVoST2 ASR2 recipe and new ST2 recipe#5318simpleoier wants to merge 3 commits intoespnet:masterfrom
Conversation
ftshijt
left a comment
There was a problem hiding this comment.
The main implementation looks good. Only one concern in the potential confusion of "src" in ST task.
In ST task, we usually would do multi-task, while predicting source language transcript, where we refer as "src". It is different from the input speech. Could you please factor out that part if possible instead of reusing? Cause it would potentially bring a lot of confusions between the st1 and st2 implementation.
d4e8eb1 to
8babca8
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5318 +/- ##
==========================================
- Coverage 77.17% 77.14% -0.03%
==========================================
Files 684 684
Lines 62643 62735 +92
==========================================
+ Hits 48345 48399 +54
- Misses 14298 14336 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request is now in conflict :( |
There was a problem hiding this comment.
Thanks!
What does it look like?
Could you paste an example?
There was a problem hiding this comment.
It is like the tqdm progress bar, with percentage progress, current batch / total_batch, time_so_far / est_time_to_finish, time_per_batch.
0%| | 1/1577 [00:20<8:49:13, 20.15s/it]
egs2/TEMPLATE/asr2/README.md
Outdated
| 8. cp ../../librispeech/asr2/conf/tuning/train_discrete_asr_e_branchformer1.yaml conf/ # copy training conf | ||
| 9. cp ../../librispeech/asr2/conf/decode_ctc0.3.yaml conf/ # copy confs | ||
| 10. EDIT run.sh by checking ../asr1/run.sh | ||
| a. We may skip an LM |
There was a problem hiding this comment.
| a. We may skip an LM | |
| a. We may skip an LM by adding an option `--use_lm false` |
egs2/TEMPLATE/asr2/README.md
Outdated
| * SSL model choice can affect the performance a lot, e.g. wavlm models may not work for non-English data, | ||
| * Layer selection is also important: different layers retain different information. For example, based the training criterion, the 24-th layer of HuBERT_large is trying to match the information from HuBERT_base layer 9. If you didn't have experience, please refer to the Fig. 4 of this [CCA paper](https://arxiv.org/pdf/2211.03929.pdf), which is usually helpful. | ||
| * Number of kmeans clusters also affect the variance in pronunciation, etc. | ||
| * Please check the kmeans labels in `dump/extracted/{kmeans_feat_type}/layer{layer}/{dset}/pseudo_label_km{ncluseters}.txt`. In my experience, a good km result for ASR should have an obvious pattern of repeatitions, e.g. |
There was a problem hiding this comment.
Maybe, you can add a bad example as well?
|
|
||
| ./asr2.sh \ | ||
| --kmeans_opts "--batch_bins 4800000" \ | ||
| --kmeans_opts "--batch_bins 4800000 --nj 4" \ |
There was a problem hiding this comment.
You may need to increase --num_threads to deal with large memory consumptions in scikit-learn?
(Ideally, I want you to solve it by avoiding using such less-refined implementation.
There was a problem hiding this comment.
OK. Working on this item.
espnet2/bin/mt_inference.py
Outdated
| if quantize_mt_model or quantize_lm: | ||
| if quantize_dtype == "float16" and torch.__version__ < LooseVersion( | ||
| "1.5.0" | ||
| ): | ||
| raise ValueError( | ||
| "float16 dtype for dynamic quantization is not supported with " | ||
| "torch version < 1.5.0. Switch to qint8 dtype instead." | ||
| ) |
There was a problem hiding this comment.
since CI does not support 1.5.0, we can remove these lines
There was a problem hiding this comment.
CTC BPE token part looks complicated and tricky.
It requires some documentation (in the source code and asr2 or st2 documents).
There was a problem hiding this comment.
The idea is simple. Maybe my way is a bit complicated. In st2, we use different the text targets for CTC and attention-decoder.
- CTC target: ASR transcriptions for ASR or ST, while <not_available> is used in MT.
- Att-Dec target: ASR transcriptions for ASR, while translation transcriptions for ST / MT.
For this purpose, we need different text input as data. In ESPnet preprocessor, the number of tokenizer should match the text input.
assert (
len(token_type) == len(token_list) == len(bpemodel) == len(text_name)
), "token_type, token_list, bpemodel, or processing text_name mismatched"
But in practice, the bpe model for CTC and Att-Dec is the same. We combine the vocabulary for ASR language and Translation language. However, I made CTC text tokenizer option as an explicit argument, which is easy to change.
| speech_name: str = "speech", | ||
| text_name: List[str] = ["text"], | ||
| tokenizer_encode_conf: List[Dict] = [dict(), dict()], | ||
| not_available_symbol: str = None, |
There was a problem hiding this comment.
Can you explain about it and embed the explanation in the source coed?
There was a problem hiding this comment.
I'll put the following explanation down below.
# not_available_symbol is a special symbol as placeholder in the text. e.g.
# "utt_id <na>" an item in the text input
# Then such samples will not have the corresponding text signals.
# The resulting tensor would be processed as torch.longtensor([-1])
for more information, see https://pre-commit.ci
|
This pull request is now in conflict :( |
| speech_token_lang="wavlm_large_21_km2000" # speech discrete token type abbrev. id (e.g., wavlm_large_21_km2000) | ||
| src_tgt_text_case="lc.rm" # source / target transcript case. Note, all source / target text should use the same case for now. | ||
| src_tgt_text_lang=en # source / target language abbrev. id (e.g., en). Multiple langs are supported to support multiple tasks, with space between (e.g., "es/en"), from data's perspect of view, src_lang of text is the first.multiple tasks, with space between (e.g., "es/en"), from data's perspect of view, src_lang of text is the first. | ||
| tgt_tasks="asr/st" # task abbrev. id (e.g., st). Multiple tasks are supported to support multiple tasks, with space between (e.g., "asr/st") |
There was a problem hiding this comment.
space between or "/" between?
There was a problem hiding this comment.
Since it was for multi-task, is it a good idea to put in st task or a more general one?
I'm asking cause the major reason we split st from asr in the previous context is:
- the different architectures between each other (for st only tasks, we also have a few unique architecture designs for each component (e.g., separate asr/mt decoder, two-pass framework with multi-decoder, etc.)
- data preparation (design for tgt_text, src_text, and src_speech)
- specific evaluation (bleu calculation and multi-reference support)
however, I feel many of the above parts are not shared (e.g., the architecture is still discrete asr; we skip the joint-task framework, instead, we do either asr or st in a single run; no support to multi-reference scoring)
Given the above reasons, I lean to call it s2t2 instead of st2. Please let me know you thoughts!
| @@ -0,0 +1,2066 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
The script looks great!
One issue remains on my side is the support of multi-reference scenarios for ST evaluation (I double checked the support, but it seems not supported in preparation yet) Please refer https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/st1/st.sh#L547-L553 for some details on how we process those
There was a problem hiding this comment.
For more information on the name:
https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/st1/st.sh#L1585C75-L1585C75
| hf_task=automatic-speech-recognition | ||
| # shellcheck disable=SC2034 | ||
| espnet_task=ASR |
|
This pull request is now in conflict :( |
|
This pull request is now in conflict :( |
|
This pull request is now in conflict :( |
|
This pull request is now in conflict :( |
|
This PR is stale because it has been open for 90 days with no activity. |
|
This PR is closed. Please re-open if needed. |
ST2 recipe
A combination tasks of ASR + ST + MT, which could use speech discrete tokens and text transcriptions together.
(More details will be filled)
numel_sampler: to support the same task data in a single-batch.ASR2 recipe
Misc.
pyscripts/feats/ssl_feature_utils.pyandpyscripts/feats/dump_km_labels.pylibrispeech/asr2,covost2/asr2,covost2/st2.--speech_feats_type extracted