Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CoVoST2 ASR2 recipe and new ST2 recipe#5318

Closed
simpleoier wants to merge 3 commits intoespnet:masterfrom
simpleoier:discrete_asr
Closed

CoVoST2 ASR2 recipe and new ST2 recipe#5318
simpleoier wants to merge 3 commits intoespnet:masterfrom
simpleoier:discrete_asr

Conversation

@simpleoier
Copy link
Collaborator

@simpleoier simpleoier commented Jul 21, 2023

ST2 recipe

A combination tasks of ASR + ST + MT, which could use speech discrete tokens and text transcriptions together.
(More details will be filled)

  • utt2category in numel_sampler: to support the same task data in a single-batch.
  • new st2 template
  • covost2 recipe example
  • mini_an4/st2 test

ASR2 recipe

  • Add ASR2 recipe for CoVoST2 data.

Misc.

  • tqdm progress bar in pyscripts/feats/ssl_feature_utils.py and pyscripts/feats/dump_km_labels.py
  • limit number of GPU jobs in recipes: librispeech/asr2, covost2/asr2, covost2/st2.
  • update readme for recipe creation (thanks to @sw005320 )
  • update readme for asr2 tips
  • show one line of pseudo_labels example.
  • add data filtering in stage6 of asr2 / st2.
  • add support for extracted features in kmeans, making it expand to other feature types easily.
    --speech_feats_type extracted

@sw005320 sw005320 added the ST Speech translation label Jul 21, 2023
@sw005320 sw005320 added this to the v.202307 milestone Jul 21, 2023
@sw005320 sw005320 requested a review from ftshijt July 21, 2023 03:49
Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main implementation looks good. Only one concern in the potential confusion of "src" in ST task.

In ST task, we usually would do multi-task, while predicting source language transcript, where we refer as "src". It is different from the input speech. Could you please factor out that part if possible instead of reusing? Cause it would potentially bring a lot of confusions between the st1 and st2 implementation.

@simpleoier simpleoier force-pushed the discrete_asr branch 2 times, most recently from d4e8eb1 to 8babca8 Compare July 21, 2023 21:20
@simpleoier simpleoier changed the title [WIP] CoVoST2 ASR2 recipe and new ST2 recipe CoVoST2 ASR2 recipe and new ST2 recipe Jul 21, 2023
@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

❌ Patch coverage is 78.57143% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.14%. Comparing base (92207e2) to head (dde2f85).
⚠️ Report is 5449 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/samplers/num_elements_batch_sampler.py 77.55% 11 Missing ⚠️
espnet2/tasks/mt.py 52.38% 10 Missing ⚠️
espnet2/bin/mt_inference.py 72.41% 8 Missing ⚠️
espnet2/asr/discrete_asr_espnet_model.py 94.00% 3 Missing ⚠️
espnet2/train/preprocessor.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5318      +/-   ##
==========================================
- Coverage   77.17%   77.14%   -0.03%     
==========================================
  Files         684      684              
  Lines       62643    62735      +92     
==========================================
+ Hits        48345    48399      +54     
- Misses      14298    14336      +38     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.73% <ø> (-0.03%) ⬇️
test_integration_espnet2 49.10% <45.45%> (+<0.01%) ⬆️
test_python_espnet1 19.85% <0.00%> (-0.11%) ⬇️
test_python_espnet2 52.26% <51.94%> (-0.03%) ⬇️
test_utils 23.10% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mergify
Copy link
Contributor

mergify bot commented Jul 22, 2023

This pull request is now in conflict :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
What does it look like?
Could you paste an example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is like the tqdm progress bar, with percentage progress, current batch / total_batch, time_so_far / est_time_to_finish, time_per_batch.

0%|          | 1/1577 [00:20<8:49:13, 20.15s/it]

8. cp ../../librispeech/asr2/conf/tuning/train_discrete_asr_e_branchformer1.yaml conf/ # copy training conf
9. cp ../../librispeech/asr2/conf/decode_ctc0.3.yaml conf/ # copy confs
10. EDIT run.sh by checking ../asr1/run.sh
a. We may skip an LM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a. We may skip an LM
a. We may skip an LM by adding an option `--use_lm false`

* SSL model choice can affect the performance a lot, e.g. wavlm models may not work for non-English data,
* Layer selection is also important: different layers retain different information. For example, based the training criterion, the 24-th layer of HuBERT_large is trying to match the information from HuBERT_base layer 9. If you didn't have experience, please refer to the Fig. 4 of this [CCA paper](https://arxiv.org/pdf/2211.03929.pdf), which is usually helpful.
* Number of kmeans clusters also affect the variance in pronunciation, etc.
* Please check the kmeans labels in `dump/extracted/{kmeans_feat_type}/layer{layer}/{dset}/pseudo_label_km{ncluseters}.txt`. In my experience, a good km result for ASR should have an obvious pattern of repeatitions, e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, you can add a bad example as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ftshijt, can you review it?


./asr2.sh \
--kmeans_opts "--batch_bins 4800000" \
--kmeans_opts "--batch_bins 4800000 --nj 4" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to increase --num_threads to deal with large memory consumptions in scikit-learn?
(Ideally, I want you to solve it by avoiding using such less-refined implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Working on this item.

Comment on lines +82 to +89
if quantize_mt_model or quantize_lm:
if quantize_dtype == "float16" and torch.__version__ < LooseVersion(
"1.5.0"
):
raise ValueError(
"float16 dtype for dynamic quantization is not supported with "
"torch version < 1.5.0. Switch to qint8 dtype instead."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since CI does not support 1.5.0, we can remove these lines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTC BPE token part looks complicated and tricky.
It requires some documentation (in the source code and asr2 or st2 documents).

Copy link
Collaborator Author

@simpleoier simpleoier Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is simple. Maybe my way is a bit complicated. In st2, we use different the text targets for CTC and attention-decoder.

  • CTC target: ASR transcriptions for ASR or ST, while <not_available> is used in MT.
  • Att-Dec target: ASR transcriptions for ASR, while translation transcriptions for ST / MT.

For this purpose, we need different text input as data. In ESPnet preprocessor, the number of tokenizer should match the text input.

        assert (
            len(token_type) == len(token_list) == len(bpemodel) == len(text_name)
        ), "token_type, token_list, bpemodel, or processing text_name mismatched"

But in practice, the bpe model for CTC and Att-Dec is the same. We combine the vocabulary for ASR language and Translation language. However, I made CTC text tokenizer option as an explicit argument, which is easy to change.

speech_name: str = "speech",
text_name: List[str] = ["text"],
tokenizer_encode_conf: List[Dict] = [dict(), dict()],
not_available_symbol: str = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain about it and embed the explanation in the source coed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put the following explanation down below.

        # not_available_symbol is a special symbol as placeholder in the text. e.g.
        #      "utt_id <na>" an item in the text input
        # Then such samples will not have the corresponding text signals.
        # The resulting tensor would be processed as torch.longtensor([-1])

@simpleoier simpleoier mentioned this pull request Jul 26, 2023
2 tasks
@kan-bayashi kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 23, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Sep 23, 2023
@mergify mergify bot removed the conflicts label Sep 27, 2023
speech_token_lang="wavlm_large_21_km2000" # speech discrete token type abbrev. id (e.g., wavlm_large_21_km2000)
src_tgt_text_case="lc.rm" # source / target transcript case. Note, all source / target text should use the same case for now.
src_tgt_text_lang=en # source / target language abbrev. id (e.g., en). Multiple langs are supported to support multiple tasks, with space between (e.g., "es/en"), from data's perspect of view, src_lang of text is the first.multiple tasks, with space between (e.g., "es/en"), from data's perspect of view, src_lang of text is the first.
tgt_tasks="asr/st" # task abbrev. id (e.g., st). Multiple tasks are supported to support multiple tasks, with space between (e.g., "asr/st")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between or "/" between?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was for multi-task, is it a good idea to put in st task or a more general one?

I'm asking cause the major reason we split st from asr in the previous context is:

  • the different architectures between each other (for st only tasks, we also have a few unique architecture designs for each component (e.g., separate asr/mt decoder, two-pass framework with multi-decoder, etc.)
  • data preparation (design for tgt_text, src_text, and src_speech)
  • specific evaluation (bleu calculation and multi-reference support)

however, I feel many of the above parts are not shared (e.g., the architecture is still discrete asr; we skip the joint-task framework, instead, we do either asr or st in a single run; no support to multi-reference scoring)

Given the above reasons, I lean to call it s2t2 instead of st2. Please let me know you thoughts!

@@ -0,0 +1,2066 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script looks great!

One issue remains on my side is the support of multi-reference scenarios for ST evaluation (I double checked the support, but it seems not supported in preparation yet) Please refer https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/st1/st.sh#L547-L553 for some details on how we process those

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +2046 to +2048
hf_task=automatic-speech-recognition
# shellcheck disable=SC2034
espnet_task=ASR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider changing it?

@kan-bayashi kan-bayashi modified the milestones: v.202310, v.202312 Oct 25, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 25, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Oct 25, 2023
@kan-bayashi kan-bayashi modified the milestones: v.202312, v.202405 Feb 6, 2024
@mergify
Copy link
Contributor

mergify bot commented Feb 6, 2024

This pull request is now in conflict :(

@Fhrozen Fhrozen modified the milestones: v.202409, v.202412 Oct 1, 2024
@Fhrozen Fhrozen modified the milestones: v.202412, v.202503 Dec 4, 2024
@mergify mergify bot removed the conflicts label Mar 18, 2025
@mergify
Copy link
Contributor

mergify bot commented Mar 18, 2025

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Mar 18, 2025
@Fhrozen Fhrozen modified the milestones: v.202503, v.202506 Mar 27, 2025
@mergify mergify bot removed the conflicts label Jun 13, 2025
@mergify
Copy link
Contributor

mergify bot commented Jun 13, 2025

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Jun 13, 2025
@Fhrozen Fhrozen modified the milestones: v.202506, v.202509 Aug 11, 2025
@Fhrozen Fhrozen modified the milestones: v.202509, v.202512 Sep 12, 2025
@github-actions
Copy link

This PR is stale because it has been open for 90 days with no activity.
It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For probot label Dec 11, 2025
@github-actions
Copy link

This PR is closed. Please re-open if needed.

@github-actions github-actions bot closed this Dec 19, 2025
@mergify mergify bot removed the conflicts label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants