Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support arbitrary language finetune for Whisper models.#5344

Merged
sw005320 merged 9 commits intoespnet:masterfrom
pengchengguo:whisper
Aug 3, 2023
Merged

Support arbitrary language finetune for Whisper models.#5344
sw005320 merged 9 commits intoespnet:masterfrom
pengchengguo:whisper

Conversation

@pengchengguo
Copy link
Collaborator

  1. For the asr.sh script, directly use --lang as the language id to export the Whisper vocabulary.
  2. For the training procedure, add an additional tokenizer_language for the preprocessor in config files, like
preprocessor: default
preprocessor_conf:
    tokenizer_language: "zh"
  1. Add the fine-tuning results on the Aishell corpus. When compared with other methods, fine-tuning Whisper achieves the best results.
image

@sw005320 sw005320 requested a review from simpleoier July 22, 2023 14:37
@sw005320 sw005320 added this to the v.202307 milestone Jul 22, 2023
@sw005320 sw005320 added New Features ASR Automatic speech recogntion labels Jul 22, 2023
Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Please also add some tests.


- ASR config: [conf/tuning/train_asr_whisper_medium_finetune.yaml](conf/tuning/train_asr_whisper_medium_finetune.yaml)
- #Params: 762.32 M
- Model link:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, please upload a model.

Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I only have one concern about unseen lang_id used in whisper.


## Results

- ASR config: [conf/tuning/train_asr_whisper_medium_finetune.yaml](conf/tuning/train_asr_whisper_medium_finetune.yaml)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is decode config needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it should be included.

fi

_opts=""
if [ "${token_type}" = "whisper_multilingual" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would default lang=noinfo work here?

Copy link
Collaborator Author

@pengchengguo pengchengguo Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a LANGUAGES_CODE_MAPPING to map the language codes of ESPnet with the language IDs of Whisper and make sure the input language code is supported by the Whiper models.

else:
converter = OpenAIWhisperTokenIDConverter(model_type=bpemodel)
converter = OpenAIWhisperTokenIDConverter(
model_type=bpemodel, language=tokenizer_language
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify any language id here or only the langs supported in Whisper model?

@mergify
Copy link
Contributor

mergify bot commented Jul 24, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Jul 24, 2023
@pengchengguo
Copy link
Collaborator Author

pengchengguo commented Jul 24, 2023

I have made several updates as discussed:

  1. Included the Decode Config as @simpleoier mentioned; (see asr1/README.md)
  2. Updated the HF model link and noted that the model size is very large; (see asr1/README.md)
  3. Added a start recipe for whisper fine-tuning; (see asr1/run_whisper_finetune.sh)
  4. Added arbitrary language evaluation with the original Whisper models. Excluded the language check in this part as it will be done in the Whisper code; (see pyscripts/utils/evaluate_whisper_inference.py and scripts/evaluate_asr.sh);
  5. Added a LANGUAGES_CODE_MAPPING to map the language codes of ESPnet with the language IDs of Whisper. For the languages included in ESPnet, I tried to find as many language mappings as possible, and we can maintain the mapping dictionary in the future. (see espnet2/text/whisper_tokenizer.py)
  6. Determined whether the Whisper model supports the input language code, and if not, raised a ValueError; (see espnet2/bin/whisper_export_vocabluray.py, espnet2/text/whisper_token_id_converter.py, and espnet2/text/whisper_tokenizer.py)
  7. Fixed CI test errors; (see espnet2/bin/asr_inference.py)

Thanks! I only have one concern about unseen lang_id used in whisper.

Currently, for unseen language IDs that the Whisper models do not support, we will raise a ValueError.

@mergify mergify bot removed the conflicts label Jul 24, 2023
@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Merging #5344 (75e844d) into master (4847b5f) will decrease coverage by 6.33%.
The diff coverage is 82.60%.

@@            Coverage Diff             @@
##           master    #5344      +/-   ##
==========================================
- Coverage   76.11%   69.79%   -6.33%     
==========================================
  Files         672      671       -1     
  Lines       59864    59793      -71     
==========================================
- Hits        45567    41733    -3834     
- Misses      14297    18060    +3763     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.93% <ø> (ø)
test_integration_espnet2 47.92% <31.25%> (-0.01%) ⬇️
test_python_espnet1 ?
test_python_espnet2 51.36% <82.60%> (+<0.01%) ⬆️
test_utils 23.17% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/text/build_tokenizer.py 78.37% <0.00%> (ø)
espnet2/train/preprocessor.py 44.83% <ø> (ø)
espnet2/bin/asr_inference.py 87.46% <40.00%> (-0.45%) ⬇️
espnet2/bin/whisper_export_vocabulary.py 91.83% <100.00%> (+0.92%) ⬆️
espnet2/text/whisper_token_id_converter.py 85.18% <100.00%> (+2.57%) ⬆️
espnet2/text/whisper_tokenizer.py 85.71% <100.00%> (+2.38%) ⬆️

... and 113 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

${python} -m espnet2.bin.whisper_export_vocabulary \
--whisper_model "${token_type}" \
--output "${token_list}"
--output "${token_list}" ${_opts}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful if the script can exit when ${lang} is not recognized. Is this already satisfied? Or just by adding || exit 1 after this line. I'm not sure if the python scripts satisfies the condition of retuning non-zero status.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the shell script will directly terminate when whisper_export_vocabulary raises the "ValueError: language unsupported for Whisper model". I am not sure how to handle this situation.
Using || exit 1 or not does not affect the shell script.
image

@kan-bayashi kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023
Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@sw005320 sw005320 merged commit 093a315 into espnet:master Aug 3, 2023
@sw005320
Copy link
Contributor

sw005320 commented Aug 3, 2023

Thanks, @pengchengguo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants