AVSR recipe for Easycom Dataset#5630
Conversation
…ble for audio-only training and inference
Add easycom dataset
for more information, see https://pre-commit.ci
sw005320
left a comment
There was a problem hiding this comment.
I added several comments.
| |---|---|---|---|---|---|---|---|---| | ||
| |inference_asr_model_valid.acc.ave/test_with_LRS3|694|8886|70.4|18.6|11.0|5.0|34.6|75.4| | ||
|
|
||
| ## Audio-only Speech Recognition Results (Audio-only) <br> exp/asr_train_avsr_avhubert_large_with_lrs3_noise_extracted_en_bpe1000 |
There was a problem hiding this comment.
The audio-only data significantly degrades the performance.
Can you provide some reasons?
Probably due to the AV HuBERT architecture?
There was a problem hiding this comment.
The dataset is very challenging due to noise and long-distance voice.
Previous ASR model (wav2vec2.0) trained on 60k hours of data achieves 87.5% WER (https://arxiv.org/pdf/2212.11377.pdf). Therefore, by employing the visual information, we can improve the performance greatly by complementing the insufficient audio information (due to noise, overlapped speech, and long-distance voice) during speech recognition.
The trained model using the recipe was trained on 1,759 hours of data for pre-training (AV-HuBERT) and 438 hours of data for finetuning. Considering the data amount, the current performance seems reasonable.
One possible direction to improve the performance is using more audio-visual data including LRS2, VoxCeleb, and AVSpeech.
ftshijt
left a comment
There was a problem hiding this comment.
Very cool extension! Many thanks for the effort.
Could you please also add en entry in egs2/README.md for the dataset?
Also, two minor comments as follows:
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #5630 +/- ##
==========================================
+ Coverage 76.11% 76.13% +0.01%
==========================================
Files 743 743
Lines 69117 69151 +34
==========================================
+ Hits 52608 52647 +39
+ Misses 16509 16504 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
for more information, see https://pre-commit.ci
|
Thanks a lot! |
What?
New recipe for training audio-visual speech recognition model on Easycom dataset.
The recipe is based on LRS3 avsr recipe which utilizes pre-trained AV-HuBERT model. (Dumped features)
I added data augmentation techniques to the espnet2/asr/encoder/avhubert_encoder.py
See also
Easycom dataset is too small to achieve proper performances by using the dataset only. The recipe utilizes both Easycom and LRS3 datasets to train the model.