中文語音辨識
End-to-end speech recognition on AISHELL dataset using Pytorch.
The entire system is an attention-based sequence-to-sequence model1. The encoder is a bidirectional GRU net with BatchNorm, and the decoder is another GRU net that applies Luong-based attention3.
The acoustic features are 80-dimensional filter banks. We apply SpecAugment4 to these features to improve generalization. They are also stacked every 3 consecutive frames, so the time resolution is reduced.
With this code you can achieve ~10% CER on the test set after 100 epochs.
$ pip install -r requirements.txt- Download AISHELL dataset (data_aishell.tgz) from http://www.openslr.org/33/.
- Extract data_aishell.tgz:
$ python extract_aishell.py ${PATH_TO_FILE}- Create lists (*.csv) of audio file paths along with their transcripts:
$ python prepare_data.py ${DIRECTORY_OF_AISHELL}Check available options:
$ python train.py -hUse the default configuration for training:
$ python train.py exp/default.yamlYou can also write your own configuration file based on exp/default.yaml.
$ python train.py ${PATH_TO_YOUR_CONFIG}With the default configuration, the training logs are stored in exp/default/history.csv.
You should specify your training logs accordingly.
$ python show_history.py exp/default/history.csvDuring training, the program will keep monitoring the error rate on development set.
The checkpoint with the lowest error rate will be saved in the logging directory (by default exp/default/best.pth).
To evalutate the checkpoint on test set (with a beam width of 5), run:
$ python eval.py exp/default/best.pth --beams 5Or you can test random audio from the test set and see the attentions:
$ python inference.py exp/default/best.pth --beams 5
Predict:
你 电 池 市 场 也 在 向 好
Ground-truth:
锂 电 池 市 场 也 在 向 好- Beam Search
- Restore checkpoint and resume previous training
- SpecAugment
- LM Rescoring
- Label Smoothing
- Polyak Averaging
[1] W. Chan et al., "Listen, Attend and Spell", https://arxiv.org/abs/1508.01211
[2] J. Chorowski et al., "Attention-Based Models for Speech Recognition", https://arxiv.org/abs/1506.07503
[3] M. Luong et al., "Effective Approaches to Attention-based Neural Machine Translation", https://arxiv.org/abs/1508.04025
[4] D. Park et al., "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", https://arxiv.org/abs/1904.08779

