This repo is for utilizing Self-Supervised Learning (SSL) speech models in speech translation. It supports three tasks relavant to speech translation:
- T2T: Machine Translation for transcription to translation.
- S2T: Speech Translation for speech SSL features to translation.
- U2T: Speech Translation for speech SSL discrete units to translation.
- fairseq
git clone https://github.com/facebookresearch/fairseq.git cd fairseq git checkout 0f078de343d985e0cba6a5c1dc8a6394698c95c7 pip install -e ./ - torch (1.12.0 recommended)
Please first setup the following configuration in script/setup.sh. (You could do this by copying and modifying script/setup_example.sh)
sslst_data_root: The place to put the processed data.sslst_feat_root: The place to store the extracted features. (A large storage would be needed.)sslst_output_root: The place to store the checkpoints.sslst_data_bin_root: The place to store binarized text data.
The speech dataset will be processed into tsv files. This repo default supports three common datasets, which are librispeech, libritrans and covost2, and also allow you to add new dataset.
-
Librispeech
- Download Librispeech from Official website.
We set
train-clean-100,dev-clean,test-cleanas default splits, modifyingprepare_data/Librispeech.pyto change this setting. - Set
$sslst_librispeech_rootinscript/setup.sh - Run
bash prepare_data/librispeech.sh
- Download Librispeech from Official website.
We set
-
Libritrans
- Download Libritrans from this repo
- Set
$sslst_libritrans_rootinscript/setup.sh - Run
bash prepare_data/libritrans.sh
-
CoVoST2
- Download Common Voice from Official Website and set
$sslst_cv_versioninscript/setup.sh. You should choose the source language based on the translation direction. Note that we use De -> En as default translation direction. - Clone CoVoST and set
$sslst_covost_rootand$sslst_covost2_tsv_rootinscript/setup.sh - Run
bash prepare_data/pre_covost2.shandbash prepare_data/covost2.sh.
- Download Common Voice from Official Website and set
The data would be prepared at $sslst_data_root/tsv in tsv format.
Use prepare_data/Example.py as template to add new dataset, and it would be automatically detected.
For all the text data, we do the following preprocessing steps.
- Normalize punctuations
- Remove unprintable characters
- Lowercase all the characters
- Build BPE tokenizer with size = 8000 and character coverage = 1
To do the preprocessing
- Clone mosesdecoder and set
$sslst_mosesdecoder_rootinscript/setup.sh. - Run
bash script/t2t/[DATASET].shto do the preprocessing for the dataset you want to use. - Use
script/t2t/fairseq_preprocess.shto binarize the data.bash script/t2t/fairseq_preprocess.sh [DATASET] [SRC_LANG] [TGT_LANG]
Speech to hidden unit
-
Create manifest by
bash script/s2u/create_manifest_[DATASET].sh -
Clone and install Fairseq. Set
$sslst_fairseq_rootinscript/setup.sh -
Train K-means model.
bash script/s2u/train_kmeans_simple.sh [DATASET] [KM_TAG] [SSL_MODEL] [LAYER] [N_CLUSTER] [PERCENTAGE]
If you didn't use
librispeech, you need to changesplitfromtrain-clean-100into other one inscript/s2u/train_kmeans_simple.sh.The kmeans model could be found as
$sslst_data_root/kmeans_model/[SSL_MODEL]-[KM_TAG][PERCENTAGE]p-L[LAYER]-km[N_CLUSTER].bin, e.g.data/kmeans_model/hubert-ls0.01p-L9-km500.bin. -
Dump SSL features and apply K-means clustering,
bash script/s2u/apply_kmeans_simple.sh [DATASET] [SSL_MODEL] [LAYER] [N_CLUSTER] [KM_TAG]
The results could be found as
$sslst_data_root/[DATASET]/[SPLIT].[SSL_MODEL]_l[LAYER]_[KM_TAG][N_CLUSTER], e.g.data/libritrans-en-fr/dev.hubert_l9_ls0.01p500.The dump SSL features are in
$sslst_feat_root/[DATASET]/[SSL_MODELS]/[LAYER] -
(Optional) Do the reduction.
bash script/s2u/reduce_hidden_unit.sh [DATASET] [SUFFIX] [MODE]
If
mode == simple, it simply combines the consecutive characters. (E.g. aaabb -> ab)If
mode == addN, it will add the number of consecutive after the character. (E.g. aaabb -> a _3 b _2) -
Use
script/t2t/fairseq_preprocess.shto binarize the data.bash script/t2t/fairseq_preprocess.sh [DATASET] [SUFFIX] [TGT_LANG]
-
Dump the ssl features.
bash script/dump_feature.sh [DATASET] [SSL_MODEL] [LAYER] seperate
If you have already ran the speech to hidden unit script, you could simply split those features.
bash script/s2t/split_feature.sh [DATASET] [SSL_MODEL] [LAYER]
-
Create Speech-to-Text task configuration
bash script/s2t/speech2text.sh [DATASET] [SSL_MODEL] [SSL_DIM] [LAYER] [SRC_LANG] [TGT_LANG]
We use Fairseq's transformer_iwslt_de_en as our default model architecture.
bash script/train/translation_t2t.sh [DATASET] [SRC_LANG] [TGT_LANG]We also use Fairseq's transformer_iwslt_de_en as our default model architecture.
bash script/train/translation_u2t.sh [DATASET] [SUFFIX] [TGT_LANG]We use Fairseq's s2t_transformer_s as our default model architecture.
bash script/train/speech2text.sh [DATASET] [SRC_LANG] [TGT_LANG]The detail of the hyperparameters could be found in the training scripts.
We report SacreBLEU as performance metric.
- speech-to-text-to-text
bash script/generate/cascade_s2t2t.sh [DATASET] [SRC_LANG] [MID_LANG] [TGT_LANG]
- Unit-to-text-to-text
bash script/generate/cascade_u2t2t.sh [DATASET] [SRC_LANG] [MID_LANG] [TGT_LANG]
- Text-to-text
bash script/generate/translation_t2t.sh [DATASET] [SRC_LANG] [TGT_LANG]
- Unit-to-text
bash script/generate/translation_u2t.sh [DATASET] [SRC_LANG] [TGT_LANG]
- Speech-to-text
bash script/generate/speech2text.sh [DATASET] [SRC_LANG] [TGT_LANG]
- Download
mbart.cc25from fairseq. - Unzip and put the folder into
$sslst_data_root
-
Set the dataset and langauge in
script/finetune/mbart-t2t.shproperly. -
Run the script to create the binarized dataset.
bash script/finetune/mbart-t2t.sh
Convert hidden unit into mbart's subword
-
Create the mapping of hidden units and subwords
bash script/finetune/mbart-create_hidden_unit_mapping.sh [DATASET] [LANG]
-
Apply the mapping and create binarized dataset
bash script/finetune/mbart-u2t.sh [DATASET] [SRC_LANG] [MBART_TGT_LANG]
bash script/train/translation_t2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]bash script/train/translation_u2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]bash script/generate/translation_t2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]bash script/generate/translation_u2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]