SSLST

This repo is for utilizing Self-Supervised Learning (SSL) speech models in speech translation. It supports three tasks relavant to speech translation:

T2T: Machine Translation for transcription to translation.
S2T: Speech Translation for speech SSL features to translation.
U2T: Speech Translation for speech SSL discrete units to translation.

Requirements

fairseq

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout 0f078de343d985e0cba6a5c1dc8a6394698c95c7
pip install -e ./

torch (1.12.0 recommended)

Setup

Please first setup the following configuration in script/setup.sh. (You could do this by copying and modifying script/setup_example.sh)

sslst_data_root: The place to put the processed data.
sslst_feat_root: The place to store the extracted features. (A large storage would be needed.)
sslst_output_root: The place to store the checkpoints.
sslst_data_bin_root: The place to store binarized text data.

Get data

The speech dataset will be processed into tsv files. This repo default supports three common datasets, which are librispeech, libritrans and covost2, and also allow you to add new dataset.

To prepare supported dataset

Librispeech
1. Download Librispeech from Official website. We set train-clean-100, dev-clean, test-clean as default splits, modifying prepare_data/Librispeech.py to change this setting.
2. Set $sslst_librispeech_root in script/setup.sh
3. Run bash prepare_data/librispeech.sh
Libritrans
1. Download Libritrans from this repo
2. Set $sslst_libritrans_root in script/setup.sh
3. Run bash prepare_data/libritrans.sh
CoVoST2
1. Download Common Voice from Official Website and set $sslst_cv_version in script/setup.sh. You should choose the source language based on the translation direction. Note that we use De -> En as default translation direction.
2. Clone CoVoST and set $sslst_covost_root and $sslst_covost2_tsv_root in script/setup.sh
3. Run bash prepare_data/pre_covost2.sh and bash prepare_data/covost2.sh.

The data would be prepared at $sslst_data_root/tsv in tsv format.

To add new dataset

Use prepare_data/Example.py as template to add new dataset, and it would be automatically detected.

Preprocessing

Text

For all the text data, we do the following preprocessing steps.

Normalize punctuations
Remove unprintable characters
Lowercase all the characters
Build BPE tokenizer with size = 8000 and character coverage = 1

To do the preprocessing

Clone mosesdecoder and set $sslst_mosesdecoder_root in script/setup.sh.
Run bash script/t2t/[DATASET].sh to do the preprocessing for the dataset you want to use.

Use script/t2t/fairseq_preprocess.sh to binarize the data.

bash script/t2t/fairseq_preprocess.sh [DATASET] [SRC_LANG] [TGT_LANG]

Speech

Speech to hidden unit

Create manifest by bash script/s2u/create_manifest_[DATASET].sh
Clone and install Fairseq. Set $sslst_fairseq_root in script/setup.sh
Train K-means model.
```
bash script/s2u/train_kmeans_simple.sh [DATASET] [KM_TAG] [SSL_MODEL] [LAYER] [N_CLUSTER] [PERCENTAGE]
```
If you didn't use librispeech, you need to change split from train-clean-100 into other one in script/s2u/train_kmeans_simple.sh.

The kmeans model could be found as $sslst_data_root/kmeans_model/[SSL_MODEL]-[KM_TAG][PERCENTAGE]p-L[LAYER]-km[N_CLUSTER].bin, e.g. data/kmeans_model/hubert-ls0.01p-L9-km500.bin.
Dump SSL features and apply K-means clustering,
```
bash script/s2u/apply_kmeans_simple.sh [DATASET] [SSL_MODEL] [LAYER] [N_CLUSTER] [KM_TAG]
```
The results could be found as $sslst_data_root/[DATASET]/[SPLIT].[SSL_MODEL]_l[LAYER]_[KM_TAG][N_CLUSTER], e.g. data/libritrans-en-fr/dev.hubert_l9_ls0.01p500.

The dump SSL features are in $sslst_feat_root/[DATASET]/[SSL_MODELS]/[LAYER]
(Optional) Do the reduction.
```
bash script/s2u/reduce_hidden_unit.sh [DATASET] [SUFFIX] [MODE]
```
If mode == simple, it simply combines the consecutive characters. (E.g. aaabb -> ab)

If mode == addN, it will add the number of consecutive after the character. (E.g. aaabb -> a _3 b _2)

Use script/t2t/fairseq_preprocess.sh to binarize the data.

bash script/t2t/fairseq_preprocess.sh [DATASET] [SUFFIX] [TGT_LANG]

Speech to text

Dump the ssl features.

bash script/dump_feature.sh [DATASET] [SSL_MODEL] [LAYER] seperate

If you have already ran the speech to hidden unit script, you could simply split those features.

bash script/s2t/split_feature.sh [DATASET] [SSL_MODEL] [LAYER]

Create Speech-to-Text task configuration

bash script/s2t/speech2text.sh [DATASET] [SSL_MODEL] [SSL_DIM] [LAYER] [SRC_LANG] [TGT_LANG]

Training

Text-to-text

We use Fairseq's transformer_iwslt_de_en as our default model architecture.

bash script/train/translation_t2t.sh [DATASET] [SRC_LANG] [TGT_LANG]

Unit-to-text

We also use Fairseq's transformer_iwslt_de_en as our default model architecture.

bash script/train/translation_u2t.sh [DATASET] [SUFFIX] [TGT_LANG]

Speech-to-text

We use Fairseq's s2t_transformer_s as our default model architecture.

bash script/train/speech2text.sh [DATASET] [SRC_LANG] [TGT_LANG]

The detail of the hyperparameters could be found in the training scripts.

Inference

We report SacreBLEU as performance metric.

Cascade system

speech-to-text-to-text

bash script/generate/cascade_s2t2t.sh [DATASET] [SRC_LANG] [MID_LANG] [TGT_LANG]

Unit-to-text-to-text

bash script/generate/cascade_u2t2t.sh [DATASET] [SRC_LANG] [MID_LANG] [TGT_LANG]

End-to-end system

Text-to-text

bash script/generate/translation_t2t.sh [DATASET] [SRC_LANG] [TGT_LANG]

Unit-to-text

bash script/generate/translation_u2t.sh [DATASET] [SRC_LANG] [TGT_LANG]

Speech-to-text

bash script/generate/speech2text.sh [DATASET] [SRC_LANG] [TGT_LANG]

Finetune from mBART

Preprocess

Download mbart.cc25 from fairseq.
Unzip and put the folder into $sslst_data_root

Tokenized text with mBART's spm model

Set the dataset and langauge in script/finetune/mbart-t2t.sh properly.
Run the script to create the binarized dataset.
```
bash script/finetune/mbart-t2t.sh
```

Convert hidden unit into mbart's subword

Create the mapping of hidden units and subwords

bash script/finetune/mbart-create_hidden_unit_mapping.sh [DATASET] [LANG]

Apply the mapping and create binarized dataset

bash script/finetune/mbart-u2t.sh [DATASET] [SRC_LANG] [MBART_TGT_LANG]

Training

Text-to-text

bash script/train/translation_t2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]

Unit-to-text

bash script/train/translation_u2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]

Inference

Text-to-text

bash script/generate/translation_t2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]

Unit-to-text

bash script/generate/translation_u2t_mbart.sh [DATASET] [SRC_LANG] [TGT_LANG]

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
fairseq-src		fairseq-src
prepare_data		prepare_data
script		script
utils		utils
utils_new		utils_new
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SSLST

Requirements

Setup

Get data

To prepare supported dataset

To add new dataset

Preprocessing

Text

Speech

Speech to hidden unit

Speech to text

Training

Text-to-text

Unit-to-text

Speech-to-text

Inference

Cascade system

End-to-end system

Finetune from mBART

Preprocess

Tokenized text with mBART's spm model

Convert hidden unit into mbart's subword

Training

Text-to-text

Unit-to-text

Inference

Text-to-text

Unit-to-text

About

Uh oh!

Releases

Packages

Languages

bearhsiang/SSLST

Folders and files

Latest commit

History

Repository files navigation

SSLST

Requirements

Setup

Get data

To prepare supported dataset

To add new dataset

Preprocessing

Text

Speech

Speech to hidden unit

Speech to text

Training

Text-to-text

Unit-to-text

Speech-to-text

Inference

Cascade system

End-to-end system

Finetune from mBART

Preprocess

Tokenized text with mBART's spm model

Convert hidden unit into mbart's subword

Training

Text-to-text

Unit-to-text

Inference

Text-to-text

Unit-to-text

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages