This repository contains the official implementation of paper XDLM: Cross-lingual Diffusion Language Model for Machine Translation
The codebase is implemented with FairSeq. To install the dependencies, run (recommended in a virtual environment) the following commands:
pip install -r requirements.txt
# install our package of discrete diffusion models
pip install -e discrete_diffusion
# install our fork of fairseq
cd fairseq
python3 setup.py build develop
cd ..Note The environment is tested with Python 3.8.10, PyTorch 1.10.0/1.12.0, and CUDA 11.3. Also note our fork of fairseq modifies several files in the original codebase; using more recent versions of fairseq might lead to unexpected dependency conflicts.
For the preprocess of opus dataset, we use script to download correspond language pairs as following instruction.
# Download and tokenize parallel data in 'data/wiki/para/en-zh.{en,zh}.{train,valid,test}'
./get-data-para.sh en-zh &
For the using of BPE tools, we use the tools introduced in here, use script to obtain the BPE code and processed dataset.
Use following instruction. Also, the processed data-bin/para data is available here.
TEXT=para
fairseq-preprocess --joined-dictionary \
--source-lang en --target-lang de \
--trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
--destdir data-bin/para --thresholdtgt 0 --thresholdsrc 0 \
--workers 20
We use Huggingface to obtain the origin data source from here.
For finetune datasets, we use the BPE codes obtained from opus dataset, and generate the processed dataset by the applyBPE operation used in script.
We use following instructions to process and binarize the dataset for finetuning. Also, the processed data-bin/wmt14-ende data is available here.
TEXT=wmt14_ende/IWSLT
fairseq-preprocess --joined-dictionary \
--source-lang en --target-lang de \
--trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
--destdir data-bin/wmt14_ende --thresholdtgt 0 --thresholdsrc 0 \
--workers 20
Don't forget to replace the dict.en.txt and dict.de.txt in the downloaded folder as the vocab built in the previous stage before apply fairseq-preprocess
As we introduced in the paper, we use the reparameterized multinomial diffusion model for both pretrain process and finetune process.
bash experiments/mt_train.sh -m reparam-multinomial -d para -s default -e True --not-diffusing-special-sym --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear
Before we finetune on the pretrained model, please move the checkpoints to the potential checkpoint position and rename it as a checkpoint_last.pt.
bash experiments/mt_train_finetune.sh -m reparam-multinomial -d wmt14/IWSLT -s default -e True --not-diffusing-special-sym --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear
By passing --decoding-strategy default, the vanilla sampling scheme (specific to each discrete diffusion process) is used.
A more advanced decoding approach can be invoked by passing --decoding-strategy reparam-<conditioning-of-v>-<topk_mode>-<schedule>. This approach is based on the proposed reparameterization in our paper and allows for more effective decoding procedures. The options specify the decoding algorithm via
-
<conditioning-of-v>:uncondorcond(defaultuncond): whether to generate the routing variable$v_t$ in a conditional or unconditional manner; -
<topk_mode>:stochastic<float>ordeterministic(defaultdeterministic): whether to use stochastic or deterministic top-$k$ selection. The float value instochastic<float>specifies the degree of randomness in the stochastic top-$k$ selection; -
<schedule>:linearorcosine(defaultcosine): the schedule for$k$ during our denoising procedure, which is used to control the number of top-$k$ tokens to be denoised for the next decoding step.
See the implementation for more details about the options.
Please see the scripts below for details.
Note
- Note that all tasks considered in this work operate on the original data and do not adopt Knowledge Distillation (KD).
We first get into the fairseq folder and then run the following commands to train the models.
######## training scripts for IWSLT'14 , WMT'14, and WMT'16
# first cd to fairseq
# we use 1 GPU for IWSLT'14, 4 GPUs for WMT'14 and 2 GPUs for WMT'16 datasets respectively.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_train.sh -m absorbing -d <iwslt/wmt14/wmt16> -s default -e True --store-ema --label-smoothing 0.1
CUDA_VISIBLE_DEVICES=1 bash experiments/mt_train.sh -m multinomial -d <iwslt/wmt14/wmt16> -s default -e True --not-diffusing-special-sym --store-ema --label-smoothing 0.0
CUDA_VISIBLE_DEVICES=2 bash experiments/mt_train.sh -m reparam-absorbing -d <iwslt/wmt14/wmt16> -s default -e True --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear
CUDA_VISIBLE_DEVICES=3 bash experiments/mt_train.sh -m reparam-multinomial -d <iwslt/wmt14/wmt16> -s default -e True --not-diffusing-special-sym --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linearNote
-s <str>is used to specify the name of the experiment.- We could pass custom arguments that might be specific to training by appending them after
-e True.
The evaluation pipeline is handled by experiments/mt_generate.sh. The script will generate the translation results and evaluate the BLEU score.
########### IWLS'14 and WMT'14 datasets
# we recommend putting each checkpoint into a separate folder
# since the script will put the decoded results into a file under the same folder of each checkpoint.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_generate.sh -a false -c <checkpoint_path> -d <iwslt/wmt14> Arguments:
-a: whether to average multiple checkpoints-c: indicates the location of the checkpoint. If-a false(not to average checkpoints), pass the checkpoint path; if-a true, pass the directory that stores multiple checkpoints at different training steps for averaging.-d: the dataset name
We also provide the checkpoints of our trained models.
| Dataset | Model | Checkpoint link |
|---|---|---|
| Opus | Pretrain | link |
| IWSLT'14 | Finetune | link |
| WMT'14 | Finetune | link |
|
@misc{chen2023xdlm,
title={XDLM: Cross-lingual Diffusion Language Model for Machine Translation},
author={Linyao Chen and Aosong Feng and Boming Yang and Zihui Li},
year={2023},
eprint={2307.13560},
archivePrefix={arXiv},
primaryClass={cs.CL}
}