This software project accompanies the research paper, PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model, published on NeurIPS 2023.
For further information, you can refer to our research highlight post on the latent language diffusion model.
- PLANNER is a latent text diffusion model that effectively generates text by utilizing both latent semantic diffusion and autoregressive generation.
- This is accomplished by integrating an autoregressive decoder for "decoding" with a latent diffusion module for "planning" to produce paragraph in a coarse-to-fine manner.
Please generally follow the setup command in below.
bash setup.shThis step involves tokenizing the dataset in a specified folder that contains .json files, and then saving it into a folder parsed_raw_pre containing three .pt files of train, dev and test.
python text_autoencoder/prepro.py --corpus data-bin/dummy_dataSee examples for training a variational paragraph embedder in below.
bash ./bash/ae/run_ae.sh--seed: Seed for random number generation.--lr,--enc_lr,--dec_lr: Initial learning rates for the overall model, encoder, and decoder, respectively.--epochs: Number of training epochs.--batch_size: Batch size for training.--valid_size: Size of the validation set.--lr_decay_interval: Interval (in epochs) for learning rate decay.--dropout: Dropout ratio to prevent overfitting.--gradient_accumulation_steps: Number of steps for gradient accumulation.
--enc_model: Encoder model to be used (bert-large-uncased,google/flan-t5-xl, etc.).--dec_model: Decoder model (gpt2-medium,gpt2-large, etc.).--latent_size: Size of the latent variable.--n_head: Size of the attention head.--num_layer: Number of layers in the model.
--save_dir: Directory path where model snapshots are saved.--train_pt_dir,--dev_pt_dir: Paths for training and development data.--resume_ckpt: Path to resume training from a specific checkpoint.--exp_name: Name of the experiment.
--out_layer: Last layer choice for deconvolution (pred_token,pred_emb,lm_head).--reg_layer: Regularization layer (bn,ln,none).--embed_dim: Number of embedding dimensions.--filter_size: Filter size for convolution.--filter_shape: Shape of the filter to use for convolution.--tau: Temperature parameter for training.--noiser,--noiser_ratio: Noise type and ratio for data corruption.--h_noiser,--h_noiser_ratio: Hidden noise type and ratio.
--world_size: Total number of distributed processes.--gpus: Number of GPUs to use.
Again, .pt files of train, dev and test need to be created for the (source, target) dataset. For example, for creating a dataset for summarization. First concatenate the document and summary into single tsv files by
cd data-bin/dummy_sum_data
for file in *.document; do
base=$(basename "$file" .document)
if [[ -e "$base.summary" ]]; then
paste "$base.document" "$base.summary" > "$base.txt"
fi
done
cd -Then run the following command:
python text_autoencoder/prepro_ground.py --corpus ./data-bin/dummy_sum_data/This will create three folders (train,dev,test) under data-bin/dummy_sum_data/parsed_raw_pre.
bash ./bash/diffusion/run_diffusion.shbash ./bash/diffusion/pipeline_cond_gen.shPlease consider citing our work if it is helpful to your research.
@inproceedings{zhang2023planner,
title={PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model},
author={Zhang, Yizhe and Gu, Jiatao and Wu, Zhuofeng and Zhai, Shuangfei and Susskind, Josh and Jaitly, Navdeep},
booktitle = {NeurIPS},
year={2023}
}
**PLANNER** poster for NeurIPS 2023.