This repository provides datasets and code for preprocessing, training and testing models for Iterative Text Revision with the official Hugging Face implementation of the following paper:
Understanding Iterative Revision from Human-Written Text
Wanyu Du, Vipul Raheja, Dhruv Kumar, Zae Myung Kim, Melissa Lopez and Dongyeop Kang
ACL 2022
It is mainly based on transformers
.
The following command installs all necessary packages:
pip install -r requirements.txt
The project was tested using Python 3.7.
We uploaded both our datasets and model checkpoints to Hugging Face's repo. You can directly load our data using datasets
and load our model using transformers
.
# load our dataset
from datasets import load_dataset
dataset = load_dataset("wanyu/IteraTeR_human_sent")
# load our model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("wanyu/IteraTeR-PEGASUS-Revision-Generator")
model = AutoModelForSeq2SeqLM.from_pretrained("wanyu/IteraTeR-PEGASUS-Revision-Generator")
You can change the following data and model specifications:
- "wanyu/IteraTeR_human_sent": sentence-level IteraTeR-HUMAN dataset;
- "wanyu/IteraTeR_human_doc": document-level IteraTeR-HUMAN dataset;
- "wanyu/IteraTeR_full_sent": sentence-level IteraTeR-FULL dataset;
- "wanyu/IteraTeR_full_doc": document-level IteraTeR-FULL dataset;
- "wanyu/IteraTeR-PEGASUS-Revision-Generator": PEGASUS model fine-tuned on sentence-level IteraTeR-FULL dataset, see usage example here;
- "wanyu/IteraTeR-BART-Revision-Generator": BART model fine-tuned on sentence-level IteraTeR-FULL dataset, see usage example here;
We also provided a demo code for how to use them to do iterative text revision.
You can load our dataset using Hugging Face's datasets
, and you can also download the raw data in datasets/.
We splited IteraTeR dataset as follows:
Document-level | Sentence-level | |||||
---|---|---|---|---|---|---|
Dataset | Train | Dev | Test | Train | Dev | Test |
IteraTeR-FULL | 29848 | 856 | 927 | 157579 | 19705 | 19703 |
IteraTeR-HUMAN | 481 | 27 | 51 | 3254 | 400 | 364 |
All data and detailed description for the data structure can be found under datasets/.
Code for collecting the revision history data can be found under code/crawler/.
Model | Dataset | Edit-Intention | Precision | Recall | F1 |
---|---|---|---|---|---|
Roberta | IteraTeR-HUMAN | Clarity | 0.75 | 0.63 | 0.69 |
Roberta | IteraTeR-HUMAN | Fluency | 0.74 | 0.86 | 0.80 |
Roberta | IteraTeR-HUMAN | Coherence | 0.29 | 0.36 | 0.32 |
Roberta | IteraTeR-HUMAN | Style | 1.00 | 0.07 | 0.13 |
Roberta | IteraTeR-HUMAN | Meaning-changed | 0.44 | 0.69 | 0.53 |
The code and instructions for the training and inference of the intent classifier model can be found under code/model/intent_classification/.
Model | Dataset | SARI | BLEU | ROUGE-L | Avg. |
---|---|---|---|---|---|
BART | IteraTeR-FULL | 37.28 | 77.50 | 86.14 | 66.97 |
PEGASUS | IteraTeR-FULL | 37.11 | 77.60 | 86.84 | 67.18 |
The code and instructions for the training and inference of the Pegasus and BART models can be found under code/model/generation/.
This repository also contains the code and data of the following paper:
Understanding Iterative Revision from Human-Written Text
Wanyu Du1, Zae Myung Kim1, Vipul Raheja, Dhruv Kumar and Dongyeop Kang
First Workshop on Intelligent and Interactive Writing Assistants (ACL 2022)
The IteraTeR_v2
dataset is larger than IteraTeR
with around 24K more
unique documents and 170K more edits, which is splitted as follows:
Train | Dev | Test | |
---|---|---|---|
IteraTeR_v2 | 292929 | 34029 | 39511 |
Human-model interaction data in R3: we also provide our collected human-model interaction data in R3 in dataset/R3_eval_data.zip.
If you find this work useful for your research, please cite our papers:
@inproceedings{du2022iterater,
title = "Understanding Iterative Revision from Human-Written Text",
author = "Du, Wanyu and Raheja, Vipul and Kumar, Dhruv and Kim, Zae Myung and Lopez, Melissa and Kang, Dongyeop",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
year = "2022",
publisher = "Association for Computational Linguistics",
}
@inproceedings{du2022r3,
title = "Read, Revise, Repeat: A System Demonstration for Human-in-the-loop Iterative Text Revision",
author = "*Du, Wanyu and *Kim, Zae Myung and Raheja, Vipul and Kumar, Dhruv and Kang, Dongyeop",
booktitle = "Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants",
year = "2022",
publisher = "Association for Computational Linguistics",
}