This repository contains the code of our paper:
Improving Relation Extraction by Pre-trained Language Representations.
Christoph Alt*, Marc Hübner*, Leonhard Hennig
We fine-tune the pre-trained OpenAI GPT [1] to the task of relation extraction and show that it achieves state-of-the-art results on SemEval 2010 Task 8 and TACRED relation extraction datasets.
Our code depends on huggingface's PyTorch reimplementation of the OpenAI GPT [2] - so thanks to them.
First, clone the repository to your machine and install the requirements with the following command:
pip install -r requirements.txtWe also need the weights of the pre-trained Transformer, which can be downloaded with the following command:
./download-model.sh
The english spacy model is required for sentence segmentation:
python -m spacy download en
We evaluate our model on SemEval 2010 Task 8 and TACRED, which is available through LDC.
Our model expects the input dataset to be in JSONL format. To convert a dataset run the following command:
python dataset_converter.py <DATASET DIR> <CONVERTED DATASET DIR> --dataset=<DATASET NAME>E.g. for training on the TACRED dataset, run the following command:
CUDA_VISIBLE_DEVICES=0 python relation_extraction.py train \
  --write-model True \
  --masking-mode grammar_and_ner \
  --batch-size 8 \
  --max-epochs 3 \
  --lm-coef 0.5 \
  --learning-rate 5.25e-5 \
  --learning-rate-warmup 0.002 \
  --clf-pdrop 0.1 \
  --attn-pdrop 0.1 \
  --word-pdrop 0.0 \
  --dataset tacred \
  --data-dir <CONVERTED DATASET DIR> \
  --seed=0 \
  --log-dir ./logs/CUDA_VISIBLE_DEVICES=0 python relation_extraction.py evaluate \
  --dataset tacred \
  --masking_mode grammar_and_ner \
  --test_file ./data/tacred/test.jsonl \
  --save_dir ./logs/ \
  --model_file <MODEL FILE (e.g. model_epoch...)> \
  --batch_size 8 \
  --log_dir ./logs/The models we trained on SemEval and TACRED to produce our paper results can be found here:
| Dataset | Masking Mode | P | R | F1 | Download | 
|---|---|---|---|---|---|
| TACRED | grammar_and_ner | 70.0 | 65.0 | 67.4 | Link | 
| SemEval | None | 87.6 | 86.8 | 87.1 | Link | 
First, download the archive corresponding to the model you want to evaluate (links in the table above).
wget --content-disposition <DOWNLOAD URL>Extract the model archive containing model.pt, text_encoder.pkl, and label_encoder.pkl.
tar -xvzf <MODEL ARCHIVE>- dataset: dataset to evaluate, can be one of "semeval" or "tacred".
- test-file: path to the JSONL test file used during evaluation
- log-dir: directory to store the evaluation results and predictions
- save-dir: directory containing the downloaded model files (model.pt, text_encoder.pkl, and label_encoder.pkl)
- masking-mode: masking mode to use during evaluation, can be one of "None", "grammar_and_ner", "grammar", "ner" or "unk" (caution: must match the mode for training)
For example, to evaluate the TACRED model with "grammar_and_ner" masking, run the following command:
CUDA_VISIBLE_DEVICES=0 python relation_extraction.py evaluate \
      --dataset tacred \
      --test-file ./<CONVERTED DATASET DIR>/test.jsonl \
      --log-dir <RESULTS DIR> \
      --save-dir <MODEL DIR> \
      --masking_mode grammar_and_nerIf you use our code in your research or find our repository useful, please consider citing our work.
@InProceedings{alt_improving_2019,
  author = {Alt, Christoph and H\"{u}bner, Marc and Hennig, Leonhard},
  title = {Improving Relation Extraction by Pre-trained Language Representations},
  booktitle = {Proceedings of AKBC 2019},
  year = {2019},
  url = {https://openreview.net/forum?id=BJgrxbqp67},
}
lm-transformer-re is released under the MIT license. See LICENSE for additional details.
- Improving language understanding by generative pre-training. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
- PyTorch implementation of OpenAI's Finetuned Transformer Language Model