Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.
This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.
- Python 3.8 and packages in
requirements.txt - The MIMIC-III dataset (v1.4): PhysioNet link
- BlueBERT: GitHub link
- The SPECIALIST Lexicon of UMLS: LSG website
- English dictionary (DWYL): GitHub link
$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git
-
Download the MIMIC-III dataset from PhysioNet, especially
NOTEEVENTS.csvand put underdata/mimic3 -
Download
LRWDandprevariantsof the SPECIALIST Lexicon from the LSG website (2018AB version) and put underdata/umls. -
Download the English dictionary
english.txtfrom here (commit 7cb484d) and put underdata/english_words. -
Run
scripts/build_vocab_corpus.ipynbto build the dictionary and split the MIMIC-III notes into files. -
Run the Jupyter notebook for the dataset that you want to download/pre-process:
-
Download the BlueBERT model from here under
bert/ncbi_bert_{base|large}.- For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
- For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"
Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM.
The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data.
You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256.
There are several options you may want to configure:
num_gpus: number of GPUsbatch_size: batch sizetraining_step: total number of steps to traininit_ckpt/init_step: the checkpoint file/steps to resume pretrainingnum_beams: beam search width for evaluationmimic_csv_dir: directory of the MIMIC-III csv splitsbert_dir: directory of the BlueBERT files
You can also download the pre-trained LMs and put under model/ (e.g. the CIM-base checkpoint is placed as model/cim_base/ckpt-475000.pkl):
Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load
@InProceedings{juyong2022context,
title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
pages = {234--247},
year = {2022},
volume = {174},
series = {Proceedings of Machine Learning Research},
month = {07--08 Apr},
publisher = {PMLR}
}