TSPMN

Matching-based Term Semantics Pre-training for Spoken Patient Query Understanding, ICASSP 2023.

Requirement

pytorch 1.10.0
transformers 4.18.0
deepspeed 0.6.3
apex 0.1

Usage

Pretraining

Download the MedDialog, kamed and ReMeDi-large datasets and save them to data4pretrain Folder.

cd data4pretrain/VRBot-sigir2021-datasets
unzip 'kamed-*.zip'
cd ..

Processing medical dialogues:

python preprocess4pretrain.py

Constructing the Dict. Download the sougou medical dictionary and the dictionary THUOCL in medical domain. Then:

python medical_dict/dict.py

Constructing the dialogue-term pairs for pretraining.

cd data4pretrain, python medical_word_extract.py

Pretraining:

sh pretraining_deepspeed/run_train.sh

The pretrained checkpoint can be found in 百度云盘, 提取码：b820.

Finetuning

Download the MSL Dataset. Then:

python data_process_MSL.py

Training:

nohup CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --master_port 19600 --nproc_per_node=1 train_Parallel.py > run.log 2>&1 &

Evaluating:

python evaluate.py

Pseudo-labeled data (术语伪标注医疗对话语料库)

We have released the processed dataset of dialogue-term pairs obtained through pseudo-labeling via string matching. The processed dataset can be found in 百度云盘, 提取码：wueo. It is important to note that the original data comes from publicly available datasets. We have conducted the pseudo-labeling of medical terms based on these public datasets. These pseudo-labeled terms are not limited to term extraction tasks, but can also facilitate research on related downstream tasks, such as medical dialogue generation with term knowledge enhancement, medical dialogue recommendation, and so on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TSPMN

Requirement

Usage

Pretraining

Finetuning

Pseudo-labeled data (术语伪标注医疗对话语料库)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
data4pretrain		data4pretrain
evaluate_MSL		evaluate_MSL
medical_dict		medical_dict
modeling		modeling
pretraining_deepspeed		pretraining_deepspeed
README.md		README.md
common_utils.py		common_utils.py
config_MSL.json		config_MSL.json
data_process_MSL.py		data_process_MSL.py
evaluate.py		evaluate.py
preprocess4pretrain.py		preprocess4pretrain.py
train_Parallel.py		train_Parallel.py

FlyingCat-fa/TSPMN

Folders and files

Latest commit

History

Repository files navigation

TSPMN

Requirement

Usage

Pretraining

Finetuning

Pseudo-labeled data (术语伪标注医疗对话语料库)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages