Codestin Search App

A natural language processing project based on HuggingFace Transformers for named entity recognition, specifically mutation recognition in pubMed abstracts of the SETH corpus

Technologies / Requirements

anaconda 4.10.1
python 3.7.11 (64-bit)
pytorch 1.5
transformers 4.9.2
datasets 1.11.0
seqeval 1.2.2
pandas 1.3.2
numpy 1.19.2
scikit-learn 0.24.2
wandb 0.12.1

Files

toIOB.py

Script to transform the SETH corpus from .ann to .iob format.

SETH corpus is accessed from: https://raw.githubusercontent.com/Erechtheus/mutationCorpora/master/corpora/original/SETH/corpus.txt
Output file: corpus_IOB.csv
- 2 coloumns: Word | Tag
- each row contains a token (tokenized with spacy) with the corresponding IOB tag
- rows starting with '#' contain the pubMed ID of the following tokenized abstract
- after each sentence there is an empty row

Libraries

pandas 1.3.2
urllib3 1.26.6
spacy 3.1.2

data-analysis.py

Data exploration script for plotting the distribution of the labels over the training and test set.

Libraries

pandas 1.3.2
scikit-learn 0.24.2
matplotlib 3.2.2
seaborn 0.11.2

NER.py

Main named entity recognition script including the following steps:

loading and preprocessing the corpus in IOB format in such a way that a transformer model can be trained on it
splitting into train and final test set
splitting the train set into k subsets using k-fold cross validation
for each fold: tokenization and dataset creation in such a way that the token classification model can be trained on it
- training and evaluation on evaluation set using the Trainer API of HuggingFace
- saving the model with the highest overall f1 score on the evaluation set
load model that achieved best overall f1 score and evaluate on final test set
return metrics and average f1 score of the k-fold cross validation
wandb wrapup for model monotoring and hyperparameter tuning
pipeline for making NER predictions using the best model

Libraries

pytorch 1.5
transformers 4.9.2
datasets 1.11.0
seqeval 1.2.2
pandas 1.3.2
numpy 1.19.2
scikit-learn 0.24.2
wandb 0.12.1

Hyperparameter tuning

Following hyperparameters were tuned with bayes search using wandb.to:

batch size
epochs
learning rate
bert model name
seed

Run training on gpu server using wandb

create a new sweep on wandb and set parameters
open terminal inside directory that contains NER_tuning.py file and run:
- general: CUDA_VISIBLE_DEVICES=GPU_NUM wandb agent wandb_user_name/wandb_project_name/wandab_sweep_id
- example: CUDA_VISIBLE_DEVICES=0 wandb agent seyamy/NER/5x9xgg4p

Make predictions using pipeline

open the file NER_prediction.py and set the text variable to the text to be predicted. Make sure that the name of the model to be used is correct.
the name of the output file containing the results is set to "output.txt"
open terminal inside directory that contains NER_prediction.py file and run: python NER_prediction.py
output.txt file:
- first line: text to be predicted
- other lines: predictions/labeling per token

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
corpora/SETH		corpora/SETH
LICENSE		LICENSE
NER.py		NER.py
NER_evaluation.py		NER_evaluation.py
NER_prediction.py		NER_prediction.py
NER_tuning.py		NER_tuning.py
README.md		README.md
data-analysis.py		data-analysis.py
environment-linux.yml		environment-linux.yml
environment-windows.yml		environment-windows.yml
example_prediction_output.txt		example_prediction_output.txt
toIOB.py		toIOB.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A natural language processing project based on HuggingFace Transformers for named entity recognition, specifically mutation recognition in pubMed abstracts of the SETH corpus

Technologies / Requirements

Files

toIOB.py

Libraries

data-analysis.py

Libraries

NER.py

Libraries

Hyperparameter tuning

Run training on gpu server using wandb

Make predictions using pipeline

Sources

About

Uh oh!

Releases

Packages

Languages

License

Erechtheus/mutation-ner

Folders and files

Latest commit

History

Repository files navigation

A natural language processing project based on HuggingFace Transformers for named entity recognition, specifically mutation recognition in pubMed abstracts of the SETH corpus

Technologies / Requirements

Files

toIOB.py

Libraries

data-analysis.py

Libraries

NER.py

Libraries

Hyperparameter tuning

Run training on gpu server using wandb

Make predictions using pipeline

Sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages