0% found this document useful (0 votes)

13 views9 pages

Assignment3 Part2

The project analyzes three datasets to build a machine learning model using BERT for classifying patient notes, achieving a mean F1 score of 78%. The model's performance is evaluated based on accuracy and F1 score due to the imbalanced nature of the data, with accuracy reaching 99% but deemed unreliable. The findings highlight the complexity of the data and the challenges faced in model prediction, emphasizing the importance of F1 score for assessment.

Uploaded by

chtwpk5st8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views9 pages

Assignment3 Part2

Uploaded by

chtwpk5st8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2. Present the final results, followed by an explanation of the findings.

Discuss how the

results and findings address the research problem statement of your project. (If your project
has an application development aspect, discuss the functionalities of the application with
illustrative examples in this section)

There are 3 datasets which are considered:

1. The feature dataset:

Shape: (143, 3)

2. The patient_notes dataset:

Shape: (42146, 3)
3. The sample dataset of train.csv data:
(14300, 6)

The above 3 datasets are merged into 1 single data frame (merged_df)
merged_df data is as below:

Visualizing:

Based on the patient notes, the cases are categorized. As the result says that, most of the
patients belongs to case_num=3.
The below figure displaying the count plot of the most frequent case numbers. Based on the
graph, the case_number5 is high, saying most patients are related to this particular case.

The statistical presentation of most commonly used words in the patient notes.
If we observe these words, these are the stop words which will no way contribute for our
model prediction. These words are common English words used in every patient notes.
Hence, we have removed these stop words.
After removing the stop words, the below table chart shows the most common words in patient
notes.

Basic wordcloud plot of the common words in patient notes after removing stop words:

Although there are useful words such as ‘pain’, ‘epigastric’, there are stop words still existing in
the patient notes, even after removing the stop words using the below 2 modules:
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords

Hence, the data is very complex and many irregularities were found, which will be hard for the
model to predict. This is one of the drawbacks. The imported modules may not always work for
the complex raw sentences.
Tokenization techniques:

Tokenization is the process of breaking the raw sentence to smaller blocks which are referred
as tokens. This process helps in developing the NLP model. This will result in getting the insights
of text by analyzing the series of tokens. Each token will be assigned numeric values in order to
be considered by the model.

input_ids are the indices for each of the token in the sentence.
attention_mask indicates whether a token should be handled or not.
token_type_ids indicates the sequence to which the token belongs if there are multiple
sequences.

Model building:

Pytorch, an ML framework is used as Pytorch supports various modules to support building of

NLP. Hence, we have implemented below modules for our usecase. After tokenization, we have
used BERT technique to get the final outcome.

import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

The architecture of the model is defined as below:

Model name: bert-base-uncased
Hyperparameters are:

Dropout Learning rate Optimizer Building Block

0.5 1e-5 AdamW Linear

Loss function used: BCEWithLogitsLoss

Layers:
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask,
token_type_ids=token_type_ids)
logits = self.fc1(outputs[0])
logits = self.fc2(self.dropout(logits))
logits = self.fc3(self.dropout(logits)).squeeze(-1)
No. of epochs: 3
Batch Size: 10

With use of this, the model has been trained. The below metrics are calculated.
Time taken to build the model: 95 mins.

Mean of F1 score: 75.9

Performance Tuning:

1. Added one more linear layer with input size 300, output size 150

self.fc1 = nn.Linear(768, 512)

self.fc2 = nn.Linear(512, 300)
self.fc3 = nn.Linear(300, 150)
self.fc4 = nn.Linear(150, 1)

Observed values are as below:

Mean F1 score: 73.2%

2. Changed the dropout value to 0.05

Results are as below:

Mean of the metrics:

{'Accuracy': 0.9934958112237656,
'f1': 0.7842475466560781,
'precision': 0.7697296996353311,
'recall': 0.7993235625704622}

Mean F1 score: 78%

Best model:
Considering the 2nd case of performance tuning that is, changing the dropout value to 0.05, we
are getting the better f1 score which is of 78%. This is the highest among the other predictions.

Below are the graph plots of metrics values observed in each epoch.

Observations: Based on the above graphs, the accuracy is observed to be 99% always. In
Classification models, achieveing of accuracy of 99% is not always expected. It will depend on
the data which is being considered.
In this classification problem, accuracy is defined as:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑎𝑠𝑒𝑠

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝐶𝑎𝑠𝑒𝑠

Here the model is considering almost each and every case as correctly classified case due to the
imbalanced data.
Reason for considering F1 score:
F1 metric is considered as it is not affected by imbalanced data. The dataset is highly
imbalanced. Having this imbalance, it is hard for the model to predict as per the expectations.
Hence, we are considering F1 score as it takes the input data how it got distributed/extracted
from the features. Considering the right data for the model performance always give the better
assessment.

Therefore, the mean of F1 score is considered for the assessment of model prediction thus
addressing our problem statement.

Since, the data is huge and complex (data in raw format), the model is taking up to 95 minutes
for each run. These lead to the 2 Vs of Big Data that are Volume and Variety.

ACDC Assignment3
No ratings yet
ACDC Assignment3
5 pages
20bci7097 - Soft Computing - Project Report
No ratings yet
20bci7097 - Soft Computing - Project Report
9 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Ai Internship
No ratings yet
Ai Internship
5 pages
DL2 - Jupyter Notebook
No ratings yet
DL2 - Jupyter Notebook
5 pages
Project Documentation
No ratings yet
Project Documentation
2 pages
PyTorch Neural Network Classifcation
No ratings yet
PyTorch Neural Network Classifcation
1 page
Ann 2
No ratings yet
Ann 2
8 pages
Professional Machine Learning
No ratings yet
Professional Machine Learning
67 pages
Python Pandas and Machine Learning Guide
No ratings yet
Python Pandas and Machine Learning Guide
21 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
CS335 Lab6
No ratings yet
CS335 Lab6
7 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
6 pages
4th Assign
No ratings yet
4th Assign
6 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
Machine Learning Lecture - 4 and Lecture - 5
No ratings yet
Machine Learning Lecture - 4 and Lecture - 5
73 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Minimalist Business Slides XL by Slidesgo
No ratings yet
Minimalist Business Slides XL by Slidesgo
27 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
NNProject t2
No ratings yet
NNProject t2
9 pages
C2 W2 Multiclass TF
No ratings yet
C2 W2 Multiclass TF
13 pages
C2 W2 Multiclass TF
No ratings yet
C2 W2 Multiclass TF
13 pages
Pytorch NN Training Basics
No ratings yet
Pytorch NN Training Basics
7 pages
NB4-08 PT III Early Stopping
No ratings yet
NB4-08 PT III Early Stopping
6 pages
CustomNER ConfusionMatrix Explained
No ratings yet
CustomNER ConfusionMatrix Explained
8 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Weekly Activity 6
No ratings yet
Weekly Activity 6
5 pages
B22EE010 Report
No ratings yet
B22EE010 Report
9 pages
Google - Professional Machine Learning Engineer.v2021 07 27.q25
No ratings yet
Google - Professional Machine Learning Engineer.v2021 07 27.q25
11 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
DR Basit Assignments
No ratings yet
DR Basit Assignments
13 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Assignment 3 DS5620
No ratings yet
Assignment 3 DS5620
11 pages
Ai Project Cycle Short Note
No ratings yet
Ai Project Cycle Short Note
9 pages
Arabic OCR Report
No ratings yet
Arabic OCR Report
20 pages
Driver Drowsiness Detection AI
No ratings yet
Driver Drowsiness Detection AI
9 pages
Mini Project 1
No ratings yet
Mini Project 1
4 pages
Da Week11
No ratings yet
Da Week11
6 pages
ML Project
No ratings yet
ML Project
11 pages
Aicw
No ratings yet
Aicw
19 pages
Lecture#3
No ratings yet
Lecture#3
16 pages
09 Milestone Project 2 Skimlit
No ratings yet
09 Milestone Project 2 Skimlit
32 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
MBAN Assignment
No ratings yet
MBAN Assignment
2 pages
ML Merged
No ratings yet
ML Merged
51 pages
Confusion Matrix & Regression Metrics
No ratings yet
Confusion Matrix & Regression Metrics
18 pages
Final Code
No ratings yet
Final Code
16 pages
Few-Shot Learning Tutorial - Medium
No ratings yet
Few-Shot Learning Tutorial - Medium
16 pages
PyTorch Tabular Regression Guide
No ratings yet
PyTorch Tabular Regression Guide
13 pages
Personalized Cancer Diagnosis
No ratings yet
Personalized Cancer Diagnosis
100 pages
Tutorial 4
No ratings yet
Tutorial 4
6 pages
Overfitting Solutions in Machine Learning
No ratings yet
Overfitting Solutions in Machine Learning
7 pages
Crashcourse DL Pytorch Parr
No ratings yet
Crashcourse DL Pytorch Parr
39 pages
Deep Learning With PyTorch 1
No ratings yet
Deep Learning With PyTorch 1
1 page
E1213 PRNN: Assignment 1 - Basic Models: Prof. Prathosh A. P. Submission Deadline: 1st March 2022
No ratings yet
E1213 PRNN: Assignment 1 - Basic Models: Prof. Prathosh A. P. Submission Deadline: 1st March 2022
3 pages
Pranav ML10
No ratings yet
Pranav ML10
8 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
PID Controller
No ratings yet
PID Controller
5 pages
MTPPT4 ELECTRIC FIELD - With Solution
No ratings yet
MTPPT4 ELECTRIC FIELD - With Solution
37 pages
Electromagnetic Formulas & Conversions
No ratings yet
Electromagnetic Formulas & Conversions
3 pages
804YB Kendriya Vidyalaya Sangathan Hyderabad Region Common Summative Assessment - Ii
No ratings yet
804YB Kendriya Vidyalaya Sangathan Hyderabad Region Common Summative Assessment - Ii
8 pages
Final JR Iit Co Super Chaina-Micro Teaching Schedule - QP Allotment-2023-24 - (Code-02-07-2023)
No ratings yet
Final JR Iit Co Super Chaina-Micro Teaching Schedule - QP Allotment-2023-24 - (Code-02-07-2023)
77 pages
Deepika Java
No ratings yet
Deepika Java
24 pages
Assignment 1
100% (1)
Assignment 1
3 pages
Indian Knowledge System Q&A: Answer
No ratings yet
Indian Knowledge System Q&A: Answer
32 pages
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
100% (2)
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
35 pages
Stas 2634 1980 en
No ratings yet
Stas 2634 1980 en
25 pages
Advanced Statistical Physics Problems
No ratings yet
Advanced Statistical Physics Problems
7 pages
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
No ratings yet
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
2 pages
Sampling Guide For Air Contaminants in The Workplace
No ratings yet
Sampling Guide For Air Contaminants in The Workplace
152 pages
Rabie Bin Asim Design Problem 1
No ratings yet
Rabie Bin Asim Design Problem 1
25 pages
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
No ratings yet
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
6 pages
Applied Mathematics Msbte Board Paper PDF
No ratings yet
Applied Mathematics Msbte Board Paper PDF
3 pages
NIOS Class 12 Previous Year Question Papers Physics 2006
No ratings yet
NIOS Class 12 Previous Year Question Papers Physics 2006
5 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
No ratings yet
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
82 pages
Managerial Math Assignment 2013
No ratings yet
Managerial Math Assignment 2013
4 pages
Transformations Review Stations
No ratings yet
Transformations Review Stations
11 pages
Grade 5 Math Lesson Plan: Volume
No ratings yet
Grade 5 Math Lesson Plan: Volume
14 pages
Cascaded Integrator Comb Digital Filters Paper Hogenauer
No ratings yet
Cascaded Integrator Comb Digital Filters Paper Hogenauer
8 pages
Mathematicians and Their Contributions
No ratings yet
Mathematicians and Their Contributions
2 pages
Textbook of Clinical Embryology 2nd Edition Vishram Singh PDF Download
100% (1)
Textbook of Clinical Embryology 2nd Edition Vishram Singh PDF Download
153 pages
MAMBA
No ratings yet
MAMBA
5 pages
Learning SciPy For Numerical and Scientific Computing Second Edition Sergio J. Rojas G. Instant Download
100% (1)
Learning SciPy For Numerical and Scientific Computing Second Edition Sergio J. Rojas G. Instant Download
42 pages
Cambridge International As A Level Mathematics Probability Statistics 1 Practice Book Cambridge International Download
No ratings yet
Cambridge International As A Level Mathematics Probability Statistics 1 Practice Book Cambridge International Download
44 pages
Reliability, Validity, Sensitivity
No ratings yet
Reliability, Validity, Sensitivity
3 pages
Mechanical Engineering (GATE 2020)
No ratings yet
Mechanical Engineering (GATE 2020)
2 pages

Assignment3 Part2

Uploaded by

Assignment3 Part2

Uploaded by

2. Present the final results, followed by an explanation of the findings.

Discuss how the

There are 3 datasets which are considered:

1. The feature dataset:

2. The patient_notes dataset:

Pytorch, an ML framework is used as Pytorch supports various modules to support building of

The architecture of the model is defined as below:

Dropout Learning rate Optimizer Building Block

Loss function used: BCEWithLogitsLoss

Mean of F1 score: 75.9

self.fc1 = nn.Linear(768, 512)

Observed values are as below:

2. Changed the dropout value to 0.05

Results are as below:

Mean of the metrics:

Mean F1 score: 78%

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑎𝑠𝑒𝑠

You might also like