2. Present the final results, followed by an explanation of the findings.
Discuss how the
results and findings address the research problem statement of your project. (If your project
has an application development aspect, discuss the functionalities of the application with
illustrative examples in this section)
There are 3 datasets which are considered:
1. The feature dataset:
Shape: (143, 3)
2. The patient_notes dataset:
Shape: (42146, 3)
3. The sample dataset of train.csv data:
(14300, 6)
The above 3 datasets are merged into 1 single data frame (merged_df)
merged_df data is as below:
Visualizing:
Based on the patient notes, the cases are categorized. As the result says that, most of the
patients belongs to case_num=3.
The below figure displaying the count plot of the most frequent case numbers. Based on the
graph, the case_number5 is high, saying most patients are related to this particular case.
The statistical presentation of most commonly used words in the patient notes.
If we observe these words, these are the stop words which will no way contribute for our
model prediction. These words are common English words used in every patient notes.
Hence, we have removed these stop words.
After removing the stop words, the below table chart shows the most common words in patient
notes.
Basic wordcloud plot of the common words in patient notes after removing stop words:
Although there are useful words such as ‘pain’, ‘epigastric’, there are stop words still existing in
the patient notes, even after removing the stop words using the below 2 modules:
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
Hence, the data is very complex and many irregularities were found, which will be hard for the
model to predict. This is one of the drawbacks. The imported modules may not always work for
the complex raw sentences.
Tokenization techniques:
Tokenization is the process of breaking the raw sentence to smaller blocks which are referred
as tokens. This process helps in developing the NLP model. This will result in getting the insights
of text by analyzing the series of tokens. Each token will be assigned numeric values in order to
be considered by the model.
input_ids are the indices for each of the token in the sentence.
attention_mask indicates whether a token should be handled or not.
token_type_ids indicates the sequence to which the token belongs if there are multiple
sequences.
Model building:
Pytorch, an ML framework is used as Pytorch supports various modules to support building of
NLP. Hence, we have implemented below modules for our usecase. After tokenization, we have
used BERT technique to get the final outcome.
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
The architecture of the model is defined as below:
Model name: bert-base-uncased
Hyperparameters are:
Dropout Learning rate Optimizer Building Block
0.5 1e-5 AdamW Linear
Loss function used: BCEWithLogitsLoss
Layers:
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask,
token_type_ids=token_type_ids)
logits = self.fc1(outputs[0])
logits = self.fc2(self.dropout(logits))
logits = self.fc3(self.dropout(logits)).squeeze(-1)
No. of epochs: 3
Batch Size: 10
With use of this, the model has been trained. The below metrics are calculated.
Time taken to build the model: 95 mins.
Mean of F1 score: 75.9
Performance Tuning:
1. Added one more linear layer with input size 300, output size 150
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 300)
self.fc3 = nn.Linear(300, 150)
self.fc4 = nn.Linear(150, 1)
Observed values are as below:
Mean F1 score: 73.2%
2. Changed the dropout value to 0.05
Results are as below:
Mean of the metrics:
{'Accuracy': 0.9934958112237656,
'f1': 0.7842475466560781,
'precision': 0.7697296996353311,
'recall': 0.7993235625704622}
Mean F1 score: 78%
Best model:
Considering the 2nd case of performance tuning that is, changing the dropout value to 0.05, we
are getting the better f1 score which is of 78%. This is the highest among the other predictions.
Below are the graph plots of metrics values observed in each epoch.
Observations: Based on the above graphs, the accuracy is observed to be 99% always. In
Classification models, achieveing of accuracy of 99% is not always expected. It will depend on
the data which is being considered.
In this classification problem, accuracy is defined as:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑎𝑠𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝐶𝑎𝑠𝑒𝑠
Here the model is considering almost each and every case as correctly classified case due to the
imbalanced data.
Reason for considering F1 score:
F1 metric is considered as it is not affected by imbalanced data. The dataset is highly
imbalanced. Having this imbalance, it is hard for the model to predict as per the expectations.
Hence, we are considering F1 score as it takes the input data how it got distributed/extracted
from the features. Considering the right data for the model performance always give the better
assessment.
Therefore, the mean of F1 score is considered for the assessment of model prediction thus
addressing our problem statement.
Since, the data is huge and complex (data in raw format), the model is taking up to 95 minutes
for each run. These lead to the 2 Vs of Big Data that are Volume and Variety.