Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views6 pages

Project Report

The paper discusses the use of machine learning techniques for predicting cardiovascular diseases (CVDs), which are a leading cause of death globally. It evaluates various models, including Logistic Regression, SVM, Random Forest, and boosting methods like LightGBM and XGBoost, concluding that tree-based algorithms and boosting techniques yield the highest accuracy. The study aims to assist medical practitioners in making more accurate decisions regarding heart disease classification to reduce mortality rates.

Uploaded by

iamjayant246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Project Report

The paper discusses the use of machine learning techniques for predicting cardiovascular diseases (CVDs), which are a leading cause of death globally. It evaluates various models, including Logistic Regression, SVM, Random Forest, and boosting methods like LightGBM and XGBoost, concluding that tree-based algorithms and boosting techniques yield the highest accuracy. The study aims to assist medical practitioners in making more accurate decisions regarding heart disease classification to reduce mortality rates.

Uploaded by

iamjayant246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Multimodal Prediction of Cardiovascular Diseases

Jaskaran Singh Aditya Ahuja Pratyush Kumar Vedant Gupta


2020306 2020275 2020454 2020261

Abstract The paper has sections with various causes, current


prediction methods (largely practitioner’s evaluation),
Cardiovascular diseases (or heart diseases) are the and the benefits of using ML techniques. It also goes
leading cause of death worldwide, with around 18 mil- through multiple related works with various ML tech-
lion deaths yearly. Several diagnosis methods are avail- niques with similar levels of accuracy. For its own
able worldwide, but they are expensive in terms of time, study, the paper made feature selection with K-best at-
money and personnel. The motivation for our project is tributes per the sklearn library and many data visual-
to conduct data analysis and build classification mod- ization techniques we will implement to infer important
els using machine learning techniques to help practi- attributes and patterns. We will keep SVM and boost-
tioners make more accurate decisions regarding classi- ing approaches like LightGBM and XGBoost for later
fying heart diseases. We aim to combine multiple mod- and continue preliminary model testing on our data with
els for data analysis and perform relevant model evalu- other techniques illustrated in this paper and the next
ation techniques to present a proper, functioning model. one.
Github link for the project: Link.

The data given is not wholly separable with a linear


1. Introduction boundary as per the visualizations, so aggregation tech-
niques like Random Forest, XGBoost and LightGBM
Cardiovascular Diseases (CVDs) are a group of dis- have almost no training error compared to the other two;
eases related to blood vessels and muscles around the however, only Random Forest performs better than LR,
heart and can cause heart attacks and strokes. Both GNB and SVM on testing data.
are cases of interrupted blood flow from the heart to
the brain. CVDs are the leading cause of death glob- 2.2. Analysis and Prediction of Cardiovascular
ally, representing 32% of all global deaths. Most CVDs Disease using Machine Learning Classi-
can be prevented via early detection and management, fiers [4]
and studies show that 80% of CVDs can be prevented
through early diagnosis and intervention [1]. The previous paper didn’t include a few major classi-
fication algorithms, which this paper talks about. It also
This project provides multiple preliminary prediction has a similar introduction and background on cardiovas-
models which label an individual as either 1 (Likely to cular diseases. In addition, it also talks about artificial
be suffering from a CVD) or 0 (Not likely to be suffer- neural networks (ANN) with respect to various diseases,
ing from a CVD). This classification can further be used in general, relating to kidneys, liver, lungs and so on and
to decide whether the subject is in need of further med- how a certain level of caution must be maintained when
ical attention or not. We hope that this research assists it comes to heart diseases.
medical practitioners in deciding the best course of ac-
tion for their patients and further reduces the mortality
rate of CVDs. The paper also goes into detail about how tree-based
algorithms perform well on this type of data, as well
2. Literature Review as the great performance of ANNs in various medi-
cal methodologies. The data was cleaned and scaled,
2.1. Implementation of a Heart Disease Risk Pre-
and differentiated into multifold validation sets before
diction Model Using Machine Learning [3]
training the algorithms. It used five classification al-
This paper does a multi-modal classification on a gorithms: Logistic Regression, Decision Trees, Linear
subset of our dataset to determine heart diseases with SVM, K-Nearest Neighbours and Random forest, and
a chi-square statistical test performed on certain at- performance metrics on all of those were mentioned
tributes. Six algorithms were used for the classification, along with ROC AUC analysis. Random forests once
namely SVM, Gaussian Naive Bayes, Logistic regres- again came out to be the best performing algorithm on
sion, LightGBM, XGBoost and Random Forest, with the given dataset, followed by Decision Tree, Logistic
varying but largely similar accuracy save for random for- Regression, SVM and K-Nearest Neighbors, which was
est, which came out to be the highest with 88.5%. significantly worse than other algorithms.

1
3. Dataset details & Pre-processing Tech- maximum heart rate between 110 and 140. The average
niques maximum heart rate of individuals below 70 yrs is
above 150 bpm. [2]
3.1. Dataset information
The original dataset is from 1988 and contains 76 Using SelectKBest with k value set to 3 resulted in
attributes, but all experiments use a subset of 14 of ’ChestPainType, ’MaxHR’ and ’ExerciseAngina’. This
them. We combined 4 datasets from Cleveland, Hun- was somewhat expected as all three are commonly used
gary, Switzerland, and the VA Long Beach and removed by medical practitioners to diagnose an individual’s car-
the duplicates. The combined dataset has 12 common diovascular condition accurately.
features and 918 records. Below is a plot of the correlation matrix obtained
from the data.
The 12 attributes used to detect heart diseases are as
follows:

• Age - Age of the patient


• Sex - Gender, classified as Male or Female
• ChestPainType - Classify chest pain as Typi-
cal Angina (TA), Atypical Angina (ATA), Non-
Anginal Pain (NAP) and Asymptomatic (ASY)
• RestingBP - Measure the resting blood pressure in
[mm Hg]
• Cholesterol - Measure serum cholesterol in
[mm/dl]
• FastingBS - Measure fasting blood sugar, classi-
fied as ’1’ if FastingBS ≥ 120 mg/dl and ’0’ other-
wise
• RestingECG - Measure resting blood pressure,
classified as ’Normal’, ’ST’ and ’LVH’
Figure 1. Heatmap of pairwise correlation between features.
• MaxHR - Measure the maximum heart rate
achieved
• ExerciseAngina - Check exercise-induced angina The t-SNE graph tells us that the data is not separable.
and, classified as Yes (Y) and No (N) The majority of features have low correlation, as seen
• Oldpeak - Measure ST depression induced by ex- from the heatmap. Most of the points are very close to
ercise relative to rest the points of other classes.
• ST Slope - Measure the slope of the peak exer-
cise ST segment, classified as ’Up’ for upsloping,
’Flat’, and ’Down’ for downsloping
• HeartDisease - Target variable, classified as 0 and
1

3.2. Data Preprocessing Techniques


The non-numerical columns, viz., Sex, Chest-
PainType, RestingECG, ExerciseAngina, ST Slope and
Cholesterol, were encoded using LabelEncoder to con-
vert them into numerical data to be used for data visual-
ization and model training. The encoding has been tab-
ulated in the code file.

3.3. Data Analysis Figure 2. t-SNE plot for the data.

The number of males (725) in this dataset is higher PCA analysis was done on the dataset with first 3
than that of females (193), with 63.17% of males testing components being plotted below. As we can see the data
positive and 25.90% of females testing positive. points are quite intermingled even in the highest vari-
The analysis of a histogram plot of individuals ance axes.
suffering from heart disease shows that individuals
between 50 and 60 years are more likely to suffer from 4. Methodology & Model details
some form of CVD.
We have used sklearn library to create and train the
Upon analyzing the maximum heart rate of individ- following supervised learning models: Logistic Regres-
uals diagnosed with CVDs, we found that most had a sion, Naive Bayes, Random Forest Classifier, Support

2
since it was not linearly separable. Hyperparameter tun-
ing gave us ’rbf’ as the best kernel.
4.5. Artificial Neural Network
Artificial Neural Network via TensorFlow sequential
framework was used to check performance of neural net-
works on the given dataset. The literature review sug-
gested neural network to be one of the better perform-
ing models and so various combinations of hidden lay-
ers and neurons were tested along with various optimiz-
ers. Early stopping condition was also used to stop the
training once the validation loss stagnated. From testing
these parameters, best sizes of layers came out to be [11,
24, 16, 8, 4, 1] with leaky relu being used as activation
and sigmoid being used for the final layer. 250 epochs
with batch size of 25 were run with early stopping con-
Figure 3. PCA plot for the data.
dition and reusing best weights combination.
4.6. Random Forest Classifier
Vector Machine (SVM) and K-Nearest Neighbour Clus-
tering. In addition to these models, two more boosting Random Forest Classifier with decision tree classifier
frameworks, LightGBM and XGboost, were also used to is used along with bootstrapping. We used AdaBoost
implement gradient boosting decision trees. Keras was to improve the accuracy obtained from the ’vanilla’ ran-
used to implement an Artificial Neural Network classifi- dom forest model.
cation model.
4.7. Boosting Models
4.1. Logistic Regression LightGBM and XGBoost frameworks were used to
Since our dataset is small, we decided to train a implement gradient boosting decision trees. Both mod-
Logistic Regression model with bootstrapping by us- els were trained with bootstrapping and gave optimal re-
ing resample. We have created 100 bags (bootstrapped sults with an average accuracy score of 91.70 %. Light-
datasets) sampled from 80% of the original dataset. GBM has some advantages over XGBoost, in that it is
Each of these samples was then used to train a Logistic faster than the other and has lower memory overhead.
Regression Classifier, with the final accuracy scores be-
ing the mean of these classifiers. Hyperparameter tuning
5. Results & Analysis
has been done using GridSearchCV and the model with The dataset was bootstrapped 100 times, with each
optimal parameters were used to train the 100 classifiers. iteration split 80:20 into training and testing sets. Four
performance metrics from sklearn metrics, viz., Accu-
4.2. Naive Bayes racy, Recall, Precision and F1 scores, were calculated
We trained the Naive Bayes model with bootstrap- for each model. ROC-AUC analysis was done based on
ping similar to the other models. Gaussian and Bernoulli the confusion matrix obtained on the test set. The ta-
priors distributions were used to train two different mod- ble at the end shows each tested model’s corresponding
els using GaussianNB and BernoulliNB with optimal scores and AUC values.
parameters which we obtained through hyperparameter Linear classifiers that were tested i.e. Logistic Regres-
tuning. sion, Naive Bayes and SVM didn’t perform as good as
the following models. The data analysis had suggested
4.3. k-Nearest Neighbours the given data points were not linearly separable and
intertwined. Both Bayes models we tested performed
k-Nearest Neighbours classification was used to worse than Logistic regression model with correspond-
gauge raw performance even though t-SNE plot sug- ing accuracies being 82.74, 83.74 and 84.46%. SVM in
gested that the data is quite mixed. Hyperparameter tun- particular performed the worst out of all models. We be-
ing was done to get the best performance which resulted lieve this is because a lack of hyperplane in any of the
in metric being Manhattan distance. It can be seen in the available dimensions. Even with best parameters cho-
same t-SNE plot that even though data is mixed, there sen, the accuracy of SVM could only go up to 69.1%.
are patches where similar kind of data is clustered and Any kernel applied upon the dataset failed to provide a
the distance is rather uniform between them. hyperplane and so the linear models were out of question
for this dataset. The following is the ROC-AUC curve
4.4. Support Vector Machine
for Logistic Regression model, the best performing lin-
SVM was used to classify the patients with heart dis- ear model:
eases with a linear boundary along with optimum mar- Next on, the k-Nearest Neighbour classifier used had
gin. Different kernels were utilised to transform the data best parameters taken via GridSearchCV which came

3
Figure 4. ROC-AUC Curve for Logistic Regression. Figure 7. Accuracy curve for ANN.

out to be the Manhattan Distance metric and 49 neigh- algorithms. The Confusion Matrix and ROC-AUC curve
bours to be used for queries. The model notably per- for AdaBoost can be seen below:
formed well on the dataset with more than 85% accuracy
score which was more than any linear classifier tried
above.

Figure 8. Confusion Matrix for AdaBoost model on the testing


dataset.

Figure 5. ROC-AUC Curve for k-Nearest Neighbour classifier.

A big part of model training and tuning were done for


Artificial Neural Network Model with different combi-
nations of layers and activation functions. The best pa-
rameters were chosen and testing accuracy on it didn’t
prove to be much better than linear classifiers and on a
similar level to the kNN classifier with accuracy being
85%.

Figure 9. ROC-AUC Curve for the AdaBoost model on the


testing dataset.

Feature importance was also calculated for the ran-


dom forest model. We can see how ST Slope has the
most information gain in Random Forest, which follows
with the highest correlation value in the heatmap. The
next page contains a table for accuracy values corre-
sponding to all the previous models.
We can see how ST Slope has the most information
Figure 6. Loss Curve for ANN. gain in Random Forest, which follows with the highest
correlation value in the heatmap. The next page contains
We can see that the tree-based algorithms, namely a table for accuracy values corresponding to all the pre-
Random Forest and AdaBoost, perform particularly well vious models.
with the given data, with them having 85.87% and Following the success of tree based algorithms, other
88.04% accuracies. The AUC is also the highest in both boosting techniques were used to further improve the

4
6. Conclusion
In this project, we tested various machine learning
models on the dataset containing various features of
patients and classifying them as patients with heart
conditions or not. The models used were Logistic
Regression, Naive Bayes, Support Vector Machine,
k-nearest Neighbours, Artificial Neural Network,
Random Forest, AdaBoost, XGBoost and LightGBM.
With poor performance of SVM and relatively lower
performance of linear classifiers suggested a lack of
a linear hyperplane in the data. All classifiers had
Figure 10. Feature importance in Random Forest model. accuracy scores ranging from 82% to 84% with the
exception of SVM with 69%. The kernelisation of data
didn’t prove to be fruitful with similar accuracy on
model. XGBoost with 500 estimators and each estimator SVM. This led to us switching on model philosophy.
being 3 nodes deep performed really well on the given
data with 91.6% accuracy score. Another boosting tech- In the t-SNE plot, even though the points weren’t
nique used was LightGBM which gave the similar accu- separable, there were a significant number of patches
racy score of 91.6% with 100 estimators being 10 nodes where similar points were gathered. This was confirmed
deep and with 20 leaves after parameter tuning. Both by relatively high accuracy of k-Nearest Neighbours
of these methods performed better than AdaBoost and Model. However, it was only marginally better than
significantly better than Random Forest classifier. The Logistic Regression Model.
ROC-AUC curves for the same can be seen below:
Seeing the lack of hyperplane in the dataset, we
decided to use Artificial neural network to create a
complex decision boundary. Hyperparameter tuning
led us to the aforementioned prameters but the results
were not much better than kNN Model either with 85%
accuracy being reported.

Finally we tried the tree-based algorithms which out-


performed the linear models as well as ANN by some
margin. Random Forest with 100 decision trees didn’t
perform as expected but boosting with AdaBoost came
out to be the third best model on the dataset. From these
Figure 11. ROC-AUC Curve for XGBoost. findings, we can safely assume that tree-based mod-
els will perform better on the data. Seeing this result,
we tried two popular boosting techniques XGBoost and
LightGBM which uses gradient boosting with a multi-
tude of trees. Both of them came out to be the best mod-
els with difference in accuracy being in the margin of
error with 91.6%. The results conclude that a system-
atic analysis of features and tree-based predictions were
the most helpful instead of fitting a function through the
data points. Further study on improving the model can
be done but the lack of latest compiled data hindered us.
Data collection on this subject was very scarce and so
that remains to be looked at.

Figure 12. ROC-AUC Curve for LightGBM.

5
Figure 13. Performance of various tested models

References Learning Classifiers,” 2020 6th International Con-


ference on Advanced Computing and Communi-
[1] Piepoli MF, Hoes AW, Agewall S, Albus C, Brotons cation Systems (ICACCS), 2020, pp. 15-21, doi:
C, Catapano AL, et al. 2016 European Guidelines on 10.1109/ICACCS48705.2020.9074183.
cardiovascular disease prevention in clinical practice:
The Sixth Joint Task Force of the European Society [5] Repository
of Cardiology and Other Societies on Cardiovascular
Disease Prevention in Clinical Practice (constituted [6] Dataset Link
by representatives of 10 societies and by invited ex-
The Individual contribution of each team member is as
perts) was Developed with the special contribution of
follows:
the European Association for Cardiovascular Preven-
tion and Rehabilitation (EACPR). Eur Heart J. (2016)
Jaskaran Singh: Data preprocessing, Data Analysis,
37:2315–2381. doi: 10.1093/eurheartj/ehw106
Model Training and Tuning, Report, Slides.
[2] Hirofumi Tanaka, Kevin D Monahan, Douglas R
Seals, Age-predicted maximal heart rate revisited, Aditya Ahuja: Data preprocessing, Data Analysis,
Journal of the American College of Cardiol- Model Training, Report.
ogy, Volume 37, Issue 1, 2001, Pages 153-156,
ISSN 0735-1097, https://doi.org/10.1016/S0735- Pratyush Kumar: Literature Review, Model Analysis,
1097(00)01054-8. Report, Slides.

[3] Karthick K, Aruna SK, Samikannu R, Kuppusamy Vedant Gupta: Domain study, Data Analysis, Feature
R, Teekaraman Y, Thelkar AR. Implementation of Engineering, Slides.
a Heart Disease Risk Prediction Model Using Ma-
chine Learning. Comput Math Methods Med. 2022
May 2;2022:6517716. doi: 10.1155/2022/6517716.
PMID: 35547562; PMCID: PMC9085310.

[4] N. K. Kumar, G. S. Sindhu, D. K. Prashan-


thi and A. S. Sulthana, ”Analysis and Predic-
tion of Cardio Vascular Disease using Machine

You might also like