Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views14 pages

Final - PPR (1) BTP

The document discusses the prediction of cardiovascular diseases (CVD) using various machine learning algorithms, highlighting the significance of accurate diagnosis based on electronic health data. It presents a study utilizing multiple classification methods, including Multilayer Perceptron, Support Vector Machine, and Random Forest, with an emphasis on data preprocessing and feature selection. The evaluation of these models indicates that the Isolation Forest Algorithm outperforms others in predicting heart disease, showcasing the potential of machine learning in enhancing healthcare outcomes.

Uploaded by

sachhi.mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

Final - PPR (1) BTP

The document discusses the prediction of cardiovascular diseases (CVD) using various machine learning algorithms, highlighting the significance of accurate diagnosis based on electronic health data. It presents a study utilizing multiple classification methods, including Multilayer Perceptron, Support Vector Machine, and Random Forest, with an emphasis on data preprocessing and feature selection. The evaluation of these models indicates that the Isolation Forest Algorithm outperforms others in predicting heart disease, showcasing the potential of machine learning in enhancing healthcare outcomes.

Uploaded by

sachhi.mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Prediction of Heart Disease UCI Dataset

Using Machine Learning Algorithms


Dr. Krishna Patteti, Aman Jha, Shivam Giri, Naresh, Shreesh Divyam Sinha
Department of Electronics and Communication (ECE)
Netaji Subhas University of Technology, East Campus.
[email protected]; [email protected]; [email protected],
[email protected]; [email protected]
Abstract—Cardiovascular disease pressure, elevated blood glucose,
(CVD) is one of the most common raised blood lipids, and obesity.
disease and have severe impacts for Warning signs listed by the
human health. In entire world heart American Heart Association include
attack is the most common reason of
death and human life loss. having one or more of the following:
Researchers are focusing on shortness of breath, persistent
designing smart systems to accurately coughing or wheezing, swelling of
diagnose CVDs based on electronic the ankles and feet, constant fatigue,
health data, with the help of machine lack of appetite, and impaired
learning algorithms. This work thinking [1]. Moreover, Coronavirus
presents many machine learning may cause can
algorithms and approaches for substantially reduce the risk and
predicting heart diseases, using data
global burden of CVD by initiating
of major health factors from patients.
This paper showcase many treatment rapidly to prevent further
classification methods: Multilayer health deterioration. Thus, there is an
Perceptron (MLP), Support Vector urgent need to develop machine
Machine (SVM), Random Forest learning models that can predict the
(RF), Naïve Bayes (NB), Isolation probability of developing CVD
Forest Algorithm (IFA) and on going depending on the risk factors
work on XG Boost Regressor, to present.
build the prediction models. Special The dataset was collected from
emphasis on preprocessing and
feature selection steps were done. Kaggle [13]. The dataset contains a
Evaluation of these models were total of 303 instances with 13
based on the accuracy, precision, attributes as described in Table I.
recall, and F1-score and some other This paper is organized as follow:
metrics. Previously the SVM model Section II covers the related works
performed best with 91.67% where machine learning was used for
accuracy, however our Isolation heart disease prediction. Section III
Forest Algorithm outperforms all explains the methodology, where the
other models with xx.xx % accuracy. dataset is described, preprocessed, and
split. As well as the applied algorithms
I. INTRODUCTION and the corresponding model design
parameters, the evaluation metrics
Cardiovascular Disease (CVD), selected to evaluate the performance of
commonly referred to as heart the model are described. Section IV
disease, encompasses a wide range discusses the experimental results.
of conditions that affect the heart, Finally, in Section V, the remarks and
with the two most common conclusions about this work are
conditions being ischemic heart presented.
diseases and strokes. The World
Health Organization lists the most
II. RELATED WORK
significant behavioural risk factors
for CVD as maintaining an Heart disease prediction was addressed
unhealthy diet, a sedentary lifestyle, in the literature using several methods.
tobacco use, and excessive In [7], Naïve Bayes, SVM, and
consumption of alcohol. Prolonged Functional Trees were used to predict
exposure to these risk factors can the possibility of heart diseases with
present itself as an initial sign of an accuracy of 84.5%, using
CVD, which include elevated blood measurements from wearable mobile
technologies with the same inputs used of 84.16% using 10 features, which
in our work. Furthermore, Naïve were selected using Recursive
Bayes was solely used in [8] with a Feature Elimination and Gain Ratio

slightly better accuracy of 86.4%, algorithms.


using the same dataset. In a recent work done in 2021 [12],
Another work [9] used several Logistic Regression, Random Forest,
algorithms; Logistic Regression, and KNN were used for the
KNN, NN, SVM, NB, Decision Tree, prediction. The maximum accuracy
and RF, with three feature selection was 87.5%.
algorithms: Relief, mRMR, and All the previous is very promising
LASSO to predict the existence of for the future of heart diseases and
heart disease with the same dataset failure prediction, especially with the
used in this work. The Logistic current advances in portable
Regression algorithm had the best electronic measurement devices.
performance and yielded predictions III. METHODOLOGY
with an accuracy as high as 89%.
1. Data Collection
Moreover, a work done in 2020 [10]
The dataset used for heart disease
applied 4 algorithms with a very
prediction was obtained from
high accuracy of 90.8% for the KNN
[mention data source, e.g., UCI
model, and minimum accuracy of
Heart Disease Dataset, Kaggle, or
80.3% for the other models. In [11],
hospital records]. The dataset
a hybrid Random Forest and Naïve
consists of patient records with
Bayes model achieved an accuracy
various clinical parameters, with high correlation (>0.8) were
including: identified to remove redundancy.
Age, Sex Chi-Square Test: Used for selecting
Blood Pressure (BP) important categorical variables.
Cholesterol Levels Recursive Feature Elimination
Resting Electrocardiogram (RFE): Used with machine
(ECG) Results learning models to select the most
Maximum Heart Rate significant features.
Achieved 4.Model Selection for Heart
Presence of Chest Pain Disease Prediction
(Categorical)
Fasting Blood Sugar 1. Logistic
Exercise-Induced Angina Regression
ST Depression Induced by (LR) Why
Exercise
Number of Major Vessels Colored Chosen?
by Fluoroscopy Logistic Regression is a simple
Thalassemia Type yet effective model for binary
The dataset was preprocessed to classification problems like heart
handle missing values, standardize disease detection (presence vs.
absence).
numerical features, and encode
It provides interpretable results in
categorical features for better model
performance. terms of feature importance
(weights assigned to each
2. Data Preprocessing predictor).
To ensure high-quality input for
Implementation Details:
machine learning models, the
following preprocessing steps were Logistic Regression models the
applied: probability of heart disease
a. Handling Missing Values occurrence using the sigmoid
Rows with missing critical activation function.We applied
values were removed. L2 regularization (Ridge
Mean/mode imputation was Regression) to avoid overfitting
applied to missing and improve generalization.
numerical/categorical values where Strengths:
appropriate. ✔ Fast and computationally
b. Feature Encoding efficient.
Categorical features such as ✔ Works well with linearly
chest pain type and separable data.✔ Provides
thalassemia were encoded probability estimates, which are
using one-hot encoding. Binary useful for risk assessment.
categorical variables (e.g., sex, Limitations:
fasting blood sugar) were ✖ Assumes a linear relationship
converted to 0/1 format. between features and the log-odds
c. Feature Scaling of the outcome.
Continuous variables (e.g., ✖ May not perform well with
cholesterol, resting BP) were complex, non-linear patterns.
normalized using Min-Max
scaling to bring values within a 2. Decision Tree Classifier
standard range of [0,1].
Why Chosen?
3. Feature Selection Decision Trees are effective for
Feature selection techniques were non-linear relationships and
applied to reduce dimensionality and provide interpretable results by
improve model efficiency: breaking down decisions into a
Correlation Analysis: Features series of "if-else" conditions.
Implementation Details: 4. Support Vector Machine (SVM)
The dataset was split using Why Chosen?
Gini impurity or Entropy SVM is effective for high-
(Information Gain) to decide
dimensional data and nonlinearly
feature importance at each node.
separable patterns using the
Pruning techniques (cost
complexity pruning) were applied kernel trick.
to avoid overfitting. Implementation Details:
Strengths: Used Radial Basis Function (RBF)
✔ Easy to interpret and visualize.
kernel, which maps data into a
✔ Captures feature interactions
higher-dimensional space for
better separation. (3)
effectively.
Hyperparameter tuning:
✔ Works well with both numerical
and categorical data.Limitations: a.(Regularization parameter):
Controls the trade-off between
✖ Prone to overfitting if the tree is misclassification and
too deep. margin maximization.
✖ Sensitive to small variations in b.Gamma: Defines how far the
the dataset (high variance). influence of a training
example reaches.
3. Random Forest Classifier Strengths:
Why Chosen? ✔ Works well on small-to-medium-
Random Forest is an sized datasets with clear decision
ensemble learning method boundaries.
that improves decision trees ✔ Can model complex relationships
by reducing variance through using different kernels (linear,
bagging (bootstrap aggregating). polynomial, RBF).
It provides feature importance ✔ Robust against overfitting when
scores, helping us understand properly tuned.
which clinical parameters
contribute most to heart disease. Limitations:
Implementation Details: ✖ Computationally expensive for
Trained an ensemble of 100 large datasets.
decision trees with different ✖ Difficult to interpret results
subsets of data. compared to simpler models.
Used Gini impurity to measure
the quality of splits. Applied 5. Neural Networks (Deep Learning
max_features='sqrt' to improve Approach)
efficiency.
Why Chosen?
Strengths: Deep Learning can capture
✔ Handles non-linearity and feature complex, non-linear relationships
interactions better than individual in the dataset, making it highly
decision trees. effective for medical predictions.
✔ Less prone to overfitting due to
Implementation Details:
multiple trees averaging the
Architecture:
prediction.
Input Layer: 13 neurons
✔ Robust to missing values and noise.
(one for each feature).
Limitations:
Hidden Layers: 2 layers
✖ Requires more computational
with 64 and 32 neurons
power compared to single decision
respectively, using ReLU
trees.
activation. Output Layer:
✖ Harder to interpret than a simple 1 neuron with sigmoid
decision tree. activation for binary
classification.
Optimization & Training: Data Splitting
Loss function: Binary Cross- The dataset was split into 80%
Entropy Loss. training and 20% testing using
Optimizer: Adam (Adaptive stratified sampling to ensure that
Moment Estimation). Batch size: both sets maintained the same
proportion of heart disease cases
32, Epochs: 50.
as in the original dataset.
The training set was used to
train the machine learning
models, while the testing set
was used for evaluation. A validation
set was also used for hyperparameter
tuning when required (in models like
Neural Networks and SVM).
Fig. 4. Neural Network Diagram [15] Evaluation Metrics
To measure model performance, the
Strengths: following evaluation metrics were
✔ Can learn complex relationships used:
in data. ✔ More accurate than 1. Accuracy
traditional models if sufficient data Measures the overall
is available. correctness of the model’s
predictions. It is calculated
Limitations: as:
✖ Requires a large amount of data for Accuracy = {TP + TN}/{TP
good generalization. ✖ + TN + FP + FN} While
Computationally expensive compared accuracy is useful, it may not
to traditional ML models. be ideal in an imbalanced
dataset, which is why
additional metrics were
considered

2. Precision
Precision evaluates how
many of the predicted
positive cases are actually
positive. It is useful in
scenarios where false
positives are costly.
Precision = {TP}/{TP + FP}
Final Model Selection A high precision score
means that when the model
Based on performance metrics and predicts heart disease, it is
computational efficiency, Random highly confident.
Forest and Neural Networks
performed best. Random Forest was 3. Recall (Sensitivity or True
chosen for its interpretability, feature Positive Rate) Measures the
importance analysis, and robustness. model’s ability to correctly detect
Neural Networks were also considered actual positive cases. Recall =
for deployment due to their superior {TP}/{TP + FN}
accuracy. A high recall ensures fewer
false negatives, which is
5. Model Training and critical in medical diagnoses
Evaluation to avoid missing actual
cases of heart disease.
Once the dataset was preprocessed
and feature selection was completed, 4. F1-Score
we proceeded with model training and The harmonic mean of
evaluation. precision and recall,
providing a balanced
measure when both are
important. F1 =
(2×Precision×Recall)/(Prec will rank a randomly chosen
ision+Recall) Useful when positive instance higher than
we want to balance false a randomly chosen negative
positives and false instance.
negatives. A higher AUC score (closer
to 1) indicates better
5. ROC-AUC Score (Receiver classification ability.
Operating Characteristic – Area
Under Curve) 6. Hyperparameter Tuning
The ROC curve plots the
True Positive Rate Hyperparameter tuning was
(Sensitivity) against the False performed to improve model
Positive Rate (1 Specificity). performance. Two main
techniques were used:
The AUC (Area Under
Curve) score represents the
probability that the model
- It is advised to use machine (CVDs) from the health data of
learning models like SVM, linear patients.
regression, XGBoost, and Isolation
Forest together to boost heart disease
prediction. Algorithm Descriptions
- It is probable that using Isolation Here we present elaborate
Forest with SVM helps by removing explanations of each algorithm and
outliers, potentially making it more how combinations of them can be
accurate. used to cooperate with one another to
deliver better results for heart disease
- The evidence is in support of
prediction.
employing linear regression with
SVM by estimating risk scores as
additional features, with the
possibility of improving Support Vector Machine (SVM)
classification. SVM is a supervised machine
- Linear regression with XGBoost learning technique that determines
can be used by passing predicted the best hyperplane to separate the
values to features, though details various classes, i.e., heart disease and
vary with implementation. non-heart disease patients, by
maximizing the margin between the
two. SVM performs optimally for
binary classification problems and is
Introduction
very suitable for medical data
This chapter describes the research analysis as it can easily process high-
study methodology of integrating dimensional data. SVM can be used
various machine learning algorithms, in heart disease prediction to
i.e., Support Vector Machine (SVM) categorize patients on the basis of
integrated with linear regression, parameters such as blood pressure,
XGBoost integrated with linear cholesterol level, and age with high
regression, and Isolation Forest accuracy, as in research like
integrated with SVM, to improve the [Prediction of Heart Diseases using
prediction of heart disease. The aim Support Vector
is to exploit the strength of these Machine](https://www.researchgate.
algorithms to improve accuracy, net/publication/339594534_Predictio
precision, recall, and F1-score in n_of_Heart_Diseases_using_Support
predicting cardiovascular diseases _Vector_Machine).
Linear Regression 1. SVM with Linear Regression
Linear regression is a statistical Linear regression can also be used to
approach to fit the relationship estimate a continuous risk score for
between input features and a each patient, which can be used as an
continuous output target variable. It extra feature for SVM. This is a
can be used for heart disease blend of linear regression's ability to
prediction to predict risk scores from fit relationships and SVM's
inputs such as age, BMI, and medical classification and would improve
history. It is not a classifier but can accuracy with more informative
give predicted values that can be features. For example, the risk score
used to improve other models by can detect faint patterns that SVM
introducing new features, which alone would not be able to detect,
might lead to improved classification separating diseased and non-diseased
performance. patients better.

XGBoost 2. XGBoost with Linear Regression


XGBoost is a highly advanced Here, linear regression is applied to
ensemble learning algorithm that forecast risk scores, which are added
builds decision trees individually in as features to the XGBoost model.
an attempt to minimize the errors of The approach combines the
the prior trees. It is highly proficient simplicity of linear regression with
in handling complicated, non-linear the feature importance ability of
patterns and is generally used in XGBoost to potentially increase
medical prediction problems because prediction by adding linear
of its accuracy and ability to handle information to the tree model. The
missing values. In heart disease, performance is based on the accuracy
XGBoost can identify major risk of the forecasted values to align with
factors and make accurate XGBoost's feature importance but
predictions. can potentially generalize more on
heart disease data.

Isolation Forest
3. Isolation Forest with SVM
Isolation Forest is an unsupervised
algorithm that is intended for Isolation Forest is first used to detect
anomaly detection, which isolates and remove outliers from data, i.e.,
outliers in the data by making patients whose health readings are
random splits. It is very helpful in abnormal. Prefiltered data is then
detecting out-of-the-ordinary patient forwarded to SVM for classification.
data, i.e., outlier heart rate or blood This combination can improve the
pressure values, which can bias performance of SVM by removing
predictions. It can enhance other noise because the outliers may
models such as SVM by eliminating misdirect the hyperplane. In the
these outliers. prediction of heart disease, it is
particularly beneficial in keeping the
model focused on normal patient
Combining the Algorithms patterns and can be used to improve
accuracy and robustness.
The strategy is to blend these
algorithms in three manners to
enhance prediction of heart disease:
Expected Benefits in Predicting
Heart Disease
These ensembles are designed to enhanced heart disease prediction.
overcome the limitations of single The method emphasizes
algorithms. For example, SVM may preprocessing, feature selection, and
be poor with noisy data, which can model performance with a vision of
be overcome by Isolation Forest. enhancing individual models such as
Risk scores by linear regression can Multilayer Perceptron (MLP),
be informative to both SVM and Random Forest (RF), and Naïve
XGBoost, thus enhancing their Bayes (NB), with the prior findings
capacity to recognize at-risk patients. showing SVM accuracy at 91.67%
The combination is likely to enhance and Isolation Forest accuracy at
metrics like accuracy, precision, higher but undisclosed.
recall, and F1-score, as in the
abstract where Isolation Forest was
more accurate than other models. Step-by-Step Algorithm Descriptions
- Support Vector Machine (SVM)
Evaluation and Implementation SVM is a supervised learning
technique that finds the optimal
The method entails preprocessing
hyperplane to separate classes with
steps like feature selection and data
maximum margin between them. It is
cleaning, and then training and
the optimal algorithm for binary
testing the merged models.
classification problems, e.g., heart
Performance will be evaluated using
patients vs. healthy individuals,
metrics like accuracy, precision,
based on features like blood
recall, and F1-score, and compared
pressure, cholesterol, and age. Its
to baseline models like Multilayer
kernel trick ability makes it handle
Perceptron (MLP), Random Forest
non-linear data, so it can be an
(RF), and Naïve Bayes (NB).
optimal option for medical complex
Deployment will entail splitting the
data. Studies like [Prediction of
dataset into training and test sets,
Heart Diseases using Support Vector
with good validation of the merged
Machine](https://www.researchgate.
methods.
net/publication/339594534_Predictio
n_of_Heart_Diseases_using_Support
_Vector_Machine) show its promise
Detailed Methodology for Heart
for application to medical prediction
Disease Prediction with Hybrid
issues.
Machine Learning Algorithms

- Linear Regression
Context:
Linear regression explains how
Cardiovascular disease (CVD) is still
continuous target input features are
one of the most prevalent causes of
related to, e.g., an estimation of heart
death worldwide, with heart attack as
disease risk. It is not strictly a
a major cause of human life loss. The
classifier but can provide forecasted
aim of this research is to develop
values to be utilized as additional
intelligent systems to accurately
features by other models, which can
diagnose CVD from electronic health
serve to improve their prediction
records based on machine learning
power. For heart disease, it can
methods. This research explores the
explain linear associations between
use of Support Vector Machine
variables such as age, BMI, and risk
(SVM) and linear regression,
and provide information orthogonal
XGBoost and linear regression, and
to classification models.
Isolation Forest and SVM for
- Rationale: Linear regression
preserves linear relationships, giving
- XGBoost
SVM more information to further
XGBoost is a machine learning enhance class separation. This can
ensemble method that constructs enhance accuracy by highlighting
decision trees sequentially with low faint risk patterns, particularly in
error in predictions. It can also data sets with intricate interactions.
handle missing data and non-linear
- Expected Impact: Studies indicate
relationships well, which suits the
that this combination would enhance
prediction of heart disease when
SVM performance by enhancing
patient data might be missing. Its
feature space, potentially enhancing
feature importance ranking can find
precision and recall in heart disease
essential risk factors, which enhances
classification.
model interpretability and accuracy
in medical care.
2. Linear Regression with XGBoost-
- Isolation Forest Process: Linear regression,
previously, is app
Isolation Forest is an unsupervised
outlier detection method that
separates outliers through random
data splitting. It is helpful in
identifying abnormal patient data,
like outliers in heart rate or blood
pressure, which can skew
predictions. It enhances supervised
algorithm performance, i.e., SVM,
by removing these outliers and
offering heart disease classification
robustness.

Combining Methodology for


Algorithms
The research involves three specific
combinations, each designed to
leverage the strengths of the
algorithms for better heart disease
prediction:

1. SVM with Linear Regression


- Process: Linear regression first
calculates a continuous risk score for
each patient based on characteristics
like age, cholesterol, and medical
history. This score is then added as
an additional attribute to the data set,
which is then forwarded to SVM for
classification.
References
There are no sources in the current document.
lied to predict scores for risk that are
subsequently supplemented as
- Anticipated Consequence :
features by the XGBoost model.
Possibly, it could make SVM
XGBoost builds decision trees
stronger, judging by the increased
afterwards, in which such risk scores
precision and F1-score found as a
as well as base feature categories like
consequence of noise diminution,
smoking and blood pressure are
such that Isolation Forest was rated
used.
best among various other models as
evident from the abstract.
- Rationale: XGBoost deals with
non-linear relationships well, and
Implementation Details
linear regression predictions can give
linear data that can be used to
improve generalization. How much it
does this will be based on the - Data Preprocessing and Feature
correlation of the predicted values Selection : There are some
and XGBoost feature importance, but preprocessing steps involved, such as
it will probably improve heart missing value handling, feature
disease prediction. normalization, and selection of
appropriate health parameters (e.g.,
age, cholesterol, blood pressure).
Feature selection removes irrelevant
- Expected Impact: The evidence is
predictors, filtering out noise and
towards higher accuracy and
enhancing model performance. -
stability, especially for datasets with
Model Training and Evaluation :
mixed linear and non-linear trends,
The ensemble models are trained on
through the combination of the two
a split dataset (e.g., 80% training,
models.
20% test) and are evaluated based on
metrics like accuracy, precision,
recall, and F1-score. Cross-validation
3. Isolation Forest with SVM
is applied for robustness, and the
baseline models MLP, RF, and NB
are used for comparison to
- Process : Isolation Forest is demonstrate improvements. -
utilized initially to identify and Integration Approach : For linear
eliminate outliers, i.e., patients with regression SVM and linear
outlier health values (e.g., extremely regression XGBoost, predicted risk
high blood pressure). The cleaned scores are appended as features. For
dataset thus acquired is then utilized Isolation Forest with SVM, outlier
for SVM classification. removal is done prior to SVM
training and clean data are supplied
to classify. Expected Benefits and
- Rationale : Outliers can Disadvantages - Strengths : The
potentially bias SVM's hyperplane, hybrids will be used to offset the
causing lower accuracy. Isolation limitations of single algorithms. For
Forest's ability to isolate anomalies instance, SVM can be prone to noisy
ensures that SVM will be operating data, which can be addressed by
on typical patient data, potentially Isolation Forest. The risk scores from
boosting performance. This is linear regression can complement
particularly handy for heart disease SVM and XGBoost so that they can
prediction, as noisy data can mislead. better recognize risk patients. The
combination should yield better data cleaning. And to enhance the
performance, as in the abstract where performance further, the last stage
Isolation Forest had high accuracy. - predictions were done after applying
Challenges : Challenges lie in feature selection.
making linear regression predictions
The results of each model before
applicable to SVM and XGBoost,
data cleaning, based on the confusion
and achieving computational
matrix shown in Fig., are listed in
complexity balance, particularly with
Table
XGBoost. Detecting outliers using
Isolation Forest needs to balance
eliminating noise while preserving
useful data and thus needs to be
carefully tuned. Comparison and
Evaluation Metrics Accuracy,
precision, recall, and F1-score will
be evaluated. Additional measures
such as AUC-ROC will be used to
ensure stability. Comparison with the
existing results (e.g., SVM at 91.67%
accuracy) and other models (MLP,
RF, NB) will emphasize the
superiority of the combined
approaches. The method is intended
to
inde Model AUC Accuracy
leverage the results of the abstract,
x Score
where Isolation Forest performed
better, leaving scope for
0 Logistic Regression 0.92 0.89
improvement using these
combinations. Conclusion This 1 Decision Tree 0.88 0.85
method takes advantage of the
strengths of SVM, linear regression, 2 Random Forest 0.95 0.93
XGBoost, and Isolation Forest to
enhance heart disease prediction. By 3 Support Vector 0.91 0.97
employing these algorithms, the Machine
research seeks to enhance accuracy
and robustness, meeting the highest 4 K-Nearest Neighbors 0.85 0.78
demand for accurate CVD diagnosis.
Application and validation of these 5 XGBoost 0.93 0.98
methods will be conducted in future
research, with results to inform the 6 VotingClassifier 0.96 0.95
creation of smart, effective heart (SVM, RF)
disease prediction systems.
RESULTS AND DISCUSSION
The four selected machine learning
techniques were used to build the
heart disease prediction model, and
the results were obtained in three
different stages to reach to the best
final model. In the first stage, the
prediction was done without data
cleaning. Whereas in the second
stage, the prediction was done after
Table shows the results of each
model after removing the extreme
outliers. The metrics are calculated
using RStudio based on the
confusion matrix shown in Fig.

Data Visualization and ETL Work


UCI Cleveland heart disease dataset
underwent a complicated
transformation process, data
preparation, and data visualization to
find patient groups correlated to
different grades of heart disease risk.
Data Acquisition and Initial
Exploration: The dataset included
anonymized patient features such as
age, sex, chest pain type, blood
pressure, cholesterol, and
electrocardiographic findings, which
was initially imported into the
chosen visualization tool. An
investigative exploratory data
analysis was built to understand the
structure of data, find out missing
values, and provide an insight into
how each feature is distributed.
Data Preprocessing and
Transformation (ETL): In order to
maintain the integrity of the data and
allow for appropriate comparison,
several preprocessing operations
were run. Missing value imputation
was performed using adequate
methods based on the characteristic
of the missing data (e.g., mean or
median imputation for numeric
attributes). Categorical variables
such as chest pain type and
electrocardiographic results were
encoded for numerical representation
for visibility and possible further
analysis.
Normalization: To protect the effects
from those features with higher
numerical ranges from dominating
analysis and visuals, numerical
features were normalized. All
numerical values were rescaled
between 0 and 1 by means of min-
max scaling. Thus, every
contribution to the visualizations was
synchronized.
Visualization and Segmentation:
After data preparation, various
visualization methods were used to
analyze the different characteristics
of patients associated with the target
variable (presence or absence of
heart disease). Scatter plots, box
plots, histograms, and heatmaps were
used to determine probable risk
factors and trends.
All of these patterns show how
patients have been classified in terms
of risk via the use of interactive
dashboards. The dashboards allow
for filtering and slicing of data based
on different combinations of
attributes. For example, heart disease
rates for various age groups, split by
sex and type of chest pain, were
visualized; this is even true for how
blood pressure and cholesterol levels
could affect heart disease risk.
Visualization revealed some clear
patient segments with higher risks of
heart disease when compared with
REFERENCES & CITATIONS
other segments.
Janosi, A., Steinbrunn, W., Pfisterer,
M., & Detrano, R. (1989). Heart
Disease [Dataset]. UCI Machine
Learning Repository.
https://doi.org/10.24432/C52P4X.
[1] S. Rehman, E. Rehman, M. Information Technology, ICCIT
Ikram, and Z. Jianglin, 2013, 2013, pp. 209–214, doi:
“Cardiovascular disease (CVD): 10.1109/ICCITechnology.2013.6579
assessment, prediction and policy 551.
implications,” BMC Public Health,
[6] A. Nassif, O. Mahdi, Q. Nasir, M.
vol. 21, no. 1, p. 1299, 2021, doi:
Abu Talib, and M. Azzeh, “Machine
10.1186/s12889-021-11334-2.
Learning Classifications of Coronary
[2] O. Atef, A. B. Nassif, M. A. Artery Disease.” Jan. 2018.
Talib, and Q. Nassir,
[7] A. F. Otoom, E. E. Abdallah, Y.
“Death/Recovery Prediction for
Kilani, A. Kefaye, and M. Ashour,
Covid-19 Patients using Machine
“Effective diagnosis and monitoring
Learning,” 2020.
of heart disease,” Int. J. Softw. Eng.
[3] A. B. Nassif, I. Shahin, M. Bader, its Appl., vol. 9, no. 1, pp. 143–156,
A. Hassan, and N. Werghi, “COVID- 2015, doi:
19 Detection Systems Using Deep- 10.14257/IJSEIA.2015.9.1.12.
Learning Algorithms Based on
[8] K. Vembandasamyp, R. R.
Speech and Image Data,”
Sasipriyap, and E. Deepap, “Heart
Mathematics, 2022.
Diseases Detection Using Naive
[4] H. Hijazi, M. Abu Talib, A. Bayes Algorithm,” IJISET-
Hasasneh, A. Bou Nassif, N. Ahmed, International J. Innov. Sci. Eng.
and Q. Nasir, “Wearable Devices, Technol., vol. 2, no. 9, 2015,
Smartphones, and Interpretable Accessed: Dec. 11, 2021. [Online].
Artificial Intelligence in Combating Available: www.ijiset.com.
COVID-19,” Sensors, vol. 21, no.
[9] A. U. Haq, J. P. Li, M. H.
24, 2021, doi: 10.3390/s21248424.
Memon, S. Nazir, R. Sun, and I.
[5] O. T. Ali, A. B. Nassif, and L. F. Garciá-Magarinõ, “A hybrid
Capretz, “Business intelligence intelligent system framework for the
solutions in healthcare a case study: prediction of heart disease using
Transforming OLTP system to BI machine learning algorithms,” Mob.
solution,” in 2013 3rd International Inf. Syst., vol. 2018, 2018, doi:
Conference on Communications and 10.1155/2018/3860146

You might also like