M.Tech. (Software Engineering) : Comparitive Analysis of Heart Disease Prediction Using Machine Learning Algorithms
M.Tech. (Software Engineering) : Comparitive Analysis of Heart Disease Prediction Using Machine Learning Algorithms
M.Tech.(Software Engineering)
by
May, 2024
i
DECLARATION
I further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or diploma
in this institute or any other institute or university.
Place: Vellore
ii
CERTIFICATE
iii
Date: ……………….
EXTERNAL SUPERVISOR
iv
ABSTRACT
This project aimed to develop and evaluate machine learning models for the prediction of
heart disease based on a dataset obtained from the IEEE repository. The dataset comprised
1025 instances with 14 features, including demographic information, medical
measurements, and clinical indicators. The methodology involved data preprocessing,
feature scaling, hyperparameter tuning, and the implementation of various classification
algorithms such as Logistic Regression, K-Nearest Neighbors (KNN), Support Vector
Machine (SVM), Decision Tree, and Random Forest.
Initial model training and testing without hyperparameter tuning revealed varying levels of
accuracy across different algorithms. Notably, SVM demonstrated perfect training
accuracy but lower testing accuracy, suggesting potential overfitting. Moreover, Decision
Tree and Random Forest achieved 100% training accuracy but comparatively lower testing
accuracy, indicating potential issues with generalization. Hyperparameter tuning was then
applied to enhance model performance, resulting in improved accuracy for Logistic
Regression and Random Forest models. However, SVM exhibited reduced testing
accuracy post-tuning, highlighting the need for further optimization strategies.
Future directions for this project include exploring advanced hyperparameter tuning
techniques, conducting extensive feature engineering, experimenting with ensemble
methods, and focusing on the interpretability of models. These endeavors aim to refine
predictive models for heart disease diagnosis, ensuring robustness, generalization, and
clinical relevance. Overall, this study underscores the potential of machine learning in
healthcare and lays the foundation for further research in cardiovascular disease prediction
and management.
v
ACKNOWLEDGEMENT
vi
CONTENTS
Title Page No
List of Tables viii
List of Figures ix
List of Abbreviations x
Chapter Page
1. Introduction 1
1.1. Theoretical Background 1
1.2. Motivation 2
1.3. Aim of the proposed Work 2
1.4. Objective(s) of the proposed work 3
1.5. Report Organization 4
2. Literature Survey 4
2.1. Survey of the Existing Models/Work 4
2.2. Gaps Identified in the Survey 7
2.3. Problem Statement 8
3. Overview of the Proposed System 8
4. Requirements Analysis and Design 9
4.1. Requirements Analysis 9
4.1.1. Functional Requirements 9
4.1.1.1. Product Perspective 9
4.1.1.2. Product Features 9
4.1.1.3. User Characteristics 10
4.1.1.4. Assumption & Dependencies 10
4.1.1.5. Domain Requirements 10
4.1.2. Non-Functional Requirements 10
4.1.3. System Modeling 12
4.1.4. Engineering Standard Requirements 12
4.1.5. System Requirements 14
4.1.5.1. Hardware Requirements 14
4.1.5.2. Software Requirements 14
4.2. System Design 15
4.2.1. System Architecture 15
4.2.2. Detailed Design 16
5. Implementation and Testing 23
6. Results and Discussion 30
7. Conclusion and Scope for Future Work 42
7.1. Conclusion 42
7.2. Scope for Future Work 42
Annexure – I - Sample Code 44
References 51
vii
LIST OF TABLES
LIST OF FIGURES
viii
LIST OF ABBREVIATIONS
Abbreviation Expansion
RF Random Forest
LR Logistic Regression
DT Decision Tree
KNN K Nearest neighbor
SVM Support Vector Machine
ix
1. INTRODUCTION
Understanding the guiding principles and development processes for a heart disease
prediction system employing machine learning algorithms is necessary. The following
are the main elements of the theoretical background:
Support Vector Machines (SVM): This effective approach can classify data in both
linear and nonlinear ways. By maximising the margin between them, it seeks to
identify an ideal hyperplane that divides various classes (such as people with and
without heart disease) from one another.
The theoretical background through machine learning models also involves various
techniques for data pre-processing, feature extraction, and model optimization, such
as:
Data pre-processing
Feature extraction
Model optimization
1
1.2. MOTIVATION
Heart disease is one of major reasons for deaths now-a-days. machine learning
predictions motivated us in continuing with the project and there are certain things
motivated and mentioned below
Early Intervention and Prevention: One of the top causes of death worldwide is heart
disease. Implementing preventive interventions and lifestyle changes that can
dramatically lower the risk of developing cardiovascular problems requires early
illness detection. Healthcare workers can intervene early and avoid negative
consequences by using machine learning algorithms' timely predictions.
Data-Driven Insights: The opportunity to use patient data for predictive analytics is
presented by the availability of enormous amounts of patient data, including
electronic health records, medical imaging, genetic information, and lifestyle factors.
These data can be examined by machine learning algorithms to find hidden
correlations, trends, and risk factors for heart disease. Using these information,
healthcare professionals can design individualized treatment regimens that are catered
to the needs of specific patients and make better decisions.
Research and Development: Machine learning methods for heart disease prediction
are another source of research and advancement in the field of cardiovascular health.
The use of sophisticated algorithms promotes the investigation of novel risk factors,
the development of novel biomarkers, and the detection of hitherto unrecognized
correlations between variables. These results advance our knowledge of heart disease
and provide guidance for ongoing study.
2
1.3. AIM OF PROPOSED WORK
Heart disease prediction using machine learning algorithms offers the potential for
early intervention, personalized medicine, improved accuracy, and more efficient
healthcare delivery. By harnessing the power of data-driven insights, healthcare
providers can enhance their decision-making processes, leading to better patient
outcomes and a reduction in the global burden of heart disease hyperparameter tuning
of machine learning algorithms helps is:
3
1.4. OBJECTIVES OF PROPOSED WORK
Finding out patterns which are affecting heart disease Comparative analysis of
machine learning algorithms in predicting Whether a patient is having heart
disease or not
Determine the model which provides best accuracy among the available
algorithms.
Comparing the algorithm with and without hyper parameter tuning in order to
find out how hyperparameter tuning reduces the factors affecting the algorithm
4
2. LITERATURE SURVEY
1 Heart disease Riyaz, Lubna This study offers a thorough examination Accuracy, F-
Prediction: a of the effectiveness of several machine 1score
review learning techniques used for accurate
cardiac disease prediction, diagnosis, and
treatment. Support vector machines
(SVM), decision trees (DT), Naive Bayes
(NB), K-nearest neighbour (KNN),
artificial neural networks (ANN), and
other machine learning techniques are
investigated in the proposed research for
the prediction of the presence of cardiac
diseases. Then, it was determined which
technique performed best and worst
overall by calculating the average forecast
accuracy for each one.
5
3 An integrated A. A. Ahdal, This study compares and analyses the Accuracy,
Machine M. Rakhra, S. results of the UCI dataset using a number f1score
Learning Badotra and of machine learning methodologies and
Techniques for T. Fadhaeel, several machine learning algorithms. .
Accurate Heart Only 14 characteristics will be used out of
Disease the 75 columns. calculating the confusion
Prediction and accuracy matrix. Several optimistic
findings are confirmed. The dataset
contains several irrelevant attributes that
were handled and normalised for better
results.
6
5 Analyzing the Pathan,Muha Machine learning classification models Accuracy,
impact of mmad Salman were investigated using complete and
ROC
feature reduced features subset as inputs for
selection on the experimentation analysis. The trained
accuracy of classifiers were evaluated based on
heart disease Accuracy, Receiver Operating
predictio Characteristics (ROC) curve, and F1-
Score. The classification results of the
models proved that there is a high impact
of relevant features on the classification
accuracy
7
7 Heart disease madhu hk, SVM a supervised model is implemented Accuracy
prediction using d.ramesh to predict heart attack. The 13 features are
svm considered which include personal details
like chest pain type, blood pressure,
collestral level and heart rate. The
implemented model is tested on UCI
health care heart disease data set. The
efficacy of the model proposed is justified
using performance and confusion matrix.
The accuracy obtained is 83%
8
2.2 GAPS IDENTIFIED IN THE SURVEY
It is clear from the literature review that machine learning techniques have produced
promising results in the prediction of heart disease. But there are still certain gaps
that must be filled. The following are some of the shortcomings found:
To close these gaps, further research is required to create standardized techniques for
data collecting and preprocessing as well as to carry out extensive investigations to
confirm the efficacy of various machine learning models.
Heart disease is one of major curses to the present generation. There is no one
particular we can conclude that is the reason behind the heart-disease of a person.
Heart patients generally show different symptoms and it is quite complex to pattern
them for heart disease detection. We would like to use data mining techniques to
figure out the patterns and machine learning classification to find out better model
which gives accuracy. Compare the machine learning algorithms which gives better
performance.
9
3. OVERVIEW OF THE PROPOSED SYSTEM
Heart disease, also known as cardiovascular disease, is a general term used to describe
a range of conditions that affect the heart and blood vessels. These conditions include
coronary artery disease, heart valve problems, arrhythmias, and heart muscle damage.
Heart disease is the leading cause of death worldwide and can be caused by a variety
of factors, including high blood pressure, high cholesterol, smoking, obesity, and
diabetes. Symptoms of heart disease may include chest pain, shortness of breath,
fatigue, and irregular heartbeats. Every year according to WHO(world Health
organization),there are 10 million people are dying because of heart disease. One of
the major challenges present before the present generation is providing quality
services and diagnosis in health care. Despite the fact that heart disorders are now the
leading cause of death worldwide, they are also the ones that may be effectively
managed and controlled. When a disease is discovered at the right moment, it can be
managed with complete accuracy. In order to prevent negative effects, the suggested
work aims to identify these heart problems at an early stage.
This paper's major objective is to give clinicians a tool for early heart disease
detection. As a result, patients will receive effective care and serious repercussions
will be avoided. To uncover hidden discrete patterns and analyze the provided data,
machine learning (ML) plays a critical role. ML approaches aid in the early detection
and prediction of cardiac disease after data processing. This study examines the
effectiveness of different machine learning (ML) methods for early heart disease
prediction, including Naive Bayes, Decision Trees, Logistic Regression, and Random
Forest.after we perform hyperparameter tuning to each of these algorithms and find
out parameters which result in best performance of given validation set.
1
4. REQUIREMENTS ANALYSIS AND DESIGN
A stand-alone web application that uses machine learning algorithms to forecast the
likelihood of heart disease will be the heart disease prediction system. To generate a
prediction, it will need input from the user on their personal information, lifestyle
habits, and medical history. In order to make precise predictions, the system will use
machine learning models that have been trained on enormous datasets.
For use by people concerned about their risk of developing heart disease, the heart
disease prediction system will be created. Adults of any age and gender who want to
evaluate their risk for heart disease are eligible to use the service.
In order to produce reliable forecasts, the heart disease prediction system relies on
users providing accurate and full data inputs. Prediction accuracy will also be
influenced by the caliber and volume of training data utilized to create the machine
learning models.
1
4.1.1.5. Domain Requirements
The heart disease prediction system has to abide by all applicable data security and
privacy legislation. Additionally, for the creation, validation, and deployment of
machine learning models, the system must adhere to best practices and industry
standards
The heart disease prediction system has to abide by all applicable data security and
privacy legislation. Additionally, for the creation, validation, and deployment of
machine learning models, the system must adhere to best practices and industry
standards.
4.1.2.1.1. Efficiency
The system for heart disease prediction needs to be effective in both time and space. It
should be possible for the system to handle user inputs fast and produce predictions in
a timely manner. It should also be optimized to use the least amount of resources
possible, such as memory and computing power, to lessen its impact on system
performance.
4.1.2.1.2. Reliability
The technology used to forecast heart disease should be trustworthy and continuously
deliver accurate results. A high amount of user requests should not be a problem for
the system, which should keep running without any glitches or system failures. In
order to guarantee that it makes correct predictions under various circumstances and
inputs, it should also undergo thorough testing
1
4.1.2.1.3. Portability
The system for predicting cardiac disease ought to be portable and usable on a range
of hardware and software. Any device with an internet connection and a current web
browser should be able to access the system. It should not need any further software
installation and should work with a variety of operating systems, including Windows,
MacOS, and Linux.
4.1.2.1.4. Usability
The technique for predicting heart disease should be intuitive and simple to use. Data
entry instructions should be provided in a clear and straightforward manner, and the
system should be created with everyone's needs in mind. The system need to be
usable by users with various levels of technical proficiency and ought to be able to
offer predictions and recommendations in a language that the user can easily
comprehend. Additionally, the system should be developed to reduce errors and give
users feedback on any faults or incorrect input.
1
4.1.4. Engineering standard Requirements
• Economic
Heart disease prediction should be at viable cost which can afforded by all
organizations. project economic feasibility will determine its sustainability in society
• Environmental
• Social
The heart prediction project aims to improve the social impact of healthcare by
providing an accurate and reliable tool for early detection of heart disease. The
project's social applicability is evident as it will improve the quality of healthcare and
potentially save lives.
• Political
The heart disease forecasting study does not directly affect politics. But it is crucial to
make sure the project conforms with all applicable healthcare legislation and norm
The main goal of the initiative is to increase patient safety and healthcare outcomes by
early cardiac disease detection. As a result, the project must comply with all
applicable healthcare safety standards and laws.
• Ethical
The heart disease prediction project's ethical ramifications are quite important.
Making sure the initiative doesn't discriminate against any patient groups or result in
skewed judgements is crucial. The initiative should also guarantee the confidentiality
and privacy of patient information
1
• Sustainability
In terms of its technical and financial viability, the heart disease prediction effort need
to be long-lasting. The project's long-term viability will guarantee that healthcare
institutions can use it and gain from it in the future.
• Legality
Health-related rules and regulations must be followed by the heart disease prediction
project. No data security or privacy laws should be broken by the project.
• Inspectability
To make sure that the heart disease prediction project complies with all applicable
laws and ethical standards, it should be open to inspection by the appropriate
authorities and transparent. Additionally, the project must be inspectable in order to
spot and address any potential technical or operational problems.
1
4.2 SYSTEM DESIGN
1
4.2.2. Detailed Design
Class Diagram
Sequence Diagram
1
Activity Diagram
1
Usecase Diagram
1
Modules Description
Data Collection
I have collected the data set from ieee.The dataset has 1025 rows and 14 columns
Columns :
1. Age-age of the person
2. 2.sex-male:1, female:0
3. cp-chest pain type
• 0-Typical angina: chest pain related decrease blood supply to the heart
• 1-Atypical angina: chest pain not related to hear
• 2- Non-anginal pain: typically, esophageal spasms (non-heart related)
• 3-Asymptomatic: chest pain not showing signs of disease
• 4.trestbps-resting blood pressure (in mm Hg on admission to the
hospital) anything above 130- 140 is typically cause for concern
4. chol-serum cholestoral in mg/dl serum = LDL + HDL + .2 * triglycerides
above 200 is cause for concern
5. fbs-(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
6. restecg-resting electrocardiographic results
0: Nothing to note
1: ST-T Wave abnormality can range from mild symptoms to severe problems
signals non-normal heart beat
2: Possible or definite left ventricular hypertrophy Enlarged heart's main
pumping chamber
7. thalach-maximum heart rate achieved
8. exang-exercise induced angina (1 = yes; 0 = no)
9. oldpeak-ST depression induced by exercise to rest looks at stress of heart
during excercise unhealthy heart will stress more
10. Slope:the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
2: Downslopins: signs of unhealthy heart
11. target- have disease or not (1=yes, 0=no)
2
Data Preprocessing
Data preprocessing is the process of cleaning, transforming, and preparing raw data
into a format that is suitable for analysis. It is an essential step in data analysis and
machine learning, as raw data often contains errors, missing values, and
inconsistencies that need to be addressed before the data can be used effectively. In
the dataset-there are no null values but duplicate values are present ,we removed those
details using pandas then separated the categorical and continuous values into
separate.
Feature Scaling
Feature scaling is a data preprocessing technique that is used to standardize the range
of values of input features. It is often used in machine learning algorithms, as some
algorithms are sensitive to the scale of input features, and can perform poorly if the
features have different ranges or scales. Technique used:Standardization This
technique scales the values of the input features to have zero mean and unit variance.
2
value.
2
Hyper-parameter tuning
Hyperparameter tuning is the process of selecting the best set of hyperparameters for a
machine learning algorithm. Hyperparameters are parameters that are not learned
from data during training, but are set by the user before training begins.
hyperparameters include the learning rate, regularization parameter, number of hidden
layers in a neural network, and so on Grid search:This involves specifying a set of
values for each hyperparameter, and training the model with all possible combinations
of these values. The best combination of hyperparameters is then selected based on
the performance on a validation set.
Algorithms
• Logistic Regression
• KNN
• SVM
SVM-stands for Support Vector Machines. The algorithm aims to find the optimal
hyperplane that separates the data points into different classes, with the widest
possible margin between the hyperplane and the closest data points. In SVM, each
data point is represented as a vector in a high-dimensional feature space, and the
algorithm searches for a hyperplane that separates the data points into different
classes. The 23 hyperplane is defined by a set of weights and biases, which are
learned from the training data. The hyperplane is defined by the equation: w^T x + b
= 0 where w is a vector of weights and b is the bias term. The distance between the
hyperplane and a data point x_i is given by: d_i = y_i * (w^T x_i + b) / ||w|| where ||
w|| is the Euclidean norm of the weight vector.
• Decision Tree
Decision tree helps us to analyse a dataset and discover patterns or rules that can
forecast an important event, utilise a decision tree. It is especially helpful for issues
with discrete decision-making processes and can handle categorical as well as
numerical data. Since the resulting tree structure enables researchers to comprehend
the decision-making process and identify the most crucial features impacting the
prediction, decision trees are renowned for their interpretability.
• Random forest
2
Model Evaluation
TP be the number of true positives (i.e., instances that are correctly classified as
positive),
TN be the number of true negatives (i.e., instances that are correctly classified as
negative),
FP be the number of false positives (i.e., instances that are incorrectly classified as
positive),
FN be the number of false negatives (i.e., instances that are incorrectly classified as
negative).
2
5. IMPLEMENTATION AND TESTING
1. Data Collection
- Source: Obtained dataset from IEEE with 1025 rows and 14 columns containing
attributes related to heart disease.
- Attributes: Columns include Age, Sex, Chest Pain Type, Resting Blood Pressure,
Serum Cholesterol, Fasting Blood Sugar, Resting Electrocardiographic Results,
Maximum Heart Rate, Exercise Induced Angina, ST Depression, Slope, and Target
(presence of heart disease).
2. Data Preprocessing
- Data Cleaning: Removed duplicate values using pandas library, ensuring data
integrity.
- Null Handling: No null values found in the dataset, eliminating the need for
imputation.
-Categorical Encoding: Converted categorical variables into numerical
representations for machine learning algorithms.
3. Feature Scaling
4. Hyperparameter Tuning
2
5. Machine Learning Algorithms
6. Model Evaluation
- Performance Metrics: Evaluated models using accuracy, precision, recall, and F1-
score to assess predictive power.
- Cross-Validation: Ensured robustness of models by validating performance across
multiple data subsets.
7. System Architecture
During unit testing, each module of the system is tested individually to verify its
functionality. This involves validating data processing, model training, and other
functionalities to ensure they perform as expected. Additionally, input-output
validation is conducted to confirm correct data flow between different components.
Performance testing measures the efficiency and scalability of the system under
various loads. This involves assessing system response times, resource utilization, and
scalability to ensure that the system can handle increasing user loads without
performance degradation.
Security testing is conducted to identify and address potential security risks, ensuring
data privacy and integrity. Vulnerability assessments are performed to identify
security weaknesses, and penetration testing is conducted to test the system's
resilience to cyber threats.
User Acceptance Testing (UAT) involves gathering feedback from end-users to assess
the system's usability and satisfaction. Users provide feedback on the user interface,
functionality, and overall experience, and any issues raised are addressed to improve
system performance.
Through comprehensive testing, the Heart Disease Prediction System aims to deliver
a robust, reliable, and user-friendly solution for predicting heart disease risk and
assisting healthcare professionals in making informed decisions.
2
DATASET
Dataset Description
The dataset used for developing the Heart Disease Prediction System consists of 1025
instances and 14 attributes. These attributes provide essential information related to
various factors associated with heart disease.
- Sex: Indicates the gender of the individual, with 1 representing male and 0
representing female.
- Chest Pain Type (CP): Describes the type of chest pain experienced by the
individual, categorized into four types.
- Resting Blood Pressure (Trestbps): Indicates the resting blood pressure of the
individual upon admission to the hospital.
- Serum Cholesterol (Chol): Represents the serum cholesterol level in mg/dl, which
is a combination of LDL, HDL, and triglycerides.
- Fasting Blood Sugar (Fbs): Indicates whether the individual has fasting blood
sugar levels above 120 mg/dl.
- Maximum Heart Rate Achieved (Thalach): Indicates the maximum heart rate
achieved during testing.
2
- Number of Major Vessels (Ca): Indicates the number of major vessels colored by
fluoroscopy.
-Target: Denotes whether the individual has heart disease (1) or not (0).
This dataset provides a comprehensive range of attributes that are essential for
predicting heart disease risk and assisting in diagnosis and treatment decisions. By
analyzing these attributes, the Heart Disease Prediction System aims to provide
accurate and reliable predictions to support healthcare professionals in delivering
personalized care to individuals at risk of heart disease.
3
Implementation Setup
Hardware Setup
Software Setup
3
Frameworks and Tools
3
6.RESULTS AND DISCUSSION
Logistic Regression
3
KNN
3
SVM
3
Decision tree
3
Random Forest
3
Table 1: Models accuracy
On the other hand, the SVM model displayed perfect training accuracy of 100%,
implying that it perfectly separated the training data into distinct classes. However,
this high training accuracy did not translate well to the testing set, where the accuracy
dropped to 75%. This discrepancy suggests that the SVM model may have overfit the
training data, failing to generalize effectively to unseen data. Similarly, the decision
tree model also achieved perfect training accuracy of 100% but exhibited a testing
accuracy of only 75%. This indicates that the decision tree model may have
memorized the training data, resulting in poor performance on new data instances.
Lastly, the random forest model showed perfect training accuracy like the SVM and
decision tree models but displayed a testing accuracy similar to logistic
regression and KNN, indicating that while it performed well on the training data, its
ability to generalize to unseen data was limited, albeit better than SVM and decision
tree.
3
Hyper tuned algorithms
3
2.Hypertuned KNN
4
2. Hypertuned KNN
4
3. Hypertuned SVM
4
4. Hypertuned Decision tree
4
5. Hypertuned Random Forest
4
Table 2: Hypertuned-models accuracy
Large datasets are analyzed to uncover intricate connections between risk variables
and heart disease outcomes using machine learning methods including logistic
regression, decision trees, random forests, and support vector machines. These
algorithms can extract significant insights and produce predictions with a high degree
of accuracy through feature selection, engineering, hyper-parameter tuning, and
model training. Hypertuned logistic regression gives us the maximum accuracy.
Hyper parameter tuning shown us the growth in each model.
Given the conclusion of the analysis, it's evident that hyperparameter tuning has led to
improvements in the model accuracies, particularly noticeable in the Logistic
Regression and Random Forest Classifier models. However, some models like the
Support Vector Machine (SVM) showed a decrease in testing accuracy after
hyperparameter tuning, suggesting that further exploration is needed to optimize these
models effectively. The hyperparameters used in the tuning process might not have
been optimal for certain models, leading to unexpected results. Further
experimentation with a wider range of hyperparameters or different tuning
4
techniques like Bayesian
4
optimization could yield better results.The current analysis might not have utilized the
full potential of the dataset. Exploring additional feature engineering techniques or
incorporating domain knowledge to create new features could enhance the predictive
power of the models. Ensemble methods like stacking or boosting could be explored
to combine the strengths of multiple models and mitigate their individual weaknesses.
This approach often leads to more robust and accurate predictions by leveraging the
diversity of different models. If the dataset is limited or contains missing values,
techniques such as data augmentation or imputation could be employed to generate
synthetic data or fill in missing values, respectively. This could help improve the
generalization and reliability of the models.
While complex models like Random Forests can offer high accuracy, they often lack
interpretability. Exploring interpretable models like decision trees or logistic
regression could provide valuable insights into the factors influencing the predictions,
which is crucial for real-world applications where interpretability is important.
In conclusion, while the current analysis has provided valuable insights into the
performance of various machine learning models for predicting heart disease, there
are still opportunities for further refinement and exploration. By fine-tuning
hyperparameters, exploring new features, leveraging ensemble methods, and focusing
on interpretability, future works can continue to advance the accuracy and reliability
of predictive models in this domain.
4
ANNEXURE-1
Sample Code
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
df=pd.read_csv("C:\\Users\\surya\\Downloads\\csv files\\
heart_statlog_cleveland_hungary_final.csv")
df.head()
df.tail()
df.describe()
#colclusion there are no missing values in given dataset
df.isnull().sum()
df.duplicated().sum()
df.drop_duplicates(inplace=True)
#seperate category and continuous variables
category,continuous=[],[]
for i in df.columns:
if(len(np.unique(df[i]))<10):
category.append(i)
else:
continuous.append(i)
print (f"category=“, category)
4
print (f"continuous=“, continuous)
print(category)
print(continuous)
newcatdf=df[category]
newcondf=df[continuous]
sns.jointplot(y='target',x='age',data=df)
sns.countplot(x='sex',hue='target',data=df)
sns.boxplot(x='target',y='age',data=df)
sns.countplot(x='chest pain type',hue='target',data=df)
features=enumerate(category)
plt.figure(figsize=(15,30))
for i in enumerate(category):
plt.subplot(3,3,i[0]+1)
sns.countplot(x=i[1],hue='target',data=df)
plt.xlabel(i[1])
plt.figure(figsize=(15, 15))
s_sc = StandardScaler()
col_to_scale = continuous
df[col_to_scale] = s_sc.fit_transform(df[col_to_scale])
continuous_data=df[continuous]
continuous_data.head(5)
continuous_data['target']=df['target']
continuous_data.corr(method='pearson')
5
plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1.3)
sns.heatmap(df.corr(),annot=True,linewidth =2)
plt.tight_layout()
def print_score(clf, xtrain, ytrain, xtest, ytest, train=True):
if train:
pred = clf.predict(xtrain)
clf_report = pd.DataFrame(classification_report(ytrain, pred, output_dict=True))
print("Train
Result:\n**********************************************************")
print(f"Accuracy Score: {accuracy_score(ytrain, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(ytrain, pred)}\n")
elif train==False:
pred = clf.predict(xtest)
clf_report = pd.DataFrame(classification_report(ytest, pred, output_dict=True))
print("Test
Result:\n*********************************************************")
print(f"Accuracy Score: {accuracy_score(ytest, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(ytest, pred)}\n")
x=df.drop(['target'],axis=1)
y=df['target']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=2)
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(xtrain,ytrain)
pred=model.predict(xtest)
5
from sklearn.metrics import accuracy_score
accuracy_score(pred,ytest)
log_reg = LogisticRegression(solver='liblinear')
log_reg.fit(xtrain, ytrain)
tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1,
cv=5)
tree_cv.fit(xtrain, ytrain)
best_params = tree_cv.best_params_
print(f'Best_params:
5
{best_params}')
tree_clf = DecisionTreeClassifier(**best_params)
5
tree_clf.fit(xtrain, ytrain)
print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=True)
print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=False)
test_score = accuracy_score(ytest, tree_clf.predict(xtest)) * 100
train_score = accuracy_score(ytrain, tree_clf.predict(xtrain)) * 100
results3 = pd.DataFrame(
data=[["Tuned Decision Tree Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
results = results.append(results3, ignore_index=True)
results
n_estimators = [500, 900, 1100, 1500]
max_features = ['auto', 'sqrt']
test_score = accuracy_score(ytest, knn_clf.predict(xtest)) * 100
train_score = accuracy_score(ytrain, knn_clf.predict(xtrain)) * 100
tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1,
cv=5)
tree_cv.fit(xtrain, ytrain)
best_params = tree_cv.best_params_
print(f'Best_params:
5
{best_params}')
5
tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(xtrain, ytrain)
params_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}
rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="accuracy", cv=5, verbose=1,
n_jobs=-1)
rf_cv.fit(xtrain, ytrain)
best_params =
rf_cv.best_params_
print(f"Best parameters: {best_params}")
rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(xtrain, ytrain)
continuous_data.corr(method='pearson')
continuous_data['target']=df['target']
5
continuous_data.head()
5
continuous_data.corr(method='pearson')
continuous_data.corr(method='kendall')
continuousdf=df[continuous]
continuousdf
continuousdf['target']=df['target']
continuousdf.head()
df.age.max()
df.age.min()
df.trestbps.max()
continuousdf.describe()
import seaborn as sns
import matplotlib.pyplot as plt
data=df.copy()
data.drop(['target'],axis=1)
data.drop_duplicates(inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df.drop_duplicates(inplace=True)
xtrain,xtest,ytrain,ytest=train_test_split(data,df['target'],train_size=0.7)
model=LogisticRegression()
model.fit(xtrain,ytrain)
from sklearn.metrics import accuracy_score
pred=model.predict(xtest)
score=accuracy_score(ytest,pred)
score
5
6. REFERENCES
1.Riyaz, Lubna, et al. "heart disease prediction using machine learning techniques: a
quantitative review." International Conference on Innovative Computing and
Communications: Proceedings of ICICC 2021, Volume 3. Springer Singapore, 2022.
2.Azmi, Javed, et al. "A systematic review on machine learning approaches for
cardiovascular disease prediction using medical big data." Medical Engineering &
Physics (2022): 103825.
3.A. A. Ahdal, M. Rakhra, S. Badotra and T. Fadhaeel, "An integrated Machine
Learning Techniques for Accurate Heart Disease Prediction," 2022 International
Mobile and Embedded Technology Conference (MECON), Noida, India, 2022, pp.
594-598, doi: 10.1109/MECON53876.2022.9752342.
4. Umarani Nagavelli, Debabrata Samanta, Partha Chakraborty, "Machine Learning
Technology-Based Heart Disease Detection Models", Journal of Healthcare
Engineering, vol. 2022, Article ID 7351061, 9 pages, 2022.
https://doi.org/10.1155/2022/7351061
5. Pathan, Muhammad Salman, et al. "Analyzing the impact of feature selection on the
accuracy of heart disease prediction." Healthcare Analytics 2 (2022): 100060
6. Shah, Devansh, Samir Patel, and Santosh Kumar Bharti. "Heart disease prediction
using machine learning techniques." SN Computer Science 1.6 (2020): 1-6
7. madhu hk,and d.ramesh “heart disease prediction using svm”International Journal of
Computer Applications (0975 – 8887) Volume 183 – No. 27, September 2021
8.V. Sharma, S. Yadav and M. Gupta, "Heart Disease Prediction using Machine
Learning Techniques," 2020 2nd International Conference on Advances in
Computing, Communication Control and Networking (ICACCCN), Greater Noida,
India, 2020, pp. 177-181, doi: 10.1109/ICACCCN51052.2020.9362842.