Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views68 pages

M.Tech. (Software Engineering) : Comparitive Analysis of Heart Disease Prediction Using Machine Learning Algorithms

DA

Uploaded by

Sreedhar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views68 pages

M.Tech. (Software Engineering) : Comparitive Analysis of Heart Disease Prediction Using Machine Learning Algorithms

DA

Uploaded by

Sreedhar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

A project report on

COMPARITIVE ANALYSIS OF HEART


DISEASE PREDICTION USING MACHINE
LEARNING ALGORITHMS
Submitted in partial fulfillment for the award of the degree
of

M.Tech.(Software Engineering)
by

E. GURU MOHAN (19MIS0243)

SCHOOL OF COMPUTER SCIENCE ENGINEERING


AND INFORMATION SYSTEMS

May, 2024

i
DECLARATION

I here by declare that the thesis entitled “COMPARITIVE ANALYSIS OF HEART


DISEASE PREDICTION USING MACHINE LEARNING ALGORITHMS” submitted
by
me, for the award of the degree of M.Tech (Software Engineering) is a record of
bonafide work carried out by me under the supervision of Dr. Gitanjali J

I further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or diploma
in this institute or any other institute or university.

Place: Vellore

Date: Signature of the Candidate

ii
CERTIFICATE

This is to certify that the thesis entitled “COMPARITIVE ANALYSIS OF HEART


DISEASE PREDICTION USING MACHINE LEARNING ALGORITHMS” by
E. GURU MOHAN (19MIS0243), School of Computer Science and Engineering, Vellore
Institute of Technology, Vellore for the award of the degree M. Tech (Software
Engineering) is a record of bonafide work carried out by her under my supervision.
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The Project report fulfils the requirements and regulations of
VELLORE INSTITUTE OF TECHNOLOGY, VELLORE and in my opinion meets the
necessary standards for submission.

Signature of the Guide Signature of the HoD

Internal Examiner External Examiner

iii
Date: ……………….

CERTIFICATE BY THE EXTERNAL GUIDE

This is to certify that the project report entitled “COMPARITIVE ANALYSIS


OF HEART DISEASE PREDICTION USING MACHINE LEARNING
ALGORITHMS”
submitted by E.GURU MOHAN(19MIS0243) to Vellore Institute of Technology in
partial fulfillment of the requirement for the award of the degree of M.Tech Integrated in
Software Engineering is a record of bonafide work carried out by him under my
guidance. The project fulfills the requirements as per the regulations of this Institute and in
my opinion meets the necessary standards for submission. The contents of this report have
not been submitted and will not be submitted either in part or in full, for the award of any
other degree or diploma in this institute or any other institute or university.

EXTERNAL SUPERVISOR

iv
ABSTRACT

This project aimed to develop and evaluate machine learning models for the prediction of
heart disease based on a dataset obtained from the IEEE repository. The dataset comprised
1025 instances with 14 features, including demographic information, medical
measurements, and clinical indicators. The methodology involved data preprocessing,
feature scaling, hyperparameter tuning, and the implementation of various classification
algorithms such as Logistic Regression, K-Nearest Neighbors (KNN), Support Vector
Machine (SVM), Decision Tree, and Random Forest.

Initial model training and testing without hyperparameter tuning revealed varying levels of
accuracy across different algorithms. Notably, SVM demonstrated perfect training
accuracy but lower testing accuracy, suggesting potential overfitting. Moreover, Decision
Tree and Random Forest achieved 100% training accuracy but comparatively lower testing
accuracy, indicating potential issues with generalization. Hyperparameter tuning was then
applied to enhance model performance, resulting in improved accuracy for Logistic
Regression and Random Forest models. However, SVM exhibited reduced testing
accuracy post-tuning, highlighting the need for further optimization strategies.

Future directions for this project include exploring advanced hyperparameter tuning
techniques, conducting extensive feature engineering, experimenting with ensemble
methods, and focusing on the interpretability of models. These endeavors aim to refine
predictive models for heart disease diagnosis, ensuring robustness, generalization, and
clinical relevance. Overall, this study underscores the potential of machine learning in
healthcare and lays the foundation for further research in cardiovascular disease prediction
and management.

v
ACKNOWLEDGEMENT

It is my pleasure to express with deep sense of gratitude to Prof Gitanjali J,


Associate Professor Grade 1, Vellore Institute of Technology>, for his constant guidance,
continual encouragement, understanding; more than all, he taught me patience in my
endeavor. My association with him is not confined to academics only, but it is a great
opportunity on my part of work with an intellectual and expert in the field of Information
technology.

I would like to express my gratitude to DR.G.VISWANATHAN, Chancellor


VELLORE INSTITUTE OF TECHNOLOGY, VELLORE, MR. SANKAR
VISWANATHAN, DR. SEKAR VISWANATHAN, MR.G V SELVAM, Vice –
Presidents VELLORE INSTITUTE OF TECHNOLOGY, VELLORE, Dr. V. S. Kanchana
Bhaaskaran,
I/c Vice – Chancellor, DR. Partha Sharathi Mallick, Pro-Vice Chancellor and Dr. S.
Sumathy, Dean, School of Computer Science Engineering And Information Systems,, for
providing with an environment to work in and for his inspiration during the tenure of the
course.

In jubilant mood I express ingeniously my whole-hearted thanks to Dr. Neelu


Khare, HoD/Professor, all teaching staff and members working as limbs of our university
for their notself-centered enthusiasm coupled with timely encouragements showered on me
with zeal, which prompted the acquirement of the requisite knowledge to finalize my
course study successfully. I would like to thank my parents for their support.

It is indeed a pleasure to thank my friends who persuaded and encouraged me to


take up and complete this task. At last, but not least, I express my gratitude and
appreciation to all those who have helped me directly or indirectly toward the successful
completion of this project.

Place: Vellore E. GURU MOHAN

Date: Name of the student

vi
CONTENTS
Title Page No
List of Tables viii
List of Figures ix
List of Abbreviations x

Chapter Page
1. Introduction 1
1.1. Theoretical Background 1
1.2. Motivation 2
1.3. Aim of the proposed Work 2
1.4. Objective(s) of the proposed work 3
1.5. Report Organization 4
2. Literature Survey 4
2.1. Survey of the Existing Models/Work 4
2.2. Gaps Identified in the Survey 7
2.3. Problem Statement 8
3. Overview of the Proposed System 8
4. Requirements Analysis and Design 9
4.1. Requirements Analysis 9
4.1.1. Functional Requirements 9
4.1.1.1. Product Perspective 9
4.1.1.2. Product Features 9
4.1.1.3. User Characteristics 10
4.1.1.4. Assumption & Dependencies 10
4.1.1.5. Domain Requirements 10
4.1.2. Non-Functional Requirements 10
4.1.3. System Modeling 12
4.1.4. Engineering Standard Requirements 12
4.1.5. System Requirements 14
4.1.5.1. Hardware Requirements 14
4.1.5.2. Software Requirements 14
4.2. System Design 15
4.2.1. System Architecture 15
4.2.2. Detailed Design 16
5. Implementation and Testing 23
6. Results and Discussion 30
7. Conclusion and Scope for Future Work 42
7.1. Conclusion 42
7.2. Scope for Future Work 42
Annexure – I - Sample Code 44
References 51

vii
LIST OF TABLES

Title Page No.


Table 6.1 Accuracy Table 41

LIST OF FIGURES

Title Page No.


Fig 4.1 DFD 12
Fig 4.3 System Architecture 15
Fig 4.4 Class Diagram 16
Fig 4.5 Sequence Diagram 16
Fig 4.6 Activity Diagram 17
Fig 4.7 Use Case Diagram 18
Fig 4.8 Removing duplicate values 20
Fig 5.1 Dataset Image 27
Fig 6.1 Logistic regression 30
Fig 6.2 KNN 31
Fig 6.3 SVM 32
Fig 6.4 Decision tree 33
Fig 6.5 Random forest 34
Fig 6.6 Hyper tuned Logistic regression 35
Fig 6.7 Hyper tuned random forest 39
Fig 6.8 Hyper tuned model accuracy 41

viii
LIST OF ABBREVIATIONS

Abbreviation Expansion
RF Random Forest
LR Logistic Regression
DT Decision Tree
KNN K Nearest neighbor
SVM Support Vector Machine

ix
1. INTRODUCTION

1.1. THEORITICAL BACKGROUND

Understanding the guiding principles and development processes for a heart disease
prediction system employing machine learning algorithms is necessary. The following
are the main elements of the theoretical background:

Heart disease is a complex ailment that is influenced by a number of risk factors,


including age, gender, family history, smoking, obesity, high blood pressure,
cholesterol levels, diabetes, and way of life decisions. Building precise prediction
models requires a thorough understanding of these risk variables and how they relate
to heart disease. Machine Learning Algorithms: Without explicit programming,
machine learning algorithms are computational models that can recognize patterns and
anticipate outcomes from data. A variety of algorithms may be used to predict cardiac
disease. and successfully handle high-dimensional data by combining the findings of
several trees.

Support Vector Machines (SVM): This effective approach can classify data in both
linear and nonlinear ways. By maximising the margin between them, it seeks to
identify an ideal hyperplane that divides various classes (such as people with and
without heart disease) from one another.

The theoretical background through machine learning models also involves various
techniques for data pre-processing, feature extraction, and model optimization, such
as:

Data pre-processing

Feature extraction

Model optimization

1
1.2. MOTIVATION

Heart disease is one of major reasons for deaths now-a-days. machine learning
predictions motivated us in continuing with the project and there are certain things
motivated and mentioned below

Early Intervention and Prevention: One of the top causes of death worldwide is heart
disease. Implementing preventive interventions and lifestyle changes that can
dramatically lower the risk of developing cardiovascular problems requires early
illness detection. Healthcare workers can intervene early and avoid negative
consequences by using machine learning algorithms' timely predictions.

Data-Driven Insights: The opportunity to use patient data for predictive analytics is
presented by the availability of enormous amounts of patient data, including
electronic health records, medical imaging, genetic information, and lifestyle factors.
These data can be examined by machine learning algorithms to find hidden
correlations, trends, and risk factors for heart disease. Using these information,
healthcare professionals can design individualized treatment regimens that are catered
to the needs of specific patients and make better decisions.

Decision Support Systems: Machine learning algorithms can be incorporated into


decision support systems, giving medical personnel insightful knowledge and
suggestions for managing and preventing cardiac disease. Clinical decisions made
with the aid of these technologies can be prioritized and optimized, improving patient
care and results.

Research and Development: Machine learning methods for heart disease prediction
are another source of research and advancement in the field of cardiovascular health.
The use of sophisticated algorithms promotes the investigation of novel risk factors,
the development of novel biomarkers, and the detection of hitherto unrecognized
correlations between variables. These results advance our knowledge of heart disease
and provide guidance for ongoing study.

2
1.3. AIM OF PROPOSED WORK

Heart disease prediction using machine learning algorithms offers the potential for
early intervention, personalized medicine, improved accuracy, and more efficient
healthcare delivery. By harnessing the power of data-driven insights, healthcare
providers can enhance their decision-making processes, leading to better patient
outcomes and a reduction in the global burden of heart disease hyperparameter tuning
of machine learning algorithms helps is:

Improve Model Performance: The main aim of hyperparameter tuning is to enhance


the performance of the machine learning model. By systematically adjusting the
values of hyperparameters, we can fine-tune the model to achieve better accuracy,
precision, recall, or other performance metrics.

Avoid Overfitting or Underfitting: Overfitting occurs when a model performs


exceptionally well on the training data but fails to generalize well to unseen data.
Underfitting, on the other hand, happens when the model fails to capture the
underlying patterns in the data. Proper hyperparameter tuning helps strike a balance
between overfitting and underfitting, leading to a model that performs well on unseen
data.

Optimize Resource Utilization: Hyperparameter tuning can also optimize the


utilization of computational resources. By fine-tuning the hyperparameters, we can
ensure that the model achieves the desired level of performance while avoiding
unnecessary computational complexity or excessive resource usage.

Robustness and Generalization: A well-tuned model is more likely to be robust and


generalize well to new, unseen data. By finding the optimal hyperparameters, we can
improve the model's ability to capture the underlying patterns in the data, making it
more reliable and effective in real-world scenarios.

3
1.4. OBJECTIVES OF PROPOSED WORK

 Finding out patterns which are affecting heart disease Comparative analysis of
machine learning algorithms in predicting Whether a patient is having heart
disease or not
 Determine the model which provides best accuracy among the available
algorithms.
 Comparing the algorithm with and without hyper parameter tuning in order to
find out how hyperparameter tuning reduces the factors affecting the algorithm

4
2. LITERATURE SURVEY

2.1. SURVEY OF THE EXISTING MODELS/WORK

S.no Name Author Methodology Metrics

1 Heart disease Riyaz, Lubna This study offers a thorough examination Accuracy, F-
Prediction: a of the effectiveness of several machine 1score
review learning techniques used for accurate
cardiac disease prediction, diagnosis, and
treatment. Support vector machines
(SVM), decision trees (DT), Naive Bayes
(NB), K-nearest neighbour (KNN),
artificial neural networks (ANN), and
other machine learning techniques are
investigated in the proposed research for
the prediction of the presence of cardiac
diseases. Then, it was determined which
technique performed best and worst
overall by calculating the average forecast
accuracy for each one.

2 A systematic Azmi, Javed, The study is an in-depth analysis of Accuracy,


review on ml et al. around forty-one papers related to precision
approaches for cardiovascular disease by using machine
cardiovascular learning techniques. This study evaluates
disease the selected publications rigorously and
prediction usig identifies gaps in the available literature,
medical big making it useful for researchers to
data develop and apply in clinical fields,
primarily on datasets related to heart
disease

5
3 An integrated A. A. Ahdal, This study compares and analyses the Accuracy,
Machine M. Rakhra, S. results of the UCI dataset using a number f1score
Learning Badotra and of machine learning methodologies and
Techniques for T. Fadhaeel, several machine learning algorithms. .
Accurate Heart Only 14 characteristics will be used out of
Disease the 75 columns. calculating the confusion
Prediction and accuracy matrix. Several optimistic
findings are confirmed. The dataset
contains several irrelevant attributes that
were handled and normalised for better
results.

4 Machine Learn Umarani An effective heart disease prediction accuracy,rec


IngTechnology Nagavelli, model (HDPM) is used, which includes all,precision,
BasedHeart Debabrata density based spatial clustering of rms
Disease Samanta , applications with noise (DBSCAN) for
Detection Partha outlier detection and elimination, a hybrid
Models Chakrab orty synthetic minority over-sampling
technique-edited nearest neighbor
(SMOTE-ENN) for balancing the training
data distribution, and XGBoost for heart
disease prediction.

6
5 Analyzing the Pathan,Muha Machine learning classification models Accuracy,
impact of mmad Salman were investigated using complete and
ROC
feature reduced features subset as inputs for
selection on the experimentation analysis. The trained
accuracy of classifiers were evaluated based on
heart disease Accuracy, Receiver Operating
predictio Characteristics (ROC) curve, and F1-
Score. The classification results of the
models proved that there is a high impact
of relevant features on the classification
accuracy

6 Heart Disease Shah, This research paper presents various accuracy


Prediction Devansh, attributes related to heart disease, and the
using Samir Patel, model on basis of supervised learning
MachineLearni and Santosh algorithms as Naïve Bayes, decision tree,
ng Techniques Kumar Bhart Knearest neighbour, and random forest
algorithm. It uses the existing dataset from
the Cleveland database of UCI repository
of heart disease patients. The dataset
comprises 303 instances and 76 attributes.
Of these 76 attributes, only 14 attributes
are considered for testing, important to
substantiate the performance of different
algorithms. This research paper aims to
envision the probability of developing
heart disease in the patients. The results
portray that the highest accuracy score is
achieved with K-nearest neighbour

7
7 Heart disease madhu hk, SVM a supervised model is implemented Accuracy
prediction using d.ramesh to predict heart attack. The 13 features are
svm considered which include personal details
like chest pain type, blood pressure,
collestral level and heart rate. The
implemented model is tested on UCI
health care heart disease data set. The
efficacy of the model proposed is justified
using performance and confusion matrix.
The accuracy obtained is 83%

8 Heart disease v.sharma, With the aid of conventional machine Accuracy


prediction using learning techniques, the authors of the
yadav
machine study also attempted to identify
learning correlations between the various attributes
techniques that were included in the dataset in order to
(2022) use those connections to predict the
likelihood of developing heart disease. The
results demonstrate that, when compared
to other ML approaches, Random Forest
provides a better accurate prediction in a
shorter amount of time.

8
2.2 GAPS IDENTIFIED IN THE SURVEY

It is clear from the literature review that machine learning techniques have produced
promising results in the prediction of heart disease. But there are still certain gaps
that must be filled. The following are some of the shortcomings found:

 Lack of standardization in data gathering and preprocessing techniques.


Limited sample sizes in several research, which can cause results to be
skewed.
 Insufficient comparison of various machine learning models.
 Some models have limited interpretability, making it challenging to
comprehend the underlying causes influencing the forecast.
 Incompatibility with clinical decision-making processes.

To close these gaps, further research is required to create standardized techniques for
data collecting and preprocessing as well as to carry out extensive investigations to
confirm the efficacy of various machine learning models.

2.3. PROBLEM STATEMENT

Heart disease is one of major curses to the present generation. There is no one
particular we can conclude that is the reason behind the heart-disease of a person.
Heart patients generally show different symptoms and it is quite complex to pattern
them for heart disease detection. We would like to use data mining techniques to
figure out the patterns and machine learning classification to find out better model
which gives accuracy. Compare the machine learning algorithms which gives better
performance.

9
3. OVERVIEW OF THE PROPOSED SYSTEM

Heart disease, also known as cardiovascular disease, is a general term used to describe
a range of conditions that affect the heart and blood vessels. These conditions include
coronary artery disease, heart valve problems, arrhythmias, and heart muscle damage.
Heart disease is the leading cause of death worldwide and can be caused by a variety
of factors, including high blood pressure, high cholesterol, smoking, obesity, and
diabetes. Symptoms of heart disease may include chest pain, shortness of breath,
fatigue, and irregular heartbeats. Every year according to WHO(world Health
organization),there are 10 million people are dying because of heart disease. One of
the major challenges present before the present generation is providing quality
services and diagnosis in health care. Despite the fact that heart disorders are now the
leading cause of death worldwide, they are also the ones that may be effectively
managed and controlled. When a disease is discovered at the right moment, it can be
managed with complete accuracy. In order to prevent negative effects, the suggested
work aims to identify these heart problems at an early stage.

This paper's major objective is to give clinicians a tool for early heart disease
detection. As a result, patients will receive effective care and serious repercussions
will be avoided. To uncover hidden discrete patterns and analyze the provided data,
machine learning (ML) plays a critical role. ML approaches aid in the early detection
and prediction of cardiac disease after data processing. This study examines the
effectiveness of different machine learning (ML) methods for early heart disease
prediction, including Naive Bayes, Decision Trees, Logistic Regression, and Random
Forest.after we perform hyperparameter tuning to each of these algorithms and find
out parameters which result in best performance of given validation set.

1
4. REQUIREMENTS ANALYSIS AND DESIGN

4.1. REQUIREMENT ANALYSIS

4.1.1. Functional Requirements

4.1.1.1. Product Perspective

A stand-alone web application that uses machine learning algorithms to forecast the
likelihood of heart disease will be the heart disease prediction system. To generate a
prediction, it will need input from the user on their personal information, lifestyle
habits, and medical history. In order to make precise predictions, the system will use
machine learning models that have been trained on enormous datasets.

4.1.1.2. Product Features

 User authentication for privacy and security


 Easy interfaces for user better understanding
 machine learning techniques that are reliable for precise prediction.
 Prediction results are presented clearly with the appropriate recommendations.

4.1.1.3. User Characteristics

For use by people concerned about their risk of developing heart disease, the heart
disease prediction system will be created. Adults of any age and gender who want to
evaluate their risk for heart disease are eligible to use the service.

4.1.1.4. Assumption & Dependencies

In order to produce reliable forecasts, the heart disease prediction system relies on
users providing accurate and full data inputs. Prediction accuracy will also be
influenced by the caliber and volume of training data utilized to create the machine
learning models.

1
4.1.1.5. Domain Requirements

The heart disease prediction system has to abide by all applicable data security and
privacy legislation. Additionally, for the creation, validation, and deployment of
machine learning models, the system must adhere to best practices and industry
standards

4.1.1.6. User Requirements

The heart disease prediction system has to abide by all applicable data security and
privacy legislation. Additionally, for the creation, validation, and deployment of
machine learning models, the system must adhere to best practices and industry
standards.

4.1.2. Non-functional requirements

4.1.2.1. Product Requirements

4.1.2.1.1. Efficiency

The system for heart disease prediction needs to be effective in both time and space. It
should be possible for the system to handle user inputs fast and produce predictions in
a timely manner. It should also be optimized to use the least amount of resources
possible, such as memory and computing power, to lessen its impact on system
performance.

4.1.2.1.2. Reliability

The technology used to forecast heart disease should be trustworthy and continuously
deliver accurate results. A high amount of user requests should not be a problem for
the system, which should keep running without any glitches or system failures. In
order to guarantee that it makes correct predictions under various circumstances and
inputs, it should also undergo thorough testing

1
4.1.2.1.3. Portability

The system for predicting cardiac disease ought to be portable and usable on a range
of hardware and software. Any device with an internet connection and a current web
browser should be able to access the system. It should not need any further software
installation and should work with a variety of operating systems, including Windows,
MacOS, and Linux.

4.1.2.1.4. Usability

The technique for predicting heart disease should be intuitive and simple to use. Data
entry instructions should be provided in a clear and straightforward manner, and the
system should be created with everyone's needs in mind. The system need to be
usable by users with various levels of technical proficiency and ought to be able to
offer predictions and recommendations in a language that the user can easily
comprehend. Additionally, the system should be developed to reduce errors and give
users feedback on any faults or incorrect input.

4.1.3. System modeling

Fig: 4.1 Flow diagram with Hyperparameter tuning of

1
4.1.4. Engineering standard Requirements

• Economic

Heart disease prediction should be at viable cost which can afforded by all
organizations. project economic feasibility will determine its sustainability in society

• Environmental

Since the project is software-based, its environmental impact ought to be negligible or


nonexistent. However, it is crucial to make sure that the project's implementation and
use do not have a harmful influence on the environment.

• Social

The heart prediction project aims to improve the social impact of healthcare by
providing an accurate and reliable tool for early detection of heart disease. The
project's social applicability is evident as it will improve the quality of healthcare and
potentially save lives.

• Political

The heart disease forecasting study does not directly affect politics. But it is crucial to
make sure the project conforms with all applicable healthcare legislation and norm

• Health and Safety

The main goal of the initiative is to increase patient safety and healthcare outcomes by
early cardiac disease detection. As a result, the project must comply with all
applicable healthcare safety standards and laws.

• Ethical

The heart disease prediction project's ethical ramifications are quite important.
Making sure the initiative doesn't discriminate against any patient groups or result in
skewed judgements is crucial. The initiative should also guarantee the confidentiality
and privacy of patient information

1
• Sustainability

In terms of its technical and financial viability, the heart disease prediction effort need
to be long-lasting. The project's long-term viability will guarantee that healthcare
institutions can use it and gain from it in the future.

• Legality

Health-related rules and regulations must be followed by the heart disease prediction
project. No data security or privacy laws should be broken by the project.

• Inspectability

To make sure that the heart disease prediction project complies with all applicable
laws and ethical standards, it should be open to inspection by the appropriate
authorities and transparent. Additionally, the project must be inspectable in order to
spot and address any potential technical or operational problems.

4.1.5. System Requirements

4.1.5.1. Hardware Requirements

 Processor: Intel Core i5 or higher


 RAM: 8GB or more
 Hard Disk Space: 50 GB or more
 Graphics Card: NVIDIA or AMD with at least 2 GB memory

4.1.5.2. Software Requirements

 Operating System: Windows 10 or Linux (Ubuntu)


 Python 3.6 or higher with required libraries like Scikit-learn,TensorFlow,
Keras, Pandas, NumPy, Matplotlib, etc.
 Integrated Development Environment (IDE): Jupyter Notebook, PyCharm or
Spyder

1
4.2 SYSTEM DESIGN

4.2.1 System Architecture

Fig: 4.2 System Architecture diagram of Heart Disease Prediction System

Fig:4.3 System Architecture diagram with hyperparameter tuning

1
4.2.2. Detailed Design

 Class Diagram

Fig :4.4 Class Diagram of Heart Disease Prediction System

 Sequence Diagram

Fig:4.5 Sequence Diagram of Heart Disease Prediction System

1
 Activity Diagram

Fig:4.6 Activity Diagram of Heart Disease Prediction System

1
 Usecase Diagram

Fig:4.7 Use case Diagram of Heart Disease Prediction System

1
Modules Description

Data Collection

I have collected the data set from ieee.The dataset has 1025 rows and 14 columns
Columns :
1. Age-age of the person
2. 2.sex-male:1, female:0
3. cp-chest pain type
• 0-Typical angina: chest pain related decrease blood supply to the heart
• 1-Atypical angina: chest pain not related to hear
• 2- Non-anginal pain: typically, esophageal spasms (non-heart related)
• 3-Asymptomatic: chest pain not showing signs of disease
• 4.trestbps-resting blood pressure (in mm Hg on admission to the
hospital) anything above 130- 140 is typically cause for concern
4. chol-serum cholestoral in mg/dl serum = LDL + HDL + .2 * triglycerides
above 200 is cause for concern
5. fbs-(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
6. restecg-resting electrocardiographic results
 0: Nothing to note
 1: ST-T Wave abnormality can range from mild symptoms to severe problems
 signals non-normal heart beat
 2: Possible or definite left ventricular hypertrophy Enlarged heart's main
pumping chamber
7. thalach-maximum heart rate achieved
8. exang-exercise induced angina (1 = yes; 0 = no)
9. oldpeak-ST depression induced by exercise to rest looks at stress of heart
during excercise unhealthy heart will stress more
10. Slope:the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
2: Downslopins: signs of unhealthy heart
11. target- have disease or not (1=yes, 0=no)

2
Data Preprocessing

Data preprocessing is the process of cleaning, transforming, and preparing raw data
into a format that is suitable for analysis. It is an essential step in data analysis and
machine learning, as raw data often contains errors, missing values, and
inconsistencies that need to be addressed before the data can be used effectively. In
the dataset-there are no null values but duplicate values are present ,we removed those
details using pandas then separated the categorical and continuous values into
separate.

Fig :4.8 Removing Duplicate values

Feature Scaling

Feature scaling is a data preprocessing technique that is used to standardize the range
of values of input features. It is often used in machine learning algorithms, as some
algorithms are sensitive to the scale of input features, and can perform poorly if the
features have different ranges or scales. Technique used:Standardization This
technique scales the values of the input features to have zero mean and unit variance.

The formula for standardization is:


x_scaled = (x - mean(x)) / std(x)
where x is the original feature value,
mean(x) is the mean of the feature values,
std(x) is the standard deviation of the feature values, and x_scaled is the scaled feature

2
value.

2
Hyper-parameter tuning

Hyperparameter tuning is the process of selecting the best set of hyperparameters for a
machine learning algorithm. Hyperparameters are parameters that are not learned
from data during training, but are set by the user before training begins.
hyperparameters include the learning rate, regularization parameter, number of hidden
layers in a neural network, and so on Grid search:This involves specifying a set of
values for each hyperparameter, and training the model with all possible combinations
of these values. The best combination of hyperparameters is then selected based on
the performance on a validation set.

Algorithms

• Logistic Regression

Logistic regression is a statistical method used to analyse the relationship between a


dependent variable and one or more independent variables and to predict the
probability of an event occurring. It is a type of regression analysis that is commonly
used for binary classification problems, where the dependent variable takes on only
two possible values (e.g.,0 or 1, Yes or No) The logistic function is typically
expressed as: p(y=1|x;θ)
= 1 / (1 + exp(-z)) where p(y=1|x;θ) is the probability of the dependent variable y
being 1 given the independent variables x and the parameters θ. 22 The parameter z is
a linear combination of the independent variables and the parameters, given by: z =
β_0 + β_1x_1 + β_2x_2 + ... + β_px_p where β_0, β_1, β_2, ..., β_p are the
parameters to be estimated, x_1, x_2, ..., x_p are the independent variables

• KNN

K-Nearest Neighbours (KNN) is a supervised machine learning algorithm used for


classification and regression tasks The KNN algorithm searches for the K closest data
points in the training dataset, and assigns the class label of the new data point based
on the majority class label of the K nearest neighbours Mathematics behind K-Nearest
Neighbours (KNN) involves finding the K closest data points to a new input point in a
feature space and then using the labels of those data points to make a prediction. To
find the K nearest neighbours, a distance metric is used to measure the similarity
between data points in the feature space. The most commonly used distance metric is
2
Euclidean distance, which is calculated as follows: d(x, y) = sqrt((x_1 - y_1)^2 + (x_2
- y_2)^2 + ... + (x_n - y_n)^2) where x and y are two data points in the feature space,
and n is the number of features.

• SVM

SVM-stands for Support Vector Machines. The algorithm aims to find the optimal
hyperplane that separates the data points into different classes, with the widest
possible margin between the hyperplane and the closest data points. In SVM, each
data point is represented as a vector in a high-dimensional feature space, and the
algorithm searches for a hyperplane that separates the data points into different
classes. The 23 hyperplane is defined by a set of weights and biases, which are
learned from the training data. The hyperplane is defined by the equation: w^T x + b
= 0 where w is a vector of weights and b is the bias term. The distance between the
hyperplane and a data point x_i is given by: d_i = y_i * (w^T x_i + b) / ||w|| where ||
w|| is the Euclidean norm of the weight vector.

• Decision Tree

Decision tree helps us to analyse a dataset and discover patterns or rules that can
forecast an important event, utilise a decision tree. It is especially helpful for issues
with discrete decision-making processes and can handle categorical as well as
numerical data. Since the resulting tree structure enables researchers to comprehend
the decision-making process and identify the most crucial features impacting the
prediction, decision trees are renowned for their interpretability.

• Random forest

Random forest classification is a supervised machine learning algorithm that


combines multiple decision tree classifiers to improve the accuracy and stability of
the classification. In random forest classification, a number of decision trees are built
independently and randomly from different subsets of the training data, and their
predictions are combined to obtain the final classification. Each decision tree is
constructed by selecting a random subset of the features and a random subset of the
training data and using the decision tree algorithm to build a tree based on these
subsets.

2
Model Evaluation

Accuracy is a measure of the performance of a classification model and is calculated


as the proportion of correctly classified instances over the total number of instances in
the dataset. It is typically expressed as a percentage. accuracy can be calculated as
follows:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

TP be the number of true positives (i.e., instances that are correctly classified as
positive),
TN be the number of true negatives (i.e., instances that are correctly classified as
negative),
FP be the number of false positives (i.e., instances that are incorrectly classified as
positive),
FN be the number of false negatives (i.e., instances that are incorrectly classified as
negative).

2
5. IMPLEMENTATION AND TESTING

1. Data Collection

- Source: Obtained dataset from IEEE with 1025 rows and 14 columns containing
attributes related to heart disease.
- Attributes: Columns include Age, Sex, Chest Pain Type, Resting Blood Pressure,
Serum Cholesterol, Fasting Blood Sugar, Resting Electrocardiographic Results,
Maximum Heart Rate, Exercise Induced Angina, ST Depression, Slope, and Target
(presence of heart disease).

2. Data Preprocessing

- Data Cleaning: Removed duplicate values using pandas library, ensuring data
integrity.
- Null Handling: No null values found in the dataset, eliminating the need for
imputation.
-Categorical Encoding: Converted categorical variables into numerical
representations for machine learning algorithms.

3. Feature Scaling

- Standardization: Scaled the range of input features using standardization


technique to ensure uniformity across variables.
- Normalization: Transformed data to have zero mean and unit variance, enhancing
algorithm performance.

4. Hyperparameter Tuning

- Grid Search: Explored various combinations of hyperparameters for each machine


learning algorithm to optimize performance.
- Cross-Validation: Utilized cross-validation techniques to assess model
generalization and prevent overfitting.

2
5. Machine Learning Algorithms

- Implementation: Developed models using Logistic Regression, K-Nearest


Neighbors, Support Vector Machines, Decision Trees, and Random Forest algorithms.
- Training: Trained each model on preprocessed data to learn patterns and make
predictions.

6. Model Evaluation

- Performance Metrics: Evaluated models using accuracy, precision, recall, and F1-
score to assess predictive power.
- Cross-Validation: Ensured robustness of models by validating performance across
multiple data subsets.

7. System Architecture

- Design: Developed a stand-alone web application with user authentication for


privacy and security.
- Integration: Integrated machine learning models into the application for precise
prediction.
- User Interface: Implemented user-friendly interfaces for improved understanding
and interaction.
Testing is a crucial phase in the development of the Heart Disease Prediction
System to ensure its reliability, accuracy, and usability. The testing process
encompasses various stages, including unit testing, integration testing, system testing,
performance testing, security testing, and user acceptance testing (UAT).

During unit testing, each module of the system is tested individually to verify its
functionality. This involves validating data processing, model training, and other
functionalities to ensure they perform as expected. Additionally, input-output
validation is conducted to confirm correct data flow between different components.

Integration testing focuses on testing the integration of components to ensure


seamless interaction. This involves testing how well different modules work together
and validating the data exchange between them. Input-output validation is again
performed to ensure that data is transmitted accurately between integrated
components.
2
System testing evaluates the overall functionality of the system, including user
authentication, prediction accuracy, and recommendation generation. It assesses the
system's performance against specified requirements and evaluates the user interface
for ease of use and clarity.

Performance testing measures the efficiency and scalability of the system under
various loads. This involves assessing system response times, resource utilization, and
scalability to ensure that the system can handle increasing user loads without
performance degradation.

Security testing is conducted to identify and address potential security risks, ensuring
data privacy and integrity. Vulnerability assessments are performed to identify
security weaknesses, and penetration testing is conducted to test the system's
resilience to cyber threats.

User Acceptance Testing (UAT) involves gathering feedback from end-users to assess
the system's usability and satisfaction. Users provide feedback on the user interface,
functionality, and overall experience, and any issues raised are addressed to improve
system performance.

Finally, deployment testing validates the deployment of the system in a production


environment and monitors its performance. Post-deployment testing is conducted to
ensure system stability and reliability in real-world scenarios.

Through comprehensive testing, the Heart Disease Prediction System aims to deliver
a robust, reliable, and user-friendly solution for predicting heart disease risk and
assisting healthcare professionals in making informed decisions.

2
DATASET

Dataset Description

The dataset used for developing the Heart Disease Prediction System consists of 1025
instances and 14 attributes. These attributes provide essential information related to
various factors associated with heart disease.

- Age: Represents the age of the individual.

- Sex: Indicates the gender of the individual, with 1 representing male and 0
representing female.

- Chest Pain Type (CP): Describes the type of chest pain experienced by the
individual, categorized into four types.

- Resting Blood Pressure (Trestbps): Indicates the resting blood pressure of the
individual upon admission to the hospital.

- Serum Cholesterol (Chol): Represents the serum cholesterol level in mg/dl, which
is a combination of LDL, HDL, and triglycerides.

- Fasting Blood Sugar (Fbs): Indicates whether the individual has fasting blood
sugar levels above 120 mg/dl.

- Resting Electrocardiographic Results (Restecg): Provides information on the


resting electrocardiographic results, which can range from normal to abnormal.

- Maximum Heart Rate Achieved (Thalach): Indicates the maximum heart rate
achieved during testing.

- Exercise-Induced Angina (Exang): Indicates whether the individual experiences


exercise-induced angina.

- ST Depression Induced by Exercise (Oldpeak): Represents the ST depression


induced by exercise relative to rest.

- Slope: Describes the slope of the peak exercise ST segment.

2
- Number of Major Vessels (Ca): Indicates the number of major vessels colored by
fluoroscopy.

- Thalassemia: Represents a blood disorder, categorized into three types.

-Target: Denotes whether the individual has heart disease (1) or not (0).

This dataset provides a comprehensive range of attributes that are essential for
predicting heart disease risk and assisting in diagnosis and treatment decisions. By
analyzing these attributes, the Heart Disease Prediction System aims to provide
accurate and reliable predictions to support healthcare professionals in delivering
personalized care to individuals at risk of heart disease.

Fig:4.9 Dataset of Heart Disease Prediction system

3
Implementation Setup

Hardware Setup

 Computer: A standard laptop or desktop computer with sufficient processing


power and memory to handle the data processing and machine learning tasks.
 Processor: A multicore processor (e.g., Intel Core i5 or higher) for faster
computation of machine learning algorithms.
 Memory (RAM): At least 8 GB of RAM to accommodate the data processing
and training of machine learning models efficiently.
 Storage: Adequate storage space to store datasets, libraries, and project files.
SSD storage is preferable for faster read/write speeds.

Software Setup

 Operating System: Windows, macOS, or Linux-based operating systems are


suitable for running machine learning projects.
 Python: The Python programming language is the primary language used for
data analysis, machine learning, and model development.
 Python Libraries: Install the following libraries using pip or conda package
manager:
- NumPy: For numerical computing and array operations.
- Pandas: For data manipulation and preprocessing tasks.
- Scikit-learn: For implementing machine learning algorithms and model
evaluation.
- Matplotlib and Seaborn: For data visualization and plotting.
 Integrated Development Environment (IDE): Choose an IDE for Python
development, such as PyCharm, Jupyter Notebook, or VS Code, to write,
execute, and debug Python code efficiently.
 Version Control: Optionally, using version control systems like Git and
platforms like GitHub to manage project versions and collaborate with team
members.

3
Frameworks and Tools

 Machine Learning Frameworks: Installing additional machine learning


frameworks and libraries for advanced model development and
experimentation, such as TensorFlow or PyTorch.
 Grid Search Tools: Utilizing grid search tools like GridSearchCV from Scikit-
learn for hyperparameter tuning.Data Collection Tools: Depending on the data
source, we can use web scraping tools or APIs to collect data from online
repositories or databases.
 Database Management System: If working with large datasets stored in
databases, consider installing database management systems like MySQL or
PostgreSQL for efficient data retrieval and storage.

By setting up the hardware, installing the necessary software components, and


leveraging frameworks and tools, you can create an efficient and productive
environment for implementing the Heart Disease Prediction System and conducting
machine learning experiments effectively.

3
6.RESULTS AND DISCUSSION

Logistic Regression

3
KNN

3
SVM

3
Decision tree

3
Random Forest

3
Table 1: Models accuracy

In the initial implementation without hyperparameter tuning, the logistic regression


model demonstrated decent performance with a training accuracy of approximately
85.56% and a testing accuracy of 84.78%. This indicates that the model generalized
well to unseen data, with both training and testing accuracies showing consistency.
However, the KNN model outperformed logistic regression in terms of training
accuracy, achieving approximately 88.56%, but exhibited the same testing accuracy
of 84.78%. This suggests that while KNN performed slightly better on the training set,
its performance on the testing set remained similar to logistic regression,
indicating potential overfitting or limitations in capturing the underlying patterns in
the data.

On the other hand, the SVM model displayed perfect training accuracy of 100%,
implying that it perfectly separated the training data into distinct classes. However,
this high training accuracy did not translate well to the testing set, where the accuracy
dropped to 75%. This discrepancy suggests that the SVM model may have overfit the
training data, failing to generalize effectively to unseen data. Similarly, the decision
tree model also achieved perfect training accuracy of 100% but exhibited a testing
accuracy of only 75%. This indicates that the decision tree model may have
memorized the training data, resulting in poor performance on new data instances.
Lastly, the random forest model showed perfect training accuracy like the SVM and
decision tree models but displayed a testing accuracy similar to logistic
regression and KNN, indicating that while it performed well on the training data, its
ability to generalize to unseen data was limited, albeit better than SVM and decision
tree.

3
Hyper tuned algorithms

1.Hypertuned Logistic Regression

3
2.Hypertuned KNN

4
2. Hypertuned KNN

4
3. Hypertuned SVM

4
4. Hypertuned Decision tree

4
5. Hypertuned Random Forest

4
Table 2: Hypertuned-models accuracy

In the initial implementation with hyperparameter tuning, the models exhibited


varying degrees of improvement in their testing accuracies compared to the untuned
models. Logistic regression, after hyperparameter tuning, showed a slight increase in
both training and testing accuracies, with training accuracy rising to approximately
85.69% and testing accuracy to approximately 85.33%. While KNN maintained its
high training accuracy of approximately 88.56%, its testing accuracy remained the
same at approximately 84.78%, suggesting that hyperparameter tuning had minimal
impact on its generalization performance. Conversely, the support vector machine
model displayed a noticeable improvement in testing accuracy after tuning, reaching
approximately 83.70% from the initial 75%, indicating that the tuned hyperparameters
enhanced its ability to generalize to unseen data. Similarly, the decision tree classifier
and random forest classifier also demonstrated improvements in testing accuracies
after hyperparameter tuning, although their performance gains were not as substantial
as that of the SVM model.

However, despite the improvements seen in several models, some challenges


persisted. The decision tree classifier, despite achieving perfect training accuracy,
exhibited a lower testing accuracy of approximately 80.43% after hyperparameter
tuning, suggesting potential overfitting or limitations in its ability to generalize.
Additionally, while the random forest classifier demonstrated a notable increase in
training accuracy to approximately 93.32% after tuning, its testing accuracy remained
the same as before,
4
at approximately 85.33%, indicating that further adjustments might be necessary to
optimize its generalization performance. Overall, hyperparameter tuning led to
enhancements in the models' testing accuracies, particularly in logistic regression,
SVM, and to a lesser extent, decision tree and random forest classifiers, highlighting
the importance of fine-tuning model parameters to improve predictive performance on
unseen data.

7.CONCLUSION AND SCOPE FOR FUTURE WORKS

Early identification and individualized risk assessment is heart disease prediction


using machine learning algorithms, it can be concluded. Machine learning algorithms
are able to analyze many risk variables and provide precise predictions about a
person's likelihood of acquiring heart disease by utilizing the power of data analysis
and pattern recognition. Early intervention, preventive measures, and individualized
treatment programs are made possible, which has the potential to revolutionize
healthcare.

Large datasets are analyzed to uncover intricate connections between risk variables
and heart disease outcomes using machine learning methods including logistic
regression, decision trees, random forests, and support vector machines. These
algorithms can extract significant insights and produce predictions with a high degree
of accuracy through feature selection, engineering, hyper-parameter tuning, and
model training. Hypertuned logistic regression gives us the maximum accuracy.
Hyper parameter tuning shown us the growth in each model.

Given the conclusion of the analysis, it's evident that hyperparameter tuning has led to
improvements in the model accuracies, particularly noticeable in the Logistic
Regression and Random Forest Classifier models. However, some models like the
Support Vector Machine (SVM) showed a decrease in testing accuracy after
hyperparameter tuning, suggesting that further exploration is needed to optimize these
models effectively. The hyperparameters used in the tuning process might not have
been optimal for certain models, leading to unexpected results. Further
experimentation with a wider range of hyperparameters or different tuning

4
techniques like Bayesian

4
optimization could yield better results.The current analysis might not have utilized the
full potential of the dataset. Exploring additional feature engineering techniques or
incorporating domain knowledge to create new features could enhance the predictive
power of the models. Ensemble methods like stacking or boosting could be explored
to combine the strengths of multiple models and mitigate their individual weaknesses.
This approach often leads to more robust and accurate predictions by leveraging the
diversity of different models. If the dataset is limited or contains missing values,
techniques such as data augmentation or imputation could be employed to generate
synthetic data or fill in missing values, respectively. This could help improve the
generalization and reliability of the models.

While complex models like Random Forests can offer high accuracy, they often lack
interpretability. Exploring interpretable models like decision trees or logistic
regression could provide valuable insights into the factors influencing the predictions,
which is crucial for real-world applications where interpretability is important.

In conclusion, while the current analysis has provided valuable insights into the
performance of various machine learning models for predicting heart disease, there
are still opportunities for further refinement and exploration. By fine-tuning
hyperparameters, exploring new features, leveraging ensemble methods, and focusing
on interpretability, future works can continue to advance the accuracy and reliability
of predictive models in this domain.

4
ANNEXURE-1

Sample Code

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
df=pd.read_csv("C:\\Users\\surya\\Downloads\\csv files\\
heart_statlog_cleveland_hungary_final.csv")
df.head()
df.tail()
df.describe()
#colclusion there are no missing values in given dataset
df.isnull().sum()
df.duplicated().sum()
df.drop_duplicates(inplace=True)
#seperate category and continuous variables
category,continuous=[],[]
for i in df.columns:
if(len(np.unique(df[i]))<10):
category.append(i)
else:
continuous.append(i)
print (f"category=“, category)

4
print (f"continuous=“, continuous)
print(category)
print(continuous)
newcatdf=df[category]
newcondf=df[continuous]
sns.jointplot(y='target',x='age',data=df)
sns.countplot(x='sex',hue='target',data=df)
sns.boxplot(x='target',y='age',data=df)
sns.countplot(x='chest pain type',hue='target',data=df)
features=enumerate(category)
plt.figure(figsize=(15,30))
for i in enumerate(category):
plt.subplot(3,3,i[0]+1)
sns.countplot(x=i[1],hue='target',data=df)
plt.xlabel(i[1])
plt.figure(figsize=(15, 15))

for i, column in enumerate (continuous, 1):


plt.subplot(3, 2, i)
df[df["target"] == 0] [column]. hist(bins=35, color='blue', label='Have Heart Disease
= 0', alpha=0.6)
df[df["target"] == 1] [column]. hist(bins=35, color='red', label='Have Heart Disease
= 1', alpha=0.6)
plt.legend()
plt.xlabel(column)
from sklearn.preprocessing import StandardScaler

s_sc = StandardScaler()
col_to_scale = continuous
df[col_to_scale] = s_sc.fit_transform(df[col_to_scale])
continuous_data=df[continuous]
continuous_data.head(5)
continuous_data['target']=df['target']
continuous_data.corr(method='pearson')

5
plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1.3)
sns.heatmap(df.corr(),annot=True,linewidth =2)
plt.tight_layout()
def print_score(clf, xtrain, ytrain, xtest, ytest, train=True):
if train:
pred = clf.predict(xtrain)
clf_report = pd.DataFrame(classification_report(ytrain, pred, output_dict=True))
print("Train
Result:\n**********************************************************")
print(f"Accuracy Score: {accuracy_score(ytrain, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(ytrain, pred)}\n")

elif train==False:
pred = clf.predict(xtest)
clf_report = pd.DataFrame(classification_report(ytest, pred, output_dict=True))
print("Test
Result:\n*********************************************************")
print(f"Accuracy Score: {accuracy_score(ytest, pred) * 100:.2f}%")
print(" ")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print(" ")
print(f"Confusion Matrix: \n {confusion_matrix(ytest, pred)}\n")

x=df.drop(['target'],axis=1)
y=df['target']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=2)
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(xtrain,ytrain)
pred=model.predict(xtest)

5
from sklearn.metrics import accuracy_score
accuracy_score(pred,ytest)
log_reg = LogisticRegression(solver='liblinear')
log_reg.fit(xtrain, ytrain)

print_score(log_reg, xtrain, ytrain, xtest, ytest, True)


print_score(log_reg, xtrain, ytrain, xtest, ytest ,False)
test_score = accuracy_score(ytest, log_reg.predict(xtest)) *
100
train_score = accuracy_score(ytrain, log_reg.predict(xtrain)) * 100

results = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]],


columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results
knn = KNeighborsClassifier()
knn.fit(xtrain, ytrain)
print_score(knn, xtrain, ytrain, xtest, ytest, True)
print_score(knn, xtrain, ytrain, xtest, ytest, False)
test_score = accuracy_score(ytest, knn.predict(xtest)) * 100
train_score = accuracy_score(ytrain, knn.predict(xtrain)) * 100

results2 = pd.DataFrame(data=[["knn", train_score, test_score]],


columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results=results.append(results2,ignore_index=True)
results
svm = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm.fit(xtrain, ytrain)

tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1,
cv=5)
tree_cv.fit(xtrain, ytrain)
best_params = tree_cv.best_params_
print(f'Best_params:

5
{best_params}')
tree_clf = DecisionTreeClassifier(**best_params)

5
tree_clf.fit(xtrain, ytrain)
print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=True)
print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=False)
test_score = accuracy_score(ytest, tree_clf.predict(xtest)) * 100
train_score = accuracy_score(ytrain, tree_clf.predict(xtrain)) * 100

results3 = pd.DataFrame(
data=[["Tuned Decision Tree Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
results = results.append(results3, ignore_index=True)
results
n_estimators = [500, 900, 1100, 1500]
max_features = ['auto', 'sqrt']
test_score = accuracy_score(ytest, knn_clf.predict(xtest)) * 100
train_score = accuracy_score(ytrain, knn_clf.predict(xtrain)) * 100

results3 = pd.DataFrame(data=[["hyper tuned Knn", train_score, test_score]],


columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results=results.append(results3,ignore_index=True)
results
params = {"criterion":("gini", "entropy"),
"splitter":("best", "random"),
"max_depth":(list(range(1, 20))),
"min_samples_split":[2, 3, 4],
"min_samples_leaf":list(range(1, 20))
}

tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1,
cv=5)
tree_cv.fit(xtrain, ytrain)
best_params = tree_cv.best_params_
print(f'Best_params:

5
{best_params}')

5
tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(xtrain, ytrain)

print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=True)


print_score(tree_clf, xtrain, ytrain, xtest, ytest, train=False)

max_depth = [2, 3, 5, 10, 15, None]


min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]

params_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}

rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="accuracy", cv=5, verbose=1,
n_jobs=-1)
rf_cv.fit(xtrain, ytrain)
best_params =
rf_cv.best_params_
print(f"Best parameters: {best_params}")

rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(xtrain, ytrain)

print_score(rf_clf, xtrain, ytrain, xtest, ytest, train=True)

continuous_data.corr(method='pearson')
continuous_data['target']=df['target']

5
continuous_data.head()

5
continuous_data.corr(method='pearson')
continuous_data.corr(method='kendall')
continuousdf=df[continuous]
continuousdf
continuousdf['target']=df['target']
continuousdf.head()
df.age.max()
df.age.min()
df.trestbps.max()
continuousdf.describe()
import seaborn as sns
import matplotlib.pyplot as plt

data=df.copy()
data.drop(['target'],axis=1)
data.drop_duplicates(inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df.drop_duplicates(inplace=True)
xtrain,xtest,ytrain,ytest=train_test_split(data,df['target'],train_size=0.7)
model=LogisticRegression()
model.fit(xtrain,ytrain)
from sklearn.metrics import accuracy_score
pred=model.predict(xtest)
score=accuracy_score(ytest,pred)
score

5
6. REFERENCES

1.Riyaz, Lubna, et al. "heart disease prediction using machine learning techniques: a
quantitative review." International Conference on Innovative Computing and
Communications: Proceedings of ICICC 2021, Volume 3. Springer Singapore, 2022.
2.Azmi, Javed, et al. "A systematic review on machine learning approaches for
cardiovascular disease prediction using medical big data." Medical Engineering &
Physics (2022): 103825.
3.A. A. Ahdal, M. Rakhra, S. Badotra and T. Fadhaeel, "An integrated Machine
Learning Techniques for Accurate Heart Disease Prediction," 2022 International
Mobile and Embedded Technology Conference (MECON), Noida, India, 2022, pp.
594-598, doi: 10.1109/MECON53876.2022.9752342.
4. Umarani Nagavelli, Debabrata Samanta, Partha Chakraborty, "Machine Learning
Technology-Based Heart Disease Detection Models", Journal of Healthcare
Engineering, vol. 2022, Article ID 7351061, 9 pages, 2022.
https://doi.org/10.1155/2022/7351061
5. Pathan, Muhammad Salman, et al. "Analyzing the impact of feature selection on the
accuracy of heart disease prediction." Healthcare Analytics 2 (2022): 100060
6. Shah, Devansh, Samir Patel, and Santosh Kumar Bharti. "Heart disease prediction
using machine learning techniques." SN Computer Science 1.6 (2020): 1-6
7. madhu hk,and d.ramesh “heart disease prediction using svm”International Journal of
Computer Applications (0975 – 8887) Volume 183 – No. 27, September 2021
8.V. Sharma, S. Yadav and M. Gupta, "Heart Disease Prediction using Machine
Learning Techniques," 2020 2nd International Conference on Advances in
Computing, Communication Control and Networking (ICACCCN), Greater Noida,
India, 2020, pp. 177-181, doi: 10.1109/ICACCCN51052.2020.9362842.

You might also like