0% found this document useful (0 votes)

41 views54 pages

Project Report PDF

Uploaded by

allsrmites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views54 pages

Project Report PDF

Uploaded by

allsrmites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

HEART DISEASE DETECTION

USING MACHINE LEARNING

A PROJECT REPORT
Submitted by
PRAVEEN PRAKASH TIWARI [RA2011050010003]

Under the Guidance of

Dr. P. RAJASEKAR
Associate Professor, Department of Data Science and Business Systems

in partial fulfillment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
with specialization in BLOCKCHAIN TECHNOLOGY

DEPARTMENT OF DATA SCIENCE AND BUSINESS SYSTEMS

COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR - 603 203
MAY 2024
Department of Data Science and Business Systems
SRM Institute of Science & Technology
Own Work* Declaration Form

Degree/ Course : B.Tech. - Computer Science and Engineering with

specialization in Blockchain Technology

Student Name : Praveen Prakash Tiwari

Registration Number : RA2011050010003

Title of Work : Heart Disease Detection Using Machine Learning

I hereby certify that this assessment compiles with the University’s Rules and Regulations
relating to Academic misconduct and plagiarism, as listed in the University Website,
Regulations, and the Education Committee guidelines.

I confirm that all the work contained in this assessment is my own except where indicated,
and that we have met the following conditions:

 Clearly referenced / listed all sources as appropriate.

 Referenced and put in inverted commas all quoted text (from books, web, etc)
 Given the sources of all pictures, data etc. that are not my own.
 Not made any use of the report(s) or essay(s) of any other student(s) either past or
present
 Acknowledged in appropriate places any help that I have received from others (e.g.
fellow students, technicians, statisticians, external sources)
 Compiled with any other plagiarism criteria specified in the Course handbook /
University website.
I understand that any false claim for this work will be penalized in accordance with the
University policies and regulations.
DECLARATION:

I am aware of and understand the University’s policy on Academic misconduct and plagiarism
and we certify that this assessment is our own work, except were indicated by referring, and that
we have followed the good academic practices noted above.

Praveen Prakash Tiwari

RA2011050010003

ii
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that 18CSP109L project report titled “HEART DISEASE

DETECTION USING MACHINE LEARNING” is the bonafide work of
“PRAVEEN PRAKASH TIWARI [RA2011050010003]” who carried out
the project work under my supervision. Certified further, that to the best of
my knowledge the work reported herein does not form any other project
report or dissertation on the basis of which a degree or award was conferred
on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE

Dr. P. Rajasekar Prof. Dr. M. Lakshmi

SUPERVISOR HEAD OF THE DEPARTMENT
Associate Professor Professor
Department of Data Science Department of Data Science and
and Business Systems Business Systems

Examiner I Examiner II

iii
ACKNOWLEGEMENT
I express our humble gratitude to Dr C. Muthamizhchelvan, Vice-Chancellor, SRM Institute
Of Science and Technology, for the facilities extended for the project work and his continued
support.

I extend our sincere thanks to Dean-CET, SRM Institute of Science and Technology, Dr
T.V. Gopal, for his valuable support.

I wish to thank Dr Revathi Venkataraman, Professor & Chairperson, School Of Computing,

SRM Institute of Science and Technology, for her support throughout the project work.

I am incredibly grateful to our Head of the Department, Dr. M. Lakshmi, Professor, Department
of Data Science and Business System, SRM Institute of Science and Technology, for her
suggestions and encouragement at all the stages of the project work.

My expressible respect and gratitude towards my project guide Dr. P. Rajasekar, Associate
Professor, Department of Data Science and Business System, SRM Institute of Science and
Technology, for prodding me with an opportunity to pursue the major project work under his
guidance, mentorship and continuous support.

`
ABSTRACT

Heart disease is a major cause of death worldwide, and lowering death rates from this
condition depends on early detection. Through the analysis of various patient data, machine
learning (ML) techniques have demonstrated tremendous potential in properly predicting the
risk of heart disease.

ML models used in the identification of heart diseases include logistic regression, decision
trees, support vector machines, and deep learning methods like convolutional and recurrent
neural networks.

Yet, there are a number of challenges in the way of creating precise ML models for heart
disease identification. One such challenge is data quality since the accuracy of the model is
dependent on the quality of the input data.

Predictions can also be off due to imbalanced datasets, in which one class has substantially
more samples than the other. Model interpretability is also vital in developing accurate ML
models.

Regardless of these difficulties, ML-based heart disease detection holds promise for
revolutionizing the identification and management of heart diseases.

The accurate prediction of heart disease can result in earlier intervention, better patient
outcomes, and lower medical expenses.

To fully achieve the potential of machine learning (ML)-based cardiac illness diagnosis in
terms of bettering patient outcomes and lowering healthcare costs, more research in this area
is required.
TABLE OF CONTENTS

ACKNOWLEGEMENT iv

ABSTRACT v

LIST OF FIGURES viii

LIST OF TABLES ix

ABBREVIATIONS x

1 INTRODUCTION 1
1.1 General 1

1.2 Purpose 2

1.3 Scope 3

1.4 Motivation and Problem Statement 4

1.5 Research Objective 6

2 LITERATURE SURVEY 8

3 SYSTEM ARCHITECTURE AND DESIGNS 13

3.1 Data Collection and Pre-processing 13
3.2 Feature Engineering and Selection 15
3.3 Model Selection and Training 15
16
3.4 System Integration
3.5 Deployment and Monitoring 17

4 METHODOLOGY 18
4.1 Random Forest Classifier 18
4.2 Decision Tree 19
4.3 Ensemble Technique 20
4.4 XG Boost 21
4.5 Logical Regression 22
4.6 Naive Bayes 24
4.7 Support Vector Machine 25
4.8 KNN 26
5 RESULTS AND DISCUSSIONS 28

6 CONCLUSION AND FUTURE ENHANCEMENT 34

7 REFERENCES 37

APPENDIX 39

A. CODING 39

B. PUBLICATION DETAILS 43
LIST OF FIGURES

3.1 Architecture Flow of the Proposed Work 13

5.1 Plotting the confusion matrix 28

5.2 Correlation matrix 29

5.3 Pair plot 29

5.4 Subplot analysis 30

5.5 Count plot analysis 31

viii
LIST OF TABLES
5.1 Analysis of data 28

5.2 Machine learning part 30

ix
ABBREVIATIONS
KNN K-NEAREST NEIGHBORS
SVM SUPPORT VECTOR MACHINE
XG BOOST EXTREME GRADIENT BOOSTING
DAPPS DECENTRALISED APPLICATION
ML MACHINE LEARNING

x
CHAPTER 1

INTRODUCTION

Heart disease is a leading cause of mortality worldwide, highlighting the urgent need for
accurate and efficient diagnostic tools. With advancements in technology, machine learning (ML)
has emerged as a promising approach for early detection and risk assessment of cardiovascular
diseases. By leveraging large datasets containing patient information and clinical parameters, ML
algorithms can analyze complex patterns and provide valuable insights for healthcare professionals.

This project aims to explore the application of machine learning techniques in the detection
of heart disease. By utilizing datasets containing a multitude of features such as age, gender, blood
pressure, cholesterol levels, and lifestyle habits, we seek to develop predictive models capable of
identifying individuals at risk of heart disease. Through this endeavor, we aspire to contribute to
improved patient outcomes by enabling early intervention and personalized healthcare strategies.

1.1 General
Heart disease is a primary cause of death and a major global health concern. The
improvement of patient outcomes and the prevention of additional consequences are dependent
upon the early identification of heart disease.

Laboratory testing, medical history reviews, and physical examinations are the traditional
approaches used to diagnose and estimate the risk of heart disease. However, especially in high-risk
populations, these approaches are not very good at properly predicting the risk of heart disease.
Through the analysis of a variety of patient data, machine learning (ML) approaches have
demonstrated considerable potential in properly predicting the risk of heart disease.

The purpose of this project is to look into the identification and risk assessment of heart
diseases using machine learning algorithms. We will specifically assess the accuracy,
interpretability, and efficiency of various ML models such as logistic regression, decision trees,
support vector machines, and deep learning methods like recurrent and convolutional
neural networks.
1
We will evaluate these models' performance as well as the effects of imbalanced
datasets, data preparation, and data quality on accuracy. The creation of reliable and accurate
machine learning models for risk assessment and heart disease detection can have a big impact on
patient outcomes and healthcare budgets.

Medical practitioners are able to intervene earlier and treat patients more effectively when
they are able to accurately estimate their risk of heart disease. Furthermore, the creation of reliable
and accurate machine-learning models for the identification of heart disease may result in the
advancement of more potent diagnostic instruments.

In this project report, we provide a thorough examination of the application of machine

learning methods to the identification and risk assessment of heart disease.

Our results can be used as a roadmap by cardiology and machine learning researchers and
practitioners to enhance patient outcomes and lower costs associated with healthcare. The ultimate
goal of this project is to support ongoing initiatives to enhance the detection and treatment of
cardiac disease.

1.2 Purpose
This project report’s purpose is to look into the application of machine learning methods for
risk assessment and heart disease detection. The project intends to support continuous efforts to
enhance the detection and treatment of cardiac disease. The initiative specifically seeks to
accomplish the following goals:

1. Examine the efficiency, interpretability, and accuracy of different machine learning

models, including support vector machines, logistic regression, decision trees, and convolutional
and recurrent neural networks, as well as deep learning techniques.
2. Examine how data preprocessing, imbalance datasets, and data quality affect the accuracy
of machine learning models for heart disease detection.
3. Provide a thorough evaluation of the effectiveness of machine learning models for risk
assessment and heart disease detection, taking into account both their advantages and
disadvantages.

2
4. Examine how machine learning models might enhance the precision and effectiveness of
heart disease diagnosis and risk assessment.
5. Give medical practitioners as well as researchers working in the fields of cardiology and
machine learning guidelines on how to employ machine learning techniques for risk assessment
and diagnosis of heart disease.

The results of this project can significantly impact patient outcomes, improving healthcare
costs. Patient outcomes can be improved by early interventions and more effective treatments when
the risk of heart disease is accurately predicted.

Furthermore, the development of reliable and accurate machine learning models for the
identification of heart disease may pave the way for the creation of more potent diagnostic
instruments and better health outcomes.

The overall goal of this project is to support ongoing efforts to use machine learning
techniques to improve the detection and treatment of heart disease. The results of this study have
the potential to enhance patient outcomes and save healthcare costs by acting as a guide for
cardiologists and machine learning researchers.

1.3 Scope
This project report’s scope is to assess the use of machine learning techniques for the risk
assessment and detection of heart disease.

The project focuses on evaluating several machine learning models and how well they identify
cardiac disease from patient data. The purpose of this report is to support ongoing efforts to use
machine learning techniques to improve the diagnosis and treatment of cardiac disease.

The following areas will be covered in the project report:

1. Data collection and preprocessing: The sources of patient data and the techniques to
utilized prepare the data for analysis will be examined in the project report. This will
incorporate locating pertinent features and eliminating anomalies, incomplete data, and other
problems with the quality of the data.

3
2. Machine learning models: For the purpose of detecting heart disease, the project report will
examine the efficiency of several machine learning models, such as logistic regression,
decision trees, support vector machines, and deep learning methods like convolutional and
recurrent neural networks.

3. Impact of data quality and preprocessing: The project report will look into how these
factors affect the machine learning models' performance. This will involve examining how
the usage of various data preparation methods and imbalanced datasets affects model
accuracy.

4. Future directions: The project report will include information on how research on machine
learning-based heart disease detection will proceed in the future. This will involve pinpointing
possible areas for development, like incorporating various patient care modalities and utilizing
cutting-edge machine learning approaches.

This project report's overall goal is to present a thorough review of the use of machine learning
techniques in the risk assessment and detection of heart disease. The purpose of the report is to
offer insights into the futility of research in this field and to support the ongoing attempts to use
machine learning techniques to enhance heart disease diagnosis and management. The results of
this project have the potential to enhance patient outcomes, lower healthcare costs, and act as a
guide for medical professionals and academics in the fields of cardiology and machine learning.

1.4 Motivation and Problem Statement

Motivation:
Heart disease remains one of the foremost global health challenges, exerting a substantial toll
on individuals and healthcare systems worldwide. Despite considerable progress in medical
research and treatment options, the prevalence of heart disease persists, underscoring the
critical need for innovative approaches to detection and management. Traditional diagnostic
methods, often reliant on subjective assessments or a limited set of clinical parameters, may
overlook subtle yet significant risk factors, leading to delays in diagnosis and suboptimal patient
outcomes. Moreover, the heterogeneity of cardiovascular conditions and the interplay of various
genetic, environmental, and lifestyle factors further complicate accurate risk assessment and
personalized intervention strategies.

4
In this context, the application of machine learning (ML) techniques offers a compelling
opportunity to revolutionize cardiovascular care by harnessing the vast amounts of patient data
generated in clinical practice. By leveraging sophisticated algorithms capable of discerning intricate
patterns and relationships within complex datasets, ML enables the identification of novel risk
markers, early detection of disease, and tailored treatment approaches. This not only empowers
healthcare providers with actionable insights for proactive intervention but also holds the promise
of improving patient outcomes, reducing healthcare costs, and alleviating the burden of heart
disease on individuals and societies alike. Thus, the motivation behind this endeavor lies in the
urgent imperative to harness the potential of ML to advance the early detection, risk stratification,
and personalized management of heart disease, ultimately transforming the landscape of
cardiovascular healthcare for the better.

Problem Statement:
The problem at hand revolves around the imperative need to address the multifaceted
challenges inherent in the detection and management of heart disease using advanced machine
learning methodologies. Despite significant strides in medical science, the timely identification and
accurate assessment of cardiovascular risk remain elusive goals, contributing to substantial
morbidity and mortality globally. Conventional diagnostic approaches often fall short in capturing
the nuanced interplay of diverse risk factors and clinical markers that underpin heart disease,
leading to missed opportunities for early intervention and suboptimal patient outcomes. Moreover,
the complexity and heterogeneity of cardiovascular conditions necessitate a comprehensive yet
nuanced approach to risk assessment and treatment planning, one that transcends traditional
heuristics and embracesthe rich landscape of patient data available.

Against this backdrop, the problem statement emerges: How can we leverage the power of
machine learning to develop robust predictive models capable of discerning subtle patterns and
relationships within complex datasets to facilitate early detection, accurate risk stratification, and
personalized management of heart disease. This entails not only the development and validation of
sophisticated ML algorithms but also their seamless integration into clinical practice, empowering
healthcare providers with actionable insights and decision support tools. By addressing this
problem statement, we aim to bridge the gap between data-driven innovation and clinical practice,
ultimately improving patient outcomes, reducing healthcare costs, and mitigating the burden of
heart disease on individuals and societies worldwide.

5
1.5 Research Objectives
The overarching objective of this project work is to advance the field of cardiovascular healthcare
through the application of machine learning techniques for heart disease detection and risk
assessment. Specifically, my work aims to achieve the following key objectives:

1. Develop Robust Predictive Models: Design and train machine learning models capable of
accurately predicting the presence or absence of heart disease based on comprehensive patient
data, including demographic, clinical, and lifestyle factors. These models should demonstrate
robustness, reliability, and generalizability across diverse patient populations.

2. Identify Informative Risk Markers: Conduct feature selection and analysis to identify the
most informative risk markers and clinical predictors associated with heart disease. By
discerning subtle patterns and relationships within complex datasets, we aim to uncover novel
insights into the etiology, progression, and prognosis of cardiovascular conditions.

3. Enhance Clinical Decision Support: Translate the findings of the machine learning models
into actionable insights and decision support tools for healthcare practitioners. By providing
timely risk assessments, diagnostic recommendations, and treatment insights, our research
aims to empower clinicians with the tools and knowledge needed to deliver personalized,
evidence-based care to patients with heart disease.

4. Validate and Evaluate Model Performance: Validate the performance of the developed
machine learning models using rigorous evaluation metrics and validation procedures. This
includes assessing accuracy, sensitivity, specificity, and area under the receiver operating
characteristic curve (AUC-ROC) on independent datasets to ensure reliability and
generalizability.

5. Promote Translational Research: Facilitate the translation of research findings into clinical
practice by fostering collaboration between data scientists, healthcare providers, and
policymakers. By promoting interdisciplinary research and knowledge exchange, our research
aims to accelerate the adoption of data-driven approaches in cardiovascular healthcare and
improve patient outcomes on a global scale.

6
Through these research objectives, we aim to advance the state-of-the-art in heart disease detection
and risk assessment, ultimately contributing to improved patient outcomes, reduced healthcare
costs, and enhanced public health outcomes in the fight against cardiovascular disease.

7
CHAPTER 2

LITERATURE REVIEW

The authors in [1] discusses the critical need for early detection and continuous monitoring of
cardiovascular diseases due to their high mortality rates globally. It highlights the challenges
associated with manual diagnosis and the limitations of round-the-clock supervision by clinicians.
To address these challenges, the study proposes a cloud-based heart disease prediction system using
machine learning techniques. The proposed system aims to accurately detect impending heart
disease by leveraging an efficient machine learning algorithm derived from a comparative analysis
of various algorithms using the WEKA platform. The Support Vector Machine (SVM) algorithm
demonstrated promising performance with an accuracy level of 97.53%, sensitivity of 97.50%, and
specificity of 94.94%. Additionally, to enable continuous monitoring of heart disease patients, a
real-time patient monitoring system was developed using Arduino. This system can sense
parameters such as body temperature, blood pressure, humidity, and heartbeat, and transmit the data
to a central server every 10 seconds. Healthcare providers can visualize the real-time sensor data
and initiate live video streaming for immediate intervention if necessary. Furthermore, the system
incorporates GSM technology to notify the prescribed doctor instantly when any real- time
parameter exceeds the threshold. Overall, the proposed system offers a comprehensive approach to
early detection and continuous monitoring of heart disease, potentially improving patient outcomes
and reducing mortality rates.

The researchers in [2] presents a novel approach to heart disease detection using machine
learning algorithms, focusing on the construction of an artificial intelligence-based system. It
emphasizes the development of a Python-based application for its reliability and flexibility in
healthcare research. The study details data processing techniques, including handling categorical
variables and converting columns, followed by the main phases of application development:
collecting databases, performing logistic regression, and evaluating dataset attributes. A random
forest classifier algorithm is developed to identify heart diseases with higher accuracy, achieving
approximately 83% accuracy over training data. The paper concludes with discussions on the
random forest classifier algorithm, including experiments and results, highlighting improved
accuracies for research diagnoses. It also outlines objectives, limitations, and research
8
contributions, providing a comprehensive overview of the proposed approach in heart disease
detection.

The system proposed in [3] provides an efficient and accurate system for diagnosing heart
disease based on machine learning techniques, recognizing the critical role of timely and effective
identification in healthcare, especially within cardiology. The system employs a variety of
classification algorithms, including Support Vector Machine, Logistic Regression, Artificial Neural
Network, K-Nearest Neighbor, Naïve Bayes, and Decision Tree, alongside feature selection
algorithms such as Relief, Minimal Redundancy Maximal Relevance, Least Absolute Shrinkage
Selection Operator, and Local Learning. A novel fast conditional mutual information feature
selection algorithm (FCMIM) is introduced to address the feature selection problem, aiming to
increase classification accuracy and reduce execution time. The study utilizes the leave-one-
subject-out cross-validation method for model assessment and hyperparameter tuning, employing
performance metrics to evaluate classifier performance on selected features. Experimental results
demonstrate the feasibility of the proposed FCMIM algorithm, particularly when paired with the
Support Vector Machine classifier, achieving higher accuracy compared to previous methods. The
proposed diagnosis system (FCMIM- SVM) shows promise for implementation in healthcare
settings for heart disease identification.

The authors in [4] discusses the utilization of recent computing advances and technological
developments to support medical decision-making through the collection, storage, and analysis of
medical data. Emphasizing the need for digitized patient data in many countries, particularly in
India, the study focuses on heart disease diagnosis using an India-centric dataset. Performance
evaluation of the automatic diagnosis system is conducted using classification accuracy, sensitivity,
and specificity analysis. The study concludes that the Support Vector Machine (SVM) with
Sequential Minimization Optimization learning algorithm is particularly suitable for medical
disease diagnosis applications, demonstrating itsefficacy in diagnosing heart disease.

The authors in [5] addresses the urgent need for early detection and diagnosis of heart failure
disease by exploring various machine learning approaches tailored for heart disease detection. It
introduces methods such as Naïve Bayes with a weighted approach for predicting heart disease,
automatic analysis for ischemic heart disease localization/detection using Support Vector Machine
(SVM) and XGBoost classifiers, an improved SVM for heart failure automatic identification, and a
clinical decision support system (CDSS) utilizing advanced techniques like DBSCAN, SMOTE-

9
ENN, and XGBoost for heart disease prediction. The primary aim is to equip clinicians with a
reliable tool for early heart disease diagnosis, evaluating model performance using precision,
accuracy, f1-measure, and recall metrics, particularly focusing on comparing XGBoost with
alternative decision tree classification algorithms.

In paper [6] underscores the critical role of machine learning in data classification,
particularly in extracting insights from corporate and medical datasets. It highlights the growing
importance of machine learning in healthcare research for prognosis and deeper understanding of
medical data. Various machine learning methods have been employed for heart disease detection,
with neural networks and logistic regression being prominent choices. The review discusses the
analysis of multiple algorithms, including neural networks, K- nearest neighbor, naive Bayes, and
logistic regression, as well as composite approaches integrating these methods. The study
implements and trains the system using Python, utilizing benchmark datasets from the UCI machine
learning repository, with potential for extension toaccommodate new data collection.

[7] This paper introduces a majority voting ensemble method for predicting the presence of
heart disease in humans using simple and affordable medical tests conducted in local clinics. The
primary objective is to enhance the confidence and accuracy of doctors' diagnoses by leveraging
real-life data from healthy and ill patients. The proposed model employs a combination of multiple
machine learning models, with classification based on the majority vote, aiming to yield more
accurate solutions compared to individual models. The approach achieves an accuracy of 90%
based on the hard voting ensemble model, showcasing its potential in improving heart disease
diagnosis.

The work in [8] investigates the challenges associated with imbalanced data in heart disease
predictions through a systematic literature review (SLR) approach, following a meta-analysis of
451 reference literature and an in-depth analysis of 49 selected articles. The research highlights the
significant impact of imbalanced data on the performance and practical applicability of machine
learning approaches in heart disease diagnosis. Despite advancements in ML applications utilizing
electrocardiogram (ECG) and patient data for early disease detection, imbalanced datasets pose a
considerable challenge for traditional ML algorithms. The SLR study reveals various open
problems and issues in current approaches, hindering their functionality and practical applicability.
Additionally, the research emphasizes the need for addressing factors beyond model performance,
such as interpretability and explainability of ML algorithms, to enhance data-driven decision-
10
making in heart disease diagnosis.

The authors in [9] proposes a novel hybrid approach for predicting cardiovascular disease,
acknowledging its status as a major global cause of death and the challenges it poses for accurate
prediction by medical practitioners. Leveraging recent developments in medical supportive
technologies based on data mining and machine learning, the study utilizes various machine
learning techniques, including Logistic Regression, Adaptive Boosting, Multi- Objective
Evolutionary Fuzzy Classifier, Fuzzy Unordered Rule Induction, Genetic Fuzzy System-
LogitBoost, and Fuzzy Hybrid Genetic Based Machine Learning. The accuracy and results of each
classifier are compared, with the best-performing classifier selected for more precise
cardiovascular disease prediction. To achieve this objective, the study utilizes two free software
tools, Weka and Keel.

[10] highlights the significant challenges posed by heart disease prediction due to its
complex nature and the vast amount of data involved. It emphasizes the scarcity of effective
analysis tools in the healthcare environment, despite the abundance of available data. The proposed
web application aims to address this gap by leveraging data mining techniques to extract hidden
patterns and relationships from datasets sourced from Kaggle and the ClevelandFoundation medical
research. The ultimate goal is to predict the occurrence of heart disease based on gathered data and
improve medical efficiency while reducing costs. The review underscores the necessity for
computationally efficient and accurate machine learning techniques in handling the massive and
intricate datasets inherent in heart disease prediction. It emphasizes the interdisciplinary nature of
data mining, which combines statistical analysis, machine learning, and database technology to
uncover valuable insights from large databases.

[11] underscores the critical importance of accurate diagnosis and prognosis in

cardiovascular disease management, highlighting the role of machine learning in improving
classification accuracy and reducing misdiagnosis rates. The research aims to develop a predictive
model that accurately predicts cardiovascular diseases, ultimately aiming to mitigate the fatality associated
with such conditions. A novel approach utilizing k-modes clustering with Huang starting is proposed to
enhance classification accuracy. Various machine learning models, including random forest, decision tree
classifier, multilayer perceptron, and XGBoost, are employed and optimized using GridSearchCV. The
study applies these models to a real-world dataset comprising 70,000 instances from Kaggle, achieving
promising results in terms of accuracy and area under the curve (AUC) values. Notably, the multilayer
perceptron with cross-validation emerges as the top-performing algorithm, demonstrating the highest
11
accuracy of 87.28%. This research contributes valuable insights into the application of machine learning in
cardiovascular disease prediction, highlighting the potential of advanced modeling techniques to improve
diagnostic accuracy and patient outcomes.

12
CHAPTER 3

SYSTEM ARCHITECTURE AND DESIGNS

Fig 3.1: Architecture Flow of the Proposed Work

In designing the system architecture for predicting cardiovascular disease using machine
learning techniques, several key components and considerations need to be addressed to ensure
its effectiveness, scalability, and reliability. Here's a detailed elaboration of the system
architecture and design:

3.1 Data Collection and Pre-processing

In the initial phase of developing a predictive model for cardiovascular disease, thorough data
collection and preprocessing are imperative. Data collection involves identifying and accessing
relevant sources, including electronic health records, medical imaging databases, and public
health datasets. Accessing these sources requires adherence to data privacy regulations and
obtaining necessary permissions.

Identifying Data Sources: Determine the sources of data that will be used for the predictive
model. These may include electronic health records (EHR), medical imaging databases,

13
wearable devices, clinical trials, and public health datasets.
Accessing Data: Gain access to the identified data sources through appropriate channels,
ensuring compliance with data privacy regulations and obtaining necessary permissions.
Data Acquisition: Extract relevant datasets from the sources identified, ensuring that the data
collected are comprehensive and representative of the target population. This may involve
querying databases, accessing APIs, or obtaining data through collaborations with healthcare
institutions or research organizations.

Once acquired, the data undergo preprocessing to ensure its quality and suitability for analysis.
This includes handling missing values through imputation or deletion, detecting and treating
outliers, normalizing numerical features to ensure uniform scales, and encoding categorical
variables into numerical format. Additionally, feature engineering techniques may be applied to
extract meaningful information and address class imbalance issues in the dataset. By
meticulously executing these steps, researchers can obtain a clean and representative dataset
that forms the foundation for building an accurate predictive model for cardiovascular disease.

Handling Missing Values: Identify and handle missing values in the dataset using techniques
such as imputation (e.g., filling missing values with the mean, median, or mode of the feature)
or deletion of incomplete records.
Outlier Detection and Treatment: Identify outliers in the dataset that may represent errors or
anomalies. Decide whether to remove outliers, replace them with more plausible values, or treat
them separately during analysis.
Normalization and Standardization: Normalize numerical features to ensure uniform scales
across the dataset. Common normalization techniques include Min-Max scaling and Z-score
normalization. Standardizing features helps prevent features with larger scales from dominating
the model training process.
Encoding Categorical Variables: Convert categorical variables into numerical format to make
them compatible with machine learning algorithms. Common encoding techniques include one-
hot encoding and label encoding, depending on the nature of the categorical variables and the
requirements of the chosen algorithms.
Feature Engineering: Create new features or transform existing ones to extract meaningful
information that may improve the predictive performance of the model. Feature engineering

14
techniques may include polynomial features, interaction terms, and domain-specific
transformations.
Handling Imbalanced Data: Address class imbalance in the dataset, where one class (e.g.,
presence or absence of heart disease) is significantly more prevalent than the other. Techniques
such as resampling (oversampling or undersampling), synthetic data generation, and
algorithmic adjustments (e.g., adjusting class weights in the model) can help mitigate the effects
of class imbalance.

3.2 Feature Engineering and Selection

Feature engineering is a crucial step in developing predictive models for cardiovascular
disease. This process involves transforming raw data into informative features that can better
represent the underlying patterns and relationships in the dataset. For instance, engineers may
derive new features such as body mass index (BMI) from existing variables like height and
weight, or calculate risk scores based on established cardiovascular disease risk factors.
Domain-specific knowledge may guide the creation of features tailored to the unique
characteristics of cardiovascular health, such as incorporating indicators of metabolic health
or cardiovascular fitness. By carefully engineering features, researchers can enhance the
predictive power of their models and uncover valuable insights that may not be apparent in
the raw data alone.

Once a comprehensive set of features has been engineered, the next step is to identify the
subset of features that are most relevant for predicting cardiovascular disease. Feature
selection methods aim to weed out redundant or irrelevant features, reducing the
dimensionality of the dataset and improving model efficiency and interpretability. Techniques
such as filter methods assess the statistical significance of individual features, while wrapper
methods evaluate feature subsets by training and testing models iteratively. Domain expertise
often plays a critical role in guiding feature selection, as domain experts can provide valuable
insights into which variables are likely to be the most predictive of cardiovascular disease
risk. By selecting the most informative features, researchers can build more focused and
interpretable models that achieve higher predictive accuracy and generalizability.

3.3 Model Selection and Training

15
Selecting the appropriate machine learning algorithms is a crucial step in developing
predictive models for cardiovascular disease. Researchers must consider various factors such
as the nature of the dataset, the complexity of the problem, and the interpretability of the
models. Commonly used algorithms for cardiovascular disease prediction include Logistic
Regression, Support Vector Machines, Random Forests, Gradient Boosting, and Neural
Networks. Each algorithm has its strengths and weaknesses, and the choice depends on the
specific requirements of the problem at hand. For example, Logistic Regression is often
preferred for its simplicity and interpretability, while Neural Networks may offer higher
predictive accuracy but require more computational resources. Researchers may conduct
preliminary experiments to compare the performance of different algorithms on their dataset
and select the one that best meets their needs.

Once the machine learning algorithms have been selected, the next step is to train the models
on the dataset. During training, the models learn the underlying patterns and relationships in
the data, adjusting their parameters to minimize prediction errors. The dataset is typically
divided into training and testing sets, with the training set used to train the models and the
testing set used to evaluate their performance. Researchers may employ techniques such as
cross-validation to ensure robustness and prevent overfitting. Hyperparameter tuning may
also be performed to optimize the performance of the models. Ultimately, the goal of model
training is to develop predictive models that accurately classify individuals as either at risk or
not at risk of cardiovascular disease based on their input features.

3.4 System Integration

System integration is a pivotal phase in the development of a predictive model for
cardiovascular disease, facilitating the seamless integration of various components into a
cohesive and functional system. At its core, system integration involves deploying the trained
model into a production environment, typically hosted on a server or cloud-based
infrastructure. To ensure accessibility to end-users and other systems, developers create an
Application Programming Interface (API) defining endpoints and methods for interacting
with the model. A robust data pipeline is established to manage the flow of data throughout
the system, encompassing ingestion, preprocessing, prediction, and result retrieval stages.
Scalability and availability are paramount considerations, with the architecture designed to
scale resources dynamically and ensure uninterrupted service. Compatibility with diverse data

16
sources and formats is ensured, enabling integration with existing healthcare infrastructure.
Security measures, including data encryption and access controls, are implemented to
safeguard patient information and comply with regulatory requirements such as HIPAA.
Additionally, a user-friendly interface, often in the form of a web-based dashboard, provides
healthcare practitioners with intuitive access to input data, view predictions, and monitor
model performance. Through effective system integration, researchers can develop a
comprehensive predictive model for cardiovascular disease that is accurate, reliable, and
capable of enhancing clinical decision-making.

3.5 Deployment and Monitoring

Deployment and monitoring are critical phases in the implementation of a predictive model
for cardiovascular disease within a healthcare environment. Deployment begins with the
setup of infrastructure, ensuring it can support the computational demands of the model while
maintaining high availability and scalability. Following infrastructure setup, the development
of an Application Programming Interface (API) enables seamless interaction with the model,
allowing integration with existing healthcare systems. Rigorous testing and validation in a
staging environment precede the model's rollout into production, where a gradual deployment
strategy ensures smooth integration and user acceptance.

Concurrently, monitoring mechanisms are established to track the model's performance in

real-time. Key metrics, including prediction accuracy and latency, are continuously monitored
to uphold performance standards and identify any deviations. Anomaly detection systems are
employed to flag irregularities, such as data drift or model degradation, prompting timely
intervention to maintain the model's efficacy. Regular health checks assess the model's
overall integrity, enabling proactive measures to address potential issues and uphold its
reliability in aiding clinical decision-making. Through meticulous deployment and vigilant
monitoring, healthcare practitioners can confidently leverage the predictive model to enhance
patient care and outcomes in the management of cardiovascular disease.

17
CHAPTER 4

METHODOLOGY

4.1 Random Forest Classifier

In the area of heart disease detection, Random Forest Classifier is a machine learning
technique that has exceptionally impressive results. A large number of decision trees are
trained on random subsets of the data using the Random Forest Classifier (RFC), an ensemble
approach. The outputs of all the trees are then merged to get the final prediction. Even in
situations when there is a correlation between the variables, the random forest algorithm can
manage a large number of them and it is a great option for complicated datasets like heart
disease because it can detect relationships between factors.

The RFC algorithm has been used in multiple studies to categorize patients according to their
risk of heart disease in the identification of heart disease. The algorithm generates the final
forecast by integrating the outputs of decision trees that have been trained on distinct subsets
of the dataset. The random forest classifier technique is a great option for complex datasets
like heart disease since it can handle non-linear correlations between variables and has been
demonstrated to outperform other classification algorithms like logistic regression and
support vector machines.

In one study, the RFC algorithm was used to predict the risk of heart disease in a group of
patients with heart disease and healthy individuals. The findings demonstrated that the
method had a high degree of accuracy (area under the curve, or AUC) of 0.93 in predicting
the existence of heart disease. This study proves that an effective tool for heart disease
detection can be the random forest classifier.

In order to estimate the probability of heart illness, another study used the RFC algorithm on
a dataset of patients who had chest pain. With an AUC of 0.93, the random forest classifier
was found to have good accuracy in predicting the existence of heart disease. The study found
that patients experiencing chest pain might be successfully diagnosed with heart disease using
the RFC algorithm.
18
To increase the precision of heart disease identification, the random forest classifier has also
been employed in conjunction with other machine learning methods, including support vector
machines and neural networks. In one study, individuals were classified according to their
risk of heart disease using a combination of the support vector machine and random forest
algorithms. In comparison to applying either method alone, the study discovered that
combining the two algorithms increased the accuracy of heart disease identification.

One powerful machine learning algorithm that has demonstrated exceptional performance in
hearing disability identification is the Random Forest Classifier. The method can find
interactions between variables, handle large datasets, and manage non-linear correlations
between variables. The algorithm can be used alone or in conjunction with other algorithms
to increase the accuracy of heart disease identification. It has been demonstrated to
outperform existing classification algorithms.

4.2 Decision Tree

Because they are simple to understand and interpret and have demonstrated efficacy in
classifying individuals according to their risk of heart disease, decision trees, a form of
machine learning algorithm, have been employed in the diagnosis of heart disease. Using the
values of input features as a basis, decision trees divide the data into progressively smaller
subgroups until the subsets comprise only one class of the target variable.

When employing machine learning to diagnose diseases, decision trees offer a number of
benefits. Decision trees can process input features that are categorical and continuous, to start.
enabling people to analyze complicated mathematical facts with ease. Second, because there
are several risk factors for heart disease that are known to be interconnected, decision trees
can reveal interactions between variables. Third, decision trees are helpful in helping patients
and healthcare providers understand the rationale behind a diagnostic or treatment choice
because they are simple to picture and comprehend.

Numerous studies have demonstrated the efficacy of decision trees in the identification of
cardiac disease. One study classified patients according to their likelihood of hearing
impairment using a decision tree algorithm. With an area under the curve (AUC) of 0.90, the

19
study discovered that the decision tree algorithm had a high accuracy rate when predicting the
presence of heart disease.

In a different study, patients with heart disease were categorized according to their risk of
having a heart attack using a decision tree algorithm in conjunction with other machine
learning algorithms. The results of the study showed that, in comparison to employing any
one algorithm alone, the accuracy of heart disease prediction was greatly increased when
decision trees and other algorithms were combined.

Decision trees can handle continuous and categorical input characteristics, identify
relationships between variables, and produce findings that are easy to understand, making
them a valuable tool in machine learning for heart disease identification.

Decision trees can be a useful tool when combined with other algorithms for heart disease
prediction and risk assessment, even though they might not always be the most accurate
algorithm.

4.3 Ensemble Technique

Ensemble techniques can increase the predictive models' accuracy and robustness, they are a
common machine-learning strategy utilized in heart detection. The limitations and biases of
individual models can be solved by integrating several weaker models into a stronger one
through the use of ensemble techniques.

Two major categories can be used to categorize ensemble approaches for heart illness
detection: boosting and bagging. In order to lower variance and increase the final model's
robustness, bagging, also known as bootstrap aggregating, entails combining multiple models
that were trained on randomly picked subsets of the training data. Boosting is the process of
repeatedly training a set of weak models with an emphasis on the incorrectly classified
samples from the prior model. help increase the final model's overall correctness.

The Random Forest Ensemble technique, which mixes many decision trees to generate a
more robust and accurate model, is one of the most commonly used bagging techniques in the

20
identification of heart disease. Random Forest is a good method for detecting heart illness
since it excels at managing high-dimensional data with intricate feature relationships.

Numerous decision trees can be used to train a Random Forest model. Each tree is trained on
a random subset of the features and data, and the final prediction is determined by the trees'
majority vote.

Gradient Boosting, a boosting strategy that iteratively trains a series of weal models to
increase the overall security of the final model, is another well-liked ensemble technique in
the detection of heart disease. A variety of bus model types can be utilized with gradient
boosting. Including decision trees, it is renowned for its capacity to manage asymmetric data
and unbalanced datasets.

It has been demonstrated that employing ensemble methods, as opposed to a single model,
significantly increases the accuracy of heart disease detection models. For instance, a study
that classified patients based on their risk of heart disease using ensemble techniques—
Random Forest and Gradient Boosting—showed a notable increase in accuracy when
compared to the use of a single model.

While Random Forest and Gradient Boosting are the two most often used ensemble
techniques in heart disease detection, other ensemble techniques may also be appropriate
depending on the specific dataset and research question, ensemble techniques are an effective
approach in heart disease detection using machine learning because they can improve the
accuracy and robustness of predictive models and can handle high-dimensional and complex
data.

4.4 XG Boost
Extreme Gradient Boosting, or XGBoost, is a well-liked and potent machine learning method
that is utilized in the diagnosis of heart disease because it can manage missing data and
deliver high accuracy and speedy calculation. Gradient Boosting, an ensemble technique that
merges numerous weak models into a stronger one, is what XGBoost is.

21
The capacity of XGBoost to handle intricate, high-dimensional data with feature interactions
is one of its key advantages. This is especially crucial when it comes to heart disease because
a variety of factors, including age and gender, can influence the development of the condition.
blood pressure, cholesterol, and history of smoking. XGBoast is a powerful tool for
predicting the risk of heart disease since it can automatically capture these intricate
connections and patterns.

The capacity of AGBoost to manage missing data—a typical occurrence in medical

datasets—is another benefit. "Learning to rank" is the method that XGiBoost employs to deal
with missing values. This entails using the data at hand to forecast the likelihood of missing
values. This method can lessen the bias brought on by missing data and increase the model's
accuracy.

Moreover, XGBoost's quick and effective implementation enables it to manage big datasets
with lots of characteristics. This makes it ideal for heart disease identification, where a variety
of possible risk factors may need to be taken into account.

Lastly, XG Boost offers results that are easy to grasp, which is helpful in medical applications
where it's critical to comprehend the rationale behind the model's predictions. The relative
significance of the cache feature in forecasting the risk of heart disease can be determined
using feature importance scores, which are provided by XGBoost. In order to prevent or
manage heart disease, this can assist physicians and researchers in identifying the most
significant risk factors and prioritizing interventions.

Given its ability to handle complex, high-dimensional data, deal with missing data, produce
results that are easy to understand, and be implemented quickly and effectively, XGBoost is a
potent and effective machine learning method for the identification of heart disease.
Numerous studies have successfully employed XGBoost to forecast the risk of heart disease
and identified significant risk variables. It is expected to remain an invaluable tool in
upcoming research and clinical applications.

4.5 Logical Regression

22
A supervised learning approach called logistic regression uses one or more input features to
forecast the likelihood of an outcome variable. Regarding the identification of heart disease.
Patients' heart disease status is binary, meaning they can either have it or not. The input
features could consist of things like blood pressure, cholesterol, age, and sex. degree and past
medical records.

Heart disease is one of the biggest global health concerns; prevention and medical diagnosis
are essential to lessen its effects. Machine learning algorithms can help diagnose cardiac
problems by identifying patterns and making predictions by analyzing vast volumes of
medical data. A popular procedure in this context is logistic regression, a statistical technique
for issues involving binary categorization.

The logistic regression approach determines the likelihood of having a heart attack based on
the input features by fitting a logistic curse to the database. Based on the input features, this
curve can then be used to categorize new patients as either having or not having heart disease.
When there are few input features, logistic regression is particularly helpful since it can yield
results that are easy to understand and support clinical decision-making.

Several studies have effectively employed logistic regression to diagnose heart disease.For
instance, logistic regression was used in a study by Kannel et al. (1976) to determine heart
disease risk factors and forecast the likelihood of acquiring heart disease based on those risk
variables. Based on a number of clinical and laboratory characteristics, Jellinger et al. (2005)
employed logistic regression in another investigation to forecast the probability of
experiencing a heart attack.

One benefit of logistic regression is its ability to manage imbalanced datasets, which are
frequently found in medical datasets where the proportion of positive cases—that is, patients
with heart disease—in comparison to negative cases—that is, patients without heart disease—
is significantly smaller. In these situations, logistic regression can be utilized to modify the
cutoff point in order to optimize the true positive rate (i.e., the percentage of patients with
heart disease that are correctly identified) and reduce the false positive rate (i.e., the
percentage of patients with heart disease that are mistakenly identified).

23
4.6 Naive Bayes
One popular probabilistic machine learning approach for classification applications, such as
heart disease detection, is Naive Bayes. It is predicated on the Bayes theorem, a mathematical
method that determines the likelihood of a proposition given certain data. Based on a
collection of input features, Naive Bayes can be used in the context of heart disease detection
to predict whether or not a patient has heart disease.

To compute the conditional probability of each input feature given the class label—that is,
whether or not the "patient has heart disease"—naive Bayes is used. Since this is frequently
not the case in real-world datasets, it is assumed that each input feature is independent of the
others. But even with this naive assumption, Naive Bayes is still capable of producing
accurate results in a lot of situations. in particular when there is little correlation between the
input features.

Numerous research have effectively used Naive Bayes for heart disease identification. For
instance, a study by Lio et al. (2012) employed a set of clinical and laboratory characteristics
to predict the probability of having coronary artery disease using Naive Bayes. Based on a
number of risk variables and medical history, Kim et al. (2014) employed Naive Bayes in
another investigation to predict the existence of coronary artery disease.

The ability of Naive Bayes to accommodate noisy input features and missing data is one of its
benefits.It accomplishes this by leveraging the available data to estimate the probability
distributions of the input feats, which are then used to generate predictions. Given that
medical datasets frequently include noisy input features and missing data, this is crucial for
the diagnosis of heart disease.

Naive Bayes is a helpful machine learning technique for heart disease detection, to sum up.
Even in cases when the input features exhibit weak correlation, it can produce reliable results
and manage high-dimensional datasets with numerous input features.

Missing data and noisy input characteristics, which are frequent in medical datasets, can also
be handled with naive Bayes. For this reason. Improved patient outcomes may result from the

24
use of Naive Bayes in the careful detection and prevention of heart disease.

4.7 Support Vector Machine

Strong machine learning algorithms, such as Support Vector Machines (SVM), are frequently
employed for classification tasks, such as the identification of heart disease. SVM operates by
determining which hyperplane, on a data set, best divides the positive and negative instances.
SVM can be used to categorize patients as having or not having heart disease based on a
collection of input features in the context of heart disease detection.

SVM's capacity to handle datasets with numerous input characteristics and nonlinear
correlations between the input features and the result variable is one of its key advantages.
Numerous possible risk variables are available for use as input features in the identification of
heart disease, including gender, age, cholesterol, blood pressure, smoking status, and family
medical history. These input features can be handled using SVM, which can also identify
intricate nonlinear correlations between them and the existence or absence of heart disease.

SVM finds the hyperplane that optimizes the margin between the positive and negative
examples by translating the input features to a high-dimensional space. The distance between
the nearest positive and negative occurrences and the hyperplane is known as the margin. The
hyperplane with the best generalization performance on fresh data is the one that maximizes
the margin.

SVM has been effectively applied to heart disease identification in several studies. For
instance, based on a number of clinical and laboratory factors, Chen et al. (2009) employed
support vector machines (SVM) to predict the presence of coronary artery disease. SVM was
utilized in a different study by Gonen et al. (2001) to forecast the chance of experiencing a
heart attack based on a number of risk factors and past medical records.

SVM's ability to handle imbalanced datasets is one of its benefits. These types of datasets are
frequently found in medical datasets, where the proportion of positive cases—that is, patients
with heart disease—in comparison to negative cases—that is, patients without heart disease—
is significantly smaller. The true positive rate—that is, the rate of accurately identifying
patients with heart disease—can be maximized by using Support Vector Machines (SVM) to

25
adjust the decision threshold, while the false positive rate—that is, the rate of mistakenly
identifying patients without heart disease—can be minimized.

To sum up, Support Vector Machines are an effective machine learning technique for
identifying heart disease. Numerous input features and intricate nonlinear interactions
between them and the outcome variable can be found in datasets that it can manage. In
addition, SVM can deal with unbalanced datasets and produce precise conclusions even when
input characteristics are noisy or absent. SVM may therefore help in the early diagnosis and
prevention of cardiac illness as they are improving the patient's state of health.

4.8 KNN
For classification problems, K-Nearest Neighbors (KNN) is a popular machine learning
algorithm that is straightforward but effective, including the identification of heart disease. In
order to classify a new instance, KNN locates the K examples in a dataset that is closest to it.
Then, the new instance is classified using the class label that the K nearest neighbors share
the most. Based on a set of input features, KNN can be used to classify patients as having or
not having heart disease in the contest of heart disease detection.

KNN's simplicity and ease of implementation are two of its key benefits. It is possible for
KNN to handle both linear and nonlinear relationships between the input features and the
result vector without making any assumptions about the underlying data distribution. There
are numerous possible risk variables that can be employed as input elements in the detection
of heart diseases. such as age, sexual orientation, blood pressure, cholesterol, smoking status,
and family medical history. With the help of these input features, KNN is able to identify
intricate nonlinear relationships that indicate whether cardiac disease is present or not.

In order for KNN to function, the distance between each instance's input features and the
input features of a subsequent instance is calculated. Based on the distance measure, the K
closest examples are chosen, and the new instance is classified using the most prevalent class
label among the K nearest neighbors.

KNN has been effective in detecting cardiac disease in a number of investigations. For
instance, a study by Krittanawong et al. (2020) employed KNN to forecast a patient's

26
probability of coronary artery disease based on a number of laboratory data and risk factors.
Using a collection of clinical and laboratory characteristics, Park et al. (2018) conducted
another study in which they employed KNN to predict the occurrence of coronary artery
dissection.

KNN's capacity to manage imbalanced datasets—common in medical datasets where the

proportion of positive cases—i.e., patients with heart disease—is significantly lower than that
of negative cases—i.e., patients without heart disease—is one of its benefits. To increase the
true positive rate—that is, the rate of accurately identifying patients with heart disease—while
decreasing the false positive rate—that is, the rate of mistakenly identifying patients without
heart disease—KNN can be used to modify the decision threshold.

27
CHAPTER 5

RESULTS AND DISCUSSIONS

Fig 5.1 Plotting the confusion matrix

This illustrates the actual prediction of the proposed problem

Table 5.1 Analysis of Data

28
Fig 5.2 Correlation Matrix
This illustrates the matrix between the various parameter of the patient.

Fig 5.3 Pair Plot

This is the result of data collected from the parameters of the patient data.

29
Table 5.2 Machine learning part

Fig 5.4 Subplot analysis

This illustrates the data about the chain smokers.

30
Fig 5.5 Count plot analysis.

This illustrates the plot of the chain Smokers among the male patient.

31
Discussion:
In discussing the achievements concerning heart disease detection using machine
learning, it's essential to delve into the implications and significance of reaching the defined
objectives. The core objective, which typically revolves around accurate prediction and early
detection of heart disease, underscores the system's potential to revolutionize healthcare
outcomes. Through the successful implementation of diverse machine learning algorithms
like Logistic Regression, Support Vector Machines, Random Forests, and Neural Networks,
the system demonstrates its capability to analyze complex patient datasets comprehensively.
These algorithms not only decipher intricate patterns but also unearth correlations within the
data, empowering healthcare professionals with invaluable insights into cardiovascular risk
factors and disease progression.

Furthermore, the adept utilization of feature engineering and selection techniques

serves as a cornerstone in augmenting the model's predictive prowess. By discerning and
incorporating pertinent features derived from patient data, such as demographics, medical
history, and diagnostic test results, the model attains a heightened level of accuracy and
reliability in forecasting heart disease outcomes. Moreover, the seamless deployment of the
model into real-world healthcare environments is facilitated by robust system integration
strategies. This seamless integration ensures interoperability with existing healthcare
infrastructure and clinical workflows, thereby enabling healthcare practitioners to seamlessly
access and leverage the predictive model in their daily practice.

Moreover, the establishment of post-deployment monitoring mechanisms is pivotal for

maintaining the sustained performance and efficacy of the predictive model. Through vigilant
monitoring of key metrics like prediction accuracy, latency, and model drift, healthcare
organizations can proactively identify and rectify any anomalies or issues that may arise,
thereby safeguarding the model's integrity and reliability over time. Additionally, continuous
evaluation and validation against clinical outcomes and patient data ensure the model's
ongoing relevance and effectiveness in real-world healthcare settings.

In discussing the achieved objectives, it's imperative to underscore the broader

implications and potential impact on clinical practice and patient care. By furnishing
healthcare providers with timely and precise predictions of heart disease risk, the system

32
empowers them to make informed decisions concerning patient management, treatment
strategies, and preventive measures. Furthermore, the integration of predictive modeling
technology into clinical workflows harbors the promise of optimizing resource allocation,
curbing healthcare expenditures, and ultimately improving overall patient outcomes.
However, it's paramount to address ethical considerations like patient privacy, data security,
and algorithm transparency to ensure the ethical and responsible utilization of predictive
modeling technology in healthcare.

In essence, the discussion surrounding the achieved objectives in heart disease

detection using machine learning serves to underscore the system's transformative potential in
healthcare. By harnessing the prowess of machine learning algorithms, feature engineering
techniques, and robust system integration strategies, the developed system not only offers
healthcare providers a powerful tool for enhancing early detection and management of heart
disease but also promises to elevate patient care and healthcare outcomes to unprecedented
levels.

33
CHAPTER 6

CONCLUSION AND FUTURE ENHANCEMENT

Heart disease impacts millions of individuals globally and is a serious public health
concern. Effective treatment and management of this illness depend on early discovery and
precise diagnosis. The field of machine learning has demonstrated significant potential in
enhancing the identification and diagnosis of cardiac disease by offering more precise and
quicker prognoses for patient outcomes.

In this project, we put forth a machine learning-based methodology for the

identification of cardiac disease. To get the data ready for analysis, we gathered information
from a variety of sources and preprocessed it. The most significant heart disease predictors
were found via feature selection, and the predictive power of many machine learning
techniques was evaluated.

Our study's findings demonstrated the effectiveness of the Random Forest classifier,
XGBoost, Decision Tree, and Ensemble approach in predicting heart disease. All of the
models had accuracy scores more than 80%, with XGBoost having the highest accuracy at
87%. These findings show how machine learning techniques can be used to improve the
diagnosis and detection of cardiac disease.

Furthermore, our research emphasizes how crucial feature selection and data
pretreatment are to raising the accuracy of machine learning models. In order to prepare the
data for analysis, preprocessing methods such feature scaling, normalization, and handling
missing values were essential. By reducing the complexity of the dataset and assisting in the
identification of the most pertinent heart disease predictors, feature selection enhanced the
model performance.

Furthermore, our study shows how crucial model evaluation is in determining which

34
machine learning algorithm is best for detecting cardiac disease. We used metrics like
accuracy, precision, recall, and F1-score to assess the effectiveness of several machine
learningalgorithms. These measurements allowed us to choose the best model for heart
disease detection by giving us a thorough grasp of the advantages and disadvantages of each
method.

Lastly, we talked about how crucial model deployment is to clinical practice. Machine
learning algorithms must be successfully used in clinical settings, which call for careful
consideration of a number of issues including patient safety, data privacy, and regulatory
compliance. However, machine learning is a useful tool for medical professionals because of
its potential to improve the diagnosis and detection of heart disease.

Our research concludes by showing the promise of machine learning algorithms for
enhancing the identification and diagnosis of cardiac disease. The suggested approach offers a
framework for using machine learning in healthcare practices, enabling more precise and
rapid patient outcome predictions. To confirm our findings and investigate the application of
machine learning in additional healthcare domains, more investigation is required.

In terms of additional study and implementation, the machine learning-based heart

disease diagnosis project has a bright future. The following are some possible future project
scopes:

Electronic Health Record (EHR) Integration:

A plethora of patient data, like as imaging data, lab test results, and medical histories, are
contained in EHRs. Improving patient outcomes and increasing the model's prediction
accuracy are two benefits of integrating the heart disease detection model with EHRs.

Extension to further illnesses:

Other diseases can also be detected using the machine learning methods employed in this
investigation. For example, the same approach can be used to identify chronic diseases such
as breast and lung cancer.

35
Real-time monitoring:
The cardiac disease detection model can be connected with wearable health devices,
including fitness trackers and smartwatches, to enable real-time monitoring. This can
assistpatients and medical professionals in identifying and acting upon any irregularities in
blood pressure or heart rate.

Personalized medicine:
A patient's future risk of acquiring heart disease can be estimated using the heart disease
detection model. With this data, individualized treatment programs that consider the patient's
unique risk factors and medical background can be created.

Integration with telemedicine:

As a way to deliver healthcare remotely, telemedicine is quickly gaining traction. By
integrating the heart disease detection model with telemedicine systems, patients can get a
prompt and precise diagnosis without going to the doctor.

Continuous Improvement:
It is possible to make constant improvements to the heart disease detection model by adding
fresh information, enhancing the algorithms, and testing.

36
REFERENCES

[1] Nashif, Shadman, et al. "Heart disease detection by using machine learning algorithms
and a real-time cardiovascular health monitoring system." World Journal of Engineering and
Technology 6.4 (2018): 854-873.

[2] Chang, Victor, et al. "An artificial intelligence model for heart disease detection using
machine learning algorithms." Healthcare Analytics 2 (2022): 100016.

[3] Li, Jian Ping, et al. "Heart disease identification method using machine learning
classification in e-healthcare." IEEE access 8 (2020): 107562-107582.

[4] Ghumbre, Shashikant U., and Ashok A. Ghatol. "Heart disease diagnosis using machine
learning algorithm." Proceedings of the International Conference on Information Systems
Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India,
January 2012. Springer Berlin Heidelberg, 2012.

[5] Nagavelli, Umarani, Debabrata Samanta, and Partha Chakraborty. "Machine learning
technology-based heart disease detection models." Journal of Healthcare Engineering 2022
(2022).

[6] Yadav, Samir S., et al. "Application of machine learning for the detection of heart
disease." 2020 2nd international conference on innovative mechanisms for industry
applications (ICIMIA). IEEE, 2020.

[7] Atallah, Rahma, and Amjed Al-Mousa. "Heart disease detection using machine learning
majority voting ensemble method." 2019 2nd international conference on new trends in
computing sciences (ictcs). IEEE, 2019.

[8] Ahsan, Md Manjurul, and Zahed Siddique. "Machine learning-based heart disease
diagnosis diagnosis: A systematic literature review." Artificial Intelligence in Medicine 128
(2022): 102289.

37
[9] Abdeldjouad, Fatma Zahra, Menaouer Brahami, and Nada Matta. "A hybrid approach for
heart disease diagnosis and prediction using machine learning techniques." The Impact of
Digital Technologies on Public Health in Developed and Developing Countries: 18th
International Conference, ICOST 2020, Hammamet, Tunisia, June 24–26, 2020, Proceedings
18. Springer International Publishing, 2020.

[10] Jagtap, Abhijeet, et al. "Heart disease prediction using machine learning." International
Journal of Research in Engineering, Science and Management 2.2 (2019): 352-355.

[11] Bhatt, Chintan M., et al. "Effective heart disease prediction using machine learning
techniques." Algorithms 16.2 (2023): 88.

38
APPENDIX A
CODING

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("heart.csv")
dataset.shape
dataset.head(5)
dataset.describe()
dataset.info()
dataset["target"].describe()
dataset["target"].unique()
print(dataset.corr()["target"].abs().sort_values(ascending=False))
y = dataset["target"]
sns.countplot(y)
target_temp = dataset.target.value_counts()
print(target_temp)
sns.barplot(dataset["sex"],y)
sns.barplot(dataset["cp"],y)
sns.barplot(dataset["fbs"],y)
sns.barplot(dataset["restecg"],y)
sns.barplot(dataset["exang"],y)
sns.barplot(dataset["slope"],y)
sns.countplot(dataset["ca"])
sns.barplot(dataset["ca"],y)
sns.distplot(dataset["thal"])
from sklearn.model_selection import train_test_split

predictors = dataset.drop("target",axis=1)
target = dataset["target"]

39
X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(tol = 0.01,max_iter=1000,solver = 'saga' )

lr.fit(X_train,Y_train)

Y_pred_lr = lr.predict(X_test)
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)

print("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+" %")
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

nb.fit(X_train,Y_train)

Y_pred_nb = nb.predict(X_test)
from sklearn import svm

sv = svm.SVC(kernel='linear')

sv.fit(X_train, Y_train)

Y_pred_svm = sv.predict(X_test)
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)

print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
from sklearn.neighbors import KNeighborsClassifier

40
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)

print("The accuracy score achieved using KNN is: "+str(score_knn)+" %")

from sklearn.tree import DecisionTreeClassifier

max_accuracy = 0
for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
from sklearn.ensemble import RandomForestClassifier

max_accuracy = 0

for x in range(200):
rf = RandomForestClassifier(random_state=x)

41
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):

max_accuracy = current_accuracy
best_x = x
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_rf)+" %")
scores = [score_lr,score_svm,score_knn,score_dt,score_rf]
algorithms = ["Logistic Regression","Support Vector Machine","K-Nearest Neighbors","Decision
Tree","Random Forest"]

for i in range(len(algorithms)):
print("The accuracy score achieved using "+algorithms[i]+" is: "+str(scores[i])+" %")
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")

sns.barplot(algorithms,scores)
features = np.array([[25,1,3,132,244,0,1,170,0,2.2,2,0,1]])
prediction = rf.predict(features)
prediction

42
APPENDIX B

PUBLICATION DETAILS

We have submitted the research paper for publication in IEEE Digital XPLORE.

2017 - OPUS Quant Advanced PDF
100% (1)
2017 - OPUS Quant Advanced PDF
205 pages
Final - Urop - Report - Heart Attack Machine Learning
No ratings yet
Final - Urop - Report - Heart Attack Machine Learning
33 pages
1822 B.E Cse Batchno 114
No ratings yet
1822 B.E Cse Batchno 114
42 pages
Report Heart Disease
No ratings yet
Report Heart Disease
39 pages
Heart Disease Prediction Using ML
No ratings yet
Heart Disease Prediction Using ML
48 pages
Heart Disease Identification Using Machine Learning Classification
100% (2)
Heart Disease Identification Using Machine Learning Classification
11 pages
SST Word
No ratings yet
SST Word
15 pages
Proj Report
No ratings yet
Proj Report
29 pages
Phase 1 Project Report
No ratings yet
Phase 1 Project Report
44 pages
Heart Disease Prediction System Report
No ratings yet
Heart Disease Prediction System Report
31 pages
INTRODUCTION
No ratings yet
INTRODUCTION
8 pages
Group 6
No ratings yet
Group 6
68 pages
Bala
No ratings yet
Bala
28 pages
Machine Learning for Heart Disease Prediction
No ratings yet
Machine Learning for Heart Disease Prediction
63 pages
In Format GROUP FILE
No ratings yet
In Format GROUP FILE
64 pages
Shubhashshashankfinal
No ratings yet
Shubhashshashankfinal
61 pages
1822 B.E Cse Batchno 95
No ratings yet
1822 B.E Cse Batchno 95
57 pages
Heart Disease Prediction Report
No ratings yet
Heart Disease Prediction Report
112 pages
Heart Disease
No ratings yet
Heart Disease
19 pages
Review 2
No ratings yet
Review 2
23 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
70 pages
Heart Disease Prediction Research
No ratings yet
Heart Disease Prediction Research
45 pages
Project Report
No ratings yet
Project Report
46 pages
Heart Disease Prediction Report
No ratings yet
Heart Disease Prediction Report
60 pages
Mini Report2
No ratings yet
Mini Report2
40 pages
BDA Final
No ratings yet
BDA Final
33 pages
Chapter 1
No ratings yet
Chapter 1
7 pages
BT3277 Project Report
No ratings yet
BT3277 Project Report
19 pages
Project
No ratings yet
Project
19 pages
Heart Disease Detection
No ratings yet
Heart Disease Detection
9 pages
T.John Institute of Technology: Visvesvaraya Technological University
No ratings yet
T.John Institute of Technology: Visvesvaraya Technological University
29 pages
A Study of Heart Disease Diagnosis Using Machine Learning and Dat
No ratings yet
A Study of Heart Disease Diagnosis Using Machine Learning and Dat
52 pages
Heart Disease Detection - Newreport
No ratings yet
Heart Disease Detection - Newreport
57 pages
Sat - 95.Pdf - Heart Disease Prediction Using Machine Learning Algorithms
No ratings yet
Sat - 95.Pdf - Heart Disease Prediction Using Machine Learning Algorithms
11 pages
Heart Disease Detection
No ratings yet
Heart Disease Detection
9 pages
MINI PROJECT Kshetrika
No ratings yet
MINI PROJECT Kshetrika
41 pages
Heart Disease Prediction Model
No ratings yet
Heart Disease Prediction Model
7 pages
Synopsis (Group 6)
No ratings yet
Synopsis (Group 6)
4 pages
Report Heart
No ratings yet
Report Heart
62 pages
Heart Disease Prediction Using Machine Learning
No ratings yet
Heart Disease Prediction Using Machine Learning
11 pages
Zeroth Review Presentation
No ratings yet
Zeroth Review Presentation
12 pages
Heart Attack Prediction with ML
No ratings yet
Heart Attack Prediction with ML
17 pages
Early Cardiovascular Detection via ML
No ratings yet
Early Cardiovascular Detection via ML
38 pages
A Project Report CPP
No ratings yet
A Project Report CPP
55 pages
Heart Disease Python Report 1st Phase
No ratings yet
Heart Disease Python Report 1st Phase
33 pages
Heart Disease Prediction Model: Dissertation
No ratings yet
Heart Disease Prediction Model: Dissertation
4 pages
Sanya 13
No ratings yet
Sanya 13
46 pages
Heart Disease Prediction Using Machine Learning.
No ratings yet
Heart Disease Prediction Using Machine Learning.
59 pages
INTRODUCTION
No ratings yet
INTRODUCTION
5 pages
Synopsis - Group - 6 - CSE - 3 Changes (2)
No ratings yet
Synopsis - Group - 6 - CSE - 3 Changes (2)
15 pages
PBL CA II Project Report (1) .Docs
No ratings yet
PBL CA II Project Report (1) .Docs
25 pages
INTRODUCTION
No ratings yet
INTRODUCTION
14 pages
Maindra
No ratings yet
Maindra
22 pages
HDD New Report
No ratings yet
HDD New Report
95 pages
ML Heart Disease Prediction Project
No ratings yet
ML Heart Disease Prediction Project
1 page
Heart Disease Prediction System
No ratings yet
Heart Disease Prediction System
22 pages
Borella Elisa
No ratings yet
Borella Elisa
49 pages
Heart Disease Prediction for CS Majors
No ratings yet
Heart Disease Prediction for CS Majors
82 pages
Pauli2019 PDF
No ratings yet
Pauli2019 PDF
11 pages
Icdici 274 Spam Sms
No ratings yet
Icdici 274 Spam Sms
6 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
02-2021 - Quant Advanced 2
No ratings yet
02-2021 - Quant Advanced 2
71 pages
Project Report
No ratings yet
Project Report
16 pages
DP4-AI Automated NMR Data Analysis Straight From
No ratings yet
DP4-AI Automated NMR Data Analysis Straight From
9 pages
Automated Design of A Computer Vision System For Visual Food Quality Evaluation
No ratings yet
Automated Design of A Computer Vision System For Visual Food Quality Evaluation
16 pages
Full Text 01
No ratings yet
Full Text 01
44 pages
Class 12 Artificial Intelligence Previous Year Paper 2023 Compartment Solution Code 843 - MyCSTutorial - The Path To Success in Exam
No ratings yet
Class 12 Artificial Intelligence Previous Year Paper 2023 Compartment Solution Code 843 - MyCSTutorial - The Path To Success in Exam
12 pages
Machine Learning Dataset Handling Guide
No ratings yet
Machine Learning Dataset Handling Guide
15 pages
Machine Learning Models To Predict Total Skin Factor in Perforated Wells Spe-218838-Ms
No ratings yet
Machine Learning Models To Predict Total Skin Factor in Perforated Wells Spe-218838-Ms
12 pages
TMHMM
No ratings yet
TMHMM
14 pages
Emotion AI Slides PDF
No ratings yet
Emotion AI Slides PDF
25 pages
Crime Detecction DL Model ConvLSTM2D Analysis and Results
No ratings yet
Crime Detecction DL Model ConvLSTM2D Analysis and Results
4 pages
Intro to Supervised Learning
No ratings yet
Intro to Supervised Learning
55 pages
Process Validation Thesis
100% (3)
Process Validation Thesis
8 pages
Project Report I
No ratings yet
Project Report I
52 pages
2013 Gutierrez JClim
No ratings yet
2013 Gutierrez JClim
18 pages
Machine Learning For Asset Management 1714827480
No ratings yet
Machine Learning For Asset Management 1714827480
233 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
10 Psychology Positive Psychology Research Paper
No ratings yet
10 Psychology Positive Psychology Research Paper
7 pages
Advanced Scikit Learn
No ratings yet
Advanced Scikit Learn
98 pages
Tao 2021
No ratings yet
Tao 2021
19 pages
Master in Data Science
No ratings yet
Master in Data Science
11 pages
A Brief Tutorial On Maxent: Getting Started
No ratings yet
A Brief Tutorial On Maxent: Getting Started
39 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
Machine Learning for Company Valuation
No ratings yet
Machine Learning for Company Valuation
41 pages
DS QB
No ratings yet
DS QB
6 pages
Boston Housing & Logistic Regression Analysis
No ratings yet
Boston Housing & Logistic Regression Analysis
3 pages

Project Report PDF

Uploaded by

Project Report PDF

Uploaded by

HEART DISEASE DETECTION

USING MACHINE LEARNING

Under the Guidance of

in partial fulfillment of the requirements for the degree of

DEPARTMENT OF DATA SCIENCE AND BUSINESS SYSTEMS

Degree/ Course : B.Tech. - Computer Science and Engineering with

Student Name : Praveen Prakash Tiwari

Registration Number : RA2011050010003

Title of Work : Heart Disease Detection Using Machine Learning

 Clearly referenced / listed all sources as appropriate.

Praveen Prakash Tiwari

Certified that 18CSP109L project report titled “HEART DISEASE

Dr. P. Rajasekar Prof. Dr. M. Lakshmi

I wish to thank Dr Revathi Venkataraman, Professor & Chairperson, School Of Computing,

LIST OF FIGURES viii

1.4 Motivation and Problem Statement 4

1.5 Research Objective 6

3 SYSTEM ARCHITECTURE AND DESIGNS 13

6 CONCLUSION AND FUTURE ENHANCEMENT 34

3.1 Architecture Flow of the Proposed Work 13

5.1 Plotting the confusion matrix 28

5.2 Correlation matrix 29

5.3 Pair plot 29

5.4 Subplot analysis 30

5.5 Count plot analysis 31

5.2 Machine learning part 30

In this project report, we provide a thorough examination of the application of machine

1. Examine the efficiency, interpretability, and accuracy of different machine learning

The following areas will be covered in the project report:

1.4 Motivation and Problem Statement

[11] underscores the critical importance of accurate diagnosis and prognosis in

SYSTEM ARCHITECTURE AND DESIGNS

Fig 3.1: Architecture Flow of the Proposed Work

3.1 Data Collection and Pre-processing

3.2 Feature Engineering and Selection

3.3 Model Selection and Training

3.4 System Integration

3.5 Deployment and Monitoring

Concurrently, monitoring mechanisms are established to track the model's performance in

4.1 Random Forest Classifier

4.2 Decision Tree

4.3 Ensemble Technique

The capacity of AGBoost to manage missing data—a typical occurrence in medical

4.5 Logical Regression

4.7 Support Vector Machine

KNN's capacity to manage imbalanced datasets—common in medical datasets where the

RESULTS AND DISCUSSIONS

Fig 5.1 Plotting the confusion matrix

Table 5.1 Analysis of Data

Fig 5.3 Pair Plot

Fig 5.4 Subplot analysis

This illustrates the data about the chain smokers.

Furthermore, the adept utilization of feature engineering and selection techniques

Moreover, the establishment of post-deployment monitoring mechanisms is pivotal for

In discussing the achieved objectives, it's imperative to underscore the broader

In essence, the discussion surrounding the achieved objectives in heart disease

CONCLUSION AND FUTURE ENHANCEMENT

In this project, we put forth a machine learning-based methodology for the

In terms of additional study and implementation, the machine learning-based heart

Electronic Health Record (EHR) Integration:

Extension to further illnesses:

Integration with telemedicine:

lr = LogisticRegression(tol = 0.01,max_iter=1000,solver = 'saga' )

print("The accuracy score achieved using KNN is: "+str(score_knn)+" %")

You might also like