Project Report Divii
Project Report Divii
BONAFIDE CERTIFICATE
I thank GOD for the love and blessings showered upon me in completing this
project successfully.
DIVYA.S
CONTENTS
1 INTRODUCTION 2
2 SYSTEM SPECIFICATION 3
2.1 Hardware Specification 3
3 PROBLEM DESCRIPTION 5
4 SYSTEM ANALYSIS 22
4.1 Existing System 22
5 CONCLUSION 27
6 FUTURE ENHANCEMENT 28
7 BIBLIOGRAPHY 29
8 APPENDIX 30
1.Sample Coding 30
2. Screen layout 36
ABSTRACT
Heart diseases also known as cardio vascular diseases encompass a wide range of
conditions that affect the heart. These vary from blood vessel diseases, heart rhythm problems to
heart defects that one is born with. It is the primary cause for death worldwide over the past few
decades. It is the need of the hour to obtain accurate and reliable approach to achieve early
diagnosis of the disease by automating the task and hence realize efficient management of it.
Data Science plays an important in processing large amounts of data in the field of medical
sciences. Researchers utilize several Data Mining and Machine Learning Techniques to analyze
large sets of data and aid in the right prediction of heart diseases. In which project Machine
Learning algorithms where used for the comparative analysis and as implemented in Python, to
predict the presence of heart disease in a patient. This is a classification problem, with input
features as a variety of parameters, and the target variable as a binary variable, predicting
whether heart disease is present or not. This project analyzes the supervised learning models
Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest and
comparative study is carried out for the most efficient algorithm. From this study it is found that
Random Forest provides most accuracy with 90.16% in comparison to other algorithms.
1
CHAPTER – 1
INTRODUCTION
The heart plays an important role in the human body by pumping blood, supplying blood
to all the parts of the body, and purifying blood. When the heart does not get the required amount
of blood, it results in heart failure and death. India is also having a very high rate of death due to
heart diseases. The accurate and timely diagnosis of heart disease is necessary to improve the
security of the heart and life. Heart diseases – Rheumatic heart disease, Valvular heart disease,
Hypertensive heart disease, and Cerebrovascular heart disease. Symptoms Chest pain or
discomfort, shortness of breath, fainting, swelling of legs, abdomen, or areas around the eyes and
easily tiring during exercise and activity. Machine learning algorithms and techniques help us to
predict and diagnose different heart diseases and help doctors to avoid sudden death in such
cases.
A huge number of patients details will be collected and interpreted to predict the occurrence of
disease. In this project, Four algorithms have been used that is Support vector ,K Nearest
Neighbor, Decision Tree, and Random Forest. The objective of this project is to compare the
accuracy of four different machine learning algorithms and conclude with the best algorithm
among these for heart disease prediction. It is very difficult to diagnose the heart diseases in
advance or at early stages even though symptoms of heart diseases are noticed and one of the
common things is all of the symptoms will not occur all of a sudden. So, it’s very important to
diagnose the heart diseases in time which avoids death or reduce mortality rate. Since diagnosing
heart disease is very expensive so majority of them will not visit the doctor for consultation due
to various reasons. In this project, the problem is based on the given information about each
individual.
In order to achieve the objective, machine learning algorithms are use at different levels of
evaluation. Although machine learning algorithms are commonly used, heart disease prediction
is a vital task involving the highest possible accuracy. Hence, these algorithms are evaluated at
numerous levels and types of evaluation strategies. The main contribution of this project
includes Extraction of classified accuracy useful for heart disease prediction, comparison of
different machine learning algorithms and identifying the best performance-based algorithm for
heart disease prediction.
2
CHAPTER – 2
SYSTEM SPECIFICATION
Processor : I5
RAM : 4GB
Board : Mercury
Mouse : Logitech
3
2.3.1 Aim and objective of the study
The main objective and the study is to detect and access classification of heart disease
prediction using machine learning.
The specific objectives of the study are:
The objective of this project is to check whether the patient is likely to be
diagnosed with any cardiovascular heart disease based on their medical attributes
such as gender, age, chest pain, fasting sugar level etc.
To detect heart disease prediction early using machine learning system.
2.3.2 Dataset
Machine Learning heavily depends on data. It is the most crucial aspect that makes
algorithm training possible. so the section allows you to download the required dataset.
There are many popular open sources for collecting the data Eg: Kaggle.com, UCI repository,
etc.
In this project heart disease.csv dataset which used. This data is downloaded from
Kaggle.com.
4
CHAPTER – 3
PROBLEM DESCRIPTION
5
3.1.2 Load and Read the dataset
The dataset format might be in .csv, excel files, .txt, .json, .etc, and here read the
dataset with the help of pandas.
In pandas it have a function called read_csv () to read the dataset. As a parameter to load the
dataset give the directory of the csv file.
3.2 DATA PREPROCESSING
Data pre-processing is an important step for the creation of a machine learning model.
Initially, data may not be clean or in the required format for the model which can cause
misleading outcomes. In pre-processing of data, A data is transformed into the required
format. It is used to deal with noises, duplicates, and missing values of the dataset.
Data pre-processing has the activities like importing datasets, splitting datasets, attribute
scaling, etc. Preprocessing of data is required for improving the accuracy of the model.
The dataset obtained is not completely accurate and error free. Hence, the first carry out the
following operations on it.
Data Balancing: Since the data balancing graph shows that both the target classes are equal,
data balancing is crucial for accurate results. The target classes are shown "0" denoting patients
with heart disease and "1" denoting patients without heart disease.
Data Cleaning: NA values in the dataset are the major setback for us as it will reduce the
accuracy of the prediction profoundly so, we will remove the fields which do not have values. I
will substitute it with the mean value of the column. This way, we will remove all the values in
the data set.
Feature Scaling: Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions will not work properly without feature scaling. For example, the
majority of classifiers calculate the distance between two points by the Euclidean distance. If one
of the features has a broad range of values, the distance will be governed by this particular
feature. E.g. Age has just two values i.e. 0,1 and cholesterol has high values like 100. So, in
order to get them closer to each other we will need to scale them.
Factorization: In this process assigned a meaning to the values so that the algorithm doesn’t
confuse between them. For example, assigning meaning to 0 and 1 in the age section so that the
algorithm doesn’t consider 1 as greater than 0 in that section.
6
3.2.1 Importing The Libraries
In python, libraries are used to refer to a collection of modules that are used repeatedly
in various programs without the need of writing them from scratch.
Modules on the other hand refer to any python file saved with the .pyextension.
Modules often contain codes such as functions, classes and statements that can be imported
and used within other programs.
Import the necessary libraries shown below:
Seaborn
Numpy
Pandas
Matplotlib.pyplot
Seaborn
Seaborn is a library for making statistical graphics in python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you to explore and understand
your data.
Numpy
Numpy can be used to perform a wide variety of mathematical operation on arrays. It adds
powerful data structures to python that guarantee efficient calculation with arrays and matrices
and it supplies an enormous library of high-level mathematical function that operate on these
arrays and matrices.
Pandas
Pandas is a python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive.
Matplotlib.pyplot
Matplotlib is a cross-platform, data visualization and graphical plotting library Python and its
numerical extension Numpy.
3.3 EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is a pre-processing step to understand the data. There are
numerous methods and steps in performing EDA, however, most of them are specific, focusing
on either visualization or distribution, and are incomplete. Therefore, here, I will walk-through
step-by-step to understand, explore, and extract the information from the data to answer the
questions or assumptions. There are no structured steps or method to follow, however, this
project will provide an insight on EDA for you and my future self.Initially, the dataset contains
7
76 features or attributes from 303 patients; however, published studies chose only 14 features
that are relevant in predicting heart disease. Hence, here we will be using the dataset consisting
of 303 patients with 14 features set.
3.3.1 Descriptive statistical
Descriptive analysis is to study the basic features of data with the statistical process. Here
pandas have a worthy function called describe. With this describe function we can understand
the unique, top and frequent values of categorical features. And find mean, std, min, max and
percentile values of continuous features.
3.3.2 Visual Analysis
Visual Analysis is the process of using visual representations, such as charts, plots, and graphs,
to explore and understandable data. It’s a way to quickly identify patterns, trends, and outliers
in the data, which can help to gain insights and make informed decisions. Histogram is used
for both Univariate and Bivariate analysis.
3.3.2.1 Univariate analysis:
Univariate analysis is the simplest form of analysis where I explore a single variable.
Univariate analysis is performed to describe the data in a better way. I will perform Univariate
analysis of numerical and categorical variables differently because plotting use different plots.
3.3.2.2 Bivariate and Multivariate analysis:
I have study about various plots to explore single categorical and numerical data. Bivariate
analysis is used when I have to explore the relationship between 2 different variables and I
have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when I analyze more than 2 variables together then it
is known as multivariate analysis. I will work on different plots for Bivariate as well on
multivariate analysis.
3.3.2.3 Histogram:
A histogram is a value distribution plot of numerical columns. It basically creates bins in
various ranges in values and plot it where we can visualize how values are distributed. We can
have a look where more values lie like in positive, negative, or at the center (mean). Let’s have
a look at the age column.
8
3.4 SPLIT THE DATASET INTO TRAINING AND TESTING
3.4.1.Scalling the data
It is important to scale the data so the ML algorithms do not over fit to the wrong features.
Using the minmaxScaler (), the values are scaled per feature based on the minimum and
maximum between 0 and 1. This keeps the information from being lost but allows the ML
algorithms to correctly train with the data.
3.4.2 Dataset Splitting
The data is normally split into two subsets: training data and testing data(and sometimes to
three: train, validate and test). split a dataset into train and test sets to evaluate how well the
machine learning model performs. The training dataset is the actual data set used to fit and
train the model, and the statistics of the train set are known. The test data set is the sample of
data used to provide prediction.
The train-test-split is used to estimate the performance of machine learning algorithms that are
applicable for prediction-based algorithms/applications. This method is a fast and easy
procedure to perform such that user can compare their own machine learning model results to
machine results. In this project the test set is split into 30% of actual data and the train set is
split into 70% of the actual data.
The data was split into 80% training (240 people) and 20% testing (63people) after dropping 6
instances with missing values. This is a general rule of thumb for splitting data to train ML
algorithms with.
3.5 MODEL BUILDING
3.5.1 Machine learning Algorithm
Machine Learning Algorithms are the programs that can learn the hidden patterns from the data,
predict the output, and improve the performance from experience on their own. Different
algorithms can be used in machine learning for different tasks, that can be used for prediction
problems like stock market prediction, and for classification problems.
9
1. Supervised Learning Algorithm
Supervised learning is a type of machine learning in which the machine needs external
supervise to learn. The supervised learning models are trained using the labeled dataset. Once the
training and processing are done, the model is tested by providing a sample test data to check
whether it predicts the correct output.
The goal of supervised learning is to map input data with the output data. Supervised
learning is based on supervision, and it is the same as when a student learns things in the
teacher’s supervision. The example of supervised learning is spam filtering. Supervised learning
can be divided further into two categories of problem:
• Classification
• Regression
Unsupervised Learning is a type of machine learning in which the machine does not need
any external supervision to learn from the data, hence called unsupervised learning. The
unsupervised models can be trained using the unlabelled dataset that is not classified, nor
categorized, and the algorithm needs to act on that data without any supervision. In unsupervised
learning, the model doesn't have a predefined output, and it tries to find useful insights from the
huge amount of data. These are used to solve the Association and Clustering problems. Hence
further, it can be classified into two types:
• Clustering
• Association
10
actions. For each good actions, the agent gets positive feedback, and for each bad action, the
agent gets negative feedback or penalty.
In Reinforcement learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning. ”Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts with the environment and learns
to act within that”. It is a core part of an Artificial intelligence, and all AI agents’ works on the
concept of reinforcement learning.
This model deals with binary classification of labeled data and the algorithm were
chosen based on the fact.
Methodologies used on this model are mentioned below:
1. K Nearest Neighbors classifier
2. Support vector classifier
3. Decision tree classifier
4. Random forest classifier
3.5.2.1 K Nearest Neighbors classifier
The KNN algorithm is a simple but extremely powerful classification algorithm. The
name of the algorithm originates from the underlying philosophy of KNN – i.e. people having
similar background or mindset tend to stay close to each other. As a part of the KNN algorithm
the unknown and unlabelled data which comes for a prediction problem is judged on the bases of
the training data set elements which are similar to the unknown element is assigned on the basis
of the class labels of the similar training data set elements.
Algorithm:
Steps:
Calculate the distance of the test data point form the different training data points
11
If k=1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that
class label to the test data point.
End do
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)
Print ("The accuracy score achieved using KNN is: "+str(score_knn)+" %")from
Accuracy:
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However, primarily,
it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called support vectors, and hence the algorithm is
termed as Support Vector Machine. Support vector machines (SVMs) are powerful yet flexible
supervised machine learning algorithms which are used both for classification and regression.
12
But generally, they are used in classification problems. In the 1960s, SVMs were first introduced
but later they got refined in 1990. SVMs have their unique way of implementation as compared
to other machine learning algorithms. Lately, they are extremely popular because of their ability
to handle multiple continuous and categorical variables
Support vector machine (SVM) are supervised learning method that analyze data used
for classification and regression analysis. It is given a set of training data, marked as belonging to
either one of two categories; an SVM training algorithm then builds a model that assigns new
examples to one category or the other, making it a no probabilistic binary linear classifier. An
SVM model is a representation of the examples as points in space, mapped so that the examples
of the separate categories are divided by a clear gap that is as wide as possible. New examples
are then mapped into that same space and predicted to belong to a category based on which side
of the gap they fall. The points are separated based on hyper plane that separate them. When data
are not labeled, supervised learning is not possible, and an unsupervised learning approach is
required, which attempts to find natural clustering of the data to groups, and then map new data
to these formed groups. In the project, we have used this algorithm to classify the patients into
groups according to the risk posed to them based on the parameters provided.
If the number of features is much greater than the number of samples, avoid over-fitting
in choosing Kernel functions and regularization term is crucial. SVMs do not directly
provide probability estimates, these are calculated using an expensive five-fold cross-
validation.
13
Algorithm:
Input: heart disease_Input Features Assign training and testing dataset for heart disease
Steps
For each set number of input file features, run the training step.
Run the classification step for a set number of features in the input file.
sv = svm.SVC(kernel='linear')
sv.fit(X_train, Y_train)
Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
(61,)
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)
print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
Accuracy:
14
3.5.2.3 Decision tree classifier:
Decision tree learning is one of the most widely adopted algorithms for classification. As
the name indicates, it builds a model in the form of a tree structure. Its grouping exactness is
focused with different strategies, and it is exceptionally productive. A decision tree is used for
multi-dimensional analysis with multiple classes. It is characterized by fast execution time and
ease in the interpretation of the rules. The goal of decision tree learning is to create a model that
predicts the value of the output variable based on the input variables in the feature vector. Each
node of a decision tree corresponds to one of the feature vector. From every node, there are
edges to children, wherein there is an edge for each of the possible values of the feature
associated with the node. The tree terminates at different leaf nodes where each leaf node
represents a possible value for the output variable. The output variable is determined by
following a path that starts at the root and is guided by the values of the input variables.
Algorithm :
Steps:
ifEi<Emin
end if
End do
Split the data set into subsets using the attribute F min
Draw a decision tree node containing the attribute F min and split the data set into subsets
Repeat the above steps until the full tree is drawn covering all the attributes of the original table.
15
Applying Decision tree classifier:
max_accuracy = 0
for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
print(Y_pred_dt.shape)
(61,)score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
Accuracy:
16
3.5.2.4 Random forest classifier:
Random forest is an ensemble classifier, i.e. a combining classifier that uses and
combines many decision tree classifiers. Ensembling is usually done using the concept of
bagging with different feature sets. The reason for using large number of trees in random forest
is to train the trees enough such that contribution from each feature comes in a number of
models. After the random forest is generated by combining the trees, majority vote is applied to
combine the output of the different trees. The result from the ensemble model is usually better
than that from the individual decision tree models.
a. If there are N variables or features in the input data set, select a subset of ‘m’ (m < N)
features at random out of the N features. Also, the observations or data instances should
be picked randomly.
b. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
c. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible
extent.
d. Select a different subset of the training data ‘with replacement’ to train another decision
tree following steps (1) to (3). Repeat this to build and train ‘n’ decision trees.
e. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.
Advantages:
● It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages:
Although Random Forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
max_accuracy = 0
17
for x in range(2000):
rf = RandomForestClassifier(random_state=x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
(61,)score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
print("The accuracy score achieved using random forest Tree is: "+str(score_rf)+" %")
Accuracy :
18
3.6 EVALUATION THROUGH PERFORMANCE METRICS
In this project, various machine learning algorithms like KNN, SVM, Decision Tree,
Random Forest, are used to predict heart disease. Heart Disease UCI dataset, has a total of 76
attributes, out of those only 14 attributes are considered for the prediction of heart disease.
Various attributes of the patient like gender, chest pain type, fasting blood pressure, serum
cholesterol, exang, etc are considered for this project. The accuracy for individual algorithms has
to measure and whichever algorithm is giving the best accuracy, that is considered for the heart
disease prediction. For evaluating the experiment, various evaluation metrics like accuracy,
confusion matrix, precision, recall, and f1-score are considered. Accuracy- Accuracy is the ratio
of the number of correct predictions to the total number of inputs in the dataset.
The Confusion matrix is one of the most intuitive and easiest metrics used for finding the
correctness and accuracy of the model. It is used for classification problems where the output can
be of two or more types of classes which makes it perfect for this paper. The table layout or the
matrix layout helps to visualize the performance of an algorithm.
1. True Positives (TP): True positives are the cases when the actual class of the data point was
True(1) and the predicted is also True(1) Ex: The case where a person is actually having
malignant (1) tumor and the model classifying his case as malignant (1) comes under True
Positive.
2. True Negatives (TN): True negatives are the cases when the actual class of the data point was
False (0) and the predicted is also False (0). Ex: The case where a person having benign (0)
tumor and the model classifying his case as benign (0) comes under True Negatives.
3. False Positives (FP): False positives are the cases when the actual class of the data point was
False (0) and the predicted is True (1). False is because the model has predicted incorrectly and
positive because the class predicted was a positive one (1). Ex: A person having a benign (0)
tumor and the model classifying his case as malignant (1) comes under False Positives.
4. False Negatives (FN): False negatives are the cases when the actual class of the data point was
True (1) and the predicted is False (0). False is because the model has predicted incorrectly and
negative because the class predicted was a negative one (0). Ex: A person having malignant (1)
19
tumor and the model classifying his case as benign (0) tumor comes under False
Negatives.Accuracy = (TP + TN) /(TP+FP+FN+TN)
Predicted Values
0 1
(FN) (TP)
3.6.2 Accuracy
Accuracy is one of the metric for evaluating classification models. Informally accuracy is
the fraction of predictions our model got right. Formally Accuracy has the following
definition.
Accuracy = -------------------------------------------
For Binary Classification Accuracy can be calculated in terms of Positive and Negative as
follows:
T P+ T N
Accuracy = --------------------------
TP+FP+TN+FN
20
3.6.3 Precision
Precision is the ratio of True Positives to the summation of True Positives and False
Positives. Ex: Precision is a measure of proportion of patients that has been diagnosed as having
malignant tumor, actually had malignant tumor. The predicted positives (People predicted as
having malignant tumor are TP and FP) and the people actually having a malignant tumor are
TP.
TP
Precision = -------------
TP + FP
3.6.4 Recall
Recall is a measure that shows the proportion of patients that actually had malignant
tumor was diagnosed by the algorithm as having malignant tumor. The actual positives (People
having malignant tumor are TP and FN) and the people diagnosed by the model having a
malignant tumor are TP. Therefore, if we want to focus more on minimizing False Negatives, we
would want our Recall to be as close to 100% as possible.
TP
Recall = --------------
TP + FN
3.6.5 F1 Score
F1 Score Combines the Precision and Recall scores of a model. It is the Harmonic Mean
between precision and recall. The range for F1 Score is [0, 1]. It shows how precise the classifier
is and how robust it is at the same time.
2 * Prediction*Recall
F1 Score = ------------------------------
Prediction + Recall
21
CHAPTER-4
SYSTEM ANALYSIS
Existing System might be redundant variables which must be handled. The Heuristic search
method is used in mRMR and selects optimum features that have maximum relevance and
minimum redundancy. It checks one feature at a cycle and computes pairwise redundancy. The
mRMR does not take care of the joint association of features.
Easy to use
No human intervention required
Not only detect the heart disease type but also suggest precautions
Efficient use of available annotated data samples
22
4.3 COMPARITIVE ANALYSIS OF
MACHINE LEARNING ALGORITHM
Support vector machine:
The Accuracy of support vector machine was 81.97 %. Along with accuracy, all of the
performance metrics Precision, Recall, F1 score, fell from (0=83%, 1=81%), (0=74%, 1=88%),
(0=78%, 1=78%) respectively.
K Nearest Neighbors:
The Accuracy of K Nearest Neighbors was 67.21. Along with accuracy, all of the performance
metrics Precision, Recall, F1 score, fell from (0=62%, 1=72%), (0=67%, 1=88%), (0=64%,
1=70%) respectively.
The Accuracy of Decision tree classifier was 81.97. Along with accuracy, all of the performance
metrics Precision, Recall, F1 score, fell from (0=79%, 1=85%), (0=81%, 1=82%), (0=80%,
1=84%) respectively.
The Accuracy of Random forest was 90.16. Along with accuracy, all of the performance metrics
Precision, Recall, F1 score, fell from (0=92%, 1=89%), (0=85%, 1=94%), (0=88%, 1=91%)
respectively.
After completing the implementation of all four models for detecting heart disease from the
dataset, the results can be compared below the table.
23
MODELS ACCURACY
Table 4.1
Precision
MODELS
0 1
Table 4.2
24
Recall
MODELS
0 1
Table 4.3
25
F1 Score
MODELS
0 1
Table 4.4
26
CHAPTER-5
CONCLUSION
Identifying the processing of raw healthcare data of heart information will help in the long term
saving of human lives and early detection of abnormalities in heart conditions. Machine learning
techniques were used in this work to process raw data and provide a new and novel discernment
towards heart disease. Heart disease prediction is challenging and very important in the medical
field. However, the mortality rate can be drastically controlled if the disease is detected at the
early stages and preventative measures are adopted as soon as possible. Further extension of this
study is highly desirable to direct the investigations to real-world datasets instead of just
theoretical approaches and simulations. In this project have studied various classification
algorithms that can be used for classification of heart disease databases also have seen different
techniques that can be used for classification and the accuracy obtained by them. This
investigation tells us about dissimilar technologies that are used in dissimilar papers with
dissimilar count of attributes with different accuracies depending on the tools designed for
execution.
With the increasing number of deaths due to heart diseases, it has become mandatory to develop
a system to predict heart diseases effectively and accurately. The motivation for the study was to
find the most efficient ML algorithm for detection of heart diseases. This study compares the
accuracy score of Support vector machine, K Nearest Neighbors, Decision Tree, and Random
Forest algorithms for predicting heart disease using UCI machine learning repository dataset.
The result of this study indicates that the Random Forest algorithm is the most efficient
algorithm with accuracy score of 90.16% for prediction of heart disease. In future the work can
be enhanced by developing a web application based on the Random Forest algorithm as well as
using a larger dataset as compared to the one used in this analysis which will help to provide
better results and help health professionals in predicting the heart disease effectively and
efficiently.
27
CHAPTER-6
FUTURE ENHANCEMENT
In future of the structure can be further upgraded by creating various combinations of data
mining techniques and by parameter tuning also. A prototype heart disease prediction system is
developed using three data mining classification modeling techniques. The system extracts
hidden knowledge from a historical heart disease database. DMX query language and functions
are used to build and access the models. The models are trained and validated against a test
dataset. Lift Chart and Classification Matrix methods are used to evaluate the effectiveness of the
models. All three models are able to extract patterns in response to the predictable state. The
most effective model to predict patients with heart disease appears to be Naïve Bayes followed
by Neural Network and Decision Trees. Five mining goals are defined based on business
intelligence and data exploration. The goals are evaluated against the trained models. All three
models could answer complex queries, each with its own strength with respect to ease of model
interpretation, access to detailed information and accuracy. Naïve Bayes could answer four out
of the five goals; Decision Trees, three; and Neural Network, two. Although not the most
effective model, Decision Trees results are easier to read and interpret. The drill through feature
to access detailed patients’ profiles is only available in Decision Trees. Naïve Bayes fared better
than Decision Trees as it could identify all the significant medical predictors. The relationship
between attributes produced by Neural Network is more difficult to understand.
28
CHAPTER-7
BIBLOGRAPHY
[1] V. Krishnaiah, G. Narasimha, N. Subhash Chandra, “Heart Disease Prediction System using
Data Mining Techniques and Intelligent Fuzzy Approach: A Review” IJCA 2016.
[2] K.Sudhakar, Dr. M. Manimekalai “Study of Heart Disease Prediction using Data mining”,
IJARCSSE 2016.
[3] NagannaChetty, Kunwar Singh Vaisla, NagammaPatil, “An Improved Method for Disease
Prediction using Fuzzy Approach”, ACCE 2015.
[4] VikasChaurasia, Saurabh Pal, “Early Prediction of Heart disease using Data mining
Techniques”, Caribbean journal of Science and Technology,2013
[5] ShusakuTsumoto,” Problems with Mining Medical Data”, 0-7695- 0792-1 I00@ 2000 IEEE.
[6] Y. Alp Aslandoganet. al.,” Evidence Combination in Medical Data Mining”, Proceedings of
the international conference on Information Technology: Coding and Computing (ITCC’04) 0-
7695-2108-8/04©2004 IEEE.
[7] Carlos Ordonez, "Improving Heart Disease Prediction Using Constrained Association Rules,"
Seminar Presentation at University of Tokyo, 2004.
[8] Franck Le Duff, CristianMunteanu, Marc Cuggiaa, Philippe Mabob, "Predicting Survival
Causes After Out of Hospital Cardiac Arrest using Data Mining Method", Studies in health
technology and informatics, Vol. 107, No. Pt 2, page no. 1256-1259, 2004.
[9] Boleslaw Szymanski, Long Han, Mark Embrechts, Alexander Ross, KarstenSternickel,Lijuan
Zhu, "Using Efficient Supanova Kernel For Heart Disease Diagnosis", Proc. ANNIE 06,
intelligent engineering systems through artificial neural networks, vol. 16,page no. 305-310,
2006.
[10] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee, and Keun Ho Ryu, "Associative
Classification Approach for Diagnosing Cardiovascular Disease", Springer 2006,Vol:345, page
no. 721- 727.
https://www.kaggle.com/datasets/jillanisofttech/heart-disease-prediction-using-
differenttechniques
29
CHAPTER-8
SAMPLE CODING
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
%matplotlib inline
import os
print(os.listdir())
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv("heart.csv")
type(dataset)
dataset.shape
dataset.head(5)
dataset.sample(5)
dataset.describe()
dataset.info()
###Luckily, we have no missing values
info = ["age","1: male, 0: female","chest pain type, 1: typical angina, 2: atypical angina, 3: non-
anginal pain, 4: asymptomatic","resting blood pressure"," serum cholestoral in mg/dl","fasting
blood sugar > 120 mg/dl","resting electrocardiographic results (values 0,1,2)"," maximum heart
rate achieved","exercise induced angina","oldpeak = ST depression induced by exercise relative
to rest","the slope of the peak exercise ST segment","number of major vessels (0-3) colored by
flourosopy","thal: 3 = normal; 6 = fixed defect; 7 = reversable defect"]
30
for i in range(len(info)):
print(dataset.columns[i]+":\t\t\t"+info[i])
dataset["target"].describe()
dataset["target"].unique()
print(dataset.corr()["target"].abs().sort_values(ascending=False))
#This shows that most columns are moderately correlated with target, but 'fbs' is very weakly
correlated.
y = dataset["target"]
sns.countplot(y)
target_temp = dataset.target.value_counts()
print(target_temp)
print("Percentage of patience without heart problems: "+str(round(target_temp[0]*100/303,2)))
print("Percentage of patience with heart problems: "+str(round(target_temp[1]*100/303,2)))
dataset["sex"].unique()
sns.barplot(dataset["sex"],y)
dataset["cp"].unique()
sns.barplot(dataset["cp"],y)
dataset["fbs"].describe()
dataset["fbs"].unique()
sns.barplot(dataset["fbs"],y)
dataset["restecg"].unique()
sns.barplot(dataset["restecg"],y)
dataset["exang"].unique()
sns.barplot(dataset["exang"],y)
dataset["slope"].unique()
sns.barplot(dataset["slope"],y)
dataset["ca"].unique()
sns.countplot(dataset["ca"])
31
sns.barplot(dataset["ca"],y)
dataset["thal"].unique()
sns.barplot(dataset["thal"],y)
sns.distplot(dataset["thal"])
from sklearn.model_selection import train_test_split
predictors = dataset.drop("target",axis=1)
target = dataset["target"]
X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
X_train.shape
X_test.shape
Y_train.shape
Y_test.shape
from sklearn.metrics import accuracy_score
from sklearn import svm
sv = svm.SVC(kernel='linear')
sv.fit(X_train, Y_train)
sv = svm.SVC(kernel='linear')
sv.fit(X_train, Y_train)
Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)
print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
Y_pred_svm = sv.predict(X_test)
32
print("support vector machine report:\n\n",classification_report(Y_test,Y_pred_svm))
import numpy
actual = numpy.random.binomial(1, 0.9, size =61 )
predicted = numpy.random.binomial(1, 0.9, size = 61)
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(actual, predicted)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix,
display_labels = [False, True])
import matplotlib.pyplot as plt
cm_display.plot()
plt.show()
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)
33
max_accuracy = 0
for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
print(Y_pred_dt.shape)
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
Y_pred_svm = sv.predict(X_test)
print("Decision tree report:\n\n",classification_report(Y_test,Y_pred_dt))
from sklearn.ensemble import RandomForestClassifier
max_accuracy = 0
for x in range(2000):
rf = RandomForestClassifier(random_state=x)
34
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
print("The accuracy score achieved using random forest Tree is: "+str(score_rf)+" %")
scores = [score_svm,score_knn,score_dt,score_rf,]
algorithms = ["Support Vector Machine","K-Nearest Neighbors","Decision Tree","Random
forest"]
for i in range(len(algorithms)):
print("The accuracy score achieved using "+algorithms[i]+" is: "+str(scores[i])+" %")
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")
sns.barplot(algorithms,scores)
35
SCREEN LAYOUT
The Result obtained from the project entitled “heart disease prediction using machine learning”
were discussed as follows.
Fig 4.1
Fig 4.2
36
Step 2.1 : DataFrame check if any missing values exists
Fig 4.3
Fig 4.4
37
Step 3: Dataset by using histogram
Fig 4.5
Step 4: After the analysis of dataset split the data into training and testing
Fig 4.6
38
Step 4.1: Univariate analysis
Fig 4.7
Step 5: After splitting the dataset into training and testing implement the machine learning
models.
Fig 4.8
39
Step 5.1: After the implementation of all the four algorithms, results obtained through confusion
matrix for all four algorithms to Evaluate the performance.
40
Fig 4.11 (Decision tree algorithm)
41
Fig 4.12 SVM (Precision, recall, f1 score,)
42
Fig 4.14 Decision tree (Precision, recall, f1 score,)
43
44