Final Report
Final Report
2. ATUL TIWARY(2211981114)
3.SIDNATH KUMAR(2211981386)
4.SHIVAM DUBEY(2211981364)
PROBLEM STATEMENT: Nowadays, the case of heart related issues are increasing very
rapidly. So,it is very important to predict any disease before it happens.
NEED STATEMENT – Developing a deep learning model for accurate and early prediction
of heart disease based on comprehensive patient data, aiming to enhance proactive healthcare
interventions and reduce cardiovascular risks.
STACK USED:-
1) Kaggle:-we have taken the dataset from here.
2) Python:-This language is used for the programming.
3) Machine learning:-Algorithms like Logistic Regression,KNN,SVM,DT,Naïve
Byes,KNN are used in this project.
ABSTRACT :- Heart Disease is one of the major problem in todays era. This disease alone
responsible for dying of more people per year in comparison to all the other diseases
integrated. It includes various problems related to heart like Blood vessel disease (coronary
artery disease), Born heart problems (congenital heart defects), Irregular
heartbeats(arrhythmias) and many more. Also this disease is very commonly seen in this new
generation specially the youths. Study says youths are suffering with heart diseases due to
chronic stress, smoking ,poor lifestyle, etc. So the growing cases of heart disease makes us
feel that we should have some advanced algorithms and works regarding heart disease
prediction. The purpose of this work is basically to find weather the patient or the person is
suffering or likely to suffer from heart disease or not. Hospitals can predict the heart disease
with the help of automated diagnoses which works on the data science algorithms and data
mining approaches as well. This research paper aims to implement supervised machine
learning like Decision trees(DT), Random forest(RF), K-nearest neighbours, Gaussian Naïve
Bayes algorithms. The dataset used in this work has been collected from Kagggle and the
accuracy of our project is 98%.
INTRODUCTION:-
Heart Disaese:- Heart is the most important organ of a human body. Its function is to pump
the blood to every part of our body. If it fails to work rightly, then most of our body parts
will stop functioning. If the organ stops working, a person can die within a minsecond. The
main key which influences the heart disease is like overweight of the body , high BP(blood
pressure),our hereditary(ancestor),consumption of alcohol in large amount etc.
As we know that the Cardiovascular disease (CVDs) is spreading very rapidly in this era of
20s. By the data from the World health organization (WHO), Every year an estimation of
17.9 million dies from CVDs mainly of heart attacks and high level of strokes. Among all
the total human deaths, the people who are below the age of 70 years are mostly dying from
it .There are many tests which are conducting for the diagnosing of CVD ,the name of the
tests are such as test of blood sugar ,checking the level of cholesterol and tests of ECG.
Heart illness means it encompasses a wide spectrum of disorders which affects the structure
of the heart and it also made problem in the functionality of the heart. At the current time, one
of the most significant disease of heart is Coronary-Heart-Disease and it is the major reason
for the large number of death in this Generation. This disease came in seen when plaque
accumulates in the coronary-artery walls which stop the bloods to flow to the heart and
other parts of heart. This mainly ending in the result of the heart-attacks.
The chest pain is the major indicator to the heart that it is not getting satisified blood flow .
The most normal sign of the heart disease is palpitations, chest-pain , arm-pain as well as
breath-shortness etc.
In the cardiovascular-disease, the time plays a very major role for the patient . So, it is very
mandatory to predict the heart related illness at the correct time before it is too late.
At the present time , Many Medical organizations are collecting different data on the heart
related disease. But, every data cannot be seen to be true or totally correct after seeing this , it
is very difficult to take a significant decisions with this kind of data. These data which are
very hard for humans to study and come to the other side of prediction . For this we can have
the Machine-Learning Algorithm, This made too easy to the humans. It can easily predict
the heart related issues and the accuracy of the prediction is wonderful.
Overview of Machine Learning:-
summarized knowledge from the trained data which has been given by the humans and the
rate of the improvement is very accurate. There are mainly four types of machine Learning
Learninng.
• KNN
• Logistic Regression
• SVM
• Naïve Bayes
• Decision Tree
• Random Forest
Figure 1:- Our model working flow digaram
Literature Review:-
Several algorithms of machine learning,like as k-nearest neighbor, support vector machine
among others have been used by many Researchers in an effort to design non-invasive
healthcare systems to address the problems caused by invasive-based procedures.
In Organized manner , the set of data has been taken only those patient who have Heart
disease problems and all the data has been taken according to medical conditions[2].
Due to various conditions the heart is affected. WHO’s report says that the high numbers of
mortality rate in mid-aged peoples are due to Heart-disease.This data source has 303 rows
which holds the information of patients of different age-group. By using this dataset we are
collecting very important data i.e. the medical attributes like as age,BP,sugar level of patient
,cholesterol etc of Heart’s patient which provides idea for detecting that the patient suffer
from heart’s disease or not.The dataset which we have taken has 303 patients medical records
in 303 rows and 13 columns with 13 medical attributes which help us into identify or detect
that patients are in risk or not of heart disease. The Heart Disease data’s are takes from the
UCI-repositories. By this dats, symbols which support to the detect of humans prones to get
the heart disease is extract. These dataset are devided into two parts: first one is Training
dataset and second one is Testing dataset. In this dataset 303 rows and 14 column is present,
where every row has taken to a one types of record. The attribute,which are used in this
Not present/present/
XIV Number(disorder)(tar Is there Heart Present in the four
get) Disease major types.
Present or not.
a. Dataset-There is not any particular dataset or information that can be used for this
purpose.Most researcher used to use UCI Dataset as it can be read easily and it is
filtered Dataset.
c. Accuracy– Alizadehsani et al. have proposed the most accurate model based on a
combination of several classifiers, namely SVM,NB and C4.5 with the accuracy rate of
96.4%.
METHODOLOGY:- This study target to anticipate the probability of heart diseases through
programmatic heart disease prediction, and that can be beneficial for medical professionals
and patients.we use machine algorithms on dataset(that we gather) and then present the study
in the report. For better result we eliminate meaningless information and also add features
such as MAP and BMI. Next, we will separate the data set based on gender and use k_modes
clustering.
Finally, we train the model with prepared data. The enhance methodology will create more
accurate results and best model performance. And now we show how our report will created
step by Step:-
1.Data Source:- The dataset uses in our study comprise of 70,000 patient of records with 12
different features. These features are age, gender, systolic blood pressure, and blood pressure.
The target class, “cardio,” shows whether a patient has cardiovascular disease (represented as
1) or is healthy (represented as 0).
2. Removing Outliers:- The dataset at staring had 70,000 records but was minimised to
57,155 after outliers are removed. Outliers like in blood pressure , weight, and height were
identified and eliminated if they were outside the 2.5% to 97.5% range, likely due to data
entry errors. And by This removing of these data which are not usable enhances model
performance.
3.Feature Selection and Reduction:- In our study, continual data like age, weight, and blood
pressure were grouped into box to upgrade classification algorithm performances and will be
more readable. For example, age converted from days to years and group into 5-year
meantime, help in clear analysis. Attributes like height and weight are accommodate into a
Body Mass Index (BMI) category, improving model definiteness for predict cardiovascular
disease (CVD) risks. Mean Arterial Pressure (MAP)—studied from blood pressure review,
was also keep and indicate a direct relation CVD risks. This methodological move to
categorical data make more intelligible understanding and increase the guessing capability of
health models.
4.Clustering:- Our, study used a one of best technique like k-modes clustering to organized
patient data in the groups that share same characteristic, specially focus on non-numeric
(categorical) information like age groups, BMI categories, and blood pressure categories.
This method is great for dealing with tyuupes of data that don't fit well with average values,
which is often the case with numbers or quantities.
For figure out the good number of groups (or clusters) , divide the data in, the study which
uses a technique called the elbow method. This method involved looking at chart where
every point reflects a potential number of clusters, and finding a point where adding more
clusters doesn't really improve the useful of the data grouping. And , when predicting the
data separately for men and women, the study found that the best no of clusters for both was
two. This acknowledges the differences between men and women in terms of heart disease
risks and how it will develops, which can help tailor saving and treatment strategies more
vastly
Additionally, the study looks in how different factors like blood pressure, and age relate to
each other using correlation table. This helps to see which factor end to go hand-in-hand,
which can be very effective for understanding how different risk factors combine to affect
heart disease risk.
5. Modeling Process:-
1. Splitting the Dataset: The dataset is divided in two parts: 80% is used for training the
models (training dataset), and the 20% is for test how well these models work (testing
dataset).
1. Decision Tree Classifier: suppose a tree where each branch represents a choice b/w
options, lead different paths on data options , ending in a decision (like diagnosing a disease).
It's simple as well as easy to understand. A decision tree make around 73% accuracy in
analysing correctly.
2. Random Forest: This is like a forest of decision trees work together. Every tree votes, and
the most common result becomes the final outcome. It's more accurative because it combines
many trees' decisions, reducing the errors a single tree make. Random forests also showed
about 73% accuracy.
3. Multilayer Perceptron (MLP): This is a neural network with layers of nodes that uses the
human brain, capabllity of learning from complex patterns. It's especially good for complex
problems that aren't simple lines. MLP uses a technique called back propagation to forward
its predictions and educing errors.
4. XGBoost: Stands for Extreme Gradient Boosting. It makes one tree at a time, where every
new tree helps correct errors made by the previous tree. It gives more weight to the harder-to-
probabilate instances, making the model smarter over by time. XGBoost also becomes a 98%
accuracy rate in a study.
• Figure 5 ,This figure count the number of patients have disease and not have disease.
The data have 479 patients who have heart disease and 420 patients who don’t have
heart disease.
• if we talk about the disease percentage then 46.71 percentage patients don’t have any
heart disease(0) and 53.21% of patients who have any kind of heart disease.
• The ration between with and without heart disease is not accurately 50:50 but it is so
close that it will not create any cause.
• Figure 6 defines the frequency of heart disease in terms of Age, The green colour
representing patients with no disease, the red colour is representing patients with kind
of heart disease.
• Age 51 has the highest number of heart diseases, while the people
who are at the age of 58 has the highest number of patients without any
kind of heart disease.
• The patients who are at the age of 58 are almost 37 people who don’t have any
disease while age at 51 and there is 35 patients of that age who have disease.
Figure 7- frequency of heart disease for each sex
• Figure 7 defines the frequency of heart diseases for male as well as female patients.
• The percentage of female patients in our data is 32.25% and the percentage of male
patients is 67.25%.
• According to the figure , Females have lesser chance of having a Heart Disease in
comparison of Male .
Figure 8- Relation between Maximum Heart Rate and Heart Disease
• The highest Heart rate achieved is over 200 for a 30-yr old patient and have heart
disease.
• Generally, patients with higher heart rate have been diagnosed with heart
diseases.
Figure 9- Frequency of heart disease according to chest pain type
• This figure represents the correlation between chest pain type and heart disease
frequency.
• The chest pain type 2 having the highest frequency of heart disease patients.
Figure 10- frequency of heart disease according to FBS
• This figure represents the relation between fasting blood sugar and heart disease
frequency.
• Here we can see that if (FBS>120mg/dl, and 1 is representing as true and 0 as false.
• 60 patients are having more than 120mg/dl and have disease of heart disease.
Implementation:-
Figure 12-Updated data point before and after applying KNN algorithm
❖ Decision Tree:- DT algorithm is also a supervised learning algorithm. It is mainly
used on classification models. It includes two node ,one is
Decision node and another is leaf node. Mainly , the decision node are split into small
internal nodes. Which represents the features of the given dataset. The branches works
to represents the rules of the certain decision.
The idea of the DT comes after the actual tree structures.
DT breaks all the decisions node into different-different possible
outcomes, But a leaf nodes cannot be broken.
The Top-Left model is the True Negative(-) value ,clarify that the actual value was
No as well as the model’s Prediction was also No.
The Top-Right value is called the Falsee-positive value which represents
that the actual -value was no but the models’s prediction was yes. It is
basically known as Type-I error.
Whereas the Bottom-left value is the false-negative, also known as the type-II error. It
represents that the actual value was yes, but the model’s-prediction was No.
The Bottom-right number is meant to as true positive, because it indicates that the
combine the actual value as as well as model’s prediction both were correct.
TP+TN
Accuracy = TP+TN+FP+FN
Figure 18- KNN confusion matrix
The true Negative value of KNN is 71 which means the model predicted that no heart disease
accurately 71 times.
The True positive value is 80 which means that the model-predicted existence
of heart disease accurately 80 times.
Figure 19- SVM confusion matrix
The true Negative value for SVM is 41 which means the model predicted, No
Heart disease accurately 41 times.
The true Positive value is 76 means that the model predicted, existence of heart
Disease accurately 76 times.
Figure 20- DT-confusion matrix
The true Negative value for DT is 86 which means the model predicted No heart disease
accurately 86 times.
The true Positive value is 91 which means that the model predicted existence of heart disease
accurately 91 times.
Figure 21- RF Confusion matrix
The True Negative value for RF is 86 which means the model predicted No heart disease
accurately 86 times.
The True Positive value is 91 which means that the model predicted existence of heart
disease accurately 91 times.
Figure 22-Logistic Regression confusion Matrix
The True Negative value for Logistic Regression is 30 which means the model predicted No
heart disease accurately 30 times.
The True Positive value is 82 which means that the model predicted existence of heart
disease accurately 82 times.
Figure 23-Naïve Byes confusion Matrix
The True Negative value for Naïve Byes is 67 which means the model predicted No heart
disease accurately 67 Times.
The True Positive value is 81 which means that the model predicted existence of heart
disease accurately 81 times.
Algorithms Used Accuracy Score
SVM 65.00%
KNN 90.56%
Conclusion:- Heart Diseases are highly dangerous diseasestaking the lives of millions
around the world per year, as discussed above. It is necessary that any potential of a heart
disease is discovered early so as to undergo possible treatment. As the invasive-based
methods have their drawbacks, It is incredibly important that Machine Learning models are
being developed and improved in order to detect Heart Diseases. Possible Heart Disease
treatments vary depending on the type of damage, with lifestyle changes such as a healthy
diet, regular exercising, good amount of sleep and no smoking being a few of them. There
is also medication available and some patients might need to undergo surgery if lifestyle
changes alone do not work.
[2] Dangare. C .S and ...Apte S S (2012)... Improve studies of heart disease prediction result
using data mining classifications methods. Int Journal of Computer Applications, 47(10), 44-
8.
[3] Ordonez C (2006). Association rules discover with the train and test goes for heart disease
prediction. IEEE Transactions on Information Technology in Biomedicine, 10(2), ..334-43.
[4] Shinde R...,... Arjun S,.... Patil P and ..Waghmare J (2015). intelligent heart disease
prediction system use k-means clustering and Naive Bayes algorithms. Int Journalism of
Computers Science and Info Technologies, 6(1), 637-9.
[5] Bashir S..., Qamar U ....and... Javed M Y (2014, November). An ensemble base decisions
supported framework if intelligent heart disease diagnosis. In International Conference on
Information Society (i-Society 2014) (pp. 259-64). IEEE.
[8] Jabbar M A, Deekshatulu B L and Chandra P (2013, March). Heart disease prediction
using lazy associative classification. In 2013 International Mutli-Conference on Automation,
Computing,Communication, Control and Compressed Sensing (iMac4s) (pp. 40- 6). IEEE.
[9] Dangare Chaitrali S and Sulabha S Apte. "Improved study of heart disease prediction
system using data mining classification techniques." International Journal of Computer
Applications 47.10 (2012): 44-8.
[10] Soni Jyoti. "Predictive data mining for medical diagnosis: An overview of heart disease
prediction." International Journal of Computer Applications 17.8 (2011): 43-8.
[11] Chen A H, Huang S Y, P S, Cheng C H and Lin E J (2011, September). HDPS: Heart
disease prediction system. In 2011 Computing in Cardiology (pp. 557-60). IEEE.
[12] Parthiban, Latha and R Subramanian. "Intelligent heart disease prediction system using
CANFIS and genetic algorithm." International Journal of Biological, Biomedical and Medical
Sciences 3.3 (2008).
[16] Amin Ul Haq, Jian Ping Li, Muhammad Hammad Memon, Shah Nazir, Ruinan Sun, A
Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine
Learning Algorithms, Mobile Information Systems, Volume 2018, doi:
10.1155/2018/3860146
[17] S. Mohan, C. Thirumalai and G. Srivastava, "Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques," in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.
[18] L. Ali et al., "An Optimized Stacked Support Vector Machines Based Expert System for the Effective Pre-
diction of Heart Failure," in IEEE Access, vol. 7, pp. 54007-54014, 2019, doi: 10.1109/ACCESS.2019.2909969.
[19] TR, R. ., Lilhore, U. K., M, P. ., Simaiya, S. ., Kaur, . A. ., & Hamdi, M. . (2022). PREDICTIVE ANALY-
SIS OF HEART DISEASES WITH MACHINE LEARNING APPROACHES. Malaysian Journal of Computer
Science, 132–148. https://doi.org/10.22452/mjcs.sp2022no1.10
[21] Pooja Anbuselvan, 2020, Heart Disease Prediction using Machine Learning Techniques, INTERNATION-
AL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 09, Issue 11 (Novem-
ber 2020).
[22] "HEART DISEASE PREDICTION USING MACHINE LEARNING", International Journal of Emerging
Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.7, Issue 6, page no.2081-2085,
June-2020, Available :http://www.jetir.org/papers/JETIR2006301.pdf
[23] Palaniappan, S. & Awang, R. In 2012 IEEE/ACS International Conference on Computer Systems and Ap-
plications 108–115 (IEEE, New York).
[24] Olaniyi, Ebenezer & Oyedotun, Oyebade & Khashman, Adnan. (2015). Heart Diseases Diagnosis Using
Neural Networks Arbitration. International Journal of Intelligent Systems and Applications. 7. 75-82.
10.5815/ijisa.2015.12.08.
[25] Resul Das, Ibrahim Turkoglu, Abdulkadir Sengur, Effective diagnosis of heart disease through neural net-
works ensembles, Expert Systems with Applications, Volume 36, Issue 4, 2009, Pages 7675-7680, ISSN 0957-
4174, https://doi.org/10.1016/j.eswa.2008.09.013.
[26] C. Boukhatem, H. Y. Youssef and A. B. Nassif, "Heart Disease Prediction Using Machine Learning," 2022
Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab
Emirates, 2022, pp. 1-6, doi: 10.1109/ASET53988.2022.9734880.
[27] Rindhe, Baban & Ahire, Nikita & Patil, Rupali & Gagare, Shweta & Darade, Manisha. (2021). Heart Dis-
ease Prediction Using Machine Learning. International Journal of Advanced Research in Science, Communica-
tion and Technology. 267-276. 10.48175/IJARSCT-1131.
[28] Singh, Yeshvendra & Sinha, Nikhil & Singh, Sanjay. (2017). Heart Disease Prediction System Using Ran-
dom Forest. 613-623. 10.1007/978-981-10-5427-3_63.
[29] Apurb Rajdhan , Avi Agarwal , Milan Sai , Dundigalla Ravi, Dr. Poonam Ghuli, 2020, Heart Disease Pre-
diction using Machine Learning, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECH-
NOLOGY (IJERT) Volume 09, Issue 04 (April 2020)
[30] Muhammad, Y., Tahir, M., Hayat, M. et al. Early and accurate detection and diagnosis of heart disease
using intelligent computational model. Sci Rep 10, 19747 (2020). https://doi.org/10.1038/s41598-020-76635-9