HEART DISEASE PREDICTION
USING MACHINE LEARNING
SUBMITTED BY
K. PRAVEEN
(227Z1A0582)
1
INDEX
S.NO TOPIC PAGENO
1. Introduction 3
2. Algorithm Descriptions 4
3. Benefits & Objectives 5
4. High- & Low-level 6
requirements
5. System Requirements 7
6. Source Code 8
7. Result 12
8. Conclusion 15
2
Introduction
Heart disease is one of the leading causes of death across the world,
affecting millions of people every year. According to the World
Health Organization, it is responsible for a large number of health
complications and fatalities. Early detection and timely diagnosis play
a major role in reducing the risk and planning proper treatment. In
recent years, machine learning (ML) has gained attention in the
healthcare field for its ability to find patterns in medical data and
support better decision-making.
This project focuses on using machine learning techniques to predict
whether a person has heart disease or not. Using a real-world heart
disease dataset that includes key medical information like age, blood
pressure, cholesterol levels, and more, the goal is to test how accurate
different ML models are at predicting heart conditions. We compare
the results of four popular classification algorithms—k-Nearest
Neighbours (k-NN), Decision Tree, Logistic Regression, and Support
Vector Machine (SVM)—with a special focus on reducing false
negatives, which are dangerous in situations where early diagnosis is
critical.
The aim of this study is not only to test accuracy but also to find the
most reliable model that could be used in real-life medical setups,
helping doctors diagnose heart conditions more quickly, safely, and
affordably.
3
Algorithm Descriptions
k-Nearest Neighbours(k-NN)
k-Nearest Neighbours is a simple, intuitive algorithm that classifies a
data point based on how its neighbours are labelled. It works by
calculating the distance between the input point and other points in
the dataset, then selecting the "k" closest neighbours. The majority
class among these neighbours determines the prediction. It doesn't
make assumptions about the data, which makes it flexible, but it can
become slow with large datasets and is sensitive to the scale of
features.
Logistic Regression
Logistic Regression is a statistical algorithm used for binary
classification problems. It calculates the probability that a given input
belongs to a particular class using the logistic (sigmoid) function.
Despite its name, it is a classification algorithm, not a regression one.
It performs well when there is a linear relationship between the
features and the outcome and is especially useful for medical and risk
prediction tasks due to its simplicity and efficiency.
Support Vector Machine (SVM)
Support Vector Machine is a powerful supervised learning algorithm
that finds the best boundary (or hyperplane) that separates data points
of different classes. It tries to maximize the margin between the two
classes for better accuracy. SVM is effective in high-dimensional
spaces and is known for its performance in both linear and non-linear
classification using kernel tricks. However, it can be computationally
expensive and harder to interpret compared to simpler models.
4
Benefits &Objectives
Benefits
Predicts the likelihood of heart disease with high accuracy,
aiding in early diagnosis and intervention.
Helps identify individuals at risk, enabling timely medical
attention and lifestyle changes.
Provides a simple interface for users to input health data and
receive instant predictions.
Compares multiple machine learning models using key metrics
and visual tools to determine the best-performing model.
Objectives
Apply machine learning algorithms for binary classification of
heart disease data.
Preprocess and split the dataset into training and testing sets to
ensure robust evaluation.
Evaluate model performance using metrics such as accuracy,
confusion matrix, F1-score, and ROC-AUC.
Focus on minimizing false negatives to ensure no cases of heart
disease are misclassified as healthy.
Create a user-friendly interface for real-time prediction based on
user input.
5
High- & Low-level Requirements
High-level Requirements
Input Interface
Model Processing
Prediction Output
Performance Comparison
Visual Representation
Low-level Requirements
Dataset Handling
Data Preprocessing
Model Implementation
Model Evaluation
User Input Validation
6
SYSTEM REQUIREMENTS
Hardware
Minimum 1 GHz processor
1 GB RAM or more
100 MB free storage
Software
OS: Windows/Linux/macOS
IDE: Jupyter, VS Code, or PyCharm
Libraries: scikit-learn, matplotlib, NumPy, pandas
Features
Input interface for user values (via console or web-based)
Classification using multiple machine learning algorithms (e.g.,
k-NN, Decision Tree, Logistic Regression, SVM)
Accuracy and ROC-AUC score reporting
ROC curve plotting for visual performance comparison
Real-time prediction for custom input values (health data)
SOURCE CODE
7
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score, roc_curve, auc
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
# Load dataset
df = pd.read_csv("heart.csv")
# Features and target
X = df.drop("target", axis=1)
y = df["target"]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
8
X_test_scaled = scaler.transform(X_test)
# Define models
models = {
"KNN": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier(random_state=42),
"SVM": SVC(probability=True, random_state=42)
}
# Train and evaluate models
roc_data = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(f"\n=== {name} ===")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test,
y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
roc_data[name] = (fpr, tpr, roc_auc)
9
# Plot ROC curves
plt.figure(figsize=(10, 6))
for name, (fpr, tpr, roc_auc) in roc_data.items():
plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()
# Ask input from the user
print("\nPlease enter the following details for prediction:")
age = int(input("Age: "))
sex = int(input("Sex (1 = male, 0 = female): "))
cp = int(input("Chest pain type (0-3): "))
resting_blood_pressure = int(input("Resting blood pressure: "))
serum_cholesterol = int(input("Serum cholesterol (mg/dl): "))
fasting_blood_sugar = int(input("Fasting blood sugar > 120 mg/dl (1
= yes, 0 = no): "))
# Prepare input for prediction
10
sample_input = [[age, sex, cp, resting_blood_pressure,
serum_cholesterol, fasting_blood_sugar, 0, 0, 0, 0, 0, 0, 0]] # other
inputs can be set to 0 or a default value
# Scale the input using the same scaler as training data
sample_input_scaled = scaler.transform(sample_input)
# Predict with all models
print("\n--- Sample Prediction from All Models ---")
for name, model in models.items():
prediction = model.predict(sample_input_scaled)[0]
result = "Heart Disease Detected" if prediction == 1 else "No Heart
Disease Detected"
print(f"{name}: {result}")
11
OUTPUT
12
13
14
CONCLUSION
The heart disease prediction system demonstrates how machine
learning can effectively assist in the early diagnosis of cardiovascular
conditions. By applying multiple classification algorithms and
comparing their performance, the project identifies the most reliable
model for accurate predictions. The system allows users to input basic
medical data and receive real-time risk assessment, making it a
valuable tool for both patients and healthcare providers. With its focus
on reducing false negatives, the model aims to ensure that at-risk
individuals are not overlooked, ultimately supporting timely
intervention and better health outcomes.
15