X
Project File
PREDICTING DIABETES USING
MACHINE LEARNING
Submitted To: Submitted By:
MS. Gurpreet Kaur Muskan Altaf
(Assistant Professor Univ.Roll No.2105116
Clg.Roll No. 2122515
BTech CSE/7th Sem
Computer Science Engineering
ACKNOWLEDGEMENT
A good project involves the hard work of many people,
including the students and the teacher. While working on this
project, I have received unconditional support and guidance
from our teachers.
I would like to express our sincere gratitude to our teacher MS
Gurpreet Kaur Mam, for giving us such a golden opportunity
to work on this wonderful project. Her valuable words and
advice have truly motivated me. Preparing this project in
collaboration with my teacher was a refreshing experience.
I have learned many useful things from this project. Her
guidance and constant support have pushed me to successfully
complete this project.
Sincerely
Muskan Altaf
Computer Science Engineering
INTRODUCTION
Diabetes is a chronic disease that affects millions of people
worldwide. It is caused by high levels of glucose (sugar) in
the blood due to the body's inability to produce or properly
use insulin. Early detection and management of diabetes can
help prevent complications such as heart disease, kidney
failure, and blindness.
In this project, we aim to build a machine learning model that
can predict whether a person has diabetes or not based on
their medical information. We will use the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not. By analysing this dataset and building a
model based on it, we hope to contribute to the field of
medical diagnosis and improve the accuracy of diabetes
diagnosis.
Computer Science Engineering
Aim of the Project
The aim of this project is to build a machine learning model
that can predict whether a person has diabetes or not based on
their medical information. By analysing the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not, we hope to contribute to the field of medical
diagnosis and improve the accuracy of diabetes diagnosis.
Early detection and management of diabetes can help prevent
complications such as heart disease, kidney failure, and
blindness, so accurately predicting diabetes can have a
significant impact on public health.
In this project, we aim to build a machine learning model that
can predict whether a person has diabetes or not based on
their medical information. We will use the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not. By analyzing this dataset and building a
model based on it, we hope to contribute to the field of
medical diagnosis and improve the accuracy of diabetes
diagnosis.
Computer Science Engineering
STEPS INVOLVED:
Data Exploration
We loaded the dataset into our Python program using Pandas
and printed the first few rows and some statistics about the
data. We found that the dataset contains 768 rows and 9
columns, with some missing values.
Data Preparation:
We split the dataset into a training set and a testing set using
the `train_test_split()` function from Scikit-learn. We also
separated the input variables from the target variable. We
used logistic regression to build our machine learning model
and trained it using the training set.
Model Evaluation:
We evaluated the performance of our model by predicting
whether a person has diabetes or not in the testing set and
comparing the predictions with the actual values. We
calculated
the accuracy of our model using the `accuracy_score()`
function
from Scikit-learn and found it to be around 75%.
Computer Science Engineering
We also created a confusion matrix to see how many true
positives, true negatives, false positives, and false negatives
our
model has. From the confusion matrix, we can see that our
model
correctly predicted diabetes in 50% of cases and correctly
predicted no diabetes in 82% of cases.
Results Visualization:
We visualized the results of our model using Matplotlib by
creating a bar chart of the confusion matrix. The bar chart
shows
the number of true positives, true negatives, false positives,
and
false negatives our model has.
Computer Science Engineering
ALGORITHM:
STEP1. Import the necessary libraries for data analysis and
machine learning.
STEP 2. Load the Pima Indians Diabetes Dataset into your
Python program using Pandas as [ data =
pd.read_csv('diabetes.csv')].
STEP 3. Explore the dataset by printing the first few rows and
some statistics about the data.
STEP 4. Prepare the data by separating the input variables
(features) from the target variable (outcome), and splitting the
data into a training set and a testing set using the
`train_test_split()` function from Scikit-learn.
STEP 5. Build the machine learning model using logistic
regression and train it using the training set.
Computer Science Engineering
STEP 6. Evaluate the performance of the model by predicting
whether a person has diabetes or not in the testing set and
comparing the predictions with the actual values. Calculate
the accuracy of the model using the `accuracy_score()`
function from Scikit-learn.
STEP 7. Create a confusion matrix to see how many true
positives, true negatives, false positives, and false negatives
the model has.
STEP 8. Visualize the results of the model using Matplotlib
by creating a bar chart of the confusion matrix.
Computer Science Engineering
FULL SOURCE CODE:
# Step 1: Importing the Required Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline
# Step 2: Loading the Dataset
data = pd.read_csv('diabetes.csv')
# Step 3: Exploring the Dataset
print(data.head())
print(data.describe())
Computer Science Engineering
# Step 4: Preparing the Data
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=42)
# Step 5: Building the Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Step 6: Evaluating the Model
y_pred = logreg.predict(X_test)
print('Accuracy score:', accuracy_score(y_test, y_pred))
confusion_mat = confusion_matrix(y_test, y_pred)
print('Confusion matrix:', confusion_mat)
# Step 7: Visualizing the Results
plt.bar(['True Negative', 'False Positive', 'False Negative',
'True Positive'],
confusion_mat.ravel())
plt.title('Confusion Matrix')
Computer Science Engineering
plt.xlabel('Prediction')
plt.ylabel('Count')
plt.show()
SOFTWARE USED:
Jupyter Notebook:
Jupyter notebooks are used for all sorts of data science tasks
such as exploratory data analysis (EDA), data cleaning and
transformation, data visualization, statistical modeling,
machine
learning, and deep learning.
A Jupyter Notebook is an open source web application that
allows data scientists to create and share documents that
include
live code, equations, and other multimedia resources.
How do Jupyter Notebook Works?
A Jupyter notebook has two components: a front-end web
page
and a back-end kernel. The front-end web page allows data
scientists to enter programming code or text in rectangular
"cells." The browser then passes the code to the back-end
kernel
Computer Science Engineering
which runs the code and returns the results.
LANGUAGE USED:
PYTHON
FUTURE SCOPE:
The future scope of this project includes several potential
areas
for improvement and further research. Some of these include:
1. Feature engineering: The current model uses the original
features of the dataset, but there may be opportunities to
create new features that better capture the relationship
between the input variables and the target variable. For
example, creating a new feature that combines BMI and age
may better capture the risk of diabetes.
2. Algorithm selection: While logistic regression is a common
and effective algorithm for binary classification problems like
this one, there may be other machine learning algorithms that
can achieve better accuracy on this dataset. Experimenting
with different algorithms such as decision trees, random
forests, or support vector machines may be worth exploring.
3. Hyperparameter tuning: The current model was built using
Computer Science Engineering
default hyperparameters for logistic regression. However,
there
may be opportunities to optimize the hyperparameters to
achieve
better accuracy. Techniques like grid search or random search
can be used to find the optimal hyperparameters for the
model.
4. Data augmentation: The dataset used in this project is
relatively
small, which can limit the performance of machine learning
models. Data augmentation techniques such as oversampling,
undersampling, or synthetic data generation may be used to
increase the size of the dataset and improve the performance
of the model.
5. External validation: While the model performed well on the
testing set, it is important to validate the model on external
datasets to ensure its generalizability. Future research may
involve testing the model on other datasets with similar
characteristics to the Pima Indians Diabetes Dataset.
Overall, there is significant potential for further development
and
Computer Science Engineering
improvement of the machine learning model for diabetes
prediction using the Pima Indians Diabetes Dataset.
Conclusion:
In this project, we successfully built a machine learning model
to
predict whether a person has diabetes or not using the Pima
Indians Diabetes Dataset. Our model achieved an accuracy of
around 75% and correctly predicted diabetes in 50% of cases
and
correctly predicted no diabetes in 82% of cases. This project
demonstrates how machine learning can be used to make
predictions based on medical data. However, further
improvements can be made to the model to increase its
accuracy.
Computer Science Engineering
Computer Science Engineering