Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views15 pages

Diabetes Project MuskanAltaf

The project focuses on predicting diabetes using a machine learning model built from the Pima Indians Diabetes Dataset. The model achieved an accuracy of approximately 75%, with specific performance metrics indicating 50% correct predictions for diabetes and 82% for no diabetes. Future improvements include feature engineering, exploring different algorithms, hyperparameter tuning, data augmentation, and external validation.

Uploaded by

Mudasir Bashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Diabetes Project MuskanAltaf

The project focuses on predicting diabetes using a machine learning model built from the Pima Indians Diabetes Dataset. The model achieved an accuracy of approximately 75%, with specific performance metrics indicating 50% correct predictions for diabetes and 82% for no diabetes. Future improvements include feature engineering, exploring different algorithms, hyperparameter tuning, data augmentation, and external validation.

Uploaded by

Mudasir Bashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

X

Project File
PREDICTING DIABETES USING
MACHINE LEARNING

Submitted To: Submitted By:


MS. Gurpreet Kaur Muskan Altaf
(Assistant Professor Univ.Roll No.2105116
Clg.Roll No. 2122515
BTech CSE/7th Sem

Computer Science Engineering


ACKNOWLEDGEMENT

A good project involves the hard work of many people,


including the students and the teacher. While working on this
project, I have received unconditional support and guidance
from our teachers.

I would like to express our sincere gratitude to our teacher MS


Gurpreet Kaur Mam, for giving us such a golden opportunity
to work on this wonderful project. Her valuable words and
advice have truly motivated me. Preparing this project in
collaboration with my teacher was a refreshing experience.

I have learned many useful things from this project. Her


guidance and constant support have pushed me to successfully
complete this project.
Sincerely
Muskan Altaf

Computer Science Engineering


INTRODUCTION

Diabetes is a chronic disease that affects millions of people


worldwide. It is caused by high levels of glucose (sugar) in
the blood due to the body's inability to produce or properly
use insulin. Early detection and management of diabetes can
help prevent complications such as heart disease, kidney
failure, and blindness.

In this project, we aim to build a machine learning model that


can predict whether a person has diabetes or not based on
their medical information. We will use the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not. By analysing this dataset and building a
model based on it, we hope to contribute to the field of
medical diagnosis and improve the accuracy of diabetes
diagnosis.

Computer Science Engineering


Aim of the Project

The aim of this project is to build a machine learning model


that can predict whether a person has diabetes or not based on
their medical information. By analysing the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not, we hope to contribute to the field of medical
diagnosis and improve the accuracy of diabetes diagnosis.
Early detection and management of diabetes can help prevent
complications such as heart disease, kidney failure, and
blindness, so accurately predicting diabetes can have a
significant impact on public health.
In this project, we aim to build a machine learning model that
can predict whether a person has diabetes or not based on
their medical information. We will use the Pima Indians
Diabetes Dataset, which contains information about female
patients of Pima Indian heritage and whether they have
diabetes or not. By analyzing this dataset and building a
model based on it, we hope to contribute to the field of
medical diagnosis and improve the accuracy of diabetes
diagnosis.

Computer Science Engineering


STEPS INVOLVED:

Data Exploration
We loaded the dataset into our Python program using Pandas
and printed the first few rows and some statistics about the
data. We found that the dataset contains 768 rows and 9
columns, with some missing values.

Data Preparation:
We split the dataset into a training set and a testing set using
the `train_test_split()` function from Scikit-learn. We also
separated the input variables from the target variable. We
used logistic regression to build our machine learning model
and trained it using the training set.

Model Evaluation:
We evaluated the performance of our model by predicting
whether a person has diabetes or not in the testing set and
comparing the predictions with the actual values. We
calculated
the accuracy of our model using the `accuracy_score()`
function
from Scikit-learn and found it to be around 75%.

Computer Science Engineering


We also created a confusion matrix to see how many true
positives, true negatives, false positives, and false negatives
our
model has. From the confusion matrix, we can see that our
model
correctly predicted diabetes in 50% of cases and correctly
predicted no diabetes in 82% of cases.

Results Visualization:
We visualized the results of our model using Matplotlib by
creating a bar chart of the confusion matrix. The bar chart
shows
the number of true positives, true negatives, false positives,
and
false negatives our model has.

Computer Science Engineering


ALGORITHM:

STEP1. Import the necessary libraries for data analysis and


machine learning.

STEP 2. Load the Pima Indians Diabetes Dataset into your


Python program using Pandas as [ data =
pd.read_csv('diabetes.csv')].

STEP 3. Explore the dataset by printing the first few rows and
some statistics about the data.

STEP 4. Prepare the data by separating the input variables


(features) from the target variable (outcome), and splitting the
data into a training set and a testing set using the
`train_test_split()` function from Scikit-learn.

STEP 5. Build the machine learning model using logistic


regression and train it using the training set.

Computer Science Engineering


STEP 6. Evaluate the performance of the model by predicting
whether a person has diabetes or not in the testing set and
comparing the predictions with the actual values. Calculate
the accuracy of the model using the `accuracy_score()`
function from Scikit-learn.

STEP 7. Create a confusion matrix to see how many true


positives, true negatives, false positives, and false negatives
the model has.

STEP 8. Visualize the results of the model using Matplotlib


by creating a bar chart of the confusion matrix.

Computer Science Engineering


FULL SOURCE CODE:
# Step 1: Importing the Required Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

# Step 2: Loading the Dataset


data = pd.read_csv('diabetes.csv')

# Step 3: Exploring the Dataset


print(data.head())
print(data.describe())

Computer Science Engineering


# Step 4: Preparing the Data
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=42)

# Step 5: Building the Model


logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Step 6: Evaluating the Model


y_pred = logreg.predict(X_test)
print('Accuracy score:', accuracy_score(y_test, y_pred))
confusion_mat = confusion_matrix(y_test, y_pred)
print('Confusion matrix:', confusion_mat)

# Step 7: Visualizing the Results


plt.bar(['True Negative', 'False Positive', 'False Negative',
'True Positive'],
confusion_mat.ravel())
plt.title('Confusion Matrix')

Computer Science Engineering


plt.xlabel('Prediction')
plt.ylabel('Count')
plt.show()

SOFTWARE USED:
Jupyter Notebook:
Jupyter notebooks are used for all sorts of data science tasks
such as exploratory data analysis (EDA), data cleaning and
transformation, data visualization, statistical modeling,
machine
learning, and deep learning.
A Jupyter Notebook is an open source web application that
allows data scientists to create and share documents that
include
live code, equations, and other multimedia resources.
How do Jupyter Notebook Works?
A Jupyter notebook has two components: a front-end web
page
and a back-end kernel. The front-end web page allows data
scientists to enter programming code or text in rectangular
"cells." The browser then passes the code to the back-end
kernel
Computer Science Engineering
which runs the code and returns the results.

LANGUAGE USED:
PYTHON

FUTURE SCOPE:
The future scope of this project includes several potential
areas
for improvement and further research. Some of these include:
1. Feature engineering: The current model uses the original
features of the dataset, but there may be opportunities to
create new features that better capture the relationship
between the input variables and the target variable. For
example, creating a new feature that combines BMI and age
may better capture the risk of diabetes.
2. Algorithm selection: While logistic regression is a common
and effective algorithm for binary classification problems like
this one, there may be other machine learning algorithms that
can achieve better accuracy on this dataset. Experimenting
with different algorithms such as decision trees, random
forests, or support vector machines may be worth exploring.

3. Hyperparameter tuning: The current model was built using

Computer Science Engineering


default hyperparameters for logistic regression. However,
there
may be opportunities to optimize the hyperparameters to
achieve
better accuracy. Techniques like grid search or random search
can be used to find the optimal hyperparameters for the
model.

4. Data augmentation: The dataset used in this project is


relatively
small, which can limit the performance of machine learning
models. Data augmentation techniques such as oversampling,
undersampling, or synthetic data generation may be used to
increase the size of the dataset and improve the performance
of the model.

5. External validation: While the model performed well on the


testing set, it is important to validate the model on external
datasets to ensure its generalizability. Future research may
involve testing the model on other datasets with similar
characteristics to the Pima Indians Diabetes Dataset.
Overall, there is significant potential for further development
and
Computer Science Engineering
improvement of the machine learning model for diabetes
prediction using the Pima Indians Diabetes Dataset.

Conclusion:

In this project, we successfully built a machine learning model


to
predict whether a person has diabetes or not using the Pima
Indians Diabetes Dataset. Our model achieved an accuracy of
around 75% and correctly predicted diabetes in 50% of cases
and
correctly predicted no diabetes in 82% of cases. This project
demonstrates how machine learning can be used to make
predictions based on medical data. However, further
improvements can be made to the model to increase its
accuracy.

Computer Science Engineering


Computer Science Engineering

You might also like