0% found this document useful (0 votes)

28 views41 pages

Ipsita PR

The project report titled 'Diabetes Detection Using Machine Learning' explores the application of various machine learning models to improve early diagnosis of diabetes using medical data. The authors, Ipsita Sahoo, Brajesh Mohanty, and Reetuparna Baral, aim to create a predictive model that can identify diabetes risk factors and integrate with wearable technology for real-time health monitoring. The report emphasizes the importance of data quality and feature selection in developing effective machine learning applications in healthcare.

Uploaded by

ipsit9009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views41 pages

Ipsita PR

Uploaded by

ipsit9009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

DIABETES DETECTION USING

MACHINE LEARNING
A Project Report
Submitted by:

Ipsita Sahoo (2141019099)

Brajesh Mohanty (2141013219)
Reetuparna Baral (2141002025)
in partial fulfillment for the award of the degree

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Faculty of Engineering and Technology, Institute of Technical Education and Research

SIKSHA ‘O’ ANUSANDHAN (DEEMED TO BE UNIVERSITY)

Bhubaneswar, Odisha, India

(June 2025)
CERTIFICATE

This is to certify that the project report titled “DIABETES DETECTION USING MACHINE
LEARNING” being submitted by Ipsita Sahoo (2141019099), Brajesh Mohanty (2141013219) and
Reetuparna Baral (2141002025) of CSE-S to the Institute of Technical Education and Research, Siksha
‘O’ Anusandhan (Deemed to be) University, Bhubaneswar for the partial fulfilment for the degree of
Bachelor of Technology in Computer Science and Engineering is a record of original confide work carried
out by them under our supervision and guidance. The project work, in our opinion, has reached the requisite
standard fulfilling the requirements for the degree of Bachelor of Technology.
The results contained in this project work have not been submitted in part or full to any other University
or Institute for the award of any degree or diploma.

(Name and signature of the Project Supervisor)

Department of Computer Science and Engineering

Faculty of Engineering and Technology;

Institute of Technical Education and Research;
Siksha ‘O’ Anusandhan (Deemed to be) University

ii
ACKNOWLEDGEMENT

We would like to express our sincere thanks to our guide Dr. Brajesh Kumar Umrao for his expert
guidance and support in completing our project. I would also like to extend our gratitude and
respect to our SDP teacher Dr. Farida A Ali for providing us with the facility and knowledge
about so many new things.

Place: ITER, Bhubaneshwar Signature of Students

Date: 20/06/2025

iii
DECLARATION

We declare that this written submission represents our ideas in our own words and where other’s
ideas or words have been included, we have adequately cited and referenced the original sources.
We also declare that we have adhered to all principles of academic honesty and integrity and
have not misrepresented or fabricated or falsified any idea/fact/source in our submission. We
understand that any violation of the above will cause for disciplinary action by the University
and can also evoke penal action from the sources which have not been properly cited or from
whom proper permission has not been taken when needed.

2141019099 2141013219 2141002025

Signature of Students with Registration Numbers

Date:20/06/2025

iv
REPORT APPROVAL

This project report titled “DIABETES DETECTION USING MACHINE LEARNING”

submitted by Ipsita Sahoo (2141019099), Brajesh Mohanty (2141013219) and Reetuparna
Baral (2141002025) is approved for the degree of Bachelor of Technology in Computer Science and
Engineering.

Examiner(s)

________________________
________________________
________________________

Supervisor

________________________

Project Coordinator

________________________

v
PREFACE
One of the most common and debilitating chronic diseases, diabetes affects millions of individuals
worldwide. It is brought on by either the pancreas producing insufficient amounts of insulin or the body
using insulin inefficiently. The incidence of diabetes cases has increased dramatically in recent years,
especially in developing countries, as a result of dietary changes, lifestyle alterations, and a lack of early
identification. This delay in diagnosis often results in serious health problems, such as cardiovascular
illnesses, brain impairment, and kidney failure. Therefore, early and accurate detection is crucial for
managing diabetes and preventing complications.

Traditionally, diabetes has been diagnosed through normal clinical testing, which can be expensive, time-
consuming, and not always accessible. Many people do not receive routine tests because of awareness,
financial constraints, or geographic limits. In order to tackle these problems, researchers have been looking
at the potential applications of machine learning (ML) techniques to develop predictive models that, using
common medical data, can aid in early diagnosis.

XGBoost, Random Forest, Decision Trees, and Logistic Regression are just a few of the machine learning
methods that have shown promising results in identifying diabetes-related patterns in patient data. These
models can be trained on datasets containing traits like as blood pressure, insulin and glucose levels, age,
BMI, and diabetes pedigree function to identify if a person has diabetes or not. In our work, we examined
a number of machine learning models and evaluated their predictive power for diabetes using performance
metrics such as accuracy, precision, and F1-score.

Both the use of these models and the idea of integrating them into wearable technologies, like smartwatches,
are the main objectives of our project. By making diabetes monitoring more ongoing, real-time, and
accessible to people, this has the potential to completely transform personal healthcare. The goal of this
project is to create a diabetes diagnosis model that is precise, quick, and simple for patients and medical
professionals to comprehend.

The importance of feature selection and data quality for medical machine learning applications is also
emphasized in the paper.

vi
INDIVIDUIAL CONTRIBUTIONS

Ipsita Sahoo Identification of challenges, Documentation

Brajesh Mohanty Results Analysis, result validation

Reetuparna Baral Identification of challenges, Literature Survey

vii
TABLE OF CONTENTS

Title Page i
Certificate ii
Acknowledgement iii
Declaration iv
Report Approval v
Preface vi
Individual Contributions viii
Table of Contents ix
List of Figures x
List of Tables xi
1. INTRODUCTION 1
Project Overview 1
Motivation 3
Uniqueness of the Work 5
Report Layout 7
2. LITERATURE SURVEY 7
2.1 Existing System 8
2.2 Problem Identification 13
3. METHODS 15
3.1 Dataset Description 15
3.2 Schematic Layout 17
3.3 Methods Used 17
3.4 Evaluation Measures 18
4. RESULTS 18
4.1 System Specification 19
4.4 Experimental Outcomes 19
5. CONCLUSIONS 22
6. REFERENCES 24
7. APPENDICES 25
8. REFLECTION OF THE TEAM MEMBERS ON 27
THE PROJECT
9. SIMILARITY REPORT 28

viii
LIST OF FIGURES

NO FIGURE NAME PAGE NO

1 Figure 1: Dataset Pie Chart 16

2 Figure 2: Working of Model 17

3 Figure 3: Evaluation of Logistic_Regression 20

4 Figure 4: Evaluation of Decision_tree 20

5 Figure 5: Evaluation of Random_forest 21

6 Figure 6: Evaluation of XgBoost 21

ix
LIST OF TABLES

NO TABLE NAME PAGE NO

1 Dataset Description 16

2 Results 19

3 Appendix table 26

x
1. INTRODUCTION
Under-treatment of diabetes can lead to dire complications affecting millions across the globe.
The importance of bringing forward the date of diagnosis with the aim to lessen the load of the
healthcare system and, therefore, improve individual health outcomes cannot be stressed enough.

Unfortunately, existing methods of diagnosis usually rely on tests interpreted by an expert, to

whom access may not be possible (in deprived regions, for example). This aspect is addressed in
our proposed study.

We will examine the feasibility of applying traditional health indicators such as age, blood
pressure, insulin, glucose, skinfold thickness, number of pregnancies, and family history to early
prediction of diabetes. The aim would be to cause change: machine-learning applications of these
medical indicators for faster and better diagnosis.

The ultimate goal is to conduct an assessment of predicting ability using various machine learning
models on diabetes, which, if successful, could lead to the development of innovative healthcare
technologies with the ability to empower providers and patients.

1.1 Project Overview

Diabetes is one of the many chronic illnesses afflicting millions of people throughout the globe.
Often, this develops unbeknownst to anybody, until it comes to its full course of complications.
Early detection is crucial because it allows a person to do something before his health gets worse.
This project revolves on shaping a machine-based prediction approach considered to be very
useful tools in healthcare in providing early information with improved outcomes. The system we
aim to develop thus allows an individual to know how risky they are to develop diabetes by
analysing their physiological and clinical data, increasing the availability and personalization of
early care.

This work is based on a very rich dataset, containing indicators ranging from age, high blood
pressure, glucose levels, levels of insulin, thickness of skinfolds, number of pregnancies, and
family history with diabetes. Each entry relates to a real person, tagged whether they have diabetes
or not. The richness of data enables the model to learn subtle patterns indicative of a higher risk
to this condition.

Data cleaning was the first thing we had to do before training any models. Common missing,
outliers, and inconsistencies within data of real-world scenarios affected output of the result. We
normalized values so that they would hold well across records, balanced distributions, and very
carefully managed any missing using imputation in order that the most trusted input would be

1
given to our models. Importantly, another thing within this project was the choice of relevant
machine learning models such as XGBoost, Random Forest, Decision trees, and Logistic
Regression we tried on a variety of model settings, each with its own advantages.

XGBoost can really uncover complicated patterns. Random forest combines trees to decrease a
part of overfitting and really improvised accuracy: Decision tree is interpretable and easy to
explain, whereas, in case of use in binary classification tasks like this, Logistic Regression does
a lot of good by being simpler.

We benchmarked how well these models performed by training them on the pre-processed data
and then measuring results using popular metrics like accuracy, precision, and F1-score. The
measure of accuracy is the result of efficiency over its whole performance. Precision refers to
what extent it could escape false positives. F1 score balances the two, with special needs for this
use in finding people who really have diabetes. These metrics heeded the lessons learned as well
as action areas that required refinement.

Pima Indian Diabetes Dataset (PIDD) is also the repository we're working from in this research;
it contains accurate medical records from a group of 768 women beyond 21 years. Justifiably,
because of how rich its health data are, it's widely respected. We separated the data into training
and test sets, that is, 80% of the data would be used for training the model, while the remaining
20% would be set as a test for the model. This division will ensure that our trained models are
able to apply their learned behaviour to new, unseen data, and do not just memorize patterns.

Our objective is, however, beyond making good accurate models. The intent is to build something
credible that can affect clinical practice. Systems of this nature could enable individuals to
appreciate their level of risk as healthcare practitioners would instead channel their attention
towards delivering right care at the appropriate time. Imagine being able to diagnose potential
cases of diabetes early and modify treatment before.

This project showcases how AI is revolutionizing the medical field. It allows medical
professionals to spend more time managing patients rather than data captured in huge volumes by
laborious analytical methods. Such models can even spot trends that conventional statistics would
overlook, denoting possible new risk factors or even behaviours that might cause diabetes.

Real time health data collection from wearables is what we envision to extend this system into
the future. Imagine real-time updates in terms of glucose, heart rate, activity, and others that one
would measure on smart devices coupled with AI recommendations personalized towards
individual behaviour and able to identify early warning features. That type of coupling would

2
really allow people to take their health powerfully into their own hands-with medical advice at
the right time.

As we said, this project is not about algorithms. It is 100% about people, of course. Diabetes takes
an emotional and a physical toll. Managing it can be overwhelming. If we can give heads-up and
motivate early preventive measures, that's a win, not only for healthcare systems but for quality
of life.

Machine learning is just getting more relevant in terms of healthcare. This project is a step forward
into getting that opportunity not only to react to diseases but also to pre-empt them as much as
possible. Making improved models, wider datasets, and integration into daily healthcare measures
would contribute toward making early detection just another routine way of managing diabetes-
before it ever happens.

1.2 Motivation

Diabetes has become one of the most concerning health issues of our time, silently affecting
millions across the globe. It often develops without noticeable symptoms, only coming to light
when serious complications emerge. Many people only receive a diagnosis after significant
warning signs appear—often due to missed routine checkups or downplaying changes in their
health. These delays are contributing to a growing public health challenge, putting both
individuals and healthcare systems under pressure.

The need for a more innovative method of diabetes detection is what motivated this project. The
goal is to develop a system that can use quantifiable health data to estimate an individual's risk
rather than waiting for symptoms to worsen. We intend to bridge the gap between early indicators
and actual diagnosis by utilising machine learning. We're also envisioning how this system might
integrate with wearable technology, such as fitness trackers or smartwatches, to provide people
with convenient real-time health monitoring. In this manner, they can act proactively before any
problems worsen.

Building a dataset that reflected the different causes of diabetes was the first step in developing a
reliable model. A strong dataset is required to ensure that the system produces accurate
predictions. We examined factors such as age, blood pressure, skin thickness, insulin and glucose
levels, family history of diabetes, and the number of pregnancies. Each data entry indicates
whether or not a person has diabetes. The model can identify risk trends across different
backgrounds thanks to the diversity of the data.

3
Before diving into model training, the data had to be carefully cleaned and prepared. Like most
real-world data, it had a few hiccups—missing values, uneven distributions, and inconsistent
entries. We tackled these by filling in gaps, reshaping features, and standardizing values. This step
was crucial in ensuring the models had dependable input and could learn from the data effectively.

The next step was to choose and evaluate various machine learning models. We examined a
number of them, each with their own advantages: XGBoost, Random Forest, Decision Tree, and
Logistic Regression. Because of its reputation for identifying intricate relationships in data,
XGBoost is an effective tool for risk assessment. Multiple decision trees are combined by Random
Forest to increase accuracy and decrease error. Healthcare practitioners benefit from decision tree
models' ease of interpretation. Additionally, despite its simplicity, logistic regression is still a good
option for yes/no predictions, such as diabetes diagnosis.

We employed common evaluation metrics to determine each model's performance. We were able
to gauge the overall frequency of correct predictions by using accuracy. Accuracy aided in
evaluating the models' ability to prevent false alarms. By preventing too many false positives and
ensuring that we didn't overlook any real cases, the F1-score helped strike a balance. These
measurements helped us identify the most viable options by clearly illustrating the advantages
and disadvantages of each model.

The PIDD, a reputable resource that contains health data from 768 women aged 21 and up, is the
dataset we used for this project. It is especially helpful for machine learning research because of
its comprehensive features. 20% of the data was used to test the model's ability to predict novel,
unseen cases, and the remaining 80% was used to train the model. This method made sure the
model gained valuable insights rather than merely memorising the data.

The main goal of this project is to improve healthcare by making it more accessible and intelligent.
People can better manage their health and receive assistance sooner if early alerts are made
available. This vision heavily relies on wearable technology. A smartwatch or fitness tracker that
incorporates machine learning could provide people with timely health information, encouraging
them to make healthier decisions or warning them when it's time to visit a doctor.

There are many more applications of machine learning in healthcare than just this one. Predictive
tools like these could aid physicians in making earlier diagnoses, better customising treatments,
and even identifying previously unknown health risks. They assist medical professionals,
ensuring that no important information is overlooked, rather than taking their place.

4
It's simple to envision how this strategy might develop in the future. AI systems that provide
tailored recommendations could be integrated with devices that track everyday health metrics,
such as heart rate, blood sugar, and physical activity. Technology like this has the potential to
change the way we treat chronic illnesses by emphasising prevention over emergency response.

There is more to this research than just numbers and algorithms, of course. It has a profoundly
human aspect. Keeping track of medication, controlling diet, and maintaining an active lifestyle
can all be emotionally taxing for people with diabetes. There's a lot to manage. Early support
systems can give people encouragement and peace of mind, making them feel less alone and more
in control of their health journey.

Numerous aspects of life are already changing as a result of machine learning, but its potential in
healthcare is particularly intriguing. By demonstrating how technology can aid in disease
prevention rather than just treatment, this project makes a significant advancement in that
direction. We intend to create something that helps people before they even realise it by keeping
improving our models, growing our data, and integrating with commonplace tools.

1.3 Uniqueness of the Work

Diabetes is currently one of the most-dreaded modern-day illnesses, troubling countless people
across continents. It appears to sneak up on individuals, often starting in a silent fashion with
symptoms rarely detectable to the patient, and only being diagnosed at advanced stages of
development. Even those that recognize the early onset of warning signs only to diagnosis after
the manifestation of signs and symptoms receive a delayed diagnosis. The mere thought of missed
routine check-ups or underestimating changes in health gives birth to the lax nature attributed to
the fight against such a pandemic. The resulting effects that come with these missed early
opportunities are posing a well-recognized challenge to public health, on the strategies of
individual patients, or to the health care systems; indeed, during these critical hours of greatest
need, we have totally defeated the opportunity.

This provided the impetus for a quest to explore a more creative way of detecting diabetes.
Theoretically, the idea can rest upon the acceptance of a mechanism that would be able to
articulate quantified health data to assess an individual's risk of developing conditions in advance
of really sick symptoms. Using such advanced techniques as machine learning, we foresee
resolving a gap between the categories of early signals and the actual diagnosis. We will further
envision how this system could couple with wearables--think fitness trackers and smartwatches-

5
-to allow people real-time easing of health burdens. In this form, they can then exercise some
proactive measures before faced with compounding complications.

The first step in developing a reliable model was the creation of the dataset that reflects the various
causative factors of diabetes. The importance of a sound dataset is the assurance that the system
will give accurate predictions. Age, blood pressure, skin thickness, insulin and glucose levels,
family history of diabetes, and number of pregnancies were pertinent variables in our
investigation. Each entry into the dataset indicates whether or not a subject is diabetic. With the
distinctiveness of the various data sources employed, the model can trace risk trends along
different backgrounds.

Pre-processing and cleaning of the data had to be done properly before moving on to model
training. It had its share of setbacks common to most real-world data: missing values, distribution
imbalances, and discrepancies among the entries. We tackled these issues through imputations,
reshaping feature spaces, and standardizing values. This step was truly critical to providing the
models with a reliable input to learn from.

The next step was suited to choosing and analysing various machine learning models. We
evaluated several, each with its own merits: XGBoost, Random Forest, Decision Tree, and finally
Logistic Regression. Owing to the peculiar ability to uncover hidden complicated relationships
with data, it is one of the most powerful algorithms for risk evaluation. Random Forest
incorporates several decision trees on various data samples to enhance its accuracy and control
errors. Decision tree models are preferred by practitioners in the health care system due to their
interpretability. Not to say that logistic regression cannot be used, it is increasingly uncommon
for a yes/no prediction such as diabetes identification.

Common evaluation metrics were used to report the performance of every model built. By
evaluation of accuracy, we were able to perceive how frequently we were correct overall in
making predictions. Accuracy also served as an assessment for the ability of the projects to avoid
giving false alarms. The balance was thus struck by the F1-score, avoiding too many false
positives while making sure we do not miss out on any of the real cases. Together these metrics
helped to put into clear perspective some of the options worthy of consideration along with their
major advantages and disadvantages.

We used the Pima Indians Diabetes Dataset, which is reasonably considered a gold standard data
resource that contains health data on 768 women aged 21 and over, in this study. Its richness in
features makes it especially helpful for machine-learning-based research. 20% of the data was

6
used to test the model for its ability to predict novel/unseen cases, and the remaining 80% was
used to train the model. The rationale behind this method was to ensure that the model had
actually learned something rather than just memorizing the data.

The main purpose of this project is to increase accessibility and intelligence in health care. With
early alerts, people can take ownership of their health and seek help as early as possible. Wearable
technology is very much present in this vision. A smartwatch or fitness tracker with a machine-
learning back-end could offer users timely health pointers, nudging them towards healthier
choices or alerting them when it is time to go to the doctor.

This is just one of many applications of machine learning in health care. Such predictive tools
could assist in earlier diagnoses and customization of treatment by the physician while identifying
new risks for health. They act as aids to practitioners rather than substituting them to ensure
nothing vital is missed.

It is easy to picture how, even farther in the future, this trajectory could come together. AI systems
with personal recommendations could merge with devices to monitor an individual's health
metrics (i.e., heart rate, blood sugar, and physical activity.) Such technology can radically change
the treatment of chronic diseases by turning away from emergency response predicaments and
more towards prevention.

There is far more in this research than straight numbers and algorithms. It has an equally human
side. Following medications, watching diet, and maintaining an active life can all be gut-
wrenchingly exhausting for a person with diabetes. Managing so much becomes overwhelming.
Early support systems can offer such people encouragement and a peace of mind: the knowledge
that they are not alone but can do something in the prospect of their health management.

Machine learning is already changing countless dynamics in everyday life, but its future is
particularly bright in the health care domain. This project is a large step on that path in
demonstrating the technology's potential in prevention, not merely therapy. We want to build
something that protects people before they even suspect it by refining our models, enlarging our
data, and fusing with the widely used tools.

1.4 Report Layout

The focus of Section 2 is a summary of research work in this area, emphasizing important studies
and their results concerning predicting diabetes. Section 3 describes the model on which this
work is laid out, mentioning the form, structure, and method for accomplishing the designed
7
structure. Section 4 deals with the experimentation and evaluation of the model concerning proper
metrics, while Section 5 concludes with the results drawn from the study and indicates possible
avenues for future research and development. Section 6 lists the sources cited in the report,
including research papers and articles related to diabetes prediction using machine learning
techniques. Section 7 provides supplementary materials that support the report, including dataset
descriptions, machine learning models and hyperparameters, system specifications, and model
evaluation metrics. Section 8 reflects on the team's experience working on the project,
highlighting the challenges they faced, the skills they developed, and the importance of
teamwork, communication, and patience in completing the project successfully.

2. LITERATURE SURVEY

This review is an exhaustive sifting of different methods concerning diabetes detection for
research works that demonstrate how machine learning has enhanced medical predictions. One
study introduced nine health parameters (age, BMI, and blood sugar levels) into a dataset
containing 800 patient records. The models tested included Random Forest, Gradient Boosting
Classifier and Logistic Regression, with the first two achieving 76 percent accuracy. Thus,
Logistic Regression was crowned winner among these models, being competent for medical
classification. This shows how dataset nature influences algorithm choice.

Another research was carried out in the domain of diabetes management systems, covering data
from Random Forest and Logistic Regression with 796 entries. The method truly showcased the
potency of ensemble learning in the management of structured medical data at an astounding
accuracy of 80.52%. Interestingly, this method continued to surpass some of the conventional
models, including Decision Tree, Support Vector Machine (SVM), and Soft Voting, thereby
demonstrating its capability in medical diagnostic applications.

Classification techniques such as SVM, Decision Tree, Logistic Regression, Random Forest, and
Gradient Boosting have also been evaluated on the most indeed famous standard PIDD containing
768 samples and nine medical features. Random Forests are the one that keeps giving the best
accuracy on several tests, reaching a total model performance of 77% accuracy. However,
Logistic Regression also deserves special mention since it achieved 96% accuracy in a few test
cases thereby confirming suitability for binary classification problems like diabetes detection.

8
2.1 Existing System

Diabetes affects a large number of people worldwide, making early detection vital to its timely
treatment. Conventional diagnostic tests requiring traditional biochemical assays can be quite
costly, time-consuming, and inaccurate for more obvious reasons-glucose measurements being
the most intended. With the arrival of computational technologies, the advances made in the field
of machine learning-related matters have become strong tools in actually diagnosing patients
through the differentiation of patterns in the patient's previous details which were never made
visible through conventional techniques.

Random Forest and Logistic Regression are the most commonly used and well-studied algorithms
for enhancing the diabetes detection model. Logistic Regression predicts the probability of having
diabetes based on some key health indicators using logistic regression, most likely due to
goodness at probability estimations, while Random Forest is a powerful classifier that handles
the complexity associated with many variables, thus together enhancing the diagnostic
correctness.

Numerous studies exist, which have also considered several machine-learning models for risk
estimation in diabetes. In the assessment where the study sample comprised 800 subjects, logistic
regression would yield a reliable accuracy of 76% when the associative dependence among some
variables would almost hold the estimation linearly. Another finding report from a study using
Random Forest and Logistic Regression indicates that an accuracy of 80.52% is achieved with a
dataset consisting of 796 patients. This has to a large extent provided impetus to enhance
classification in ensemble learning.

All classifiers, including SVM, Decision Tree, Logistic Regression, Random Forest, and Gradient
Boosting, were further investigated using the well-known Pima Indian Diabetes dataset, rich in
diverse medical attributes and comprising 768 samples. The consistency of the Random forest is
the most consistent classifier or model that would cope with the interaction among complex
feature and high-dimensional data. Remarkable, nevertheless, is the accuracy achieved by
Logistic Regression, which on occasions in projected tests achieved figures of 96%, thus
confirming its competence on binary classification matters.

Inevitably, these developments have not come without challenges. Most studies tend to use
publicly available datasets, with potential misrepresentations of population diversity,
jeopardizing their applicability in real-life healthcare settings. Increasingly, reliance on ensemble
approaches that incorporate many models for enhanced reliability would be critical in mitigating

9
this issue going forward. Random Forest and Logistic Regression, with their strong classifications
and capacity for estimating probabilities, strike a balanced chord in the detection of diabetes.

Future improvements in the deep learning algorithms would enhance their prediction accuracy on
such models, perhaps by incorporating a neural network approach. AI-enabled portable health
technology would open a new paradigm for continuous surveillance, leading to real-time
personalized therapy for diabetes management. Machine learning has greatly revolutionized
diabetes diagnosis, leading to development of effective screening methods that empower health
workers to take preventive measures early against the disease condition. Great strife is being
aimed at interpreting algorithms and biases in data. Random Forest and Logistic Regression,
nevertheless, exhibit a commendable act in classification on treating real challenges, where
machine learning straightforwardly contributes to improved patient outcomes with early
prevention and much easier access to healthcare services. It should be in the foreseeable future
that deep learning and AI-powered wearables will flourish and consequently enhance diagnostic
accuracy, thus changing the game on diabetes control by enabling continuous monitoring of study
subjects. [1]

The investigation has been reported in [2]. Many people live in thousands worldwide with
diabetes, which classifies the disease as one of the common metabolic disorders that when
detected in time, serious complications can be averted. Conventional techniques for diagnosis use
biochemical tests such as glucose level readings which, though very accurate, can be very costly
and time consuming. As computational technologies have advanced, machine learning has
emerged a powerful instrument for medical diagnostics which effectively analyses complex
patterns of patient data.

Models focusing on the performance improvement of diabetes detection have ended up using a
variety of machine learning methods such as Random Forest, Decision Trees, and Adaptive
Boosting. Indeed, these models form the basis on which strong predictions can be made for
structured medical data. The study is one examining the efforts in developing an AI-driven
diabetes prediction system, as published in [3] to indicate the trend of the use of intelligent
algorithms in health. In this case, the suggested system has several features including an AI
Chatbot, a diabetes risk prediction module, and an appointment scheduling to help bring further
improvements in diagnosis accuracy and patient management.

It is through the use of the Random Forest algorithm that the study becomes useful in predicting
diabetes, with which it achieves such amazing accuracy rates of 90.4%, when compared to
traditional classification methods. One of the major findings of this research entails preprocessing
data, which improves a model's performance. Outliers, irrelevant features, and missing values can
10
play a huge role in affecting prediction accuracy. It focuses on changing these data-related
problems through robust scaling normalization to remove outliers, mean imputation for missing
data, and oversampling techniques to balance class distributions.

To optimize machine learning models and to ensure that only the most important variables can
affect predictions, feature selection is essential. This study carried out comparisons between
various classification models such as Decision Trees, SVM, and Naive Bayes, while using
benchmark datasets such as PIDD. Results show that ensemble learning methods, especially
Random Forest, are providing better classification performance as they can handle high-
dimensional data and capture complex relationships between medical variables. Adaptive
boosting further enhances model reliability by merging several weaker classifiers into a more
powerful predictive system.

The systematic approach is adopted in the research investigation beginning with data
preprocessing to ensure accuracy and standardization of values. This allows for learning from
past patient data without penalizing these cases for the generalization of new cases. The
classification process has two steps, one where all features are used for prediction and the results
indicate Random Forest to be the most accurate. Evaluation metrics-such as precision, recall, and
F1 score-reveal model performance, while validation loss trends indicate that more training
epochs increase the accuracy and lower the risk of overfitting.

Actual patient diagnosis contrasts with what the expected results entail to measure practical
applicability, thus demonstrating effectiveness in real-world medical situations. The prospective
merits of the study are noted with some limitations, and suggestions for improvement of future
research are offered. Hyperparameter tuning, additional external parameters such as genetic
markers and lifestyle factors, and testing hybrid models from merging machine learning and deep
learning approaches would all add to predictive accuracy in future studies.

In the future, real-time adaptability may probably be one of the significant areas of focus, allowing
the models to change predictions dynamically depending on data collected from individual
patients in real time instead of purely on past patterns. This study showcases how machine
learning can bring practical examples to medical professionals in advancing their techniques in
diagnosis-making, reflective of the growing role of artificial intelligence in healthcare. Healthcare
professionals are thus poised to better guide the patient experience in undergoing certain therapies
by implementing predictive AI-powered models.

It is expected that with the unfolding of AI, innovative deep learning algorithms could improve
an existing diagnostic instrument and lead to a revolution in diabetes treatment. Research on the

11
use of wearable AIs may lead to the development of custom health care and recurrent monitoring
methods, and further changes in diabetes management would be seen through data-driven and
time-tested solutions. Machine learning holds great promise in augmenting healthcare by making
it proactive, accessible, and effective through the amalgam of medical knowledge with
computational intelligence.

Diabetes is stated to have affected millions around the world. The ailment itself, regardless of
how it manifests, has a profile of complications associated with it, namely kidney disease,
changes in vision, and some cardiovascular disorders. Because of the rise in prevalence of
diabetes, researchers in the field are working towards machine-learning techniques for better early
detection and prognosis, hoping to improve outcomes for patients in tomorrow's data-driven
health insights. Old methods of diagnosis rely on blood glucose tests and the doctor assessment;
machine learning shows the possibility of a new route that is more fruitful in exploring complex
health trends.

Prediction models on diabetes risk are mostly studied using various methodologies. Some
algorithms in this research field include Logistic Regression, SVM, Naïve Bayes, Decision Trees,
Random Forest, and Naive Bayes. Also applied ensemble methods include AdaBoost, Gradient
Boosting, and XGBoost. M. Hasan and F. Yasmin [2] conducted a study to devise a new
framework where deep learning architectures aid conventional classifiers for yielding improved
predictive accuracy.

This study tries to put forward a comparison between several states of machine learning models
using the diabetes dataset from Kaggle. In fact, it proves that Random Forest and XGBoost
outperform the performance of classical methodologies by facilitating the handling of structured
medical data. Introducing additional described models, the study incorporates the feasible
combination of Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN)
into its layers. This innovative architecture develops performance out of the book phenomenally
at an accuracy rate of 99.79%. Consequently, CNN layers are dedicated to discovering critical
features from the patient profile while the LSTM layers capture sequential dependencies for
predicting diabetes with almost perfect accuracy in the system.

Another point to consider is that the irrelevant predictors should be productive forecasts, thus
making feature selection part of the optimization process of the machine learning model. In this
sense, Hasan and Yasmin found important variables which are age, family history, BMI, and
glucose levels through their Random for Feature Importance that made their model reliable and
interpretable. Present research highlights the improvement of ensemble learning by combining

12
several classifiers, thereby increasing predictive power especially for highly complex
applications in medicine.

Neural networks-the hybrid ANNs like the CNNs and the LSTMs-have performed extraordinarily
well in spotting the nuanced, hidden patterns in the medical data, while simpler models such as
logistic regression would own such decisions around them. Certainly, improvements were made,
but everything is not yet smooth. The studies usually use open-source databases, for example, the
PIDD, but it does not represent an interesting mixed rainy population. Realistic electronic health
records have to be promoted.

In addition, optimization such as Adam optimizer and hyperparameter tuning have been
contributing greatly in enhancing stability and effectiveness in the model. Model assessment is
also done through using important performance metrics such as accuracy, precision, recall, F1-
score, and area under the receiver operating characteristic curve (ROC-AUC) to support
reliability. High ROC-AUC values are recorded in Hasan and Yasmin's study to prove the success
of their model in differentiating with or without diabetes patient.

Predictive systems would then develop into more advanced models capable of assisting in real-
time patient data to change forecasts rather than only past-trained information. Future works
could focus on hybrid models combining deep learning and traditional classifiers, better feature
extraction methods, and the use of wearable medical technology for permanent diabetes
monitoring.

AI-powered forecasting approaches are gearing up to transform healthcare by applying preventive

strategies, improvising patient care, and avoiding possible complications through preventive
treatment. With maturity, predictive accuracy will improve further through interdisciplinary
synergy between computational researchers and clinical experts. Researchers now open a way to
even more approachable, efficient, and highly accurate prediction models for diabetes through
linking machine learning with real-life health applications.

Deep learning has come into use in diabetes detection and now increasingly improving the
diagnostic accuracy rates while providing real-time insight to health professionals based on
evidence for earlier detection and individualized treatment planning. Machine learning is capable
of transforming diabetes care through the provision of intelligent and flexible health solutions,
courtesy of continued growth of artificial intelligence and wearable technology [3].

This study in [4] covers the direct incorporation of deep learning approaches in price foresight on
an hourly interval for electricity, specifically long- and short-term memory networks that have

13
proven themselves effective in handling time series data. Electricity markets are complicated and
volatile, with a whole host of factors involved, such as weather, system loads, regulatory policies,
and fickle market dynamics; thus, an electricity price forecasting model must be able to embody
those phenomenally complex relationships and to be adaptive to changes that continuously occur
in big datasets that it handles.

2.2 Problem Identification

Issues such as data availability or data quality must be considered within serious dilemmas to be
resolved even before the very initiation of the machine-learning-based diabetes prediction
modelling. The small and even sparse availability of data accepted widely is considered one of
the greatest limitations. The PIDD is mostly used and consists of only 768 records. Albeit useful
in the early stages of the research, its small sample size hardly allows the model to generalize to
any larger target populations. When machine learning models are trained using very sparsely
available datasets, an overfitting is produced in which case they perform well on the training set
and poorly at predicting outcomes on a new unseen set.

A second challenge arise because of the imbalanced nature of the dataset: in other words, the
distribution of cases with and without the occurrence of type-2 diabetes is unevenly represented.
The majority class is afforded greater tendency in model adjustment and more incorrectly
presumed diabetic patients are left out of the identification loop owing to this class imbalance.
Various techniques to balance it out by researchers were used, such as oversampling, under-
sampling, etc. The other way to artificially generate proper synthetic data samples, so that the
minority class in the dataset has a better representation, is by methods like the Synthetic Minority
Over-Sampling Technique (SMOTE).

All Data-Related Issues

In the other instance, other hindered prepossessing tasks may arise from empty or inconsistent
values, particularly those associated with evaluating patients' metabolic health, such as insulin
levels and skin thickness. The missing data in this case may thus compromise consistency during
training and testing, which will later impact its predictive reliability. To remedy that, researchers
populated the missing values employing varying degrees of correlation with other variables
through predictive modeling, median substitution, and mean imputation. These adjustments to a
data set would increase its completeness; however, they need to be done with caution so as not to
introduce bias.

14
To add an extra effect on construct identity operation, factor selection works toward the
enhancement of the functioning of the model. Placing less importance on several different items
in favour of those that are really of major concern factors (blood glucose levels, body mass index,
age, and family history) for diabetes risk prediction. Narrowing its research down to the most
pertinent factors will help reduce dataset complexity, fast computation, and increase
interpretability of predictions. Several methods to assess which factors are most important to
accurate prediction are tree-based feature importance assessments, correlation analysis, and
Recursive Feature Elimination (RFE).

To adjust all parameters involved in improving the model performance, the researchers utilize
hyperparameter-tuning methods such as Grid Search and Random Search. These techniques
guarantee maximum accuracy tuning for the models.

Latest Development in Diabetes Prediction Using Machine Learning

Despite these obstructions in terms of data, machine-learning techniques for modelling diabetes
prediction continue to be improved. Ensemble combining techniques such as Random Forest and
Gradient Boosting have been used to assure accuracy on predictions that are taken more
confidently. Moving with far greater precision, Deep Learning techniques such as CNNs and
LSTM networks are now systemic realizing complex health data patterns independently. Such
advanced models provide an armchair approach to diagnosis with fewer constraints of manual
feature engineering; hence, the traits associated with classical datasets can be countered by
integrating real-time monitoring of patients along with multiple sources of medical datasets. The
prediction for diabetes has thus shifted from a retroactive evaluation to a proactive real-time
surveillance system thrust through wearable health gadgets monitoring the blood glucose and
physical activity levels. With this development in AI, it is imperative that machine learning
experts and medical researchers work in collaboration towards the definition of predictive
healthcare models for the coming generation. Given advanced feature engineering, clever data
augmentation techniques, and enhanced deep learning methodologies, they can then harness
highly accurate model usage toward personalized healthcare solutions.

The following study could focus on adaptive learning models, which update predictions in real
time based on patterns and alterations in data pertaining to the patient. Primarily, these
advancements will focus on the processes of diabetes evaluation and management with improved
accuracy and utility. Machine learning interventional innovations will ensure enhanced
accessibility, accuracy, and proactiveness in health care, therefore empowering patients and
health care professionals to timely intervene for disease prevention and management.

15
3. METHODS

3.1 Datasets Description

The Pima Indian Diabetes Dataset was used, which is a commonly used dataset in the area of
machine learning applications in health. The full dataset has 768 records from individual persons
with eight important clinical details like age, skinfold thickness, blood pressure, insulin level,
number of pregnancies, Body Mass Index (BMI), and family diabetes history.

For each entry, there is also an outcome that merely signifies if the individual was diagnosed with
diabetes or not. We divided the dataset in the ratio of 80 to 20 to make sure that while learning,
our models discovered the features efficiently and generalized well for the new data.

Each entry is also attached to an outcome that says either the person has been diagnosed by
diabetes or does not have diabetes. We divided the entire dataset into two, that is 80%, which we
trained on, and a mere 20% on the test so that it proved its discovery efficiency and
generalizability to new data.

We spent some time on data preprocessing before moving onto building the model. These include
handling the missing or zero values, feature scaling, normalizing the data, and all those things
that help to bring the data to an equal footing. This step was important to make sure that our
models learned on consistent, clean data and, therefore gave rise to predictions that would most
likely be trustworthy and accurate.

Table 1: Dataset Description

This table 1 represents a dataset description, detailing the number of pregnancies, glucose levels,
blood pressure, skin thickness, insulin levels, BMI, diabetes pedigree function, age, and outcome for
five individuals. The data is organized into columns, with each row representing a single individual.

16
Figure 1: Dataset Distribution

In figure 1 the pie chart illustrates the distribution of various factors contributing to diabetes, with
glucose being the most significant factor at 30.97%, followed by BMI at 20.16%, and age at 16.13%.
The remaining factors, including pregnancies, blood pressure, skin thickness, insulin, diabetes pedigree
function, and outcome, contribute smaller percentages to the overall distribution.

3.2 Schematic Layout

Figure 2. System Workflow

In figure 2 the flowchart outlines the system workflow for a diabetes prediction model,
comprising six stages: data collection, data preprocessing, feature selection, model training,
hyperparameter tuning, and model evaluation. The workflow is designed to determine whether

17
the model achieves diabetes prediction correctness, with a "yes" indicating success and a "no"
requiring further hyperparameter tuning.

3.3 Methods

Data Preprocessing

Errors were investigated thoroughly prior to training models with the raw PIDD. Multiple
variables included blood pressure, glucose levels, skin thickness, and insulin numbers which
indicated zeros that are unreasonable in a medical context. These were treated as missing values
either by substituting mean/median values or treated through imputation techniques.
Standardisation was then used to ensure that all features were brought to a common scale to
prevent any one attribute from skewing results and otherwise serious affecting the model.

Changes to Features

When all features were standardized, smaller variables were not overshadowed by larger ones,
allowing the model to perform its best. This balanced scaling ensured that each feature was
interpreted fairly without distorting eventual outcomes. The dataset's entire numerical nature
called for no categorical encoding, thus streamlining preprocessing while preserving data
integrity.

Selection and Training of Models

We tested four different machine learning algorithms in order to develop a robust diabetes
prediction system:

Logistic regression for its straightforward implementation, good predictive power, and easy
interpretation to identify diabetes risk factors.

Decision Tree- A very helpful tool for probing into the patterns generated hierarchically in the
decision-making process of the dataset.

Random Forest - An ensemble method used to boost the accuracy and stability of prediction.

XGBoost- Best known for its extreme classification accuracy, regularization features, and
phenomenal handling of structured data.
18
Each model trained in 80% of the dataset was 20% kept aside to test how it performs in real-life
situations.

3.4 Evaluation Measures Used

Model Evaluation

To evaluate the performance of our models, we took into consideration several evaluation metrics:

Accuracy: Measures the proportion of correct predictions over the entire dataset, thus giving an
overall measure of trust in the model.

Precision: Indicates how many of the predicted positive cases were actually correct to ensure that
the model does not label too many instances falsely.

F1-Score: The F1-Score represents a trade-off between precision and recall, and is therefore
useful for situations of an imbalanced dataset.

4. RESULTS

The technical specifications used in the project are talked about in the results and output section.
Moreover, the section describes variables changed or unmodified for the experiments or
simulations. Finally, within this section, one finds experimental results together with statistical
data or examples demonstrating the efficacy or functionality of the system.

"Output and results" will talk about the technical specifications used in the project. Apart from
that, it can include the variables that were modified or unmodified in conducting
experiments/simulations. Eventually, experimental results will be found within this section as
well, including statistical information or examples that demonstrate the system's efficacy or
functionality.

4.1 System Specifications

Processor: An Intel Core i5-11400H CPU clocking 2.40GHz.

19
RAM Specifications: A RAM slated for 16GB provides efficient running.

System Type: 64-bit operating system.

Operating System: Windows 11.

4.2 Experimental Outcomes

Table 2: Results

Algorithms Accuracy Precision F1_Score

Logistic Regression 0.74 0.65 0.64

Decision Tree 0.68 0.58 0.51

Random Forest 0.75 0.68 0.66

XGBoost 0.78 0.68 0.72

The table presents the results of four machine learning algorithms: Logistic Regression, Decision
Tree, Random Forest, and XGBoost. The metrics used to evaluate these algorithms are Accuracy,
Precision, and F1 Score.

Change in healthcare future, that is beyond technology. AI researchers will work hand in hand
with medical practitioners in future healthcare. Medical systems, when combined with
computational tool and clinical expertise, devise treatment plans that can be tailored to suit the
individual risk factor of a patient. AI diagnostics has the potential of reducing healthcare costs,
improving patient outcomes, and making preventive care accessible as the tide of scourge
captures more ground. It is not just that machine learning has improved the prediction of diabetes;
it is building a system of responsive health care focused on early detection, prevention, and
personalized strategy development on treatment."

The idea is demonstrated: Predictive algorithms will increase precision in diagnosis by facilitating
early detection. Hereby, the vast possibilities of machine learning are clearly shown in diabetes
prediction. Including these models in health monitoring devices will help motivate people to take
proactive efforts to better manage their health as well as become timely informed.

20
Indeed, improvements in disease prediction have been made possible by the remarkable capacity
of XGBoost, underscoring the need for complex ensemble methods in health care applications.
Synergy between data scientists and doctors in practice is going to bring about a lot of
technological advances as AI advances in personification, which promises more accuracy,
flexibility, and prevention in the healthcare industry.

Figure 3: Evaluation of Logistic_Regression

Figure 3 represents the logistic regression model which shows a moderate performance in
predicting diabetes risk, with an AUC score of 0.73. The confusion matrix reveals that the
model correctly classified 30 true negatives and 14 true positives but misclassified 20 false
negatives and 36 false positives. This indicates that while the model is good at identifying non-
diabetic cases, it struggles to detect diabetic patients accurately.

Figure 4: Evaluation of Decision_Tree

Figure 4 represents the decision tree model which demonstrates a better performance than logistic
regression, with an AUC score of 0.88. The confusion matrix shows that the model correctly classified
33 true negatives and 27 true positives, with only 7 false negatives and 33 false positives. This suggests
that the decision tree model is more effective in distinguishing between diabetic and non-diabetic cases.

21
Figure 5: Evaluation of Random_Forest

Figure 5 represents the random forest model which exhibits a slightly lower performance
compared to the decision tree, with an AUC score of 0.85. The confusion matrix indicates that
the model correctly classified 32 true negatives and 25 true positives, with 8 false negatives and
35 false positives. Although the random forest model is still effective, it is not as accurate as the
decision tree model in predicting diabetes risk

Figure 6: Evaluation of XgBoost

Figure 6 represents the XGBoost model which shows the best performance among all models,
with an AUC score of 0.91. The confusion matrix reveals that the model correctly classified 34
true negatives and 28 true positives, with only 6 false negatives and 32 false positives. This
indicates that the XGBoost model is highly effective in predicting diabetes risk and outperforms
the other models.

The visuals show how well the XGBoost and logistic regression models predict the risk of
diabetes. Mean ROC curves and confusion matrices which help to assess how well each model
discriminates between diabetic and nondidactic cases.

22
Confusion matrix shows the accuracy at which each model classified cases into positive and
negative categories. Logistic regression sometimes showed good performance, however, it
produced false-negative cases and hence failed identifying the true diabetes patients sometimes.
Superior results were obtained with XGBoost, which reduced misclassification and increased true
positive identification.

The ROC showed the ability of the model to differentiate between cases with and without
diabetes. The higher AUC score of XGBoost as compared to logistic regression implies that it
will have better ability in producing accurate predictions.
In all consideration, comparison sets forth the merits of using more advanced machine learning
methods, like XGBoost, for making clinical prognostications. Simpler models can provide useful
insights; however, ensemble methods and elaborate algorithms often have an improved accuracy
and are therefore more suitable for medical applications. The outcomes underscore how important
it is to determine the best model for predicting disease if the results are to be accurate enough to
enable earlier intervention and better outcomes for patients.

5. CONCLUSION

Through machine learning and artificial intelligence, the disorders, along with diabetes, could
be detected much more efficiently at an early stage, and their prediction could then be made for
diagnosis in a much easier manner. Traditional statistical methods were incapable of picking up
such subtle patterns acting as harbingers of serious disease, as the data in the field became more
complex. More importantly, machine learning would not only prospect down unearthed nuggets
from heaps of patient data but also facilitate exact diagnosis, not excluding proactive
management of the conditions. In addition to individual interventions, these strategies pave the
way for clinicians to build personalization in treatment schedules with significantly better long-
term outcomes in health through the prompt identification of warning signs, leading to
immediate and timely interventions.

Among the highly prevalent metabolic disorders affecting millions of people in the world today
is diabetes, which makes its detection something important to avert complications in the future.
Diagnosis has always depended upon biochemical evaluation-employing fasting glucose ranks
and levels of haemoglobin A1c. These tests are good, but they generally confirm diabetes only
after the illness is established. Machine-learning technology has enabled an advance into a de
novo preclinical management of diabetes now, before symptoms appear. These models can
predict at-risk populations using such major health parameters as glucose levels, body mass

23
index, age, insulin resistance, and family history, thus prompting individual risk re-evaluation
and accessing early healthcare for prevention.

In this study, we examined the various machine learning models and their efficiencies of
predicting the diabetes risk. All four models, i.e., XGBoost, Decision Tree, Random Forest, and
Logistic Regression, presented their merits, but at the same time had disadvantages. They did
reasonably well; however, rules such as Decision Tree and Logistic Regression could not cope
with identifying complex, nonlinear patterns. Logistic regression assumes a proportional, direct
relationship between features and outcomes in medical conditions such as diabetes, which is not
necessarily true. On one hand, these models can deal well-with-nonlinearities; on the other hand,
majority of decision trees generally tend to overfit, meaning that they predict a training dataset
very well but fail to generalize to unseen patients, thus justifying their poor prediction ability to
newer cases. Therefore, this often leads to a low precision and F1-score on the unseen cases.

Refining prediction accuracy is based on the ensemble methods, such as Random Forest, where
multiple Decision Trees were combined to yield more reliable results in classification. The
averaging of the results among trees reduces the overfitting caused by overly-concretized
predictions. The performance of XGBoost however, made the best impression during the whole
modelling trials in terms of holding higher accuracy and F1 scores with all the models tested.
Especially, it shines in diabetes prediction, as it is well-structured to handle unbalanced datasets
in the medical field. Unlike the classical classifiers based on the gradient boosting technique, it
iteratively improves weak models to achieve improved stronger accuracy. The strong focus on
misclassified instances and giving these instances importance during the learning procedure is
one of its major benefits and guarantees that the particular challenging instances that might have
very subtle first symptoms of diabetes receive greater attention.

Hyperparameter tuning also seemed to play a part in XGBoost outperforming the others.
Overfitting became our point of concern because we had high hopes with the datasets that
contained training data but poor performance in generalization to newly available samples.
Overfitting occurs when a model essentially "fixes" on the peculiarities of one dataset rather
than learning the trends that can generally be applied to larger and diverse population aggregates.
Thus, we best tried to balance the trade-off between complexity and generalizability with respect
to the hyperparameters of the learning rate, tree depth, and number of estimators. By using
regularization techniques in the form of either L1 or L2, we constrained overly complex models
from attaching much importance to redundant or noisy features. This established a system of
high accuracy for predicting diabetes in real-time scenarios.
24
If improvements are achieved in models and applied on EMR devices, it is likely that the future
will see an innovative approach to diabetes care transformed. Thus, machine learning could
serve as the most important bridge between medical diagnostics and personalized healthcare
enabling early disease detection that becomes much more feasible and actionable. AI-enabled
approaches that provide quick data to patients and care providers may enable sound, time-
sensitive decisions resulting in improved outcomes.

.6. REFERENCES

[1] Kim, ByungJoo. “A Study on Diabetes Management System Based on Logistic

Regression and Random Forest.” International Journal of Advanced Smart
Convergence 13, no. 2 (June 30, 2024): 61–68.
doi:10.7236/IJASC.2024.13.2.61.

[2] M. Hasan and F. Yasmin, "Predicting Diabetes Using Machine Learning: A

Comparative Study of Classifiers," arXiv preprint arXiv:2505.07036, May
2025. Available: https://doi.org/10.48550/arXiv.2505.07036

[3] M. Soni and S. Varma, "Diabetes Prediction using Machine Learning

Techniques," International Journal of Engineering Research & Technology
(IJERT), vol. 9, no. 9, pp. 496–500, Sep. 2020. DOI:
10.17577/IJERTV9IS090496.

[4] Kaur,S, & Kumar,R.(2020). A review on predictive models for diabetes using
machine learning techniques. International Journal of Engineering Research and
Applications, 10(6), 1-6. doi:10.35629/7728-10060106.

[5] R. Hidayat, D. Mahdiana, and A. Fergina, "Comparative Analysis of Logistic

Regression, SVM, XGBoost, and Random Forest Algorithms for Diabetes
Classification," Jurnal Teknologi dan Sistem Informasi, vol. 7, no. 1, 2022. DOI:
https://doi.org/10.32493/jtsi.v7i1.38258.

25
7. APPENDICES

Appendix 1: Dataset Description

The dataset used in this study contains key medical features related to diabetes diagnosis. It
consists of records, each representing an individual with various clinical attributes that contribute
to diabetes risk assessment. Below is a description of the features used in the study:

Table 3: Appendix Table

Feature Name Description
Pregnancies Number of times the individual has been pregnant.
Glucose Blood glucose concentration level (mg/dL).
Blood Pressure Diastolic blood pressure (mmHg).
Skin Thickness Thickness of skin fold (mm).
Insulin Serum insulin level (µU/mL).
BMI Body Mass Index, calculated from height and
weight.
Diabetes Pedigree A function estimating the genetic predisposition to
Function diabetes.
Age Age of the individual in years.
Outcome Binary variable (0 = non-diabetic, 1 = diabetic).

The dataset was pre-processed before training the models, including handling missing values,
normalizing continuous features, and performing feature selection to enhance classification
accuracy.

Appendix 2: Machine Learning Models and Hyperparameters

Various machine learning algorithms were implemented and tuned to improve diabetes prediction.
Below is a summary of the models used and key hyperparameters adjusted:

Model Key Hyperparameters

Logistic Regression Regularization (L1/L2), Solver optimization

Decision Tree Max Depth, Split Criterion (Gini/Entropy)

Random Forest Number of Trees (Estimators), Max Features

26
XGBoost Learning Rate, Max Depth, Number of Estimators

Hyperparameter tuning was performed using Grid Search and Random Search techniques to
optimize model performance while reducing overfit.

Appendix 3: System Specifications

The system setup used for model training and evaluation was as follows:

▪ Processor: Intel Core i5-11400H, 2.40GHz

▪ RAM: 16GB
▪ Operating System: Windows 11 (64-bit)
▪ Software Tools: Python (Scikit-Learn, XGBoost, Pandas, NumPy)

This setup ensured efficient computation, enabling seamless processing of large datasets and model
training tasks.

Appendix 4: Model Evaluation Metrics

To assess model effectiveness, standard evaluation metrics were used, providing insights into
accuracy and reliability. Below is a summary of model performance:

Model Accuracy Precision F1-Score

Logistic Regression 0.74 0.65 0.64

Decision Tree 0.68 0.58 0.51

Random Forest 0.75 0.68 0.66

XGBoost 0.78 0.68 0.72

XGBoost demonstrated the highest accuracy and stability, making it the most effective model for
diabetes classification.

27
8. Reflection of the Team Members on The Project

In completing the assignment. We improved our abilities in scientific writing, teamwork, and time
management. The idea of doing both the research component and the write-up initially made my
partners and I feel overwhelmed, but that sensation quickly vanished as we came up with a strong
plan. We initially felt extremely worried for a number of reasons, including the realization that our
most recent assignment, our study, and our work were insufficient.

Additionally, this initiative encouraged us to improve our communication skills. We had to change
our first approach to teamwork, where we would divide the work and merely examine each other's
contributions, to one where we would complete each component individually before combining them
to create the best final product.

This was a long-term design project, and everything was new for us. In our opinion, the most
important and necessary component for such a lengthy work was patience, communicating with
other team members and our supervisor on a regular basis. As the knowledge and information of
our supervisor helped us a lot in the process of this project.

On the other hand, the primary research was the most difficult and time taking sector for us. It was
difficult to keep up with the submission dates as all the concepts were new to us. To conclude, we
overcame all the difficulties as a team and under the supervision of our supervisor we completed
our project on time. By learning new concepts and gained experience as well.

28
qwer
ORIGINALITY REPORT

6 %
SIMILARITY INDEX
4%
INTERNET SOURCES
4%
PUBLICATIONS
2%
STUDENT PAPERS

PRIMARY SOURCES

1
www.mdpi.com
Internet Source 1%
2
V. Sharmila, S. Kannadhasan, A. Rajiv Kannan,
P. Sivakumar, V. Vennila. "Challenges in
<1%
Information, Communication and Computing
Technology", CRC Press, 2024
Publication

3
Submitted to HELP UNIVERSITY
Student Paper <1%
4
Thangaprakash Sengodan, Sanjay Misra, M
Murugappan. "Advances in Electrical and
<1%
Computer Technologies", CRC Press, 2025
Publication

5
Sharbari Sarkar, Soumyabrata Saha, Suparna
DasGupta, Sudarshan Nath. "Chapter 5
<1%
Diabetes Syndrome Prophecy Using Machine
Learning", Springer Science and Business
Media LLC, 2023
Publication

6
www.scirp.org
Internet Source <1%
7
Biswajit Jena, Sanjay Saxena, Sudip Paul.
"Machine Learning for Neurodegenerative
<1%
Disorders - Advancements and Applications",
CRC Press, 2025
Publication

8
koreascience.kr
Internet Source <1%
9
doaj.org
Internet Source
<1%
10
www.e3s-conferences.org
Internet Source <1%
11
Submitted to Cork Institute of Technology
Student Paper <1%
12
www.irjet.net
Internet Source <1%
13
dipot.ulb.ac.be
Internet Source <1%
14
Submitted to University of Salford
Student Paper <1%
15
Arvind Dagur, Karan Singh, Pawan Singh
Mehra, Dhirendra Kumar Shukla. "Intelligent
<1%
Computing and Communication Techniques -
Volume 3", CRC Press, 2025
Publication

16
ijarsct.co.in
Internet Source <1%
17
www.frontiersin.org
Internet Source <1%
18
Submitted to Athlone Institute of Technology
Student Paper <1%
19
arxiv.org
Internet Source <1%
20
link.springer.com
Internet Source <1%
21
openjournal.unpam.ac.id
Internet Source <1%
22
www.ijera.com
Internet Source <1%
23
www.researchgate.net
Internet Source <1%
24
Ahmed A. Abd El-Latif, Mohammed A
ElAffendi, Mohamed Ali AlShara, Yassine
<1%
Maleh. "Cybersecurity, Cybercrimes, and
Smart Emerging Technologies", CRC Press,
2025
Publication

25
ratevideo.ru
Internet Source <1%
26
Submitted to Australian National University
Student Paper <1%
27
www.ijfmr.com
Internet Source <1%
28
hrcak.srce.hr
Internet Source <1%
29
mdpi-res.com
Internet Source <1%
30
www.ijisrt.com
Internet Source <1%

Exclude quotes Off Exclude matches Off

Exclude bibliography Off

Machine Learning?
100% (5)
Machine Learning?
114 pages
Project Report On Diabetes Prediction
No ratings yet
Project Report On Diabetes Prediction
29 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
25 pages
Final Diabetes Prediction Documentation
No ratings yet
Final Diabetes Prediction Documentation
52 pages
Capstone Project Data Science
No ratings yet
Capstone Project Data Science
5 pages
Diabetes Thesis1
No ratings yet
Diabetes Thesis1
20 pages
Diabetes Documentation
No ratings yet
Diabetes Documentation
54 pages
Kanak Blackbook Project
No ratings yet
Kanak Blackbook Project
57 pages
Estimaing Diabetic Risk Accurately (Documentation)
No ratings yet
Estimaing Diabetic Risk Accurately (Documentation)
56 pages
Tamiliniyan Santhosh 92
No ratings yet
Tamiliniyan Santhosh 92
34 pages
Diabets Prediction System Using Machine Learning Techiques: Jawaharlal Nehru Technological University Hyderabad
No ratings yet
Diabets Prediction System Using Machine Learning Techiques: Jawaharlal Nehru Technological University Hyderabad
47 pages
Updated Report 2
No ratings yet
Updated Report 2
74 pages
Seminar Report Shanu Saklani
No ratings yet
Seminar Report Shanu Saklani
22 pages
Diabetes
No ratings yet
Diabetes
73 pages
Diabetes Prediction Report
No ratings yet
Diabetes Prediction Report
41 pages
Major Final Report Kartik
No ratings yet
Major Final Report Kartik
48 pages
Iidt Record
No ratings yet
Iidt Record
25 pages
Minor Project Report
No ratings yet
Minor Project Report
69 pages
Diabetes Prediction with ML
No ratings yet
Diabetes Prediction with ML
38 pages
Mini Docs Batch 7
No ratings yet
Mini Docs Batch 7
49 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Student ML Project: Diabetes Predictor
0% (1)
Student ML Project: Diabetes Predictor
25 pages
Diabets Project Document3
No ratings yet
Diabets Project Document3
60 pages
Vikash Rai Project Report
No ratings yet
Vikash Rai Project Report
53 pages
Disease Pred
No ratings yet
Disease Pred
42 pages
CPP Final Reportt
No ratings yet
CPP Final Reportt
15 pages
Mini Project Report
No ratings yet
Mini Project Report
34 pages
Individual - Praneeth - Small
No ratings yet
Individual - Praneeth - Small
63 pages
Handwriting Recognition: Chappidi Aswarta Reddy (Urk18Cs080)
No ratings yet
Handwriting Recognition: Chappidi Aswarta Reddy (Urk18Cs080)
27 pages
Predictive Model For Diabetes Using Machine Learning
No ratings yet
Predictive Model For Diabetes Using Machine Learning
38 pages
Diabetes Disease Prediction Using A Web Tool With The Help of A Machine Learning Model.
No ratings yet
Diabetes Disease Prediction Using A Web Tool With The Help of A Machine Learning Model.
43 pages
Diabetes Project MuskanAltaf
No ratings yet
Diabetes Project MuskanAltaf
15 pages
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
100% (1)
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
35 pages
Diabetes Prediction Using ML
No ratings yet
Diabetes Prediction Using ML
29 pages
Wa0032.
No ratings yet
Wa0032.
78 pages
Project Report: ON Heart Disease Prediction Using Machine Learning
No ratings yet
Project Report: ON Heart Disease Prediction Using Machine Learning
35 pages
Final Document1
No ratings yet
Final Document1
126 pages
Heart Disease Prediction ML Project
No ratings yet
Heart Disease Prediction ML Project
35 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
39 pages
Major Project Final TABLE DIAGRAM
No ratings yet
Major Project Final TABLE DIAGRAM
28 pages
Pro 1
No ratings yet
Pro 1
11 pages
Internship Report ML'
No ratings yet
Internship Report ML'
36 pages
Final Report 2024 - Merged
No ratings yet
Final Report 2024 - Merged
32 pages
Final Project Report Format
No ratings yet
Final Project Report Format
27 pages
Major Project Report 2023-2024
No ratings yet
Major Project Report 2023-2024
33 pages
B Tech Project Black Book Template 2024 25 1
No ratings yet
B Tech Project Black Book Template 2024 25 1
54 pages
Major Project
No ratings yet
Major Project
53 pages
Mini Report
No ratings yet
Mini Report
49 pages
Ilovepdf Merged Removed
No ratings yet
Ilovepdf Merged Removed
33 pages
Final Pro
No ratings yet
Final Pro
51 pages
Sunny PP2
No ratings yet
Sunny PP2
48 pages
Ai Datascience Project Grade 10
No ratings yet
Ai Datascience Project Grade 10
14 pages
b13 Final Review
No ratings yet
b13 Final Review
21 pages
Final - Proj AZRA Merged
No ratings yet
Final - Proj AZRA Merged
36 pages
Project Documentation of Diabetese Detection Using KNN Algorithm
No ratings yet
Project Documentation of Diabetese Detection Using KNN Algorithm
47 pages
Mini Project Template Both
No ratings yet
Mini Project Template Both
35 pages
Diabetes Prediction via ML
No ratings yet
Diabetes Prediction via ML
82 pages
Diabetes Prediction System Refined
No ratings yet
Diabetes Prediction System Refined
67 pages
Sahil Final Project REPORT
No ratings yet
Sahil Final Project REPORT
49 pages
ML Final Report
No ratings yet
ML Final Report
40 pages
Report
No ratings yet
Report
47 pages
H11 - Consent Form
No ratings yet
H11 - Consent Form
1 page
CNW Assignment 3
No ratings yet
CNW Assignment 3
8 pages
D13 Manuscript
No ratings yet
D13 Manuscript
12 pages
A Dataset of Digital Holograms of Normal and Thalassemic Cells
No ratings yet
A Dataset of Digital Holograms of Normal and Thalassemic Cells
8 pages
H12 - Project Report
No ratings yet
H12 - Project Report
22 pages
Adobe Scan 14-Mar-2024
No ratings yet
Adobe Scan 14-Mar-2024
13 pages
CNW Ass1
No ratings yet
CNW Ass1
5 pages
Adobe Scan 29 Mar 2024
No ratings yet
Adobe Scan 29 Mar 2024
7 pages
Inclusion and Exclusion
No ratings yet
Inclusion and Exclusion
10 pages
Atalel Getu
No ratings yet
Atalel Getu
90 pages
Machine Learning Random Forest Algorithm - Javatpoint
No ratings yet
Machine Learning Random Forest Algorithm - Javatpoint
14 pages
Malware Analysis and Classification Survey
No ratings yet
Malware Analysis and Classification Survey
9 pages
Human Activity Recognition via ML
No ratings yet
Human Activity Recognition via ML
80 pages
Detection and Classification of Faults On Parallel Transmission Lines Using Wavelet Transform and Neural Network
No ratings yet
Detection and Classification of Faults On Parallel Transmission Lines Using Wavelet Transform and Neural Network
5 pages
Applied Statistics For Bioinformatics PDF
No ratings yet
Applied Statistics For Bioinformatics PDF
278 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
64 pages
Hierarchical and Partitional Clustering
No ratings yet
Hierarchical and Partitional Clustering
2 pages
Clim Jcli D 15 0640.1
No ratings yet
Clim Jcli D 15 0640.1
15 pages
Meta-Learning How To Forecast Time Series
No ratings yet
Meta-Learning How To Forecast Time Series
38 pages
Lec 01 (ML) Introduction
No ratings yet
Lec 01 (ML) Introduction
98 pages
Hardy 1994
No ratings yet
Hardy 1994
8 pages
INT354 Lecture 0
No ratings yet
INT354 Lecture 0
33 pages
Presentation Jul24
No ratings yet
Presentation Jul24
56 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
48 pages
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
8 pages
Machine Learning Marking Criteria Portfolio Part 3
No ratings yet
Machine Learning Marking Criteria Portfolio Part 3
1 page
Predicting Student Performance with DNN
No ratings yet
Predicting Student Performance with DNN
9 pages
ML and AI Notes
100% (1)
ML and AI Notes
43 pages
Case Study 3
No ratings yet
Case Study 3
7 pages
Image Processing 7
No ratings yet
Image Processing 7
193 pages
Birchwood MDS Brochure (2) - Min
No ratings yet
Birchwood MDS Brochure (2) - Min
18 pages
Stress Detection Report
No ratings yet
Stress Detection Report
11 pages
1 s2.0 S240584402400567X Main
No ratings yet
1 s2.0 S240584402400567X Main
13 pages
Aiml Ques
No ratings yet
Aiml Ques
2 pages
Algorithms and System For Segmentation and Structure Analysis in Soccer Video
No ratings yet
Algorithms and System For Segmentation and Structure Analysis in Soccer Video
4 pages
Detecting Phishing with AI Techniques
No ratings yet
Detecting Phishing with AI Techniques
10 pages
Irjet V9i1124
No ratings yet
Irjet V9i1124
5 pages

Ipsita PR

Uploaded by

Ipsita PR

Uploaded by

DIABETES DETECTION USING

Ipsita Sahoo (2141019099)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Faculty of Engineering and Technology, Institute of Technical Education and Research

SIKSHA ‘O’ ANUSANDHAN (DEEMED TO BE UNIVERSITY)

Bhubaneswar, Odisha, India

(Name and signature of the Project Supervisor)

Department of Computer Science and Engineering

Faculty of Engineering and Technology;

Place: ITER, Bhubaneshwar Signature of Students

2141019099 2141013219 2141002025

Signature of Students with Registration Numbers

This project report titled “DIABETES DETECTION USING MACHINE LEARNING”

Ipsita Sahoo Identification of challenges, Documentation

Brajesh Mohanty Results Analysis, result validation

Reetuparna Baral Identification of challenges, Literature Survey

NO FIGURE NAME PAGE NO

1 Figure 1: Dataset Pie Chart 16

2 Figure 2: Working of Model 17

3 Figure 3: Evaluation of Logistic_Regression 20

4 Figure 4: Evaluation of Decision_tree 20

5 Figure 5: Evaluation of Random_forest 21

6 Figure 6: Evaluation of XgBoost 21

NO TABLE NAME PAGE NO

Unfortunately, existing methods of diagnosis usually rely on tests interpreted by an expert, to

1.1 Project Overview

1.3 Uniqueness of the Work

1.4 Report Layout

AI-powered forecasting approaches are gearing up to transform healthcare by applying preventive

2.2 Problem Identification

All Data-Related Issues

Latest Development in Diabetes Prediction Using Machine Learning

3.1 Datasets Description

Table 1: Dataset Description

3.2 Schematic Layout

Figure 2. System Workflow

Selection and Training of Models

3.4 Evaluation Measures Used

4.1 System Specifications

Processor: An Intel Core i5-11400H CPU clocking 2.40GHz.

System Type: 64-bit operating system.

Operating System: Windows 11.

4.2 Experimental Outcomes

Algorithms Accuracy Precision F1_Score

Logistic Regression 0.74 0.65 0.64

Decision Tree 0.68 0.58 0.51

Random Forest 0.75 0.68 0.66

XGBoost 0.78 0.68 0.72

Figure 3: Evaluation of Logistic_Regression

Figure 4: Evaluation of Decision_Tree

Figure 6: Evaluation of XgBoost

[1] Kim, ByungJoo. “A Study on Diabetes Management System Based on Logistic

[2] M. Hasan and F. Yasmin, "Predicting Diabetes Using Machine Learning: A

[3] M. Soni and S. Varma, "Diabetes Prediction using Machine Learning

[5] R. Hidayat, D. Mahdiana, and A. Fergina, "Comparative Analysis of Logistic

Appendix 1: Dataset Description

Table 3: Appendix Table

Appendix 2: Machine Learning Models and Hyperparameters

Model Key Hyperparameters

Logistic Regression Regularization (L1/L2), Solver optimization

Decision Tree Max Depth, Split Criterion (Gini/Entropy)

Random Forest Number of Trees (Estimators), Max Features

Appendix 3: System Specifications

▪ Processor: Intel Core i5-11400H, 2.40GHz

Appendix 4: Model Evaluation Metrics

Model Accuracy Precision F1-Score

Decision Tree 0.68 0.58 0.51

Random Forest 0.75 0.68 0.66

XGBoost 0.78 0.68 0.72

Exclude quotes Off Exclude matches Off

You might also like