0% found this document useful (0 votes)

18 views26 pages

ML Models for Cervical Cancer Detection

Uploaded by

Felix Obuya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views26 pages

ML Models for Cervical Cancer Detection

Uploaded by

Felix Obuya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

MODELLING CERVICAL CANCER USING MACHINE LEARNING

ALGORITHMS

A Research Proposal submitted in partial fulfilment of the requirements of the degree of

Bachelor of Science in Statistics at the Jomo Kenyatta University of Agriculture and
Technology

2023
DECLARATION
This research is our original work and has not been presented anywhere for a degree award.
Name: JOHN OTIENO ANAM Reg. No: SCM 223-0154/2020
Signature...............................................................Date:13/11/2023

Name: NIMO MOHAMED HASHI Reg. No: SCM 223-0703/2020

Signature............................................................Date:13/11/2023

Name: DAVID JAMES OMONDI Reg. No: SCM 223-0777/2020

Signature........................................................... Date:13/11/2023

Name: CHEPKIRUI R FAITH Reg. No: SCM223-0090/2020

Signature......................................................... Date:13/11/2023

Name....CHELSEA ADHIAMBO OLUOH Reg. No: SCM223-0153/2020

Signature......................................................... Date: 13/11/2023

This research has been submitted for examination with my approval as a university
supervisor.
Name of Supervisor: Dr. Samuel Mutua
Signature.................................... Date: 13/11/2023

i
Table of Contents
...............................................................................................................................................................1
INTRODUCTION.....................................................................................................................................1
1.1. Background of the study.............................................................................................................1
1.2. Problem Statement.....................................................................................................................3
1.3 Objectives of the study................................................................................................................4
1.3.1 General objective..................................................................................................................4
1.3.2 Specific Objectives................................................................................................................4
1.3 Justification of the Study..............................................................................................................4
LITERATURE REVIEW..............................................................................................................................6
2.1 Introduction.................................................................................................................................6
2.2 Related Work...............................................................................................................................6
2.3. Summary and Critiques.............................................................................................................11
METHODOLOGY...................................................................................................................................12
3.1 Introduction...............................................................................................................................12
3.2. XGBOOST MODEL.....................................................................................................................12
3.2.1 Objective Function:.............................................................................................................13
3.2.2 Loss Function:.....................................................................................................................13
3.2.3 Regularization Term:...........................................................................................................13
3.2.4 Gradient Descent Update:..................................................................................................14
3.3. Random Forest:.........................................................................................................................14
3.3.1 Nodes in a decision tree:.....................................................................................................14
3.3.2 Decision Tree Pruning.....................................................................................................16
3.4. Comparing Models Performance:.............................................................................................16
3.4.1. Confusion Matrix...............................................................................................................16
3.4.2. Precision............................................................................................................................17
3.4.3. Recall..................................................................................................................................17
3.4.4 F1 score:..............................................................................................................................17
3.4.5 ROC Curve and AUC:...........................................................................................................17
3.5 Research Dataset...................................................................................................................17
References...........................................................................................................................................18
APPENDIX 1: WORK PLAN....................................................................................................................21
APPENDIX 2: BUDGET..........................................................................................................................22

ii
LIST OF ABBREVIATIONS

CART Classification and regression tree

CFS Correlation Based Future Selection

CIL Class Imbalance Learning

DT Decision Tree

FST Feature selection problem

HPV Human papillomavirus

IBL Instant Based Learner

LASSO Least absolute shrinkage and selection operator

MI Mutual information

ML Machine Language

NLP Natural Language Processing

PCA Principal Component Analysis

PPA Positive Predictive Agreement

RFE Recursive Feature Elimination

ROC Receiver Operating Characteristic curve

RSFS Random Subset Future Selection

SFFS Sequential Floating Forward Ssearch

SFS Sequential Forward Search

SMOTE Synthetic Minority Oversampling Technique

SSVM Smooth Support Vector Machine

SVM Support Vector Machine

UCI University of California Irvine

WHO World Health Organization

iii
CHAPTER ONE.
INTRODUCTION.
1.1. Background of the study.
Cervical cancer is the fourth most frequent cancer among women globally, making it a
serious public health concern. The International Agency for Research on Cancer released
updated numbers in 2020 (Ferlay et al.), showing that 311,000 fatalities and about 570,000
new cases were recorded in 2018. The prevalence of cervical cancer is still high even with
screening techniques available, especially in low resource environments. Numerous factors,
such as restricted access to healthcare facilities, a lack of knowledge about preventative
measures, and obstacles to early diagnosis and treatment, can be blamed for this high
incidence and death (Arbyn et al., 2020).

Kenya is one of the developing countries which faces knowledge disparity, resource
constraints which incapacitate health service delivery and provision therefore affecting
screening services which act as preventive measure since cervical cancer is preventable.

Developing countries with better health facilities, workforce and services which has proven
to be effective and has led to reduction of cervical cancer mortality and morbidity (khan et
al., 2014).

Table 1: Crude Incidence rates of HPV related cancers

Type of cancer Male Female

Cervical cancer 19.4

Anal cancer 0.22 0.54

Vulva cancer 0.58

Vaginal cancer 0.24

Penile cancer 0.18

Oropharyngeal cancer 0.18 0.07

Oral cavity cancer 1.46 1.36

Laryngeal cancer 1.26 0.27

Source: ICO/IARC information Centre fact sheet 2021.

1
Computational approaches have been rapidly integrated into the healthcare profession in
recent years to improve patient care and illness management. Topic modelling is one of the
many Natural Language Processing (NLP) techniques that have become effective tools for
gleaning meaningful insights from massive volumes of textual data.

In many industrialized nations, cervical cancer screening has dramatically decreased

incidence and death rates. Screening is primarily achieved by Pap smears and human
papillomavirus (HPV) testing. Notwithstanding their benefits, these approaches have certain
drawbacks such as the requirement for highly qualified healthcare personnel, exorbitant
expenses, and restricted availability, especially in environments with restricted resources.

The application of machine learning (ML) techniques has demonstrated potential in tackling
these issues by offering automated, precise, and economical methods for risk assessment and
cervical cancer detection. Complex patterns in cervical cancer data can be analysed by ML
algorithms, allowing for early detection and individualized treatment plan.

ML models are capable of processing a wide range of data types, such as clinical history,
patient demographics, and medical imaging, in order to detect aberrant cervical cells. For
example, using deep learning algorithms on colposcopy images can help identify
precancerous lesions with a high degree of accuracy (Li et al., 2019). Furthermore, to help
with early diagnosis, ML algorithms can examine genomic data to find genetic markers
linked to cervical cancer (Lou et al., 2017).

Machine learning based predictive modelling aids in determining the risk variables for each
patient and forecasts the chance of acquiring cervical cancer.

ML algorithms may categorize patients into distinct risk groups by examining a variety of
risk indicators, including age, sexual behavior, and HPV infection status. According to
(Balasubramanian et al.,2018), this makes tailored preventative tactics and targeted therapies
possible.

By evaluating patients' reactions to various medications, forecasting treatment results, and

suggesting individualized treatment plans, machine learning algorithms can improve therapy
regimens. ML models can help oncologists make data driven decisions that result in more
efficient and customized treatments by evaluating past treatment data and patient variables

(Wang et al., 2020).

2
Although, machine learning has a lot of promise for cervical cancer research, there are
obstacles and moral issues to take into account. Researchers need to manage several
important issues, including data privacy, algorithm bias, and the requirement for consistent
datasets. Working together, data scientists, ethicists, and healthcare professionals can
guarantee the ethical development and application of machine learning based cervical cancer
treatments.

The application of machine learning techniques to cervical cancer research has the potential
to significantly enhance risk assessment, early detection, screening precision, and
individualized treatment. Machine learning (ML) based methods have the potential to
transform cervical cancer prevention and treatment, ultimately saving lives and lessening the
disease's worldwide burden, as long as researchers keep improving algorithms, addressing
ethical issues, and fostering greater collaboration.

1.2. Problem Statement.

The health of women continues to be seriously threatened by cervical cancer on a global
scale. Despite advances in medicine, the illness continues to be a major factor in women's
cancer related death, especially in areas with limited resources. Cervical cancer presents a
variety of difficulties, from difficulty in accessing healthcare facilities to difficulties in early
detection and precise diagnosis.

While useful, conventional screening techniques like the Pap smear test have certain inherent
drawbacks. Among these drawbacks include the subjective interpretation of results, the
possibility of inconsistent diagnoses, and a significant false negative rate, which delays
identification and response. Additionally, there is a lack of general knowledge of cervical
cancer and its preventive measures in many poor nations, which leads to late-stage diagnoses
and worse survival rates.

Given these difficulties, a comprehensive approach to the cervical cancer issue is urgently
required. It is crucial to create novel, precise, and easily understandable screening techniques
that can get around the drawbacks of conventional procedures.

Additionally, focused awareness efforts and educational programs are required to inform
women about cervical cancer, its risk factors, and the value of routine screenings. It is crucial
to close the healthcare gap by utilizing technology and machine learning algorithms to deliver
precise and quick diagnoses, especially in areas with a lack of healthcare resources.

3
This study seeks to make a significant contribution to this important field by investigating
approaches using machine learning algorithms.

1.3 Objectives of the study.

The study will be guided by the following objectives.

1.3.1 General objective.

To classify Cervical Cancer using Machine learning Algorithms.

1.3.2 Specific Objectives

i. To fit the XGBOOST algorithm to the dataset and evaluate its performance in
classifying cervical cancer.
ii. To fit the Random Forest algorithm to the dataset and evaluate its performance in
classifying cervical cancer.
iii. To compare the performance of the XGBOOST and Random Forest algorithms in
classifying cervical cancer, with the aim of identifying which algorithm is more
effective.

1.3 Justification of the Study

As a leading worldwide health issue, cervical cancer needs thorough investigation and
creative approaches to lessen its effects on women's health. The important rationale and
societal importance for carrying out a study that incorporates machine learning algorithms in
cervical cancer research and medical procedures are described in this reasoning.

1.Improving Early Detection and Diagnosis:

High precision analysis of large datasets, such as patient histories and medical imaging, is
possible with machine learning techniques. Healthcare workers can identify aberrant cervical
cells more quickly and correctly than with traditional approaches by using machine learning
models to automate the examination of Pap smears, colposcope images, and other diagnostic
tests. According to (Li et al.,2019), early detection is essential because it allows for prompt
therapies, which in turn raises the likelihood of cure and survival.

2. Supporting Informed Decision Making:

Large scale patient data analysis can be performed by machine learning algorithms, which
can also produce insightful reports for medical professionals. By providing evidence-based
information, these insights enable clinicians to make well informed decisions. According to

4
(Wang et al.,2020), machine learning plays a significant role in improving patient care,
lowering healthcare expenses, and improving overall healthcare quality by assisting
healthcare workers in selecting the most effective treatments and interventions.

3. Bridging Healthcare Disparities:

Women sometimes encounter obstacles while trying to access services for cervical cancer
screening and treatment, especially in rural and underserved areas. The development of
portable, low-cost screening instruments and telemedicine applications can be aided by
machine learning. By offering easily accessible and reasonably priced screening options,
these technologies can close the gap in healthcare disparities and guarantee that a greater
number of people can take advantage of early diagnosis and prompt interventions

(Mabotuwana et al., 2018).

5
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
This chapter presents background of related works. It includes recent and prominent
publications of literature review of various papers, journals and articles in the field of Public
Health of using Machine Learning in modelling cervical cancer.

2.2 Related Work

Singh & Sharma (2019) used machine learning algorithms to improve diagnosis of cervical
cancer. They made use of the University of California, Irvine (UCI) cervical cancer risk
factor dataset, which included four target variables (Himmelman, Schiller, Cytology, and
Biopsy) and 32 risk factors. In order to create a classification model, the researchers used a
voting technique that combines Random Forest, Logistic Regression, and Decision Tree
classifiers.

They used Principal Component Analysis (PCA) and the Synthetic Minority Oversampling
Technique (SMOTE) to reduce dimensions without sacrificing model accuracy in order to
address the problem of imbalanced data. To avoid overfitting, a stratified 10 fold cross
validation approach was used to assess the models.

Using the SMOTE voting model produced notable gains in accuracy, sensitivity, and area
under the Receiver Operating Characteristic curve (ROC_AUC) for all target variables. The
improvements were found in the following ratios: sensitivity, accuracy, and Positive
Predictive Agreement (PPA): 0.93% to 5.13%, 39.26% to 46.97%, and 2% to 29%,
respectively.

The application of PCA shortened the duration of computational processing, improving the
overall effectiveness of the model. Based on particular evaluation metrics, a comparative
analysis with prior research revealed the efficacy of their models and their capacity to
diagnose cervical cancer more quickly.

Tanimu et al., (2022), used decision tree (DT) classification technique to examine risk factors
associated with cervical cancer. They investigated Least Absolute Shrinkage and Selection
Operator (LASSO) and Recursive Feature Elimination (RFE) feature selection strategies in
order to improve prediction accuracy. Due to the missing values and extreme imbalance in

6
the dataset, SMOTE Tomek a combination of under sampling and oversampling techniques
was applied.

Their investigation proved the efficacy of the DT classifier, especially when employing
certain characteristics from SMOTE Tomek and RFE. This method produced remarkable
sensitivity (100%) and accuracy (98.72%). The DT classifier demonstrated remarkable
efficacy in resolving cervical cancer classification issues by decreasing the feature count and
mitigating the problem of class imbalance. The study highlighted the significance of feature
selection and addressed class imbalances, offering insightful information for the creation of
prediction models for the diagnosis and prognosis of cervical cancer.

Zhang, H., & Wang, J. (2017), used supervised machine learning techniques to analyze
different methods with the goal of identifying cervical cancer in its early stages.

A complete dataset on cervical cancer including 858 patient records, 36 risk factors, and one
outcome variable was taken from the UCI repository in order to conduct their research.

Artificial neural networks, Bayesian networks, Support Vector Machines (SVM), random
trees, logistic trees, and XG boost trees were the six classification algorithms used. The
models trained with and without feature selection techniques were evaluated in the study.
Three distinct feature selection techniques were applied: LASSO regression, wrapper method,
and relief rank.

The results of the investigation demonstrated the effectiveness of XG Boost, which when all
features were taken into account attained an amazing maximum accuracy of 94.94%.
Interestingly, certain cases showed that using feature selection techniques increased
efficiency. The study demonstrated how machine learning outperforms conventional
statistical models in managing intricate, massive datasets and detecting crucial prognostic
characteristics.

The study recognized current limitations even though machine learning has enormous
potential for improving cervical cancer treatment and practical applications. More work is
required to improve the precision, dependability, and usefulness of clinical outcome forecasts
due to obstacles such overfitting, incomplete and simplified data, and interpretability
problems in prediction models. This study highlights the necessity for ongoing developments
to properly address the difficulties of cervical cancer and represents the achievements made
in the use of machine learning for detection.

7
Ashok, B. et al., (2016), developed an optimized feature selection model which identified
important characteristics using various selection techniques. Furthermore, the research
endeavored to develop classifier models using the utilization of multiple algorithms, such as
C5.0, random forest, rpart, KNN, and SVM.

The researchers assessed each model's performance and thoroughly investigated training
options and the results of these algorithms were provided in the paper after a thorough
analysis.

Purnami et al., (2016), employed several classification techniques, including Classification

and Regression Tree (CART), Smooth Support Vector Machine (SSVM), and Three Order
Spline SSVM (TSSVM), in modeling cervical cancer.

They tackled a number of important issues, including the uneven distribution of data on
cervical cancer. They used the Synthetic Minority Oversampling Technique (SMOTE) as a
preprocessing method to address this problem. SMOTE ensured a more representative sample
for analysis by helping to balance the dataset.

The effectiveness of these categorization techniques was carefully assessed by the researchers
utilizing crucial criteria like specificity, sensitivity, and accuracy.

According to their findings, using SMOTE as a preprocessing method greatly improved the
classification models' performance. In particular, the SMOTE SSVM method fared better
than SMOTE TSSVM and SMOTE CART, indicating that this strategy is useful for
enhancing the precision and dependability of forecasts regarding cervical cancer survival. In
the context of cervical cancer survival, this discovery emphasizes the significance of
resolving data imbalance for more precise prediction models.

Feature selection was an important part of their process, and it has a big impact on the
diagnosis's accuracy. The researchers used a variety of techniques, including Mutual
Information (MI), Sequential Forward Search (SFS), Sequential Floating Forward Search
(SFFS), and Random Subset Feature Selection (RSFS), to determine which features were
most pertinent. These methods assisted in the identification of the most illuminating
characteristics required for a precise diagnosis of cervical cancer.

Through the combination of advanced feature selection techniques, feature extraction, and
picture segmentation, the researchers developed a successful cervical cancer diagnosis
approach. This novel method not only increases diagnosis accuracy but also streamlines the

8
difficult process of examining pap smear slides, helping pathologists diagnose cervical cancer
with greater precision and knowledge.

Novaković et al., (2017). The aim of their study was to select the best option from a range of
categorization models produced by an intricate and repetitive model building procedure. The
study examined the evaluation techniques used to gauge these models' performance in order
to do this.

The study covered the drawbacks of relying only on classification accuracy as an assessment
criterion. They noted that accuracy depends on the class distribution in the dataset and cannot
take into consideration the distinctions between the errors produced by the models. In order
to provide a more complex picture of the models' performance, the researchers looked at
alternate assessment metrics.

A study done by (Raschka, 2018), provides a thorough analysis of a number of methods that
are essential for model assessment, model choice, and algorithm choice in the context of
scholarly machine learning research and practical implementations. Drawing from both
theoretical frameworks and practical data, they analyze the benefits and drawbacks of various
strategies and underline how important it is to use them effectively.

This paper explores several comparison methodologies, including omnibus tests and multiple
comparison adjustments, and statistical tests for algorithm comparisons. Furthermore, when
working with limited datasets, alternate approaches to algorithm selection are advised, such
as nested cross validation and the combined F test 5x2 cross validation.

The holdout approach and other popular techniques for selecting and evaluating models are
covered in the article. It does, however, advise against using them with tiny datasets because
of possible problems. As a substitute for assessing performance uncertainty, the authors
present many iterations of the bootstrap technique.

They also cover cross validation methods, such as k fold cross validation and leave one out
cross validation, highlighting the necessity of balancing the bias variance trade off when
determining the right 'k' value. Empirical evidence based practical recommendations are
given for the best possible 'k' selection.

A study done by (Hall,1999), addressed the issue of choosing an appropriate set of

characteristics to build a classification model for particular tasks. This work's main
hypothesis is that features that are substantially associated with the class variable but

9
uncorrelated with one another should make up an effective feature set. In order to put this
theory into practice, the study presents a test theory inspired feature evaluation formula that
serves as the foundation for the Correlation based Feature Selection (CFS) method.

The feature assessment formula, a suitable correlation measure, and a heuristic search
approach are all combined by CFS. Experiments using generated and natural datasets and
three machine learning algorithms a decision tree learner called C4.5, an instance-based
learner called IB1, and naive Bayes were used to thoroughly assess the algorithm.

Dokduang et al., (2014), created and evaluated a method that could be used to forecast how
radiation therapy will affect bone metastases in cases of cervical cancer. The problem they
encountered was handling unbalanced data, which frequently arises in cervical cancer
instances where there are bone metastases. Class imbalance learning (CIL), a significant
difficulty posed by imbalanced data in data mining, causes problems in machine learning and
lower classifier performance.

The study compared various methods made to deal with data imbalance in classification jobs
in order to address this issue. The unbalanced data was balanced using the synthetic minority
over sampling technique (SMOTE), and the effectiveness of several classification models,
such as Ant Miner, RIPPER, Ridor, PART, ADTree, C4.5, ELM, and Weighted ELM, was
assessed. The study employed Accuracy, G mean, and F measure as evaluation criteria.

Bentéjac et al., (2021), investigated current developments in the gradient boosting algorithm
family, such as XGBoost, LightGBM, and CatBoost, in order to improve speed and accuracy
in machine learning applications.

Machine learning problems were addressed with XGBoost, a scalable ensemble technique,
being emphasized as a dependable and effective approach. LightGBM, on the other hand,
focuses on rapid training performance by selectively sampling high gradient instances,
making it extremely fast. With the goal of enhancing model accuracy by preventing
prediction shifts, CatBoost updated the gradient computing process.

The outcomes showed that, despite the relatively slight variations between the approaches,
CatBoost produced the best results in terms of both AUC and generalization accuracy
throughout the datasets under study. Despite sacrificing some accuracy, LightGBM proved to
be the fastest approach. With respect to accuracy and training speed, XGBoost did well,
coming in second.

10
The reviews above indicates that despite the utilization of several traditional algorithms for
cervical cancer prediction, the models have not reached a desirable level of performance. This
deficiency primarily stems from the critical aspect of machine learning: feature selection. The
authors did not specify the process through which features were chosen, leaving a significant
gap in the methodology.

Furthermore, relying solely on standard deep learning algorithms does not guarantee a
model's suitability for practical implementation in the healthcare sector. Customization is
essential to mitigate overfitting and enhance speed for clinical applications.

2.3. Summary and Critiques

The literature review encompasses a range of approaches in using machine learning for
cervical cancer prediction. Various studies employ different datasets and algorithms,
addressing challenges like imbalanced data and emphasizing the importance of feature
selection. Customization for practical healthcare applications is highlighted as essential.
However, future research should focus on providing more comprehensive details on feature
selection processes, context specific algorithmic evaluations, and real-world validation to
enhance the effectiveness of cervical cancer prediction models.

Several critiques were identified across the reviewed literatures. Firstly, many studies lack
detailed information on the feature selection process, creating a gap in understanding the
factors influencing model predictions. Some papers focus on algorithm comparison without
clearly delineating the impact of these algorithms on cervical cancer prediction, leaving a gap
in actionable insights. Additionally, certain studies lack specific results or detailed insights,
making it challenging to assess the effectiveness of proposed methodologies in the context of
cervical cancer prediction. Improved reporting on feature selection, context specific
algorithmic evaluations, and real-world validation are essential for advancing the field.

11
CHAPTER THREE
METHODOLOGY
3.1 Introduction
This chapter outlines the techniques employed to construct the specified Machine Learning
models and the model evaluation techniques.

3.2. XGBOOST MODEL

XGBoost, short for extreme Gradient Boosting, is a popular machine learning algorithm that
belongs to the family of gradient boosting methods. It is widely used for supervised learning
tasks, particularly in the fields of classification and regression. XGBoost is known for its high
performance and efficiency, making it a go to for data scientists and machine learning
practitioners.

Here is a high-level summary of XGBoost:

Ensemble Learning: XGBoost is an ensemble learning method, which means it combines the
predictions of multiple weak learners (usually decision trees) to create a stronger, more robust
model.

Gradient Boosting: XGBoost is based on the concept of gradient boosting, where models are
built sequentially to correct the errors of the previous ones. Each new model focuses on the
mistakes made by the combined ensemble of existing models.

Regularization: XGBoost incorporates regularization techniques to control the complexity of

the individual trees, preventing overfitting. This is achieved through regularization terms
added to the objective function.

Objective Function: The training of XGBoost involves optimizing an objective function,

which is a sum of a loss function and a regularization term. The objective is to find the model
parameters that minimize this function.

Gradient Descent: XGBoost uses gradient descent optimization to minimize the objective
function. In each iteration, it calculates the gradient of the objective function with respect to
the model's predictions and updates the model parameters in the direction that minimizes the
objective.

12
3.2.1 Objective Function:
The objective function in XGBoost is a sum of a loss function and a regularization term for
each tree in the ensemble. The objective is to minimize this function during training.

n K
Objective =∑ ❑ L ( y i , ^y i ) + ∑ ❑ Ω ( f k )
i=1 k=1

Where:

n is the number of training samples.

L is the loss function measuring the difference between the true label ( y i) and the predicted
label ^y

K is the number of trees in the ensemble.

Ω ( f k ) is the regularization term for the kth tree.

n
The first term, ∑ ❑ L ( y i , ^y i )represents the training loss. The choice of loss function (L)
i=1

depends on the specific task.

K
The second term, ∑ ❑ Ω ( f k ) is the regularization term that penalizes the complexity of the
k =1

individual trees in the ensemble. The regularization term (Ω) typically includes the depth of
the tree and the weights of leaf nodes.

3.2.2 Loss Function:

For binary classification, the logistic loss function is used.

L ( y i , ^y i ) = y i ⋅ log ⁡( 1+ e− y ) + ( 1− y i ) ⋅log ⁡( 1+ e y )
^i ^i

3.2.3 Regularization Term:

The regularization term (Ω) penalizes the complexity of the trees. It typically includes the
depth of the tree and the weights of leaf nodes:

T
1
Ω ( f k )=γT + λ ∑ ❑ w 2j
2 j=1

 T is the number of terminal nodes (leaves) in the tree.

 w j is the weight of leaf j.

13
 γ and λ are regularization parameters that control the impact of the regularization
terms.

3.2.4 Gradient Descent Update:

The model update in each iteration involves using gradient descent to minimize the objective
function. For the kth tree, the update is given by:

f k ( x )=f k−1 ( x )+ η⋅ hk ( x )

3.3. Random Forest:

Is a supervised learning algorithm, that uses ensemble learning technique (Ensemble learning
is using multiple algorithms at a time) to build several decision trees at random data points.

Random forest classification can be summarized in 3 steps:

1) Subset of the training set is randomly selected with replacement to form a set of decision
trees (Bootstrapping).

2) Each decision tree gives a vote on the possible class of the test object.

3) The majority vote is taken as the final result (Bagging).

3.3.1 Nodes in a decision tree:

A decision tree is made up of three nodes: root node, internal node and the leaf node. The root
node is the starting point, the internal nodes have arrows pointing to them and away from
them. The leaf nodes have arrows pointing to them. Classification stops when you get to a
leaf node since the node is considered to be pure thus cannot be split further.

14
How to select the variable in the root node:

The variable used at the root node is selected based on the level of impurity. For classification
problems, the CART algorithm is used to quantify the level of impurity. The algorithm uses
Gini index criterion as shown below:

c
GI =1−∑ ❑ P2i
i=1

where Pi is the probability of class i and there is a total of c classes.

The level of impurity of each variable in the training dataset is calculated. Based on the
threshold value, the root node is split into two. Values that satisfy the threshold are grouped to
the left node as true and those that do not satisfy the threshold are grouped to the right as
false. The gini index of the left node G1 and right node G2 are calculated. As a result of
different total number of the observations on the left and right node, the total gini impurity is
given by the weighted average of the gini impurities for the right node and left node. The
formula is as shown below:

Total GI = W1 ∗ G1 + W2 ∗ G2

where W1 is the weight of the right node while W2 is the weight of the left node.

Total values in the left node

W1 =
Total samples in the left node + Total samples in the right node

Total values in the right node

W2 =
Total samples in the left node +Total samples in the right node

The same procedure is repeated for all variables. The variable with the lowest gini
impurity is used as the root node.

Deciding which variable to use in the next internal node:

From the root node, the tree is split into two categories, true and false. The total gini
impurity of the remaining variables is calculated again. The variable with the lowest gini
impurity is taken as the next internal node.

15
3.3.2 Decision Tree Pruning
Pruning is a data compression technique in machine learning algorithm that reduces the
size of decision trees by removing section of the tree that are non critical and redundant
to classify instances. It ensures trimming down a full tree to reduce complexity of the
model.

Parameters used to prune the trees to increase performance

i. n estimators: high number of estimators implies high performance.

ii. imax depth: is the longest path between the root node and the leaf node

3.4. Comparing Models Performance:

Evaluate both the XGBoost and Random Forest models on the testing set using appropriate
evaluation metrics.

3.4.1. Confusion Matrix

This evaluation metric is carried out on a model that has categorical outcomes. It is a matrix
that compares the number of predictions for each class that are correct and those that are
incorrect.

Terms used in defining a confusion matrix are:

i. True Positive is when actual value and predicted value are both positive
ii. False Positive is when the actual value is negative while the predicted value is
positive.
iii. True Negative is when the actual value and predicted value are both negative.
iv. False Negative is when the actual value is positive while the predicted value is
negative.

Predicted

No Cervical Cancer Has Cervical Cancer

Actual No cervical cancer True positive (TP) False Negative (FN)

Has Cervical cancer False positive (FP) True Negative (TN)

16
3.4.2. Precision
Is defined as the number of true positives divided by the number of predicted positives. Is
useful when we want to reduce the number of False positive.

TP
Precision =
TP+ FP

3.4.3. Recall
Is defined as the number of true positives divided by the total number of actual positives. Is a
useful metric when we want to minimize the number of false negatives.

TP
Recall =
TP+ FN

3.4.4 F1 score:
Is the harmonic mean of precision and recall, which means there is equal importance given to
FP and FN. F1 score is maximum when precision = recall.

2∗Precision∗Recall
F1 =
Precision+ Recall

3.4.5 ROC Curve and AUC:

The true positive rate is plotted against the false positive rate in a graphical format using the
Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC).

AUC, or Area Under the ROC Curve, measures a classification model's overall effectiveness.

3.5 Research Dataset

The UCI repository contributed to the dataset “Cervical Cancer Risk Factors for Biopsy”. The
collection contains information about 969 people’s activities, demographics, and medical
history.

17
References
Ferlay, J., Colombet, M., Soerjomataram, I., Parkin, D. M., Piñeros, M., Znaor, A., & Bray, F.
(2021). Cancer statistics for the year 2020: An overview. International journal of
cancer, 149(4), 778 789.

Arbyn, M., Weiderpass, E., Bruni, L., de Sanjosé, S., Saraiya, M., Ferlay, J., & Bray, F.
(2020). Estimates of incidence and mortality of cervical cancer in 2018: a worldwide
analysis. The Lancet Global Health, 8(2), e191 e203.

Khan, M. I. R., Asgher, M., & Khan, N. A. (2014). Alleviation of salt induced photosynthesis
and growth inhibition by salicylic acid involves glycinebetaine and ethylene in mungbean
(Vigna radiata L.). Plant Physiology and Biochemistry, 80, 67 74.

Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2019). A unified MRC framework for
named entity recognition. arXiv preprint arXiv:1910.11476.

Wang, S., Guan, B. Y., Lu, Y., & Lou, X. W. D. (2017). Formation of hierarchical In2S3–
CdIn2S4 heterostructured nanotubes for efficient and stable visible light CO2
reduction. Journal of the American Chemical Society, 139(48), 17305 17308.

Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018, March). Grad
cam++: Generalized gradient based visual explanations for deep convolutional networks.
In 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 839 847).
IEEE.

Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., ... & Cao, B. (2020). Clinical features
of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet, 395(10223),
497 506.

18
Bouvard, V., Wentzensen, N., Mackie, A., Berkhof, J., Brotherton, J., Giorgi Rossi, P., ... &
Lauby Secretan, B. (2021). The IARC perspective on cervical cancer screening. New
England Journal of Medicine, 385(20), 1908 1918.

Mabotuwana, T., Hombal, V., Dalal, S., Hall, C. S., & Gunn, M. (2018). Determining
adherence to follow up imaging recommendations. Journal of the American College of
Radiology, 15(3), 422 428.

Tanimu, J. J., Hamada, M., Hassan, M., Kakudi, H., & Abiodun, J. O. (2022). A machine
learning method for classification of cervical cancer. Electronics, 11(3), 463.

Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., ... & Berger, M. F.
(2017). Mutational landscape of metastatic cancer revealed from prospective clinical
sequencing of 10,000 patients. Nature medicine, 23(6), 703 713.

Ashok, B., & Aruna, P. (2016). Comparison of Feature selection methods for diagnosis of
cervical cancer using SVM classifier. Int. J. Eng. Res. Appl, 6, 94 99.

Alquran, H., Mustafa, W. A., Qasmieh, I. A., Yacob, Y. M., Alsalatie, M., Al Issa, Y., &
Alqudah, A. M. (2022). Cervical cancer classification using combined machine learning and
deep learning approach. Comput. Mater. Contin, 72(3), 5117 5134.

Purnami, S. W., Khasanah, P. M., Sumartini, S. H., Chosuvivatwong, V., & Sriplung, H.
(2016, April). Cervical cancer survival prediction using hybrid of SMOTE, CART and
smooth support vector machine. In AIP Conference Proceedings (Vol. 1723, No. 1). AIP
Publishing.

Ashok, B., & Aruna, P. (2016). Comparison of Feature selection methods for diagnosis of
cervical cancer using SVM classifier. Int. J. Eng. Res. Appl, 6, 94 99.

Novaković, J. D., Veljović, A., Ilić, S. S., Papić, Ž., & Tomović, M. (2017). Evaluation of
classification models in machine learning. Theory and Applications of Mathematics &
Computer Science, 7(1), 39.

Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine
learning. arXiv preprint arXiv:1811.12808.

Hall, M. A. (1999). Correlation based feature selection for machine learning (Doctoral
dissertation, The University of Waikato).

19
Dokduang, K., Chiewchanwattana, S., Sunat, K., & Tangvoraphonkchai, V. (2014). A
comparative

machine learning algorithm to predict the bone metastasis cervical cancer with imbalance
data problem. In Recent Advances in Information and Communication Technology:
Proceedings of the 10th International Conference on Computing and Information Technology
(IC2IT2014) (pp. 93 102). Springer International Publishing.

Bentéjac, C., Csörgő, A., & Martínez Muñoz, G. (2021). A comparative analysis of gradient
boosting algorithms. Artificial Intelligence Review, 54, 1937 1967.

Hall, M. A. (1999). Correlation based feature selection for machine learning (Doctoral
dissertation, The University of Waikato).

20
APPENDIX 1: WORK PLAN
Novembe

February
October

January

Duration

March
r

April
Activity
Proposal
writing
Proposal
presentation

Data analysis
and
Interpretation
Project
Compilation
Project
Presentation

21
APPENDIX 2: BUDGET
Item Cost Total cost
printing and 420 420
photocopying

Internet fee 1800 1800

Flash disk 2500 2500
Binding the final 600 600
report

Total Ksh.5,320

English: Quarter 1 - Module 1 Noting Context Clues
86% (28)
English: Quarter 1 - Module 1 Noting Context Clues
22 pages
Paper V1.edited
No ratings yet
Paper V1.edited
6 pages
Early Risk Prediction of Cervical Cancer A Machine Learning Approach
No ratings yet
Early Risk Prediction of Cervical Cancer A Machine Learning Approach
4 pages
CCPD Ppaer-05
No ratings yet
CCPD Ppaer-05
3 pages
Paper V1.edited
No ratings yet
Paper V1.edited
6 pages
An Optimized Machine Learning Model For Automatic Prediction of Cervical Cancer Using Decision Tree Classifier
No ratings yet
An Optimized Machine Learning Model For Automatic Prediction of Cervical Cancer Using Decision Tree Classifier
6 pages
#Thesis Final v1-06082021
No ratings yet
#Thesis Final v1-06082021
47 pages
Disease Presiction
No ratings yet
Disease Presiction
32 pages
Minor Project Ankit
No ratings yet
Minor Project Ankit
9 pages
10 1109@ccis 2018 86911
No ratings yet
10 1109@ccis 2018 86911
5 pages
Stella Proposal Model For Predicting Cervical Cancer Using Machine Learning Algorithms
No ratings yet
Stella Proposal Model For Predicting Cervical Cancer Using Machine Learning Algorithms
5 pages
Kanchan Chandolkar Major Project Report File
No ratings yet
Kanchan Chandolkar Major Project Report File
27 pages
Review 1th
No ratings yet
Review 1th
19 pages
Progress
No ratings yet
Progress
35 pages
Breast Cancer Prediction Project
No ratings yet
Breast Cancer Prediction Project
33 pages
Cervical Cancer Prediction Using Machine Learning
No ratings yet
Cervical Cancer Prediction Using Machine Learning
10 pages
Brest Cancer Tumor Detection
No ratings yet
Brest Cancer Tumor Detection
40 pages
1822 B.E Cse Batchno 328
No ratings yet
1822 B.E Cse Batchno 328
60 pages
Final Project Record Exmple
No ratings yet
Final Project Record Exmple
93 pages
A Comparison of ML Algorithms For Predicting Breast Cancer Risk Using Health Data
No ratings yet
A Comparison of ML Algorithms For Predicting Breast Cancer Risk Using Health Data
8 pages
Diagnosing Cervical Cancer Using Machine Learning Methods
No ratings yet
Diagnosing Cervical Cancer Using Machine Learning Methods
3 pages
Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction
No ratings yet
Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction
16 pages
19MCA013
No ratings yet
19MCA013
27 pages
Paper 1-Predicting Cervical Cancer Based On Behavioral Risk Factors
No ratings yet
Paper 1-Predicting Cervical Cancer Based On Behavioral Risk Factors
9 pages
Advancing Cervical Cancer Screening: The Role of Machine Learning in Smart Diagnosis
No ratings yet
Advancing Cervical Cancer Screening: The Role of Machine Learning in Smart Diagnosis
40 pages
Project Report
100% (1)
Project Report
63 pages
Paper 51-Cervical Cancer Prediction
No ratings yet
Paper 51-Cervical Cancer Prediction
9 pages
Cervical Cancer Prediction: kNN, RF, SVM
No ratings yet
Cervical Cancer Prediction: kNN, RF, SVM
9 pages
Advancing Cervical Cancer Screening: The Role of Machine Learning in Smart Diagnosis
No ratings yet
Advancing Cervical Cancer Screening: The Role of Machine Learning in Smart Diagnosis
40 pages
CPP Report
No ratings yet
CPP Report
16 pages
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
No ratings yet
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
9 pages
Prediction of Cervical Cancer Using Machine Learning and Deep Learning Algorithms
No ratings yet
Prediction of Cervical Cancer Using Machine Learning and Deep Learning Algorithms
5 pages
Proactive Cervical Cancer Risk Assessment Using Data-Driven Analytics
No ratings yet
Proactive Cervical Cancer Risk Assessment Using Data-Driven Analytics
11 pages
Ramakant Thesis
No ratings yet
Ramakant Thesis
68 pages
Sagar Gowda
No ratings yet
Sagar Gowda
13 pages
Paper Draft Mtech
No ratings yet
Paper Draft Mtech
4 pages
SVM Model
No ratings yet
SVM Model
14 pages
Project Report: Bangladesh University of Business & Technology (BUBT)
No ratings yet
Project Report: Bangladesh University of Business & Technology (BUBT)
18 pages
CHAPTER ONE To 3-1
No ratings yet
CHAPTER ONE To 3-1
51 pages
Vikash Rai Project Report
No ratings yet
Vikash Rai Project Report
53 pages
Send To Hem Report
No ratings yet
Send To Hem Report
61 pages
Survey of Cervical Cancer Prediction Using Machine Learning: A Comparative Approach
No ratings yet
Survey of Cervical Cancer Prediction Using Machine Learning: A Comparative Approach
6 pages
An Examination and Evaluation of The Rising Incidence of Breast Cancer Among Women and Impact On Society
No ratings yet
An Examination and Evaluation of The Rising Incidence of Breast Cancer Among Women and Impact On Society
5 pages
Breast Cancer Prediction Model Assignment
No ratings yet
Breast Cancer Prediction Model Assignment
37 pages
CCD Article For Synopsis
No ratings yet
CCD Article For Synopsis
14 pages
Batch 9 Liver Disease Prediction Using SVM and Naive Bayes
No ratings yet
Batch 9 Liver Disease Prediction Using SVM and Naive Bayes
96 pages
Abdullah 2019 IOP Conf. Ser. Mater. Sci. Eng. 557 012003
No ratings yet
Abdullah 2019 IOP Conf. Ser. Mater. Sci. Eng. 557 012003
12 pages
286IARP27
No ratings yet
286IARP27
72 pages
IRJMETS51200105224
No ratings yet
IRJMETS51200105224
5 pages
Intelligent Cervical Cancer Detection: Empowering Healthcare With Machine Learning Algorithms
No ratings yet
Intelligent Cervical Cancer Detection: Empowering Healthcare With Machine Learning Algorithms
9 pages
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
No ratings yet
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
12 pages
Chapter One To Three
No ratings yet
Chapter One To Three
39 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
10 pages
ML Techniques in Breast Cancer
No ratings yet
ML Techniques in Breast Cancer
44 pages
Disease Prediction
No ratings yet
Disease Prediction
9 pages
A-14 Mini Project Abstract
No ratings yet
A-14 Mini Project Abstract
15 pages
IJCRT2204206
No ratings yet
IJCRT2204206
11 pages
The Concept of Motivation
No ratings yet
The Concept of Motivation
5 pages
Kardiomed-700-User Manual
No ratings yet
Kardiomed-700-User Manual
87 pages
TN1 Units 6 10 Assessment
No ratings yet
TN1 Units 6 10 Assessment
9 pages
Illinois School Mask Mandate Appeal
No ratings yet
Illinois School Mask Mandate Appeal
24 pages
3-Management of Abortion and Postabortion Care
No ratings yet
3-Management of Abortion and Postabortion Care
22 pages
App 160008337 15063140 6151579 ALL
No ratings yet
App 160008337 15063140 6151579 ALL
11 pages
Baking Skills Assessment Guide
No ratings yet
Baking Skills Assessment Guide
23 pages
Architectural Malinowski'S RESIDENTE 10-10-18
No ratings yet
Architectural Malinowski'S RESIDENTE 10-10-18
12 pages
Erythropoietin
No ratings yet
Erythropoietin
8 pages
Logbook AHC1501
No ratings yet
Logbook AHC1501
76 pages
Engineering Layout for HDNPL Project
100% (1)
Engineering Layout for HDNPL Project
1 page
Clark Bagong Thesis
No ratings yet
Clark Bagong Thesis
41 pages
Global Medicine Price Index
No ratings yet
Global Medicine Price Index
3 pages
Alaska Milk Corporation: Product Development
No ratings yet
Alaska Milk Corporation: Product Development
2 pages
Teen Breathe Issue 32 February 2022
No ratings yet
Teen Breathe Issue 32 February 2022
70 pages
T NG ÔN - Chap 8 Motivation
No ratings yet
T NG ÔN - Chap 8 Motivation
8 pages
Edible Structures The Basic Science of What We Eat - 1st Edition Academic PDF Download
100% (18)
Edible Structures The Basic Science of What We Eat - 1st Edition Academic PDF Download
16 pages
Comparison Between Vacuum and Forceps Extraction To Neonatal Outcome On Prolonged Second Stage of Labor
No ratings yet
Comparison Between Vacuum and Forceps Extraction To Neonatal Outcome On Prolonged Second Stage of Labor
4 pages
Nigerian Nursing Students' Spirituality
No ratings yet
Nigerian Nursing Students' Spirituality
11 pages
OceanofPDF - Com 68 Whiskey - Erin Russell
No ratings yet
OceanofPDF - Com 68 Whiskey - Erin Russell
290 pages
Anxiety Lesson 3
No ratings yet
Anxiety Lesson 3
6 pages
Chopra Et Al 2024 Intertemporal Altruism
No ratings yet
Chopra Et Al 2024 Intertemporal Altruism
29 pages
Safeguarding Child-On-Child Abuse
No ratings yet
Safeguarding Child-On-Child Abuse
1 page
Infant Cranial Deformity Treatment Study
No ratings yet
Infant Cranial Deformity Treatment Study
5 pages
The Fall of Fertility - How Same-Sex Marriage Will Further Declini
No ratings yet
The Fall of Fertility - How Same-Sex Marriage Will Further Declini
18 pages
Lu Lung Channels
No ratings yet
Lu Lung Channels
5 pages
Different Types of Active Physical Activities
No ratings yet
Different Types of Active Physical Activities
2 pages
SA-Initial-UltraProtect Hand Sanitiser Foam-En-SDS1369 06 CLP
No ratings yet
SA-Initial-UltraProtect Hand Sanitiser Foam-En-SDS1369 06 CLP
10 pages
SDG Zero Hunger
No ratings yet
SDG Zero Hunger
14 pages