ML Models for Cervical Cancer Detection
ML Models for Cervical Cancer Detection
ALGORITHMS
2023
DECLARATION
This research is our original work and has not been presented anywhere for a degree award.
Name: JOHN OTIENO ANAM Reg. No: SCM 223-0154/2020
Signature...............................................................Date:13/11/2023
This research has been submitted for examination with my approval as a university
supervisor.
Name of Supervisor: Dr. Samuel Mutua
Signature.................................... Date: 13/11/2023
i
Table of Contents
...............................................................................................................................................................1
INTRODUCTION.....................................................................................................................................1
1.1. Background of the study.............................................................................................................1
1.2. Problem Statement.....................................................................................................................3
1.3 Objectives of the study................................................................................................................4
1.3.1 General objective..................................................................................................................4
1.3.2 Specific Objectives................................................................................................................4
1.3 Justification of the Study..............................................................................................................4
LITERATURE REVIEW..............................................................................................................................6
2.1 Introduction.................................................................................................................................6
2.2 Related Work...............................................................................................................................6
2.3. Summary and Critiques.............................................................................................................11
METHODOLOGY...................................................................................................................................12
3.1 Introduction...............................................................................................................................12
3.2. XGBOOST MODEL.....................................................................................................................12
3.2.1 Objective Function:.............................................................................................................13
3.2.2 Loss Function:.....................................................................................................................13
3.2.3 Regularization Term:...........................................................................................................13
3.2.4 Gradient Descent Update:..................................................................................................14
3.3. Random Forest:.........................................................................................................................14
3.3.1 Nodes in a decision tree:.....................................................................................................14
3.3.2 Decision Tree Pruning.....................................................................................................16
3.4. Comparing Models Performance:.............................................................................................16
3.4.1. Confusion Matrix...............................................................................................................16
3.4.2. Precision............................................................................................................................17
3.4.3. Recall..................................................................................................................................17
3.4.4 F1 score:..............................................................................................................................17
3.4.5 ROC Curve and AUC:...........................................................................................................17
3.5 Research Dataset...................................................................................................................17
References...........................................................................................................................................18
APPENDIX 1: WORK PLAN....................................................................................................................21
APPENDIX 2: BUDGET..........................................................................................................................22
ii
LIST OF ABBREVIATIONS
DT Decision Tree
MI Mutual information
ML Machine Language
iii
CHAPTER ONE.
INTRODUCTION.
1.1. Background of the study.
Cervical cancer is the fourth most frequent cancer among women globally, making it a
serious public health concern. The International Agency for Research on Cancer released
updated numbers in 2020 (Ferlay et al.), showing that 311,000 fatalities and about 570,000
new cases were recorded in 2018. The prevalence of cervical cancer is still high even with
screening techniques available, especially in low resource environments. Numerous factors,
such as restricted access to healthcare facilities, a lack of knowledge about preventative
measures, and obstacles to early diagnosis and treatment, can be blamed for this high
incidence and death (Arbyn et al., 2020).
Kenya is one of the developing countries which faces knowledge disparity, resource
constraints which incapacitate health service delivery and provision therefore affecting
screening services which act as preventive measure since cervical cancer is preventable.
Developing countries with better health facilities, workforce and services which has proven
to be effective and has led to reduction of cervical cancer mortality and morbidity (khan et
al., 2014).
1
Computational approaches have been rapidly integrated into the healthcare profession in
recent years to improve patient care and illness management. Topic modelling is one of the
many Natural Language Processing (NLP) techniques that have become effective tools for
gleaning meaningful insights from massive volumes of textual data.
The application of machine learning (ML) techniques has demonstrated potential in tackling
these issues by offering automated, precise, and economical methods for risk assessment and
cervical cancer detection. Complex patterns in cervical cancer data can be analysed by ML
algorithms, allowing for early detection and individualized treatment plan.
ML models are capable of processing a wide range of data types, such as clinical history,
patient demographics, and medical imaging, in order to detect aberrant cervical cells. For
example, using deep learning algorithms on colposcopy images can help identify
precancerous lesions with a high degree of accuracy (Li et al., 2019). Furthermore, to help
with early diagnosis, ML algorithms can examine genomic data to find genetic markers
linked to cervical cancer (Lou et al., 2017).
Machine learning based predictive modelling aids in determining the risk variables for each
patient and forecasts the chance of acquiring cervical cancer.
ML algorithms may categorize patients into distinct risk groups by examining a variety of
risk indicators, including age, sexual behavior, and HPV infection status. According to
(Balasubramanian et al.,2018), this makes tailored preventative tactics and targeted therapies
possible.
2
Although, machine learning has a lot of promise for cervical cancer research, there are
obstacles and moral issues to take into account. Researchers need to manage several
important issues, including data privacy, algorithm bias, and the requirement for consistent
datasets. Working together, data scientists, ethicists, and healthcare professionals can
guarantee the ethical development and application of machine learning based cervical cancer
treatments.
The application of machine learning techniques to cervical cancer research has the potential
to significantly enhance risk assessment, early detection, screening precision, and
individualized treatment. Machine learning (ML) based methods have the potential to
transform cervical cancer prevention and treatment, ultimately saving lives and lessening the
disease's worldwide burden, as long as researchers keep improving algorithms, addressing
ethical issues, and fostering greater collaboration.
While useful, conventional screening techniques like the Pap smear test have certain inherent
drawbacks. Among these drawbacks include the subjective interpretation of results, the
possibility of inconsistent diagnoses, and a significant false negative rate, which delays
identification and response. Additionally, there is a lack of general knowledge of cervical
cancer and its preventive measures in many poor nations, which leads to late-stage diagnoses
and worse survival rates.
Given these difficulties, a comprehensive approach to the cervical cancer issue is urgently
required. It is crucial to create novel, precise, and easily understandable screening techniques
that can get around the drawbacks of conventional procedures.
Additionally, focused awareness efforts and educational programs are required to inform
women about cervical cancer, its risk factors, and the value of routine screenings. It is crucial
to close the healthcare gap by utilizing technology and machine learning algorithms to deliver
precise and quick diagnoses, especially in areas with a lack of healthcare resources.
3
This study seeks to make a significant contribution to this important field by investigating
approaches using machine learning algorithms.
High precision analysis of large datasets, such as patient histories and medical imaging, is
possible with machine learning techniques. Healthcare workers can identify aberrant cervical
cells more quickly and correctly than with traditional approaches by using machine learning
models to automate the examination of Pap smears, colposcope images, and other diagnostic
tests. According to (Li et al.,2019), early detection is essential because it allows for prompt
therapies, which in turn raises the likelihood of cure and survival.
Large scale patient data analysis can be performed by machine learning algorithms, which
can also produce insightful reports for medical professionals. By providing evidence-based
information, these insights enable clinicians to make well informed decisions. According to
4
(Wang et al.,2020), machine learning plays a significant role in improving patient care,
lowering healthcare expenses, and improving overall healthcare quality by assisting
healthcare workers in selecting the most effective treatments and interventions.
Women sometimes encounter obstacles while trying to access services for cervical cancer
screening and treatment, especially in rural and underserved areas. The development of
portable, low-cost screening instruments and telemedicine applications can be aided by
machine learning. By offering easily accessible and reasonably priced screening options,
these technologies can close the gap in healthcare disparities and guarantee that a greater
number of people can take advantage of early diagnosis and prompt interventions
5
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
This chapter presents background of related works. It includes recent and prominent
publications of literature review of various papers, journals and articles in the field of Public
Health of using Machine Learning in modelling cervical cancer.
They used Principal Component Analysis (PCA) and the Synthetic Minority Oversampling
Technique (SMOTE) to reduce dimensions without sacrificing model accuracy in order to
address the problem of imbalanced data. To avoid overfitting, a stratified 10 fold cross
validation approach was used to assess the models.
Using the SMOTE voting model produced notable gains in accuracy, sensitivity, and area
under the Receiver Operating Characteristic curve (ROC_AUC) for all target variables. The
improvements were found in the following ratios: sensitivity, accuracy, and Positive
Predictive Agreement (PPA): 0.93% to 5.13%, 39.26% to 46.97%, and 2% to 29%,
respectively.
The application of PCA shortened the duration of computational processing, improving the
overall effectiveness of the model. Based on particular evaluation metrics, a comparative
analysis with prior research revealed the efficacy of their models and their capacity to
diagnose cervical cancer more quickly.
Tanimu et al., (2022), used decision tree (DT) classification technique to examine risk factors
associated with cervical cancer. They investigated Least Absolute Shrinkage and Selection
Operator (LASSO) and Recursive Feature Elimination (RFE) feature selection strategies in
order to improve prediction accuracy. Due to the missing values and extreme imbalance in
6
the dataset, SMOTE Tomek a combination of under sampling and oversampling techniques
was applied.
Their investigation proved the efficacy of the DT classifier, especially when employing
certain characteristics from SMOTE Tomek and RFE. This method produced remarkable
sensitivity (100%) and accuracy (98.72%). The DT classifier demonstrated remarkable
efficacy in resolving cervical cancer classification issues by decreasing the feature count and
mitigating the problem of class imbalance. The study highlighted the significance of feature
selection and addressed class imbalances, offering insightful information for the creation of
prediction models for the diagnosis and prognosis of cervical cancer.
Zhang, H., & Wang, J. (2017), used supervised machine learning techniques to analyze
different methods with the goal of identifying cervical cancer in its early stages.
A complete dataset on cervical cancer including 858 patient records, 36 risk factors, and one
outcome variable was taken from the UCI repository in order to conduct their research.
Artificial neural networks, Bayesian networks, Support Vector Machines (SVM), random
trees, logistic trees, and XG boost trees were the six classification algorithms used. The
models trained with and without feature selection techniques were evaluated in the study.
Three distinct feature selection techniques were applied: LASSO regression, wrapper method,
and relief rank.
The results of the investigation demonstrated the effectiveness of XG Boost, which when all
features were taken into account attained an amazing maximum accuracy of 94.94%.
Interestingly, certain cases showed that using feature selection techniques increased
efficiency. The study demonstrated how machine learning outperforms conventional
statistical models in managing intricate, massive datasets and detecting crucial prognostic
characteristics.
The study recognized current limitations even though machine learning has enormous
potential for improving cervical cancer treatment and practical applications. More work is
required to improve the precision, dependability, and usefulness of clinical outcome forecasts
due to obstacles such overfitting, incomplete and simplified data, and interpretability
problems in prediction models. This study highlights the necessity for ongoing developments
to properly address the difficulties of cervical cancer and represents the achievements made
in the use of machine learning for detection.
7
Ashok, B. et al., (2016), developed an optimized feature selection model which identified
important characteristics using various selection techniques. Furthermore, the research
endeavored to develop classifier models using the utilization of multiple algorithms, such as
C5.0, random forest, rpart, KNN, and SVM.
The researchers assessed each model's performance and thoroughly investigated training
options and the results of these algorithms were provided in the paper after a thorough
analysis.
They tackled a number of important issues, including the uneven distribution of data on
cervical cancer. They used the Synthetic Minority Oversampling Technique (SMOTE) as a
preprocessing method to address this problem. SMOTE ensured a more representative sample
for analysis by helping to balance the dataset.
The effectiveness of these categorization techniques was carefully assessed by the researchers
utilizing crucial criteria like specificity, sensitivity, and accuracy.
According to their findings, using SMOTE as a preprocessing method greatly improved the
classification models' performance. In particular, the SMOTE SSVM method fared better
than SMOTE TSSVM and SMOTE CART, indicating that this strategy is useful for
enhancing the precision and dependability of forecasts regarding cervical cancer survival. In
the context of cervical cancer survival, this discovery emphasizes the significance of
resolving data imbalance for more precise prediction models.
Feature selection was an important part of their process, and it has a big impact on the
diagnosis's accuracy. The researchers used a variety of techniques, including Mutual
Information (MI), Sequential Forward Search (SFS), Sequential Floating Forward Search
(SFFS), and Random Subset Feature Selection (RSFS), to determine which features were
most pertinent. These methods assisted in the identification of the most illuminating
characteristics required for a precise diagnosis of cervical cancer.
Through the combination of advanced feature selection techniques, feature extraction, and
picture segmentation, the researchers developed a successful cervical cancer diagnosis
approach. This novel method not only increases diagnosis accuracy but also streamlines the
8
difficult process of examining pap smear slides, helping pathologists diagnose cervical cancer
with greater precision and knowledge.
Novaković et al., (2017). The aim of their study was to select the best option from a range of
categorization models produced by an intricate and repetitive model building procedure. The
study examined the evaluation techniques used to gauge these models' performance in order
to do this.
The study covered the drawbacks of relying only on classification accuracy as an assessment
criterion. They noted that accuracy depends on the class distribution in the dataset and cannot
take into consideration the distinctions between the errors produced by the models. In order
to provide a more complex picture of the models' performance, the researchers looked at
alternate assessment metrics.
A study done by (Raschka, 2018), provides a thorough analysis of a number of methods that
are essential for model assessment, model choice, and algorithm choice in the context of
scholarly machine learning research and practical implementations. Drawing from both
theoretical frameworks and practical data, they analyze the benefits and drawbacks of various
strategies and underline how important it is to use them effectively.
This paper explores several comparison methodologies, including omnibus tests and multiple
comparison adjustments, and statistical tests for algorithm comparisons. Furthermore, when
working with limited datasets, alternate approaches to algorithm selection are advised, such
as nested cross validation and the combined F test 5x2 cross validation.
The holdout approach and other popular techniques for selecting and evaluating models are
covered in the article. It does, however, advise against using them with tiny datasets because
of possible problems. As a substitute for assessing performance uncertainty, the authors
present many iterations of the bootstrap technique.
They also cover cross validation methods, such as k fold cross validation and leave one out
cross validation, highlighting the necessity of balancing the bias variance trade off when
determining the right 'k' value. Empirical evidence based practical recommendations are
given for the best possible 'k' selection.
9
uncorrelated with one another should make up an effective feature set. In order to put this
theory into practice, the study presents a test theory inspired feature evaluation formula that
serves as the foundation for the Correlation based Feature Selection (CFS) method.
The feature assessment formula, a suitable correlation measure, and a heuristic search
approach are all combined by CFS. Experiments using generated and natural datasets and
three machine learning algorithms a decision tree learner called C4.5, an instance-based
learner called IB1, and naive Bayes were used to thoroughly assess the algorithm.
Dokduang et al., (2014), created and evaluated a method that could be used to forecast how
radiation therapy will affect bone metastases in cases of cervical cancer. The problem they
encountered was handling unbalanced data, which frequently arises in cervical cancer
instances where there are bone metastases. Class imbalance learning (CIL), a significant
difficulty posed by imbalanced data in data mining, causes problems in machine learning and
lower classifier performance.
The study compared various methods made to deal with data imbalance in classification jobs
in order to address this issue. The unbalanced data was balanced using the synthetic minority
over sampling technique (SMOTE), and the effectiveness of several classification models,
such as Ant Miner, RIPPER, Ridor, PART, ADTree, C4.5, ELM, and Weighted ELM, was
assessed. The study employed Accuracy, G mean, and F measure as evaluation criteria.
Bentéjac et al., (2021), investigated current developments in the gradient boosting algorithm
family, such as XGBoost, LightGBM, and CatBoost, in order to improve speed and accuracy
in machine learning applications.
Machine learning problems were addressed with XGBoost, a scalable ensemble technique,
being emphasized as a dependable and effective approach. LightGBM, on the other hand,
focuses on rapid training performance by selectively sampling high gradient instances,
making it extremely fast. With the goal of enhancing model accuracy by preventing
prediction shifts, CatBoost updated the gradient computing process.
The outcomes showed that, despite the relatively slight variations between the approaches,
CatBoost produced the best results in terms of both AUC and generalization accuracy
throughout the datasets under study. Despite sacrificing some accuracy, LightGBM proved to
be the fastest approach. With respect to accuracy and training speed, XGBoost did well,
coming in second.
10
The reviews above indicates that despite the utilization of several traditional algorithms for
cervical cancer prediction, the models have not reached a desirable level of performance. This
deficiency primarily stems from the critical aspect of machine learning: feature selection. The
authors did not specify the process through which features were chosen, leaving a significant
gap in the methodology.
Furthermore, relying solely on standard deep learning algorithms does not guarantee a
model's suitability for practical implementation in the healthcare sector. Customization is
essential to mitigate overfitting and enhance speed for clinical applications.
Several critiques were identified across the reviewed literatures. Firstly, many studies lack
detailed information on the feature selection process, creating a gap in understanding the
factors influencing model predictions. Some papers focus on algorithm comparison without
clearly delineating the impact of these algorithms on cervical cancer prediction, leaving a gap
in actionable insights. Additionally, certain studies lack specific results or detailed insights,
making it challenging to assess the effectiveness of proposed methodologies in the context of
cervical cancer prediction. Improved reporting on feature selection, context specific
algorithmic evaluations, and real-world validation are essential for advancing the field.
11
CHAPTER THREE
METHODOLOGY
3.1 Introduction
This chapter outlines the techniques employed to construct the specified Machine Learning
models and the model evaluation techniques.
Ensemble Learning: XGBoost is an ensemble learning method, which means it combines the
predictions of multiple weak learners (usually decision trees) to create a stronger, more robust
model.
Gradient Boosting: XGBoost is based on the concept of gradient boosting, where models are
built sequentially to correct the errors of the previous ones. Each new model focuses on the
mistakes made by the combined ensemble of existing models.
Gradient Descent: XGBoost uses gradient descent optimization to minimize the objective
function. In each iteration, it calculates the gradient of the objective function with respect to
the model's predictions and updates the model parameters in the direction that minimizes the
objective.
12
3.2.1 Objective Function:
The objective function in XGBoost is a sum of a loss function and a regularization term for
each tree in the ensemble. The objective is to minimize this function during training.
n K
Objective =∑ ❑ L ( y i , ^y i ) + ∑ ❑ Ω ( f k )
i=1 k=1
Where:
L is the loss function measuring the difference between the true label ( y i) and the predicted
label ^y
n
The first term, ∑ ❑ L ( y i , ^y i )represents the training loss. The choice of loss function (L)
i=1
K
The second term, ∑ ❑ Ω ( f k ) is the regularization term that penalizes the complexity of the
k =1
individual trees in the ensemble. The regularization term (Ω) typically includes the depth of
the tree and the weights of leaf nodes.
L ( y i , ^y i ) = y i ⋅ log ( 1+ e− y ) + ( 1− y i ) ⋅log ( 1+ e y )
^i ^i
T
1
Ω ( f k )=γT + λ ∑ ❑ w 2j
2 j=1
13
γ and λ are regularization parameters that control the impact of the regularization
terms.
f k ( x )=f k−1 ( x )+ η⋅ hk ( x )
1) Subset of the training set is randomly selected with replacement to form a set of decision
trees (Bootstrapping).
2) Each decision tree gives a vote on the possible class of the test object.
A decision tree is made up of three nodes: root node, internal node and the leaf node. The root
node is the starting point, the internal nodes have arrows pointing to them and away from
them. The leaf nodes have arrows pointing to them. Classification stops when you get to a
leaf node since the node is considered to be pure thus cannot be split further.
14
How to select the variable in the root node:
The variable used at the root node is selected based on the level of impurity. For classification
problems, the CART algorithm is used to quantify the level of impurity. The algorithm uses
Gini index criterion as shown below:
c
GI =1−∑ ❑ P2i
i=1
The level of impurity of each variable in the training dataset is calculated. Based on the
threshold value, the root node is split into two. Values that satisfy the threshold are grouped to
the left node as true and those that do not satisfy the threshold are grouped to the right as
false. The gini index of the left node G1 and right node G2 are calculated. As a result of
different total number of the observations on the left and right node, the total gini impurity is
given by the weighted average of the gini impurities for the right node and left node. The
formula is as shown below:
Total GI = W1 ∗ G1 + W2 ∗ G2
where W1 is the weight of the right node while W2 is the weight of the left node.
The same procedure is repeated for all variables. The variable with the lowest gini
impurity is used as the root node.
From the root node, the tree is split into two categories, true and false. The total gini
impurity of the remaining variables is calculated again. The variable with the lowest gini
impurity is taken as the next internal node.
15
3.3.2 Decision Tree Pruning
Pruning is a data compression technique in machine learning algorithm that reduces the
size of decision trees by removing section of the tree that are non critical and redundant
to classify instances. It ensures trimming down a full tree to reduce complexity of the
model.
i. True Positive is when actual value and predicted value are both positive
ii. False Positive is when the actual value is negative while the predicted value is
positive.
iii. True Negative is when the actual value and predicted value are both negative.
iv. False Negative is when the actual value is positive while the predicted value is
negative.
Predicted
16
3.4.2. Precision
Is defined as the number of true positives divided by the number of predicted positives. Is
useful when we want to reduce the number of False positive.
TP
Precision =
TP+ FP
3.4.3. Recall
Is defined as the number of true positives divided by the total number of actual positives. Is a
useful metric when we want to minimize the number of false negatives.
TP
Recall =
TP+ FN
3.4.4 F1 score:
Is the harmonic mean of precision and recall, which means there is equal importance given to
FP and FN. F1 score is maximum when precision = recall.
2∗Precision∗Recall
F1 =
Precision+ Recall
AUC, or Area Under the ROC Curve, measures a classification model's overall effectiveness.
17
References
Ferlay, J., Colombet, M., Soerjomataram, I., Parkin, D. M., Piñeros, M., Znaor, A., & Bray, F.
(2021). Cancer statistics for the year 2020: An overview. International journal of
cancer, 149(4), 778 789.
Arbyn, M., Weiderpass, E., Bruni, L., de Sanjosé, S., Saraiya, M., Ferlay, J., & Bray, F.
(2020). Estimates of incidence and mortality of cervical cancer in 2018: a worldwide
analysis. The Lancet Global Health, 8(2), e191 e203.
Khan, M. I. R., Asgher, M., & Khan, N. A. (2014). Alleviation of salt induced photosynthesis
and growth inhibition by salicylic acid involves glycinebetaine and ethylene in mungbean
(Vigna radiata L.). Plant Physiology and Biochemistry, 80, 67 74.
Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2019). A unified MRC framework for
named entity recognition. arXiv preprint arXiv:1910.11476.
Wang, S., Guan, B. Y., Lu, Y., & Lou, X. W. D. (2017). Formation of hierarchical In2S3–
CdIn2S4 heterostructured nanotubes for efficient and stable visible light CO2
reduction. Journal of the American Chemical Society, 139(48), 17305 17308.
Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018, March). Grad
cam++: Generalized gradient based visual explanations for deep convolutional networks.
In 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 839 847).
IEEE.
Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., ... & Cao, B. (2020). Clinical features
of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet, 395(10223),
497 506.
18
Bouvard, V., Wentzensen, N., Mackie, A., Berkhof, J., Brotherton, J., Giorgi Rossi, P., ... &
Lauby Secretan, B. (2021). The IARC perspective on cervical cancer screening. New
England Journal of Medicine, 385(20), 1908 1918.
Mabotuwana, T., Hombal, V., Dalal, S., Hall, C. S., & Gunn, M. (2018). Determining
adherence to follow up imaging recommendations. Journal of the American College of
Radiology, 15(3), 422 428.
Tanimu, J. J., Hamada, M., Hassan, M., Kakudi, H., & Abiodun, J. O. (2022). A machine
learning method for classification of cervical cancer. Electronics, 11(3), 463.
Zehir, A., Benayed, R., Shah, R. H., Syed, A., Middha, S., Kim, H. R., ... & Berger, M. F.
(2017). Mutational landscape of metastatic cancer revealed from prospective clinical
sequencing of 10,000 patients. Nature medicine, 23(6), 703 713.
Ashok, B., & Aruna, P. (2016). Comparison of Feature selection methods for diagnosis of
cervical cancer using SVM classifier. Int. J. Eng. Res. Appl, 6, 94 99.
Alquran, H., Mustafa, W. A., Qasmieh, I. A., Yacob, Y. M., Alsalatie, M., Al Issa, Y., &
Alqudah, A. M. (2022). Cervical cancer classification using combined machine learning and
deep learning approach. Comput. Mater. Contin, 72(3), 5117 5134.
Purnami, S. W., Khasanah, P. M., Sumartini, S. H., Chosuvivatwong, V., & Sriplung, H.
(2016, April). Cervical cancer survival prediction using hybrid of SMOTE, CART and
smooth support vector machine. In AIP Conference Proceedings (Vol. 1723, No. 1). AIP
Publishing.
Ashok, B., & Aruna, P. (2016). Comparison of Feature selection methods for diagnosis of
cervical cancer using SVM classifier. Int. J. Eng. Res. Appl, 6, 94 99.
Novaković, J. D., Veljović, A., Ilić, S. S., Papić, Ž., & Tomović, M. (2017). Evaluation of
classification models in machine learning. Theory and Applications of Mathematics &
Computer Science, 7(1), 39.
Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine
learning. arXiv preprint arXiv:1811.12808.
Hall, M. A. (1999). Correlation based feature selection for machine learning (Doctoral
dissertation, The University of Waikato).
19
Dokduang, K., Chiewchanwattana, S., Sunat, K., & Tangvoraphonkchai, V. (2014). A
comparative
machine learning algorithm to predict the bone metastasis cervical cancer with imbalance
data problem. In Recent Advances in Information and Communication Technology:
Proceedings of the 10th International Conference on Computing and Information Technology
(IC2IT2014) (pp. 93 102). Springer International Publishing.
Bentéjac, C., Csörgő, A., & Martínez Muñoz, G. (2021). A comparative analysis of gradient
boosting algorithms. Artificial Intelligence Review, 54, 1937 1967.
Hall, M. A. (1999). Correlation based feature selection for machine learning (Doctoral
dissertation, The University of Waikato).
20
APPENDIX 1: WORK PLAN
Novembe
February
October
January
Duration
March
r
April
Activity
Proposal
writing
Proposal
presentation
Data analysis
and
Interpretation
Project
Compilation
Project
Presentation
21
APPENDIX 2: BUDGET
Item Cost Total cost
printing and 420 420
photocopying
Total Ksh.5,320
22