Loan Approval Prediction2
Loan Approval Prediction2
Algorithms
by
Reak Roy
22301776
Tahsin Alam
19301171
Syed Hafiz Kabir
23241063
Mirza Abyaz Awsaf
20101146
Shadik Ul Haque
23141087
1. The Project submitted is my/our own original work while completing degree
at Brac University.
3. The thesis does not contain material which has been accepted, or submitted,
for any other degree or diploma at a university or other institution.
22301776 23241063
19301171 20101146
Shadik Ul Haque
23141087
i
Approval
The thesis/project titled “Loan Approval Prediction Using Machine Learning” sub-
mitted by
1. Reak Roy (22301776)
2. Syed Hafiz Kabir (23241063)
3. Tahsin Alam (19301171)
4. Mirza Abyaz Awsaf (20101146)
5. Shadik ul Haque (23141087)
Of October, 2024 has been accepted as satisfactory in partial fulfillment of the
requirement for the degree of B.Sc. in Computer Science on Day.
Examining Committee:
Supervisor:
(Member)
Amitabha Chakraborty
Assistant Professor
CSE
BRAC University
Program Coordinator:
(Member)
Designation
Department
Brac University
Head of Department:
(Chair)
Designation
Department of Computer Science and Engineering
Brac University
ii
Abstract
This research paper describes the potential of several classifiers of classical machine
learning and architecture of deep neural networks when predicting the status of a
loan application. The data set of 9824 observations and 13 features, provided with
the information about the applicants and their credit profiles, was utilized together
with other techniques, such as bootstrapping, for more data quality. Some impu-
tation strategies were applied to deal with the lack of values, while also features
were carefully prepared by employing ANOVA, Mutual Information and Tree based
approaches among other statistical methods. For the validation of the model per-
formance, the dataset was split into two parts: training (70%) and testing (30%).
Many classical machine learning algorithms were applied including but not limited
to Logistic Regression, Support Vector Machines, Decision Trees, Random Forests,
Multi-Layer Perceptron, Gradient Boosting machines, K-Nearest Neighbors, etc.
Out of all models used in the research, Random Forest Classifier demonstrated
the most high values of accuracy of 86.09% and F1-score (0.9043), hence it was
the best performing one. Advanced methodologies such as SMOTE and ADASYN
were also used to handle the issue of class imbalance, where the performance of K-
Nearest Neighbors was impressive acuuracy of 87% after resampling. In a different,
yet similar analysis, five types of neural network architectures, Simple Recurrent
Neural Network(RNN), LSTM, CNN, FCNN and FCN were built with the use of
Tensorflow, Scikit-learn, and Numpy running on Google Colaboratory notebooks.
The outcomes showed that the Fully Convolutional Network (FCN) has the best
validation accuracy of 89.75% and validation loss of 0.2255 among the models built.
iii
Acknowledgement
Firstly, all praise to the Great Allah for whom our project have been completed
without any major interruption.
Secondly, to our advisor Amitabha Chakrabarty sir for his kind support and advice
in our work. He helped us whenever we needed help.
Though our paper not accepted there, all the reviews they gave helped us a lot in
our later works.
And finally to our parents without their throughout sup-port it may not be possible.
With their kind support and prayer we are now on the verge of our graduation.
iv
Table of Contents
Declaration i
Approval ii
Abstract iii
Acknowledgment iii
Dedication iv
Table of Contents v
Nomenclature viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Aims and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Summary of the Contribution . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Project Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Works 6
2.1 Machine Learning / Classifier . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Machine Learning / Classifier Summary . . . . . . . . . . . . . . . . 16
3 Proposed Model 19
3.1 Workplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 The Nine Classification Model . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 ADAboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5 K-Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.6 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . 22
v
3.2.7 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 22
3.2.8 Gradient Boosting Classifier . . . . . . . . . . . . . . . . . . . 23
3.2.9 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . 25
4 Dataset 26
4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Data Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Handing Missing Values . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 SMOTE and ADASYN . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Feature Selection: . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Data Prepraration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Data Visualization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Bibliography 58
Bibliography 59
vi
List of Figures
3.1 Workplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
List of Tables
viii
Chapter 1
Introduction
Loan Approval is essential for employees of banks as well as for the applicant. The
main mission of this paper is to give an easy was to select good capable candidates.
All loans are handled by finance companies. They can see all local, semi-local, and
rural areas. After a corporation or bank confirms a client’s eligibility, the client
gives application. On basis of information clients supply on the application form,
an institution mainly banks wishes to speed up the loan eligibility process (in real
time). These details can be their age, financial status, banking transaction, source
of income withdrawal amount and history of credit.The dataset we have collected,
which had a set of parameters loans were approved. In order to get accurate find-
ings, this model is programmed in this way. Our main target of this project is to
forecast the safeness of loans. Ada Boost, Gaussian Naı̈ve Bayes, MLP, KNeighbors,
Decision Tree, Logistic Regression, Random Forest, Gradient Boosting, SVC algo-
rithms are used for predicting loan safety. The data is first cleaned to remove any
missing values from the data collection. Loan approval is a really important pro-
cess for banking associations. The system approves or rejects the loan operations.
One of the most significant contributing factors to a bank’s financial results is loan
recovery. It’s actually delicate to predict the eventuality of payment of loan by the
client.
1.1 Motivation
Recent studies show that the loan approval process is at the core of most credit sys-
tems and is critical to both the credit granting organization and the loan applicant.
In practice, whereas the problems of loan application evaluation may be solved by
assessing applications by several experts using a set of metrics to meet specific pro-
files, a decision making system that combines the use of stereotypes and evaluation
of financial and demographic data is employed. Even so, these procedures tend to
be slow and of personal judgment, and there is always a risk of the presence of the
so-called “human factor” in these processes. The growing flow of loan requests will
require making more precise, efficient, and most importantly, automated systems
for making decisions for them.
Machine learning (ML) and deep learning (DL) have changed the point of view of
many industries, including the finance sector as well. These processes have made
1
it possible to work on internal processes such as loan approvals and hence make
decisions more accurately and time saving. In this case ML and DL models focus on
the previously known data which in the present case is very complex and involves
a lot of interdependencies. When it comes to this application, reducing the risks of
default is made easier, the time taken for processing the outcome is reduced and less
subjectivity is seen in the decision making processes.
Furthermore, the prospect of big data analytics that comes with dozens of variables
on the applicants’ profile as well as their banking history, demographic and credit
scoring information makes it possible to build sophisticated models for decision
making. This study is driven by the urgency of the challenge on how modern machine
learning and deep learning techniques can be used for predicting phenomena such
as loan approval and aims to enhance the precision, scalability and justice of such
predictions to financial institutions and customers.
Feature Engineering and Selection: Also, to make the model perform better,
high end statistical approaches like ANOVA, MI (Mutual Information) and tree
based feature additions are incorporated in the study. Because of these approaches,
only important variables like credit rating, income, loan amount and applicant’s
work history are taken in the model’s final version.
Assessment of Performance Using Various Models: The models are assessed
through a number of metrics such as accuracy, F1-score, precision, recall and valida-
2
tion loss. Thanks to this multi-metric evaluation, which enables the understanding
of the performance of each model in more detail especially when handling imbal-
anced datasets. The Random Forest classifier records the best performance as far as
classical ML algorithms are concerned while FCN records the best results for deep
learning models.
3
very rapid . Moreover, we’ll be able to predict loan repayment probability, eliminate
biases and discrimination, personalize loan offers, understanding customer behavior
and lastly to enable Data-Driven decision making. Thus, the recommendation from
the customer may increase.
1.5 Methodology
In this project , the research methodology worked upon in order to achieve a well-
structured model, both in terms of its development and validation. The following
explains how the procedure works:
Data Collection and Preparation: For the purpose of this study, a dataset consisting
of 9,824 records and 13 variables describing individuals including their background,
history, amount required, income, and creditworthiness is used. Preprocessing steps
include the imputation of missing values, which applies diverse strategies, normal-
ization of continuous variables, encoding of categorical variables, and the ((simple)
K-means bootstrapping) KMT b method boosting. Data quality is also improved
using other methods such as through the use of bootstrapping.
Feature Selection and Engineering: Futurology is so important in this study since
the models can only learn from significant variables. ANOVA, Mutual Information,
and tree-based selection are used here for dealing with dimensionality and improving
the dimensions of the model. While tree-based methods offer a means to assess the
importance of features, their main aim is to help alleviate the dimensionality issues
presented by the dataset.
Model Development and Training: The implementation of the classical machine
learning models, comprising Logistic Regression, Decision Trees, Support Vector
Machines, K-Nearest Neighbors, Random Forests, and Gradient Boosting is exe-
cuted. Plus, sophisticated deep learning models namely Simple RNN, LSTM, CNN,
Fully Connected Neural Network and Fully Convolutional Network are also con-
structed using TensorFlow, Scikit-learn and other frameworks.
Handling Imbalance in Classes: The dataset that has been used demonstrates that
class imbalance exists since there are several more accepted than rejected loan ap-
plications. The minority class which in this case is made up of rejected loans is
remapped using SMOTE and ADASYN techniques to create more examples of this
class. This helps in ensuring that the dataset is balanced and as such there is no
tendency for the models to be biased to the majority class.
Evaluation of Models: Several metrics are used to evaluate the execution of each
model. The database is used for splits into training and testing with a percentage
of seventy and thirty, respectı́vely and cross validation is done to ensure that the
findings are not only correct but can also apply to different populations. Important
metrics encompass accuracy, F1 score, precision, recall and valiladtion loss with an
aim of developing an overall notion of the model’s ability to predict.
Tools and Frameworks: Tensorflow, Scikit-learn, Pandas and Numpy are examples
of Python libraries that a model makes use of. The research is being carried out
in Google Colaboratory notebooks, which is an advantage as it is a cloud-based
environment suitable for effective computation and model training.
4
1.6 Summary of the Contribution
The contributions made in this project are laid out in a concise manner:
Classical ML algorithms Compared: This section presented an in-depth approach
on the performance of several ordinary machine learning algorithms where Random
Forest was the best algorithm in predicting loan approval with an accuracy of 86.09%
and its F1 score of 0.9043.
Performance of Deep learning Models: Out of the different types of deep learning
model architectures that were examined, it was found that FCN had the highest val-
idation accuracy of 89.75% and the lowest validation loss of 0.2255 for loan approval
prediction.
Class Imbalance-Resampling Techniques: With the help of SMOTE and ADASYN,
class imbalance is effectively addressed by the performance of classifiers such as
K-Nearest Neighbors.
Feature Engineering and Selection: Advanced feature selection methods were ap-
plied in the model so that only the viable features are entered in the model hence
enhancing the performance and effectiveness of the model.
Chapter 2: Literature Review - this chapter deals with the existing literature on
prediction of loan approval, machine learning techniques and application of deep
learning in finance stating where there is need for improvement.
Chapter 3: Data Preprocessing and Feature Engineering - this chapter outlines the
dataset used, data preprocessing processes and feature engineering processes carried
out in the research work.
Chapter 5: Results and Evaluation – This chapter contains the findings of the
model evaluations, where various models are assessed along several performance
metrics
Chapter 6: Conclusion and Future Work – This chapter offers an overview of the
results and their interpretation, addresses the deficiencies identified in this research
and offers ideas on where the field of loan approval prediction could be advanced in
the future.
5
Chapter 2
Related Works
In the paper ”Exploring the Machine Learning Algorithm for Prediction of the Loan
Sanctioning Process ” the authors E. Chandra Blessie, R. Rekha- Year 2019, imle-
mented various machince learnong models to predict loan defaults. This prediction
is achieved by feeding the history records of the client who acquired loans from the
bank into a trained machine learning model which could yield an accurate result.
The focus of the paper is to determine whether or not it’ll be the safety of allocation
of loan to the loan candidates. The paper had three section as mentioned below (i)
Collection of Data, (ii) Data Cleaning (iii) Performance Evaluation.
This paper presents a new machine learning model aimed at minimizing loan de-
faults and maintaining privacy by preserving sensitive information from borrowers’
financial reports. The model uses machine learning and data mining to predict loan
eligibility for users, automating the process by identifying eligible segments from
online loan application forms. Decision tree algorithms, widely used in banking for
classification and regression tasks, are used for loan prediction and severity forecast-
ing. The R package is used for data mining visualization, but real-time consumer
data collections may contain imputed or missing data. For classification and regres-
sion problems DT is a supervised learning algorithm, using tree representation for
prediction. The analytical process includes data cleansing, missing value imputa-
tion, exploratory analysis, and model construction. The best accuracy on the public
test set is 0.811, with applicants with poor credit history less likely to get approved
and high-income applicants more likely to repay loans [14].
The modern banking system heavily relies on its credit system for income, and risk
evaluation is recommended to minimize losses and decrease non-profit assets. Cus-
tomer information is crucial in estimating loan acceptance, and artificial intelligence
techniques are used to provide reliable results. Classifiers such as logistic regression
and decision tree classifiers are used to determine if an applicant is qualified for
a loan transaction. This study collects financial information from a finance orga-
nization, performs pre-processing, and estimates two classifiers: logistic regression
and decision tree classifiers. The best model for predicting loan acceptance will be
determined by comparing these classifiers. The Dream Housing Finance Company
provided the data, which was cleaned and white spaces removed. The banking indus-
try frequently uses the selection tree, a non-parametric supervised machine learning
method, to solve classification and regression issues. Logistic regression uses a dual
dependent variable to lessen system complexity, employing a larger target variable
and a bigger number of samples to find the category. Among the collected data,
6
70 percent of the data was used for training and 30 percent for testing. The LR
Algorithm had an accuracy rate of 83.7percent, while the Decision Tree Algorithm
provided 85.4percent accuracy. The proposed model predicts loan acceptance using
machine learning methods, with decision tree achieving the most accurate results
[15].
Micro credit is a small loan program for impoverished borrowers without collateral
or credit history, often rejected by traditional financial institutions. China’s on-
line lending market has expanded, providing basic financial services to a large user
base. A framework, features, and reinforcement learning-based searching strategy
are proposed based on user behavior data from 360 Financial’ online system. The
data includes interactions between users and the platform, with event id being a
unique index. Feature Tools is used to generate discriminated features through a
novel search strategy, aiming to answer questions about feature derived, guidance
signal introduction, and feature value calculation. The feature engineering problem
is transformed into a reinforcement learning problem using a Markov chain trans-
formation link, aiming to find high-information features through a policy gradient
method. Using an actual default problem, the proposed method was evaluated and
compared with professional judgment and conventional genetic programming. The
method was trained on 100,000 users from the 360 Financial online lending system,
showing a nearly four times improvement in both velocity and velocity+ features
compared to random policies. The paper proposes a performance-driven framework
for automated feature generation from raw data using reinforcement learning, uni-
fying feature structure, interpretation, and calculation logic, reformulating the fea-
ture generation problem as reinforcement learning. Experiments show the proposed
method improves human effort and avoids local optimums in traditional genetic
programming [16].
Banks distribute loans; their main asset is the earnings from such loans. By pre-
dicting applicant safety and automating feature validation, machine learning may
aid in the development of a Loan Prediction System. Both bank workers and loan
applicants gain from this quick and easy process, which gives loan applicants a win-
dow of time to approve their loans. The probable methodology will be the collection
and the deployment of the data set followed by the training of the model on the
training dataset and the test the model on the testing data set and after that the
results will be analysed. Kumar Arun, Garg Ishan, Kaur Sanmeet in 2016 suggested
there will be six machine models that will be used as per their paper[1].The models
are, Decision Tree, Random Forest, Support Vector Machine, Linear Model, Neural
Networks and Adaboost. The first model used in the paper is the extension of C4.5
algorithm, the basic algorithm of this tree is that it requires all the attribute to be
discredited. The random forest is just the group of learning system that works by
building large numbers of decision tree. The linear model highlights its use for both
distinct quality and multiple quantity factors even if it is mathematically identical
to other models.
In the paper[2], the authors suggested a model that is also made based upon the
decision tree. The decision tree are widely used in the banking sector due to its
highly accurate results. Moreover, its ability make a statistical model makes it more
desire able. The decision tree also effectively completes the classification and regres-
sion task[1]. The methodology used in the paper[2] is the collection of the data and
7
the preprocessing of the data followed by the building of the classification model
using decision tree and finally the prediction from the results[2].The data that has
been collected may have inconsistency, preprocessing of the data will be needed to
make the algorithm more efficient. Decision tree algorithm is used for loan defaulter
and non-defaulter problem prediction, its tendency to provide better results and its
intuitive implementation, interpret able predictions, unbiased estimated error, easy
tuning, and highest accuracy makes it perfect for implementation in the project.
The analytical process involved data cleaning and processing, missing value imputa-
tion with mice package followed by building the model that have accuracy of 0.811
from public testing. The results concluded that people with high income and low
loan request are more likely to get approved as they may repay bac their loan eas-
ily. However, the basic characteristics like gender, martial status are not taken in
consideration by the company.
In the paper[3] the author used data set provided by Xiamen International Bank,
by using the data and various machine learning models including XBoost, Random
Forest, Adaboost, k nearest neighbours, multilayer perceptions. The use of the
data and models the authors predicts weather the loan will be approved. The
data set included 132029 records the can be divided into three groups 1.User basic
attributes 2.Loan related information 3.Information related to user credit reporting
The featuring extraction was done by removal of useless data and classification of
data was done. The models were then used on the data, the base model of XGBoost,
random forest and adaboost is mainly a decision tree. In the boosting the base
learner was initially trained, according to the performance of the base learner the
training samples were distributed. The based on the distribution samle another
base learner was trained, the process is repeated until it reaches to the T value.
The k nearest neighbour make prediction based on the k neighbours information,
the Multilayeer perception is an artificial neural network, each neuron has a series
of parameters that can be learned and uses nonlinear function as the activation
function, the introduction of the nonlinear function can make MLP more effective.
However, the results shows us that RF give 0.5010 acurracy adaboost gives 1.0000
accuracy, XGBoost gives 0.7166 accuracy the kNN give 0.5036 accuracy and MLP
gives us 0.5000 accuracy. So from the results we can see that ADAboost gives us
100 percent accurate results which is greater than all the models used
Banks primarily generate revenue from credit lines, which depend on loan repayment
and client default rates. Predicting loan defaulters can help lower Non-Performing
Assets. Research shows various methods for loan default control, with Logistic
regression models being a crucial predictive analytics tool. Kaggle data is used for
analysis and prediction.
Small loans are crucial for aspiring entrepreneurs, but they also carry the risk of
default. This is a common issue in the financial industry, and banks often offset
the loss with other fully paid loans. Peer-to-peer lending companies like Lending
Club provide a platform for borrowers to create small unsecured personal loans, with
investors choosing which loans to invest in. This shifts the burden of loss from a
single bank to several individual investors, requiring diversification to avoid winners
and losers. Machine learning, a subfield of artificial intelligence, automates data
processing and creates analytical models with minimal human intervention.
In this paper, we used decision tree, naive bayes classification,ordinary least square
regression, logistic regression, support vector machine (svm) and clustering algo-
8
rithms as machine learning algorithms.
In result we got, the public test set, data cleaning, missing value imputation, ex-
ploratory analysis, and model development yielded the greatest accuracy of 0.811
[8].
For over a century, banks have relied on accurate default risk prediction. With the
availability of massive data sets and open-source data, interest in risk prediction has
grown. Automating loan approval procedures can expand financing options for small
firms and individuals, promoting equitable access to loans. P2P lending, with sites
like Lending Club lent over 45 billion dollar, has gained popularity in less developed
economies.
Neural Network was applied but to default the prediction here only.L2 regularization
was the most commonly used regularization strategy in grid search for LR and SVMs
to prevent overfitting.The study used recall and AUC metrics for result validation,
considering credit risk and rating in relation to other loans. Logistic regression
was applied to combined data, and a hyper parameter grid search was optimized to
maximize the average unweighted recall. The recall macro was prioritized over AUC
to avoid over fitting rejected classes, as AUC weights accuracy over forecasts.
In results, automated P2P loan acceptance and default prediction, with high re-
jection and default recalls. The methodology could reduce defaults to 10 percent,
improving market efficiency, and using Logistic Regression for approval and Deep
Neural Networks for default [9].
The banking industry’s credit lending sector is facing rapid expansion and competi-
tion from new start-ups, leading to negative credit losses. To address this, research
is needed to design effective models that exploit existing data and provide strong
predictive prototypes. This will help banks maximize profits by understanding ap-
plicant tendencies, money usage, and default predictions.
Here in the data set contains 850 bank default payment records, which were pre-
processed using techniques like cleaning, integration, formatting, and normalization.
The predictive model’s accuracy was assessed using methods like MLR, DT, SVM,
Random Forest and other algorithms. The study used a dichotomous default pay-
ment as a dependent variable and compared categorization findings to the destina-
tion’s score. The research was implemented using Python on a local machine using
the Jupyter Kernel. Eight major explanatory factors were identified, including age,
educational background, employment status, address, income, debt, credit to debt
ratio, and other factors.
15 Python-based classification methods, including LR, SVM, and Naive Bayes, after
data pre-processing. To evaluated many other metrics such as CM, Precision metrics,
Recalling metrics and FI which enhance the likelihood of identifying and fixing
algorithm errors, leading to improved results.
This paper uses classification algorithms to predict bank loan defaults, focusing on
job experience and debt income. Python performance indicators are used to identify
problematic consumers, improving credit approval. This was the outcome [10].
This research uses data from former bank clients to predict loan safety using a ma-
chine learning model trained on 1500 examples, 10 numerical characteristics, and 8
categorical features. Factors like CIBIL Score, Business Value, and client assets are
considered when deciding whether to credit a loan. A well-liked machine learning
approach for classification issues that focuses on predictive analysis is logistic re-
9
gression. It presents data and clarifies connections between independent nominal,
ordinal, and ratio level variables and binary variables. The sigmoid function is used
in the model’s development with binary outcomes as the aim. Bank clients’ data is
split into training and test sets, and any missing values are filled in using the mean,
median, or mode. For exploratory data analysis, Feature Engineering approaches
are applied with a focus on loan-eligible consumers. Data preparation, processing,
imputation, experimental analysis, model construction, assessment, and testing are
all steps in the prediction process. The best accuracy scenario is 0.811. Loan ap-
plications with modest loan amounts are more likely to be granted than those with
excellent credit scores and lower credit limits. Gender and marital status are not
taken into account.[4] Distribution of loans is a basic business function of banks,
and credit risk assessment is essential for banks globally. The main goal is to place
assets in trustworthy hands. There is no assurance that the applicant picked is the
appropriate one, despite the fact that many banks grant loans following a rigorous
process of verification and validation. A loan prediction system can swiftly and sim-
ply identify worthy candidates, giving banks a particular edge. The system permits
priority application checks, time constraints for applicants to verify loan sanctioned
status, and computerized calculation of the weight of characteristics in loan pro-
cessing. The conclusions of this document, which is intended only for management
authorities at banks and financial institutions, may be forwarded to relevant depart-
ments for necessary action. Vaidya, Ashlesha’s article forecasts loan approvals using
logistic regression as a machine learning method. Power terms and nonlinear effects
may be accommodated by the model, but parameter estimation needs independent
variables and a sizable sample. The application of artificial neural networks as an
early warning system for identifying loan risk is covered by Yang, Baoan, et al. In a
prediction model for dynamic lending, genetic algorithms are employed to optimize
profit and reduce loan approval mistakes. Modeling accuracy for Logistic Regres-
sion, Decision Trees, and Random Forest are 80.945 percent; 93.648 percent; and
83.388 percent while Cross Validation Results are 80.945 percent; 72 This means
that although decision tree based model achieves best accuracy with the data set,
random forest is more interpret able and generalized, despite only having slightly
higher cross-validation score when compared to logistic regression [5].
A key component of a bank’s operations is loan distribution, with the main objective
being to place assets in trustworthy hands. There is no assurance that the chosen
applicant is the deserving one, despite the fact that many banks grant loans using a
similar procedure. A Loan Prediction System employs machine learning to automate
feature validation and forecast applicant safety. This approach offers rapid, immedi-
ate, and simple ways to choose suitable candidates, which is advantageous for bank
workers and applicants. It establishes time limitations for applicants, determines the
weight of characteristics in loan processing, and provides priority review of partic-
ular applications. Only management authorities of banks and financial institutions
may use this system. Customer segmentation, high-risk loan applications, anticipat-
ing default payments, promotions, collateral monitoring, asset grading, regular sales
management, stock holdings management, cash management, and cross-selling are
just a few financial industries where data mining is an essential tool. It is essential
for managing client profiles and transaction data in banking, enabling users to make
informed choices. A person’s likelihood of repaying financial obligations depends
10
on their credit score, which categorizes applicants into those with excellent credit
and those with low credit. While credit rating distinguishes between present and
future customers, credit evaluation links a customer’s characteristics to previous
borrowers. It’s crucial for both banks and clients to monitor default vulnerability.
Weka is one of the most widely used tools in data mining, making pre-processing,
classification, clustering, association, regression, and feature selection stages eas-
ier. An algorithm for classification is also used to forecast results based on data.
The credit validity forecast framework filters through advance solicitations from a
current bank data set with a 66 percentage preparation set and a 34 percentage
test set using a choice tree and computed relapse enlistment information mining.
The application helps banks anticipate credit status and make informed decisions,
reducing bad loans and cut-offs. It uses AI calculations and packages to analyze
data and make informed decisions. This technology aids in identifying necessary
data from vast information, reducing bad credit issues. It also aids in attracting
new clients, maintaining credit, avoiding extortion, identifying misrepresentations,
offering customer-based products, and enhancing customer relationships.[6] With
the development and application of diverse concepts, the technical world is moving
closer to automation. An important characteristic that attempts to imitate intellect
similar to that of humans in computers is artificial intelligence. In this modern age,
technologists want to collaborate with people to bring forth new discoveries. Some
of these include machine learning, neural networks, fuzzy logic, NLP (natural lan-
guage processing), and expert systems. Advanced industries are leveraging machine
learning to boost sales growth because being able to build an analytic model with
much less coding is what makes it a powerful technology to begin with. machine
learning offers a way to replace some of the exploitation of humans with something
else for as long as there is growing “big” data in a big data economy? Data sets are
used by money lending companies to decide which applicants will be granted loans.
These files provide pertinent data on things like gender, education, income, and
property type. Using logistic regression and variables including education, credit
history, self-employment, and property area, the model forecasts the likelihood of a
loan being approved. The model must adhere to the requirements of money lend-
ing companies and be accurate and quick. The logistic regression parameters are
used to compute the likelihood that the loan will be authorized, and if the likeli-
hood is larger than 0.5, the loan will be approved. The loan will not be granted if
the likelihood is less than 0.5. The logistic regression model, a statistical machine
learning model used for predictive analysis, is covered in this work. It emphasizes
precision, handling of non-linear effects, and power terms. However, logistic regres-
sion has drawbacks such a high sample size, reliance on independent variables, and
incapability to deliver continuous outputs like forecasts of temperature rise [7].
11
more, Machine Learning Classifiers can handle the high dimensional data to find
out connections from various features. These classifiers can be trained and tested
thoroughly for more optimized results. Giving a flexible framework to accurately
predict loan approval.
The use of machine learning for loan approval prediction and credit card fraud de-
tection are explained in this paper [17 ]. The banking sector aims to secure assets
through verification processes, but this process can be time-consuming and inef-
fective. A system has been developed to predict loan applicant suitability using
machine learning algorithms, achieving 92% accuracy using the Random Forest Al-
gorithm. This paper used an online banking transaction repository data set to
analyze and classify transactions if it is fraudulent or normal. A New Web Appli-
cation for Predicting Loan Approvals and Detecting Fraudulent Transactions has
been implemented It used Random Forest Algorithm and Support Vector Machine
Learning Algorithm for improved accuracy. The paper also highlights the need of
finding and protecting fraudulent transactions, and it utilized the Support Vector
Machine Algorithm to analyze and preprocess data. It was then tested on a 615-row
training data set, achieving 92% accuracy for loan approval predictions. Finally, it
was tested on a 30,000-customer data set, achieving 94% accuracy for credit card
fraud detection.
In this paper [18] Machine learning models, such as XGBoost, random forest classi-
fier, and support vector machine classifier, are used to predict loan approval. Many
people are unable to back loans to banks, leading to losses for banks. The main
reason for getting a loan is to fulfill the needs of something, such as business growth
or a loss. The problem arises because not everyone can loan, and if they can’t
return, the lender, company, or bank gets in the loss. The main objective of this
paper is to predict whether a new applicant is granted a loan using machine learning
models trained on historical data. It uses data from the user, such as monthly in-
come, marriage status, loan amount, and loan duration, to determine if the loan will
be given or not. Different classification systems are employed to make the model,
and the classifier classifies the data items into their appropriate class. After testing
XGBoost gives 77.7778% accuracy. Random Forest gives 76.3889% accuracy and
Decision Tree 64.58% accuracy. To enhance predictions and minimize defaults, a
logistic regression model, utilizing Kaggle data, considers not only checking account
information but also personal customer attributes in this paper [19]. Sensitivity and
specificity are compared between models, revealing marginal improvement. Using a
Logistic regression algorithm with data from previously approved loans. The data
set works on 1500 cases with 10 numerical and 8 categorical attributes, including
CIBIL Score, Business Value, and customer assets. Parameters such as qualifica-
tion, income, loan amount, and credit history contribute to the model’s efficacy.
Logistic regression model with a sigmoid function is employed. The preprocessing
phase, involving exploratory data analysis and feature engineering, consumes sig-
nificant time. Two separate datasets are fed into the model for robust predictions.
Imputation, feature engineering, data mining and cleaning are used for a better re-
sult. Evaluation methods such as confusion metrics, accuracy, precision, recall, and
F1 score guide the selection process. The chosen model must meet the stakeholder
requirements and constraints. So, the best case using this data set can obtain an
accuracy rate of 81.10%.
Due to fierce rivalry, banks frequently struggle to gain the upper hand over one
12
another and improve overall company. The vast amount of data that is readily
available. The establishment of knowledge bases and their effective use have enabled
banks to create effective delivery channels. Data mining can be used to optimize
business choices. The main source of risk that the banking business faces is credit
risks, which include the risk of loss and loan defaults. The primary feature of this
loan credibility prediction system is its usage of the Decision Tree Induction Data
Mining Algorithm for loan request screening and filtering. An Assemblage, a pre-
existing bank dataset with 4520 records and 17 attributes, is mined for data to
create a Tree. The final dataset is split into a 34% test set and a 66% training set
after preprocessing. The classifier’s final output is validated using the test set. This
section presents the findings from the experimental analysis used to estimate the loan
repayment capacity. Our suggested paradigm has been implemented in ASP.NET-
MVC5 The prediction was made using a bank dataset that already existed. For the
experimental analysis, a somewhat sized bank dataset (4520) was utilized. Following
the pre-processing stage, the dataset was manually reduced to 3271 by performing
dimensionality reduction. The manual addition and use of Information Gain as an
attribute evaluator and Ranker as a search yields the ranks of the attributes [20].
Both customers and bank representatives find Loan Prediction to be of great use.
This project’s goal is to provide a quick, easy, and expedient method for selecting
the primary client. The purpose of the Loan Prediction System is to enable prompt
application so that it may be verified based on need. This project is exclusively
intended for the bank’s or account organization’s supervisory authority. The pri-
mary objective of data mining research is to obtain a large amount of obtained data,
making it a very active and significant field of study. Data mining is becoming more
popular mainstream in a financial sector given that effective investigative methods
exist for separating obscure. By applying the covering technique, ascribes emerged
as fundamental components among the absolute of 31 attributes of the variable im-
portance chart collection of media transmissions. The tribute forecast model is said
to be greatly impacted by these 21 attributes. Accuracy foreseeing a customer’s pur-
chasing behavior through a disorganized grid with precision 91.36. ROC twists are
often applied equally to requests to take into account the classifier’s yield. The true
positive rate is determined by the Y-hub, while the x-pivot shows the false positive
rate, which ranges in value from 0.1 to 1.0. Accuracy in anticipating a customer’s
purchasing behavior using a disorganized grid with precision 92.18Typically, ROC
twists are applied equally to investigate the classifier’s yield. The Y-hub shows the
true positive rate, while the x-pivot establishes the false positive rate, with a value
ranging from 0.1 to 1.0. Consequently, the tendency is for the exactness of the two
models to almost remain unchanged [21].
Artificial intelligence algorithms and machine learning models have applications in
a variety of industries, including education, healthcare, entertainment, and other
professions. Credit ratings and loan conditions are the characteristics that most
likely influence the outcome, as we found during this investigation. A training set
(80%) and a test set (20%) were then created from the dataset. We utilized MAT-
LAB to train twenty-seven different machine learning models. Bayesian optimization
was used to three models in order to determine the optimal hyperparameters with
the least amount of error. Our validation methodology was 5-fold cross-validation.
The dataset was split into a training set (80% of the data, or 3416 observations)
13
and a test set (20% of the data, or 853 observations) prior to training. Using the
training set, we used MATLAB R2023a to train 27 machine learning classification
models. Bayesian optimization was used to optimize the hyperparameters of three
of the models. Table 1 provides a summary of the research findings. It is evident
that the optimal outcome, 98.45% accuracy on the training set (validation), and
narrow neural networks were used to achieve 98.83% on the test set. The training
set (validation) accuracy of an optimized ensemble classification model was 98.42%,
and the test set accuracy was 98.83% determined. Optimized ensemble model was
the second model in Table 1 that achieved 98.42% validation accuracy and 98.83%
test accuracy. We all determined the optimal hyperparameters for the model us-
ing Bayesian optimization. The minimum classification error plot is presented in
Fig. 10. It is evident that during the optimization process, the classification error
dropped to 0.015809 [22].
The majority of bank revenues come from loans. Financial banks value loan ap-
proval. As rates rise, banks struggle to appropriately assess requests and mitigate
risks when predicting consumer loan payments. Numerous researchers have stud-
ied loan approval system prediction in recent years. Machine learning is useful for
forecasting large data sets. Loans are banks’ principal income and risk. Many of a
bank’s assets come from loan interest. Risks include borrowers not repaying loans
on schedule. The term is “credit risk”. Loan approval or denial credibility was
determined. This paper attempts to explain Machine Learning techniques that ac-
curately identify loan beneficiaries and let banks detect loan defaulters, decreasing
credit risk. Our models include Random Forest, Decision Tree, Naive Bayes, and
Logistic Regression. The process of analysis begins with data purification and miss-
ing value processing, followed by exploratory analysis, model creation, and model
evaluation. Higher accuracy and other performance criteria indicate the public test
set has the best accuracy. This document can help predict if a candidate will receive
a bank loan or not [23].
A bank loan is a credit offer offered to a customer or business by a bank. One of a
bank’s fundamental financial products is lending, and interest on loans creates the
majority of its profits. After an accurate sequence of verification and validation, the
loan corporations grant a loan. However, they are still unsure if a particular appli-
cation would be able to repay their debt. In banking operations, manual procedures
are typically employed to decide whether an applicant is qualified for a loan from
their bank. This project’s principal objective was to analyse if an application is
acceptable for a loan by collecting information from numerous sources and employ-
ing machine learning algorithms to extract essential data. This would enable banks
and lending organisations to decide on the best course of action for each loan ap-
proval. The field of artificial intelligence known as ”machine learning” is dedicated
to teaching computers how to learn without the need for predetermined, explicit
guidelines. Machine learning systems use experience unlike a rule-based system
which will do any job with a given set of clear rules. Performance can be enhanced
by exposing the system to extra data. It learns and adapts without following any
particular set of directions. Our project focuses on using existing customers’ details
to evaluate them further by applying a few machine learning algorithms and predict-
ing which future applications can be accepted for the loan.Models employed here
include logistic regression, decision tree, random forest,XG boost are used which
14
got 77.8%,68.1%,73.5% and 76.7% accuracy correspondingly. To conclude, Logistic
regression provided the best results. Perhaps in the near future, this prediction
module and the automated processing system module will be integrated [24].
With the rise in the banking business, a big number of people are requesting bank
loans. However, the bank can only issue loans to a restricted number of applicants
due to its limited resources, thus deciding who would be the best candidate for a
loan and which will be more financially sound for the bank is a frequent procedure.
We therefore strive to minimise the risk factor in identifying the safe people in this
research in order to save a substantial amount of bank resources and labor. This is
performed by mining the Big Data for the preceding loan recipients’ data. Based on
these data and experiences, a machine learning model was used to train the system
to generate the most accurate output. Almost all banks’ principal operation is the
distribution of loans. The majority of the assets owned by the bank are directly due
to the earnings made on the loans that the banks have given.. These days, a lot
of financial institutions grant loans following a lengthy and thorough verification and
validation process, but it’s still unclear if the selected application is the most worthy
candidate out of all of them. We can assess whether a certain application is safe or
not using this way, and machine learning technology has automated the entire feature
validation operation. The models that are used include e Random Forest (RF), SVM
and Tree model with Genetic Algorithm (TGA). This application matches with all
Banker requirements and functions as planned. This section is straightforward to
tie into a variety of other systems. It is highly accurate, satisfies all banker criteria,
and is interoperable with various other systems. Multiple computer failures, content
difficulties, and weight fixing in computerized prediction systems were identified. In
the near future, banking software that connects with an automated processing unit
may be more dependable, accurate, and dynamic. Numerous instances of content
errors, computer breakdowns, and most critically, fixed feature weights in automated
prediction systems that give more dynamic, safe, and consistent weight modification
[25].
In the paper ‘Loan analysis Predicting Defaluters’ the authors used dataset from
kaggls, the dataset consists 855969 numbers of data and among which are 46467
data of failed loans. The LAPD is a credit risk scoring model that uses historic data
to predict future defaulters by identifying patterns. Further in the process the data
were pre prrocessed abd label encoder yechinque was used to convert the variable
data to numerical value. Later, the data was splitted into test and train sets in
the ratio of 7:3, the 70% was used for training and the rest of the 30% was used
for testing the model. Thee algorithm used by the authors were Logitic Regression,
Decision Tree, Random Forest and Ada Boost. The results were, Logistic Regression
gave an accuracy of 62%, the Desicion Tree gave an 90% accuracy result whereas
Ada Boost gave 86% accuracy and random forest gave 92% accuracy which is the
highest the ahuthor concluded [26].
In the paper ‘Loan Prediction Using random Forest and Descision Trees’, the authors
collected the data sets from the banking sector, it consists of 12 attributes. The
data was splited into testing and training setss, after the preprocessing were done
by the authors. Later the data was given to the model training set to be specific
to train the model. Then the testing set was given to the model to see weather
the prediction are right. Two machine learning classification model was used by
15
the authors the Random Forest and the Decision Tree. The Decision Tree is an
extension of C4.5 classification algorithm, the experiment was done by the use of
J48 Decision Tree which is an implementation of C4.5 Descion Tree. However, with
a confidence factor of 0.15 the accuracy was 62.12% and if the confidence is 0.25
the accuracy is 63.39%, so if the confidence factor is high the accuracy is high. The
Random Forest had been experimented several ways with different parameters each
time the best results without all attribute selection was 87.75% [27].
The authors mentions that the data taken for the paper was taken from hous-
ing company finance, the dataset consists of both demographic and socioeconomic
characteristics of individual borrower.Moreover, it was retrieved from Kaggle data
repository. Later the data was preprocessed and splitted into training and testing
sets. The algorithm used here were SVM, LR, and Naive Bayes.Here the data was
splitted into 70% for training and 30% for testing. The results are presented in two
tables one with the attribute of Early R and other without the attribute of Early R,
the results are different of each table. However, from the first table the results shows
that the LR has an accuracy of 92%, SVM has accuracy of 83.6% and the Naive
Bayes has accuracy of 91.8%. Furthermore, the second table without the attribute
of Early R has less accuracy that the table before, the Lr has 87% accuracy, SVM
has accuracy of 83.6 and the Naive Bayes has accuracy of 85.6%. Therefore it is
concluded that for both the cases the LR gives the best results [28].
16
Table 2.1 – Continued from previous page
Ref Task Classifier/Model Data set Accuracy
19 Streamline loan ap- Logistic Regression Kaggle 81.10%
proval prediction
by prioritizing data
preparation and op-
timizing accuracy,
focusing on credit
scores, income, and
loan amounts, with
limited impact from
gender and marital
status.
20 Loan Credibility Pre- Decision Tree N/A 66%
diction System Based training
on Decision Tree Algo- set and
rithm 34% test.
21 A Comparative Anal- Random Forest classi- N/A 92.18%
ysis of Feature Selec- fier, Boruta classifier
tion for Loan Predic-
tion Model
22 Comparing Machine Machine Learning N/A 98.83%
Learning Classifica- Classification, Ensem-
tion Models on a Loan ble Model
Approval Prediction
Dataset
23 Loan approval Pre- Random Forest al- N/A Needs to
diction Based on gorithm, Decision be tested.
Random Forest Algo- Tree algorithm, Naive
rithm. Bayes algorithm,
Logistic Regression
24 Collecting eligible ap- Logistic regression, N/A 77.8%,
plication by utilizing Decision tree, Ran- 68.1%,
machine learning tech- dom forest, XG boost 73.5%,
niques. 76.7%
25 Determining who Random Forest (RF), N/A Needs to
would be the best SVM and Tree model be Tested
candidate for a loan with Genetic Algo-
and which will be rithm (TGA)
more financially
sound for the bank is
a common procedure.
26 Building LAPD(Loan Logistic Regression, Kaggle 62%,
Prediction System) Decision Tree, Ran- 90%,
and Integrating it dom Forest, Ada 92%, 86%
with Web Application Boost
Continued on next page
17
Table 2.1 – Continued from previous page
Ref Task Classifier/Model Data set Accuracy
27 Making a Loan predic- Random Forest, Deci- Banking 87.75%
tion system using ran- sion tree Sector For con-
dom forest and deci- fidence
sion trees factor
0.15 ac-
curacy
62.12%.
For con-
fidence
factor
0.25 the
accuracy
63.39%
18
Chapter 3
Proposed Model
3.1 Workplan
19
3.2 The Nine Classification Model
Total nine categorization models are used in our suggested model. Logistic regres-
sion, ADABoost, Support Vector Classifier (SVC), Multi-layer Perceptron Classifier
(MLP), Random Forest Classifier(RFC), Decision Tree, K-Neighbours, Gaussian
Naive Classifier and Gradient Boosting are some of the algorithms we have sug-
gested. We used classifiers so that we could compare them and see which one
produces the best results.
z = β0 + β1 x1 + β2 x2 + . . . + βn xn (3.1)
where:
i. z is the linear combination (also called the logit or log-odds).
The logistic regression uses sigmoid function is used to convert z into a probability.
The sigmoid function is defined as: σ(z) = 1+e1−z
This function outputs values between 0 and 1, making it suitable for probability
estimation.
3. Prediction: The output of the sigmoid function, ŷ, is interpreted as the proba-
bility that the input x belongs to the positive class (usually labeled as 1):
1
ŷ = σ(z) = (3.2)
1 + e−(β0 +β1 x1 +β2 x2 +...+βn xn )
4. Training: During training, the model learns the values of the coefficients
β0 , β1 , . . . , βn by minimizing a loss function. The commonly used loss function for
logistic regression is the log-loss (binary cross-entropy), defined as:
m
1 X
Log-Loss = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (3.3)
m i=1
20
where:
• m is the number of training examples.
• yi is the true label for the i-th example.
• ŷi is the predicted probability for the i-th example.
Optimization algorithms like gradient descent are used to minimize the log-loss by
updating the coefficients iteratively.
To conclude, logistic regression uses a linear model to calculate a value z and applies
the sigmoid function to convert z into a probability, and then it uses this probability
to classify the input. The model is trained by adjusting the coefficients to minimize
the log-loss function.
3.2.2 ADAboost
It is the short form of Adaptive Boosting, is an ensemble learning algorithm that
combines multiple weak classifiers to create a strong classifier. The main idea be-
hind ADAboost is to improve the accuracy of predictions by focusing on the that are
difficult to classify. ADAboost is a powerful and versatile algorithm that enhances
the performance of weak classifiers by focusing on difficult to classy instances, lead-
ing to a more accurate and robust final model. Introducing to a fresh approach
and concept to the learning algorithm’s design. Boosting theory include discover-
ing a more accurate weak learning condition in multiclass problem and deriving a
tighter generalization error constraint. Thorough research should be conducted on
AdaBoost’s stopping conditions, anti-noise capability enhancement, and accuracy
improvement through base classifier diversity optimization.
PC
Equation for Gini Impurity: Gini(D) = 1 - i=1 p2i
P |Dv |
Equation for Information Gain: IG(D, A) = Entropy(D) - v∈V alues(A) |D| Entropy(Dv )
21
3.2.4 Random Forest
The Random Forest algorithm is a versatile and widely-used ensemble learning
method for classification and regression tasks. It operates by constructing a multi-
tude of decision trees during training and output either the mode of the classes for
classification or the mean prediction for regression of the individual tress. Random
Forest is a powerful and flexible algorithm capable of producing high quality predic-
tions across a variety of domain and applications. It has the ability to handle large
dataset with higher dimension [39].
3.2.5 K-Neighbours
The K-Nearest Neighbours algorithm is a non-parametric technique used in both
regression and classification.The KNN method, or K-Nearest Neighbour, has been
extensively utilised in machine learning and data mining because it is straightfor-
ward but incredibly practical with unique execution.After training sample data,
classification is used to forecast the labels of test data points. Although various
classification techniques have been proposed by researchers in the last few decades,
KNN (K-Nearest Neighbour) remains one of the most often used techniques for clas-
sifying data sets. The input is made up of the k nearest examples in each area; the
neighbours are chosen from a group of objects or objects with similar attributes;
this group of objects can be thought of as the training dataset [40].
P (x | Ck ) · P (Ck )
P (Ck | x) = (3.4)
P (x)
Since P (x) is constant for all classes, it can be ignored in the context of classification:
22
P (Ck | x) ∝ P (x | Ck ) · P (Ck ) (3.5)
2. Naive Independence Assumption: The ”naive” assumption is that all fea-
tures xi are independent given the class Ck . Thus,
n
Y
P (x | Ck ) = P (x1 , x2 , . . . , xn | Ck ) = P (xi | Ck ) (3.6)
i=1
where:
23
Here’s a detailed explanation of how the Gradient Boosting Classifier works, along
with the necessary formulas.
1. The Loss Function: For a classification task, the loss function L(y, F (x))
measures the difference between the true labels y and the predicted values F (x). A
common loss function for classification is the logistic loss.
2. Additive Model: The model F (x) is built in an additive manner:
Fit a weak learner (e.g., a decision tree) hm (x) to the pseudo-residuals rim .
Compute the optimal step size ν:
N
X
ν = arg min L (yi , Fm−1 (xi ) + νhm (xi ))
ν
i=1
In practice, a fixed learning rate is often used instead of finding the optimal ν.
Update the model:
24
In conclusion, Gradient Boosting Classifier builds a strong model by combining the
predictions of multiple weak learners, each correcting the errors of the previous ones,
resulting in a highly accurate and robust classifier.
25
Chapter 4
Dataset
26
4.2.1 Handing Missing Values
We used different imputation methods, replacing missing values with mode or mean
based on feature categorical or numerical. Categorical Variables: Categorical columns
were encoded using label encoding and directly replacing column values to make
them compatible with machine learning algorithms.
As is it shown in figure 4.1, there are too many columns missing with a small
amount of null values, therefore we used mean and mode to replace with NaN values.
The Y values with 1 and N values with 0 were replaced as well and the same for
other Boolean types of columns. Then by the use of Label Encodes some specified
categorical columns in both the training and testing data sets will be replaced with
numerical representations. So, the figure 4.2 shows that there isn’t any missing value
in the data set
27
4.2.2 SMOTE and ADASYN
With the help of SMOTE and ADASYN, we are presently implementing two different
methods for resampling and data augmentation in order to overcome the problem
of imbalanced class distribution in a binary classification task.
Before resampling:
SMOTE Resampling:
SMOTE does this by adding new synthetic samples specifically for the minority class
(class 0) until both classes have same sized samples (i.e, class 1).
ADASYN Resampling:
As with ADASYN, this technique also creates a few extra synthetic samples for
the less produced class, hence leading to a distribution that is very close to but
not exactly equal.The figure 4.3 shows us the balance of the dataset after both
technique was applied. Moreover the table 4.1 also shows the number of instances
after SMOTE and ADAYSN is applied
28
Figure 4.3: Balancing the Dataset
29
Dependents: Number of dependent family members that this potential borrower
has living with them.
LoanStatus, the target variable it is a binary variable indicating whether the loan
was approved (1) or not (0) The first step we took in our process was to split our
dataset into training and testing subsets. The training subset consists of 70% of the
data in all, or about 6,876 records.
The testing subset has 30 percent, 2,948 records. This ensures that models have
enough information to learn from while still having enough unseen data for evalua-
tion.
Few of the factors which are considered for loan approval prediction are like Gen-
der, Married, Dependents, Applicant Income, Self Employed, Credit History and
Property Area. In this section we will try to predict the LoanStatus with well know
Classifiers. Here we assess a series of performance metrics like accuracy, precision,
recall and F1-score to understand the overall quality of the model.
30
Figure 4.5: The Parallel coordinates graph of the dataset
31
Chapter 5
Training Score:
This shows the performance of the model on the sample of data, which was used
32
for training. Typically, high training accuracy indicates that the model learned the
patterns contained in data. However, if training accuracy is very high, it usually
means that the model was overfitted and learned irrelevant data patterns that may
not be useful for predicting new data.
Accuracy Score:
This shows the performance of the model on data that was not used for training
(test set). The more similar the test accuracy to the training accuracy, the better
is the model’s generalization. The Score results are shown below in table 5.1 and
figure 5.1
Table 5.1: Score Results of test and train for classic models
33
5.1.3 Test Classifier Performance Evalutation:
For both the Support Vector Machine and Decision Tree models, they exhibit high
training and test performance with similar accuracy on the training and test datasets
respectively as shown in the table 5.1. This indicates that these types of models do
not suffer from overfitting as they are able to achieve good performance on unseen
datasets.
The Random Forest and K-Nearest Neighbors (KNN) classifiers also yield good
results, though the evaluative accuracy for the former is slightly less than the cor-
responding training, suggesting a small degree of overfitting, yet a reasonable ren-
dition.
Logistic Regression, AdaBoost, and Gaussian Naive Bayes models perform notably
lesser than the SVM and Decision Algorithm in terms of accuracies, albeit showing
consistent performances in the Train and Test datasets, respectively. This suggests
that these models have a lower risk of overfitting as they might be simpler in struc-
ture compared to fitting complex problems.
The Performance of Gradient Boosting is fair with accuracy in the range of the high
seventies. Its performance might improve if hyperparameter tuning is done.
The performance of Multi-Layer Perceptron leaves room for improvement since the
difference between the train accuracy and test accuracy is wider than for other
models hinting at possible overfitting.
The best training and testing accuracy is observed in the Decision Tree and SVM
models. Some models such as Logistic Regression and Naive Bayes are very basic
and nonlinear patterns may not be expected from them, however, they are stable
in their work. Other models such as Random Forest and KNN have an accurate
performance level however they are more complex and adjustment in their use is
necessary to prevent overfitting.
Performance Metrics
The following table 5.2 summarizes the performance of each classifier in terms of
accuracy, precision, recall, F1-score, and confusion matrix statistics.The figure 5,1
and figure 5.2 shows the Accuracy and the Evaluation Metrics of the Classical Models
34
Figure 5.2: Evaluation Metrics Of the Classic ML Modles
35
Model True Positives (TP) False Positives (FP) False Negatives (FN) True Negatives (TN)
Logistic Regression 1937 611 316 36
SVC 2000 623 304 21
Decision Tree Classifier 1918 312 615 103
Random Forest Classifier 1937 326 601 84
MLP Classifier 1949 396 531 72
Gradient Boosting Classifier 1945 431 496 76
K-Neighbors Classifier 1880 333 594 141
GaussianNB 2000 623 304 21
AdaBoost Classifier 1936 537 390 85
Key Insights:
According to the findings from table 5.2, Random Forest Classifier proved to be
the most accurate model. The accuracy obtained was 0.8609 and the F1 score was
0.9043 which means the model is the most efficient and balanced for this dataset.
Decision Tree Classifier also showed strong performance results with high accuracy
(0.8592) and precision (0.8601) and the F1 score was 0.9024. This particular model
was able to provide the best balance between precision and recall, making it ideal for
applications where both need to be achieved. Likewise, SVC and GaussianNB have
the highest recall (0.9896), which means they are designed for tasks where there is
a greater concern about false negatives than false positives.
ACoording to the results of the table 5.2, Random Forest Classifier offers the best
overall performance, particularly in terms of accuracy, F1 score. Decision Tree Clas-
sifier also performed well, with a strong balance of precision, recall, and accuracy.
For applications where recall is critical, SVC and GaussianNB may be preferable
due to their ability to minimize false negatives. Future work can explore further
model optimization and alternative performance metrics based on domain-specific
requirements.
Moreover, from confusion matrix It can be seen that all the classifiers have a high
True Positive ratio which is indicative of correctly classifying positive classes and
the best performance is recorded by the SVC and Gaussian Naive Bayes classifiers
respectively.
The Decision Tree Classifier has the provision for the most Observe .True Negative
in distress hence it is a classifier that predicts negative classes effectively. Good TN
is also practiced in Random Forest.
There have been cases especially in the curves and SVC where these models exhibit
high False Positive rates hence can lead to serious reclassification in very important
functions.
From the tblae 5.3 we see that the SVC is noted to have the least False Negative
making her positive instance sensitive which works well in situations where pay-off
for false negative is high.
Based on the confusion matrices from table 5.3 we can dissect the performance of
various classifiers and found that Random Forest classifier performed the best. It
does well by having very high true positives (1937), adequate true negatives (601)
and few false negatives (84).
The analysis of the performance metrics shows that it is very rare for actual negatives
to be classified as positives but assures the positives will be detected. This mixture
is important due to the fact that in very many instances both types of errors can
36
be very costly.
It can be seen from this analysis that Random Forest Algorithm is the best in terms
of classification accuracy since it is able to maintain good records of true negatives
and at the same time, true positives are very high.
Model True Positives (TP) False Positives (FP) False Negatives (FN) True Negatives (TN)
Logistic Regression 480 476 75 1917
SVC 480 476 21 1971
Decision Tree Classifier 848 108 280 1712
Random Forest Classifier 848 108 280 1712
MLP Classifier 874 82 430 1562
Gradient Boosting Classifier 675 281 200 1792
K-Neighbors Classifier 722 234 115 1877
GaussianNB 479 477 87 1905
AdaBoost Classifier 560 396 131 1861
37
Figure 5.5: Confusion matrix of the Classical model after SMOTE
So, based on the exploration of the confusion matrix and key metrics we can decide:
Models to Avoid: When it comes to models like AdaBoost and Logistic Regres-
sion, they do not perform well in situations with an overwhelming amount of false
negatives, which means they tend to miss more positive cases than any other models
and as such, they become useless in the case of imbalanced dataset.
It can be summarized that KNN, Random Forest, and Gradient Boosting models,
when applied, perform effectively with SMOTE-resampled data.
38
5.3 Evaluation of Performance with Different Clas-
sifiers After ADASYN
ADASYN has been applied to adjust the class imbalance within the data set and the
classifiers; performance has been assessed and compared also using several metrics
-Accuracy, Precision, Recall, F1 Score.The following figure 5.6 shows the evaluation
metrics that have been obtained after the SMOTE was applied. The later figure
5.7 an table 5.5 shows the confusion matrix of the model after ADASYN has been
applied.
39
Model True Positives (TP) False Positives (FP) False Negatives (FN) True Negatives (TN)
Logistic Regression 644 312 507 1485
SVC 781 175 558 1434
Decision Tree Classifier 886 70 418 1574
Random Forest Classifier 886 70 418 1574
MLP Classifier 872 84 432 1560
Gradient Boosting Classifier 799 157 472 1520
K-Neighbors Classifier 765 191 190 1802
GaussianNB 493 463 178 1814
AdaBoost Classifier 704 252 591 1401
Best Classifier: K-Nearest Neighbors (KNN) has the highest recall of 0.942 and
the bes F1 score of 0.912, which stands him out. It also boasts of the best accuracy
score of 0.870, which assures a good balance between precision and recall.
Strong Performers: Gradient Boosting and Gaussian Naive Bayes also exhibit
good performance, GaussianNB having superior recall most often, at the risk of
having morefalse positives and Gradient boosting performing relatively equally on
precision andrecall.
Models to Avoid: AdaBoost and SVC have the least recall and accuracy coupled
with the high levels of false negatives thus making these models ineffective for the
given problem’s skewed classifier’s data.
With comparision with SMOTE the K-Nearest Neighbors is reported to deliver good
results using both methods and thus it can be concluded that KNN is not adversely
affected by class imbalance regardless of the type of application.
40
Figure 5.8: Mean Evaluation Metrics of the different Resampling technique
As indicated by the results in table 5.6 and figure 5.8 in of the Accuracy test, there
is No Resampling which emphasizes that it is the most accurate as it has the high-
est accuracy value of 0.843924. It makes the highest number of correct predictions
on average. No Resampling comes first followed by SMOTE at 0.821762, while
ADASYN is rated last 0.775592 in terms of accuracy.
The No Resampling method scored the highest in precision which was 0.831013.
It also depicts that it had the lowest number of false positives when compared to
SMOTE and ADASYN. The second position is taken by SMOTE which had a pre-
cision level of 0.805487, while ADASYN was slightly lower than this with a level
of 0.790420. The concept of recall or sensitivity came out clearly in No Resam-
pling and SMOTE equal results of 0.810418 and 0.799864 respectively. This shows
how effective they are in detecting the true positive cases. Again the last winner is
ADASYN with the lowest score.
SMOTE (0.791545) has the best F1 Score. This means that SMOTE has the most
appropriate balance between Precision and Recall. This performance is almost simi-
lar to that of No Resampling (0.790043) while that of ADASYN (0.766264) is below
the average level. It is noticed that The no resampling approach demonstrates the
accuracy and precision although its F1 score is slightly lower than that of SMOTE.
For priority of achieving overall correctness (accuracy) while minimizing false posi-
tives (high precision) this option would be the best. The SMOTE method gives the
highest F1 score, this implies that it slightly improves precision and recall. It may
be a better strategy to employ in cases where both precision and recall are critical,
41
particularly in skewed datasets.
In all metrics, ADASYN scored the lowest, hence , this one is the least preferred
among the three. All in all, one cannot go wrong whether they chose No Resampling
or SMOTE for that matter but rather; No Resampling if the priority is on the high
level of accuracy and precision. SMOTE if the most critical aspect to consider is
the F1 Score (i.e. combination of precision and recall).
Activation Function
The sigmoid activation function is introduced in order to fit any real number value
into a number between 0 and 1. This can be expressed mathematically as:
1
σ(x) = (5.2)
1 + e−x
42
Binary Cross-Entropy Loss
The Binary Cross-Entropy loss is suitable for this RNN model because it allows for
probabilistic interpretation of outputs, making it effective for training purposes.
N
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (5.3)
N i=1
• Output Gate: Determines what part or how much of the hidden state can
be output.
These gates can be represented as follows:
43
the target variable and true class labels which requires the adjustment of network
parameters through backpropagation which is usually done in layers [47].These can
be expressed mathematically as:
z =W ·v+b (5.9)
Output Layer :For binary classification, the output layer often uses a sigmoid
activation function:
1
ŷ = σ(z) = (5.10)
1 + e−z
y = σ(W · x + b) (5.11)
44
convolutional layers only [44]. Thus, they can learn spatial characteristics of the
presented data effectively without much feature engineering.
K
!
(l)
X (l) (l−1)
hi = σ Wk xi+k−1 + b(l) (5.12)
k=1
Each model was trained for 100 epochs with a batch size of 32 using the Adam
optimizer and binary cross-entropy loss. The models were evaluated based on ac-
curacy, precision, recall, and F1-score to compare their performance on the test data.
45
Model Accuracy Precision Recall F1 Score Confusion Matrix (TP, FP, FN, TN)
Simple RNN 0.8504 0.8552 0.9372 0.8944 (1867, 316, 125, 640)
LSTM 0.8558 0.8429 0.9669 0.9006 (1926, 359, 66, 597)
CNN 0.8823 0.8831 0.9518 0.9162 (1896, 251, 96, 705)
FCNN 0.8803 0.8796 0.9533 0.9150 (1899, 260, 93, 696)
FCN 0.8975 0.8634 0.9804 0.9182 (1953, 309, 39, 647)
Table 5.8: Training and Validation Loss with Validation Accuracy for Different
Models
The findings and visulaizations of our results from figure 5.10 to figure 5.19 it shows
the Accuracy and Loss of the implemented neural models. Moreover, from figure
5.20 to figure 5.24 it shows the confusion matrixes of the models
RNN
46
Figure 5.10: RNN Loss
LSTM
47
Figure 5.12: LSTM Loss
48
Figure 5.14: CNN Loss
49
Figure 5.16: FCNN Loss
50
Figure 5.18: FCN Loss
51
Figure 5.20: Confusion matrix for LTSM
52
Figure 5.22: Confusion matrix for FCNN
Loss and Generalization: The model has also proven its also unsatisfactory for
deployment due to the very high validation loss, which is paused at about 0.3328.
53
5.7.2 LSTM
Performance: LSTM has also demonstrated a slight improvement in accuracy scores
(0.8558) and recall scores (0.9669) as opposed to Simple RNN indicating the ability
of the model to adapt to long term features.
Loss and Generalization: The validation loss of the model is also lower than that of
Simple RNN (0.3123), which means it generalizes more effective. However, precision
(0.8429) is lower than that of both CNN and FCN leading to more false positives.
Loss Generalization: Although the CNN model yields a higher validation loss
(0.2450) than, that indicates a lower generalization than LSTM, Hell gatt is suitable
to the loan approval task since its validation accuracy is high at 0.8823.
Loss and Generalization: Although its validation loss is low (0.2483) implying the
network will generalize correctly, the FCNN model does not defeat the CNN model
on generalization capability or accuracy.
Loss and Generalization: It is also worth noting that FCN has the least valida-
tion loss among the models, which is 0.2255, indicating that it is better than other
models in predicting unseen data. This further reinforces the view that the FCN is
the most complex model which works well in offering balanced precision and recall
characteristics whilst reducing errors.
54
5.7.6 Trade-Offs
RNN: Recorded lower precision and F1-scores when compared to advanced archi-
tectures such as LSTM and CNN.
LSTM: Provided better recall which is an advantage in use cases where false nega-
tions can be expensive.
CNN and FCN: These models return the best results for all the metrics, indicat-
ing that the convolutional layers are useful for this kind of prediction task.
Simple RNN: This RNN model is moderate to light in weight allowing for quick
training but is hampered by the vanishing gradient effect which in turn results to
low performance.
LSTM: This model is advantageous as it can hold longer dependencies but the
amount of computation and time needed for training is highly excessive when com-
pared with non-complex structures like RNN or CNN.
CNN and FCNN: These models have a lower training time than LSTM, espe-
cially with datasets that exhibit little time variation. The way CNN is structured
allows it to be highly optimised and trained within a short span of time without
compromising on the predictive results.
Therefore, the results of this study indicate that Fully Convolutional Neural Net-
works (FCN) provide the best performance for loan approval prediction. FCN’s high
validation accuracy and strong generalization capabilities make it an ideal choice for
deployment in financial institutions. CNN also shows excellent promise, providing
a balance between performance and efficiency.
55
Chapter 6
6.1 Conclusion
This research highlights the importance of employing machine learning classifiers
for the prediction of loan approval status which will be beneficial for financial in-
stitutions aiming at improving their processes of making decisions. Our extensive
analysis demonstrates that suitable data pre-processing and feature selection play
critical roles in attaining the best performance of the model. As a result, the Ran-
dom Forest Classifier has also been shown to be the most effective method with high
accuracy and F1 score values while striking a good balance of precision and recall.
Its strength makes it a good candidate for finance-related applications in the real
world. Also, the tactical use of SMOTE and ADASYN proved to reduce the prob-
lem of class imbalance to a greater extent, enabling classifiers such as K-Nearest
Neighbors to still perform satisfactorily. The set of evaluation metrics that were
applied distinguished the capabilities each model hold for and against each other
and this sets a basis for further explorations in this area.
Out of the five types of neural networks analyzed, the architecture of Fully Convo-
lutional Network (FCN) had the best performance in prediction at 89.75 validation
accuracy, 0.9804 recall, and 0.2255 being the least validation loss. Hence, this makes
FCN the best option for use considering real-time occurrences when accuracy of de-
cisions is highly sought. Both CNN and FCNN showed good performance but the
comparative study also showed important benefits and drawbacks of precision and
recall against time taken to run the model.
On the other hand, the less sophisticated architectures such as Simple RNNs and
LSTMs, although useful in some aspects, did not perform the best as compared
to the sophisticated models. This suggests that there is still much room for devel-
opment and improvement of machine learning applications in the financial space.
Further studies may focus on the application of ensemble techniques, more exten-
sive hyperparameter tuning, and these models extending their scope to propensities
predicting loan approval among other factors to improve the prediction and gener-
alization in approval predictions.
56
6.2 Future Works
The increasing application of machine learning and deep learning technologies in
ascertaining loan default risk in the financial sector also presents certain hurdles.
These barriers should be surmounted to increase the performance and the trustwor-
thiness of the predictive models and also create a better business environment in
the given industries. Some of the issues we have encountered as well as strategies
that can be applied in the coming years are outlined below.
Solution: In the future, it is advisable that the research should be directed toward
the problem of constructing operative data gathering and data purification methods.
More data augmentation methods can also be used to add in more synthetic data
for the occassional challenges so that the model performs well without overfitting.
Gaining access to such datasets may entail establishing joint ventures with certain
consumer-focused financial institutions in order to assist in bettering the model with
historical records of applicants and the applications dealt with [41].
57
6.2.3 Bias and Fairness
Problem: Predictive models are often influenced by prejudice in the training
datasets; this, thus, leads to biased results when discriminating loan applicants.
For instance, certain demographic sections may have been favored in the age-old
practices of offering credits which will end up desiring models that are biased.
Solution: There lies a very important aspect in ensuring the equity of comple-
tion of model at the prediction execution phase of the prometed works. In the
future, studies may work on how to detect biases in outcomes and how to reduce
or eliminate them, if they exist. This include, for instance, employing techniques
for context-sensitive adversarial training, or applying fairness criteria during fitted
model training so that predictions do not discriminate against any group based on
the group defined characteristics [43].
Solution: Additional research may explore the design of modular and deployable
structures enabling smoother embedding of the prediction models into the current
processes. The implementation of the cloud solutions is favorable to the incorpora-
tion of machine learning techniques in the existing systems with minimal changes
to the legacy systems[44].
58
Bibliography
[1] Kumar Arun, Garg Ishan, Kaur Sanmeet(2016). Loan Approval Prediction
based on Machine Learning Approach. IOSR Journal of Computer Engineering
(IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. I
(May-Jun. 2016), PP 79-81 www.iosrjournals.org
[3] Lai, L. (2020). Loan Default Prediction with Machine Learning Techniques.
2020 International Conference on Computer Communication and Network Se-
curity (CCNS).
[4] Mohammad Ahmad Sheikh; Amit Kumar Goel; Tapas Kumar(2020) An Ap-
proach for Prediction of Loan Approval using Machine Learning Algorithm
[8] Mohammad Ahmad Sheikh; Amit Kumar Goel; Tapas Kumar(2020) An Ap-
proach for Prediction of Loan Approval using Machine Learning Algorithm
[9] Jeremy D. Turiel, Tomaso Aste, P2P LOAN ACCEPTANCE AND DEFAULT
PREDICTION WITH ARTIFICIAL INTELLIGENCE
[10] Mayank Anand, Arun Velu, Pawan Whig,Prediction of Loan Behaviour with
Machine Learning Models for Secure Banking
[11] S. Vimala, K.C. Sharmili, —Prediction of Loan Risk using NB and Support
Vector Machine‖, International Conference on Advancements in Computing
Technologies (ICACT 2018), vol. 4, no. 2, pp. 110-113, 2018.
[12] Pidikiti Supriya, Myneedi Pavani, Nagarapu Saisushma, Namburi Vimala Ku-
mari, Vikash,“Loan Prediction by using Machine Learning Models”, Interna-
tionalJournalofEngineering andTechniques.Volume 5 Issue 2, Mar-Apr 2019
59
[13] Nikhil Madane, Siddharth Nanda,”Loan Prediction using Decision tree”, Jour-
nal of the Gujarat Research History, Volume 21 Issue 14s, December 2019.
[14] Rutika Pramod Kathe, Sakshi Dattatray Panhale, Pooja Prakash Avhad,
Punam Laxman Dapse, Ghorpade Dinesh B. Prediction Of Loan Approval Us-
ing Machine.Learning Algorithm: A Review Paper. International Journal Of
Creative Research Thoughts(IJCRT).
[15] S. Sobana, P. Jasmine Lois Ebenezer. A COMPARATIVE STUDY ON MA-
CHINE LEARNING ALGORITHMS FOR LOAN APPROVAL PREDICTION
ANALYSIS.International Research Journal of Modernization in Engineering
Technology and Science.
[16] Mengnan.Song,Jiasong.Wang,Tongtong.Zhang, Guoguang.Zhang,Ruijun.Zhang,
Suisui. Su. Effective Automated Feature Derivation via Reinforcement Learning
for Microcredit Default Prediction.
[17] Arunkumar, G. A., Panchuram, C. R., Afzal, K. M. A., Yadav, N. S., Goradiya,
U. (2023). Predictive Analysis in Banking using Machine Learning. Interna-
tional Journal of Scientific Research in Computer Science, Engineering and
Information Technology, Volume 10 (Issue 2), Page Number 434-439. DOI:
https://doi.org/10.32628/CSEIT2390247
[18] Singh, V., Yadav, A., Awasthi, R. (2021). Prediction of modernized
loan approval system based on machine learning approach. In Proceedings
of the 2021 International Conference on Intelligent Technologies (CONIT)
(pp. 1-6). Institute of Electrical and Electronics Engineers (IEEE). DOI:
https://doi.org/10.1109/CONIT51480.2021.9498475
[19] M.A. Sheikh, A.K. Goel and T. Kumar, ”An Approach for Prediction of Loan
Approval using Machine Learning Algorithm”, 2020 International Conference
On Electronics and Sustainable Communication Systems (ICESC), pp. 490-494,
2020. DOI: https://doi.org/10.1109/ICESC48915.2020.9155614
[20] Sivasree M S ,Rekha Sunny T(2015), Loan Credibility Prediction System Based
on Decision Tree Algorithm, Volume 4, Issue 9
[21] Karthikeyan S.M, Pushpa Ravikumar(2021), A Comparative Analysis of Fea-
ture Selection for Loan Prediction Model, Volume 174, No. 11
[22] Ladislav Végh , Krisztina Czakóová and Ondrej Takáč(2023), Comparing Ma-
chine Learning Classification Models on a Loan Approval Prediction Dataset,
Volume 7, pp. 98-103
[23] Subhiksha, Vaishnavi, Shalini, Mr. N. Manikandan(2022) Bank Loan Approval
Prediction Using Data Science Technique (ML)
[24] Shruti Mishra, Shailki Sharma and Shreyansh Singh(2022) Loan approval pre-
diction
[25] Deepak Ishwar Gouda, Ashok Kumar , Anil Manjunatha Madivala, Dilip Ku-
mar, Dr.Ravikumar(2021) LOAN APPROVAL PREDICTION BASED ON
MACHINE LEARNING
60
[26] Mudit Manish Agarwal, Harshal Mahendra Shirke, Vivek Prafullbhai Vadhiya,
Manya Gidwani,’Loan Analysis Predicting Defaulters’,2022 Volume 9, Issue 4
[27] Kshitiz Gautam, Arun Pratap Singh, Keshav Tyagi, Mr.Suresh Kumar,’Loan
Prediction Using Decision Tree and Random Forest’ 2020 Volume 7, Issue 8
[29] Lundberg, S. M., Lee, S.-I. (2017). A unified approach to interpreting model
predictions. In Proceedings of the 31st International Conference on Machine
Learning (ICML) (pp. 477-487). (Discusses SHAP values for interpretability of
various models)
[31] Hardt, M., Price, E., Srebro, N. (2016, June). Equality of opportunity in su-
pervised learning. In Advances in Neural Information Processing Systems (pp.
3315-3323).
[32] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.
(2014). Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1), 1929-1958.
[35] Nusinovici, S., Tham, Y. C., Yan, M. Y. C., Ting, D. S. W., Li, J., Sa-
banayagam, C., Wong, T. Y., Cheng, C. (2020). Logistic regression was as good
as machine learning for predicting major chronic diseases. Journal of Clinical
Epidemiology, 122, 56–69. https://doi.org/10.1016/j.jclinepi.2020.03.002
[36] Sufriyana, H., Husnayain, A., Chen, Y. L., Kuo, C. Y., Singh, O., Yeh, T. Y.,
Wu, Y. W., Su, E. C. Y. (2020). Comparison of multivariable logistic regres-
sion and other machine learning algorithms for prognostic prediction studies in
Pregnancy Care: Systematic Review and Meta-Analysis. JMIR Medical Infor-
matics, 8(11), e16503. https://doi.org/10.2196/16503
61
[38] Ying CAO , Qi-Guang MIAO , Jia-Chen LIU , Lin GAO. (June
2013). Acta Automatica Sinica Volume 39, Issue 6, Pages 745-758.
https://doi.org/10.1016/S1874-1029(13)60052-X
[39] Li Lang, Liang Tiancai, Ai Shan, Tang Xiangyan (May 2021). An improved
random forest algorithm and its application to wind pressure prediction Volume
37, Issue 2 , pages: 1802-1802. https://doi.org/10.1002/int.22448
[40] Amit Pandey, Achin Jain,”Comparative Analysis of KNN Algorithm using Vari-
ousNormalization Techniques”, International Journal of Computer Network and
Information Security(IJCNIS), Vol.9,No.11, pp.36-42, 2017.DOI: 10.5815ijc-
nis.2017.11.04
[41] Chan, K. Y., Abu-Salih, B., Qaddoura, R., Al-Zoubi, A. M., Palade, V., Pham,
D., Del Ser, J., Muhammad, K. (2023). Deep neural networks in the cloud:
Review, applications, challenges and research directions. Neurocomputing, 545,
126327. https://doi.org/10.1016/j.neucom.2023.126327
[48] Haykin, S. (2009). Neural Networks and Learning Machines (3rd ed.). Prentice
Hall.
[49] Long, J., Shelhamer, E., amp; Darrell, T. (2015). quot;Fully convolutional net-
works for semantic segmentation.quot; Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 3431-3440.
[50] Nusinovici, S., Tham, Y. C., Yan, M. Y. C., Ting, D. S. W., Li, J., Sa-
banayagam, C., Wong, T. Y., Cheng, C. (2020). Logistic regression was as good
as machine learning for predicting major chronic diseases. Journal of Clinical
Epidemiology, 122, 56–69. https://doi.org/10.1016/j.jclinepi.2020.03.002
62
[52] Yakobu Dasari , Katiki Rishitha and Ongole Gandhi,Prediction of Bank Loan
Status Using Machine Learning Algorithm(2023),International Journal of Com-
puting and Digital SystemsISSN (2210-142X)Int. J. Com.,Dig.Sys.14,No.1 (Jul-
23)http://dx.doi.org/10.12785/ijcds/140113
63