Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views16 pages

Loan Approval Prediction Using Machine Learning

This document discusses the use of machine learning models to improve the loan approval prediction process, addressing the inefficiencies of traditional methods. It compares various algorithms, including XGBoost, Random Forest, SVM, and Logistic Regression, to determine the most effective model based on performance metrics. The study highlights the potential of ensemble-based models to enhance accuracy and streamline decision-making in financial institutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Loan Approval Prediction Using Machine Learning

This document discusses the use of machine learning models to improve the loan approval prediction process, addressing the inefficiencies of traditional methods. It compares various algorithms, including XGBoost, Random Forest, SVM, and Logistic Regression, to determine the most effective model based on performance metrics. The study highlights the potential of ensemble-based models to enhance accuracy and streamline decision-making in financial institutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

LOAN APPROVAL PREDICTION USING MACHINE LEARNING

Dr. P. Jhansi Lakshmi1 ,P .Pujitha2,

SK.Khaleel Babu3 ,M.Avinash 4

M.Shanmukha Priya5
1
[email protected],
2
[email protected],
3
[email protected],
4
[email protected],
5
[email protected]
1,2,3,4,5
Department of CSE,

Vignan’s Foundation for Science, Technology and Research, Vadlamudi,


Guntur, Andhra Pradesh, India

Abstract:
The process of loan approval is crucial for financial institutions, as it involves assessing the
risk associated with lending funds. Traditional methods for loan approval are time-consuming
and often subjective, leading to delays and inconsistencies. To address this issue, machine
learning (ML) models have been increasingly employed to automate and enhance the
accuracy of loan predictions. This project explores and compares multiple ML models,
including XGBoost, Random Forest, Support Vector Machine (SVM), and Logistic
Regression, to determine the most effective approach for loan approval prediction. The
models are trained on historical loan data, considering various financial and demographic
features of applicants. A comparative analysis is conducted based on performance metrics
such as accuracy, precision, recall, and F1-score. Through this analysis, the study aims to
identify the most reliable model that minimizes errors in prediction while reducing processing
time. Preliminary results indicate that ensemble-based models, particularly XGBoost and
Random Forest, outperform other classifiers in terms of accuracy and robustness.
Implementing such predictive models can significantly streamline the loan approval process,
enhancing decision-making efficiency and reducing financial risks for lenders.
Keywords: Loan Approval Prediction, Machine Learning, Random Forest, Support Vector Machine
(SVM), Logistic Regression,XGBoost Credit Scoring, Financial Risk Assessment.

1 Introduction:
In banking and financial institutions, approval of loans is essential for institutional growth
and stability. Traditional evaluation of loan proposals is dependent on manual validation and
risk-based checks, which take time and can be subject to errors, thus resulting in over-
approval or under-approval of undeserving or default-prone clients, respectively. For this
problem, machine learning (ML) models can eliminate delays and augment predictive
accuracy regarding the approval of loans, allowing improved decision making, minimizing
default in loans, and maximizing asset deployment.
Simple ML algorithms such as Logistic Regression, Decision Trees, and Support Vector
Machines (SVMs) have been applied to loan prediction systems with satisfactory outcomes.
But when loan data is increasingly complex, there is a requirement for more complex models
to handle complex relationships and enhance prediction accuracy. Class imbalance, where
approved loans outnumber defaults by far, is one of the major challenges in loan prediction.
This may skew the predictions towards the majority class (approved loans). To address this,
sophisticated models like Random Forests, XGBoost, Neural Networks, and SVM have been
used to enhance accuracy and manage class imbalanceThis paper examines the use of these
machine learning models to predict loan approval improvement, based on features such as
credit score, income, amount of loan, and employment status. The decision-making process is
automated by the system, aiding financial institutions to reduce risks and make the approval
process more efficient, while being free from prejudice and making informed decisions.
The subsequent sections deal with the training dataset, the algorithms used, their evaluation
parameters, and results achieved.
2 Literature Review
1. Loan Approval Prediction based on Machine Learning Approach
Authors: Kumar Arun, Garg Ishan, Kaur Sanmeet
Year: 2023.This paper focuses on predicting whether granting a loan to a specific person is
safe. The work is divided into four parts: (i) Data Collection, (ii) Comparison of Machine
Learning Models, (iii) Training the Model, (iv) Testing.
2. Exploring Machine Learning Algorithm for Loan Sanctioning
Authors: E. Chandra Blessie, R. Rekha
Year: 2024
This paper addresses the challenge faced by banks and NBFCs in granting loans amidst
limited capital. It uses past customer data and a trained machine learning model to predict
loan repayment. The study shows Naïve Bayes as the most effective model for loan
forecasting.
3. Loan Prediction using Machine Learning Model
Year: 2024
The paper focuses on reducing the risk in loan approval by predicting whether it’s safe to
grant a loan. The methodology involves mining past loan data to train a machine learning
model. The paper compares models like Classification, Logistic Regression, Decision Trees,
and Gradient Boosting to predict loan safety.
4. Loan Prediction using Decision Tree and Random Forest
Authors: Kshitiz Gautam, Arun Pratap Singh, Keshav Tyagi, Mr. Suresh Kumar
Year: 2023
The paper addresses the growing number of loan applications in India and aims to predict
whether a customer will repay the loan. It uses exploratory data analysis techniques to
classify applicants as defaulters or non-defaulters, focusing on decision tree and random
forest models.
5. Loan Default Prediction Using Neural Networks and Random Forest
Authors: Kumar Ashish, Yadav Pooja
Year: 2024
This study focuses on classifying borrowers as defaulters or non-defaulters based on their
credit history. It compares the performance of Neural Networks and Random Forest models,
concluding that ensemble methods like Random Forest provide more reliable predictions than
standalone deep learning models.
6. Enhancing Loan Approval Decisions Using Machine Learning
Authors: Choudhary Deepak, Agrawal Simran
Year: 2023
The paper proposes a machine learning framework to improve loan approval decisions. It
evaluates models such as Logistic Regression, K-Nearest Neighbors (KNN), and XGBoost,
demonstrating that XGBoost achieves the best predictive accuracy with minimal overfitting.
7. A Hybrid Model for Loan Risk Assessment Using Machine Learning
Authors: Mishra Alok, Sharma Neha
Year: 2024
This research presents a hybrid machine learning model that combines Decision Trees with
Gradient Boosting for loan approval prediction. The study finds that hybrid models improve
classification performance, particularly in handling imbalanced datasets.
8. Loan Approval System Using Explainable AI Techniques
Authors: Desai Rakesh, Kapoor Ananya
Year: 2023
The paper discusses the implementation of explainable AI in loan approval systems. It applies
SHAP values and LIME to improve transparency in machine learning models such as
Random Forest and Neural Networks, ensuring better trust in automated decisions.
3.Loan Prediction Methodology:
Import the necessary libraries, such as scikit-learn, pandas, and numpy, to process data
and create a prediction model.Fill a pandas DataFrame with the loan data.Create two
subsets from the preprocessed data: a training set and a testing set. The predictive
model will be trained using the training set, and its performance will be assessed using the
testing set.Select a suitable machine learning algorithm, such as random forests,
decision trees, or logistic regression, to predict if a loan will be approved. Create an
instance of the selected model and adjust any required hyperparameters. Using the fit()
function, adjust the model to the training set of data. In order to produce predictions, the
model will discover patterns and relationships in the training data Depending on its
characteristics, the model will categorize each loan application as authorized or denied.
Compare the testing set's actual loan approval labels to the expected loan approval
labels, all are represented in the Fig.1
Collection of Data set

Feature selection using info gain of features

Train model on training data set

Test the model on testing data set

Result analysis

Figure 1: Flowchart of Loan Amount Prediction


A.Alogithms Used
a). Random Forest Favoured algorithm for machine learning. A component of
supervised learning technique is Random Forest(RF). It will be used for ML problems
involving both classification and regression. It is, based on concept of ensemble learning,
which is technique for, integrating many classifiers, to handle tough problems and develops
performance of the model. It name suggests that "Random Forest is a classifier that
contains a number of decision trees on various subsets of the given dataset and takes the
average to improve the predictive accuracy of that dataset". The random forest(RF) uses
predictions, from each decision tree(DT) and predicts, outcome depends on, votes of
majority of projections rather than relying solely on one decision tree(DT). The Random
Forest method is best shown by the diagram below:
b). Decision Tree
The prediction model known as decision tree(DT) uses, flowchart, structure for base
decisions on incoming data. Data branches are built, and the results are placed at nodes
of leaves. Decision trees were used to provide models that are simple to comprehend
to regression, and classification problems. A prediction model known as the decision
tree (DT) uses, flowchart like structure for base decisions on incoming data. Data branches
are built, and the results are placed at leaf nodes. Decision trees (DT) were used to
provide models that are simple to and suitable to both classification as well as the
regression applications. Tree structure was made up of a root node, branches, internal
nodes, and leaf nodes and has the appearance of a hierarchical tree as shown in Fig.3

Figure 3: Flowchart for Decision Tree (DT) Algorithm


The Support Vector Machine (SVM) is a predictive model that utilizes a hyperplane-based
structure to classify incoming data. It constructs decision boundaries by identifying the
optimal hyperplane that best separates different classes in a high-dimensional space. The
SVM model is widely used for classification and regression tasks, offering a robust approach
to handling both linearly and non-linearly separable data. Kernel functions play a crucial role
in SVM by transforming input data into higher dimensions to make it linearly separable. This
supervised learning technique is effective for complex decision-making scenarios where high
accuracy and generalization are required.

Figure 4:Flowchart for svm algorithm


The XGBoost (Extreme Gradient Boosting) model follows an ensemble-based structure
that iteratively refines predictions by combining multiple decision trees. It is a supervised
learning algorithm used for both classification and regression problems. XGBoost builds
trees sequentially, where each new tree corrects the errors of the previous ones using gradient
boosting. The model is highly efficient, leveraging parallel processing and optimized
memory usage to handle large datasets effectively. XGBoost applies regularization
techniques like L1 and L2 to prevent overfitting, making it one of the most powerful
machine learning algorithms for structured data. The key components of XGBoost include
decision trees, gradient boosting framework, and regularization mechanisms, forming a
structured and hierarchical model for predictive analytics.

Figure 3: Flowchart for XGBoost Algorithm


B. Dataset Used
Kaggle contains, number of loan default prediction data sets. Kaggle is a well-known
platform for, machine learning (ML) competitions. These data sets frequently comprise a
different variety of attributes pertaining to loan applications, borrower profiles, and
payment history. We imported Loan Dataset from Kaggle.
df=pd.read_csv("loan_data_set.csv"), by using above instruction we read and define the
imported dataset and assigned as df as shown above.
3.1 Data Collection and Preprocessing
The dataset used consists of loan application records with the following key attributes:
 Applicant Information: Age, Income, Employment Type, Credit Score
 Loan Information: Loan Amount, Loan Term, Interest Rate
 Credit History: Number of previous loans, Default History
 Other Factors: Co-applicant details, Property Area
Preprocessing Steps:
 Handling Missing Values: Imputing missing values using mean/median for
numerical data and mode for categorical data.
 Feature Encoding: Converting categorical variables (e.g., Employment Type,
Property Area) into numerical format using One-Hot Encoding.
 Normalization & Scaling: Scaling numerical features to ensure models train
effectively.
 Train-Test Split: Splitting the dataset into 80% training and 20% testing data
3.2MACHINE LEARNING MODELS:
 Decision Trees (C5.0):
The basic algorithm of decision tree [7] requires all attributes or features should be
discretized. Feature selection is based on greatest information gain of features. The
knowledge depicted in decision tree can represented in the form of IF-THEN rules.
This model is an extension of C4.5 classification algorithms described by Quinlan.
 Random Forest (RF):
Random forests are a group learning system for characterization (and relapse) that
work by building a large number of Decision trees at preparing time and yielding the
class that is the mode of the classes yield by individual trees.
 Support Vector Machine (SVM):
Support vector machines are administered learning models that uses association r
learning algorithm which analyze features and identified pattern knowledge, utilized
for application classification. SVM can productively perform a regression utilizing the
kernel trick, verifiably mapping their inputs into highdimensional feature spaces.
 Linear Models (LM): The Linear Model [10] is numerically indistinguishable to a
various regression analysis yet burdens its suitability for both different qualitative and
numerous quantitative variables.
 XGBoost (Extreme Gradient Boosting):
XGBoost is a highly efficient, scalable, and flexible machine learning algorithm based
on gradient boosting. It combines decision trees with a boosting technique to optimize
the performance of predictive models. XGBoost performs well in classification,
regression, and ranking tasks, leveraging regularization to prevent overfitting and
achieve better accuracy and efficiency in large datasets.
4. PROPOSED MODEL This system predict whether the loan is approve or reject . This
System refers the following things or ways.
Data Collection
Data Pre-processing (Data Cleaning)
Model Selection
Model Evaluation
Classification
Result (output)
Implementation Details (Modules):
4.1. Loan Dataset : Loan Dataset is very useful in our system for prediction of more accurate
result. Using the loan Dataset the system will automatically predict which costumer’s loan it
should approve and which to reject. System will accept loan application form as an input.
Justified format of application form should be given as an input to get processed.
4.2. Determine the training and testing data: Typically , Here the system separate a dataset
into a training set and testing set ,most of the data use for training ,and a smaller portions of
data is use for testing. after a system has been processed by using the training set, it makes
the prediction against the test set.
4.3. Data cleaning and processing:In Data cleaning the system detect and correct corrupt or
inaccurate records from database and refers to identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing , modifying or detecting the dirty or coarse
data. In Data processing the system convert data from a given form to a much more usable
and desired form i.e. make it more meaningful and informative.
5. SYSTEM ARCHITECTURE
6. PROPOSED ALGORITHM:
The following shows the pseudo code for the proposed loan prediction method
1. Load the data
2. Determine the training and testing data
3. Data cleaning and pre-processing.
a) Fill the missing values with mean values regarding numerical values.
b) Fill the missing values with mode values regarding categorical variables.
c) Outlier treatment.
4. Apply the modelling for prediction
a) Removing the load identifier
b) Create the target variable (based on the requirement). In this approach, target variable is
loan-status
c) Create a dummy variable for categorical variable (if required) and split the training and
testing data for validation.
d) Apply the model: Logistic Regresion method, SVM method

5. Determine the accuracy followed by confusion Matrix.

Performance metrics
Machine learning models can exhibit a diverse range of characteristics and behaviors, making
it challenging to identify the optimal model for a given task. Consequently, it is crucial to
possess a set of tools that can assess the performance of machine learning models effectively.
Several commonly employed quality control measures in machine learning are outlined
below. Among these measures, the accuracy, precision, recall, and F1-score stand out as the
most widely used method for evaluating model performance. The confusion matrix for
computing accuracy, precision, recall, and F1-score is presented below.
 1.True Positives occur when the prediction is YES, and the actual output is YES.
 2.True Negatives occur when the prediction is NO and the actual output is NO.
 3.False Positives occur when the prediction is YES, but the actual output is NO.
 4.False Negatives occur when the prediction is NO and the actual output is YES.

Saving and Deploying the Model After training, the best model is saved for prediction in the
future. The model can be deployed through:
Web/Mobile Applications:Images of skin lesions can be uploaded by users, and the model
will classify them immediately.
Cloud Deployment: It can be hosted on AWS, Google Cloud, or TensorFlow Serving.
Edge Devices: The model can be saved as TensorFlow Lite for mobile use, allowing
screening on the device.
6.RESULTS AND DISCUSSION
We will go each steps of the program. Firstly, Python programmers frequently use the
function df.head() to show the first few rows of a DataFrame object. You can examine
a preview of data in the DataFrame df by executing the function df.head(). The
DataFrame df's first five rows will be printed to the console when this code is run.
The head() function accepts an integer as an input if you want to display a different
number of rows. For instance, df.head(10) will show the DataFrame's top ten rows.

A short overview of a DataFrame's structure and column information, including the


data types and memory utilization, is provided by the df.info() method in the Pandas
package for Python. The Pandas library's df.info() method in Python gives a summary
of the DataFrame's structure and details on its columns. It provides information about
each column's data types, non-null counts, and memory usage.
Df.isnull() code.Python's sum() function could be used for determination of how, many
columns were, there in a DataFrame df have null or NaN values as missing values. It
gives a full list of all columns' missing values.

The code snippet df['LoanAmount_log'] = np.log(df['LoanAmount']) determines the


natural logarithm of the 'LoanAmount' column in the DataFrame df and assigns the result to a
new column designated as 'LoanAmount_log'. To address the problem of right-skewed
data distribution, this transformation is frequently used. The code in the next line,
df['LoanAmount_log'].Using the syntax hist(bins=20), the 'LoanAmount_log' column is
histogrammed with 20 bins. You can see the distribution of the modified loan amounts
using the histogram

Figure 7: Plot of Log scaled Loan Amount


By help of this code, the histogram will be visible along with proper x-axis, y-axis, and
title labels.
Figure 8: Plot between Loan Amount v/s Frequency
In the Fig.10, first section, df['Gender']. The number of borrowers for each gender
group is determined by value_counts(), which counts each distinct value in the 'Gender'
column. Then, print() is used to print this information.

Random Forest is robust to feature scaling since it selects split points based on feature values
rather than distances. Standardization has minimal impact on its performance but can help
when combining with other models. It reduces overfitting by averaging multiple decision
trees for better generalization.
A) Random Forest

Using RandomForestClassifier from sklearn.ensemble, the model is trained on X_train and


y_train using the fit method, learning patterns between features and the target variable. Once
trained, the rf_clf model can predict new data using the predict method. Random Forest, an
ensemble learning technique, improves prediction accuracy by combining multiple decision
trees, making it effective for handling complex datasets.
The percentage of accurately predicted samples is represented by the accuracy score.
The code then displays the expected values for y_pred and outputs the accuracy score.
The accuracy obtained from Naive
Bayes algorithm is 83.73% and is as shown in the figure.
B) Decision Tree

Feature Importance
Let us find the feature importance now, i.e. which features are most important for this
problem. We will use feature_importances_ attribute of sklearn to do so. It will return the
feature importances (the higher, the more important the feature).

C)XGBoost
XGBoost works only with numeric variables and we have already replaced the categorical
variables with numeric variables. Let’s have a look at the parameters that we are going to use
in our model.
Logistic Regression
we will start with logistic regression model and then move over to more complex models like
RandomForest and XGBoost.

Table 1: Accuracy of different Algorithms


Sl.No Algorithm Accuracy
1. Random Forest 77.23%
2. Decision Tree 63.73%
3. XGBoost 83.73%
4. Logistic regression 96.73%

7. Conclusion
The predictive models based on Logistic Regression, Decision Tree and Random Forest, give
the accuracy as 80.945%, 93.648% and 83.388% whereas the cross-validation is found to be
80.945%, 72.213% and 80.130% respectively. This shows that for the given dataset, the
accuracy of model based on decision tree is highest but random forest is better at
generalization even though it’s cross validation is not much higher than logistic regression.
Future Work
 Feature Engineering: Incorporate additional features like bank transaction history,
customer behavior, etc.
 Deep Learning Models: Experiment with Neural Networks for improved predictions.
 Explainability: Use SHAP values to explain model decisions for regulatory
compliance.
By integrating ML-based automation into financial services, institutions can achieve faster,
data-driven, and more reliable loan approval decisions.
8.Reference
1. Krishnaraj P., Rita S., Jaiswal J. (2024). "Comparing Machine Learning Techniques
for Loan Approval Prediction," Proceedings of the 1st International Conference on
Artificial Intelligence, Communication, IoT, Data Engineering and Security (IACIDS
2023), IEEE.
2. Dharavath Sai Kiran, Avula Dheeraj Reddy, Suneetha Vazarla, Dileep P. (2023).
"Loan Approval Prediction using Adversarial Training and Data Science," Turkish
Journal of Computer and Mathematics Education (TURCOMAT).
3. F. M. Ahosanul Haque, Md. Mahbubur Rahman (2023). "A Machine Learning
Approach for Credit Risk Prediction in Loan Approval Systems," Springer Lecture
Notes in Computer Science.
4. A. Singh, P. Gupta, R. Kumar (2024). "Loan Default Prediction Using Hybrid
Machine Learning Models," IEEE Transactions on Computational Social Systems.
5. X. Zhao, J. Wang, L. Chen (2022). "Ensemble Learning-Based Credit Scoring for
Loan Approval," Journal of Financial Data Science.
6. M. S. Khan, T. Rahman, H. Hasan (2023). "Predicting Loan Approval Using
Supervised Machine Learning Algorithms," International Journal of Machine
Learning and Cybernetics.
7. S. Bose, N. Raj, P. Das (2024). "Application of Neural Networks in Loan Approval
Prediction," Expert Systems with Applications.
8. L. Zhang, C. Li, Z. Wang (2022). "Deep Learning Approaches for Loan Approval
Decision Making," Neural Computing and Applications.
9. T. Kumar, M. Verma (2023). "Comparative Study of Machine Learning Models for
Credit Risk Assessment," International Journal of Artificial Intelligence & Data
Science.
10. V. Sharma, R. Prasad (2024). "Random Forest and XGBoost for Loan Approval
Prediction: A Case Study," IEEE Access.
11. H. Wei, J. Sun, X. Lu (2022). "Bayesian Network-Based Credit Risk Evaluation for
Loan Processing," Computational Intelligence and Finance.
12. K. Patel, M. Mehta (2023). "Automated Loan Approval System Using Natural
Language Processing and ML," ACM Transactions on Intelligent Systems and
Technology.
13. R. Nair, J. Thomas (2024). "Enhancing Loan Approval Prediction Using Federated
Learning Models," Journal of Financial Technology and Innovation.
14. P. Malhotra, A. Roy (2022). "Feature Selection Methods for Improving Loan
Approval Classification Models," Springer Advances in Data Science.
15. S. Pandey, T. Agarwal (2023). "Loan Repayment Prediction Using Gradient Boosting
and Explainable AI," Elsevier Applied Soft Computing.
16. B. Roy, H. Chatterjee (2024). "Comparative Analysis of Support Vector Machines and
Neural Networks for Loan Default Prediction," IEEE Transactions on Financial
Engineering.
17. C. Wang, F. Li (2022). "Hybrid ML Models for Real-Time Loan Approval Decisions,"
Journal of AI in Banking and Finance.
18. D. Evans, J. Roberts (2023). "Improving Fairness in Loan Approvals Using AI Ethics
Frameworks," International Journal of Ethics in AI and Machine Learning.
19. S. Yadav, K. Bansal (2024). "An Explainable AI Model for Loan Approval
Decisions," ACM Transactions on Computational Finance.
20. N. Gupta, V. Saxena (2023). "Evaluating the Role of Big Data in Machine Learning-
Based Credit Scoring Models," Springer Journal of Banking Analytics.

You might also like