Predicting Credit Card Approvals using ML
Techniques
Ravjot Singh Follow
Oct 27, 2020 · 10 min read
In this project, we’ll be using Credit Card Approval Dataset from UCI Machine Learning
Repository. The structure of our project will be as follows —
To get a basic introduction of our project & What’s the business problem associated
with it ?
We’ll start by loading and viewing the dataset.
To manipulate data, if there are any missing entries in the dataset.
To perform exploratory data analysis (EDA) on our dataset.
To pre-process data before applying machine learning model to the dataset.
To apply machine learning model that can predict if an individual’s application for a
credit card will be accepted or not.
Credit Card Applications and the problems associated with it
Nowadays, banks receive a lot of applications for issuance of credit cards. Many of them
rejected for many reasons, like high-loan balances, low-income levels, or too many inquiries
on an individual’s credit report. Manually analyzing these applications is error-prone and a
time-consuming process. Luckily, this task can be automated with the power of machine
learning and pretty much every bank does so nowadays. In this project, we will be build an
automatic credit card approval predictor using machine learning techniques, just like the
real banks do.
FIRST TASK
Importing the pandas package and loading the dataset.
1. pandas : Pandas is used to read the dataset file and import it as a dataframe, which is
similar to a table with rows and columns.
Importing the pandas packages and Loading the dataset
OUTPUT
On observing above, the output appears a bit confusing at its first sight, but let’s try to
figure out the most important features of a credit card application. We find that since the
data is confidential, the contributor of this dataset has anonymized the feature names. The
features of this dataset have been anonymized to protect the privacy, but this bloggives us a
pretty good overview of the probable features.The probable features in a typical credit card
application
are Gender , Age , Debt , Married , BankCustomer , EducationLevel , Ethnicity , YearsEmployed ,
PriorDefault , Employed , CreditScore , DriversLicense , Citizen , ZipCode , Income and finally
the ApprovalStatus .
SECOND TASK
As we can see from our first output at the data, the dataset has a mixture of numerical and
non-numerical features. This can be fixed with some pre-processing, but before we do that,
let’s learn about the dataset a bit more to see if there are other dataset issues that need to be
fixed.
So, let’s start by printing the summary statistics and dataframe information -
Data analysis part
OUTPUT
THIRD TASK
Manipulating the data — Part -1
We’ve uncovered some issues that will affect the performance of our machine learning
model if they go unchanged:
Our dataset contains both numeric and non-numeric data (specifically data that are
of float64 , int64 and object types). Specifically, the features 2, 7, 10 and 14 contain
numeric values (of types float64, float64, int64 and int64 respectively) and all the other
features contain non-numeric values (of type object).
The dataset also contains values from several ranges. Some features have a value range
of 0–28, some have a range of 2–67, and some have a range of 1017–100000. Apart
from these, we can get useful statistical information (like mean , max , and min ) about the
features that have numerical values.
Finally, the dataset has missing values, which we’ll take care of in this task. The missing
values in the dataset are labeled with ‘?’, which can be seen in the last cell’s output.
Now, let’s temporarily replace these missing value question marks with NaN.
numpy package: numpy enables us to work with arrays with great efficiency.
Importing numpy package and manipulating the dataset
OUTPUT
FOURTH TASK
Manipulating the data — Part -2
We have replaced all the question marks with NaNs. This is going to help us in the next
missing value treatment that we are going to perform in this task.
An important question that gets raised here is “why are we giving so much
importance to missing values?” Can’t they be just ignored? Ignoring missing values
can affect the performance of a machine learning model heavily. While ignoring the missing
values our machine learning model may miss out on information about the dataset that may
be useful for its training. Then, there are many models which cannot handle missing values
implicitly such as LDA.
So, to avoid this problem, we are going to impute the missing values with a strategy
called mean imputation.
Imputing the missing values with mean
OUTPUT
FIFTH TASK
Manipulating the data — Part -3
We have successfully taken care of the missing values present in the numeric columns.
There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of
these columns contain non-numeric data and this why the mean imputation strategy would
not work here. This needs a different treatment.
We are going to impute these missing values with the most frequent values as present in the
respective columns. This is good practice when it comes to imputing missing values for
categorical data in general.
Imputing the missing values with the most frequent value in that column
“OUTPUT” shows that there are no more missing values in the dataset
SIXTH TASK
Pre-processing the data — Part -1
The missing values are now successfully handled.
There is still some minor but essential data pre-processing needed before we proceed
towards building our machine learning model. We are going to divide these remaining pre-
processing steps into three main tasks:
1. Convert the non-numeric data into numeric.
2. Split the data into train and test sets.
3. Scale the feature values to a uniform range.
First, we will be converting all the non-numeric values into numeric ones. We do this
because not only it results in a faster computation but also many machine learning models
(like XGBoost) (and especially the ones developed using scikit-learn) require the data to be
in a strictly numeric format. We will do this by using a technique called label encoding.
Converting the non-numeric values into numeric values
SEVENTH TASK
Splitting the dataset into training and test sets
We have successfully converted all the non-numeric values to numeric ones.
Now, we will split our data into train set and test set to prepare our data for two different
phases of machine learning modeling: training and testing. Ideally, no information from the
test data should be used to scale the training data or should be used to direct the training
process of a machine learning model. Hence, we first split the data and then apply the
scaling.
Also, features like DriversLicense and ZipCode are not as important as the other features in
the dataset for predicting credit card approvals. We should drop them to design our
machine learning model with the best set of features. In Data Science literature, this is often
referred to as feature selection.
Splitting the data into training set (70%) and test set (30%)
EIGHTH TASK
Pre-processing the data — Part -2
The data is now split into two separate sets — train and test sets respectively. We are only
left with one final pre-processing step of scaling before we can fit a machine learning model
to the data.
Now, let’s try to understand what these scaled values mean in the real world. Let’s
use CreditScore as an example. The credit score of a person is their creditworthiness based
on their credit history. The higher this number, the more financially trustworthy a person is
considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to
the range of 0-1.
Scaling the feature values to a given range
NINTH TASK
Fitting a Logistic Regression Model to the training set
Essentially, predicting if a credit card application will be approved or not is a classification
task. According to UCI, our dataset contains more instances that correspond to “Denied”
status than instances corresponding to “Approved” status. Specifically, out of 690 instances,
there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got
approved.
This gives us a benchmark. A good machine learning model should be able to accurately
predict the status of the applications with respect to these statistics.
Which model should we pick? A question to ask is: are the features that affect the credit
card approval decision process correlated with each other? Although we can measure
correlation, that is outside the scope of this notebook, so we’ll rely on our intuition that they
indeed are correlated for now. Because of this correlation, we’ll take advantage of the fact
that generalized linear models perform well in these cases. Let’s start our machine learning
modeling with a Logistic Regression model (a generalized linear model).
sklearn package: This machine learning library includes numerous machine learning
algorithms already builtin with certain parameters set as default parameters, so they
work right out of the box.
Importing Logistic Regression classification model from sklearn package
OUTPUT
TENTH TASK
Making predictions and evaluating the performance of the model
But how well does our model perform?
We will now evaluate our model on the test set with respect to classification accuracy. But
we will also take a look the model’s confusion matrix. In the case of predicting credit card
applications, it is equally important to see if our machine learning model is able to predict
the approval status of the applications as denied that originally got denied. If our model is
not performing well in this aspect, then it might end up approving the application that
should have been approved. The confusion matrix helps us to view our model’s
performance from these aspects.
Predicting the accuracy of model on the test set
“OUTPUT” showing the accuracy of our classification model
ELEVENTH TASK
Grid Search and making the model perform better
Our model was pretty good! It was able to yield an accuracy score of almost 84%.
For the confusion matrix, the first element of the of the first row of the confusion matrix
denotes the true negatives meaning the number of negative instances (denied applications)
predicted by the model correctly. And the last element of the second row of the confusion
matrix denotes the true positives meaning the number of positive instances (approved
applications) predicted by the model correctly.
Let’s see if we can do better. We can perform a grid search of the model parameters to
improve the model’s ability to predict credit card approvals.
scikit-learn’s implementation of logistic regression consists of different hyperparameters
but we will grid search over the following two:
tol
max_iter
Applying Hyper-parameters to make the model perform better
TWELFTH TASK
We have defined the grid of hyperparameter values and converted them into a single
dictionary format which GridSearchCV() expects as one of its parameters. Now, we will
begin the grid search to see which values perform best.
We will instantiate GridSearchCV() with our earlier logreg model with all the data we have.
Instead of passing train and test sets separately, we will supply X (scaled version) and y .
We will also instruct GridSearchCV() to perform a cross-validation of five folds.
We’ll end the notebook by storing the best-achieved score and the respective best
parameters.
Best score after applying hyper-parameters
OUTPUT
CONCLUSION:
While building this credit card approval predictor model, we tackled some of the most
widely-known pre-processing steps such as scaling, label encoding, and missing value
imputation. We finished with some machine learning model to predict if a person’s
application for a credit card would get approved or not given some information about that
person.