Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
90 views22 pages

M4 Data Mining W4 Business Report

The document discusses using clustering and classification models to analyze customer data for a bank and insurance company. For the bank, K-means clustering with 4 clusters is used to segment customers and recommend promotional strategies. For insurance, CART, random forest, and ANN models are built on claim data and random forest is identified as the best performing model.

Uploaded by

Krishna Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views22 pages

M4 Data Mining W4 Business Report

The document discusses using clustering and classification models to analyze customer data for a bank and insurance company. For the bank, K-means clustering with 4 clusters is used to segment customers and recommend promotional strategies. For insurance, CART, random forest, and ANN models are built on claim data and random forest is identified as the best performing model.

Uploaded by

Krishna Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 22

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers


to its customers. They collected a sample that summarizes the activities of users
during the past few months. You are given the task to identify the segments based
on credit card usage.

1.1 Read the data and do exploratory data analysis. Describe the data briefly.

We try to load data, and we can see

There are 7 features and 210 records in dataset

7 features are,
 spending: Amount spent by the customer per month (in 1000s)
 advance_payments: Amount paid by the customer in advance by cash
(in 100s)
 probability_of_full_payment: Probability of payment done in full by the
customer to the bank
 current_balance: Balance amount left in the account to make pur-
chases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making pay-
ments for purchases made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one pur-
chase (in 1000s)

We don’t have any duplicate records,


We don’t have any null records,

Let’s see information about dataset. We have ‘float64’ as datatype for all col-
umn

Lets see deciption/5-point summery about dataset

We don’t find any duplicate records also,


Lets see histogram

Lets see covariance of dataset


Lets see co-relation of dataset

Lets see corelation matix


Lets see pairplot
1.2 Do you think scaling is necessary for clustering in this case? Justify

Let’s see description of data to understand ‘weight’ of each features on over-


all dataset

We can see ‘max’ attribute of all features represent very different scaling of
impact on overall dataset from features, to standardize impact of all features
‘scaling’ is needed.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them

Lets first scale data


Lets see dendogram

Lets see truncated dendogram with last 25 clusters only

Lets see optimum clusters, first based on ‘maxclust’ criteria. We got two clus-
ters

Now, lets try to find clusters based on ‘distance’ criteria. Again we got two
clusters

1.4 Apply K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and silhouette score.
First lets see WSS score

Now lets plot wss values

We got four optimal clusters

1.5 Describe cluster profiles for the clusters defined. Recommend different pro-
motional strategies for different clusters.
Recommendations

1. Cluster 0 should be given less general offers and more ‘on the payment’ club offers

2. Cluster 1 should be given less general offers and more ‘full payment’ club offers.

3. Cluster 2 should be given all offers as it holds second highest numbers in all attrib-
utes. Provide offers

4. Cluster 3 should be given all offers as it holds second highest numbers in all attrib-
utes. Cluster 3 hold high spending and advance payments. It holds highest credit
limit. Provide offers clubbing min payment and full payment.

Description

 Cluster 0: Customers having lowest in all attributes other than min-payment-amount


and max-spent-in-single-shopping. It holds highest min-payment-amount

 Cluster 1: Customers having second lowest in all attributes other than min-payment-
amount and max-spent-in-single-shopping. It holds lowest min-payment-amount and
max-spent-in-single-shopping.

 Cluster 2: Customers having second highest in all attributes other than min-payment-
amount

 Cluster 3: Customers having highest in all attributes other than min-payment-amount

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the
task to make a model which predicts the claim status and provide recommendations
to management. Use CART, RF & ANN and compare the models' performances in
train and test sets.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it.

Lets check head of dataset

we removed ‘agency_code’ which seem useless in this dataset for analysis


Lets see dataset information

Dataset contains ‘no’ null values

Lets see dataset description


Dataset contains 139 duplicate records

Lets see pairplot of dataset


Lets see corelation matrix

Dataset looks imblanced


Lets see if dataset contains outliers or not

Lets treat outliers


Inferences
* After successful data load, we see "Agency_Code","Product Name","Destination"
as useless columns so we drop them

* We check information about dataset and find out 'no null values' but we have ob-
jects to be taken care of

* We encode columns (whose datatype is object) to numeric so that we make it more


useful
* We see, a little imbalance data 70/30 ratio for 'claimed' attribute of dataset

2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network

CART Model

Lets split dataset. And check its dimension

Lets see variable importance

Lets see predicted class and probability


We have CART model with below training results

We have CART model with below test results

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data


Nurel network

We have a nurel natwork model with below training data

We have a nurel natwork model with below test data

2.3 Performance Metrics: Check the performance of Predictions on Train and Test


sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model

CART Model

We have CART model with below training results

We have CART model with below test results


Training Data AUC: 0.838

Testing Data AUC: 0.818

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data

Testing Data Area under Curve is 0.8716678249009195

Testing data Area under Curve is 0.8223000420226922

Nurel network
We have a nurel natwork model with below training data

We have a nurel natwork model with below test data

Nurel network training data Area under Curve is 0.7818421652748042

Nurel network testing data Area under Curve is 0.736167530466452

2.4 Final Model: Compare all the model and write an inference which model is best/
optimized.

ROC Curve for the 3 models on the Training data


ROC Curve for the 3 models on the Test data

Based on comparision above it seem ‘Random forest’ is best model in this case.

2.5 Inference: Based on the whole Analysis, what are the business insights and re-
commendations

Out of the 3 models, Random Forest has slightly better performance than the Cart and Neural
network model

Overall all the 3 models are reasonably stable enough to be used for making any future pre-
dictions. From Cart and Random Forest Model, the variable change is found to be the most
useful feature amongst all other features for predicting if a person has diabetes or not. If
change is yes, then those customers have more chances of putting claim.
In [ ]:

You might also like