Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers
to its customers. They collected a sample that summarizes the activities of users
during the past few months. You are given the task to identify the segments based
on credit card usage.
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
We try to load data, and we can see
There are 7 features and 210 records in dataset
7 features are,
spending: Amount spent by the customer per month (in 1000s)
advance_payments: Amount paid by the customer in advance by cash
(in 100s)
probability_of_full_payment: Probability of payment done in full by the
customer to the bank
current_balance: Balance amount left in the account to make pur-
chases (in 1000s)
credit_limit: Limit of the amount in credit card (10000s)
min_payment_amt : minimum paid by the customer while making pay-
ments for purchases made monthly (in 100s)
max_spent_in_single_shopping: Maximum amount spent in one pur-
chase (in 1000s)
We don’t have any duplicate records,
We don’t have any null records,
Let’s see information about dataset. We have ‘float64’ as datatype for all col-
umn
Lets see deciption/5-point summery about dataset
We don’t find any duplicate records also,
Lets see histogram
Lets see covariance of dataset
Lets see co-relation of dataset
Lets see corelation matix
Lets see pairplot
1.2 Do you think scaling is necessary for clustering in this case? Justify
Let’s see description of data to understand ‘weight’ of each features on over-
all dataset
We can see ‘max’ attribute of all features represent very different scaling of
impact on overall dataset from features, to standardize impact of all features
‘scaling’ is needed.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
Lets first scale data
Lets see dendogram
Lets see truncated dendogram with last 25 clusters only
Lets see optimum clusters, first based on ‘maxclust’ criteria. We got two clus-
ters
Now, lets try to find clusters based on ‘distance’ criteria. Again we got two
clusters
1.4 Apply K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and silhouette score.
First lets see WSS score
Now lets plot wss values
We got four optimal clusters
1.5 Describe cluster profiles for the clusters defined. Recommend different pro-
motional strategies for different clusters.
Recommendations
1. Cluster 0 should be given less general offers and more ‘on the payment’ club offers
2. Cluster 1 should be given less general offers and more ‘full payment’ club offers.
3. Cluster 2 should be given all offers as it holds second highest numbers in all attrib-
utes. Provide offers
4. Cluster 3 should be given all offers as it holds second highest numbers in all attrib-
utes. Cluster 3 hold high spending and advance payments. It holds highest credit
limit. Provide offers clubbing min payment and full payment.
Description
Cluster 0: Customers having lowest in all attributes other than min-payment-amount
and max-spent-in-single-shopping. It holds highest min-payment-amount
Cluster 1: Customers having second lowest in all attributes other than min-payment-
amount and max-spent-in-single-shopping. It holds lowest min-payment-amount and
max-spent-in-single-shopping.
Cluster 2: Customers having second highest in all attributes other than min-payment-
amount
Cluster 3: Customers having highest in all attributes other than min-payment-amount
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the
task to make a model which predicts the claim status and provide recommendations
to management. Use CART, RF & ANN and compare the models' performances in
train and test sets.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it.
Lets check head of dataset
we removed ‘agency_code’ which seem useless in this dataset for analysis
Lets see dataset information
Dataset contains ‘no’ null values
Lets see dataset description
Dataset contains 139 duplicate records
Lets see pairplot of dataset
Lets see corelation matrix
Dataset looks imblanced
Lets see if dataset contains outliers or not
Lets treat outliers
Inferences
* After successful data load, we see "Agency_Code","Product Name","Destination"
as useless columns so we drop them
* We check information about dataset and find out 'no null values' but we have ob-
jects to be taken care of
* We encode columns (whose datatype is object) to numeric so that we make it more
useful
* We see, a little imbalance data 70/30 ratio for 'claimed' attribute of dataset
2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
CART Model
Lets split dataset. And check its dimension
Lets see variable importance
Lets see predicted class and probability
We have CART model with below training results
We have CART model with below test results
Random Forest Model
We have a random forest model with below training data
We have a random forest model with below test data
Nurel network
We have a nurel natwork model with below training data
We have a nurel natwork model with below test data
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model
CART Model
We have CART model with below training results
We have CART model with below test results
Training Data AUC: 0.838
Testing Data AUC: 0.818
Random Forest Model
We have a random forest model with below training data
We have a random forest model with below test data
Testing Data Area under Curve is 0.8716678249009195
Testing data Area under Curve is 0.8223000420226922
Nurel network
We have a nurel natwork model with below training data
We have a nurel natwork model with below test data
Nurel network training data Area under Curve is 0.7818421652748042
Nurel network testing data Area under Curve is 0.736167530466452
2.4 Final Model: Compare all the model and write an inference which model is best/
optimized.
ROC Curve for the 3 models on the Training data
ROC Curve for the 3 models on the Test data
Based on comparision above it seem ‘Random forest’ is best model in this case.
2.5 Inference: Based on the whole Analysis, what are the business insights and re-
commendations
Out of the 3 models, Random Forest has slightly better performance than the Cart and Neural
network model
Overall all the 3 models are reasonably stable enough to be used for making any future pre-
dictions. From Cart and Random Forest Model, the variable change is found to be the most
useful feature amongst all other features for predicting if a person has diabetes or not. If
change is yes, then those customers have more chances of putting claim.
In [ ]: