Data Mining: Clustering & Model Evaluation
Data Mining: Clustering & Model Evaluation
Shrirekh Girnale
Contents
Problem 1: Clustering..................................................................................................................................................4
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)........................................................................................................................4
1.2 Do you think scaling is necessary for clustering in this case? Justify............................................9
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.............................................................................................................10
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters.....................................................................................................................................................12
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters...........................................................................................................................13
Problem 2: CART-RF-ANN.........................................................................................................................................15
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)......................................................................................................................15
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network........................................................................................................................................20
2.2.1 Splitting data into train and test sets:..................................................................................................20
2.2.2 Building models:.....................................................................................................................................21
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model...........................................................................................................................................22
2.3.1 Decision Tree_Model evaluation:....................................................................................................22
2.3.2 Random forest_Model evaluation:...................................................................................................24
2.4 Final Model - Compare all the models and write an inference which model is best/optimized.
28
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations...................................................................................................................................................30
1|Page
Data Mining project
Figures
Figure 1 Box plots for all the columns in Bank Marketing Part 1 data set..............................................................5
Figure 2 Distribution plots for all the columns in Bank Marketing Part 1 data set..................................................6
Figure 3 Distribution plots for all the columns in Bank Marketing Part 1 data set..................................................7
Figure 4 Heat map for all the columns in Bank Marketing Part 1 data set..............................................................8
Figure 5 Dendrogram for hierarchical clustering......................................................................................................10
Figure 6 Classification of ‘scaled Bank Marketing’ into clusters using fcluster method......................................11
Figure 7 Classification of ‘scaled Bank Marketing’ into clusters using agglomerative method..........................12
Figure 8 Elbow curve for k-mean clustering..............................................................................................................13
Figure 9 Box plots of columns in Insurance data set...............................................................................................16
Figure 10 Pairplot of columns in Insurance dataset.................................................................................................17
Figure 11 Correlation plot of columns in Insurance data set..................................................................................18
Figure 12 Codes for objects in features with original data type as 'object'...........................................................20
Figure 13 Dimensions of Train and Test data sets..................................................................................................20
Figure 14 Decision tree best parameters...................................................................................................................21
Figure 15 Decision tree................................................................................................................................................21
Figure 16 Random Forest_best parameters.............................................................................................................22
Figure 17 Neural Network_best parameters.............................................................................................................22
Figure 18 AUC for training set (Decision tree)..........................................................................................................23
Figure 19 AUC for test set (Decision tree)................................................................................................................23
Figure 20 Confusion matrix_Decision tree................................................................................................................23
Figure 21 Confusion matrix_Random forest.............................................................................................................24
Figure 22 AUC for training set (Random_forest)......................................................................................................25
Figure 23 AUC for test set (Random_forest)............................................................................................................26
Figure 24 Confusion matrix_Neural Network classifier............................................................................................26
Figure 25 AUC for training set (Neural network classifier)......................................................................................27
Figure 26 AUC for test set (Neural network classifier)............................................................................................28
Figure 27 Comparison of AUC curves for three models..........................................................................................29
Figure 28 Crosstab chart (Agency_code vs. Claimed)............................................................................................30
Figure 29 Crosstab chart (Claimed vs. Type)...........................................................................................................30
Figure 30 Bar chart (Claimed vs. Sales)....................................................................................................................31
Figure 31 Bar chart (Agency_code vs. Sales)..........................................................................................................31
2|Page
Data Mining project
Tables
3|Page
Data Mining project
Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).
Introduction:
The purpose of the part 1 of this report is to develop a customer segmentation to give
promotional offers to its customers for a leading bank. A given summarized data of user’s
activities during the past few months will be analysed for this purpose. In this report segments
will be identified based on customer’s credit card usage.
The data set contains total seven columns, none of which contains a null value. The data type
of all the columns is float.
4|Page
Data Mining project
iii. The basic statistics of the above data set (Table 1) is as follows:
iv. Boxplot:
Boxplot in Figure 1 shows that there are outliers in the ‘min_payment_amt’ and
‘probability_of_full_payment’ only.
Figure 1 Box plots for all the columns in Bank Marketing Part 1 data set
5|Page
Data Mining project
v. Distribution plot:
Figure 2 Distribution plots for all the columns in Bank Marketing Part 1 data set
Distribution plots of the columns gives us the first-hand idea if the data is normally distributed, right-skewed or
left-skewed. ‘Credit_limit’, ‘min_payment_amt’ and ‘max_spent_in_single_shopping’ shows normal distribution.
‘Spending’, ‘advance_payments’ and ‘current_balance’ shows right-skewed distribution.
‘Probability_of_full_payment’ shows left-skewed distribution.
6|Page
Data Mining project
Figure 3 Distribution plots for all the columns in Bank Marketing Part 1 data set
Figure 3 shows that most of the pair plots in the diagonal shows the distribution of a single
variable. The inference of the distribution plot can also be confirmed by the diagonal plots in pair
plot. Also, many of the other pair plots shows positive correlation.
7|Page
Data Mining project
Figure 4 Heat map for all the columns in Bank Marketing Part 1 data set
From the correlation plot in Figure 4, we can see that various attributes of the car are highly
correlated to each other. Correlation values near to 1 or -1 are highly positively correlated and
highly negatively correlated respectively. Correlation values near to 0 are not correlated to each
other.
8|Page
Data Mining project
1.2 Do you think scaling is necessary for clustering in this case? Justify
Scaling the data is required when the range of values of raw data varies widely for smooth
functioning of machine learning algorithm.
As can be seen in Table 4, min and max shows wide variety of range for different features.
Means of different features also varies from 0.8 to 14.8. Also, standard deviation varies from
0.02 to 2.9 which is a large variation. Therefore, it is required to do scaling for efficient
clustering in this case.
In this case, scaling can be performed using min-max or standard scaler method. In standard
scaler method, scaling is done such that the distribution is centred around zero with standard
9|Page
Data Mining project
deviation of one. Min-max scaler method shrinks the range between zero and one. But as some
of the features consists of outliers and as min-max scaler method is sensitive to outliers, we will
use standard scaler method for scaling the data.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
1. The basic statistics of the scaled data is as follows:
10 | P a g e
Data Mining project
11 | P a g e
Data Mining project
3. Hierarchical clustering is applied to the above scaled data with linkage method as ‘ward’ and
clustering based on ‘spending’ is done to obtain the following the dendrogram. Dendrogram
is cut at 10 nodes as there are only two main clusters.
12 | P a g e
Data Mining project
data points, Cluster 2 contains 67 data points and Cluster 3 contains 73 data points.
6. Agglomerative clustering:
i. The scaled data is used for clustering using agglomerative clustering method with 3
clusters, ‘euclidean’ affinity and linkage as ‘average’, following clusters are obtained.
Figure 7 Classification of ‘scaled Bank Marketing’ into clusters using agglomerative method
ii. After applying the agglomerative clustering following agglomerative clusters are obtained.
Table 9 shows that there are 73, 68 and 69 datapoints in cluster 1, 2 and 3 respectively.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters.
1. Forming clusters with K from 1 to 10 and comparing the WSS we obtain the following results:
14 | P a g e
Data Mining project
It can be seen from Figure 8 that three clusters can be used for clustering purpose. To
confirm the number of clusters, silhouette_scores for clusters 3 and 4 are checked.
Silhouette_score for three clusters is 0.3304 and four clusters is 0.3276. As the
Silhouette_score for three clusters is more than four clusters, three clusters will be
formed.
15 | P a g e
Data Mining project
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
1. Hierarchical clustering:
i. F-cluster
Table 11 Cluster Profiling using F-cluster
16 | P a g e
Data Mining project
Cluster 1 in f-cluster and agglomerative clusters shows the customers having very good credit
record. These customers can be given good promotional offers such as cash-back when
spending above a particular limit is done so as to encourage their spending.
Cluster 2 in f-cluster and cluster 0 in agglomerative clusters shows customers with very bad
credit record. Offers needs to be given to such customers to encourage their timely return of
credit taken and to avoid bad loans.
Cluster 3 in f-cluster and cluster 2 in agglomerative cluster shows customers with average
credit record. As the probability of full payment for customers in this segment is good, these
are best potential customers who can be encouraged to utilize more credit. As, the advance
payment record is poor for these customers, more interest on credit can be earned from this
segment of customers.
2. K-mean clustering:
17 | P a g e
Data Mining project
Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).
Introduction:
The dataset contains the data of insurance firm providing tour insurance which is facing higher
claim frequency. This report contains comparison of model performance of three models namely,
CART, RF & ANN in train and test sets.
iii. The basic statistics of the above data set (Table 14) is as follows
18 | P a g e
Data Mining project
iv. Boxplot
Boxplot in Figure 9 shows that there are outliers in all the integer datatype columns.
v. Pairplot
19 | P a g e
Data Mining project
20 | P a g e
Data Mining project
Correlation plot shows that there is strong correlation between commission and sales only.
Sales and duration shows good correlation. Duration and commission shows weak
correlation. Also, age has extremely poor correlation with other variables.
21 | P a g e
Data Mining project
Table 17 shows that there are no null values in any of the features.
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.2.1 Splitting data into train and test sets:
i. First, its necessary to convert all the features with ‘object’ data type to ‘categorical’ and code the
unique object within each feature and giving appropriate code. Following figure shows the
unique object in each feature and their respective code.
Figure 12 Codes for objects in features with original data type as 'object'
22 | P a g e
Data Mining project
iii. Splitting the dataset into train and test sets: The data set is split into train and test sets with
test_size as 0.25 and random state as 1. Test_size of 0.25 indicates that 25% data will be used
for testing purpose and 75% data will be used for training of the model. Random state ensures
that the model results are reproduceable.
iv. The dimension of train and test data is as follows:
Random Forest:
Best parameters for Random forest with 10 cross validations is as shown in Figure 16. Scoring
for this algorithm is based on the ‘recall’ of the
algorithm.
23 | P a g e
Data Mining project
2.3
Figure 17 Neural Network_best parameters
Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model.
2.3.1 Decision Tree_Model evaluation:
AUC and ROC for the training set:
24 | P a g e
Data Mining project
Confusion matrix:
25 | P a g e
Data Mining project
Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No
claim) is more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is balanced in
the train and test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model
is optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.
26 | P a g e
Data Mining project
27 | P a g e
Figure 23 AUC for test set (Random_forest)
Data Mining project
Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No
claim) is more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is less balanced
in the train and test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model
is optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.
28 | P a g e
Data Mining project
29 | P a g e
Data Mining project
Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No claim) is
more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is less balanced in
the test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model is
optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.
2.4 Final Model - Compare all the models and write an inference which model is
best/optimized.
Table 26 shows that CART model is performing the best among the three models used for data
analysis for following reasons:
i. Accuracy for training data is best for CART model amongst the three models.
ii. Recall for both the training and test data is best in CART. Also, There is a better balance
between the recall of training and test data in CART as compared to other models.
iii. F1 score, which shows the balance between the specificity and sensitivity, is also best
balanced with best values for CART model as compared to other models.
It can be also seen from Table 26 and Figure 27 that there is no significant difference between
AUC/ ROC of all the three models.
30 | P a g e
Data Mining project
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
i. According to the CART model, the most important feature impacting the claims is
‘Agency_code’. It can be seen from Figure 28 that the Agency_Code ‘0’ (C2B) has most
numbers of claims followed by ‘EPX’, ‘CWT’ and ‘JZI’.
31 | P a g e
Data Mining project
ii.
It
can
be
seen
from
iii. The sales also plays an important role in claims as seen in Figure 30 and Figure 31. The claims also
increase with increase in sales. Sales is second most important feature as per CART model.
32 | P a g e
Data Mining project
33 | P a g e