Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views34 pages

Data Mining: Clustering & Model Evaluation

Uploaded by

Shrirekh Girnale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views34 pages

Data Mining: Clustering & Model Evaluation

Uploaded by

Shrirekh Girnale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Mining project

Shrirekh Girnale

SEPTEMBER 30, 2021


PG-DSBA
Data Mining project

Contents
Problem 1: Clustering..................................................................................................................................................4
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)........................................................................................................................4
1.2 Do you think scaling is necessary for clustering in this case? Justify............................................9
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.............................................................................................................10
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters.....................................................................................................................................................12
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters...........................................................................................................................13
Problem 2: CART-RF-ANN.........................................................................................................................................15
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis)......................................................................................................................15
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network........................................................................................................................................20
2.2.1 Splitting data into train and test sets:..................................................................................................20
2.2.2 Building models:.....................................................................................................................................21
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model...........................................................................................................................................22
2.3.1 Decision Tree_Model evaluation:....................................................................................................22
2.3.2 Random forest_Model evaluation:...................................................................................................24
2.4 Final Model - Compare all the models and write an inference which model is best/optimized.
28
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations...................................................................................................................................................30

1|Page
Data Mining project

Figures
Figure 1 Box plots for all the columns in Bank Marketing Part 1 data set..............................................................5
Figure 2 Distribution plots for all the columns in Bank Marketing Part 1 data set..................................................6
Figure 3 Distribution plots for all the columns in Bank Marketing Part 1 data set..................................................7
Figure 4 Heat map for all the columns in Bank Marketing Part 1 data set..............................................................8
Figure 5 Dendrogram for hierarchical clustering......................................................................................................10
Figure 6 Classification of ‘scaled Bank Marketing’ into clusters using fcluster method......................................11
Figure 7 Classification of ‘scaled Bank Marketing’ into clusters using agglomerative method..........................12
Figure 8 Elbow curve for k-mean clustering..............................................................................................................13
Figure 9 Box plots of columns in Insurance data set...............................................................................................16
Figure 10 Pairplot of columns in Insurance dataset.................................................................................................17
Figure 11 Correlation plot of columns in Insurance data set..................................................................................18
Figure 12 Codes for objects in features with original data type as 'object'...........................................................20
Figure 13 Dimensions of Train and Test data sets..................................................................................................20
Figure 14 Decision tree best parameters...................................................................................................................21
Figure 15 Decision tree................................................................................................................................................21
Figure 16 Random Forest_best parameters.............................................................................................................22
Figure 17 Neural Network_best parameters.............................................................................................................22
Figure 18 AUC for training set (Decision tree)..........................................................................................................23
Figure 19 AUC for test set (Decision tree)................................................................................................................23
Figure 20 Confusion matrix_Decision tree................................................................................................................23
Figure 21 Confusion matrix_Random forest.............................................................................................................24
Figure 22 AUC for training set (Random_forest)......................................................................................................25
Figure 23 AUC for test set (Random_forest)............................................................................................................26
Figure 24 Confusion matrix_Neural Network classifier............................................................................................26
Figure 25 AUC for training set (Neural network classifier)......................................................................................27
Figure 26 AUC for test set (Neural network classifier)............................................................................................28
Figure 27 Comparison of AUC curves for three models..........................................................................................29
Figure 28 Crosstab chart (Agency_code vs. Claimed)............................................................................................30
Figure 29 Crosstab chart (Claimed vs. Type)...........................................................................................................30
Figure 30 Bar chart (Claimed vs. Sales)....................................................................................................................31
Figure 31 Bar chart (Agency_code vs. Sales)..........................................................................................................31

2|Page
Data Mining project

Tables

Table 1 Sample of Bank Marketing Part 1 data set....................................................................................................5


Table 2 Information of Bank Marketing Part 1 data set..............................................................................................5
Table 3 Description of Bank Marketing Part 1 data set..............................................................................................6
Table 4 Investigation for Scaling.................................................................................................................................10
Table 5 Description of the scaled data.......................................................................................................................11
Table 6 Sample of Insurance data set.......................................................................................................................12
Table 7 Information of Insurance data set.................................................................................................................12
Table 8 Description of Insurance dataset..................................................................................................................13
Table 9 Null value check for Insurance dataset........................................................................................................16
Table 10 Decision Tree_Feature Importance............................................................................................................18
Table 15 Random Forest_Feature Importance.........................................................................................................18
Table 11 Classification report of training data_Decision tree..................................................................................20
Table 12 Classification report of test data_Decision tree........................................................................................20
Table 13 Classification report of training data_Random forest...............................................................................21
Table 14 Classification report of test data_Random forest.....................................................................................21
Table 16 Classification report of training data_Neural network classifier..............................................................23
Table 17 Classification report of test data_Neural network classifier....................................................................23
Table 18 Comparison of performance metrics of three models..............................................................................24

3|Page
Data Mining project

Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).
Introduction:

The purpose of the part 1 of this report is to develop a customer segmentation to give
promotional offers to its customers for a leading bank. A given summarized data of user’s
activities during the past few months will be analysed for this purpose. In this report segments
will be identified based on customer’s credit card usage.

Exploratory Data Analysis:

i. The sample data set is as follows:

Table 1 Sample of Bank Marketing Part 1 data set

ii. The information of the above data set is as follows:


Table 2 Information of Bank Marketing Part 1 data set

The data set contains total seven columns, none of which contains a null value. The data type
of all the columns is float.

4|Page
Data Mining project

iii. The basic statistics of the above data set (Table 1) is as follows:

Table 3 Description of Bank Marketing Part 1 data set

There are no duplicate values in the data set.

iv. Boxplot:

Boxplot in Figure 1 shows that there are outliers in the ‘min_payment_amt’ and
‘probability_of_full_payment’ only.

Figure 1 Box plots for all the columns in Bank Marketing Part 1 data set

5|Page
Data Mining project

v. Distribution plot:

Figure 2 Distribution plots for all the columns in Bank Marketing Part 1 data set

Distribution plots of the columns gives us the first-hand idea if the data is normally distributed, right-skewed or
left-skewed. ‘Credit_limit’, ‘min_payment_amt’ and ‘max_spent_in_single_shopping’ shows normal distribution.
‘Spending’, ‘advance_payments’ and ‘current_balance’ shows right-skewed distribution.
‘Probability_of_full_payment’ shows left-skewed distribution.

6|Page
Data Mining project

vi. Pair plot:

Figure 3 Distribution plots for all the columns in Bank Marketing Part 1 data set

Figure 3 shows that most of the pair plots in the diagonal shows the distribution of a single
variable. The inference of the distribution plot can also be confirmed by the diagonal plots in pair
plot. Also, many of the other pair plots shows positive correlation.

7|Page
Data Mining project

‘Min_payment_amt’ and ‘probability_of_full_payment’ shows scattered distribution with all the


other columns. Other columns shows positive linear correlation.

vii. Correlation plot:

Figure 4 Heat map for all the columns in Bank Marketing Part 1 data set

From the correlation plot in Figure 4, we can see that various attributes of the car are highly
correlated to each other. Correlation values near to 1 or -1 are highly positively correlated and
highly negatively correlated respectively. Correlation values near to 0 are not correlated to each
other.

8|Page
Data Mining project

viii. Univariate Analysis:


 Boxplot shows that only ‘min_payment_amt’ and ‘probability_of_full_payment’ consists of
outliers.

ix. Multivariate Analysis:


 There is a strong correlation observed between few fields. ‘Advance_payments’ is
strongly correlated with ‘spending’, ‘spending’ is highly correlated with
‘max_spent_in_single_shopping’, ‘advance_payment’ is highly correlated with
‘max_spent_in_single_shopping’, ‘current_balance’ is highly correlated with
‘max_spent_in_single_shopping’.
 ‘min_payment_amt’ shows negative correlation values with all the other columns.

1.2 Do you think scaling is necessary for clustering in this case? Justify
Scaling the data is required when the range of values of raw data varies widely for smooth
functioning of machine learning algorithm.

As can be seen in Table 4, min and max shows wide variety of range for different features.
Means of different features also varies from 0.8 to 14.8. Also, standard deviation varies from
0.02 to 2.9 which is a large variation. Therefore, it is required to do scaling for efficient
clustering in this case.

Table 4 Investigation for Scaling

In this case, scaling can be performed using min-max or standard scaler method. In standard
scaler method, scaling is done such that the distribution is centred around zero with standard

9|Page
Data Mining project

deviation of one. Min-max scaler method shrinks the range between zero and one. But as some
of the features consists of outliers and as min-max scaler method is sensitive to outliers, we will
use standard scaler method for scaling the data.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
1. The basic statistics of the scaled data is as follows:

Table 5 Description of the scaled data

2. Sample of the scale data is as follows:

10 | P a g e
Data Mining project

Table 6 Sample of scaled Bank Marketing Part 1 data set

11 | P a g e
Data Mining project

3. Hierarchical clustering is applied to the above scaled data with linkage method as ‘ward’ and
clustering based on ‘spending’ is done to obtain the following the dendrogram. Dendrogram
is cut at 10 nodes as there are only two main clusters.

Figure 5 Dendrogram for hierarchical clustering


4. Dendrogram
shows that there are only two main clusters orange and green. The orange cluster has 70
data points and green cluster contains 140 data points.
5. Clustering using f-cluster:
i. The scaled data is used for clustering using fcluster method using distance criteria and
following clusters are obtained.

12 | P a g e
Data Mining project

ii. Appended scaled dataset with cluster feature:


Table 7 Appended scaled dataset with cluster feature with f cluster method

iii. The frequency of the cluster is as shown in Error:


Reference source not found. Cluster 1 contains 70

Table 8 Appended scaled dataset with cluster frequency feature

data points, Cluster 2 contains 67 data points and Cluster 3 contains 73 data points.

6. Agglomerative clustering:
i. The scaled data is used for clustering using agglomerative clustering method with 3
clusters, ‘euclidean’ affinity and linkage as ‘average’, following clusters are obtained.

Figure 7 Classification of ‘scaled Bank Marketing’ into clusters using agglomerative method

ii. After applying the agglomerative clustering following agglomerative clusters are obtained.
Table 9 shows that there are 73, 68 and 69 datapoints in cluster 1, 2 and 3 respectively.

Table 9 Clustering using Agglomerative method


13 | P a g e
Data Mining project

1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters.
1. Forming clusters with K from 1 to 10 and comparing the WSS we obtain the following results:

Table 10 WSS for k from 1 to10

2. WSS profile is as follows:

14 | P a g e
Data Mining project

Figure 8 Elbow curve for k-mean clustering

 It can be seen from Figure 8 that three clusters can be used for clustering purpose. To
confirm the number of clusters, silhouette_scores for clusters 3 and 4 are checked.
 Silhouette_score for three clusters is 0.3304 and four clusters is 0.3276. As the
Silhouette_score for three clusters is more than four clusters, three clusters will be
formed.

15 | P a g e
Data Mining project

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
1. Hierarchical clustering:
i. F-cluster
Table 11 Cluster Profiling using F-cluster

ii. Agglomerative clustering

16 | P a g e
Data Mining project

Table 12 Cluster Profiling using agglomerative method

 Cluster 1 in f-cluster and agglomerative clusters shows the customers having very good credit
record. These customers can be given good promotional offers such as cash-back when
spending above a particular limit is done so as to encourage their spending.
 Cluster 2 in f-cluster and cluster 0 in agglomerative clusters shows customers with very bad
credit record. Offers needs to be given to such customers to encourage their timely return of
credit taken and to avoid bad loans.
 Cluster 3 in f-cluster and cluster 2 in agglomerative cluster shows customers with average
credit record. As the probability of full payment for customers in this segment is good, these
are best potential customers who can be encouraged to utilize more credit. As, the advance
payment record is poor for these customers, more interest on credit can be earned from this
segment of customers.

2. K-mean clustering:

Table 13 Cluster Profiling using K-mean clustering

 Cluster 2 shows the customers with best credit record.


 Cluster 1 shows the customers with poor credit record.
 Cluster 0 shows the customers with average credit record. These segment of customers
needs to be focused to increase the profits by interest earned.

17 | P a g e
Data Mining project

Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).
Introduction:
The dataset contains the data of insurance firm providing tour insurance which is facing higher
claim frequency. This report contains comparison of model performance of three models namely,
CART, RF & ANN in train and test sets.

Exploratory Data Analysis:


i. Sample data set is as follows:

Table 14 Sample of Insurance data set

Table 15 Information of Insurance data set

ii. The information of the above data set is as follows:

iii. The basic statistics of the above data set (Table 14) is as follows

18 | P a g e
Data Mining project

Table 16 Description of Insurance dataset

The dataset contains 139 duplicates.

iv. Boxplot

Figure 9 Box plots of columns in Insurance data set

Boxplot in Figure 9 shows that there are outliers in all the integer datatype columns.
v. Pairplot

19 | P a g e
Data Mining project

Figure 10 Pairplot of columns in Insurance dataset


 All the diagonal plots shows that all the variables are right skewed.
 There is no correlation between the variables except the sales and commission.

20 | P a g e
Data Mining project

vi. Correlation plot

Figure 11 Correlation plot of columns in Insurance data set

Correlation plot shows that there is strong correlation between commission and sales only.
Sales and duration shows good correlation. Duration and commission shows weak
correlation. Also, age has extremely poor correlation with other variables.

vii.Null Value check

21 | P a g e
Data Mining project

Table 17 Null value check for Insurance dataset

Table 17 shows that there are no null values in any of the features.

2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.2.1 Splitting data into train and test sets:
i. First, its necessary to convert all the features with ‘object’ data type to ‘categorical’ and code the
unique object within each feature and giving appropriate code. Following figure shows the
unique object in each feature and their respective code.

Figure 12 Codes for objects in features with original data type as 'object'

22 | P a g e
Data Mining project

iii. Splitting the dataset into train and test sets: The data set is split into train and test sets with
test_size as 0.25 and random state as 1. Test_size of 0.25 indicates that 25% data will be used
for testing purpose and 75% data will be used for training of the model. Random state ensures
that the model results are reproduceable.
iv. The dimension of train and test data is as follows:

Figure 13 Dimensions of Train and Test data sets

2.2.2 Building models:


2.2.2.1 Decision tree:
Best parameters for decision tree classifier with criterion ‘Gini’ and 10 cross validations are as
shown in Figure 14. Scoring for this algorithm is based on the ‘recall’ of the algorithm.

Figure 14 Decision tree best parameters

2.2.2.2 Decision Tree_Feature Importance:


Error: Reference source not found shows the importance of the independent
features impacting the dependent feature.
Table 18 Decision Tree_Feature Importance

2.2.2.3 Figure 15 Decision tree

Random Forest:
Best parameters for Random forest with 10 cross validations is as shown in Figure 16. Scoring
for this algorithm is based on the ‘recall’ of the
algorithm.
23 | P a g e
Data Mining project

Figure 16 Random Forest_best parameters

2.2.2.4 Random Forest_Feature Importance:


Table 19 Random Forest_Feature Importance

2.2.2.5 Neural Network classifier:


Best parameters for Neural network classifier with 10 cross validations is as shown in. Scoring
for this algorithm is based on the ‘recall’ of the algorithm.

2.3
Figure 17 Neural Network_best parameters

Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model.
2.3.1 Decision Tree_Model evaluation:
 AUC and ROC for the training set:

24 | P a g e
Data Mining project

Figure 18 AUC for training set (Decision tree)

 AUC and ROC for the test set:

Figure 19 AUC for test set (Decision tree)

 From Figure 18 & Figure 19, it


can be seen that area under curve for training set is 0.830 and testing set is 0.806.

 Confusion matrix:

Figure 20 Confusion matrix_Decision tree

 Classification report for training data:


Table 20 Classification report of training data_Decision tree

25 | P a g e
Data Mining project

 Classification report for test data:


Table 21 Classification report of test data_Decision tree

 Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No
claim) is more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is balanced in
the train and test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model
is optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.

2.3.2 Random forest_Model evaluation:


 Confusion matrix:

Figure 21 Confusion matrix_Random forest

 Classification report for training data:

Table 22 Classification report of training data_Random forest

26 | P a g e
Data Mining project

 Classification report for test data:


Table 23 Classification report of test data_Random forest

 AUC and ROC for


the training set:

Figure 22 AUC for training set (Random_forest)

 AUC and ROC for the test set:

27 | P a g e
Figure 23 AUC for test set (Random_forest)
Data Mining project

 Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No
claim) is more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is less balanced
in the train and test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model
is optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.

2.3.2.1 Neural Network classifier_Model evaluation:


 Confusion matrix:

Figure 24 Confusion matrix_Neural Network classifier

 Classification report for training data:

Table 24 Classification report of training data_Neural network classifier

 Classification report for test data:

28 | P a g e
Data Mining project

Table 25 Classification report of test data_Neural network classifier

 AUC and ROC for the training set:

Figure 25 AUC for training set (Neural network classifier)

 AUC and ROC for the test set:

Figure 26 AUC for test set (Neural network classifier)

29 | P a g e
Data Mining project

 Inference:
i. Classification report of both the training and test data shows that recall for 0’s (No claim) is
more as compared to 1’s (Claimed).
ii. F1 score, which shows the balance between specificity and sensitivity is less balanced in
the test dataset.
iii. Accuracy of both the train and test dataset is approximately same showing the model is
optimized.
iv. From ROC curve it can inferred that the threshold value of 0.25 value will give good
balance between true positive rate and false positive rate.

2.4 Final Model - Compare all the models and write an inference which model is
best/optimized.

Table 26 Comparison of performance metrics of three models

Table 26 shows that CART model is performing the best among the three models used for data
analysis for following reasons:

i. Accuracy for training data is best for CART model amongst the three models.
ii. Recall for both the training and test data is best in CART. Also, There is a better balance
between the recall of training and test data in CART as compared to other models.
iii. F1 score, which shows the balance between the specificity and sensitivity, is also best
balanced with best values for CART model as compared to other models.

It can be also seen from Table 26 and Figure 27 that there is no significant difference between
AUC/ ROC of all the three models.

30 | P a g e
Data Mining project

Figure 27 Comparison of AUC curves for three models

2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
i. According to the CART model, the most important feature impacting the claims is
‘Agency_code’. It can be seen from Figure 28 that the Agency_Code ‘0’ (C2B) has most
numbers of claims followed by ‘EPX’, ‘CWT’ and ‘JZI’.

31 | P a g e
Data Mining project

ii.
It
can
be
seen
from

Figure 28 Crosstab chart (Agency_code vs. Claimed)


Figure 29Figure 28 that the claims are done more in type ‘0’ (Airlines) as compared to ‘1’
(Travel agency).

Figure 29 Crosstab chart (Claimed vs. Type)

iii. The sales also plays an important role in claims as seen in Figure 30 and Figure 31. The claims also
increase with increase in sales. Sales is second most important feature as per CART model.

32 | P a g e
Data Mining project

Figure 30 Bar chart (Claimed vs. Sales)

Figure 31 Bar chart (Agency_code vs. Sales)

33 | P a g e

You might also like