0% found this document useful (0 votes)

90 views22 pages

M4 Data Mining W4 Business Report

The document discusses using clustering and classification models to analyze customer data for a bank and insurance company. For the bank, K-means clustering with 4 clusters is used to segment customers and recommend promotional strategies. For insurance, CART, random forest, and ANN models are built on claim data and random forest is identified as the best performing model.

Uploaded by

Krishna Upadhyay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views22 pages

M4 Data Mining W4 Business Report

Uploaded by

Krishna Upadhyay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 22

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers

to its customers. They collected a sample that summarizes the activities of users
during the past few months. You are given the task to identify the segments based
on credit card usage.

1.1 Read the data and do exploratory data analysis. Describe the data briefly.

We try to load data, and we can see

There are 7 features and 210 records in dataset

7 features are,
 spending: Amount spent by the customer per month (in 1000s)
 advance_payments: Amount paid by the customer in advance by cash
(in 100s)
 probability_of_full_payment: Probability of payment done in full by the
customer to the bank
 current_balance: Balance amount left in the account to make pur-
chases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making pay-
ments for purchases made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one pur-
chase (in 1000s)

We don’t have any duplicate records,

We don’t have any null records,

Let’s see information about dataset. We have ‘float64’ as datatype for all col-
umn

Lets see deciption/5-point summery about dataset

We don’t find any duplicate records also,

Lets see histogram

Lets see covariance of dataset

Lets see co-relation of dataset

Lets see corelation matix

Lets see pairplot
1.2 Do you think scaling is necessary for clustering in this case? Justify

Let’s see description of data to understand ‘weight’ of each features on over-

all dataset

We can see ‘max’ attribute of all features represent very different scaling of
impact on overall dataset from features, to standardize impact of all features
‘scaling’ is needed.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them

Lets first scale data

Lets see dendogram

Lets see truncated dendogram with last 25 clusters only

Lets see optimum clusters, first based on ‘maxclust’ criteria. We got two clus-
ters

Now, lets try to find clusters based on ‘distance’ criteria. Again we got two
clusters

1.4 Apply K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and silhouette score.
First lets see WSS score

Now lets plot wss values

We got four optimal clusters

1.5 Describe cluster profiles for the clusters defined. Recommend different pro-
motional strategies for different clusters.
Recommendations

1. Cluster 0 should be given less general offers and more ‘on the payment’ club offers

2. Cluster 1 should be given less general offers and more ‘full payment’ club offers.

3. Cluster 2 should be given all offers as it holds second highest numbers in all attrib-
utes. Provide offers

4. Cluster 3 should be given all offers as it holds second highest numbers in all attrib-
utes. Cluster 3 hold high spending and advance payments. It holds highest credit
limit. Provide offers clubbing min payment and full payment.

Description

 Cluster 0: Customers having lowest in all attributes other than min-payment-amount

and max-spent-in-single-shopping. It holds highest min-payment-amount

 Cluster 1: Customers having second lowest in all attributes other than min-payment-
amount and max-spent-in-single-shopping. It holds lowest min-payment-amount and
max-spent-in-single-shopping.

 Cluster 2: Customers having second highest in all attributes other than min-payment-
amount

 Cluster 3: Customers having highest in all attributes other than min-payment-amount

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the
task to make a model which predicts the claim status and provide recommendations
to management. Use CART, RF & ANN and compare the models' performances in
train and test sets.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it.

Lets check head of dataset

we removed ‘agency_code’ which seem useless in this dataset for analysis

Lets see dataset information

Dataset contains ‘no’ null values

Lets see dataset description

Dataset contains 139 duplicate records

Lets see pairplot of dataset

Lets see corelation matrix

Dataset looks imblanced

Lets see if dataset contains outliers or not

Lets treat outliers

Inferences
* After successful data load, we see "Agency_Code","Product Name","Destination"
as useless columns so we drop them

* We check information about dataset and find out 'no null values' but we have ob-
jects to be taken care of

* We encode columns (whose datatype is object) to numeric so that we make it more

useful
* We see, a little imbalance data 70/30 ratio for 'claimed' attribute of dataset

2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network

CART Model

Lets split dataset. And check its dimension

Lets see variable importance

Lets see predicted class and probability

We have CART model with below training results

We have CART model with below test results

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data

Nurel network

We have a nurel natwork model with below training data

We have a nurel natwork model with below test data

2.3 Performance Metrics: Check the performance of Predictions on Train and Test

sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model

CART Model

We have CART model with below training results

We have CART model with below test results

Training Data AUC: 0.838

Testing Data AUC: 0.818

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data

Testing Data Area under Curve is 0.8716678249009195

Testing data Area under Curve is 0.8223000420226922

Nurel network
We have a nurel natwork model with below training data

We have a nurel natwork model with below test data

Nurel network training data Area under Curve is 0.7818421652748042

Nurel network testing data Area under Curve is 0.736167530466452

2.4 Final Model: Compare all the model and write an inference which model is best/
optimized.

ROC Curve for the 3 models on the Training data

ROC Curve for the 3 models on the Test data

Based on comparision above it seem ‘Random forest’ is best model in this case.

2.5 Inference: Based on the whole Analysis, what are the business insights and re-
commendations

Out of the 3 models, Random Forest has slightly better performance than the Cart and Neural
network model

Overall all the 3 models are reasonably stable enough to be used for making any future pre-
dictions. From Cart and Random Forest Model, the variable change is found to be the most
useful feature amongst all other features for predicting if a person has diabetes or not. If
change is yes, then those customers have more chances of putting claim.
In [ ]:

Personal Loan Campaign Final
No ratings yet
Personal Loan Campaign Final
12 pages
Spectroil Q100
67% (3)
Spectroil Q100
100 pages
Data Mining for Business Insights
83% (12)
Data Mining for Business Insights
34 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
Data Mining: Clustering & CART Analysis
100% (4)
Data Mining: Clustering & CART Analysis
57 pages
Checkmate Iv Celox Checkmate Iv Quik-Cup
100% (1)
Checkmate Iv Celox Checkmate Iv Quik-Cup
4 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Data Mining Assignment Guide
100% (1)
Data Mining Assignment Guide
21 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
Answer Report
No ratings yet
Answer Report
9 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
Machine Learning - Project
80% (10)
Machine Learning - Project
14 pages
Data Mining Project
100% (2)
Data Mining Project
20 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining Project
No ratings yet
Data Mining Project
11 pages
Clustering Analysis: Reading The Data
100% (1)
Clustering Analysis: Reading The Data
15 pages
Project 4 Data Mining Final v2
100% (1)
Project 4 Data Mining Final v2
19 pages
Data Mining Project Anshul
100% (1)
Data Mining Project Anshul
48 pages
Bank Customer Segmentation
No ratings yet
Bank Customer Segmentation
14 pages
Description: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Description: Bank - Marketing - Part1 - Data - CSV
4 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Project Questions
No ratings yet
Project Questions
4 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Data Mining: Clustering & Model Evaluation
No ratings yet
Data Mining: Clustering & Model Evaluation
34 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Clustering Algorithms for Data Analysis
No ratings yet
Clustering Algorithms for Data Analysis
7 pages
Data Mining Project Ragunathan
No ratings yet
Data Mining Project Ragunathan
21 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Credit Card Segmentation Guide
No ratings yet
Credit Card Segmentation Guide
5 pages
AllLife Bank Customer Segmentation Unsupervised Learning-Coded-Project-Business-Report
No ratings yet
AllLife Bank Customer Segmentation Unsupervised Learning-Coded-Project-Business-Report
10 pages
Thera Bank
100% (1)
Thera Bank
25 pages
Insights
No ratings yet
Insights
2 pages
Bank Customer Segmentation Guide
No ratings yet
Bank Customer Segmentation Guide
32 pages
Data Mining Project Report - Reshma
No ratings yet
Data Mining Project Report - Reshma
23 pages
Set 2
No ratings yet
Set 2
19 pages
Rev Insurance Business Report
No ratings yet
Rev Insurance Business Report
4 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
Proposal CIS 412
No ratings yet
Proposal CIS 412
1 page
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Class Activity#7 Robert Skublen
No ratings yet
Class Activity#7 Robert Skublen
7 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
19 pages
Business Report
No ratings yet
Business Report
18 pages
ML Solution
No ratings yet
ML Solution
60 pages
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
No ratings yet
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
6 pages
Bank Customer Segmentation Guide
No ratings yet
Bank Customer Segmentation Guide
53 pages
Project Explanation
No ratings yet
Project Explanation
17 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
Customer Churn Prediction On Credit Card Services Using Random Forest Method
No ratings yet
Customer Churn Prediction On Credit Card Services Using Random Forest Method
8 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Barani Institute of Management Sciences: Final-Term Exam Fall-2019
No ratings yet
Barani Institute of Management Sciences: Final-Term Exam Fall-2019
2 pages
Hologic Dimensions Rel.n.
No ratings yet
Hologic Dimensions Rel.n.
12 pages
Collin College - Continuing Education: Course Syllabus
No ratings yet
Collin College - Continuing Education: Course Syllabus
4 pages
Twin-Turbine Centrifugal Compressor MODEL TT-300: Service Monitor User Manual
No ratings yet
Twin-Turbine Centrifugal Compressor MODEL TT-300: Service Monitor User Manual
68 pages
Transcript - Participate Safely and Responsibly Online PDF
No ratings yet
Transcript - Participate Safely and Responsibly Online PDF
11 pages
OS - Chapter - 4 - Memory Management
No ratings yet
OS - Chapter - 4 - Memory Management
48 pages
SPiCE for Software Process Improvement
No ratings yet
SPiCE for Software Process Improvement
22 pages
People Central Hub Configuration Workbook
No ratings yet
People Central Hub Configuration Workbook
2,487 pages
ISO 17987-4 - 2013 - Draft
No ratings yet
ISO 17987-4 - 2013 - Draft
37 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
Queuing Model for KFC Jember
No ratings yet
Queuing Model for KFC Jember
19 pages
Dahua Intro & Products
No ratings yet
Dahua Intro & Products
68 pages
MT6622 MediaTek
No ratings yet
MT6622 MediaTek
35 pages
Gann Circle Swing Levels
No ratings yet
Gann Circle Swing Levels
2 pages
Lab8 - ARM Memory
No ratings yet
Lab8 - ARM Memory
9 pages
Group Functions
No ratings yet
Group Functions
6 pages
Alteryx Webinar Lecture 1 - Slides PDF
100% (1)
Alteryx Webinar Lecture 1 - Slides PDF
56 pages
Skoda Amundsen MIB 2 Map Update Guide
No ratings yet
Skoda Amundsen MIB 2 Map Update Guide
12 pages
Practical Exercises - PowerPoint
No ratings yet
Practical Exercises - PowerPoint
2 pages
Business Stats Analysis Report
No ratings yet
Business Stats Analysis Report
3 pages
Activity 19 Using Report Wizard Creating A Report Based On More Than One Table
No ratings yet
Activity 19 Using Report Wizard Creating A Report Based On More Than One Table
1 page
SOFTWARE ENGINEERING March 2021
No ratings yet
SOFTWARE ENGINEERING March 2021
4 pages
Energy Optimization and Saving For Green Data Centers: Niharika Raskar
No ratings yet
Energy Optimization and Saving For Green Data Centers: Niharika Raskar
14 pages
Ukrainian Power Grid Cyberattack Analysis
No ratings yet
Ukrainian Power Grid Cyberattack Analysis
12 pages
Ddco Question Bank
No ratings yet
Ddco Question Bank
1 page
Google Research: 3D Vision & Robotics
No ratings yet
Google Research: 3D Vision & Robotics
35 pages
Unit 09 - Assignment 02 Guide
0% (1)
Unit 09 - Assignment 02 Guide
2 pages
SB2X 115 02
No ratings yet
SB2X 115 02
20 pages

M4 Data Mining W4 Business Report

Uploaded by

M4 Data Mining W4 Business Report

Uploaded by

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers

We try to load data, and we can see

There are 7 features and 210 records in dataset

We don’t have any duplicate records,

Lets see deciption/5-point summery about dataset

We don’t find any duplicate records also,

Lets see covariance of dataset

Lets see corelation matix

Let’s see description of data to understand ‘weight’ of each features on over-

Lets first scale data

Lets see truncated dendogram with last 25 clusters only

Now lets plot wss values

We got four optimal clusters

 Cluster 0: Customers having lowest in all attributes other than min-payment-amount

 Cluster 3: Customers having highest in all attributes other than min-payment-amount

Lets check head of dataset

we removed ‘agency_code’ which seem useless in this dataset for analysis

Dataset contains ‘no’ null values

Lets see dataset description

Lets see pairplot of dataset

Dataset looks imblanced

Lets treat outliers

* We encode columns (whose datatype is object) to numeric so that we make it more

Lets split dataset. And check its dimension

Lets see variable importance

Lets see predicted class and probability

We have CART model with below test results

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data

We have a nurel natwork model with below training data

We have a nurel natwork model with below test data

2.3 Performance Metrics: Check the performance of Predictions on Train and Test

We have CART model with below training results

We have CART model with below test results

Testing Data AUC: 0.818

Random Forest Model

We have a random forest model with below training data

We have a random forest model with below test data

Testing Data Area under Curve is 0.8716678249009195

Testing data Area under Curve is 0.8223000420226922

We have a nurel natwork model with below test data

Nurel network training data Area under Curve is 0.7818421652748042

Nurel network testing data Area under Curve is 0.736167530466452

ROC Curve for the 3 models on the Training data

You might also like