Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
117 views53 pages

Bank Customer Segmentation Guide

The bank wants to develop a customer segmentation model to target customers with promotional offers. They collected data on customers' credit card usage over the past few months for analysis. The task is to identify customer segments based on variables related to credit card spending, payments, balances and limits in the dataset. Exploratory data analysis found the data has 210 records with no missing values, all variables are numeric, and descriptive statistics show the variables are distributed relatively evenly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views53 pages

Bank Customer Segmentation Guide

The bank wants to develop a customer segmentation model to target customers with promotional offers. They collected data on customers' credit card usage over the past few months for analysis. The task is to identify customer segments based on variables related to credit card spending, payments, balances and limits in the dataset. Exploratory data analysis found the data has 210 records with no missing values, all variables are numeric, and descriptive statistics show the variables are distributed relatively evenly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.

They collected a sample that summarizes the activities of users during the past few months.

You are given the task to identify the segments based on credit card usage.

Data Dictionary for Market Segmentation:


- spending: Amount spent by the customer per month (in 1000s)

- advance_payments: Amount paid by the customer in advance by cash (in 100s)

- probability_of_full_payment: Probability of payment done in full by the


customer to the bank

- current_balance: Balance amount left in the account to make purchases (in


1000s)

- credit_limit: Limit of the amount in credit card (10000s)

- min_payment_amt : minimum paid by the customer while making payments for


purchases made monthly (in 100s)

- max_spent_in_single_shopping: Maximum amount spent in one purchase (in


1000s)

--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------

Import the libraries


In [1]:
#Load the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Plot styling
import seaborn as sns; sns.set() # for plot styling
%matplotlib inline

In [3]:
# Import stats from scipy
`

--------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------

1.1 Read the data and do exploratory data analysis.


Describe the data briefly
Load data
In [4]:
#Read the csv file
orginal_dataset=pd.read_csv('../input/bank-marketing/
bank_marketing_part1_Data.csv')
Checking the data
In [5]:
orginal_dataset.head()

Out[5]:

spendi advance_pay probability_of_full_ current_bal credit_li min_paymen max_spent_in_single_


ng ments payment ance mit t_amt shopping

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837

In [6]:
orginal_dataset.tail()

Out[6]:

spend advance_pay probability_of_full_ current_ba credit_l min_paymen max_spent_in_single


ing ments payment lance imit t_amt _shopping

20
13.89 14.02 0.8880 5.439 3.199 3.986 4.738
5

20
16.77 15.62 0.8638 5.927 3.438 4.920 5.795
6

20 14.03 14.16 0.8796 5.438 3.201 1.717 5.001


spend advance_pay probability_of_full_ current_ba credit_l min_paymen max_spent_in_single
ing ments payment lance imit t_amt _shopping

20
16.12 15.00 0.9000 5.709 3.485 2.270 5.443
8

20
15.57 15.15 0.8527 5.920 3.231 2.640 5.879
9

Observation

Data looks good based on intial records seen in top 5 and bottom 5.

In [7]:
orginal_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB

Observation:

 7 variables and 210 records.


 No missing record based on intial analysis.
 All the variables numeric type.
In [8]:
### data dimensions

orginal_dataset.shape

Out[8]:
(210, 7)
In [9]:
### Checking for Missing Values
orginal_dataset.isnull().sum()

Out[9]:
spending 0
advance_payments 0
probability_of_full_payment 0
current_balance 0
credit_limit 0
min_payment_amt 0
max_spent_in_single_shopping 0
dtype: int64

Observation

No missing value.

Univariate analysis
Checking the Summary Statistic
In [10]:
## Intital descriptive analysis of the data

orginal_dataset.describe(percentiles=[.25,0.50,0.75,0.90]).T

Out[10]:

cou
mean std min 25% 50% 75% 90% max
nt

210. 14.8475 2.9096 10.59 12.270 14.355 17.3050 18.98 21.18


spending
0 24 99 00 00 00 00 80 00

210. 14.5592 1.3059 12.41 13.450 14.320 15.7150 16.45 17.25


advance_payments
0 86 59 00 00 00 00 40 00

probability_of_full_paym 210. 0.87099 0.0236 0.808 0.8569 0.8734 0.88777 0.899 0.918
ent 0 9 29 1 0 5 5 3 3

210. 5.62853 0.4430 4.899 5.2622 5.5235 5.97975 6.273 6.675


current_balance
0 3 63 0 5 0 0 3 0

210. 3.25860 0.3777 2.630 2.9440 3.2370 3.56175 3.786 4.033


credit_limit
0 5 14 0 0 0 0 5 0
cou
mean std min 25% 50% 75% 90% max
nt

210. 3.70020 1.5035 0.765 2.5615 3.5990 4.76875 5.537 8.456


min_payment_amt
0 1 57 1 0 0 0 6 0

max_spent_in_single_sho 210. 5.40807 0.4914 4.519 5.0450 5.2230 5.87700 6.185 6.550
pping 0 1 80 0 0 0 0 0 0

Observation

- Based on summary descriptive, the data looks good.

- We see for most of the variable, mean/medium are nearly equal

- Include a 90% to see variations and it looks distributely evenly

- Std Deviation is high for spending variable

Spending variable
In [11]:
print('Range of values: ', orginal_dataset['spending'].max()-
orginal_dataset['spending'].min())
Range of values: 10.59
In [12]:
#Central values
print('Minimum spending: ', orginal_dataset['spending'].min())
print('Maximum spending: ',orginal_dataset['spending'].max())
print('Mean value: ', orginal_dataset['spending'].mean())
print('Median value: ',orginal_dataset['spending'].median())
print('Standard deviation: ', orginal_dataset['spending'].std())
print('Null values: ',orginal_dataset['spending'].isnull().any())
Minimum spending: 10.59
Maximum spending: 21.18
Mean value: 14.847523809523818
Median value: 14.355
Standard deviation: 2.909699430687361
Null values: False
In [13]:
#Quartiles

Q1=orginal_dataset['spending'].quantile(q=0.25)
Q3=orginal_dataset['spending'].quantile(q=0.75)
print('spending - 1st Quartile (Q1) is: ', Q1)
print('spending - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of spending is ',
stats.iqr(orginal_dataset['spending']))
spending - 1st Quartile (Q1) is: 12.27
spending - 3st Quartile (Q3) is: 17.305
Interquartile range (IQR) of spending is 5.035
In [14]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in spending: ', L_outliers)
print('Upper outliers in spending: ', U_outliers)
Lower outliers in spending: 4.717499999999999
Upper outliers in spending: 24.8575
In [15]:
print('Number of outliers in spending upper : ',
orginal_dataset[orginal_dataset['spending']>24.8575]['spending'].count())
print('Number of outliers in spending lower : ',
orginal_dataset[orginal_dataset['spending']<4.717499]['spending'].count())
print('% of Outlier in spending upper:
',round(orginal_dataset[orginal_dataset['spending']>24.8575]
['spending'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in spending lower:
',round(orginal_dataset[orginal_dataset['spending']<4.717499]
['spending'].count()*100/len(orginal_dataset)), '%')
Number of outliers in spending upper : 0
Number of outliers in spending lower : 0
% of Outlier in spending upper: 0.0 %
% of Outlier in spending lower: 0.0 %
In [16]:
plt.title('spending')
sns.boxplot(orginal_dataset['spending'],orient='horizondal',color='purple')

Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64619a9210>

In [17]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='spending',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('spending', fontsize=15)
ax1.set_title('Distribution of spending', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['spending'],ax=ax2)
ax2.set_xlabel('spending', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['spending'])
ax3.set_xlabel('spending', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

advance_payments variable
In [18]:
print('Range of values: ', orginal_dataset['advance_payments'].max()-
orginal_dataset['advance_payments'].min())
Range of values: 4.84
In [19]:
#Central values
print('Minimum advance_payments: ', orginal_dataset['advance_payments'].min())
print('Maximum advance_payments: ',orginal_dataset['advance_payments'].max())
print('Mean value: ', orginal_dataset['advance_payments'].mean())
print('Median value: ',orginal_dataset['advance_payments'].median())
print('Standard deviation: ', orginal_dataset['advance_payments'].std())
print('Null values: ',orginal_dataset['advance_payments'].isnull().any())
Minimum advance_payments: 12.41
Maximum advance_payments: 17.25
Mean value: 14.559285714285727
Median value: 14.32
Standard deviation: 1.305958726564022
Null values: False
In [20]:
#Quartiles

Q1=orginal_dataset['advance_payments'].quantile(q=0.25)
Q3=orginal_dataset['advance_payments'].quantile(q=0.75)
print('advance_payments - 1st Quartile (Q1) is: ', Q1)
print('advance_payments - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of advance_payments is ',
stats.iqr(orginal_dataset['advance_payments']))
advance_payments - 1st Quartile (Q1) is: 13.45
advance_payments - 3st Quartile (Q3) is: 15.715
Interquartile range (IQR) of advance_payments is 2.2650000000000006
In [21]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in advance_payments: ', L_outliers)
print('Upper outliers in advance_payments: ', U_outliers)
Lower outliers in advance_payments: 10.052499999999998
Upper outliers in advance_payments: 19.1125
In [22]:
print('Number of outliers in advance_payments upper : ',
orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count())
print('Number of outliers in advance_payments lower : ',
orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count())
print('% of Outlier in advance_payments upper:
',round(orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in advance_payments lower:
',round(orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
Number of outliers in advance_payments upper : 0
Number of outliers in advance_payments lower : 0
% of Outlier in advance_payments upper: 0.0 %
% of Outlier in advance_payments lower: 0.0 %
In [23]:
plt.title('advance_payments')
sns.boxplot(orginal_dataset['advance_payments'],orient='horizondal',color='purple'
)

Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b422690>
In [24]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='advance_payments',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('advance_payments', fontsize=15)
ax1.set_title('Distribution of advance_payments', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['advance_payments'],ax=ax2)
ax2.set_xlabel('advance_payments', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['advance_payments'])
ax3.set_xlabel('advance_payments', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

probability_of_full_payment variable

In [25]:
print('Range of values: ', orginal_dataset['probability_of_full_payment'].max()-
orginal_dataset['probability_of_full_payment'].min())
Range of values: 0.11019999999999996
In [26]:
#Central values
print('Minimum probability_of_full_payment ',
orginal_dataset['probability_of_full_payment'].min())
print('Maximum probability_of_full_payment:
',orginal_dataset['probability_of_full_payment'].max())
print('Mean value: ', orginal_dataset['probability_of_full_payment'].mean())
print('Median value: ',orginal_dataset['probability_of_full_payment'].median())
print('Standard deviation: ',
orginal_dataset['probability_of_full_payment'].std())
print('Null values:
',orginal_dataset['probability_of_full_payment'].isnull().any())
Minimum probability_of_full_payment 0.8081
Maximum probability_of_full_payment: 0.9183
Mean value: 0.8709985714285714
Median value: 0.8734500000000001
Standard deviation: 0.023629416583846496
Null values: False
In [27]:
#Quartiles

Q1=orginal_dataset['probability_of_full_payment'].quantile(q=0.25)
Q3=orginal_dataset['probability_of_full_payment'].quantile(q=0.75)
print('probability_of_full_payment - 1st Quartile (Q1) is: ', Q1)
print('probability_of_full_payment - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of probability_of_full_payment is ',
stats.iqr(orginal_dataset['probability_of_full_payment']))
probability_of_full_payment - 1st Quartile (Q1) is: 0.8569
probability_of_full_payment - 3st Quartile (Q3) is: 0.887775
Interquartile range (IQR) of probability_of_full_payment is
0.030874999999999986
In [28]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in probability_of_full_payment: ', L_outliers)
print('Upper outliers in probability_of_full_payment: ', U_outliers)
Lower outliers in probability_of_full_payment: 0.8105875
Upper outliers in probability_of_full_payment: 0.9340875
In [29]:
print('Number of outliers in probability_of_full_payment upper : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count())
print('Number of outliers in probability_of_full_payment lower : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count())
print('% of Outlier in probability_of_full_payment upper:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in probability_of_full_payment lower:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
Number of outliers in probability_of_full_payment upper : 0
Number of outliers in probability_of_full_payment lower : 3
% of Outlier in probability_of_full_payment upper: 0.0 %
% of Outlier in probability_of_full_payment lower: 1.0 %
In [30]:
plt.title('probability_of_full_payment')
sns.boxplot(orginal_dataset['probability_of_full_payment'],orient='horizondal',col
or='purple')

Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b57fc50>

In [31]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='probability_of_full_payment',data=orginal_dataset,orient='v',ax=ax1
)
ax1.set_ylabel('probability_of_full_payment', fontsize=15)
ax1.set_title('Distribution of probability_of_full_payment', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['probability_of_full_payment'],ax=ax2)
ax2.set_xlabel('probability_of_full_payment', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['probability_of_full_payment'])
ax3.set_xlabel('probability_of_full_payment', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
current_balance variable

In [32]:
print('Range of values: ', orginal_dataset['current_balance'].max()-
orginal_dataset['current_balance'].min())
Range of values: 1.7759999999999998
In [33]:
#Central values
print('Minimum current_balance: ', orginal_dataset['current_balance'].min())
print('Maximum current_balance: ',orginal_dataset['current_balance'].max())
print('Mean value: ', orginal_dataset['current_balance'].mean())
print('Median value: ',orginal_dataset['current_balance'].median())
print('Standard deviation: ', orginal_dataset['current_balance'].std())
print('Null values: ',orginal_dataset['current_balance'].isnull().any())
Minimum current_balance: 4.899
Maximum current_balance: 6.675
Mean value: 5.628533333333334
Median value: 5.5235
Standard deviation: 0.4430634777264493
Null values: False
In [34]:
#Quartiles

Q1=orginal_dataset['current_balance'].quantile(q=0.25)
Q3=orginal_dataset['current_balance'].quantile(q=0.75)
print('current_balance - 1st Quartile (Q1) is: ', Q1)
print('current_balance - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of current_balance is ',
stats.iqr(orginal_dataset['current_balance']))
current_balance - 1st Quartile (Q1) is: 5.26225
current_balance - 3st Quartile (Q3) is: 5.97975
Interquartile range (IQR) of current_balance is 0.7175000000000002
In [35]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in current_balance: ', L_outliers)
print('Upper outliers in current_balance: ', U_outliers)
Lower outliers in current_balance: 4.186
Upper outliers in current_balance: 7.056000000000001
In [36]:
print('Number of outliers in current_balance upper : ',
orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count())
print('Number of outliers in current_balance lower : ',
orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count())
print('% of Outlier in current_balance upper:
',round(orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in current_balance lower:
',round(orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count()*100/len(orginal_dataset)), '%')
Number of outliers in current_balance upper : 0
Number of outliers in current_balance lower : 0
% of Outlier in current_balance upper: 0.0 %
% of Outlier in current_balance lower: 0.0 %
In [37]:
plt.title('current_balance')
sns.boxplot(orginal_dataset['current_balance'],orient='horizondal',color='purple')

Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b209dd0>

In [38]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='current_balance',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('current_balance', fontsize=15)
ax1.set_title('Distribution of current_balance', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['current_balance'],ax=ax2)
ax2.set_xlabel('current_balance', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['current_balance'])
ax3.set_xlabel('current_balance', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

credit_limit variable

In [39]:
print('Range of values: ', orginal_dataset['credit_limit'].max()-
orginal_dataset['credit_limit'].min())
Range of values: 1.4030000000000005
In [40]:
#Central values
print('Minimum credit_limit: ', orginal_dataset['credit_limit'].min())
print('Maximum credit_limit: ',orginal_dataset['credit_limit'].max())
print('Mean value: ', orginal_dataset['credit_limit'].mean())
print('Median value: ',orginal_dataset['credit_limit'].median())
print('Standard deviation: ', orginal_dataset['credit_limit'].std())
print('Null values: ',orginal_dataset['credit_limit'].isnull().any())
Minimum credit_limit: 2.63
Maximum credit_limit: 4.033
Mean value: 3.258604761904763
Median value: 3.237
Standard deviation: 0.3777144449065874
Null values: False
In [41]:
#Quartiles

Q1=orginal_dataset['credit_limit'].quantile(q=0.25)
Q3=orginal_dataset['credit_limit'].quantile(q=0.75)
print('credit_limit - 1st Quartile (Q1) is: ', Q1)
print('credit_limit - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of credit_limit is ',
stats.iqr(orginal_dataset['credit_limit']))
credit_limit - 1st Quartile (Q1) is: 2.944
credit_limit - 3st Quartile (Q3) is: 3.56175
Interquartile range (IQR) of credit_limit is 0.61775
In [42]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in credit_limit: ', L_outliers)
print('Upper outliers in credit_limit: ', U_outliers)
Lower outliers in credit_limit: 2.017375
Upper outliers in credit_limit: 4.488375
In [43]:
print('Number of outliers in credit_limit upper : ',
orginal_dataset[orginal_dataset['credit_limit']>4.488375]['credit_limit'].count())
print('Number of outliers in credit_limit lower : ',
orginal_dataset[orginal_dataset['credit_limit']<2.017375]['credit_limit'].count())
print('% of Outlier in credit_limit upper:
',round(orginal_dataset[orginal_dataset['credit_limit']>4.488375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in credit_limit lower:
',round(orginal_dataset[orginal_dataset['credit_limit']<2.017375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
Number of outliers in credit_limit upper : 0
Number of outliers in credit_limit lower : 0
% of Outlier in credit_limit upper: 0.0 %
% of Outlier in credit_limit lower: 0.0 %
In [44]:
plt.title('credit_limit')
sns.boxplot(orginal_dataset['credit_limit'],orient='horizondal',color='purple')

Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b036fd0>

In [45]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='credit_limit',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('credit_limit', fontsize=15)
ax1.set_title('Distribution of credit_limit', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['credit_limit'],ax=ax2)
ax2.set_xlabel('credit_limit', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['credit_limit'])
ax3.set_xlabel('credit_limit', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

min_payment_amt variable

In [46]:
print('Range of values: ', orginal_dataset['min_payment_amt'].max()-
orginal_dataset['min_payment_amt'].min())
Range of values: 7.690899999999999
In [47]:
#Central values
print('Minimum min_payment_amt: ', orginal_dataset['min_payment_amt'].min())
print('Maximum min_payment_amt: ',orginal_dataset['min_payment_amt'].max())
print('Mean value: ', orginal_dataset['min_payment_amt'].mean())
print('Median value: ',orginal_dataset['min_payment_amt'].median())
print('Standard deviation: ', orginal_dataset['min_payment_amt'].std())
print('Null values: ',orginal_dataset['min_payment_amt'].isnull().any())
Minimum min_payment_amt: 0.7651
Maximum min_payment_amt: 8.456
Mean value: 3.7002009523809507
Median value: 3.599
Standard deviation: 1.5035571308217792
Null values: False
In [48]:
#Quartiles

Q1=orginal_dataset['min_payment_amt'].quantile(q=0.25)
Q3=orginal_dataset['min_payment_amt'].quantile(q=0.75)
print('min_payment_amt - 1st Quartile (Q1) is: ', Q1)
print('min_payment_amt - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of min_payment_amt is ',
stats.iqr(orginal_dataset['min_payment_amt']))
min_payment_amt - 1st Quartile (Q1) is: 2.5614999999999997
min_payment_amt - 3st Quartile (Q3) is: 4.76875
Interquartile range (IQR) of min_payment_amt is 2.20725
In [49]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in min_payment_amt: ', L_outliers)
print('Upper outliers in min_payment_amt: ', U_outliers)
Lower outliers in min_payment_amt: -0.7493750000000006
Upper outliers in min_payment_amt: 8.079625
In [50]:
print('Number of outliers in min_payment_amt upper : ',
orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count())
print('Number of outliers in min_payment_amt lower : ',
orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count())
print('% of Outlier in min_payment_amt upper:
',round(orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in min_payment_amt lower:
',round(orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
Number of outliers in min_payment_amt upper : 2
Number of outliers in min_payment_amt lower : 0
% of Outlier in min_payment_amt upper: 1.0 %
% of Outlier in min_payment_amt lower: 0.0 %
In [51]:
plt.title('min_payment_amt')
sns.boxplot(orginal_dataset['min_payment_amt'],orient='horizondal',color='purple')

Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645ae1e090>
In [52]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='min_payment_amt',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('min_payment_amt', fontsize=15)
ax1.set_title('Distribution of min_payment_amt', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['min_payment_amt'],ax=ax2)
ax2.set_xlabel('min_payment_amt', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['min_payment_amt'])
ax3.set_xlabel('min_payment_amt', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

max_spent_in_single_shopping variable

In [53]:
print('Range of values: ', orginal_dataset['max_spent_in_single_shopping'].max()-
orginal_dataset['max_spent_in_single_shopping'].min())
Range of values: 2.0309999999999997
In [54]:
#Central values
print('Minimum max_spent_in_single_shopping: ',
orginal_dataset['max_spent_in_single_shopping'].min())
print('Maximum max_spent_in_single_shoppings:
',orginal_dataset['max_spent_in_single_shopping'].max())
print('Mean value: ', orginal_dataset['max_spent_in_single_shopping'].mean())
print('Median value: ',orginal_dataset['max_spent_in_single_shopping'].median())
print('Standard deviation: ',
orginal_dataset['max_spent_in_single_shopping'].std())
print('Null values:
',orginal_dataset['max_spent_in_single_shopping'].isnull().any())
Minimum max_spent_in_single_shopping: 4.519
Maximum max_spent_in_single_shoppings: 6.55
Mean value: 5.408071428571429
Median value: 5.223000000000001
Standard deviation: 0.4914804991024054
Null values: False
In [55]:
#Quartiles

Q1=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.25)
Q3=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.75)
print('max_spent_in_single_shopping - 1st Quartile (Q1) is: ', Q1)
print('max_spent_in_single_shopping - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of max_spent_in_single_shopping is ',
stats.iqr(orginal_dataset['max_spent_in_single_shopping']))
max_spent_in_single_shopping - 1st Quartile (Q1) is: 5.045
max_spent_in_single_shopping - 3st Quartile (Q3) is: 5.877000000000001
Interquartile range (IQR) of max_spent_in_single_shopping is
0.8320000000000007
In [56]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in max_spent_in_single_shopping: ', L_outliers)
print('Upper outliers in max_spent_in_single_shopping: ', U_outliers)
Lower outliers in max_spent_in_single_shopping: 3.796999999999999
Upper outliers in max_spent_in_single_shopping: 7.125000000000002
In [57]:
print('Number of outliers in max_spent_in_single_shopping upper : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.125000000000002]
['max_spent_in_single_shopping'].count())
print('Number of outliers in max_spent_in_single_shopping lower : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.796999999999999]
['max_spent_in_single_shopping'].count())
print('% of Outlier in max_spent_in_single_shopping upper:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.12500000
0000002]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in max_spent_in_single_shopping lower:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.79699999
9999999]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
Number of outliers in max_spent_in_single_shopping upper : 0
Number of outliers in max_spent_in_single_shopping lower : 0
% of Outlier in max_spent_in_single_shopping upper: 0.0 %
% of Outlier in max_spent_in_single_shopping lower: 0.0 %
In [58]:
plt.title('max_spent_in_single_shopping')
sns.boxplot(orginal_dataset['max_spent_in_single_shopping'],orient='horizondal',co
lor='purple')

Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b26d910>

In [59]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='max_spent_in_single_shopping',data=orginal_dataset,orient='v',ax=ax
1)
ax1.set_ylabel('max_spent_in_single_shopping', fontsize=15)
ax1.set_title('Distribution of max_spent_in_single_shopping', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['max_spent_in_single_shopping'],ax=ax2)
ax2.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['max_spent_in_single_shopping'])
ax3.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
In [60]:
# Let's us only plot the distributions of independent attributes
orginal_dataset.hist(figsize=(12,16),layout=(4,2));
In [61]:
# Let's check the skewness values quantitatively
orginal_dataset.skew().sort_values(ascending=False)

Out[61]:
max_spent_in_single_shopping 0.561897
current_balance 0.525482
min_payment_amt 0.401667
spending 0.399889
advance_payments 0.386573
credit_limit 0.134378
probability_of_full_payment -0.537954
dtype: float64
In [62]:
# distplot combines the matplotlib.hist function with seaborn kdeplot()
# KDE Plot represents the Kernel Density Estimate
# KDE is used for visualizing the Probability Density of a continuous variable.
# KDE demonstrates the probability density at different values in a continuous
variable.

plt.figure(figsize=(10,50))
for i in range(len(orginal_dataset.columns)):
plt.subplot(17, 1, i+1)
sns.distplot(orginal_dataset[orginal_dataset.columns[i]], kde_kws={"color": "b",
"lw": 3, "label": "KDE"}, hist_kws={"color": "g"})
plt.title(orginal_dataset.columns[i])

plt.tight_layout()
Observations

- Credit limit average is around $3.258(10000s)

- Distrubtion is skewed to right tail for all the variable execpt


probability_of_full_payment variable, which has left tail

Multivariate analysis
Check for multicollinearity
In [63]:
sns.pairplot(orginal_dataset,diag_kind='kde');

Observation
- Strong positive correlation between

- spending & advance_payments,

- advance_payments & current_balance,


- credit_limit & spending

- spending & current_balance

- credit_limit & advance_payments

- max_spent_in_single_shopping current_balance

In [64]:
#correlation matrix

orginal_dataset.corr().T

Out[64]:

spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin


ding ayments ull_payment balance _limit ent_amt gle_shopping

1.00 0.9707
spending 0.994341 0.608288 0.949985 -0.229572 0.863693
0000 71

advance_payment 0.99 0.9448


1.000000 0.529244 0.972422 -0.217340 0.890784
s 4341 29

probability_of_ful 0.60 0.7616


0.529244 1.000000 0.367915 -0.331471 0.226825
l_payment 8288 35

0.94 0.8604
current_balance 0.972422 0.367915 1.000000 -0.171562 0.932806
9985 15

0.97 1.0000
credit_limit 0.944829 0.761635 0.860415 -0.258037 0.749131
0771 00

- -
min_payment_am -
0.22 -0.217340 -0.331471 0.2580 1.000000 -0.011079
t 0.171562
9572 37

max_spent_in_sin 0.86 0.7491


0.890784 0.226825 0.932806 -0.011079 1.000000
gle_shopping 3693 31

In [65]:
#creating a heatmap for better visualization
plt.figure(figsize=(10,8))
sns.heatmap(orginal_dataset.corr(),annot=True,fmt=".2f",cmap="viridis")
plt.show()

In [66]:
# Let us see the significant correlation either negative or positive among
independent attributes..
c = orginal_dataset.corr().abs() # Since there may be positive as well as -ve
correlation
s = c.unstack() #
so = s.sort_values(ascending=False) # Sorting according to the correlation
so=so[(so<1) & (so>0.3)].drop_duplicates().to_frame() # Due to symmetry..
dropping duplicate entries.
so.columns = ['correlation']
so

Out[66]:

correlation

spending advance_payments 0.994341

advance_payments current_balance 0.972422


correlation

credit_limit spending 0.970771

spending current_balance 0.949985

credit_limit advance_payments 0.944829

max_spent_in_single_shoppin current_balance
0.932806
g

advance_payments max_spent_in_single_shopping 0.890784

spending max_spent_in_single_shopping 0.863693

current_balance credit_limit 0.860415

probability_of_full_payment credit_limit 0.761635

max_spent_in_single_shoppin credit_limit
0.749131
g

spending probability_of_full_payment 0.608288

advance_payments probability_of_full_payment 0.529244

current_balance probability_of_full_payment 0.367915


correlation

probability_of_full_payment min_payment_amt 0.331471

Strategy to remove outliers: We choose to replace


attribute outlier values by their respective medians ,
instead of dropping them, as we will lose other column
info and also there outlier are present only in two
avariables and within 5 records.
In [67]:
clean_dataset=orginal_dataset.copy()

In [68]:
def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 *
IQR)))
#iterating over columns to check for no.of outliers in each of the numerical
attributes.
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col,
vData_num_mod[col].value_counts().iloc[1]))
count += 1
print("\n\nNo of attributes with outliers are :", count)

check_outliers(orginal_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2

No of attributes with outliers are : 2


let us remove the outliers
for column in clean_dataset.columns.tolist(): Q1 = clean_dataset[column].quantile(.25) # 1st
quartile Q3 = clean_dataset[column].quantile(.75) # 3rd quartile IQR = Q3-Q1 # get inter quartile
range

# Replace elements of columns that fall below Q1-1.5*IQR and above Q3+1.5*IQR

clean_dataset[column].replace(clean_dataset.loc[(clean_dataset[column] >
Q3+1.5*IQR)|(clean_dataset[column] < Q1-1.5*IQR), column],
clean_dataset[column].median())
In [69]:
check_outliers(clean_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2

No of attributes with outliers are : 2


In [70]:
# Let us check presence of outliers
plt.figure(figsize=(18,14))
box = sns.boxplot(data=clean_dataset)
box.set_xticklabels(labels=box.get_xticklabels(),rotation=90);

Observation
Most of the outlier has been treated and now we are good to go.

In [71]:
plt.title('probability_of_full_payment')
sns.boxplot(clean_dataset['probability_of_full_payment'],orient='horizondal',color
='purple')

Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64585ef310>

Observation

Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.

--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------

1.2 Do you think scaling is necessary for clustering in


this case? Justify
Scaling needs to be done as the values of the variables are different.

spending, advance_payments are in different values and this may get more weightage.

Also have shown below the plot of the data prior and after scaling.

Scaling will have all the values in the relative same range.

I have used zscore to standarised the data to relative same scale -3 to +3.

In [72]:
# prior to scaling
plt.plot(clean_dataset)
plt.show()
In [73]:
# Scaling the attributes.

from scipy.stats import zscore


clean_dataset_Scaled=orginal_dataset.apply(zscore)
clean_dataset_Scaled.head()

Out[73]:

spendi advance_pay probability_of_full_ current_bal credit_li min_paymen max_spent_in_single_


ng ments payment ance mit t_amt shopping

1.754 1.33857
0 1.811968 0.178230 2.367533 -0.298806 2.328998
355 9

0.393 0.85823
1 0.253840 1.501773 -0.600744 -0.242805 -0.538582
582 6

1.413 1.31734
2 1.428192 0.504874 1.401485 -0.221471 1.509107
300 8

- -
3 1.384 -1.227533 -2.591878 -0.793049 1.63901 0.987884 -0.454961
034 7

1.082 1.15546
4 0.998364 1.196340 0.591544 -1.088154 0.874813
581 4

In [74]:
#after scaling
plt.plot(clean_dataset_Scaled)
plt.show()
--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------

1.3 Apply hierarchical clustering to scaled data. Identify


the number of optimum clusters using Dendrogram and
briefly describe them
Creating the Dendrogram

Importing dendrogram and linkage module

In [75]:
from scipy.cluster.hierarchy import dendrogram, linkage

Choosing average linkage method

In [76]:
link_method = linkage(clean_dataset_Scaled, method = 'average')

In [77]:
dend = dendrogram(link_method)
Cutting the Dendrogram with suitable clusters

In [78]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 10)

In [79]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 25)

Importing fcluster module to create clusters

In [80]:
from scipy.cluster.hierarchy import fcluster

In [81]:
# Set criterion as maxclust,then create 3 clusters, and store the result in
another object 'clusters'

clusters_3 = fcluster(link_method, 3, criterion='maxclust')


clusters_3

Out[81]:
array([1, 3, 1, 2, 1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 1, 1, 1,
1, 3, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 2, 3, 1,
1, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 3, 1,
3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 2, 3, 2, 3, 1, 1, 1,
3, 2, 3, 2, 3, 2, 3, 3, 1, 1, 3, 1, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 3, 3, 2, 1, 3, 1, 3, 3, 1], dtype=int32)
In [82]:
cluster3_dataset=orginal_dataset.copy()

In [83]:
cluster3_dataset['clusters-3'] = clusters_3

In [84]:
cluster3_dataset.head()

Out[84]:

spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust


ing yments l_payment alance imit nt_amt e_shopping ers-3

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 1

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 3

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 1

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 1

Cluster Frequency

In [85]:
cluster3_dataset['clusters-3'].value_counts().sort_index()

Out[85]:
1 75
2 70
3 65
Name: clusters-3, dtype: int64

Cluster Profiles
In [86]:
aggdata=cluster3_dataset.groupby('clusters-3').mean()
aggdata['Freq']=cluster3_dataset['clusters-3'].value_counts().sort_index()
aggdata

Out[86]:

spendi advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_sing Fr


ng yments l_payment alance limit nt_amt le_shopping eq

clust
ers-3

18.12 3.6481
1 16.058000 0.881595 6.135747 3.650200 5.987040 75
9200 20

11.91 2.8460
2 13.291000 0.846766 5.258300 4.619000 5.115071 70
6857 00

14.21 3.2535
3 14.195846 0.884869 5.442000 2.768418 5.055569 65
7077 08

Another method - ward

In [87]:
wardlink = linkage(clean_dataset_Scaled, method = 'ward')

In [88]:
dend_wardlink = dendrogram(wardlink)

In [89]:
dend_wardlink = dendrogram(wardlink,
truncate_mode='lastp',
p = 10,
)

In [90]:
clusters_wdlk_3 = fcluster(wardlink, 3, criterion='maxclust')
clusters_wdlk_3

Out[90]:
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)
In [91]:
cluster_w_3_dataset=orginal_dataset.copy()

In [92]:
cluster_w_3_dataset['clusters-3'] = clusters_wdlk_3

In [93]:
cluster_w_3_dataset.head()

Out[93]:

spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust


ing yments l_payment alance imit nt_amt e_shopping ers-3

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 1

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 3

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 1


spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust
ing yments l_payment alance imit nt_amt e_shopping ers-3

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 1

In [94]:
cluster_w_3_dataset['clusters-3'].value_counts().sort_index()

Out[94]:
1 70
2 67
3 73
Name: clusters-3, dtype: int64
In [95]:
aggdata_w=cluster_w_3_dataset.groupby('clusters-3').mean()
aggdata_w['Freq']=cluster_w_3_dataset['clusters-3'].value_counts().sort_index()
aggdata_w

Out[95]:

spendi advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_sing Fr


ng yments l_payment alance limit nt_amt le_shopping eq

clust
ers-3

18.37 3.6846
1 16.145429 0.884400 6.158171 3.639157 6.017371 70
1429 29

11.87 2.8485
2 13.257015 0.848072 5.238940 4.949433 5.122209 67
2388 37

14.19 3.2264
3 14.233562 0.879190 5.478233 2.612181 5.086178 73
9041 52

Observation
Both the method are almost similer means , minor variation, which we know it occurs.

We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering

Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.

And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).

--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------

1.4 Apply K-Means clustering on scaled data and


determine optimum clusters. Apply elbow curve and
silhouette score.
In [96]:
from sklearn.cluster import KMeans

In [97]:
k_means = KMeans(n_clusters = 1)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[97]:
1469.9999999999998
In [98]:
k_means = KMeans(n_clusters = 2)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[98]:
659.171754487041
In [99]:
k_means = KMeans(n_clusters = 3)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[99]:
430.6589731513006
In [100]:
k_means = KMeans(n_clusters = 4)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[100]:
371.74655984791394
In [101]:

wss =[]

In [102]:
for i in range(1,11):
KM = KMeans(n_clusters=i)
KM.fit(clean_dataset_Scaled)
wss.append(KM.inertia_)

In [103]:
wss

Out[103]:
[1469.9999999999998,
659.171754487041,
430.6589731513006,
371.74655984791394,
327.5720355975522,
289.7454594701129,
261.99257202366164,
239.88573666700952,
221.00108292809628,
209.55056673071783]
In [104]:
plt.plot(range(1,11), wss)
plt.xlabel("Clusters")
plt.ylabel("Inertia in the cluster")
plt.show()

In [105]:
k_means_4 = KMeans(n_clusters = 4)
k_means_4.fit(clean_dataset_Scaled)
labels_4 = k_means_4.labels_

In [106]:
kmeans4_dataset=orginal_dataset.copy()

In [107]:
kmeans4_dataset["Clus_kmeans"] = labels_4
kmeans4_dataset.head(5)
Out[107]:

spen advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_singl Clus_k


ding yments l_payment alance limit nt_amt e_shopping means

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 3

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 0

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 3

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 3

In [108]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [109]:
silhouette_score(clean_dataset_Scaled,labels_4)

Out[109]:
0.3291966792017613
In [110]:
from sklearn import metrics

In [111]:
scores = []
k_range = range(2, 11)

for k in k_range:
km = KMeans(n_clusters=k, random_state=2)
km.fit(clean_dataset_Scaled)
scores.append(metrics.silhouette_score(clean_dataset_Scaled, km.labels_))

scores

Out[111]:
[0.46577247686580914,
0.4007270552751299,
0.3291966792017613,
0.28316654897654814,
0.2897583830272518,
0.2694844355168535,
0.25437316027505635,
0.2623959398663564,
0.2673980772529917]
In [112]:
#plotting the sc scores
plt.plot(k_range,scores)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

Insights

From SC Score, the number of optimal clusters could be 3 or 4

In [113]:
sil_width = silhouette_samples(clean_dataset_Scaled,labels_4)

In [114]:
kmeans4_dataset["sil_width"] = sil_width
kmeans4_dataset.head(5)

Out[114]:

spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin Clus_k sil_w


ding ayments ull_payment balance _limit ent_amt gle_shopping means idth

19.9 0.43
0 16.92 0.8752 6.675 3.763 3.252 6.550 3
4 2658

15.9 0.09
1 14.89 0.9064 5.363 3.582 3.336 5.144 0
9 9543

18.9 0.42
2 16.42 0.8829 6.248 3.755 3.368 6.148 3
5 5893
spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin Clus_k sil_w
ding ayments ull_payment balance _limit ent_amt gle_shopping means idth

10.8 0.52
3 12.96 0.8099 5.278 2.641 5.182 5.185 2
3 9852

17.9 0.08
4 15.86 0.8992 5.890 3.694 2.068 5.837 3
9 2791

In [115]:
silhouette_samples(clean_dataset_Scaled,labels_4).min()

Out[115]:
-0.05115805932867967
3 Cluster Solution
In [116]:
km_3 = KMeans(n_clusters=3,random_state=123)

In [117]:
#fitting the Kmeans
km_3.fit(clean_dataset_Scaled)
km_3.labels_

Out[117]:
array([0, 2, 0, 1, 0, 1, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1,
0, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0,
1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 1, 2, 2, 0,
0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 2, 1, 0, 2, 2, 2, 2, 0, 1, 2, 0, 2,
0, 1, 2, 0, 2, 1, 1, 0, 0, 0, 1, 0, 2, 0, 2, 0, 2, 0, 0, 1, 1, 0,
2, 2, 0, 1, 1, 0, 2, 2, 1, 0, 2, 1, 1, 1, 2, 2, 0, 1, 2, 2, 1, 2,
2, 0, 1, 0, 0, 1, 0, 2, 2, 2, 1, 1, 2, 1, 0, 1, 2, 1, 2, 1, 2, 2,
1, 2, 2, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 2, 2, 1, 2, 1, 2, 0, 0, 0,
2, 1, 2, 1, 2, 2, 2, 2, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 0, 0, 1,
0, 1, 2, 0, 2, 1, 0, 2, 0, 2, 2, 2], dtype=int32)
In [118]:
#proportion of labels classified

pd.Series(km_3.labels_).value_counts()

Out[118]:
1 72
2 71
0 67
dtype: int64
K-Means Clustering & Cluster Information
In [119]:
kmeans1_dataset=orginal_dataset.copy()

In [120]:
# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0

y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans1_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans1_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[120]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single


ing ments _payment lance imit nt_amt _shopping

clus
ter

1 18.5 16.2 0.9 6.2 3.7 3.6 6.0

2 11.9 13.2 0.8 5.2 2.8 4.7 5.1

3 14.4 14.3 0.9 5.5 3.3 2.7 5.1

In [121]:
def ClusterPercentage(datafr,name):
"""Common utility function to calculate the percentage and size of cluster"""

size = pd.Series(datafr[name].value_counts().sort_index())
percent = pd.Series(round(datafr[name].value_counts()/datafr.shape[0] *
100,2)).sort_index()

size_df = pd.concat([size, percent],axis=1)


size_df.columns = ["Cluster_Size","Cluster_Percentage"]

return(size_df)

In [122]:
ClusterPercentage(kmeans1_dataset,"cluster")
Out[122]:

Cluster_Size Cluster_Percentage

1 67 31.90

2 72 34.29

3 71 33.81

In [123]:
#transposing the cluster
cluster_3_T = kmeans_mean_cluster.T

In [124]:
cluster_3_T

Out[124]:

cluster 1 2 3

spending 18.5 11.9 14.4

advance_payments 16.2 13.2 14.3

probability_of_full_payment 0.9 0.8 0.9

current_balance 6.2 5.2 5.5

credit_limit 3.7 2.8 3.3

min_payment_amt 3.6 4.7 2.7

max_spent_in_single_shopping 6.0 5.1 5.1


Note

I am going with 3 clusters via kmeans, but am showing


the analysis of 4 and 5 kmeans cluster, I see we based
on current dataset given, 3 cluster solution makes sense
based on the spending pattern (High, Medium, Low)
4-Cluster Solution

In [125]:
km_4 = KMeans(n_clusters=4,random_state=123)

In [126]:
#fitting the Kmeans
km_4.fit(clean_dataset_Scaled)
km_4.labels_

Out[126]:
array([2, 1, 2, 0, 2, 0, 0, 1, 2, 0, 2, 3, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0,
2, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 2, 2, 1, 3, 2,
0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 2, 3, 0, 0, 0, 2, 1, 0, 0, 3, 1, 2,
2, 1, 2, 1, 1, 0, 2, 2, 0, 2, 1, 0, 3, 1, 1, 1, 1, 2, 0, 3, 3, 3,
3, 0, 1, 2, 1, 0, 1, 2, 2, 3, 0, 2, 1, 2, 3, 2, 1, 2, 2, 0, 1, 2,
3, 1, 2, 0, 0, 3, 1, 3, 0, 2, 1, 0, 0, 0, 1, 1, 2, 0, 1, 1, 0, 1,
1, 2, 0, 2, 2, 0, 3, 1, 3, 1, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 3,
1, 1, 1, 0, 1, 2, 2, 0, 2, 3, 2, 0, 3, 1, 1, 0, 1, 0, 1, 2, 2, 2,
1, 1, 3, 0, 1, 1, 1, 1, 3, 3, 1, 3, 1, 0, 1, 1, 0, 2, 1, 3, 2, 0,
2, 0, 1, 3, 1, 0, 3, 1, 3, 1, 3, 3], dtype=int32)
In [127]:
#proportion of labels classified

pd.Series(km_4.labels_).value_counts()

Out[127]:
1 65
0 64
2 51
3 30
dtype: int64

K-Means Clustering & Cluster Information

In [128]:
kmeans14_dataset=orginal_dataset.copy()

In [129]:
# Fitting K-Means to the dataset

kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)


y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0


y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans14_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans14_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[129]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single


ing ments _payment lance imit nt_amt _shopping

clus
ter

1 16.4 15.3 0.9 5.9 3.4 3.9 5.7

2 14.0 14.1 0.9 5.4 3.2 2.6 5.0

3 19.2 16.5 0.9 6.3 3.8 3.5 6.1

4 11.8 13.2 0.8 5.2 2.8 4.9 5.1

In [130]:
ClusterPercentage(kmeans14_dataset,"cluster")

Out[130]:

Cluster_Size Cluster_Percentage

1 30 14.29

2 67 31.90
Cluster_Size Cluster_Percentage

3 48 22.86

4 65 30.95

In [131]:
#transposing the cluster
cluster_4_T = kmeans_mean_cluster.T

In [132]:
cluster_4_T

Out[132]:

cluster 1 2 3 4

spending 16.4 14.0 19.2 11.8

advance_payments 15.3 14.1 16.5 13.2

probability_of_full_payment 0.9 0.9 0.9 0.8

current_balance 5.9 5.4 6.3 5.2

credit_limit 3.4 3.2 3.8 2.8

min_payment_amt 3.9 2.6 3.5 4.9

max_spent_in_single_shoppin
5.7 5.0 6.1 5.1
g

5 cluster

In [133]:
kmeans15_dataset=orginal_dataset.copy()

In [134]:
# Fitting K-Means to the dataset

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)


y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0

y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans15_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans15_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[134]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single


ing ments _payment lance imit nt_amt _shopping

clus
ter

1 19.2 16.5 0.9 6.3 3.8 3.5 6.1

2 11.7 13.2 0.8 5.3 2.8 4.5 5.2

3 14.3 14.3 0.9 5.5 3.3 2.4 5.1

4 16.4 15.3 0.9 5.9 3.4 3.9 5.7

5 12.3 13.3 0.9 5.2 3.0 5.0 5.0

In [135]:
ClusterPercentage(kmeans15_dataset,"cluster")
Out[135]:

Cluster_Size Cluster_Percentage

1 48 22.86

2 41 19.52

3 56 26.67

4 29 13.81

5 36 17.14

In [136]:
#transposing the cluster
cluster_5_T = kmeans_mean_cluster.T

In [137]:
cluster_5_T

Out[137]:

cluster 1 2 3 4 5

spending 19.2 11.7 14.3 16.4 12.3

advance_payments 16.5 13.2 14.3 15.3 13.3

probability_of_full_payment 0.9 0.8 0.9 0.9 0.9

current_balance 6.3 5.3 5.5 5.9 5.2

credit_limit 3.8 2.8 3.3 3.4 3.0


cluster 1 2 3 4 5

min_payment_amt 3.5 4.5 2.4 3.9 5.0

max_spent_in_single_shoppin
6.1 5.2 5.1 5.7 5.0
g

--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------

1.5 Describe cluster profiles for the clusters defined.


Recommend different promotional strategies for different
clusters.
3 group cluster via Kmeans
In [138]:
cluster_3_T

Out[138]:

cluster 1 2 3

spending 18.5 11.9 14.4

advance_payments 16.2 13.2 14.3

probability_of_full_payment 0.9 0.8 0.9

current_balance 6.2 5.2 5.5

credit_limit 3.7 2.8 3.3

min_payment_amt 3.6 4.7 2.7


cluster 1 2 3

max_spent_in_single_shopping 6.0 5.1 5.1

3 group cluster via hierarchical clustering


In [139]:
aggdata_w.T

Out[139]:

clusters-3 1 2 3

11.87238
spending 18.371429 14.199041
8

13.25701
advance_payments 16.145429 14.233562
5

probability_of_full_payment 0.884400 0.848072 0.879190

current_balance 6.158171 5.238940 5.478233

credit_limit 3.684629 2.848537 3.226452

min_payment_amt 3.639157 4.949433 2.612181

max_spent_in_single_shopping 6.017371 5.122209 5.086178

67.00000
Freq 70.000000 73.000000
0

Cluster Group Profiles


Group 1 : High Spending

Group 3 : Medium Spending

Group 2 : Low Spending


Promotional strategies for each cluster
Group 1 : High Spending Group
- Giving any reward points might increase their purchases.

- maximum max_spent_in_single_shopping is high for this group, so can be


offered discount/offer on next transactions upon full payment

- Increase there credit limit and

- Increase spending habits

- Give loan against the credit card, as they are customers with good repayment
record.

- Tie up with luxary brands, which will drive more one_time_maximun spending

Group 3 : Medium Spending Group


- They are potential target customers who are paying bills and doing purchases
and maintaining comparatively good credit score. So we can increase credit
limit or can lower down interest rate.

- Promote premium cards/loyality cars to increase transcations.

- Increase spending habits by trying with premium ecommerce sites, travel


portal, travel airlines/hotel, as this will encourge them to spend more

Group 2 : Low Spending Group


- customers should be given remainders for payments. Offers can be provided on
early payments to improve their payment rate.

- Increase there spending habits by tieing up with grocery stores, utlities


(electircity, phone, gas, others)

--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------

You might also like