Bank Customer Segmentation Guide
Bank Customer Segmentation Guide
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.
They collected a sample that summarizes the activities of users during the past few months.
You are given the task to identify the segments based on credit card usage.
--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
In [2]:
#Plot styling
import seaborn as sns; sns.set() # for plot styling
%matplotlib inline
In [3]:
# Import stats from scipy
`
--------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------
Out[5]:
In [6]:
orginal_dataset.tail()
Out[6]:
20
13.89 14.02 0.8880 5.439 3.199 3.986 4.738
5
20
16.77 15.62 0.8638 5.927 3.438 4.920 5.795
6
20
16.12 15.00 0.9000 5.709 3.485 2.270 5.443
8
20
15.57 15.15 0.8527 5.920 3.231 2.640 5.879
9
Observation
Data looks good based on intial records seen in top 5 and bottom 5.
In [7]:
orginal_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB
Observation:
orginal_dataset.shape
Out[8]:
(210, 7)
In [9]:
### Checking for Missing Values
orginal_dataset.isnull().sum()
Out[9]:
spending 0
advance_payments 0
probability_of_full_payment 0
current_balance 0
credit_limit 0
min_payment_amt 0
max_spent_in_single_shopping 0
dtype: int64
Observation
No missing value.
Univariate analysis
Checking the Summary Statistic
In [10]:
## Intital descriptive analysis of the data
orginal_dataset.describe(percentiles=[.25,0.50,0.75,0.90]).T
Out[10]:
cou
mean std min 25% 50% 75% 90% max
nt
probability_of_full_paym 210. 0.87099 0.0236 0.808 0.8569 0.8734 0.88777 0.899 0.918
ent 0 9 29 1 0 5 5 3 3
max_spent_in_single_sho 210. 5.40807 0.4914 4.519 5.0450 5.2230 5.87700 6.185 6.550
pping 0 1 80 0 0 0 0 0 0
Observation
Spending variable
In [11]:
print('Range of values: ', orginal_dataset['spending'].max()-
orginal_dataset['spending'].min())
Range of values: 10.59
In [12]:
#Central values
print('Minimum spending: ', orginal_dataset['spending'].min())
print('Maximum spending: ',orginal_dataset['spending'].max())
print('Mean value: ', orginal_dataset['spending'].mean())
print('Median value: ',orginal_dataset['spending'].median())
print('Standard deviation: ', orginal_dataset['spending'].std())
print('Null values: ',orginal_dataset['spending'].isnull().any())
Minimum spending: 10.59
Maximum spending: 21.18
Mean value: 14.847523809523818
Median value: 14.355
Standard deviation: 2.909699430687361
Null values: False
In [13]:
#Quartiles
Q1=orginal_dataset['spending'].quantile(q=0.25)
Q3=orginal_dataset['spending'].quantile(q=0.75)
print('spending - 1st Quartile (Q1) is: ', Q1)
print('spending - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of spending is ',
stats.iqr(orginal_dataset['spending']))
spending - 1st Quartile (Q1) is: 12.27
spending - 3st Quartile (Q3) is: 17.305
Interquartile range (IQR) of spending is 5.035
In [14]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in spending: ', L_outliers)
print('Upper outliers in spending: ', U_outliers)
Lower outliers in spending: 4.717499999999999
Upper outliers in spending: 24.8575
In [15]:
print('Number of outliers in spending upper : ',
orginal_dataset[orginal_dataset['spending']>24.8575]['spending'].count())
print('Number of outliers in spending lower : ',
orginal_dataset[orginal_dataset['spending']<4.717499]['spending'].count())
print('% of Outlier in spending upper:
',round(orginal_dataset[orginal_dataset['spending']>24.8575]
['spending'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in spending lower:
',round(orginal_dataset[orginal_dataset['spending']<4.717499]
['spending'].count()*100/len(orginal_dataset)), '%')
Number of outliers in spending upper : 0
Number of outliers in spending lower : 0
% of Outlier in spending upper: 0.0 %
% of Outlier in spending lower: 0.0 %
In [16]:
plt.title('spending')
sns.boxplot(orginal_dataset['spending'],orient='horizondal',color='purple')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64619a9210>
In [17]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='spending',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('spending', fontsize=15)
ax1.set_title('Distribution of spending', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['spending'],ax=ax2)
ax2.set_xlabel('spending', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['spending'])
ax3.set_xlabel('spending', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
advance_payments variable
In [18]:
print('Range of values: ', orginal_dataset['advance_payments'].max()-
orginal_dataset['advance_payments'].min())
Range of values: 4.84
In [19]:
#Central values
print('Minimum advance_payments: ', orginal_dataset['advance_payments'].min())
print('Maximum advance_payments: ',orginal_dataset['advance_payments'].max())
print('Mean value: ', orginal_dataset['advance_payments'].mean())
print('Median value: ',orginal_dataset['advance_payments'].median())
print('Standard deviation: ', orginal_dataset['advance_payments'].std())
print('Null values: ',orginal_dataset['advance_payments'].isnull().any())
Minimum advance_payments: 12.41
Maximum advance_payments: 17.25
Mean value: 14.559285714285727
Median value: 14.32
Standard deviation: 1.305958726564022
Null values: False
In [20]:
#Quartiles
Q1=orginal_dataset['advance_payments'].quantile(q=0.25)
Q3=orginal_dataset['advance_payments'].quantile(q=0.75)
print('advance_payments - 1st Quartile (Q1) is: ', Q1)
print('advance_payments - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of advance_payments is ',
stats.iqr(orginal_dataset['advance_payments']))
advance_payments - 1st Quartile (Q1) is: 13.45
advance_payments - 3st Quartile (Q3) is: 15.715
Interquartile range (IQR) of advance_payments is 2.2650000000000006
In [21]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in advance_payments: ', L_outliers)
print('Upper outliers in advance_payments: ', U_outliers)
Lower outliers in advance_payments: 10.052499999999998
Upper outliers in advance_payments: 19.1125
In [22]:
print('Number of outliers in advance_payments upper : ',
orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count())
print('Number of outliers in advance_payments lower : ',
orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count())
print('% of Outlier in advance_payments upper:
',round(orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in advance_payments lower:
',round(orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
Number of outliers in advance_payments upper : 0
Number of outliers in advance_payments lower : 0
% of Outlier in advance_payments upper: 0.0 %
% of Outlier in advance_payments lower: 0.0 %
In [23]:
plt.title('advance_payments')
sns.boxplot(orginal_dataset['advance_payments'],orient='horizondal',color='purple'
)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b422690>
In [24]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='advance_payments',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('advance_payments', fontsize=15)
ax1.set_title('Distribution of advance_payments', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['advance_payments'],ax=ax2)
ax2.set_xlabel('advance_payments', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['advance_payments'])
ax3.set_xlabel('advance_payments', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
probability_of_full_payment variable
In [25]:
print('Range of values: ', orginal_dataset['probability_of_full_payment'].max()-
orginal_dataset['probability_of_full_payment'].min())
Range of values: 0.11019999999999996
In [26]:
#Central values
print('Minimum probability_of_full_payment ',
orginal_dataset['probability_of_full_payment'].min())
print('Maximum probability_of_full_payment:
',orginal_dataset['probability_of_full_payment'].max())
print('Mean value: ', orginal_dataset['probability_of_full_payment'].mean())
print('Median value: ',orginal_dataset['probability_of_full_payment'].median())
print('Standard deviation: ',
orginal_dataset['probability_of_full_payment'].std())
print('Null values:
',orginal_dataset['probability_of_full_payment'].isnull().any())
Minimum probability_of_full_payment 0.8081
Maximum probability_of_full_payment: 0.9183
Mean value: 0.8709985714285714
Median value: 0.8734500000000001
Standard deviation: 0.023629416583846496
Null values: False
In [27]:
#Quartiles
Q1=orginal_dataset['probability_of_full_payment'].quantile(q=0.25)
Q3=orginal_dataset['probability_of_full_payment'].quantile(q=0.75)
print('probability_of_full_payment - 1st Quartile (Q1) is: ', Q1)
print('probability_of_full_payment - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of probability_of_full_payment is ',
stats.iqr(orginal_dataset['probability_of_full_payment']))
probability_of_full_payment - 1st Quartile (Q1) is: 0.8569
probability_of_full_payment - 3st Quartile (Q3) is: 0.887775
Interquartile range (IQR) of probability_of_full_payment is
0.030874999999999986
In [28]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in probability_of_full_payment: ', L_outliers)
print('Upper outliers in probability_of_full_payment: ', U_outliers)
Lower outliers in probability_of_full_payment: 0.8105875
Upper outliers in probability_of_full_payment: 0.9340875
In [29]:
print('Number of outliers in probability_of_full_payment upper : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count())
print('Number of outliers in probability_of_full_payment lower : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count())
print('% of Outlier in probability_of_full_payment upper:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in probability_of_full_payment lower:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
Number of outliers in probability_of_full_payment upper : 0
Number of outliers in probability_of_full_payment lower : 3
% of Outlier in probability_of_full_payment upper: 0.0 %
% of Outlier in probability_of_full_payment lower: 1.0 %
In [30]:
plt.title('probability_of_full_payment')
sns.boxplot(orginal_dataset['probability_of_full_payment'],orient='horizondal',col
or='purple')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b57fc50>
In [31]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='probability_of_full_payment',data=orginal_dataset,orient='v',ax=ax1
)
ax1.set_ylabel('probability_of_full_payment', fontsize=15)
ax1.set_title('Distribution of probability_of_full_payment', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['probability_of_full_payment'],ax=ax2)
ax2.set_xlabel('probability_of_full_payment', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['probability_of_full_payment'])
ax3.set_xlabel('probability_of_full_payment', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
current_balance variable
In [32]:
print('Range of values: ', orginal_dataset['current_balance'].max()-
orginal_dataset['current_balance'].min())
Range of values: 1.7759999999999998
In [33]:
#Central values
print('Minimum current_balance: ', orginal_dataset['current_balance'].min())
print('Maximum current_balance: ',orginal_dataset['current_balance'].max())
print('Mean value: ', orginal_dataset['current_balance'].mean())
print('Median value: ',orginal_dataset['current_balance'].median())
print('Standard deviation: ', orginal_dataset['current_balance'].std())
print('Null values: ',orginal_dataset['current_balance'].isnull().any())
Minimum current_balance: 4.899
Maximum current_balance: 6.675
Mean value: 5.628533333333334
Median value: 5.5235
Standard deviation: 0.4430634777264493
Null values: False
In [34]:
#Quartiles
Q1=orginal_dataset['current_balance'].quantile(q=0.25)
Q3=orginal_dataset['current_balance'].quantile(q=0.75)
print('current_balance - 1st Quartile (Q1) is: ', Q1)
print('current_balance - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of current_balance is ',
stats.iqr(orginal_dataset['current_balance']))
current_balance - 1st Quartile (Q1) is: 5.26225
current_balance - 3st Quartile (Q3) is: 5.97975
Interquartile range (IQR) of current_balance is 0.7175000000000002
In [35]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in current_balance: ', L_outliers)
print('Upper outliers in current_balance: ', U_outliers)
Lower outliers in current_balance: 4.186
Upper outliers in current_balance: 7.056000000000001
In [36]:
print('Number of outliers in current_balance upper : ',
orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count())
print('Number of outliers in current_balance lower : ',
orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count())
print('% of Outlier in current_balance upper:
',round(orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in current_balance lower:
',round(orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count()*100/len(orginal_dataset)), '%')
Number of outliers in current_balance upper : 0
Number of outliers in current_balance lower : 0
% of Outlier in current_balance upper: 0.0 %
% of Outlier in current_balance lower: 0.0 %
In [37]:
plt.title('current_balance')
sns.boxplot(orginal_dataset['current_balance'],orient='horizondal',color='purple')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b209dd0>
In [38]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='current_balance',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('current_balance', fontsize=15)
ax1.set_title('Distribution of current_balance', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['current_balance'],ax=ax2)
ax2.set_xlabel('current_balance', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['current_balance'])
ax3.set_xlabel('current_balance', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
credit_limit variable
In [39]:
print('Range of values: ', orginal_dataset['credit_limit'].max()-
orginal_dataset['credit_limit'].min())
Range of values: 1.4030000000000005
In [40]:
#Central values
print('Minimum credit_limit: ', orginal_dataset['credit_limit'].min())
print('Maximum credit_limit: ',orginal_dataset['credit_limit'].max())
print('Mean value: ', orginal_dataset['credit_limit'].mean())
print('Median value: ',orginal_dataset['credit_limit'].median())
print('Standard deviation: ', orginal_dataset['credit_limit'].std())
print('Null values: ',orginal_dataset['credit_limit'].isnull().any())
Minimum credit_limit: 2.63
Maximum credit_limit: 4.033
Mean value: 3.258604761904763
Median value: 3.237
Standard deviation: 0.3777144449065874
Null values: False
In [41]:
#Quartiles
Q1=orginal_dataset['credit_limit'].quantile(q=0.25)
Q3=orginal_dataset['credit_limit'].quantile(q=0.75)
print('credit_limit - 1st Quartile (Q1) is: ', Q1)
print('credit_limit - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of credit_limit is ',
stats.iqr(orginal_dataset['credit_limit']))
credit_limit - 1st Quartile (Q1) is: 2.944
credit_limit - 3st Quartile (Q3) is: 3.56175
Interquartile range (IQR) of credit_limit is 0.61775
In [42]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in credit_limit: ', L_outliers)
print('Upper outliers in credit_limit: ', U_outliers)
Lower outliers in credit_limit: 2.017375
Upper outliers in credit_limit: 4.488375
In [43]:
print('Number of outliers in credit_limit upper : ',
orginal_dataset[orginal_dataset['credit_limit']>4.488375]['credit_limit'].count())
print('Number of outliers in credit_limit lower : ',
orginal_dataset[orginal_dataset['credit_limit']<2.017375]['credit_limit'].count())
print('% of Outlier in credit_limit upper:
',round(orginal_dataset[orginal_dataset['credit_limit']>4.488375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in credit_limit lower:
',round(orginal_dataset[orginal_dataset['credit_limit']<2.017375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
Number of outliers in credit_limit upper : 0
Number of outliers in credit_limit lower : 0
% of Outlier in credit_limit upper: 0.0 %
% of Outlier in credit_limit lower: 0.0 %
In [44]:
plt.title('credit_limit')
sns.boxplot(orginal_dataset['credit_limit'],orient='horizondal',color='purple')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b036fd0>
In [45]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='credit_limit',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('credit_limit', fontsize=15)
ax1.set_title('Distribution of credit_limit', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['credit_limit'],ax=ax2)
ax2.set_xlabel('credit_limit', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['credit_limit'])
ax3.set_xlabel('credit_limit', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
min_payment_amt variable
In [46]:
print('Range of values: ', orginal_dataset['min_payment_amt'].max()-
orginal_dataset['min_payment_amt'].min())
Range of values: 7.690899999999999
In [47]:
#Central values
print('Minimum min_payment_amt: ', orginal_dataset['min_payment_amt'].min())
print('Maximum min_payment_amt: ',orginal_dataset['min_payment_amt'].max())
print('Mean value: ', orginal_dataset['min_payment_amt'].mean())
print('Median value: ',orginal_dataset['min_payment_amt'].median())
print('Standard deviation: ', orginal_dataset['min_payment_amt'].std())
print('Null values: ',orginal_dataset['min_payment_amt'].isnull().any())
Minimum min_payment_amt: 0.7651
Maximum min_payment_amt: 8.456
Mean value: 3.7002009523809507
Median value: 3.599
Standard deviation: 1.5035571308217792
Null values: False
In [48]:
#Quartiles
Q1=orginal_dataset['min_payment_amt'].quantile(q=0.25)
Q3=orginal_dataset['min_payment_amt'].quantile(q=0.75)
print('min_payment_amt - 1st Quartile (Q1) is: ', Q1)
print('min_payment_amt - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of min_payment_amt is ',
stats.iqr(orginal_dataset['min_payment_amt']))
min_payment_amt - 1st Quartile (Q1) is: 2.5614999999999997
min_payment_amt - 3st Quartile (Q3) is: 4.76875
Interquartile range (IQR) of min_payment_amt is 2.20725
In [49]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in min_payment_amt: ', L_outliers)
print('Upper outliers in min_payment_amt: ', U_outliers)
Lower outliers in min_payment_amt: -0.7493750000000006
Upper outliers in min_payment_amt: 8.079625
In [50]:
print('Number of outliers in min_payment_amt upper : ',
orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count())
print('Number of outliers in min_payment_amt lower : ',
orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count())
print('% of Outlier in min_payment_amt upper:
',round(orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in min_payment_amt lower:
',round(orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
Number of outliers in min_payment_amt upper : 2
Number of outliers in min_payment_amt lower : 0
% of Outlier in min_payment_amt upper: 1.0 %
% of Outlier in min_payment_amt lower: 0.0 %
In [51]:
plt.title('min_payment_amt')
sns.boxplot(orginal_dataset['min_payment_amt'],orient='horizondal',color='purple')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645ae1e090>
In [52]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='min_payment_amt',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('min_payment_amt', fontsize=15)
ax1.set_title('Distribution of min_payment_amt', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['min_payment_amt'],ax=ax2)
ax2.set_xlabel('min_payment_amt', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['min_payment_amt'])
ax3.set_xlabel('min_payment_amt', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
max_spent_in_single_shopping variable
In [53]:
print('Range of values: ', orginal_dataset['max_spent_in_single_shopping'].max()-
orginal_dataset['max_spent_in_single_shopping'].min())
Range of values: 2.0309999999999997
In [54]:
#Central values
print('Minimum max_spent_in_single_shopping: ',
orginal_dataset['max_spent_in_single_shopping'].min())
print('Maximum max_spent_in_single_shoppings:
',orginal_dataset['max_spent_in_single_shopping'].max())
print('Mean value: ', orginal_dataset['max_spent_in_single_shopping'].mean())
print('Median value: ',orginal_dataset['max_spent_in_single_shopping'].median())
print('Standard deviation: ',
orginal_dataset['max_spent_in_single_shopping'].std())
print('Null values:
',orginal_dataset['max_spent_in_single_shopping'].isnull().any())
Minimum max_spent_in_single_shopping: 4.519
Maximum max_spent_in_single_shoppings: 6.55
Mean value: 5.408071428571429
Median value: 5.223000000000001
Standard deviation: 0.4914804991024054
Null values: False
In [55]:
#Quartiles
Q1=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.25)
Q3=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.75)
print('max_spent_in_single_shopping - 1st Quartile (Q1) is: ', Q1)
print('max_spent_in_single_shopping - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of max_spent_in_single_shopping is ',
stats.iqr(orginal_dataset['max_spent_in_single_shopping']))
max_spent_in_single_shopping - 1st Quartile (Q1) is: 5.045
max_spent_in_single_shopping - 3st Quartile (Q3) is: 5.877000000000001
Interquartile range (IQR) of max_spent_in_single_shopping is
0.8320000000000007
In [56]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in max_spent_in_single_shopping: ', L_outliers)
print('Upper outliers in max_spent_in_single_shopping: ', U_outliers)
Lower outliers in max_spent_in_single_shopping: 3.796999999999999
Upper outliers in max_spent_in_single_shopping: 7.125000000000002
In [57]:
print('Number of outliers in max_spent_in_single_shopping upper : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.125000000000002]
['max_spent_in_single_shopping'].count())
print('Number of outliers in max_spent_in_single_shopping lower : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.796999999999999]
['max_spent_in_single_shopping'].count())
print('% of Outlier in max_spent_in_single_shopping upper:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.12500000
0000002]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in max_spent_in_single_shopping lower:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.79699999
9999999]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
Number of outliers in max_spent_in_single_shopping upper : 0
Number of outliers in max_spent_in_single_shopping lower : 0
% of Outlier in max_spent_in_single_shopping upper: 0.0 %
% of Outlier in max_spent_in_single_shopping lower: 0.0 %
In [58]:
plt.title('max_spent_in_single_shopping')
sns.boxplot(orginal_dataset['max_spent_in_single_shopping'],orient='horizondal',co
lor='purple')
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b26d910>
In [59]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='max_spent_in_single_shopping',data=orginal_dataset,orient='v',ax=ax
1)
ax1.set_ylabel('max_spent_in_single_shopping', fontsize=15)
ax1.set_title('Distribution of max_spent_in_single_shopping', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(orginal_dataset['max_spent_in_single_shopping'],ax=ax2)
ax2.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(orginal_dataset['max_spent_in_single_shopping'])
ax3.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
In [60]:
# Let's us only plot the distributions of independent attributes
orginal_dataset.hist(figsize=(12,16),layout=(4,2));
In [61]:
# Let's check the skewness values quantitatively
orginal_dataset.skew().sort_values(ascending=False)
Out[61]:
max_spent_in_single_shopping 0.561897
current_balance 0.525482
min_payment_amt 0.401667
spending 0.399889
advance_payments 0.386573
credit_limit 0.134378
probability_of_full_payment -0.537954
dtype: float64
In [62]:
# distplot combines the matplotlib.hist function with seaborn kdeplot()
# KDE Plot represents the Kernel Density Estimate
# KDE is used for visualizing the Probability Density of a continuous variable.
# KDE demonstrates the probability density at different values in a continuous
variable.
plt.figure(figsize=(10,50))
for i in range(len(orginal_dataset.columns)):
plt.subplot(17, 1, i+1)
sns.distplot(orginal_dataset[orginal_dataset.columns[i]], kde_kws={"color": "b",
"lw": 3, "label": "KDE"}, hist_kws={"color": "g"})
plt.title(orginal_dataset.columns[i])
plt.tight_layout()
Observations
Multivariate analysis
Check for multicollinearity
In [63]:
sns.pairplot(orginal_dataset,diag_kind='kde');
Observation
- Strong positive correlation between
- max_spent_in_single_shopping current_balance
In [64]:
#correlation matrix
orginal_dataset.corr().T
Out[64]:
1.00 0.9707
spending 0.994341 0.608288 0.949985 -0.229572 0.863693
0000 71
0.94 0.8604
current_balance 0.972422 0.367915 1.000000 -0.171562 0.932806
9985 15
0.97 1.0000
credit_limit 0.944829 0.761635 0.860415 -0.258037 0.749131
0771 00
- -
min_payment_am -
0.22 -0.217340 -0.331471 0.2580 1.000000 -0.011079
t 0.171562
9572 37
In [65]:
#creating a heatmap for better visualization
plt.figure(figsize=(10,8))
sns.heatmap(orginal_dataset.corr(),annot=True,fmt=".2f",cmap="viridis")
plt.show()
In [66]:
# Let us see the significant correlation either negative or positive among
independent attributes..
c = orginal_dataset.corr().abs() # Since there may be positive as well as -ve
correlation
s = c.unstack() #
so = s.sort_values(ascending=False) # Sorting according to the correlation
so=so[(so<1) & (so>0.3)].drop_duplicates().to_frame() # Due to symmetry..
dropping duplicate entries.
so.columns = ['correlation']
so
Out[66]:
correlation
max_spent_in_single_shoppin current_balance
0.932806
g
max_spent_in_single_shoppin credit_limit
0.749131
g
In [68]:
def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 *
IQR)))
#iterating over columns to check for no.of outliers in each of the numerical
attributes.
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col,
vData_num_mod[col].value_counts().iloc[1]))
count += 1
print("\n\nNo of attributes with outliers are :", count)
check_outliers(orginal_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2
# Replace elements of columns that fall below Q1-1.5*IQR and above Q3+1.5*IQR
clean_dataset[column].replace(clean_dataset.loc[(clean_dataset[column] >
Q3+1.5*IQR)|(clean_dataset[column] < Q1-1.5*IQR), column],
clean_dataset[column].median())
In [69]:
check_outliers(clean_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2
Observation
Most of the outlier has been treated and now we are good to go.
In [71]:
plt.title('probability_of_full_payment')
sns.boxplot(clean_dataset['probability_of_full_payment'],orient='horizondal',color
='purple')
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64585ef310>
Observation
Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.
--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------
spending, advance_payments are in different values and this may get more weightage.
Also have shown below the plot of the data prior and after scaling.
Scaling will have all the values in the relative same range.
I have used zscore to standarised the data to relative same scale -3 to +3.
In [72]:
# prior to scaling
plt.plot(clean_dataset)
plt.show()
In [73]:
# Scaling the attributes.
Out[73]:
1.754 1.33857
0 1.811968 0.178230 2.367533 -0.298806 2.328998
355 9
0.393 0.85823
1 0.253840 1.501773 -0.600744 -0.242805 -0.538582
582 6
1.413 1.31734
2 1.428192 0.504874 1.401485 -0.221471 1.509107
300 8
- -
3 1.384 -1.227533 -2.591878 -0.793049 1.63901 0.987884 -0.454961
034 7
1.082 1.15546
4 0.998364 1.196340 0.591544 -1.088154 0.874813
581 4
In [74]:
#after scaling
plt.plot(clean_dataset_Scaled)
plt.show()
--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------
In [75]:
from scipy.cluster.hierarchy import dendrogram, linkage
In [76]:
link_method = linkage(clean_dataset_Scaled, method = 'average')
In [77]:
dend = dendrogram(link_method)
Cutting the Dendrogram with suitable clusters
In [78]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 10)
In [79]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 25)
In [80]:
from scipy.cluster.hierarchy import fcluster
In [81]:
# Set criterion as maxclust,then create 3 clusters, and store the result in
another object 'clusters'
Out[81]:
array([1, 3, 1, 2, 1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 1, 1, 1,
1, 3, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 2, 3, 1,
1, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 3, 1,
3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 2, 3, 2, 3, 1, 1, 1,
3, 2, 3, 2, 3, 2, 3, 3, 1, 1, 3, 1, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 3, 3, 2, 1, 3, 1, 3, 3, 1], dtype=int32)
In [82]:
cluster3_dataset=orginal_dataset.copy()
In [83]:
cluster3_dataset['clusters-3'] = clusters_3
In [84]:
cluster3_dataset.head()
Out[84]:
Cluster Frequency
In [85]:
cluster3_dataset['clusters-3'].value_counts().sort_index()
Out[85]:
1 75
2 70
3 65
Name: clusters-3, dtype: int64
Cluster Profiles
In [86]:
aggdata=cluster3_dataset.groupby('clusters-3').mean()
aggdata['Freq']=cluster3_dataset['clusters-3'].value_counts().sort_index()
aggdata
Out[86]:
clust
ers-3
18.12 3.6481
1 16.058000 0.881595 6.135747 3.650200 5.987040 75
9200 20
11.91 2.8460
2 13.291000 0.846766 5.258300 4.619000 5.115071 70
6857 00
14.21 3.2535
3 14.195846 0.884869 5.442000 2.768418 5.055569 65
7077 08
In [87]:
wardlink = linkage(clean_dataset_Scaled, method = 'ward')
In [88]:
dend_wardlink = dendrogram(wardlink)
In [89]:
dend_wardlink = dendrogram(wardlink,
truncate_mode='lastp',
p = 10,
)
In [90]:
clusters_wdlk_3 = fcluster(wardlink, 3, criterion='maxclust')
clusters_wdlk_3
Out[90]:
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)
In [91]:
cluster_w_3_dataset=orginal_dataset.copy()
In [92]:
cluster_w_3_dataset['clusters-3'] = clusters_wdlk_3
In [93]:
cluster_w_3_dataset.head()
Out[93]:
In [94]:
cluster_w_3_dataset['clusters-3'].value_counts().sort_index()
Out[94]:
1 70
2 67
3 73
Name: clusters-3, dtype: int64
In [95]:
aggdata_w=cluster_w_3_dataset.groupby('clusters-3').mean()
aggdata_w['Freq']=cluster_w_3_dataset['clusters-3'].value_counts().sort_index()
aggdata_w
Out[95]:
clust
ers-3
18.37 3.6846
1 16.145429 0.884400 6.158171 3.639157 6.017371 70
1429 29
11.87 2.8485
2 13.257015 0.848072 5.238940 4.949433 5.122209 67
2388 37
14.19 3.2264
3 14.233562 0.879190 5.478233 2.612181 5.086178 73
9041 52
Observation
Both the method are almost similer means , minor variation, which we know it occurs.
We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering
Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------
In [97]:
k_means = KMeans(n_clusters = 1)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_
Out[97]:
1469.9999999999998
In [98]:
k_means = KMeans(n_clusters = 2)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_
Out[98]:
659.171754487041
In [99]:
k_means = KMeans(n_clusters = 3)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_
Out[99]:
430.6589731513006
In [100]:
k_means = KMeans(n_clusters = 4)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_
Out[100]:
371.74655984791394
In [101]:
wss =[]
In [102]:
for i in range(1,11):
KM = KMeans(n_clusters=i)
KM.fit(clean_dataset_Scaled)
wss.append(KM.inertia_)
In [103]:
wss
Out[103]:
[1469.9999999999998,
659.171754487041,
430.6589731513006,
371.74655984791394,
327.5720355975522,
289.7454594701129,
261.99257202366164,
239.88573666700952,
221.00108292809628,
209.55056673071783]
In [104]:
plt.plot(range(1,11), wss)
plt.xlabel("Clusters")
plt.ylabel("Inertia in the cluster")
plt.show()
In [105]:
k_means_4 = KMeans(n_clusters = 4)
k_means_4.fit(clean_dataset_Scaled)
labels_4 = k_means_4.labels_
In [106]:
kmeans4_dataset=orginal_dataset.copy()
In [107]:
kmeans4_dataset["Clus_kmeans"] = labels_4
kmeans4_dataset.head(5)
Out[107]:
In [108]:
from sklearn.metrics import silhouette_samples, silhouette_score
In [109]:
silhouette_score(clean_dataset_Scaled,labels_4)
Out[109]:
0.3291966792017613
In [110]:
from sklearn import metrics
In [111]:
scores = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, random_state=2)
km.fit(clean_dataset_Scaled)
scores.append(metrics.silhouette_score(clean_dataset_Scaled, km.labels_))
scores
Out[111]:
[0.46577247686580914,
0.4007270552751299,
0.3291966792017613,
0.28316654897654814,
0.2897583830272518,
0.2694844355168535,
0.25437316027505635,
0.2623959398663564,
0.2673980772529917]
In [112]:
#plotting the sc scores
plt.plot(k_range,scores)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
Insights
In [113]:
sil_width = silhouette_samples(clean_dataset_Scaled,labels_4)
In [114]:
kmeans4_dataset["sil_width"] = sil_width
kmeans4_dataset.head(5)
Out[114]:
19.9 0.43
0 16.92 0.8752 6.675 3.763 3.252 6.550 3
4 2658
15.9 0.09
1 14.89 0.9064 5.363 3.582 3.336 5.144 0
9 9543
18.9 0.42
2 16.42 0.8829 6.248 3.755 3.368 6.148 3
5 5893
spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin Clus_k sil_w
ding ayments ull_payment balance _limit ent_amt gle_shopping means idth
10.8 0.52
3 12.96 0.8099 5.278 2.641 5.182 5.185 2
3 9852
17.9 0.08
4 15.86 0.8992 5.890 3.694 2.068 5.837 3
9 2791
In [115]:
silhouette_samples(clean_dataset_Scaled,labels_4).min()
Out[115]:
-0.05115805932867967
3 Cluster Solution
In [116]:
km_3 = KMeans(n_clusters=3,random_state=123)
In [117]:
#fitting the Kmeans
km_3.fit(clean_dataset_Scaled)
km_3.labels_
Out[117]:
array([0, 2, 0, 1, 0, 1, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1,
0, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0,
1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 1, 2, 2, 0,
0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 2, 1, 0, 2, 2, 2, 2, 0, 1, 2, 0, 2,
0, 1, 2, 0, 2, 1, 1, 0, 0, 0, 1, 0, 2, 0, 2, 0, 2, 0, 0, 1, 1, 0,
2, 2, 0, 1, 1, 0, 2, 2, 1, 0, 2, 1, 1, 1, 2, 2, 0, 1, 2, 2, 1, 2,
2, 0, 1, 0, 0, 1, 0, 2, 2, 2, 1, 1, 2, 1, 0, 1, 2, 1, 2, 1, 2, 2,
1, 2, 2, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 2, 2, 1, 2, 1, 2, 0, 0, 0,
2, 1, 2, 1, 2, 2, 2, 2, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 0, 0, 1,
0, 1, 2, 0, 2, 1, 0, 2, 0, 2, 2, 2], dtype=int32)
In [118]:
#proportion of labels classified
pd.Series(km_3.labels_).value_counts()
Out[118]:
1 72
2 71
0 67
dtype: int64
K-Means Clustering & Cluster Information
In [119]:
kmeans1_dataset=orginal_dataset.copy()
In [120]:
# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)
y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1
cluster = pd.DataFrame(y_kmeans1)
kmeans1_dataset['cluster'] = cluster
#Mean of clusters
kmeans_mean_cluster =
pd.DataFrame(round(kmeans1_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster
Out[120]:
clus
ter
In [121]:
def ClusterPercentage(datafr,name):
"""Common utility function to calculate the percentage and size of cluster"""
size = pd.Series(datafr[name].value_counts().sort_index())
percent = pd.Series(round(datafr[name].value_counts()/datafr.shape[0] *
100,2)).sort_index()
return(size_df)
In [122]:
ClusterPercentage(kmeans1_dataset,"cluster")
Out[122]:
Cluster_Size Cluster_Percentage
1 67 31.90
2 72 34.29
3 71 33.81
In [123]:
#transposing the cluster
cluster_3_T = kmeans_mean_cluster.T
In [124]:
cluster_3_T
Out[124]:
cluster 1 2 3
In [125]:
km_4 = KMeans(n_clusters=4,random_state=123)
In [126]:
#fitting the Kmeans
km_4.fit(clean_dataset_Scaled)
km_4.labels_
Out[126]:
array([2, 1, 2, 0, 2, 0, 0, 1, 2, 0, 2, 3, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0,
2, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 2, 2, 1, 3, 2,
0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 2, 3, 0, 0, 0, 2, 1, 0, 0, 3, 1, 2,
2, 1, 2, 1, 1, 0, 2, 2, 0, 2, 1, 0, 3, 1, 1, 1, 1, 2, 0, 3, 3, 3,
3, 0, 1, 2, 1, 0, 1, 2, 2, 3, 0, 2, 1, 2, 3, 2, 1, 2, 2, 0, 1, 2,
3, 1, 2, 0, 0, 3, 1, 3, 0, 2, 1, 0, 0, 0, 1, 1, 2, 0, 1, 1, 0, 1,
1, 2, 0, 2, 2, 0, 3, 1, 3, 1, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 3,
1, 1, 1, 0, 1, 2, 2, 0, 2, 3, 2, 0, 3, 1, 1, 0, 1, 0, 1, 2, 2, 2,
1, 1, 3, 0, 1, 1, 1, 1, 3, 3, 1, 3, 1, 0, 1, 1, 0, 2, 1, 3, 2, 0,
2, 0, 1, 3, 1, 0, 3, 1, 3, 1, 3, 3], dtype=int32)
In [127]:
#proportion of labels classified
pd.Series(km_4.labels_).value_counts()
Out[127]:
1 65
0 64
2 51
3 30
dtype: int64
In [128]:
kmeans14_dataset=orginal_dataset.copy()
In [129]:
# Fitting K-Means to the dataset
cluster = pd.DataFrame(y_kmeans1)
kmeans14_dataset['cluster'] = cluster
#Mean of clusters
kmeans_mean_cluster =
pd.DataFrame(round(kmeans14_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster
Out[129]:
clus
ter
In [130]:
ClusterPercentage(kmeans14_dataset,"cluster")
Out[130]:
Cluster_Size Cluster_Percentage
1 30 14.29
2 67 31.90
Cluster_Size Cluster_Percentage
3 48 22.86
4 65 30.95
In [131]:
#transposing the cluster
cluster_4_T = kmeans_mean_cluster.T
In [132]:
cluster_4_T
Out[132]:
cluster 1 2 3 4
max_spent_in_single_shoppin
5.7 5.0 6.1 5.1
g
5 cluster
In [133]:
kmeans15_dataset=orginal_dataset.copy()
In [134]:
# Fitting K-Means to the dataset
y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1
cluster = pd.DataFrame(y_kmeans1)
kmeans15_dataset['cluster'] = cluster
#Mean of clusters
kmeans_mean_cluster =
pd.DataFrame(round(kmeans15_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster
Out[134]:
clus
ter
In [135]:
ClusterPercentage(kmeans15_dataset,"cluster")
Out[135]:
Cluster_Size Cluster_Percentage
1 48 22.86
2 41 19.52
3 56 26.67
4 29 13.81
5 36 17.14
In [136]:
#transposing the cluster
cluster_5_T = kmeans_mean_cluster.T
In [137]:
cluster_5_T
Out[137]:
cluster 1 2 3 4 5
max_spent_in_single_shoppin
6.1 5.2 5.1 5.7 5.0
g
--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------
Out[138]:
cluster 1 2 3
Out[139]:
clusters-3 1 2 3
11.87238
spending 18.371429 14.199041
8
13.25701
advance_payments 16.145429 14.233562
5
67.00000
Freq 70.000000 73.000000
0
- Give loan against the credit card, as they are customers with good repayment
record.
- Tie up with luxary brands, which will drive more one_time_maximun spending
--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------