0% found this document useful (0 votes)

117 views53 pages

Bank Customer Segmentation Guide

The bank wants to develop a customer segmentation model to target customers with promotional offers. They collected data on customers' credit card usage over the past few months for analysis. The task is to identify customer segments based on variables related to credit card spending, payments, balances and limits in the dataset. Exploratory data analysis found the data has 210 records with no missing values, all variables are numeric, and descriptive statistics show the variables are distributed relatively evenly.

Uploaded by

Sai Saamrudh Saiganesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views53 pages

Bank Customer Segmentation Guide

Uploaded by

Sai Saamrudh Saiganesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.

They collected a sample that summarizes the activities of users during the past few months.

You are given the task to identify the segments based on credit card usage.

Data Dictionary for Market Segmentation:

- spending: Amount spent by the customer per month (in 1000s)

- advance_payments: Amount paid by the customer in advance by cash (in 100s)

- probability_of_full_payment: Probability of payment done in full by the

customer to the bank

- current_balance: Balance amount left in the account to make purchases (in

1000s)

- credit_limit: Limit of the amount in credit card (10000s)

- min_payment_amt : minimum paid by the customer while making payments for

purchases made monthly (in 100s)

- max_spent_in_single_shopping: Maximum amount spent in one purchase (in

1000s)

--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------

Import the libraries

In [1]:
#Load the required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Plot styling
import seaborn as sns; sns.set() # for plot styling
%matplotlib inline

In [3]:
# Import stats from scipy
`

--------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------

1.1 Read the data and do exploratory data analysis.

Describe the data briefly
Load data
In [4]:
#Read the csv file
orginal_dataset=pd.read_csv('../input/bank-marketing/
bank_marketing_part1_Data.csv')
Checking the data
In [5]:
orginal_dataset.head()

Out[5]:

spendi advance_pay probability_of_full_ current_bal credit_li min_paymen max_spent_in_single_

ng ments payment ance mit t_amt shopping

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837

In [6]:
orginal_dataset.tail()

Out[6]:

spend advance_pay probability_of_full_ current_ba credit_l min_paymen max_spent_in_single

ing ments payment lance imit t_amt _shopping

20
13.89 14.02 0.8880 5.439 3.199 3.986 4.738
5

20
16.77 15.62 0.8638 5.927 3.438 4.920 5.795
6

20 14.03 14.16 0.8796 5.438 3.201 1.717 5.001

spend advance_pay probability_of_full_ current_ba credit_l min_paymen max_spent_in_single
ing ments payment lance imit t_amt _shopping

20
16.12 15.00 0.9000 5.709 3.485 2.270 5.443
8

20
15.57 15.15 0.8527 5.920 3.231 2.640 5.879
9

Observation

Data looks good based on intial records seen in top 5 and bottom 5.

In [7]:
orginal_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB

Observation:

 7 variables and 210 records.

 No missing record based on intial analysis.
 All the variables numeric type.
In [8]:
### data dimensions

orginal_dataset.shape

Out[8]:
(210, 7)
In [9]:
### Checking for Missing Values
orginal_dataset.isnull().sum()

Out[9]:
spending 0
advance_payments 0
probability_of_full_payment 0
current_balance 0
credit_limit 0
min_payment_amt 0
max_spent_in_single_shopping 0
dtype: int64

Observation

No missing value.

Univariate analysis
Checking the Summary Statistic
In [10]:
## Intital descriptive analysis of the data

orginal_dataset.describe(percentiles=[.25,0.50,0.75,0.90]).T

Out[10]:

cou
mean std min 25% 50% 75% 90% max
nt

210. 14.8475 2.9096 10.59 12.270 14.355 17.3050 18.98 21.18

spending
0 24 99 00 00 00 00 80 00

210. 14.5592 1.3059 12.41 13.450 14.320 15.7150 16.45 17.25

advance_payments
0 86 59 00 00 00 00 40 00

probability_of_full_paym 210. 0.87099 0.0236 0.808 0.8569 0.8734 0.88777 0.899 0.918
ent 0 9 29 1 0 5 5 3 3

210. 5.62853 0.4430 4.899 5.2622 5.5235 5.97975 6.273 6.675

current_balance
0 3 63 0 5 0 0 3 0

210. 3.25860 0.3777 2.630 2.9440 3.2370 3.56175 3.786 4.033

credit_limit
0 5 14 0 0 0 0 5 0
cou
mean std min 25% 50% 75% 90% max
nt

210. 3.70020 1.5035 0.765 2.5615 3.5990 4.76875 5.537 8.456

min_payment_amt
0 1 57 1 0 0 0 6 0

max_spent_in_single_sho 210. 5.40807 0.4914 4.519 5.0450 5.2230 5.87700 6.185 6.550
pping 0 1 80 0 0 0 0 0 0

Observation

- Based on summary descriptive, the data looks good.

- We see for most of the variable, mean/medium are nearly equal

- Include a 90% to see variations and it looks distributely evenly

- Std Deviation is high for spending variable

Spending variable
In [11]:
print('Range of values: ', orginal_dataset['spending'].max()-
orginal_dataset['spending'].min())
Range of values: 10.59
In [12]:
#Central values
print('Minimum spending: ', orginal_dataset['spending'].min())
print('Maximum spending: ',orginal_dataset['spending'].max())
print('Mean value: ', orginal_dataset['spending'].mean())
print('Median value: ',orginal_dataset['spending'].median())
print('Standard deviation: ', orginal_dataset['spending'].std())
print('Null values: ',orginal_dataset['spending'].isnull().any())
Minimum spending: 10.59
Maximum spending: 21.18
Mean value: 14.847523809523818
Median value: 14.355
Standard deviation: 2.909699430687361
Null values: False
In [13]:
#Quartiles

Q1=orginal_dataset['spending'].quantile(q=0.25)
Q3=orginal_dataset['spending'].quantile(q=0.75)
print('spending - 1st Quartile (Q1) is: ', Q1)
print('spending - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of spending is ',
stats.iqr(orginal_dataset['spending']))
spending - 1st Quartile (Q1) is: 12.27
spending - 3st Quartile (Q3) is: 17.305
Interquartile range (IQR) of spending is 5.035
In [14]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in spending: ', L_outliers)
print('Upper outliers in spending: ', U_outliers)
Lower outliers in spending: 4.717499999999999
Upper outliers in spending: 24.8575
In [15]:
print('Number of outliers in spending upper : ',
orginal_dataset[orginal_dataset['spending']>24.8575]['spending'].count())
print('Number of outliers in spending lower : ',
orginal_dataset[orginal_dataset['spending']<4.717499]['spending'].count())
print('% of Outlier in spending upper:
',round(orginal_dataset[orginal_dataset['spending']>24.8575]
['spending'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in spending lower:
',round(orginal_dataset[orginal_dataset['spending']<4.717499]
['spending'].count()*100/len(orginal_dataset)), '%')
Number of outliers in spending upper : 0
Number of outliers in spending lower : 0
% of Outlier in spending upper: 0.0 %
% of Outlier in spending lower: 0.0 %
In [16]:
plt.title('spending')
sns.boxplot(orginal_dataset['spending'],orient='horizondal',color='purple')

Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64619a9210>

In [17]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='spending',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('spending', fontsize=15)
ax1.set_title('Distribution of spending', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['spending'],ax=ax2)
ax2.set_xlabel('spending', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['spending'])
ax3.set_xlabel('spending', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

advance_payments variable
In [18]:
print('Range of values: ', orginal_dataset['advance_payments'].max()-
orginal_dataset['advance_payments'].min())
Range of values: 4.84
In [19]:
#Central values
print('Minimum advance_payments: ', orginal_dataset['advance_payments'].min())
print('Maximum advance_payments: ',orginal_dataset['advance_payments'].max())
print('Mean value: ', orginal_dataset['advance_payments'].mean())
print('Median value: ',orginal_dataset['advance_payments'].median())
print('Standard deviation: ', orginal_dataset['advance_payments'].std())
print('Null values: ',orginal_dataset['advance_payments'].isnull().any())
Minimum advance_payments: 12.41
Maximum advance_payments: 17.25
Mean value: 14.559285714285727
Median value: 14.32
Standard deviation: 1.305958726564022
Null values: False
In [20]:
#Quartiles

Q1=orginal_dataset['advance_payments'].quantile(q=0.25)
Q3=orginal_dataset['advance_payments'].quantile(q=0.75)
print('advance_payments - 1st Quartile (Q1) is: ', Q1)
print('advance_payments - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of advance_payments is ',
stats.iqr(orginal_dataset['advance_payments']))
advance_payments - 1st Quartile (Q1) is: 13.45
advance_payments - 3st Quartile (Q3) is: 15.715
Interquartile range (IQR) of advance_payments is 2.2650000000000006
In [21]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in advance_payments: ', L_outliers)
print('Upper outliers in advance_payments: ', U_outliers)
Lower outliers in advance_payments: 10.052499999999998
Upper outliers in advance_payments: 19.1125
In [22]:
print('Number of outliers in advance_payments upper : ',
orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count())
print('Number of outliers in advance_payments lower : ',
orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count())
print('% of Outlier in advance_payments upper:
',round(orginal_dataset[orginal_dataset['advance_payments']>19.1125]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in advance_payments lower:
',round(orginal_dataset[orginal_dataset['advance_payments']<10.052499]
['advance_payments'].count()*100/len(orginal_dataset)), '%')
Number of outliers in advance_payments upper : 0
Number of outliers in advance_payments lower : 0
% of Outlier in advance_payments upper: 0.0 %
% of Outlier in advance_payments lower: 0.0 %
In [23]:
plt.title('advance_payments')
sns.boxplot(orginal_dataset['advance_payments'],orient='horizondal',color='purple'
)

Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b422690>
In [24]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='advance_payments',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('advance_payments', fontsize=15)
ax1.set_title('Distribution of advance_payments', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['advance_payments'],ax=ax2)
ax2.set_xlabel('advance_payments', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['advance_payments'])
ax3.set_xlabel('advance_payments', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

probability_of_full_payment variable

In [25]:
print('Range of values: ', orginal_dataset['probability_of_full_payment'].max()-
orginal_dataset['probability_of_full_payment'].min())
Range of values: 0.11019999999999996
In [26]:
#Central values
print('Minimum probability_of_full_payment ',
orginal_dataset['probability_of_full_payment'].min())
print('Maximum probability_of_full_payment:
',orginal_dataset['probability_of_full_payment'].max())
print('Mean value: ', orginal_dataset['probability_of_full_payment'].mean())
print('Median value: ',orginal_dataset['probability_of_full_payment'].median())
print('Standard deviation: ',
orginal_dataset['probability_of_full_payment'].std())
print('Null values:
',orginal_dataset['probability_of_full_payment'].isnull().any())
Minimum probability_of_full_payment 0.8081
Maximum probability_of_full_payment: 0.9183
Mean value: 0.8709985714285714
Median value: 0.8734500000000001
Standard deviation: 0.023629416583846496
Null values: False
In [27]:
#Quartiles

Q1=orginal_dataset['probability_of_full_payment'].quantile(q=0.25)
Q3=orginal_dataset['probability_of_full_payment'].quantile(q=0.75)
print('probability_of_full_payment - 1st Quartile (Q1) is: ', Q1)
print('probability_of_full_payment - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of probability_of_full_payment is ',
stats.iqr(orginal_dataset['probability_of_full_payment']))
probability_of_full_payment - 1st Quartile (Q1) is: 0.8569
probability_of_full_payment - 3st Quartile (Q3) is: 0.887775
Interquartile range (IQR) of probability_of_full_payment is
0.030874999999999986
In [28]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in probability_of_full_payment: ', L_outliers)
print('Upper outliers in probability_of_full_payment: ', U_outliers)
Lower outliers in probability_of_full_payment: 0.8105875
Upper outliers in probability_of_full_payment: 0.9340875
In [29]:
print('Number of outliers in probability_of_full_payment upper : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count())
print('Number of outliers in probability_of_full_payment lower : ',
orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count())
print('% of Outlier in probability_of_full_payment upper:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']>0.9340875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in probability_of_full_payment lower:
',round(orginal_dataset[orginal_dataset['probability_of_full_payment']<0.8105875]
['probability_of_full_payment'].count()*100/len(orginal_dataset)), '%')
Number of outliers in probability_of_full_payment upper : 0
Number of outliers in probability_of_full_payment lower : 3
% of Outlier in probability_of_full_payment upper: 0.0 %
% of Outlier in probability_of_full_payment lower: 1.0 %
In [30]:
plt.title('probability_of_full_payment')
sns.boxplot(orginal_dataset['probability_of_full_payment'],orient='horizondal',col
or='purple')

Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b57fc50>

In [31]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='probability_of_full_payment',data=orginal_dataset,orient='v',ax=ax1
)
ax1.set_ylabel('probability_of_full_payment', fontsize=15)
ax1.set_title('Distribution of probability_of_full_payment', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['probability_of_full_payment'],ax=ax2)
ax2.set_xlabel('probability_of_full_payment', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['probability_of_full_payment'])
ax3.set_xlabel('probability_of_full_payment', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
current_balance variable

In [32]:
print('Range of values: ', orginal_dataset['current_balance'].max()-
orginal_dataset['current_balance'].min())
Range of values: 1.7759999999999998
In [33]:
#Central values
print('Minimum current_balance: ', orginal_dataset['current_balance'].min())
print('Maximum current_balance: ',orginal_dataset['current_balance'].max())
print('Mean value: ', orginal_dataset['current_balance'].mean())
print('Median value: ',orginal_dataset['current_balance'].median())
print('Standard deviation: ', orginal_dataset['current_balance'].std())
print('Null values: ',orginal_dataset['current_balance'].isnull().any())
Minimum current_balance: 4.899
Maximum current_balance: 6.675
Mean value: 5.628533333333334
Median value: 5.5235
Standard deviation: 0.4430634777264493
Null values: False
In [34]:
#Quartiles

Q1=orginal_dataset['current_balance'].quantile(q=0.25)
Q3=orginal_dataset['current_balance'].quantile(q=0.75)
print('current_balance - 1st Quartile (Q1) is: ', Q1)
print('current_balance - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of current_balance is ',
stats.iqr(orginal_dataset['current_balance']))
current_balance - 1st Quartile (Q1) is: 5.26225
current_balance - 3st Quartile (Q3) is: 5.97975
Interquartile range (IQR) of current_balance is 0.7175000000000002
In [35]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in current_balance: ', L_outliers)
print('Upper outliers in current_balance: ', U_outliers)
Lower outliers in current_balance: 4.186
Upper outliers in current_balance: 7.056000000000001
In [36]:
print('Number of outliers in current_balance upper : ',
orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count())
print('Number of outliers in current_balance lower : ',
orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count())
print('% of Outlier in current_balance upper:
',round(orginal_dataset[orginal_dataset['current_balance']>7.056000000000001]
['current_balance'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in current_balance lower:
',round(orginal_dataset[orginal_dataset['current_balance']<4.186]
['current_balance'].count()*100/len(orginal_dataset)), '%')
Number of outliers in current_balance upper : 0
Number of outliers in current_balance lower : 0
% of Outlier in current_balance upper: 0.0 %
% of Outlier in current_balance lower: 0.0 %
In [37]:
plt.title('current_balance')
sns.boxplot(orginal_dataset['current_balance'],orient='horizondal',color='purple')

Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b209dd0>

In [38]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='current_balance',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('current_balance', fontsize=15)
ax1.set_title('Distribution of current_balance', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['current_balance'],ax=ax2)
ax2.set_xlabel('current_balance', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['current_balance'])
ax3.set_xlabel('current_balance', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

credit_limit variable

In [39]:
print('Range of values: ', orginal_dataset['credit_limit'].max()-
orginal_dataset['credit_limit'].min())
Range of values: 1.4030000000000005
In [40]:
#Central values
print('Minimum credit_limit: ', orginal_dataset['credit_limit'].min())
print('Maximum credit_limit: ',orginal_dataset['credit_limit'].max())
print('Mean value: ', orginal_dataset['credit_limit'].mean())
print('Median value: ',orginal_dataset['credit_limit'].median())
print('Standard deviation: ', orginal_dataset['credit_limit'].std())
print('Null values: ',orginal_dataset['credit_limit'].isnull().any())
Minimum credit_limit: 2.63
Maximum credit_limit: 4.033
Mean value: 3.258604761904763
Median value: 3.237
Standard deviation: 0.3777144449065874
Null values: False
In [41]:
#Quartiles

Q1=orginal_dataset['credit_limit'].quantile(q=0.25)
Q3=orginal_dataset['credit_limit'].quantile(q=0.75)
print('credit_limit - 1st Quartile (Q1) is: ', Q1)
print('credit_limit - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of credit_limit is ',
stats.iqr(orginal_dataset['credit_limit']))
credit_limit - 1st Quartile (Q1) is: 2.944
credit_limit - 3st Quartile (Q3) is: 3.56175
Interquartile range (IQR) of credit_limit is 0.61775
In [42]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in credit_limit: ', L_outliers)
print('Upper outliers in credit_limit: ', U_outliers)
Lower outliers in credit_limit: 2.017375
Upper outliers in credit_limit: 4.488375
In [43]:
print('Number of outliers in credit_limit upper : ',
orginal_dataset[orginal_dataset['credit_limit']>4.488375]['credit_limit'].count())
print('Number of outliers in credit_limit lower : ',
orginal_dataset[orginal_dataset['credit_limit']<2.017375]['credit_limit'].count())
print('% of Outlier in credit_limit upper:
',round(orginal_dataset[orginal_dataset['credit_limit']>4.488375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in credit_limit lower:
',round(orginal_dataset[orginal_dataset['credit_limit']<2.017375]
['credit_limit'].count()*100/len(orginal_dataset)), '%')
Number of outliers in credit_limit upper : 0
Number of outliers in credit_limit lower : 0
% of Outlier in credit_limit upper: 0.0 %
% of Outlier in credit_limit lower: 0.0 %
In [44]:
plt.title('credit_limit')
sns.boxplot(orginal_dataset['credit_limit'],orient='horizondal',color='purple')

Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b036fd0>

In [45]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='credit_limit',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('credit_limit', fontsize=15)
ax1.set_title('Distribution of credit_limit', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['credit_limit'],ax=ax2)
ax2.set_xlabel('credit_limit', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['credit_limit'])
ax3.set_xlabel('credit_limit', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

min_payment_amt variable

In [46]:
print('Range of values: ', orginal_dataset['min_payment_amt'].max()-
orginal_dataset['min_payment_amt'].min())
Range of values: 7.690899999999999
In [47]:
#Central values
print('Minimum min_payment_amt: ', orginal_dataset['min_payment_amt'].min())
print('Maximum min_payment_amt: ',orginal_dataset['min_payment_amt'].max())
print('Mean value: ', orginal_dataset['min_payment_amt'].mean())
print('Median value: ',orginal_dataset['min_payment_amt'].median())
print('Standard deviation: ', orginal_dataset['min_payment_amt'].std())
print('Null values: ',orginal_dataset['min_payment_amt'].isnull().any())
Minimum min_payment_amt: 0.7651
Maximum min_payment_amt: 8.456
Mean value: 3.7002009523809507
Median value: 3.599
Standard deviation: 1.5035571308217792
Null values: False
In [48]:
#Quartiles

Q1=orginal_dataset['min_payment_amt'].quantile(q=0.25)
Q3=orginal_dataset['min_payment_amt'].quantile(q=0.75)
print('min_payment_amt - 1st Quartile (Q1) is: ', Q1)
print('min_payment_amt - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of min_payment_amt is ',
stats.iqr(orginal_dataset['min_payment_amt']))
min_payment_amt - 1st Quartile (Q1) is: 2.5614999999999997
min_payment_amt - 3st Quartile (Q3) is: 4.76875
Interquartile range (IQR) of min_payment_amt is 2.20725
In [49]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in min_payment_amt: ', L_outliers)
print('Upper outliers in min_payment_amt: ', U_outliers)
Lower outliers in min_payment_amt: -0.7493750000000006
Upper outliers in min_payment_amt: 8.079625
In [50]:
print('Number of outliers in min_payment_amt upper : ',
orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count())
print('Number of outliers in min_payment_amt lower : ',
orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count())
print('% of Outlier in min_payment_amt upper:
',round(orginal_dataset[orginal_dataset['min_payment_amt']>8.079625]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in min_payment_amt lower:
',round(orginal_dataset[orginal_dataset['min_payment_amt']<-0.749375]
['min_payment_amt'].count()*100/len(orginal_dataset)), '%')
Number of outliers in min_payment_amt upper : 2
Number of outliers in min_payment_amt lower : 0
% of Outlier in min_payment_amt upper: 1.0 %
% of Outlier in min_payment_amt lower: 0.0 %
In [51]:
plt.title('min_payment_amt')
sns.boxplot(orginal_dataset['min_payment_amt'],orient='horizondal',color='purple')

Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645ae1e090>
In [52]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='min_payment_amt',data=orginal_dataset,orient='v',ax=ax1)
ax1.set_ylabel('min_payment_amt', fontsize=15)
ax1.set_title('Distribution of min_payment_amt', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['min_payment_amt'],ax=ax2)
ax2.set_xlabel('min_payment_amt', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['min_payment_amt'])
ax3.set_xlabel('min_payment_amt', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

max_spent_in_single_shopping variable

In [53]:
print('Range of values: ', orginal_dataset['max_spent_in_single_shopping'].max()-
orginal_dataset['max_spent_in_single_shopping'].min())
Range of values: 2.0309999999999997
In [54]:
#Central values
print('Minimum max_spent_in_single_shopping: ',
orginal_dataset['max_spent_in_single_shopping'].min())
print('Maximum max_spent_in_single_shoppings:
',orginal_dataset['max_spent_in_single_shopping'].max())
print('Mean value: ', orginal_dataset['max_spent_in_single_shopping'].mean())
print('Median value: ',orginal_dataset['max_spent_in_single_shopping'].median())
print('Standard deviation: ',
orginal_dataset['max_spent_in_single_shopping'].std())
print('Null values:
',orginal_dataset['max_spent_in_single_shopping'].isnull().any())
Minimum max_spent_in_single_shopping: 4.519
Maximum max_spent_in_single_shoppings: 6.55
Mean value: 5.408071428571429
Median value: 5.223000000000001
Standard deviation: 0.4914804991024054
Null values: False
In [55]:
#Quartiles

Q1=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.25)
Q3=orginal_dataset['max_spent_in_single_shopping'].quantile(q=0.75)
print('max_spent_in_single_shopping - 1st Quartile (Q1) is: ', Q1)
print('max_spent_in_single_shopping - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of max_spent_in_single_shopping is ',
stats.iqr(orginal_dataset['max_spent_in_single_shopping']))
max_spent_in_single_shopping - 1st Quartile (Q1) is: 5.045
max_spent_in_single_shopping - 3st Quartile (Q3) is: 5.877000000000001
Interquartile range (IQR) of max_spent_in_single_shopping is
0.8320000000000007
In [56]:
#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in max_spent_in_single_shopping: ', L_outliers)
print('Upper outliers in max_spent_in_single_shopping: ', U_outliers)
Lower outliers in max_spent_in_single_shopping: 3.796999999999999
Upper outliers in max_spent_in_single_shopping: 7.125000000000002
In [57]:
print('Number of outliers in max_spent_in_single_shopping upper : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.125000000000002]
['max_spent_in_single_shopping'].count())
print('Number of outliers in max_spent_in_single_shopping lower : ',
orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.796999999999999]
['max_spent_in_single_shopping'].count())
print('% of Outlier in max_spent_in_single_shopping upper:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']>7.12500000
0000002]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
print('% of Outlier in max_spent_in_single_shopping lower:
',round(orginal_dataset[orginal_dataset['max_spent_in_single_shopping']<3.79699999
9999999]['max_spent_in_single_shopping'].count()*100/len(orginal_dataset)), '%')
Number of outliers in max_spent_in_single_shopping upper : 0
Number of outliers in max_spent_in_single_shopping lower : 0
% of Outlier in max_spent_in_single_shopping upper: 0.0 %
% of Outlier in max_spent_in_single_shopping lower: 0.0 %
In [58]:
plt.title('max_spent_in_single_shopping')
sns.boxplot(orginal_dataset['max_spent_in_single_shopping'],orient='horizondal',co
lor='purple')

Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f645b26d910>

In [59]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

#boxplot
sns.boxplot(x='max_spent_in_single_shopping',data=orginal_dataset,orient='v',ax=ax
1)
ax1.set_ylabel('max_spent_in_single_shopping', fontsize=15)
ax1.set_title('Distribution of max_spent_in_single_shopping', fontsize=15)
ax1.tick_params(labelsize=15)

#distplot
sns.distplot(orginal_dataset['max_spent_in_single_shopping'],ax=ax2)
ax2.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax2.tick_params(labelsize=15)

#histogram
ax3.hist(orginal_dataset['max_spent_in_single_shopping'])
ax3.set_xlabel('max_spent_in_single_shopping', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
In [60]:
# Let's us only plot the distributions of independent attributes
orginal_dataset.hist(figsize=(12,16),layout=(4,2));
In [61]:
# Let's check the skewness values quantitatively
orginal_dataset.skew().sort_values(ascending=False)

Out[61]:
max_spent_in_single_shopping 0.561897
current_balance 0.525482
min_payment_amt 0.401667
spending 0.399889
advance_payments 0.386573
credit_limit 0.134378
probability_of_full_payment -0.537954
dtype: float64
In [62]:
# distplot combines the matplotlib.hist function with seaborn kdeplot()
# KDE Plot represents the Kernel Density Estimate
# KDE is used for visualizing the Probability Density of a continuous variable.
# KDE demonstrates the probability density at different values in a continuous
variable.

plt.figure(figsize=(10,50))
for i in range(len(orginal_dataset.columns)):
plt.subplot(17, 1, i+1)
sns.distplot(orginal_dataset[orginal_dataset.columns[i]], kde_kws={"color": "b",
"lw": 3, "label": "KDE"}, hist_kws={"color": "g"})
plt.title(orginal_dataset.columns[i])

plt.tight_layout()
Observations

- Credit limit average is around $3.258(10000s)

- Distrubtion is skewed to right tail for all the variable execpt

probability_of_full_payment variable, which has left tail

Multivariate analysis
Check for multicollinearity
In [63]:
sns.pairplot(orginal_dataset,diag_kind='kde');

Observation
- Strong positive correlation between

- spending & advance_payments,

- advance_payments & current_balance,

- credit_limit & spending

- spending & current_balance

- credit_limit & advance_payments

- max_spent_in_single_shopping current_balance

In [64]:
#correlation matrix

orginal_dataset.corr().T

Out[64]:

spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin

ding ayments ull_payment balance _limit ent_amt gle_shopping

1.00 0.9707
spending 0.994341 0.608288 0.949985 -0.229572 0.863693
0000 71

advance_payment 0.99 0.9448

1.000000 0.529244 0.972422 -0.217340 0.890784
s 4341 29

probability_of_ful 0.60 0.7616

0.529244 1.000000 0.367915 -0.331471 0.226825
l_payment 8288 35

0.94 0.8604
current_balance 0.972422 0.367915 1.000000 -0.171562 0.932806
9985 15

0.97 1.0000
credit_limit 0.944829 0.761635 0.860415 -0.258037 0.749131
0771 00

- -
min_payment_am -
0.22 -0.217340 -0.331471 0.2580 1.000000 -0.011079
t 0.171562
9572 37

max_spent_in_sin 0.86 0.7491

0.890784 0.226825 0.932806 -0.011079 1.000000
gle_shopping 3693 31

In [65]:
#creating a heatmap for better visualization
plt.figure(figsize=(10,8))
sns.heatmap(orginal_dataset.corr(),annot=True,fmt=".2f",cmap="viridis")
plt.show()

In [66]:
# Let us see the significant correlation either negative or positive among
independent attributes..
c = orginal_dataset.corr().abs() # Since there may be positive as well as -ve
correlation
s = c.unstack() #
so = s.sort_values(ascending=False) # Sorting according to the correlation
so=so[(so<1) & (so>0.3)].drop_duplicates().to_frame() # Due to symmetry..
dropping duplicate entries.
so.columns = ['correlation']
so

Out[66]:

correlation

spending advance_payments 0.994341

advance_payments current_balance 0.972422

correlation

credit_limit spending 0.970771

spending current_balance 0.949985

credit_limit advance_payments 0.944829

max_spent_in_single_shoppin current_balance
0.932806
g

advance_payments max_spent_in_single_shopping 0.890784

spending max_spent_in_single_shopping 0.863693

current_balance credit_limit 0.860415

probability_of_full_payment credit_limit 0.761635

max_spent_in_single_shoppin credit_limit
0.749131
g

spending probability_of_full_payment 0.608288

advance_payments probability_of_full_payment 0.529244

current_balance probability_of_full_payment 0.367915

correlation

probability_of_full_payment min_payment_amt 0.331471

Strategy to remove outliers: We choose to replace

attribute outlier values by their respective medians ,
instead of dropping them, as we will lose other column
info and also there outlier are present only in two
avariables and within 5 records.
In [67]:
clean_dataset=orginal_dataset.copy()

In [68]:
def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 *
IQR)))
#iterating over columns to check for no.of outliers in each of the numerical
attributes.
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col,
vData_num_mod[col].value_counts().iloc[1]))
count += 1
print("\n\nNo of attributes with outliers are :", count)

check_outliers(orginal_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2

No of attributes with outliers are : 2

let us remove the outliers
for column in clean_dataset.columns.tolist(): Q1 = clean_dataset[column].quantile(.25) # 1st
quartile Q3 = clean_dataset[column].quantile(.75) # 3rd quartile IQR = Q3-Q1 # get inter quartile
range

# Replace elements of columns that fall below Q1-1.5*IQR and above Q3+1.5*IQR

clean_dataset[column].replace(clean_dataset.loc[(clean_dataset[column] >
Q3+1.5*IQR)|(clean_dataset[column] < Q1-1.5*IQR), column],
clean_dataset[column].median())
In [69]:
check_outliers(clean_dataset)
No. of outliers in probability_of_full_payment: 3
No. of outliers in min_payment_amt: 2

No of attributes with outliers are : 2

In [70]:
# Let us check presence of outliers
plt.figure(figsize=(18,14))
box = sns.boxplot(data=clean_dataset)
box.set_xticklabels(labels=box.get_xticklabels(),rotation=90);

Observation
Most of the outlier has been treated and now we are good to go.

In [71]:
plt.title('probability_of_full_payment')
sns.boxplot(clean_dataset['probability_of_full_payment'],orient='horizondal',color
='purple')

Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64585ef310>

Observation

Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.

--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------

1.2 Do you think scaling is necessary for clustering in

this case? Justify
Scaling needs to be done as the values of the variables are different.

spending, advance_payments are in different values and this may get more weightage.

Also have shown below the plot of the data prior and after scaling.

Scaling will have all the values in the relative same range.

I have used zscore to standarised the data to relative same scale -3 to +3.

In [72]:
# prior to scaling
plt.plot(clean_dataset)
plt.show()
In [73]:
# Scaling the attributes.

from scipy.stats import zscore

clean_dataset_Scaled=orginal_dataset.apply(zscore)
clean_dataset_Scaled.head()

Out[73]:

spendi advance_pay probability_of_full_ current_bal credit_li min_paymen max_spent_in_single_

ng ments payment ance mit t_amt shopping

1.754 1.33857
0 1.811968 0.178230 2.367533 -0.298806 2.328998
355 9

0.393 0.85823
1 0.253840 1.501773 -0.600744 -0.242805 -0.538582
582 6

1.413 1.31734
2 1.428192 0.504874 1.401485 -0.221471 1.509107
300 8

- -
3 1.384 -1.227533 -2.591878 -0.793049 1.63901 0.987884 -0.454961
034 7

1.082 1.15546
4 0.998364 1.196340 0.591544 -1.088154 0.874813
581 4

In [74]:
#after scaling
plt.plot(clean_dataset_Scaled)
plt.show()
--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------

1.3 Apply hierarchical clustering to scaled data. Identify

the number of optimum clusters using Dendrogram and
briefly describe them
Creating the Dendrogram

Importing dendrogram and linkage module

In [75]:
from scipy.cluster.hierarchy import dendrogram, linkage

Choosing average linkage method

In [76]:
link_method = linkage(clean_dataset_Scaled, method = 'average')

In [77]:
dend = dendrogram(link_method)
Cutting the Dendrogram with suitable clusters

In [78]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 10)

In [79]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 25)

Importing fcluster module to create clusters

In [80]:
from scipy.cluster.hierarchy import fcluster

In [81]:
# Set criterion as maxclust,then create 3 clusters, and store the result in
another object 'clusters'

clusters_3 = fcluster(link_method, 3, criterion='maxclust')

clusters_3

Out[81]:
array([1, 3, 1, 2, 1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 1, 1, 1,
1, 3, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 2, 3, 1,
1, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 3, 1,
3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 2, 3, 2, 3, 1, 1, 1,
3, 2, 3, 2, 3, 2, 3, 3, 1, 1, 3, 1, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 3, 3, 2, 1, 3, 1, 3, 3, 1], dtype=int32)
In [82]:
cluster3_dataset=orginal_dataset.copy()

In [83]:
cluster3_dataset['clusters-3'] = clusters_3

In [84]:
cluster3_dataset.head()

Out[84]:

spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust

ing yments l_payment alance imit nt_amt e_shopping ers-3

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 1

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 3

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 1

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 1

Cluster Frequency

In [85]:
cluster3_dataset['clusters-3'].value_counts().sort_index()

Out[85]:
1 75
2 70
3 65
Name: clusters-3, dtype: int64

Cluster Profiles
In [86]:
aggdata=cluster3_dataset.groupby('clusters-3').mean()
aggdata['Freq']=cluster3_dataset['clusters-3'].value_counts().sort_index()
aggdata

Out[86]:

spendi advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_sing Fr

ng yments l_payment alance limit nt_amt le_shopping eq

clust
ers-3

18.12 3.6481
1 16.058000 0.881595 6.135747 3.650200 5.987040 75
9200 20

11.91 2.8460
2 13.291000 0.846766 5.258300 4.619000 5.115071 70
6857 00

14.21 3.2535
3 14.195846 0.884869 5.442000 2.768418 5.055569 65
7077 08

Another method - ward

In [87]:
wardlink = linkage(clean_dataset_Scaled, method = 'ward')

In [88]:
dend_wardlink = dendrogram(wardlink)

In [89]:
dend_wardlink = dendrogram(wardlink,
truncate_mode='lastp',
p = 10,
)

In [90]:
clusters_wdlk_3 = fcluster(wardlink, 3, criterion='maxclust')
clusters_wdlk_3

Out[90]:
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)
In [91]:
cluster_w_3_dataset=orginal_dataset.copy()

In [92]:
cluster_w_3_dataset['clusters-3'] = clusters_wdlk_3

In [93]:
cluster_w_3_dataset.head()

Out[93]:

spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust

ing yments l_payment alance imit nt_amt e_shopping ers-3

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 1

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 3

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 1

spend advance_pa probability_of_ful current_b credit_l min_payme max_spent_in_singl clust
ing yments l_payment alance imit nt_amt e_shopping ers-3

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 1

In [94]:
cluster_w_3_dataset['clusters-3'].value_counts().sort_index()

Out[94]:
1 70
2 67
3 73
Name: clusters-3, dtype: int64
In [95]:
aggdata_w=cluster_w_3_dataset.groupby('clusters-3').mean()
aggdata_w['Freq']=cluster_w_3_dataset['clusters-3'].value_counts().sort_index()
aggdata_w

Out[95]:

spendi advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_sing Fr

ng yments l_payment alance limit nt_amt le_shopping eq

clust
ers-3

18.37 3.6846
1 16.145429 0.884400 6.158171 3.639157 6.017371 70
1429 29

11.87 2.8485
2 13.257015 0.848072 5.238940 4.949433 5.122209 67
2388 37

14.19 3.2264
3 14.233562 0.879190 5.478233 2.612181 5.086178 73
9041 52

Observation
Both the method are almost similer means , minor variation, which we know it occurs.

We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering

Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.

And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).

--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------

1.4 Apply K-Means clustering on scaled data and

determine optimum clusters. Apply elbow curve and
silhouette score.
In [96]:
from sklearn.cluster import KMeans

In [97]:
k_means = KMeans(n_clusters = 1)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[97]:
1469.9999999999998
In [98]:
k_means = KMeans(n_clusters = 2)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[98]:
659.171754487041
In [99]:
k_means = KMeans(n_clusters = 3)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[99]:
430.6589731513006
In [100]:
k_means = KMeans(n_clusters = 4)
k_means.fit(clean_dataset_Scaled)
k_means.inertia_

Out[100]:
371.74655984791394
In [101]:

wss =[]

In [102]:
for i in range(1,11):
KM = KMeans(n_clusters=i)
KM.fit(clean_dataset_Scaled)
wss.append(KM.inertia_)

In [103]:
wss

Out[103]:
[1469.9999999999998,
659.171754487041,
430.6589731513006,
371.74655984791394,
327.5720355975522,
289.7454594701129,
261.99257202366164,
239.88573666700952,
221.00108292809628,
209.55056673071783]
In [104]:
plt.plot(range(1,11), wss)
plt.xlabel("Clusters")
plt.ylabel("Inertia in the cluster")
plt.show()

In [105]:
k_means_4 = KMeans(n_clusters = 4)
k_means_4.fit(clean_dataset_Scaled)
labels_4 = k_means_4.labels_

In [106]:
kmeans4_dataset=orginal_dataset.copy()

In [107]:
kmeans4_dataset["Clus_kmeans"] = labels_4
kmeans4_dataset.head(5)
Out[107]:

spen advance_pa probability_of_ful current_b credit_ min_payme max_spent_in_singl Clus_k

ding yments l_payment alance limit nt_amt e_shopping means

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.550 3

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 0

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 3

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.890 3.694 2.068 5.837 3

In [108]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [109]:
silhouette_score(clean_dataset_Scaled,labels_4)

Out[109]:
0.3291966792017613
In [110]:
from sklearn import metrics

In [111]:
scores = []
k_range = range(2, 11)

for k in k_range:
km = KMeans(n_clusters=k, random_state=2)
km.fit(clean_dataset_Scaled)
scores.append(metrics.silhouette_score(clean_dataset_Scaled, km.labels_))

scores

Out[111]:
[0.46577247686580914,
0.4007270552751299,
0.3291966792017613,
0.28316654897654814,
0.2897583830272518,
0.2694844355168535,
0.25437316027505635,
0.2623959398663564,
0.2673980772529917]
In [112]:
#plotting the sc scores
plt.plot(k_range,scores)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

Insights

From SC Score, the number of optimal clusters could be 3 or 4

In [113]:
sil_width = silhouette_samples(clean_dataset_Scaled,labels_4)

In [114]:
kmeans4_dataset["sil_width"] = sil_width
kmeans4_dataset.head(5)

Out[114]:

spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin Clus_k sil_w

ding ayments ull_payment balance _limit ent_amt gle_shopping means idth

19.9 0.43
0 16.92 0.8752 6.675 3.763 3.252 6.550 3
4 2658

15.9 0.09
1 14.89 0.9064 5.363 3.582 3.336 5.144 0
9 9543

18.9 0.42
2 16.42 0.8829 6.248 3.755 3.368 6.148 3
5 5893
spen advance_p probability_of_f current_ credit min_paym max_spent_in_sin Clus_k sil_w
ding ayments ull_payment balance _limit ent_amt gle_shopping means idth

10.8 0.52
3 12.96 0.8099 5.278 2.641 5.182 5.185 2
3 9852

17.9 0.08
4 15.86 0.8992 5.890 3.694 2.068 5.837 3
9 2791

In [115]:
silhouette_samples(clean_dataset_Scaled,labels_4).min()

Out[115]:
-0.05115805932867967
3 Cluster Solution
In [116]:
km_3 = KMeans(n_clusters=3,random_state=123)

In [117]:
#fitting the Kmeans
km_3.fit(clean_dataset_Scaled)
km_3.labels_

Out[117]:
array([0, 2, 0, 1, 0, 1, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1,
0, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0,
1, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 1, 2, 2, 0,
0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 2, 1, 0, 2, 2, 2, 2, 0, 1, 2, 0, 2,
0, 1, 2, 0, 2, 1, 1, 0, 0, 0, 1, 0, 2, 0, 2, 0, 2, 0, 0, 1, 1, 0,
2, 2, 0, 1, 1, 0, 2, 2, 1, 0, 2, 1, 1, 1, 2, 2, 0, 1, 2, 2, 1, 2,
2, 0, 1, 0, 0, 1, 0, 2, 2, 2, 1, 1, 2, 1, 0, 1, 2, 1, 2, 1, 2, 2,
1, 2, 2, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 2, 2, 1, 2, 1, 2, 0, 0, 0,
2, 1, 2, 1, 2, 2, 2, 2, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 0, 0, 1,
0, 1, 2, 0, 2, 1, 0, 2, 0, 2, 2, 2], dtype=int32)
In [118]:
#proportion of labels classified

pd.Series(km_3.labels_).value_counts()

Out[118]:
1 72
2 71
0 67
dtype: int64
K-Means Clustering & Cluster Information
In [119]:
kmeans1_dataset=orginal_dataset.copy()

In [120]:
# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0

y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans1_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans1_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[120]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single

ing ments _payment lance imit nt_amt _shopping

clus
ter

1 18.5 16.2 0.9 6.2 3.7 3.6 6.0

2 11.9 13.2 0.8 5.2 2.8 4.7 5.1

3 14.4 14.3 0.9 5.5 3.3 2.7 5.1

In [121]:
def ClusterPercentage(datafr,name):
"""Common utility function to calculate the percentage and size of cluster"""

size = pd.Series(datafr[name].value_counts().sort_index())
percent = pd.Series(round(datafr[name].value_counts()/datafr.shape[0] *
100,2)).sort_index()

size_df = pd.concat([size, percent],axis=1)

size_df.columns = ["Cluster_Size","Cluster_Percentage"]

return(size_df)

In [122]:
ClusterPercentage(kmeans1_dataset,"cluster")
Out[122]:

Cluster_Size Cluster_Percentage

1 67 31.90

2 72 34.29

3 71 33.81

In [123]:
#transposing the cluster
cluster_3_T = kmeans_mean_cluster.T

In [124]:
cluster_3_T

Out[124]:

cluster 1 2 3

spending 18.5 11.9 14.4

advance_payments 16.2 13.2 14.3

probability_of_full_payment 0.9 0.8 0.9

current_balance 6.2 5.2 5.5

credit_limit 3.7 2.8 3.3

min_payment_amt 3.6 4.7 2.7

max_spent_in_single_shopping 6.0 5.1 5.1

Note

I am going with 3 clusters via kmeans, but am showing

the analysis of 4 and 5 kmeans cluster, I see we based
on current dataset given, 3 cluster solution makes sense
based on the spending pattern (High, Medium, Low)
4-Cluster Solution

In [125]:
km_4 = KMeans(n_clusters=4,random_state=123)

In [126]:
#fitting the Kmeans
km_4.fit(clean_dataset_Scaled)
km_4.labels_

Out[126]:
array([2, 1, 2, 0, 2, 0, 0, 1, 2, 0, 2, 3, 0, 2, 1, 0, 1, 0, 1, 0, 0, 0,
2, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 2, 2, 1, 3, 2,
0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 2, 3, 0, 0, 0, 2, 1, 0, 0, 3, 1, 2,
2, 1, 2, 1, 1, 0, 2, 2, 0, 2, 1, 0, 3, 1, 1, 1, 1, 2, 0, 3, 3, 3,
3, 0, 1, 2, 1, 0, 1, 2, 2, 3, 0, 2, 1, 2, 3, 2, 1, 2, 2, 0, 1, 2,
3, 1, 2, 0, 0, 3, 1, 3, 0, 2, 1, 0, 0, 0, 1, 1, 2, 0, 1, 1, 0, 1,
1, 2, 0, 2, 2, 0, 3, 1, 3, 1, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 3,
1, 1, 1, 0, 1, 2, 2, 0, 2, 3, 2, 0, 3, 1, 1, 0, 1, 0, 1, 2, 2, 2,
1, 1, 3, 0, 1, 1, 1, 1, 3, 3, 1, 3, 1, 0, 1, 1, 0, 2, 1, 3, 2, 0,
2, 0, 1, 3, 1, 0, 3, 1, 3, 1, 3, 3], dtype=int32)
In [127]:
#proportion of labels classified

pd.Series(km_4.labels_).value_counts()

Out[127]:
1 65
0 64
2 51
3 30
dtype: int64

K-Means Clustering & Cluster Information

In [128]:
kmeans14_dataset=orginal_dataset.copy()

In [129]:
# Fitting K-Means to the dataset

kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)

y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0

y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans14_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans14_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[129]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single

ing ments _payment lance imit nt_amt _shopping

clus
ter

1 16.4 15.3 0.9 5.9 3.4 3.9 5.7

2 14.0 14.1 0.9 5.4 3.2 2.6 5.0

3 19.2 16.5 0.9 6.3 3.8 3.5 6.1

4 11.8 13.2 0.8 5.2 2.8 4.9 5.1

In [130]:
ClusterPercentage(kmeans14_dataset,"cluster")

Out[130]:

Cluster_Size Cluster_Percentage

1 30 14.29

2 67 31.90
Cluster_Size Cluster_Percentage

3 48 22.86

4 65 30.95

In [131]:
#transposing the cluster
cluster_4_T = kmeans_mean_cluster.T

In [132]:
cluster_4_T

Out[132]:

cluster 1 2 3 4

spending 16.4 14.0 19.2 11.8

advance_payments 15.3 14.1 16.5 13.2

probability_of_full_payment 0.9 0.9 0.9 0.8

current_balance 5.9 5.4 6.3 5.2

credit_limit 3.4 3.2 3.8 2.8

min_payment_amt 3.9 2.6 3.5 4.9

max_spent_in_single_shoppin
5.7 5.0 6.1 5.1
g

5 cluster

In [133]:
kmeans15_dataset=orginal_dataset.copy()

In [134]:
# Fitting K-Means to the dataset

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)

y_kmeans = kmeans.fit_predict(clean_dataset_Scaled)

#beginning of the cluster numbering with 1 instead of 0

y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1

# New Dataframe called cluster

cluster = pd.DataFrame(y_kmeans1)

# Adding cluster to the Dataset1

kmeans15_dataset['cluster'] = cluster
#Mean of clusters

kmeans_mean_cluster =
pd.DataFrame(round(kmeans15_dataset.groupby('cluster').mean(),1))
kmeans_mean_cluster

Out[134]:

spend advance_pay probability_of_full current_ba credit_l min_payme max_spent_in_single

ing ments _payment lance imit nt_amt _shopping

clus
ter

1 19.2 16.5 0.9 6.3 3.8 3.5 6.1

2 11.7 13.2 0.8 5.3 2.8 4.5 5.2

3 14.3 14.3 0.9 5.5 3.3 2.4 5.1

4 16.4 15.3 0.9 5.9 3.4 3.9 5.7

5 12.3 13.3 0.9 5.2 3.0 5.0 5.0

In [135]:
ClusterPercentage(kmeans15_dataset,"cluster")
Out[135]:

Cluster_Size Cluster_Percentage

1 48 22.86

2 41 19.52

3 56 26.67

4 29 13.81

5 36 17.14

In [136]:
#transposing the cluster
cluster_5_T = kmeans_mean_cluster.T

In [137]:
cluster_5_T

Out[137]:

cluster 1 2 3 4 5

spending 19.2 11.7 14.3 16.4 12.3

advance_payments 16.5 13.2 14.3 15.3 13.3

probability_of_full_payment 0.9 0.8 0.9 0.9 0.9

current_balance 6.3 5.3 5.5 5.9 5.2

credit_limit 3.8 2.8 3.3 3.4 3.0

cluster 1 2 3 4 5

min_payment_amt 3.5 4.5 2.4 3.9 5.0

max_spent_in_single_shoppin
6.1 5.2 5.1 5.7 5.0
g

--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------

1.5 Describe cluster profiles for the clusters defined.

Recommend different promotional strategies for different
clusters.
3 group cluster via Kmeans
In [138]:
cluster_3_T

Out[138]:

cluster 1 2 3

spending 18.5 11.9 14.4

advance_payments 16.2 13.2 14.3

probability_of_full_payment 0.9 0.8 0.9

current_balance 6.2 5.2 5.5

credit_limit 3.7 2.8 3.3

min_payment_amt 3.6 4.7 2.7

cluster 1 2 3

max_spent_in_single_shopping 6.0 5.1 5.1

3 group cluster via hierarchical clustering

In [139]:
aggdata_w.T

Out[139]:

clusters-3 1 2 3

11.87238
spending 18.371429 14.199041
8

13.25701
advance_payments 16.145429 14.233562
5

probability_of_full_payment 0.884400 0.848072 0.879190

current_balance 6.158171 5.238940 5.478233

credit_limit 3.684629 2.848537 3.226452

min_payment_amt 3.639157 4.949433 2.612181

max_spent_in_single_shopping 6.017371 5.122209 5.086178

67.00000
Freq 70.000000 73.000000
0

Cluster Group Profiles

Group 1 : High Spending

Group 3 : Medium Spending

Group 2 : Low Spending

Promotional strategies for each cluster
Group 1 : High Spending Group
- Giving any reward points might increase their purchases.

- maximum max_spent_in_single_shopping is high for this group, so can be

offered discount/offer on next transactions upon full payment

- Increase there credit limit and

- Increase spending habits

- Give loan against the credit card, as they are customers with good repayment
record.

- Tie up with luxary brands, which will drive more one_time_maximun spending

Group 3 : Medium Spending Group

- They are potential target customers who are paying bills and doing purchases
and maintaining comparatively good credit score. So we can increase credit
limit or can lower down interest rate.

- Promote premium cards/loyality cars to increase transcations.

- Increase spending habits by trying with premium ecommerce sites, travel

portal, travel airlines/hotel, as this will encourge them to spend more

Group 2 : Low Spending Group

- customers should be given remainders for payments. Offers can be provided on
early payments to improve their payment rate.

- Increase there spending habits by tieing up with grocery stores, utlities

(electircity, phone, gas, others)

--------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------

Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Data Mining for Business Insights
83% (12)
Data Mining for Business Insights
34 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
Predicting Credit Risk Using Financial Data
100% (3)
Predicting Credit Risk Using Financial Data
42 pages
Regression Analysis PPT
No ratings yet
Regression Analysis PPT
28 pages
Time Series Analysis
No ratings yet
Time Series Analysis
21 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
No ratings yet
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
36 pages
Machine Learning - Project
80% (10)
Machine Learning - Project
14 pages
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
No ratings yet
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
2 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Sunira Data Mining
No ratings yet
Sunira Data Mining
53 pages
Data Mining Project Anshul
100% (1)
Data Mining Project Anshul
48 pages
Business Report Project Data Mining
No ratings yet
Business Report Project Data Mining
50 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
7 Types of Classification Algorithms
No ratings yet
7 Types of Classification Algorithms
21 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Continuous Space PBIL Extensions
No ratings yet
Continuous Space PBIL Extensions
8 pages
Data Mining Assignment Guide
100% (1)
Data Mining Assignment Guide
21 pages
4-Data Cleaning - Handout
No ratings yet
4-Data Cleaning - Handout
6 pages
Clustering Analysis: Reading The Data
100% (1)
Clustering Analysis: Reading The Data
15 pages
Data Mining Project Report - Reshma
No ratings yet
Data Mining Project Report - Reshma
23 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
No ratings yet
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
17 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
No ratings yet
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
15 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Grade 12 Awareness of Mythical Creatures
No ratings yet
Grade 12 Awareness of Mythical Creatures
32 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Organic Termite Killer As Replacement To Chemical Insecticides
No ratings yet
Organic Termite Killer As Replacement To Chemical Insecticides
13 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
HowToWriteLabReport2 - 1 - How T
No ratings yet
HowToWriteLabReport2 - 1 - How T
26 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Danmairo - Analysis - Ipynb - Colaboratory
No ratings yet
Danmairo - Analysis - Ipynb - Colaboratory
18 pages
Credit Card Analysis
No ratings yet
Credit Card Analysis
95 pages
Python Report Ritik
No ratings yet
Python Report Ritik
15 pages
Assessment and Benefits of Goose Grass Among Ilocanos
No ratings yet
Assessment and Benefits of Goose Grass Among Ilocanos
29 pages
Analysis of Variance (Anova) F-Test: C H A P T E R 9
No ratings yet
Analysis of Variance (Anova) F-Test: C H A P T E R 9
26 pages
Bank Customer Segmentation
No ratings yet
Bank Customer Segmentation
14 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
Bank Customer Segmentation Guide
No ratings yet
Bank Customer Segmentation Guide
32 pages
Kadi Sarva Vishwa Vidhyalaya Guidelines For Summer Project Report
No ratings yet
Kadi Sarva Vishwa Vidhyalaya Guidelines For Summer Project Report
5 pages
1 Review of Statistics
No ratings yet
1 Review of Statistics
24 pages
DataMiningProjectProblem1 Clustering
100% (4)
DataMiningProjectProblem1 Clustering
20 pages
Social Dynamics of Littering
No ratings yet
Social Dynamics of Littering
17 pages
Reading 2 - Organising, Visualising and Describing Data
No ratings yet
Reading 2 - Organising, Visualising and Describing Data
63 pages
Prelim Exam 2nd Sem For Students
No ratings yet
Prelim Exam 2nd Sem For Students
4 pages
Assignmnet 5
No ratings yet
Assignmnet 5
11 pages
PTSP Ii Ece
No ratings yet
PTSP Ii Ece
3 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
H0: Age and Are Independent of Each Other H1: Age and Are Associated With Each Other
No ratings yet
H0: Age and Are Independent of Each Other H1: Age and Are Associated With Each Other
4 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Mlproj
No ratings yet
Mlproj
49 pages
Audit Sampling Essentials
No ratings yet
Audit Sampling Essentials
3 pages
1813-Article Text-6098-1-10-20210830
No ratings yet
1813-Article Text-6098-1-10-20210830
9 pages
Discrete Random
No ratings yet
Discrete Random
57 pages
MATLAB Homework 5: Exercise 5. 1 Load and Convert Data (2pts)
No ratings yet
MATLAB Homework 5: Exercise 5. 1 Load and Convert Data (2pts)
2 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Project Ip
No ratings yet
Project Ip
20 pages
TU DELFT MSC Thesis Jolien Rip - Probabilistic Downtime Analysis
100% (1)
TU DELFT MSC Thesis Jolien Rip - Probabilistic Downtime Analysis
144 pages
Impact of Tax Notices on VAT Revenue
No ratings yet
Impact of Tax Notices on VAT Revenue
9 pages
Nordis Final
No ratings yet
Nordis Final
6 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
DSC Project 442
No ratings yet
DSC Project 442
12 pages
Digital Escape Rooms Boost Student Motivation
No ratings yet
Digital Escape Rooms Boost Student Motivation
14 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Advanced ML: Math Foundations
No ratings yet
Advanced ML: Math Foundations
46 pages
Class Activity#7 Robert Skublen
No ratings yet
Class Activity#7 Robert Skublen
7 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
No ratings yet
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
7 pages
Final - EDA Assignment - Sourabh S Hubballi
No ratings yet
Final - EDA Assignment - Sourabh S Hubballi
34 pages
Module 3
No ratings yet
Module 3
108 pages
Supervised Decision Trees A Case Study For AllLife Bank
No ratings yet
Supervised Decision Trees A Case Study For AllLife Bank
50 pages
Tropical Pacific Island Environments 2nd Ed 2nd Edition Christopher S. Lobban PDF Download
100% (1)
Tropical Pacific Island Environments 2nd Ed 2nd Edition Christopher S. Lobban PDF Download
60 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages