Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (2 votes)
500 views11 pages

Data Mining - Project

This document summarizes a data mining project that performs k-means clustering on bank customer transaction data. It loads and cleans the data, scales the features, runs k-means clustering for values of k from 1 to 10 to find the optimal number of clusters, and calculates a silhouette score to evaluate the clustering results. The analysis identifies 3 and 4 clusters as producing relatively high silhouette scores of 0.4007 and 0.3276, respectively.

Uploaded by

ankitbhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
500 views11 pages

Data Mining - Project

This document summarizes a data mining project that performs k-means clustering on bank customer transaction data. It loads and cleans the data, scales the features, runs k-means clustering for values of k from 1 to 10 to find the optimal number of clusters, and calculates a silhouette score to evaluate the clustering results. The analysis identifies 3 and 4 clusters as producing relatively high silhouette scores of 0.4007 and 0.3276, respectively.

Uploaded by

ankitbhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

In [ ]:
#Case study 1

In [1]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples, silhouette_score

In [2]:
od=pd.read_csv(r'C:\Users\SBHAGAT\Downloads\bank_marketing_part1_Data-1.csv')

In [3]:
od.head()

Out[3]: spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

In [4]:
od.shape

Out[4]: (210, 7)

In [5]:
od.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 210 entries, 0 to 209

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 spending 210 non-null float64

1 advance_payments 210 non-null float64

2 probability_of_full_payment 210 non-null float64

3 current_balance 210 non-null float64

4 credit_limit 210 non-null float64

5 min_payment_amt 210 non-null float64

6 max_spent_in_single_shopping 210 non-null float64

dtypes: float64(7)

memory usage: 11.6 KB

In [6]:
od.isnull().sum()

Out[6]: spending 0

advance_payments 0

probability_of_full_payment 0

current_balance 0

credit_limit 0

min_payment_amt 0

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 1/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

max_spent_in_single_shopping 0

dtype: int64

In [7]:
od.describe().T

Out[7]: count mean std min 25% 50% 75%

spending 210.0 14.847524 2.909699 10.5900 12.27000 14.35500 17.305000 2

advance_payments 210.0 14.559286 1.305959 12.4100 13.45000 14.32000 15.715000 17

probability_of_full_payment 210.0 0.870999 0.023629 0.8081 0.85690 0.87345 0.887775 0

current_balance 210.0 5.628533 0.443063 4.8990 5.26225 5.52350 5.979750 6

credit_limit 210.0 3.258605 0.377714 2.6300 2.94400 3.23700 3.561750 4

min_payment_amt 210.0 3.700201 1.503557 0.7651 2.56150 3.59900 4.768750 8

max_spent_in_single_shopping 210.0 5.408071 0.491480 4.5190 5.04500 5.22300 5.877000 6

In [8]:
dups =od.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

In [11]:
from sklearn.preprocessing import StandardScaler

In [12]:
X = StandardScaler()

In [13]:
scaled_clust = pd.DataFrame(X.fit_transform(od), columns=od.columns)

In [14]:
scaled_clust.head()

Out[14]: spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 1.754355 1.811968 0.178230 2.367533 1.338579 -

1 0.393582 0.253840 1.501773 -0.600744 0.858236 -

2 1.413300 1.428192 0.504874 1.401485 1.317348 -

3 -1.384034 -1.227533 -2.591878 -0.793049 -1.639017

4 1.082581 0.998364 1.196340 0.591544 1.155464 -

In [15]:
#creating clusters using K means

k_means = KMeans(n_clusters = 2,random_state=1)

In [16]:
k_means.fit(scaled_clust)

Out[16]: KMeans(n_clusters=2, random_state=1)

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 2/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

In [17]: k_means.labels_

Out[17]: array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,

1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,

0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,

1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,

1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,

1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,

0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,

0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1])

In [18]:
wss=[]

In [19]:
for i in range(1,11):

KM = KMeans(n_clusters=i,random_state=1)

KM.fit(scaled_clust)

wss.append(KM.inertia_)

C:\Users\SBHAGAT\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:881: UserWar
ning: KMeans is known to have a memory leak on Windows with MKL, when there are less
chunks than available threads. You can avoid it by setting the environment variable
OMP_NUM_THREADS=1.

warnings.warn(

In [20]:
wss

Out[20]: [1469.9999999999995,

659.1717544870411,

430.65897315130064,

371.301721277542,

327.9608240079031,

290.5900305968219,

264.83153087478144,

240.6837259501598,

220.85285825594738,

206.3829103601579]

In [21]:
k_means = KMeans(n_clusters = 3,random_state=1)

k_means.fit(scaled_clust)

labels = k_means.labels_

In [22]:
silhouette_score(scaled_clust,labels,random_state=1)

Out[22]: 0.40072705527512986

In [23]:
k_means = KMeans(n_clusters = 4,random_state=1)

k_means.fit(scaled_clust)

labels = k_means.labels_

In [24]:
silhouette_score(scaled_clust,labels,random_state=1)

Out[24]: 0.32757426605518075

In [ ]:

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 3/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

In [1]:
from scipy.cluster.hierarchy import dendrogram, linkage

In [ ]:
link_method = linkage(clean_dataset_Scaled, method = 'average')

In [2]:
dend = dendrogram(link_method)

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-2-0e1e4bd1bd60> in <module>

----> 1 dend = dendrogram(link_method)

NameError: name 'link_method' is not defined

In [25]:
#case study 2

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

In [26]:
df_insured = pd.read_csv(r'C:\Users\SBHAGAT\Downloads\insurance_part2_data-2.csv')

In [27]:
df_insured.head()

Out[27]: Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destina
Name

Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan

Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan

Travel Customised
2 39 CWT No 5.94 Online 3 9.90 Ame
Agency Plan

Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan

4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan

In [28]:
df_insured.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 4/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null object

2 Type 3000 non-null object

3 Claimed 3000 non-null object

4 Commision 3000 non-null float64

5 Channel 3000 non-null object

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null object

9 Destination 3000 non-null object

dtypes: float64(2), int64(2), object(6)

memory usage: 234.5+ KB

In [29]:
df_insured.isnull().sum()

Out[29]: Age 0

Agency_Code 0

Type 0

Claimed 0

Commision 0

Channel 0

Duration 0

Sales 0

Product Name 0

Destination 0

dtype: int64

In [30]:
df_insured.describe(include="all").T

Out[30]: count unique top freq mean std min 25% 50% 75%

Age 3000.0 NaN NaN NaN 38.091 10.463518 8.0 32.0 36.0 42.0

Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN

Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN
Agency

Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN

Commision 3000.0 NaN NaN NaN 14.529203 25.481455 0.0 0.0 4.63 17.235 2

Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN

Duration 3000.0 NaN NaN NaN 70.001333 134.053313 -1.0 11.0 26.5 63.0 4

Sales 3000.0 NaN NaN NaN 60.249913 70.733954 0.0 20.0 33.0 69.0

Product Customised
3000 5 1136 NaN NaN NaN NaN NaN NaN
Name Plan

Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN

In [31]:
dups = df_insured.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 139

In [32]:
df_insured.drop_duplicates(inplace=True)

In [33]:
dups = df_insured.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 5/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

print(df_insured.shape)

Number of duplicate rows = 0

(2861, 10)

In [34]:
plt.figure(figsize=(10,5))

df_insured[['Age','Commision', 'Duration', 'Sales']].boxplot()

Out[34]: <AxesSubplot:>

In [35]:
sns.pairplot(df_insured[['Age','Commision', 'Duration', 'Sales']])

Out[35]: <seaborn.axisgrid.PairGrid at 0x1759e7d89d0>

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 6/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

In [36]:
#corelation

plt.figure(figsize=(10,8))

sns.set(font_scale=1.2)

sns.heatmap(df_insured[['Age','Commision', 'Duration', 'Sales']].corr(), annot=True)

Out[36]: <AxesSubplot:>

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 7/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

In [37]:
#converting all objects to categorical

for feature in df_insured.columns:

if df_insured[feature].dtype == 'object':

print('\n')

print('feature:',feature)

print(pd.Categorical(df_insured[feature].unique()))

print(pd.Categorical(df_insured[feature].unique()).codes)

df_insured[feature] = pd.Categorical(df_insured[feature]).codes

feature: Agency_Code

['C2B', 'EPX', 'CWT', 'JZI']

Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']

[0 2 1 3]

feature: Type

['Airlines', 'Travel Agency']

Categories (2, object): ['Airlines', 'Travel Agency']

[0 1]

feature: Claimed

['No', 'Yes']

Categories (2, object): ['No', 'Yes']

[0 1]

feature: Channel

['Online', 'Offline']

Categories (2, object): ['Offline', 'Online']

[1 0]

feature: Product Name

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 8/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan', 'Gold Plan']

Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customised Plan', 'Gol
d Plan', 'Silver Plan']

[2 1 0 4 3]

feature: Destination

['ASIA', 'Americas', 'EUROPE']

Categories (3, object): ['ASIA', 'Americas', 'EUROPE']

[0 1 2]

In [38]:
df_insured.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 2861 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 2861 non-null int64

1 Agency_Code 2861 non-null int8

2 Type 2861 non-null int8

3 Claimed 2861 non-null int8

4 Commision 2861 non-null float64

5 Channel 2861 non-null int8

6 Duration 2861 non-null int64

7 Sales 2861 non-null float64

8 Product Name 2861 non-null int8

9 Destination 2861 non-null int8

dtypes: float64(2), int64(2), int8(6)

memory usage: 193.1 KB

In [39]:
df_insured.head()

Out[39]: Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destination
Name

0 48 0 0 0 0.70 1 7 2.51 2 0

1 36 2 1 0 0.00 1 34 20.00 2 0

2 39 1 1 0 5.94 1 3 9.90 2 1

3 36 2 1 0 0.00 1 4 26.00 1 0

4 33 3 0 0 6.30 1 53 18.00 0 0

In [40]:
df_insured.Claimed.value_counts(normalize=True)

Out[40]: 0 0.680531

1 0.319469

Name: Claimed, dtype: float64

In [41]:
#2.2 Data Split: Split the data into test and train

X = df_insured.drop("Claimed", axis=1)

y = df_insured.pop("Claimed")

X.head()

Out[41]: Age Agency_Code Type Commision Channel Duration Sales Product Name Destination

0 48 0 0 0.70 1 7 2.51 2 0
localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 9/11
10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

Age Agency_Code Type Commision Channel Duration Sales Product Name Destination

1 36 2 1 0.00 1 34 20.00 2 0

2 39 1 1 5.94 1 3 9.90 2 1

3 36 2 1 0.00 1 4 26.00 1 0

4 33 3 0 6.30 1 53 18.00 0 0

In [42]:
X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30, r

In [43]:
print('X_train',X_train.shape)

print('X_test',X_test.shape)

print('train_labels',train_labels.shape)

print('test_labels',test_labels.shape)

X_train (2002, 9)

X_test (859, 9)

train_labels (2002,)

test_labels (859,)

In [44]:
dt_model = DecisionTreeClassifier(criterion = 'gini' )

In [45]:
dt_model.fit(X_train, train_labels)

Out[45]: DecisionTreeClassifier()

In [46]:
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = X_trai

Imp

Age 0.194224

Agency_Code 0.168120

Type 0.000000

Commision 0.077408

Channel 0.003908

Duration 0.274566

Sales 0.221966

Product Name 0.032130

Destination 0.027677

In [49]:
from sklearn.model_selection import GridSearchCV

param_grid ={

'max_depth' :[4,5,6],

'min_samples_leaf' : [20,40,60,70]

'min_samples_split' : [150,200,250,300]

dt_model= DecisionTreeClassifier()

grid_search= GridSearchCV(estimator =dt_model,param_grid=param_grid,cv=10)

File "<ipython-input-49-e3263300c531>", line 5

'min_samples_split' : [150,200,250,300]

SyntaxError: invalid syntax

In [50]:
#FITTING THE OPTIMAL VALUES TO THE TRAINING DATASET

grid_search.fit(X_train,train_labels)

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 10/11


10/24/21, 7:22 PM Data Mining_Project_Ankit Bhagat

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-50-bf4f3cc6609d> in <module>

1 #FITTING THE OPTIMAL VALUES TO THE TRAINING DATASET

----> 2 grid_search.fit(X_train,train_labels)

NameError: name 'grid_search' is not defined

In [51]:
best_grid

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-51-dbefcb93077f> in <module>

----> 1 best_grid

NameError: name 'best_grid' is not defined

In [52]:
#case 1 1.3 hierarchical testing

from scipy.cluster.hierarchy import dendogram, linkage

wardlink = linkage(scaled_clust,method='ward')

dend = dendogram(wardlink)

---------------------------------------------------------------------------

ImportError Traceback (most recent call last)

<ipython-input-52-ff4670ab758a> in <module>

1 #case 1 1.3 hierarchical testing

----> 2 from scipy.cluster.hierarchy import dendogram, linkage

3 wardlink = linkage(scaled_clust,method='ward')

4 dend = dendogram(wardlink)

ImportError: cannot import name 'dendogram' from 'scipy.cluster.hierarchy' (C:\Users


\SBHAGAT\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py)

In [ ]:

localhost:8888/nbconvert/html/Data Mining_Project_Ankit Bhagat.ipynb?download=false 11/11

You might also like