Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views12 pages

SPPUML3

Machine learning lab assignment 3

Uploaded by

kanaseaditya800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views12 pages

SPPUML3

Machine learning lab assignment 3

Uploaded by

kanaseaditya800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

In [39]: 1 #Name:- Kanase Aditya Madhukar

2 #Roll No.:- 2441059


3 #Batch:- D
4 #Assignment no 3

Given a bank customer, build a neural


network-based classifier that can determine
whether
they will leave or not in the next 6 months. Dataset Description: The case study is from an
open-source dataset from Kaggle. The dataset contains 10,000 sample points with 14
distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure,
Balance, etc. Link to the Kaggle project: https://www.kaggle.com/barelydedicated/bank-
customer-churn-modeling (https://www.kaggle.com/barelydedicated/bank-customer-churn-
modeling) Perform following steps:

1. Read the dataset.


2. Distinguish the feature and target set and divide the data set into training and test sets.
3. Normalize the train and test data.
4. Initialize and build the model. Identify the points of improvement and implement the
same.
5. Print the accuracy score and confusion matrix (5 points).

In [ ]: 1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 from sklearn.preprocessing import StandardScaler
6 import io

Read the Dataset


In [2]: 1 from google.colab import files
2 uploaded=files.upload()

Choose Files No file chosen


Upload widget is only available when the cell has been executed in the current browser session.
Please rerun this cell to enable.

Saving bank.csv to bank.csv


In [40]: 1 df=pd.read_csv(io.StringIO(uploaded['bank.csv'].decode('utf-8')))
2 df.head()

Out[40]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Bala

0 1 15634602 Hargrave 619 France Female 42 2

1 2 15647311 Hill 608 Spain Female 41 1 8380

2 3 15619304 Onio 502 France Female 42 8 15966

3 4 15701354 Boni 699 France Female 39 1

4 5 15737888 Mitchell 850 Spain Female 43 2 12551

2. Drop the Columns which are unique for all users

In [41]: 1 df=df.drop(['RowNumber','CustomerId','Surname'],axis=1)
2 df.head()

Out[41]: CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard Is

0 619 France Female 42 2 0.00 1 1

1 608 Spain Female 41 1 83807.86 1 0

2 502 France Female 42 8 159660.80 3 1

3 699 France Female 39 1 0.00 2 0

4 850 Spain Female 43 2 125510.82 1 1

In [42]: 1 df.isna().any()
2 df.isna().sum()

Out[42]: CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
BiVariate Analysis
In [43]: 1 print(df.shape)
2 df.info()

(10000, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null object
2 Gender 10000 non-null object
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB

In [44]: 1 df.describe()

Out[44]: CreditScore Age Tenure Balance NumOfProducts HasCrCar

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.0000

mean 650.528800 38.921800 5.012800 76485.889288 1.530200 0.7055

std 96.653299 10.487806 2.892174 62397.405202 0.581654 0.4558

min 350.000000 18.000000 0.000000 0.000000 1.000000 0.0000

25% 584.000000 32.000000 3.000000 0.000000 1.000000 0.0000

50% 652.000000 37.000000 5.000000 97198.540000 1.000000 1.0000

75% 718.000000 44.000000 7.000000 127644.240000 2.000000 1.0000

max 850.000000 92.000000 10.000000 250898.090000 4.000000 1.0000

Before performing Bivariate analysis, Lets bring all the features to the same
range
In [45]: 1 ## Scale the data
2 scaler=StandardScaler()
3 ## Extract only the Numerical Columns to perform Bivariate Analysis
4 subset=df.drop(['Geography','Gender','HasCrCard','IsActiveMember'],axis
5 scaled=scaler.fit_transform(subset)
6 scaled_df=pd.DataFrame(scaled,columns=subset.columns)
7 sns.pairplot(scaled_df,diag_kind='kde')
8 ​

Out[45]: <seaborn.axisgrid.PairGrid at 0x7fe8126f0940>


In [46]: 1 sns.heatmap(scaled_df.corr(),annot=True,cmap='rainbow')

Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe7cb4ef9b0>


From the above plots, We can see that there is no significant
Linear relationship between the features

In [47]: 1 ## Categorical Features vs Target Variable


2 sns.countplot(x='Geography',data=df,hue='Exited')
3 plt.show()
4 sns.countplot(x='Gender',data=df,hue='Exited')
5 plt.show()
6 sns.countplot(x='HasCrCard',data=df,hue='Exited')
7 plt.show()
8 sns.countplot(x='IsActiveMember',data=df,hue='Exited')
9 plt.show()
Analysing the Numerical Features relationship with the Target
variable. Here 'Exited' is the Target Feature.

In [50]: 1 subset=subset.drop('Exited',axis=1)
2 for i in subset.columns:
3 sns.boxplot(df['Exited'],df[i],hue=df['Gender'])
4 plt.show()

/usr/local/lib/python3.6/dist-packages/seaborn/_decorators.py:43: Futur
eWarning: Pass the following variables as keyword args: x, y. From vers
ion 0.12, the only valid positional argument will be `data`, and passin
g other arguments without an explicit keyword will result in an error o
r misinterpretation.
FutureWarning

Insights from Bivariate Plots


1. The Avg Credit Score seem to be almost the same for Active and Churned customers
2. Young People seem to stick to the bank compared to older people
3. The Average Bank Balance is high for Churned Customers
4. The churning rate is high with German Customers
5. The Churning rate is high among the Non-Active Members

4. Distinguish the Target and Feature Set and divide the dataset
into Training and Test sets

In [51]: 1 X=df.drop('Exited',axis=1)
2 y=df.pop('Exited')
3 ​
In [52]: 1 from sklearn.model_selection import train_test_split
2 X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.10,rando
3 X_train,X_val,y_train,y_val=train_test_split(X_train,y_train,test_size=
4 print("X_train size is {}".format(X_train.shape[0]))
5 print("X_val size is {}".format(X_val.shape[0]))
6 print("X_test size is {}".format(X_test.shape[0]))
7 ​
8 ​

X_train size is 8100


X_val size is 900
X_test size is 1000

In [53]: 1 ## Standardising the train, Val and Test data


2 from sklearn.preprocessing import StandardScaler
3 scaler=StandardScaler()
4 num_cols=['CreditScore','Age','Tenure','Balance','NumOfProducts','Estim
5 num_subset=scaler.fit_transform(X_train[num_cols])
6 X_train_num_df=pd.DataFrame(num_subset,columns=num_cols)
7 X_train_num_df['Geography']=list(X_train['Geography'])
8 X_train_num_df['Gender']=list(X_train['Gender'])
9 X_train_num_df['HasCrCard']=list(X_train['HasCrCard'])
10 X_train_num_df['IsActiveMember']=list(X_train['IsActiveMember'])
11 X_train_num_df.head()
12 ## Standardise the Validation data
13 num_subset=scaler.fit_transform(X_val[num_cols])
14 X_val_num_df=pd.DataFrame(num_subset,columns=num_cols)
15 X_val_num_df['Geography']=list(X_val['Geography'])
16 X_val_num_df['Gender']=list(X_val['Gender'])
17 X_val_num_df['HasCrCard']=list(X_val['HasCrCard'])
18 X_val_num_df['IsActiveMember']=list(X_val['IsActiveMember'])
19 ## Standardise the Test data
20 num_subset=scaler.fit_transform(X_test[num_cols])
21 X_test_num_df=pd.DataFrame(num_subset,columns=num_cols)
22 X_test_num_df['Geography']=list(X_test['Geography'])
23 X_test_num_df['Gender']=list(X_test['Gender'])
24 X_test_num_df['HasCrCard']=list(X_test['HasCrCard'])
25 X_test_num_df['IsActiveMember']=list(X_test['IsActiveMember'])
26 ​
27 ​

In [54]: 1 ## Convert the categorical features to numerical


2 X_train_num_df=pd.get_dummies(X_train_num_df,columns=['Geography','Gend
3 X_test_num_df=pd.get_dummies(X_test_num_df,columns=['Geography','Gender
4 X_val_num_df=pd.get_dummies(X_val_num_df,columns=['Geography','Gender']
5 X_train_num_df.head()

Out[54]: CreditScore Age Tenure Balance NumOfProducts EstimatedSalary HasCrCard

0 -1.178587 -1.041960 -1.732257 0.198686 0.820905 1.560315 1

1 -0.380169 -1.326982 1.730718 -0.022020 -0.907991 -0.713592 1

2 -0.349062 1.808258 -0.693364 0.681178 0.820905 -1.126515 1

3 0.625629 2.378302 -0.347067 -1.229191 0.820905 -1.682740 1

4 -0.203895 -1.136967 1.730718 0.924256 -0.907991 1.332535 1


Initialise and build the Model

In [55]: 1 from tensorflow.keras import Sequential


2 from tensorflow.keras.layers import Dense
3 ​
4 model=Sequential()
5 model.add(Dense(7,activation='relu'))
6 model.add(Dense(10,activation='relu'))
7 model.add(Dense(1,activation='sigmoid'))

In [56]: 1 import tensorflow as tf


2 optimizer=tf.keras.optimizers.Adam(0.01)
3 model.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['

In [57]: 1 model.fit(X_train_num_df,y_train,epochs=100,batch_size=10,verbose=1)

Epoch 1/100
810/810 [==============================] - 1s 1ms/step - loss: 0.4511 -
accuracy: 0.8054
Epoch 2/100
810/810 [==============================] - 1s 1ms/step - loss: 0.3623 -
accuracy: 0.8493
Epoch 3/100
810/810 [==============================] - 1s 1ms/step - loss: 0.3543 -
accuracy: 0.8541
Epoch 4/100
810/810 [==============================] - 1s 1ms/step - loss: 0.3433 -
accuracy: 0.8561
Epoch 5/100
810/810 [==============================] - 1s 1ms/step - loss: 0.3291 -
accuracy: 0.8692
Epoch 6/100
810/810 [==============================] - 1s 1ms/step - loss: 0.3488 -
accuracy: 0.8560
Epoch 7/100
810/810 [ ] 1s 1ms/step loss: 0 3439

Predict the Results using 0.5 threshold


In [58]: 1 y_pred_val=model.predict(X_val_num_df)
2 y_pred_val[y_pred_val>0.5]=1
3 y_pred_val[y_pred_val <0.5]=0
In [59]: 1 y_pred_val=y_pred_val.tolist()
2 X_compare_val=X_val.copy()
3 X_compare_val['y_actual']=y_val
4 X_compare_val['y_pred']=y_pred_val
5 X_compare_val.head(10)

Out[59]: CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard

340 642 Germany Female 40 6 129502.49 2 0

8622 706 Germany Male 36 9 58571.18 2 1

8401 535 Spain Male 58 1 0.00 2 1

4338 714 Spain Male 25 2 0.00 1 1

8915 606 France Male 36 1 155655.46 1 1

2624 605 Spain Female 29 3 116805.82 1 0

2234 720 France Female 38 10 0.00 2 1

349 582 France Male 39 5 0.00 2 1

3719 850 France Female 62 1 124678.35 1 1

2171 526 Germany Male 58 9 190298.89 2 1

Confusion Matrix of the Validation set


In [60]: 1 from sklearn.metrics import confusion_matrix
2 cm_val=confusion_matrix(y_val,y_pred_val)
3 cm_val

Out[60]: array([[694, 22],


[ 96, 88]])

From the above confusion matrix, Out of 900


Validation dataset observations, our model accurately
predicted 694+88=782 and made 96+22=118 incorrect
predictions.
In [61]: 1 Accuracy=782/900
2 print("Accuracy of the Model on the Validation Data set is 86.89%")

Accuracy of the Model on the Validation Data set is 86.89%


In [62]: 1 loss1,accuracy1=model.evaluate(X_train_num_df,y_train,verbose=False)
2 loss2,accuracy2=model.evaluate(X_val_num_df,y_val,verbose=False)
3 print("Train Loss {}".format(loss1))
4 print("Train Accuracy {}".format(accuracy1))
5 print("Val Loss {}".format(loss2))
6 print("Val Accuracy {}".format(accuracy2))

Train Loss 0.33421364426612854


Train Accuracy 0.8649382591247559
Val Loss 0.348032146692276
Val Accuracy 0.8688889145851135

Since our Training Accuracy and Validation Accuracy are pretty


close, we can conclude that our model generalises well. So, lets
apply the model on the Test set and make predictions and
evaluate the model against the Test.

In [63]: 1 from sklearn import metrics


2 y_pred_test=model.predict(X_test_num_df)
3 y_pred_test[y_pred_test>0.5]=1
4 y_pred_test[y_pred_test <0.5]=0
5 cm_test=metrics.confusion_matrix(y_test,y_pred_test)
6 cm_test
7 print("Test Confusion Matrix")
8 ​

Test Confusion Matrix

In [64]: 1 cm_test

Out[64]: array([[756, 38],


[121, 85]])

In [65]: 1 loss3,accuracy3=model.evaluate(X_test_num_df,y_test,verbose=False)
2 print("Test Accuracy is {}".format(accuracy3))
3 print("Test loss is {}".format(loss3))

Test Accuracy is 0.8410000205039978


Test loss is 0.38615888357162476

In [ ]: 1 ​

You might also like