100% found this document useful (5 votes)

2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

This document summarizes the steps taken to import libraries, read in an election dataset, clean the data, and analyze the dataset. Key steps include: 1) Importing machine learning libraries. 2) Reading in an election dataset with 1525 rows and 9 columns from an Excel file. 3) Checking for missing data (none found) and data types. 4) Removing 8 duplicate rows from the dataset. 5) Analyzing categorical variables to find unique values and frequencies.

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

2K views85 pages

Machine Learning Project Problem 1 Jupyter Notebook PDF

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Type Markdown and LaTeX: 𝛼2

Importing required Libraries

In [1]: 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
import seaborn as sns; sns.set() # for plot styling
from scipy import stats
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from scipy.stats import zscore
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore')

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 1/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [2]: 

Elect_df= pd.read_excel("Election_Data.xlsx",sheet_name="Election_Dataset_Two Classes",inde

Elect_df.head()

Out[2]:

vote age economic.cond.national economic.cond.household Blair Hague Europe politic

1 Labour 43 3 3 4 1 2

2 Labour 36 4 4 4 4 5

3 Labour 35 4 4 5 2 3

4 Labour 24 4 2 2 1 4

5 Labour 41 2 2 1 1 6

In [3]: 

# Shape function displays the number of rows and columns in a dafaframe.

print('The dataset has {} rows and {} columns'.format(Elect_df.shape[0],Elect_df.shape[1]))

The dataset has 1525 rows and 9 columns

In [4]: 

# Checking Data info

Elect_df.info();

Int64Index: 1525 entries, 1 to 1525

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 vote 1525 non-null object

1 age 1525 non-null int64

2 economic.cond.national 1525 non-null int64

3 economic.cond.household 1525 non-null int64

4 Blair 1525 non-null int64

5 Hague 1525 non-null int64

6 Europe 1525 non-null int64

7 political.knowledge 1525 non-null int64

8 gender 1525 non-null object

dtypes: int64(7), object(2)

memory usage: 119.1+ KB

In [5]: 

# Handling missing data

# Test whether there is any null value in our dataset or not. We can do this using isnull()
Elect_df.isnull().sum()
print("There are", Elect_df.isnull().values.sum(),"Missing Values in dataset")

There are 0 Missing Values in dataset

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 2/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [6]: 

cat=[]
num=[]
for i in Elect_df.columns:
if Elect_df[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)

['vote', 'gender']

['age', 'economic.cond.national', 'economic.cond.household', 'Blair', 'Hagu

e', 'Europe', 'political.knowledge']

In [7]: 

for variable in cat:

print(variable,":", sum(Elect_df[variable] == '?'))

vote : 0

gender : 0

In [8]: 

Elect_df[num].describe().T

Out[8]:

count mean std min 25% 50% 75% max

age 1525.0 54.182295 15.711209 24.0 41.0 53.0 67.0 93.0

economic.cond.national 1525.0 3.245902 0.880969 1.0 3.0 3.0 4.0 5.0

economic.cond.household 1525.0 3.140328 0.929951 1.0 3.0 3.0 4.0 5.0

Blair 1525.0 3.334426 1.174824 1.0 2.0 4.0 4.0 5.0

Hague 1525.0 2.746885 1.230703 1.0 2.0 2.0 4.0 5.0

Europe 1525.0 6.728525 3.297538 1.0 4.0 6.0 10.0 11.0

political.knowledge 1525.0 1.542295 1.083315 0.0 0.0 2.0 2.0 3.0

In [9]: 

Elect_df[cat].describe().T

Out[9]:

count unique top freq

vote 1525 2 Labour 1063

gender 1525 2 female 812

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 3/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [10]: 

# Checking for Duplicates

dups=Elect_df.duplicated()
print("Total no of duplicate values = %d" % (dups.sum()))
Elect_df[dups]

Total no of duplicate values = 8

Out[10]:

vote age economic.cond.national economic.cond.household Blair Hague Europ

68 Labour 35 4 4 5 2

627 Labour 39 3 4 4 2

871 Labour 38 2 4 2 2

984 Conservative 74 4 3 2 4

1155 Conservative 53 3 4 2 2

1237 Labour 36 3 3 2 2

1245 Labour 29 4 4 4 2

1439 Labour 40 4 3 4 2

Removing Duplicate Data

In [11]: 

Elect_df.drop_duplicates(inplace=True)

In [12]: 

Elect_df.shape

Out[12]:

(1517, 9)

unique values for categorical variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 4/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [13]: 

### unique values for categorical variables

for column in Elect_df.columns:
if Elect_df[column].dtype == 'object':
print(column.upper(),': ',Elect_df[column].nunique())
print(Elect_df[column].value_counts().sort_values())
print('\n')

VOTE : 2

Conservative 460

Labour 1057

Name: vote, dtype: int64

GENDER : 2

male 709

female 808

Name: gender, dtype: int64

In [14]: 

# Checking the Skewness in data

Elect_df.skew(axis=0,skipna=True)

Out[14]:

age 0.139800

economic.cond.national -0.238474

economic.cond.household -0.144148

Blair -0.539514

Hague 0.146191

Europe -0.141891

political.knowledge -0.422928

dtype: float64

Univariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 5/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [15]: 

a=1
plt.figure(figsize=(15,112))
for i in Elect_df.columns:
if Elect_df[i].dtype != 'object':
plt.subplot(21,3,a)
sns.distplot(Elect_df[i])
plt.title("Distribution plot for:" + i)
plt.subplot(21,3,a+1)
sns.histplot(Elect_df[i])
plt.title("Histogram for:" + i)
plt.subplot(21,3,a+2)
sns.boxplot(Elect_df[i])
plt.title("Boxplot for:" + i)
a+=3

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 6/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bivariate and Multivariate Analysis

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 7/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [16]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["vote"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('vote', fontsize=15)
ax1.set_title('Distribution of vote', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)

ax2.set_xlabel('Vote', fontsize=15)
ax2.set_title('Distribution of Vote', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)

ax3.set_xlabel('vote', fontsize=15)
ax3.set_title('Distribution of vote', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Blair'], jitter=True, ax=ax4)

ax4.set_xlabel('vote', fontsize=15)
ax4.set_title('Distribution of vote', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Hague'], jitter=True, ax=ax5)

ax5.set_xlabel('vote', fontsize=15)
ax5.set_title('Distribution of vote', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['Europe'], jitter=True, ax=ax6)

ax6.set_xlabel('vote', fontsize=15)
ax6.set_title('Distribution of vote', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["vote"], Elect_df['political.knowledge'], jitter=True, ax=ax7)

ax7.set_xlabel('vote', fontsize=15)
ax7.set_title('Distribution of vote', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 8/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 9/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [17]: 

fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))

sns.stripplot(Elect_df["gender"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('gender', fontsize=15)
ax1.set_title('Distribution of gender', fontsize=15)
ax1.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)

ax2.set_xlabel('gender', fontsize=15)
ax2.set_title('Distribution of gender', fontsize=15)
ax2.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)

ax3.set_xlabel('gender', fontsize=15)
ax3.set_title('Distribution of gender', fontsize=15)
ax3.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Blair'], jitter=True, ax=ax4)

ax4.set_xlabel('gender', fontsize=15)
ax4.set_title('Distribution of gender', fontsize=15)
ax4.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Hague'], jitter=True, ax=ax5)

ax5.set_xlabel('gender', fontsize=15)
ax5.set_title('Distribution of gender', fontsize=15)
ax5.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['Europe'], jitter=True, ax=ax6)

ax6.set_xlabel('gender', fontsize=15)
ax6.set_title('Distribution of gender', fontsize=15)
ax6.tick_params(labelsize=15)

sns.stripplot(Elect_df["gender"], Elect_df['political.knowledge'], jitter=True, ax=ax7)

ax7.set_xlabel('gender', fontsize=15)
ax7.set_title('Distribution of gender', fontsize=15)
ax7.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 10/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Check for Data Distribution w.r.t Vote

In [18]: 

### Data Distribution

plt.figure(figsize=(24,8))
sns.pairplot(Elect_df,hue='vote');

<Figure size 1728x576 with 0 Axes>

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 11/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [19]: 

#correlation matrix
Elect_df.corr()

Out[19]:

age economic.cond.national economic.cond.household Bla

age 1.000000 0.018687 -0.038868 0.03208

economic.cond.national 0.018687 1.000000 0.347687 0.32614

economic.cond.household -0.038868 0.347687 1.000000 0.21582

Blair 0.032084 0.326141 0.215822 1.00000

Hague 0.031144 -0.200790 -0.100392 -0.24350

Europe 0.064562 -0.209150 -0.112897 -0.29594

political.knowledge -0.046598 -0.023510 -0.038528 -0.02129

In [20]: 

# plot the correlation coefficients as a heatmap

plt.subplots(figsize=(15,10))
sns.heatmap(Elect_df.corr(), annot=True, fmt='.2f', cmap='Blues', vmax=1, vmin=-1);

Check for Outliers

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 12/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [21]: 

#Check for presence of outliers

plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

There are nearly no outliers in most of the numerical columns, only outlier is in economic.cond.national
variable & economic.cond.household Variable . In Gaussian Naive Bayes, outliers will affect the shape
of the Gaussian distribution and have the usual effects on the mean etc. So depending on our use case,
it makes sense to remove outlier .

In [22]: 

print('Range of values: ', Elect_df['economic.cond.national'].max()-Elect_df['economic.cond

Range of values: 4

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 13/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [23]: 

#Central values
print('Minimum value economic.cond.national: ', Elect_df['economic.cond.national'].min())
print('Maximum economic.cond.national: ',Elect_df['economic.cond.national'].max())
print('Mean value economic.cond.national: ', Elect_df['economic.cond.national'].mean())
print('Median value economic.cond.national: ',Elect_df['economic.cond.national'].median())
print('Standard deviation economic.cond.national: ', Elect_df['economic.cond.national'].std
print('Null values economic.cond.national: ',Elect_df['economic.cond.national'].isnull().an

Minimum value economic.cond.national: 1

Maximum economic.cond.national: 5

Mean value economic.cond.national: 3.245220830586684

Median value economic.cond.national: 3.0

Standard deviation economic.cond.national: 0.8817924638047195

Null values economic.cond.national: False

In [24]: 

#Quartiles

Q1=Elect_df['economic.cond.national'].quantile(q=0.25)
Q3=Elect_df['economic.cond.national'].quantile(q=0.75)
print('economic.cond.national - 1st Quartile (Q1) is: ', Q1)
print('economic.cond.national - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of economic.cond.national is ', stats.iqr(Elect_df['econom

economic.cond.national - 1st Quartile (Q1) is: 3.0

economic.cond.national - 3st Quartile (Q3) is: 4.0

Interquartile range (IQR) of economic.cond.national is 1.0

In [25]: 

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in economic.cond.national: ', L_outliers)
print('Upper outliers in economic.cond.national: ', U_outliers)

Lower outliers in economic.cond.national: 1.5

Upper outliers in economic.cond.national: 5.5

In [26]: 

print('Number of outliers in economic.cond.national upper : ', Elect_df[Elect_df['economic.

print('Number of outliers in economic.cond.national lower : ', Elect_df[Elect_df['economic.
print('% of Outlier in economic.cond.national upper: ',round(Elect_df[Elect_df['economic.co
print('% of Outlier in economic.cond.national lower: ',round(Elect_df[Elect_df['economic.co

Number of outliers in economic.cond.national upper : 0

Number of outliers in economic.cond.national lower : 1517

% of Outlier in economic.cond.national upper: 0 %

% of Outlier in economic.cond.national lower: 100 %

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 14/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Oulier Treatment

In [27]: 

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [28]: 

lr,ur=remove_outlier(Elect_df["economic.cond.national"])
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]>ur,ur,Elect_
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]<lr,lr,Elect_
lr,ur=remove_outlier(Elect_df["economic.cond.household"])
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]>ur,ur,Elec
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]<lr,lr,Elec

In [29]: 

#Check for presence of outliers

plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();

Get_dummies of the object variables

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 15/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [30]: 

cat

Out[30]:

['vote', 'gender']

In [31]: 

cat1 = ['vote', 'gender']

drop_first is used to ensure that multiple columns created based on the levels of categorical variable
are not included else it will result in to multicollinearity . This is done to ensure that we do not land in to
dummy trap.

In [32]: 

df=pd.get_dummies(Elect_df, columns=cat1,drop_first=True)
df.head()

Out[32]:

age economic.cond.national economic.cond.household Blair Hague Europe political.knowl

1 43 3.0 3.0 4 1 2

2 36 4.0 4.0 4 4 5

3 35 4.0 4.0 5 2 3

4 24 4.0 2.0 2 1 4

5 41 2.0 2.0 1 1 6

In [33]: 

# Copy all the predictor variables into X dataframe

X=df.drop('vote_Labour',axis=1)
# Copy target into the y dataframe.
y=df['vote_Labour']

In [34]: 

# Var prior to scaling

X.var()

Out[34]:

age 246.544655

economic.cond.national 0.728713

economic.cond.household 0.785491

Blair 1.380089

Hague 1.519005

Europe 10.883687

political.knowledge 1.175961

gender_male 0.249099

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 16/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [35]: 

# Data prior to scaling

plt.plot(X)
plt.title('Data prior to scaling ', fontsize=15)
plt.show()

Is Scaling necessary here or not?

In [36]: 

# Scaling the attributes.

X[['age','economic.cond.national','economic.cond.household','Blair','Hague','Europe','polit

In [37]: 

# Var post scaling

X.var()

Out[37]:

age 1.00066

economic.cond.national 1.00066

economic.cond.household 1.00066

Blair 1.00066

Hague 1.00066

Europe 1.00066

political.knowledge 1.00066

gender_male 1.00066

dtype: float64

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 17/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [38]: 

# Data post scaling

plt.plot(X)
plt.title('Data post scaling ', fontsize=15)
plt.show()

In [39]: 

X.head()

Out[39]:

age economic.cond.national economic.cond.household Blair Hague Europe

1 -0.716161 -0.301648 -0.179682 0.565802 -1.419969 -1.437338

2 -1.162118 0.870183 0.949003 0.565802 1.014951 -0.527684

3 -1.225827 0.870183 0.949003 1.417312 -0.608329 -1.134120

4 -1.926617 0.870183 -1.308366 -1.137217 -1.419969 -0.830902

5 -0.843577 -1.473479 -1.308366 -1.988727 -1.419969 -0.224465

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 18/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [40]: 

y.head()

Out[40]:

1 1

2 1

3 1

4 1

5 1

Name: vote_Labour, dtype: uint8

Train-Test Split Split X and y into training and test set in 70:30 ratio with
random_state=1

In [41]: 

# Split X and y into training and test set in 70:30 ratio

X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30, random_state=1)

In [42]: 

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

X_train (1061, 8)

X_test (456, 8)

y_train (1061,)

y_test (456,)

In [43]: 

Logistic_model = LogisticRegression(solver='newton-cg',max_iter=10000,penalty='none',verbos
Logistic_model.fit(X_train, y_train)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 1.1s finished

Out[43]:

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-

cg',

verbose=True)

Now LogisticRegression classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict() method
with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 19/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [44]: 

## Performance Matrix on train data set

y_train_predict=Logistic_model.predict(X_train)
Logistic_model_score_train=Logistic_model.score(X_train,y_train) ## Accuracy
print("The Logistic Regression Model Score on train data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_train,y_train_predict)) ## Confusion Matrix
print(metrics.classification_report(y_train,y_train_predict)) ## Classification r

The Logistic Regression Model Score on train data set is 0.834

[[197 110]

[ 66 688]]

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [45]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,Logistic_model.predict(X_train))),annot=True,
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 20/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [46]: 

## Performance Matrix on test data set

y_test_predict=Logistic_model.predict(X_test)
Logistic_model_score_test=Logistic_model.score(X_test,y_test) ## Accuracy
print("The Logistic Regression Model Score on test data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_test,y_test_predict)) ## Confusion Matrix
print(metrics.classification_report(y_test,y_test_predict)) ## Classification re

The Logistic Regression Model Score on test data set is 0.829

[[111 42]

[ 36 267]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [47]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,Logistic_model.predict(X_test))),annot=True,fm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Training Data and Test Data Confusion Matrix Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 21/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [48]: 

f,a = plt.subplots(1,2,sharex=True,sharey=True,squeeze=False)

#Plotting confusion matrix for the different models for the Training Data

plot_0 = sns.heatmap((metrics.confusion_matrix(y_train,y_train_predict)),annot=True,fmt='.5
a[0][0].set_title('Training Data')

plot_1 = sns.heatmap((metrics.confusion_matrix(y_test,y_test_predict)),annot=True,fmt='.5g'
a[0][1].set_title('Test Data');

Training Data and Test Data Classification Report Comparison

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 22/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [49]: 

print('Classification Report of the training data:\n\n',metrics.classification_report(y_tra

print('Classification Report of the test data:\n\n',metrics.classification_report(y_test,y_

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

1- Applying GridSearchCV for Logistic Regression

In [50]: 

grid={'penalty':['l2','none','l1','elasticnet'],
'solver':['liblinear','lbfgs','newton-cg'],
'tol':[0.0001,0.00001],
'max_iter': [10000, 5000,15000]}

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 23/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [51]: 

from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = Logistic_model, param_grid = grid, cv = cv, n_jobs=2
grid_search.fit(X_train, y_train)

[LibLinear]

Out[51]:

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_sta

te=1),

estimator=LogisticRegression(max_iter=10000, n_jobs=2,

penalty='none', solver='newton-c
g',

verbose=True),

n_jobs=2,

param_grid={'max_iter': [10000, 5000, 15000],

'penalty': ['l2', 'none', 'l1', 'elasticnet'],

'solver': ['liblinear', 'lbfgs', 'newton-cg'],

'tol': [0.0001, 1e-05]},

scoring='f1')

In [52]: 

print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)

{'max_iter': 10000, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, solver='liblinear', verbose=Tru

In [53]: 

best_model_lr = grid_search.best_estimator_

In [54]: 

# Prediction on the training set

ytrain_predict_lr = best_model_lr.predict(X_train)
ytest_predict_lr = best_model_lr.predict(X_test)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 24/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [55]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model_lr.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[55]:

0 1

0 0.428858 0.571142

1 0.155518 0.844482

2 0.006996 0.993004

3 0.839503 0.160497

4 0.066109 0.933891

Model Evaluation for Train Data

In [56]: 

print("The Best Logistic Regression Model Score on train data set post tuning is %.3f " % b

The Best Logistic Regression Model Score on train data set post tuning is 0.
834

In [57]: 

# Get the confusion matrix on the train data

confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True,fmt='.5g',c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 25/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [58]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LR Tuned Model train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LR Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model train data set 0.89

Model Evaluation for Test Data

In [59]: 

print("The Best Logistic Regression Model Score on train data post tuning set is %.3f " % b

The Best Logistic Regression Model Score on train data post tuning set is 0.
829

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 26/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [60]: 

# Get the confusion matrix on the train data

confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True,fmt='.5g',cma
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 27/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [61]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LR Tuned Model test data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LR Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LR Tuned Model test data set 0.88

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 28/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [62]: 

print('Classification Report of the training data:\n\n',classification_report(y_train, ytra

print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.75 0.64 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [63]: 

(best_model_lr.score(X_train, y_train)-best_model_lr.score(X_test, y_test))

Out[63]:

0.00517138746961654

LDA (linear discriminant analysis)

In [64]: 

LDA_model=LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)

Out[64]:

LinearDiscriminantAnalysis()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 29/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [65]: 

## Performance Matrix on train data set

y_train_predict=LDA_model.predict(X_train)
LDA_model_score_train=LDA_model.score(X_train,y_train)
print("The LDA Model Score on train data set is %.3f " % LDA_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The LDA Model Score on train data set is 0.834

[[200 107]

[ 69 685]]

precision recall f1-score support

0 0.74 0.65 0.69 307

1 0.86 0.91 0.89 754

accuracy 0.83 1061

macro avg 0.80 0.78 0.79 1061

weighted avg 0.83 0.83 0.83 1061

In [66]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,LDA_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 30/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [67]: 

#Performance Matrix on test data set

y_test_predict=LDA_model.predict(X_test)
LDA_model_score_test=LDA_model.score(X_test,y_test)
print("The LDA Model Score on test data set is %.3f " % LDA_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The LDA Model Score on test data set is 0.831

[[111 42]

[ 35 268]]

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [68]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,LDA_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

Applying GridSearchCV for LDA

In [69]: 

grid_lda ={'solver' :['svd', 'lsqr', 'eigen']}

grid_search_lda = GridSearchCV(estimator = LDA_model, param_grid = grid_lda, cv = cv, n_job
grid_search_lda.fit(X_train, y_train)
best_model_lda = grid_search_lda.best_estimator_

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 31/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Model Evaluation for Train Data

In [70]: 

ytrain_predict_lda = best_model_lda.predict(X_train)
ytest_predict_lda= best_model_lda.predict(X_test)

In [71]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model_lda.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[71]:

0 1

0 0.466328 0.533672

1 0.137291 0.862709

2 0.005950 0.994050

3 0.866706 0.133294

4 0.053474 0.946526

In [72]: 

#### Model Evaluation for Train Data

print("The Best LDA Model Score on train data set post tuning is %.3f " % best_model_lda.sc

The Best LDA Model Score on train data set post tuning is 0.835

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 32/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [73]: 

# Get the confusion matrix on the train data

confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True,fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 33/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [74]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Tuned Model train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LDA Tuned Model train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model train data set 0.890

Model Evaluation for Test Data

In [75]: 

#### Model Evaluation for Train Data

print("The Best LDA Model Score on test data post tuning set is %.3f " % best_model_lda.sco

The Best LDA Model Score on test data post tuning set is 0.831

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 34/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [76]: 

# Get the confusion matrix on the Test data

confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True,fmt='.5g',cm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 35/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [77]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LDA Tuned Model test data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LDA Tuned Model test data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Tuned Model test data set 0.888

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 36/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [78]: 

### Classification of Best LDA Model on Train and Test Data

print('Classification Report of the training data:\n\n',classification_report(y_train, ytra
print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre

Classification Report of the training data:

precision recall f1-score support

0 0.74 0.65 0.70 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.78 0.79 1061

weighted avg 0.83 0.84 0.83 1061

Classification Report of the test data:

precision recall f1-score support

0 0.76 0.73 0.74 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.81 456

weighted avg 0.83 0.83 0.83 456

In [79]: 

(best_model_lda.score(X_train, y_train)-best_model_lda.score(X_test, y_test))*100

Out[79]:

0.3920912082279293

KNN Model

Generally, good KNN performance usually requires preprocessing of data to make all variables similarly
scaled and centered

In [80]: 

KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)

Out[80]:

KNeighborsClassifier()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 37/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [81]: 

## Performance Matrix on train data set

y_train_predict = KNN_model.predict(X_train)
KNN_model_score_train=KNN_model.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.857

[[217 90]

[ 62 692]]

precision recall f1-score support

0 0.78 0.71 0.74 307

1 0.88 0.92 0.90 754

accuracy 0.86 1061

macro avg 0.83 0.81 0.82 1061

weighted avg 0.85 0.86 0.85 1061

In [82]: 

## Performance Matrix on test data set

y_test_predict = KNN_model.predict(X_test)
KNN_model_score_test = KNN_model.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.827

[[109 44]

[ 35 268]]

precision recall f1-score support

0 0.76 0.71 0.73 153

1 0.86 0.88 0.87 303

accuracy 0.83 456

macro avg 0.81 0.80 0.80 456

weighted avg 0.82 0.83 0.83 456

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error

Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours
= 1,3,5...19 and find the model with lowest MCE

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 38/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [83]: 

# empty list that will hold accuracy scores

ac_scores = []

# perform accuracy metrics for values from 1,3,5....19

for k in range(1,20,2):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# evaluate test accuracy
scores = knn.score(X_test, y_test)
ac_scores.append(scores)

# changing to misclassification error

MCE = [1 - x for x in ac_scores]
MCE

Out[83]:

[0.2149122807017544,

0.19736842105263153,

0.17324561403508776,

0.1842105263157895,

0.18201754385964908,

0.17105263157894735,

0.17763157894736847,

0.16885964912280704,

0.16666666666666663,

0.17105263157894735]

Plot misclassification error vs k (with k value on X-axis)

In [84]: 

# plot misclassification error vs k

plt.plot(range(1,20,2), MCE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.title("Misclassicication error Vs K Value",fontsize=14,color = 'red');
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 39/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

For K = 11 it is giving the best test accuracy. We will build the model with k=11

In [85]: 

from sklearn.neighbors import KNeighborsClassifier

KNN_model_1=KNeighborsClassifier(n_neighbors= 11)
KNN_model_1.fit(X_train,y_train)

Out[85]:

KNeighborsClassifier(n_neighbors=11)

Performance Matrix of KNN New Model on train data set

In [86]: 

## Performance Matrix on train data set

y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.843

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 40/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [87]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,KNN_model_1.predict(X_train))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 41/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [88]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.911

Performance Matrix of KNN New Model on test data set

In [89]: 

## Performance Matrix on test data set

y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.829

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 42/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [90]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,KNN_model_1.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 43/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [91]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.889

Naive Bayes
In [92]: 

NB_model=GaussianNB()
NB_model.fit(X_train, y_train)

Out[92]:

GaussianNB()

Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict()
method with test set features as its parameters.

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 44/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [93]: 

#Performance Matrix on train data set

y_train_predict=NB_model.predict(X_train)
Naive_Bayes_model_score_train=NB_model.score(X_train, y_train) ## Accur
print("The Naive Bayes Model Score on train data is %.3f " % Naive_Bayes_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict)) ## confusion_matrix
print(metrics.classification_report(y_train,y_train_predict)) ## classification_report

The Naive Bayes Model Score on train data is 0.834

[[212 95]

[ 81 673]]

precision recall f1-score support

0 0.72 0.69 0.71 307

1 0.88 0.89 0.88 754

accuracy 0.83 1061

macro avg 0.80 0.79 0.80 1061

weighted avg 0.83 0.83 0.83 1061

In [94]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,NB_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 45/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [95]: 

## Performance Matrix on test data set

y_test_predict = NB_model.predict(X_test)
Naive_Bayes_model_score_test=NB_model.score(X_test, y_test) ## Accuracy
print("The Naive Bayes Model Score on test data is %.3f " % Naive_Bayes_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict)) ## confusion_matrix
print(metrics.classification_report(y_test, y_test_predict)) ## classification_report

The Naive Bayes Model Score on test data is 0.822

[[112 41]

[ 40 263]]

precision recall f1-score support

0 0.74 0.73 0.73 153

1 0.87 0.87 0.87 303

accuracy 0.82 456

macro avg 0.80 0.80 0.80 456

weighted avg 0.82 0.82 0.82 456

In [96]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,NB_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Test Data')
plt.show()

Naive Bayes with SMOTE

In [97]: 

from imblearn.over_sampling import SMOTE

#SMOTE is only applied on the train data set
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 46/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [98]: 

X_train.shape

Out[98]:

(1061, 8)

In [99]: 

## Let's check the shape after SMOTE

X_train_res.shape

Out[99]:

(1508, 8)

In [100]: 

NB_SM_model = GaussianNB()
NB_SM_model.fit(X_train_res, y_train_res)

Out[100]:

GaussianNB()

In [101]: 

## Performance Matrix on train data set with SMOTE

y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.3f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.822

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 47/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [102]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train_res,NB_SM_model.predict(X_train_res))),annot=
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 48/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [103]: 

probs = NB_SM_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for Naive Bayes with SMOTE train data set",fontsize=14,color = 're

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.887

In [104]: 

## Performance Matrix on test data set

y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.3f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.809

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 49/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [105]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,NB_SM_model.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 50/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [106]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE test data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE test data set 0.876

Random Forest
In [107]: 

RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)

Out[107]:

RandomForestClassifier(random_state=1)

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 51/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [108]: 

## Performance Matrix on train data set

y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

In [109]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,RF_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 52/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [110]: 

## Performance Matrix on test data set

y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.3f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.831

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [111]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,RF_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()

In [112]: 

(RF_model_score_train-RF_model_score_test)*100

Out[112]:

16.885964912280706

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 53/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Bagging
In [113]: 

cart=RandomForestClassifier()
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100, random_state=1)
Bagging_model.fit(X_train,y_train)

Out[113]:

BaggingClassifier(base_estimator=RandomForestClassifier(), n_estimators=100,

random_state=1)

In [114]: 

## Performance Matrix on train data set

y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 54/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [115]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,Bagging_model.predict(X_train))),annot=True,f
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Train Data')
plt.show()

In [116]: 

## Performance Matrix on test data set

y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 55/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [117]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,Bagging_model.predict(X_test))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Test Data')
plt.show()

In [118]: 

(Bagging_model_score_train-Bagging_model_score_test)

Out[118]:

0.13900739123964478

Boosting

Ada Boost

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 56/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [119]: 

ADB_model=AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)

Out[119]:

AdaBoostClassifier(n_estimators=100, random_state=1)

In [120]: 

## Performance Matrix on train data set

y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

In [121]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,ADB_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA Boost-Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 57/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [122]: 

## Performance Matrix on train data set

y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.3f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.814

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

In [123]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,ADB_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA boost-Confusion Matrix-Test Data')
plt.show()

In [124]: 

(ADB_model_score_train-ADB_model_score_test)*100

Out[124]:

3.654488483225027

Gradient Boosting
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 58/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Gradient Boosting
In [125]: 

gbc_model=GradientBoostingClassifier(random_state=1)
gbc_model.fit(X_train, y_train)

Out[125]:

GradientBoostingClassifier(random_state=1)

In [126]: 

## Performance Matrix on train data set

y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.89

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

In [127]: 

# Get the confusion matrix on the train data

sns.heatmap((metrics.confusion_matrix(y_train,gbc_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost -Confusion Matrix-Train Data')
plt.show()

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 59/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [128]: 

## Performance Matrix on test data set

y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.84

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [129]: 

# Get the confusion matrix on the test data

sns.heatmap((metrics.confusion_matrix(y_test,gbc_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost-Confusion Matrix-Test Data')
plt.show()

In [130]: 

(gbc_model_score_train-gbc_model_score_test)*100

Out[130]:

5.702787836698253

Performance Matrix of Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 60/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [131]: 

## Performance Matrix on train data set

print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-Train Data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 61/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [132]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('The ROC_AUC score for Logistic Regression Train data set: %.3f' % auc)
# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Logistic Regression Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Train data set: 0.890

Performance Matrix of Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 62/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [133]: 

## Performance Matrix on test data set

print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True, fmt='.5g', c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-test data')
plt.show()

The Best Logistic Regression Model Score on train data set is 0.83

ROC_AUC Curve for Logistic Regression on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 63/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [134]: 

# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('The ROC_AUC score for Logistic Regression Test data set : %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Logistic Regression Test data set ",fontsize=14,color = 'red');

The ROC_AUC score for Logistic Regression Test data set : 0.883

Performance Matrix of LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 64/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [135]: 

## Performance Matrix on train data set

print("The Best LDA Model Score on train data set is %.2f " % best_model_lda.score(X_train,
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True, fmt='.5g'
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Train data')
plt.show()

The Best LDA Model Score on train data set is 0.84

ROC_AUC Curve for LDA (linear discriminant analysis) on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 65/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [136]: 

# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for LDA Train data set",fontsize=14,color = 'red');

The ROC_AUC score for LDA Train data set 0.89

Performance Matrix of LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 66/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [137]: 

#Performance Matrix on test data set

print("The Best LDA Model Score on test data set is %.2f " % best_model_lda.score(X_test, y
# Get the confusion matrix on the Test data
confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Test Data')
plt.show()

The Best LDA Model Score on test data set is 0.83

ROC_AUC Curve for LDA (linear discriminant analysis) on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 67/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [138]: 

probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
print("The ROC_AUC score for LDA Test data set is' %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for LDA Test data set",fontsize=14,color = 'red');

AUC: 0.888

The ROC_AUC score for LDA Test data set is' 0.888

Performance Matrix of KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 68/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [139]: 

## Performance Matrix on train data set

y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.2f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The KNN Model Score on Train data 0.84

[[206 101]

[ 66 688]]

precision recall f1-score support

0 0.76 0.67 0.71 307

1 0.87 0.91 0.89 754

accuracy 0.84 1061

macro avg 0.81 0.79 0.80 1061

weighted avg 0.84 0.84 0.84 1061

ROC_AUC Curve for KNN on train data set

In [140]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.91

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 69/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

Performance Matrix of KNN on test data set

In [141]: 

## Performance Matrix on test data set

y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.2f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The KNN Model Score on Test data 0.83

[[105 48]

[ 30 273]]

precision recall f1-score support

0 0.78 0.69 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.81 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for KNN on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 70/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [142]: 

# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');

The ROC_AUC score for KNN train data set 0.89

Performance Matrix of Naive Bayes with SMOTE on train data set

In [143]: 

## Performance Matrix on train data set with SMOTE

y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.2f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))

The SMOTE Model Score for train data set is 0.82

[[616 138]

[131 623]]

precision recall f1-score support

0 0.82 0.82 0.82 754

1 0.82 0.83 0.82 754

accuracy 0.82 1508

macro avg 0.82 0.82 0.82 1508

weighted avg 0.82 0.82 0.82 1508

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 71/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set

In [144]: 

probs = NB_SM_model.predict_proba(X_train_res)
probs = probs[:, 1]
auc = roc_auc_score(y_train_res, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train_res, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Naive Bayes with SMOTE train data set",fontsize=14,color = 'red');

The ROC_AUC score for Naive Bayes with SMOTE train data set 0.90

Performance Matrix of Naive Bayes with SMOTE on test data set

In [145]: 

## Performance Matrix on test data set

y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.2f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The SMOTE Model Score for test data set is 0.81

[[125 28]

[ 59 244]]

precision recall f1-score support

0 0.68 0.82 0.74 153

1 0.90 0.81 0.85 303

accuracy 0.81 456

macro avg 0.79 0.81 0.80 456

weighted avg 0.82 0.81 0.81 456

ROC AUC Curve for Naive Bayes with SMOTE Model on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 72/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set

In [146]: 

probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE Model on test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE Model on test data set",fontsize=14,color =

The ROC_AUC score for Naive Bayes with SMOTE Model on test data set 0.88

Performance Matrix of Random Forest on train data set

In [147]: 

## Performance Matrix on train data set

The random Forest Score on train data is 1.00

[[307 0]

[ 0 754]]

precision recall f1-score support

0 1.00 1.00 1.00 307

1 1.00 1.00 1.00 754

accuracy 1.00 1061

macro avg 1.00 1.00 1.00 1061

weighted avg 1.00 1.00 1.00 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 73/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [148]: 

Recall=(754/(0+754))
print("Random Forest-Train Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Train Data Set-Recall for class 1 is 1.00

ROC_AUC Curve for Random Forest on train data set

In [149]: 

probs = RF_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for Random Forest train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Random Forest train data",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest train data set 1.00

Performance Matrix of Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 74/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [150]: 

## Performance Matrix on test data set

y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.2f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The random Forest Score on test data is 0.83

[[104 49]

[ 28 275]]

precision recall f1-score support

0 0.79 0.68 0.73 153

1 0.85 0.91 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

In [151]: 

Recall=(275/(28+275))
print("Random Forest-Test Data Set-Recall for class 1 is %.2f " % Recall)

Random Forest-Test Data Set-Recall for class 1 is 0.91

ROC_AUC Curve for Random Forest on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 75/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [152]: 

probs_test = RF_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Random Forest test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Random Forest Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Random Forest test data set 0.90

Performance Matrix of Bagging on train data set

In [153]: 

## Performance Matrix on train data set

The Bagging Model Score for train data set is 0.97

[[278 29]

[ 5 749]]

precision recall f1-score support

0 0.98 0.91 0.94 307

1 0.96 0.99 0.98 754

accuracy 0.97 1061

macro avg 0.97 0.95 0.96 1061

weighted avg 0.97 0.97 0.97 1061

ROC_AUC Curve for Bagging on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 76/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [154]: 

probs = Bagging_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Bagging train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Bagging Train data set",fontsize=14,color = 'red');

The ROC_AUC score for Bagging train data set 1.00

Performance Matrix of Bagging on test data set

In [155]: 

## Performance Matrix on test data set

The Bagging Model Score for test data set is 0.83

[[104 49]

[ 29 274]]

precision recall f1-score support

0 0.78 0.68 0.73 153

1 0.85 0.90 0.88 303

accuracy 0.83 456

macro avg 0.82 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

ROC_AUC Curve for Bagging on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 77/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [156]: 

probs_test = Bagging_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Bagging test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Bagging Test data set",fontsize=14,color = 'red');

The AUC_ROC score for Bagging test data set 0.90

Performance Matrix of Ada Boost on train data set

In [157]: 

## Performance Matrix on train data set

The ADA boost Model Score for train data set is 0.850

[[214 93]

[ 66 688]]

precision recall f1-score support

0 0.76 0.70 0.73 307

1 0.88 0.91 0.90 754

accuracy 0.85 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.85 0.85 0.85 1061

ROC_AUC Curve for Ada Boost on train data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 78/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [158]: 

probs = ADB_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for ADB Model train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for ADB Model train data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model train data set 0.91

Performance Matrix of Ada Boost on test data set

In [159]: 

## Performance Matrix on train data set

y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.2f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The ADA boost Model Score for test data set is 0.81

[[103 50]

[ 35 268]]

precision recall f1-score support

0 0.75 0.67 0.71 153

1 0.84 0.88 0.86 303

accuracy 0.81 456

macro avg 0.79 0.78 0.79 456

weighted avg 0.81 0.81 0.81 456

ROC_AUC Curve for Ada Boost on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 79/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [160]: 

probs_test = ADB_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for ADB Model test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for ADB Model test data set",fontsize=14,color = 'red');

The AUC_ROC score for ADB Model test data set 0.88

Performance Matrix of Gradient Boosting on train data set

In [161]: 

## Performance Matrix on train data set

y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

The Gradient Boosting Score for train data set is 0.893

[[239 68]

[ 46 708]]

precision recall f1-score support

0 0.84 0.78 0.81 307

1 0.91 0.94 0.93 754

accuracy 0.89 1061

macro avg 0.88 0.86 0.87 1061

weighted avg 0.89 0.89 0.89 1061

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 80/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [162]: 

Recall=(708/(46+708))
print("Gradient Boosting-Train Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Train Data Set-Recall for class 1 is 0.939

ROC_AUC Curve for Gradient Boosting on train data set

In [163]: 

probs = gbc_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Gradient Boosting train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Gradient Boosting train data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting train data set 0.951

Performance Matrix of Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 81/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [164]: 

## Performance Matrix on test data set

y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

The Gradient Boosting Score for train data set is 0.836

[[105 48]

[ 27 276]]

precision recall f1-score support

0 0.80 0.69 0.74 153

1 0.85 0.91 0.88 303

accuracy 0.84 456

macro avg 0.82 0.80 0.81 456

weighted avg 0.83 0.84 0.83 456

In [165]: 

Recall=(276/(27+276))
print("Gradient Boosting-Test Data Set-Recall for class 1 is %.3f " % Recall)

Gradient Boosting-Test Data Set-Recall for class 1 is 0.911

ROC_AUC Curve for Gradient Boosting on test data set

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 82/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [166]: 

probs_test = gbc_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Gradient Boosting test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Gradient Boosting test data set",fontsize=14,color = 'red');

The ROC_AUC score for Gradient Boosting test data set 0.899

Comparison of Different Models

In [168]: 

print("The Logistic Regression Model Score Post Tuning on train data set is %.3f " % best_m
print("The Logistic Regression Model Score Post Tuning on test data set is %.3f " % best_m
print("The LDA Model Score Post Tuning on train data set is %.3f " % best_model_lda.score(X
print("The LDA Model Score Post Tuning on test data set is %.3f " % best_model_lda.score(X
print("The KNN Model Score Post Tuning on Train data %.3f " % KNN_model_1.score(X_train, y_
print("The KNN Model Score Post Tuning on Test data %.3f " % KNN_model_1.score(X_test, y_te
print("The Naive Bayes Model Score Post Tuning on train data is %.3f " % NB_SM_model.score(
print("The Naive Bayes Model Score Post Tuning on test data is %.3f " % NB_SM_model.score(X

The Logistic Regression Model Score Post Tuning on train data set is 0.834

The Logistic Regression Model Score Post Tuning on test data set is 0.829

The LDA Model Score Post Tuning on train data set is 0.835

The LDA Model Score Post Tuning on test data set is 0.831

The KNN Model Score Post Tuning on Train data 0.843

The KNN Model Score Post Tuning on Test data 0.829

The Naive Bayes Model Score Post Tuning on train data is 0.822

The Naive Bayes Model Score Post Tuning on test data is 0.809

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 83/85

3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

In [169]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lr.score(X_tr

Variance in Test and train Scores of LDA Model is 0.00517

In [170]: 

print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lda.score(X_t

Variance in Test and train Scores of LDA Model is 0.00392

In [171]: 

print("Variance in Test and train Scores of KNN Model for is %.5f " % (KNN_model_1.score(X

Variance in Test and train Scores of KNN Model for is 0.01365

In [172]: 

print("Variance in Test and train Scores of LR Model for is %.5f " % (NB_SM_model.score(X_

Variance in Test and train Scores of LR Model for is 0.01241

Cross Validation
In [173]: 

from sklearn.model_selection import cross_val_score

In [174]: 

scores = cross_val_score(best_model_lda, X_train, y_train, cv=10)

scores

Out[174]:

array([0.78504673, 0.77358491, 0.83962264, 0.85849057, 0.85849057,

0.8490566 , 0.81132075, 0.8490566 , 0.81132075, 0.82075472])

In [175]: 

scores = cross_val_score(best_model_lda, X_test, y_test, cv=10)

scores

Out[175]:

array([0.80434783, 0.76086957, 0.86956522, 0.82608696, 0.89130435,

0.86956522, 0.93333333, 0.84444444, 0.75555556, 0.84444444])

--------------------------------------END OF PROBELM 1------------

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 84/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook

--------------------------

localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 85/85

ML Quiz 3 Machine Learning Great Learning
89% (9)
ML Quiz 3 Machine Learning Great Learning
7 pages
Election Prediction Model Analysis
88% (8)
Election Prediction Model Analysis
26 pages
Quiz 3 LDA Predictive Modeling Great Learning
100% (5)
Quiz 3 LDA Predictive Modeling Great Learning
7 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
1.1 Read The Data and Do Exploratory Data Analysis. Describe The Data Briefly
100% (19)
1.1 Read The Data and Do Exploratory Data Analysis. Describe The Data Briefly
50 pages
Detailed Lesson Plan in Grade 8 Pythagorean Theorem
67% (3)
Detailed Lesson Plan in Grade 8 Pythagorean Theorem
5 pages
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
100% (11)
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
154 pages
PM - ExtendedProject - Business Report
100% (5)
PM - ExtendedProject - Business Report
35 pages
Logistic Regression and Lda
75% (4)
Logistic Regression and Lda
27 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
CITY MUNICIPAL FS IPCR-chief
0% (1)
CITY MUNICIPAL FS IPCR-chief
5 pages
Data Mining Quiz for Students
100% (1)
Data Mining Quiz for Students
5 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Wholesale Customer Data Analysis
100% (1)
Wholesale Customer Data Analysis
56 pages
Week 7 Project Report 1 and 2
No ratings yet
Week 7 Project Report 1 and 2
10 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
SAP - PS, Budget
100% (4)
SAP - PS, Budget
47 pages
Project SQL
No ratings yet
Project SQL
2 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Extended Project FastKart SQLite MYSQL 1 1 PDF
No ratings yet
Extended Project FastKart SQLite MYSQL 1 1 PDF
5 pages
Project Instructions 2
No ratings yet
Project Instructions 2
5 pages
NCERT Chemistry Class 12
No ratings yet
NCERT Chemistry Class 12
190 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Wholesale Data Analysis Report
100% (1)
Wholesale Data Analysis Report
17 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
MacBook Air Service Source
100% (1)
MacBook Air Service Source
155 pages
Project Report
100% (3)
Project Report
36 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
Election Prediction Model Analysis
100% (2)
Election Prediction Model Analysis
46 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Predicting Commute Mode with ML
100% (1)
Predicting Commute Mode with ML
12 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Modelling, Simulation and Control Design For Robotic Manipulators PDF
100% (2)
Modelling, Simulation and Control Design For Robotic Manipulators PDF
16 pages
Shoe Sales Time Series Analysis
100% (3)
Shoe Sales Time Series Analysis
105 pages
DVT Alternate Project
50% (2)
DVT Alternate Project
1 page
Project ML
100% (4)
Project ML
36 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
Introducing Quality Patient Safety Program
No ratings yet
Introducing Quality Patient Safety Program
15 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
STR Profiles: Multiplex PCR, Tri-Alleles, Amelogenin, and Partial Profiles
No ratings yet
STR Profiles: Multiplex PCR, Tri-Alleles, Amelogenin, and Partial Profiles
20 pages
RPH Reviewer
No ratings yet
RPH Reviewer
29 pages
Election Prediction Using ML Models
100% (11)
Election Prediction Using ML Models
19 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Analysis and Design of Diagrid Sturcture For HIGH RISE
No ratings yet
Analysis and Design of Diagrid Sturcture For HIGH RISE
10 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Predicting Cubic Zirconia Prices Using Linear Regression
100% (1)
Predicting Cubic Zirconia Prices Using Linear Regression
58 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Data Mining - Project
100% (2)
Data Mining - Project
25 pages
CMSU Student Survey Data Analysis
100% (3)
CMSU Student Survey Data Analysis
13 pages
Petrosjan, M. I - Rock Breakage by Blasting-Balkema (1994) PDF
No ratings yet
Petrosjan, M. I - Rock Breakage by Blasting-Balkema (1994) PDF
156 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Strategy Implementation Guide
100% (1)
Strategy Implementation Guide
18 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
Unit 2 Technological Change Population and Growth 1.0
No ratings yet
Unit 2 Technological Change Population and Growth 1.0
33 pages
Advanced Statistics Assignment Report
No ratings yet
Advanced Statistics Assignment Report
12 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Machine File
No ratings yet
Machine File
27 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Intrinsic FUNCTIONS in COBOL
No ratings yet
Intrinsic FUNCTIONS in COBOL
33 pages
STS-30 Press Kit
No ratings yet
STS-30 Press Kit
41 pages
DataMiningProjectProblem1 Clustering
100% (4)
DataMiningProjectProblem1 Clustering
20 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Final Report Askari Bank
No ratings yet
Final Report Askari Bank
117 pages
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
English 111 Fall 2013 Syllabus
No ratings yet
English 111 Fall 2013 Syllabus
15 pages
Vaişeshika's Prāgabhāva in Politics
0% (1)
Vaişeshika's Prāgabhāva in Politics
5 pages
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
No ratings yet
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
5 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
Logistic Regression Quiz Questions
100% (3)
Logistic Regression Quiz Questions
5 pages
COS111u Tut 101 - 3 (2009)
No ratings yet
COS111u Tut 101 - 3 (2009)
107 pages
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
100% (1)
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
3 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
George Kelly Construct Theory: - Early Cognitive Personality Theorist - Phenomonological - Clinician
No ratings yet
George Kelly Construct Theory: - Early Cognitive Personality Theorist - Phenomonological - Clinician
15 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Machine Design Tutorials - Week 6
No ratings yet
Machine Design Tutorials - Week 6
17 pages
Next Pathway Hack Backpackers Problem Statement
No ratings yet
Next Pathway Hack Backpackers Problem Statement
11 pages
MySQL - Week 2 Quiz
100% (2)
MySQL - Week 2 Quiz
6 pages
MAPEH
No ratings yet
MAPEH
8 pages
Board Resolutions Sample
No ratings yet
Board Resolutions Sample
32 pages
Neural Networks for Tech Enthusiasts
No ratings yet
Neural Networks for Tech Enthusiasts
2 pages
A B Shingles Case
No ratings yet
A B Shingles Case
2 pages
TMA Quiz Questions
67% (6)
TMA Quiz Questions
12 pages
15' Stress Test
No ratings yet
15' Stress Test
1 page
NSTP Proposal Paper
No ratings yet
NSTP Proposal Paper
4 pages
STEM Qualifying Exam Reviewer Grade10 Science Math 50Q
No ratings yet
STEM Qualifying Exam Reviewer Grade10 Science Math 50Q
26 pages
Comprehensive Multi-Modality Online Student Engagement Dataset With High-Quality Labels
No ratings yet
Comprehensive Multi-Modality Online Student Engagement Dataset With High-Quality Labels
11 pages