3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Type Markdown and LaTeX: 𝛼2
Importing required Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
import seaborn as sns; sns.set() # for plot styling
from scipy import stats
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from scipy.stats import zscore
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore')
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 1/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [2]:
Elect_df= pd.read_excel("Election_Data.xlsx",sheet_name="Election_Dataset_Two Classes",inde
Elect_df.head()
Out[2]:
vote age economic.cond.national economic.cond.household Blair Hague Europe politic
1 Labour 43 3 3 4 1 2
2 Labour 36 4 4 4 4 5
3 Labour 35 4 4 5 2 3
4 Labour 24 4 2 2 1 4
5 Labour 41 2 2 1 1 6
In [3]:
# Shape function displays the number of rows and columns in a dafaframe.
print('The dataset has {} rows and {} columns'.format(Elect_df.shape[0],Elect_df.shape[1]))
The dataset has 1525 rows and 9 columns
In [4]:
# Checking Data info
Elect_df.info();
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1525 entries, 1 to 1525
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1525 non-null object
1 age 1525 non-null int64
2 economic.cond.national 1525 non-null int64
3 economic.cond.household 1525 non-null int64
4 Blair 1525 non-null int64
5 Hague 1525 non-null int64
6 Europe 1525 non-null int64
7 political.knowledge 1525 non-null int64
8 gender 1525 non-null object
dtypes: int64(7), object(2)
memory usage: 119.1+ KB
In [5]:
# Handling missing data
# Test whether there is any null value in our dataset or not. We can do this using isnull()
Elect_df.isnull().sum()
print("There are", Elect_df.isnull().values.sum(),"Missing Values in dataset")
There are 0 Missing Values in dataset
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 2/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [6]:
cat=[]
num=[]
for i in Elect_df.columns:
if Elect_df[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)
['vote', 'gender']
['age', 'economic.cond.national', 'economic.cond.household', 'Blair', 'Hagu
e', 'Europe', 'political.knowledge']
In [7]:
for variable in cat:
print(variable,":", sum(Elect_df[variable] == '?'))
vote : 0
gender : 0
In [8]:
Elect_df[num].describe().T
Out[8]:
count mean std min 25% 50% 75% max
age 1525.0 54.182295 15.711209 24.0 41.0 53.0 67.0 93.0
economic.cond.national 1525.0 3.245902 0.880969 1.0 3.0 3.0 4.0 5.0
economic.cond.household 1525.0 3.140328 0.929951 1.0 3.0 3.0 4.0 5.0
Blair 1525.0 3.334426 1.174824 1.0 2.0 4.0 4.0 5.0
Hague 1525.0 2.746885 1.230703 1.0 2.0 2.0 4.0 5.0
Europe 1525.0 6.728525 3.297538 1.0 4.0 6.0 10.0 11.0
political.knowledge 1525.0 1.542295 1.083315 0.0 0.0 2.0 2.0 3.0
In [9]:
Elect_df[cat].describe().T
Out[9]:
count unique top freq
vote 1525 2 Labour 1063
gender 1525 2 female 812
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 3/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [10]:
# Checking for Duplicates
dups=Elect_df.duplicated()
print("Total no of duplicate values = %d" % (dups.sum()))
Elect_df[dups]
Total no of duplicate values = 8
Out[10]:
vote age economic.cond.national economic.cond.household Blair Hague Europ
68 Labour 35 4 4 5 2
627 Labour 39 3 4 4 2
871 Labour 38 2 4 2 2
984 Conservative 74 4 3 2 4
1155 Conservative 53 3 4 2 2
1237 Labour 36 3 3 2 2
1245 Labour 29 4 4 4 2
1439 Labour 40 4 3 4 2
Removing Duplicate Data
In [11]:
Elect_df.drop_duplicates(inplace=True)
In [12]:
Elect_df.shape
Out[12]:
(1517, 9)
unique values for categorical variables
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 4/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [13]:
### unique values for categorical variables
for column in Elect_df.columns:
if Elect_df[column].dtype == 'object':
print(column.upper(),': ',Elect_df[column].nunique())
print(Elect_df[column].value_counts().sort_values())
print('\n')
VOTE : 2
Conservative 460
Labour 1057
Name: vote, dtype: int64
GENDER : 2
male 709
female 808
Name: gender, dtype: int64
In [14]:
# Checking the Skewness in data
Elect_df.skew(axis=0,skipna=True)
Out[14]:
age 0.139800
economic.cond.national -0.238474
economic.cond.household -0.144148
Blair -0.539514
Hague 0.146191
Europe -0.141891
political.knowledge -0.422928
dtype: float64
Univariate Analysis
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 5/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [15]:
a=1
plt.figure(figsize=(15,112))
for i in Elect_df.columns:
if Elect_df[i].dtype != 'object':
plt.subplot(21,3,a)
sns.distplot(Elect_df[i])
plt.title("Distribution plot for:" + i)
plt.subplot(21,3,a+1)
sns.histplot(Elect_df[i])
plt.title("Histogram for:" + i)
plt.subplot(21,3,a+2)
sns.boxplot(Elect_df[i])
plt.title("Boxplot for:" + i)
a+=3
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 6/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Bivariate and Multivariate Analysis
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 7/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [16]:
fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))
sns.stripplot(Elect_df["vote"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('vote', fontsize=15)
ax1.set_title('Distribution of vote', fontsize=15)
ax1.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)
ax2.set_xlabel('Vote', fontsize=15)
ax2.set_title('Distribution of Vote', fontsize=15)
ax2.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)
ax3.set_xlabel('vote', fontsize=15)
ax3.set_title('Distribution of vote', fontsize=15)
ax3.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['Blair'], jitter=True, ax=ax4)
ax4.set_xlabel('vote', fontsize=15)
ax4.set_title('Distribution of vote', fontsize=15)
ax4.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['Hague'], jitter=True, ax=ax5)
ax5.set_xlabel('vote', fontsize=15)
ax5.set_title('Distribution of vote', fontsize=15)
ax5.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['Europe'], jitter=True, ax=ax6)
ax6.set_xlabel('vote', fontsize=15)
ax6.set_title('Distribution of vote', fontsize=15)
ax6.tick_params(labelsize=15)
sns.stripplot(Elect_df["vote"], Elect_df['political.knowledge'], jitter=True, ax=ax7)
ax7.set_xlabel('vote', fontsize=15)
ax7.set_title('Distribution of vote', fontsize=15)
ax7.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 8/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 9/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [17]:
fig, (ax1,ax2,ax3,ax4)=plt.subplots(1,4,figsize=(16,5))
fig, (ax5,ax6,ax7)=plt.subplots(1,3,figsize=(12,5))
sns.stripplot(Elect_df["gender"], Elect_df['age'],orient='v',jitter=True,ax=ax1)
ax1.set_xlabel('gender', fontsize=15)
ax1.set_title('Distribution of gender', fontsize=15)
ax1.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.national'], jitter=True, ax=ax2)
ax2.set_xlabel('gender', fontsize=15)
ax2.set_title('Distribution of gender', fontsize=15)
ax2.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['economic.cond.household'], jitter=True, ax=ax3)
ax3.set_xlabel('gender', fontsize=15)
ax3.set_title('Distribution of gender', fontsize=15)
ax3.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['Blair'], jitter=True, ax=ax4)
ax4.set_xlabel('gender', fontsize=15)
ax4.set_title('Distribution of gender', fontsize=15)
ax4.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['Hague'], jitter=True, ax=ax5)
ax5.set_xlabel('gender', fontsize=15)
ax5.set_title('Distribution of gender', fontsize=15)
ax5.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['Europe'], jitter=True, ax=ax6)
ax6.set_xlabel('gender', fontsize=15)
ax6.set_title('Distribution of gender', fontsize=15)
ax6.tick_params(labelsize=15)
sns.stripplot(Elect_df["gender"], Elect_df['political.knowledge'], jitter=True, ax=ax7)
ax7.set_xlabel('gender', fontsize=15)
ax7.set_title('Distribution of gender', fontsize=15)
ax7.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 10/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Check for Data Distribution w.r.t Vote
In [18]:
### Data Distribution
plt.figure(figsize=(24,8))
sns.pairplot(Elect_df,hue='vote');
<Figure size 1728x576 with 0 Axes>
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 11/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [19]:
#correlation matrix
Elect_df.corr()
Out[19]:
age economic.cond.national economic.cond.household Bla
age 1.000000 0.018687 -0.038868 0.03208
economic.cond.national 0.018687 1.000000 0.347687 0.32614
economic.cond.household -0.038868 0.347687 1.000000 0.21582
Blair 0.032084 0.326141 0.215822 1.00000
Hague 0.031144 -0.200790 -0.100392 -0.24350
Europe 0.064562 -0.209150 -0.112897 -0.29594
political.knowledge -0.046598 -0.023510 -0.038528 -0.02129
In [20]:
# plot the correlation coefficients as a heatmap
plt.subplots(figsize=(15,10))
sns.heatmap(Elect_df.corr(), annot=True, fmt='.2f', cmap='Blues', vmax=1, vmin=-1);
Check for Outliers
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 12/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [21]:
#Check for presence of outliers
plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();
There are nearly no outliers in most of the numerical columns, only outlier is in economic.cond.national
variable & economic.cond.household Variable . In Gaussian Naive Bayes, outliers will affect the shape
of the Gaussian distribution and have the usual effects on the mean etc. So depending on our use case,
it makes sense to remove outlier .
In [22]:
print('Range of values: ', Elect_df['economic.cond.national'].max()-Elect_df['economic.cond
Range of values: 4
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 13/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [23]:
#Central values
print('Minimum value economic.cond.national: ', Elect_df['economic.cond.national'].min())
print('Maximum economic.cond.national: ',Elect_df['economic.cond.national'].max())
print('Mean value economic.cond.national: ', Elect_df['economic.cond.national'].mean())
print('Median value economic.cond.national: ',Elect_df['economic.cond.national'].median())
print('Standard deviation economic.cond.national: ', Elect_df['economic.cond.national'].std
print('Null values economic.cond.national: ',Elect_df['economic.cond.national'].isnull().an
Minimum value economic.cond.national: 1
Maximum economic.cond.national: 5
Mean value economic.cond.national: 3.245220830586684
Median value economic.cond.national: 3.0
Standard deviation economic.cond.national: 0.8817924638047195
Null values economic.cond.national: False
In [24]:
#Quartiles
Q1=Elect_df['economic.cond.national'].quantile(q=0.25)
Q3=Elect_df['economic.cond.national'].quantile(q=0.75)
print('economic.cond.national - 1st Quartile (Q1) is: ', Q1)
print('economic.cond.national - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of economic.cond.national is ', stats.iqr(Elect_df['econom
economic.cond.national - 1st Quartile (Q1) is: 3.0
economic.cond.national - 3st Quartile (Q3) is: 4.0
Interquartile range (IQR) of economic.cond.national is 1.0
In [25]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in economic.cond.national: ', L_outliers)
print('Upper outliers in economic.cond.national: ', U_outliers)
Lower outliers in economic.cond.national: 1.5
Upper outliers in economic.cond.national: 5.5
In [26]:
print('Number of outliers in economic.cond.national upper : ', Elect_df[Elect_df['economic.
print('Number of outliers in economic.cond.national lower : ', Elect_df[Elect_df['economic.
print('% of Outlier in economic.cond.national upper: ',round(Elect_df[Elect_df['economic.co
print('% of Outlier in economic.cond.national lower: ',round(Elect_df[Elect_df['economic.co
Number of outliers in economic.cond.national upper : 0
Number of outliers in economic.cond.national lower : 1517
% of Outlier in economic.cond.national upper: 0 %
% of Outlier in economic.cond.national lower: 100 %
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 14/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Oulier Treatment
In [27]:
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [28]:
lr,ur=remove_outlier(Elect_df["economic.cond.national"])
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]>ur,ur,Elect_
Elect_df["economic.cond.national"]=np.where(Elect_df["economic.cond.national"]<lr,lr,Elect_
lr,ur=remove_outlier(Elect_df["economic.cond.household"])
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]>ur,ur,Elec
Elect_df["economic.cond.household"]=np.where(Elect_df["economic.cond.household"]<lr,lr,Elec
In [29]:
#Check for presence of outliers
plt.figure(figsize=(15,10))
Elect_df[num].boxplot(patch_artist = True, color='red',notch=True)
plt.title('Rectangular box plot')
plt.show();
Get_dummies of the object variables
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 15/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [30]:
cat
Out[30]:
['vote', 'gender']
In [31]:
cat1 = ['vote', 'gender']
drop_first is used to ensure that multiple columns created based on the levels of categorical variable
are not included else it will result in to multicollinearity . This is done to ensure that we do not land in to
dummy trap.
In [32]:
df=pd.get_dummies(Elect_df, columns=cat1,drop_first=True)
df.head()
Out[32]:
age economic.cond.national economic.cond.household Blair Hague Europe political.knowl
1 43 3.0 3.0 4 1 2
2 36 4.0 4.0 4 4 5
3 35 4.0 4.0 5 2 3
4 24 4.0 2.0 2 1 4
5 41 2.0 2.0 1 1 6
In [33]:
# Copy all the predictor variables into X dataframe
X=df.drop('vote_Labour',axis=1)
# Copy target into the y dataframe.
y=df['vote_Labour']
In [34]:
# Var prior to scaling
X.var()
Out[34]:
age 246.544655
economic.cond.national 0.728713
economic.cond.household 0.785491
Blair 1.380089
Hague 1.519005
Europe 10.883687
political.knowledge 1.175961
gender_male 0.249099
dtype: float64
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 16/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [35]:
# Data prior to scaling
plt.plot(X)
plt.title('Data prior to scaling ', fontsize=15)
plt.show()
Is Scaling necessary here or not?
In [36]:
# Scaling the attributes.
X[['age','economic.cond.national','economic.cond.household','Blair','Hague','Europe','polit
In [37]:
# Var post scaling
X.var()
Out[37]:
age 1.00066
economic.cond.national 1.00066
economic.cond.household 1.00066
Blair 1.00066
Hague 1.00066
Europe 1.00066
political.knowledge 1.00066
gender_male 1.00066
dtype: float64
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 17/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [38]:
# Data post scaling
plt.plot(X)
plt.title('Data post scaling ', fontsize=15)
plt.show()
In [39]:
X.head()
Out[39]:
age economic.cond.national economic.cond.household Blair Hague Europe
1 -0.716161 -0.301648 -0.179682 0.565802 -1.419969 -1.437338
2 -1.162118 0.870183 0.949003 0.565802 1.014951 -0.527684
3 -1.225827 0.870183 0.949003 1.417312 -0.608329 -1.134120
4 -1.926617 0.870183 -1.308366 -1.137217 -1.419969 -0.830902
5 -0.843577 -1.473479 -1.308366 -1.988727 -1.419969 -0.224465
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 18/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [40]:
y.head()
Out[40]:
1 1
2 1
3 1
4 1
5 1
Name: vote_Labour, dtype: uint8
Train-Test Split Split X and y into training and test set in 70:30 ratio with
random_state=1
In [41]:
# Split X and y into training and test set in 70:30 ratio
X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30, random_state=1)
In [42]:
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)
X_train (1061, 8)
X_test (456, 8)
y_train (1061,)
y_test (456,)
In [43]:
Logistic_model = LogisticRegression(solver='newton-cg',max_iter=10000,penalty='none',verbos
Logistic_model.fit(X_train, y_train)
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 1.1s finished
Out[43]:
LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-
cg',
verbose=True)
Now LogisticRegression classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict() method
with test set features as its parameters.
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 19/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [44]:
## Performance Matrix on train data set
y_train_predict=Logistic_model.predict(X_train)
Logistic_model_score_train=Logistic_model.score(X_train,y_train) ## Accuracy
print("The Logistic Regression Model Score on train data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_train,y_train_predict)) ## Confusion Matrix
print(metrics.classification_report(y_train,y_train_predict)) ## Classification r
The Logistic Regression Model Score on train data set is 0.834
[[197 110]
[ 66 688]]
precision recall f1-score support
0 0.75 0.64 0.69 307
1 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.81 0.78 0.79 1061
weighted avg 0.83 0.83 0.83 1061
In [45]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,Logistic_model.predict(X_train))),annot=True,
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 20/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [46]:
## Performance Matrix on test data set
y_test_predict=Logistic_model.predict(X_test)
Logistic_model_score_test=Logistic_model.score(X_test,y_test) ## Accuracy
print("The Logistic Regression Model Score on test data set is %.3f " % Logistic_model_sco
print(metrics.confusion_matrix(y_test,y_test_predict)) ## Confusion Matrix
print(metrics.classification_report(y_test,y_test_predict)) ## Classification re
The Logistic Regression Model Score on test data set is 0.829
[[111 42]
[ 36 267]]
precision recall f1-score support
0 0.76 0.73 0.74 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
In [47]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,Logistic_model.predict(X_test))),annot=True,fm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()
Training Data and Test Data Confusion Matrix Comparison
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 21/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [48]:
f,a = plt.subplots(1,2,sharex=True,sharey=True,squeeze=False)
#Plotting confusion matrix for the different models for the Training Data
plot_0 = sns.heatmap((metrics.confusion_matrix(y_train,y_train_predict)),annot=True,fmt='.5
a[0][0].set_title('Training Data')
plot_1 = sns.heatmap((metrics.confusion_matrix(y_test,y_test_predict)),annot=True,fmt='.5g'
a[0][1].set_title('Test Data');
Training Data and Test Data Classification Report Comparison
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 22/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [49]:
print('Classification Report of the training data:\n\n',metrics.classification_report(y_tra
print('Classification Report of the test data:\n\n',metrics.classification_report(y_test,y_
Classification Report of the training data:
precision recall f1-score support
0 0.75 0.64 0.69 307
1 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.81 0.78 0.79 1061
weighted avg 0.83 0.83 0.83 1061
Classification Report of the test data:
precision recall f1-score support
0 0.76 0.73 0.74 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
1- Applying GridSearchCV for Logistic Regression
In [50]:
grid={'penalty':['l2','none','l1','elasticnet'],
'solver':['liblinear','lbfgs','newton-cg'],
'tol':[0.0001,0.00001],
'max_iter': [10000, 5000,15000]}
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 23/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [51]:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator = Logistic_model, param_grid = grid, cv = cv, n_jobs=2
grid_search.fit(X_train, y_train)
[LibLinear]
Out[51]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_sta
te=1),
estimator=LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none', solver='newton-c
g',
verbose=True),
n_jobs=2,
param_grid={'max_iter': [10000, 5000, 15000],
'penalty': ['l2', 'none', 'l1', 'elasticnet'],
'solver': ['liblinear', 'lbfgs', 'newton-cg'],
'tol': [0.0001, 1e-05]},
scoring='f1')
In [52]:
print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)
{'max_iter': 10000, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.0001}
LogisticRegression(max_iter=10000, n_jobs=2, solver='liblinear', verbose=Tru
e)
In [53]:
best_model_lr = grid_search.best_estimator_
In [54]:
# Prediction on the training set
ytrain_predict_lr = best_model_lr.predict(X_train)
ytest_predict_lr = best_model_lr.predict(X_test)
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 24/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [55]:
## Getting the probabilities on the test set
ytest_predict_prob=best_model_lr.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()
Out[55]:
0 1
0 0.428858 0.571142
1 0.155518 0.844482
2 0.006996 0.993004
3 0.839503 0.160497
4 0.066109 0.933891
Model Evaluation for Train Data
In [56]:
print("The Best Logistic Regression Model Score on train data set post tuning is %.3f " % b
The Best Logistic Regression Model Score on train data set post tuning is 0.
834
In [57]:
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True,fmt='.5g',c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 25/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [58]:
# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LR Tuned Model train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LR Tuned Model train data set",fontsize=14,color = 'red');
The ROC_AUC score for LR Tuned Model train data set 0.89
Model Evaluation for Test Data
In [59]:
print("The Best Logistic Regression Model Score on train data post tuning set is %.3f " % b
The Best Logistic Regression Model Score on train data post tuning set is 0.
829
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 26/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [60]:
# Get the confusion matrix on the train data
confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True,fmt='.5g',cma
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 27/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [61]:
# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LR Tuned Model test data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='green')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LR Tuned Model test data set",fontsize=14,color = 'red');
The ROC_AUC score for LR Tuned Model test data set 0.88
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 28/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [62]:
print('Classification Report of the training data:\n\n',classification_report(y_train, ytra
print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre
Classification Report of the training data:
precision recall f1-score support
0 0.75 0.64 0.69 307
1 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.81 0.78 0.79 1061
weighted avg 0.83 0.83 0.83 1061
Classification Report of the test data:
precision recall f1-score support
0 0.76 0.73 0.74 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
In [63]:
(best_model_lr.score(X_train, y_train)-best_model_lr.score(X_test, y_test))
Out[63]:
0.00517138746961654
LDA (linear discriminant analysis)
In [64]:
LDA_model=LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)
Out[64]:
LinearDiscriminantAnalysis()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 29/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [65]:
## Performance Matrix on train data set
y_train_predict=LDA_model.predict(X_train)
LDA_model_score_train=LDA_model.score(X_train,y_train)
print("The LDA Model Score on train data set is %.3f " % LDA_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))
The LDA Model Score on train data set is 0.834
[[200 107]
[ 69 685]]
precision recall f1-score support
0 0.74 0.65 0.69 307
1 0.86 0.91 0.89 754
accuracy 0.83 1061
macro avg 0.80 0.78 0.79 1061
weighted avg 0.83 0.83 0.83 1061
In [66]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,LDA_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 30/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [67]:
#Performance Matrix on test data set
y_test_predict=LDA_model.predict(X_test)
LDA_model_score_test=LDA_model.score(X_test,y_test)
print("The LDA Model Score on test data set is %.3f " % LDA_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))
The LDA Model Score on test data set is 0.831
[[111 42]
[ 35 268]]
precision recall f1-score support
0 0.76 0.73 0.74 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
In [68]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,LDA_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()
Applying GridSearchCV for LDA
In [69]:
grid_lda ={'solver' :['svd', 'lsqr', 'eigen']}
grid_search_lda = GridSearchCV(estimator = LDA_model, param_grid = grid_lda, cv = cv, n_job
grid_search_lda.fit(X_train, y_train)
best_model_lda = grid_search_lda.best_estimator_
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 31/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Model Evaluation for Train Data
In [70]:
ytrain_predict_lda = best_model_lda.predict(X_train)
ytest_predict_lda= best_model_lda.predict(X_test)
In [71]:
## Getting the probabilities on the test set
ytest_predict_prob=best_model_lda.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()
Out[71]:
0 1
0 0.466328 0.533672
1 0.137291 0.862709
2 0.005950 0.994050
3 0.866706 0.133294
4 0.053474 0.946526
In [72]:
#### Model Evaluation for Train Data
print("The Best LDA Model Score on train data set post tuning is %.3f " % best_model_lda.sc
The Best LDA Model Score on train data set post tuning is 0.835
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 32/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [73]:
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True,fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 33/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [74]:
# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Tuned Model train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for LDA Tuned Model train data set",fontsize=14,color = 'red');
The ROC_AUC score for LDA Tuned Model train data set 0.890
Model Evaluation for Test Data
In [75]:
#### Model Evaluation for Train Data
print("The Best LDA Model Score on test data post tuning set is %.3f " % best_model_lda.sco
The Best LDA Model Score on test data post tuning set is 0.831
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 34/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [76]:
# Get the confusion matrix on the Test data
confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True,fmt='.5g',cm
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 35/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [77]:
# predict probabilities
probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for LDA Tuned Model test data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for LDA Tuned Model test data set",fontsize=14,color = 'red');
The ROC_AUC score for LDA Tuned Model test data set 0.888
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 36/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [78]:
### Classification of Best LDA Model on Train and Test Data
print('Classification Report of the training data:\n\n',classification_report(y_train, ytra
print('Classification Report of the test data:\n\n',classification_report(y_test, ytest_pre
Classification Report of the training data:
precision recall f1-score support
0 0.74 0.65 0.70 307
1 0.87 0.91 0.89 754
accuracy 0.84 1061
macro avg 0.81 0.78 0.79 1061
weighted avg 0.83 0.84 0.83 1061
Classification Report of the test data:
precision recall f1-score support
0 0.76 0.73 0.74 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 456
In [79]:
(best_model_lda.score(X_train, y_train)-best_model_lda.score(X_test, y_test))*100
Out[79]:
0.3920912082279293
KNN Model
Generally, good KNN performance usually requires preprocessing of data to make all variables similarly
scaled and centered
In [80]:
KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)
Out[80]:
KNeighborsClassifier()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 37/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [81]:
## Performance Matrix on train data set
y_train_predict = KNN_model.predict(X_train)
KNN_model_score_train=KNN_model.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The KNN Model Score on Train data 0.857
[[217 90]
[ 62 692]]
precision recall f1-score support
0 0.78 0.71 0.74 307
1 0.88 0.92 0.90 754
accuracy 0.86 1061
macro avg 0.83 0.81 0.82 1061
weighted avg 0.85 0.86 0.85 1061
In [82]:
## Performance Matrix on test data set
y_test_predict = KNN_model.predict(X_test)
KNN_model_score_test = KNN_model.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The KNN Model Score on Test data 0.827
[[109 44]
[ 35 268]]
precision recall f1-score support
0 0.76 0.71 0.73 153
1 0.86 0.88 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.80 456
weighted avg 0.82 0.83 0.83 456
Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from
K=1,3,5,7....19 using the Mis classification error
Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours
= 1,3,5...19 and find the model with lowest MCE
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 38/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [83]:
# empty list that will hold accuracy scores
ac_scores = []
# perform accuracy metrics for values from 1,3,5....19
for k in range(1,20,2):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# evaluate test accuracy
scores = knn.score(X_test, y_test)
ac_scores.append(scores)
# changing to misclassification error
MCE = [1 - x for x in ac_scores]
MCE
Out[83]:
[0.2149122807017544,
0.19736842105263153,
0.17324561403508776,
0.1842105263157895,
0.18201754385964908,
0.17105263157894735,
0.17763157894736847,
0.16885964912280704,
0.16666666666666663,
0.17105263157894735]
Plot misclassification error vs k (with k value on X-axis)
In [84]:
# plot misclassification error vs k
plt.plot(range(1,20,2), MCE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.title("Misclassicication error Vs K Value",fontsize=14,color = 'red');
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 39/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
For K = 11 it is giving the best test accuracy. We will build the model with k=11
In [85]:
from sklearn.neighbors import KNeighborsClassifier
KNN_model_1=KNeighborsClassifier(n_neighbors= 11)
KNN_model_1.fit(X_train,y_train)
Out[85]:
KNeighborsClassifier(n_neighbors=11)
Performance Matrix of KNN New Model on train data set
In [86]:
## Performance Matrix on train data set
y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.3f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The KNN Model Score on Train data 0.843
[[206 101]
[ 66 688]]
precision recall f1-score support
0 0.76 0.67 0.71 307
1 0.87 0.91 0.89 754
accuracy 0.84 1061
macro avg 0.81 0.79 0.80 1061
weighted avg 0.84 0.84 0.84 1061
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 40/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [87]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,KNN_model_1.predict(X_train))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 41/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [88]:
# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');
The ROC_AUC score for KNN train data set 0.911
Performance Matrix of KNN New Model on test data set
In [89]:
## Performance Matrix on test data set
y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.3f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The KNN Model Score on Test data 0.829
[[105 48]
[ 30 273]]
precision recall f1-score support
0 0.78 0.69 0.73 153
1 0.85 0.90 0.88 303
accuracy 0.83 456
macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 42/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [90]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,KNN_model_1.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('KNN-Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 43/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [91]:
# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');
The ROC_AUC score for KNN train data set 0.889
Naive Bayes
In [92]:
NB_model=GaussianNB()
NB_model.fit(X_train, y_train)
Out[92]:
GaussianNB()
Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method
for training it. After building a classifier, our model is ready to make predictions. We can use predict()
method with test set features as its parameters.
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 44/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [93]:
#Performance Matrix on train data set
y_train_predict=NB_model.predict(X_train)
Naive_Bayes_model_score_train=NB_model.score(X_train, y_train) ## Accur
print("The Naive Bayes Model Score on train data is %.3f " % Naive_Bayes_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict)) ## confusion_matrix
print(metrics.classification_report(y_train,y_train_predict)) ## classification_report
The Naive Bayes Model Score on train data is 0.834
[[212 95]
[ 81 673]]
precision recall f1-score support
0 0.72 0.69 0.71 307
1 0.88 0.89 0.88 754
accuracy 0.83 1061
macro avg 0.80 0.79 0.80 1061
weighted avg 0.83 0.83 0.83 1061
In [94]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,NB_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 45/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [95]:
## Performance Matrix on test data set
y_test_predict = NB_model.predict(X_test)
Naive_Bayes_model_score_test=NB_model.score(X_test, y_test) ## Accuracy
print("The Naive Bayes Model Score on test data is %.3f " % Naive_Bayes_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict)) ## confusion_matrix
print(metrics.classification_report(y_test, y_test_predict)) ## classification_report
The Naive Bayes Model Score on test data is 0.822
[[112 41]
[ 40 263]]
precision recall f1-score support
0 0.74 0.73 0.73 153
1 0.87 0.87 0.87 303
accuracy 0.82 456
macro avg 0.80 0.80 0.80 456
weighted avg 0.82 0.82 0.82 456
In [96]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,NB_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('NB-Confusion Matrix-Test Data')
plt.show()
Naive Bayes with SMOTE
In [97]:
from imblearn.over_sampling import SMOTE
#SMOTE is only applied on the train data set
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 46/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [98]:
X_train.shape
Out[98]:
(1061, 8)
In [99]:
## Let's check the shape after SMOTE
X_train_res.shape
Out[99]:
(1508, 8)
In [100]:
NB_SM_model = GaussianNB()
NB_SM_model.fit(X_train_res, y_train_res)
Out[100]:
GaussianNB()
In [101]:
## Performance Matrix on train data set with SMOTE
y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.3f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))
The SMOTE Model Score for train data set is 0.822
[[616 138]
[131 623]]
precision recall f1-score support
0 0.82 0.82 0.82 754
1 0.82 0.83 0.82 754
accuracy 0.82 1508
macro avg 0.82 0.82 0.82 1508
weighted avg 0.82 0.82 0.82 1508
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 47/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [102]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train_res,NB_SM_model.predict(X_train_res))),annot=
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 48/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [103]:
probs = NB_SM_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for Naive Bayes with SMOTE train data set",fontsize=14,color = 're
The ROC_AUC score for Naive Bayes with SMOTE train data set 0.887
In [104]:
## Performance Matrix on test data set
y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.3f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The SMOTE Model Score for test data set is 0.809
[[125 28]
[ 59 244]]
precision recall f1-score support
0 0.68 0.82 0.74 153
1 0.90 0.81 0.85 303
accuracy 0.81 456
macro avg 0.79 0.81 0.80 456
weighted avg 0.82 0.81 0.81 456
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 49/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [105]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,NB_SM_model.predict(X_test))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()
ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 50/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [106]:
probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE test data set",fontsize=14,color = 'red');
The ROC_AUC score for Naive Bayes with SMOTE test data set 0.876
Random Forest
In [107]:
RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)
Out[107]:
RandomForestClassifier(random_state=1)
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 51/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [108]:
## Performance Matrix on train data set
y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The random Forest Score on train data is 1.00
[[307 0]
[ 0 754]]
precision recall f1-score support
0 1.00 1.00 1.00 307
1 1.00 1.00 1.00 754
accuracy 1.00 1061
macro avg 1.00 1.00 1.00 1061
weighted avg 1.00 1.00 1.00 1061
In [109]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,RF_model.predict(X_train))),annot=True,fmt='.
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 52/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [110]:
## Performance Matrix on test data set
y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.3f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The random Forest Score on test data is 0.831
[[104 49]
[ 28 275]]
precision recall f1-score support
0 0.79 0.68 0.73 153
1 0.85 0.91 0.88 303
accuracy 0.83 456
macro avg 0.82 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
In [111]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,RF_model.predict(X_test))),annot=True,fmt='.5g
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix-Test Data')
plt.show()
In [112]:
(RF_model_score_train-RF_model_score_test)*100
Out[112]:
16.885964912280706
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 53/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Bagging
In [113]:
cart=RandomForestClassifier()
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100, random_state=1)
Bagging_model.fit(X_train,y_train)
Out[113]:
BaggingClassifier(base_estimator=RandomForestClassifier(), n_estimators=100,
random_state=1)
In [114]:
## Performance Matrix on train data set
y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))
The Bagging Model Score for train data set is 0.97
[[278 29]
[ 5 749]]
precision recall f1-score support
0 0.98 0.91 0.94 307
1 0.96 0.99 0.98 754
accuracy 0.97 1061
macro avg 0.97 0.95 0.96 1061
weighted avg 0.97 0.97 0.97 1061
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 54/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [115]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,Bagging_model.predict(X_train))),annot=True,f
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Train Data')
plt.show()
In [116]:
## Performance Matrix on test data set
y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))
The Bagging Model Score for test data set is 0.83
[[104 49]
[ 29 274]]
precision recall f1-score support
0 0.78 0.68 0.73 153
1 0.85 0.90 0.88 303
accuracy 0.83 456
macro avg 0.82 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 55/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [117]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,Bagging_model.predict(X_test))),annot=True,fmt
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Bagging-Confusion Matrix-Test Data')
plt.show()
In [118]:
(Bagging_model_score_train-Bagging_model_score_test)
Out[118]:
0.13900739123964478
Boosting
Ada Boost
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 56/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [119]:
ADB_model=AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)
Out[119]:
AdaBoostClassifier(n_estimators=100, random_state=1)
In [120]:
## Performance Matrix on train data set
y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))
The ADA boost Model Score for train data set is 0.850
[[214 93]
[ 66 688]]
precision recall f1-score support
0 0.76 0.70 0.73 307
1 0.88 0.91 0.90 754
accuracy 0.85 1061
macro avg 0.82 0.80 0.81 1061
weighted avg 0.85 0.85 0.85 1061
In [121]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,ADB_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA Boost-Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 57/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [122]:
## Performance Matrix on train data set
y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.3f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The ADA boost Model Score for test data set is 0.814
[[103 50]
[ 35 268]]
precision recall f1-score support
0 0.75 0.67 0.71 153
1 0.84 0.88 0.86 303
accuracy 0.81 456
macro avg 0.79 0.78 0.79 456
weighted avg 0.81 0.81 0.81 456
In [123]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,ADB_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('ADA boost-Confusion Matrix-Test Data')
plt.show()
In [124]:
(ADB_model_score_train-ADB_model_score_test)*100
Out[124]:
3.654488483225027
Gradient Boosting
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 58/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Gradient Boosting
In [125]:
gbc_model=GradientBoostingClassifier(random_state=1)
gbc_model.fit(X_train, y_train)
Out[125]:
GradientBoostingClassifier(random_state=1)
In [126]:
## Performance Matrix on train data set
y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The Gradient Boosting Score for train data set is 0.89
[[239 68]
[ 46 708]]
precision recall f1-score support
0 0.84 0.78 0.81 307
1 0.91 0.94 0.93 754
accuracy 0.89 1061
macro avg 0.88 0.86 0.87 1061
weighted avg 0.89 0.89 0.89 1061
In [127]:
# Get the confusion matrix on the train data
sns.heatmap((metrics.confusion_matrix(y_train,gbc_model.predict(X_train))),annot=True,fmt='
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost -Confusion Matrix-Train Data')
plt.show()
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 59/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [128]:
## Performance Matrix on test data set
y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.2f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The Gradient Boosting Score for train data set is 0.84
[[105 48]
[ 27 276]]
precision recall f1-score support
0 0.80 0.69 0.74 153
1 0.85 0.91 0.88 303
accuracy 0.84 456
macro avg 0.82 0.80 0.81 456
weighted avg 0.83 0.84 0.83 456
In [129]:
# Get the confusion matrix on the test data
sns.heatmap((metrics.confusion_matrix(y_test,gbc_model.predict(X_test))),annot=True,fmt='.5
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Gradiant Boost-Confusion Matrix-Test Data')
plt.show()
In [130]:
(gbc_model_score_train-gbc_model_score_test)*100
Out[130]:
5.702787836698253
Performance Matrix of Logistic Regression on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 60/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [131]:
## Performance Matrix on train data set
print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lr.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lr.predict(X_train)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-Train Data')
plt.show()
The Best Logistic Regression Model Score on train data set is 0.83
ROC_AUC Curve for Logistic Regression on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 61/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [132]:
# predict probabilities
probs = best_model_lr.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('The ROC_AUC score for Logistic Regression Train data set: %.3f' % auc)
# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Logistic Regression Train data set",fontsize=14,color = 'red');
The ROC_AUC score for Logistic Regression Train data set: 0.890
Performance Matrix of Logistic Regression on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 62/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [133]:
## Performance Matrix on test data set
print("The Best Logistic Regression Model Score on train data set is %.2f " % best_model_lr
# Get the confusion matrix on the train data
confusion_matrix(y_test,best_model_lr.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lr.predict(X_test)),annot=True, fmt='.5g', c
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LR-Confusion Matrix-test data')
plt.show()
The Best Logistic Regression Model Score on train data set is 0.83
ROC_AUC Curve for Logistic Regression on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 63/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [134]:
# predict probabilities
probs = best_model_lr.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('The ROC_AUC score for Logistic Regression Test data set : %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Logistic Regression Test data set ",fontsize=14,color = 'red');
The ROC_AUC score for Logistic Regression Test data set : 0.883
Performance Matrix of LDA (linear discriminant analysis) on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 64/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [135]:
## Performance Matrix on train data set
print("The Best LDA Model Score on train data set is %.2f " % best_model_lda.score(X_train,
# Get the confusion matrix on the train data
confusion_matrix(y_train,best_model_lda.predict(X_train))
sns.heatmap(confusion_matrix(y_train,best_model_lda.predict(X_train)),annot=True, fmt='.5g'
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Train data')
plt.show()
The Best LDA Model Score on train data set is 0.84
ROC_AUC Curve for LDA (linear discriminant analysis) on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 65/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [136]:
# predict probabilities
probs = best_model_lda.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for LDA Train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for LDA Train data set",fontsize=14,color = 'red');
The ROC_AUC score for LDA Train data set 0.89
Performance Matrix of LDA (linear discriminant analysis) on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 66/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [137]:
#Performance Matrix on test data set
print("The Best LDA Model Score on test data set is %.2f " % best_model_lda.score(X_test, y
# Get the confusion matrix on the Test data
confusion_matrix(y_test,best_model_lda.predict(X_test))
sns.heatmap(confusion_matrix(y_test,best_model_lda.predict(X_test)),annot=True, fmt='.5g',
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('LDA-Confusion Matrix-Test Data')
plt.show()
The Best LDA Model Score on test data set is 0.83
ROC_AUC Curve for LDA (linear discriminant analysis) on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 67/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [138]:
probs = best_model_lda.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
print("The ROC_AUC score for LDA Test data set is' %.3f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for LDA Test data set",fontsize=14,color = 'red');
AUC: 0.888
The ROC_AUC score for LDA Test data set is' 0.888
Performance Matrix of KNN on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 68/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [139]:
## Performance Matrix on train data set
y_train_predict = KNN_model_1.predict(X_train)
KNN_model_score_train_New=KNN_model_1.score(X_train, y_train)
print("The KNN Model Score on Train data %.2f " % KNN_model_score_train_New)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The KNN Model Score on Train data 0.84
[[206 101]
[ 66 688]]
precision recall f1-score support
0 0.76 0.67 0.71 307
1 0.87 0.91 0.89 754
accuracy 0.84 1061
macro avg 0.81 0.79 0.80 1061
weighted avg 0.84 0.84 0.84 1061
ROC_AUC Curve for KNN on train data set
In [140]:
# predict probabilities
probs = KNN_model_1.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');
The ROC_AUC score for KNN train data set 0.91
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 69/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
Performance Matrix of KNN on test data set
In [141]:
## Performance Matrix on test data set
y_test_predict = KNN_model_1.predict(X_test)
KNN_model_score_test_New = KNN_model_1.score(X_test, y_test)
print("The KNN Model Score on Test data %.2f " % KNN_model_score_test_New)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The KNN Model Score on Test data 0.83
[[105 48]
[ 30 273]]
precision recall f1-score support
0 0.78 0.69 0.73 153
1 0.85 0.90 0.88 303
accuracy 0.83 456
macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
ROC_AUC Curve for KNN on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 70/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [142]:
# predict probabilities
probs = KNN_model_1.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print("The ROC_AUC score for KNN train data set %.2f " % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for for KNN test data set",fontsize=14,color = 'red');
The ROC_AUC score for KNN train data set 0.89
Performance Matrix of Naive Bayes with SMOTE on train data set
In [143]:
## Performance Matrix on train data set with SMOTE
y_train_predict = NB_SM_model.predict(X_train_res)
SMOTE_model_score_train = NB_SM_model.score(X_train_res, y_train_res)
print("The SMOTE Model Score for train data set is %.2f " % SMOTE_model_score_train)
print(metrics.confusion_matrix(y_train_res, y_train_predict))
print(metrics.classification_report(y_train_res ,y_train_predict))
The SMOTE Model Score for train data set is 0.82
[[616 138]
[131 623]]
precision recall f1-score support
0 0.82 0.82 0.82 754
1 0.82 0.83 0.82 754
accuracy 0.82 1508
macro avg 0.82 0.82 0.82 1508
weighted avg 0.82 0.82 0.82 1508
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 71/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
ROC_AUC Curve for Naive Bayes with SMOTE Model on train data set
In [144]:
probs = NB_SM_model.predict_proba(X_train_res)
probs = probs[:, 1]
auc = roc_auc_score(y_train_res, probs)
print("The ROC_AUC score for Naive Bayes with SMOTE train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train_res, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Naive Bayes with SMOTE train data set",fontsize=14,color = 'red');
The ROC_AUC score for Naive Bayes with SMOTE train data set 0.90
Performance Matrix of Naive Bayes with SMOTE on test data set
In [145]:
## Performance Matrix on test data set
y_test_predict = NB_SM_model.predict(X_test)
SMOTE_model_score_test = NB_SM_model.score(X_test, y_test)
print("The SMOTE Model Score for test data set is %.2f " % SMOTE_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The SMOTE Model Score for test data set is 0.81
[[125 28]
[ 59 244]]
precision recall f1-score support
0 0.68 0.82 0.74 153
1 0.90 0.81 0.85 303
accuracy 0.81 456
macro avg 0.79 0.81 0.80 456
weighted avg 0.82 0.81 0.81 456
ROC AUC Curve for Naive Bayes with SMOTE Model on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 72/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
ROC_AUC Curve for Naive Bayes with SMOTE Model on test data set
In [146]:
probs_test = NB_SM_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Naive Bayes with SMOTE Model on test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Naive Bayes with SMOTE Model on test data set",fontsize=14,color =
The ROC_AUC score for Naive Bayes with SMOTE Model on test data set 0.88
Performance Matrix of Random Forest on train data set
In [147]:
## Performance Matrix on train data set
y_train_predict = RF_model.predict(X_train)
RF_model_score_train =RF_model.score(X_train, y_train)
print("The random Forest Score on train data is %.2f " % RF_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The random Forest Score on train data is 1.00
[[307 0]
[ 0 754]]
precision recall f1-score support
0 1.00 1.00 1.00 307
1 1.00 1.00 1.00 754
accuracy 1.00 1061
macro avg 1.00 1.00 1.00 1061
weighted avg 1.00 1.00 1.00 1061
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 73/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [148]:
Recall=(754/(0+754))
print("Random Forest-Train Data Set-Recall for class 1 is %.2f " % Recall)
Random Forest-Train Data Set-Recall for class 1 is 1.00
ROC_AUC Curve for Random Forest on train data set
In [149]:
probs = RF_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for Random Forest train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Random Forest train data",fontsize=14,color = 'red');
The AUC_ROC score for Random Forest train data set 1.00
Performance Matrix of Random Forest on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 74/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [150]:
## Performance Matrix on test data set
y_test_predict = RF_model.predict(X_test)
RF_model_score_test = RF_model.score(X_test, y_test)
print("The random Forest Score on test data is %.2f " % RF_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The random Forest Score on test data is 0.83
[[104 49]
[ 28 275]]
precision recall f1-score support
0 0.79 0.68 0.73 153
1 0.85 0.91 0.88 303
accuracy 0.83 456
macro avg 0.82 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
In [151]:
Recall=(275/(28+275))
print("Random Forest-Test Data Set-Recall for class 1 is %.2f " % Recall)
Random Forest-Test Data Set-Recall for class 1 is 0.91
ROC_AUC Curve for Random Forest on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 75/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [152]:
probs_test = RF_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Random Forest test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Random Forest Test data set",fontsize=14,color = 'red');
The AUC_ROC score for Random Forest test data set 0.90
Performance Matrix of Bagging on train data set
In [153]:
## Performance Matrix on train data set
y_train_predict=Bagging_model.predict(X_train)
Bagging_model_score_train=Bagging_model.score(X_train,y_train)
print("The Bagging Model Score for train data set is %.2f " % Bagging_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))
The Bagging Model Score for train data set is 0.97
[[278 29]
[ 5 749]]
precision recall f1-score support
0 0.98 0.91 0.94 307
1 0.96 0.99 0.98 754
accuracy 0.97 1061
macro avg 0.97 0.95 0.96 1061
weighted avg 0.97 0.97 0.97 1061
ROC_AUC Curve for Bagging on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 76/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [154]:
probs = Bagging_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Bagging train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Bagging Train data set",fontsize=14,color = 'red');
The ROC_AUC score for Bagging train data set 1.00
Performance Matrix of Bagging on test data set
In [155]:
## Performance Matrix on test data set
y_test_predict=Bagging_model.predict(X_test)
Bagging_model_score_test=Bagging_model.score(X_test,y_test)
print("The Bagging Model Score for test data set is %.2f " % Bagging_model_score_test)
print(metrics.confusion_matrix(y_test,y_test_predict))
print(metrics.classification_report(y_test,y_test_predict))
The Bagging Model Score for test data set is 0.83
[[104 49]
[ 29 274]]
precision recall f1-score support
0 0.78 0.68 0.73 153
1 0.85 0.90 0.88 303
accuracy 0.83 456
macro avg 0.82 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456
ROC_AUC Curve for Bagging on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 77/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [156]:
probs_test = Bagging_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for Bagging test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for Bagging Test data set",fontsize=14,color = 'red');
The AUC_ROC score for Bagging test data set 0.90
Performance Matrix of Ada Boost on train data set
In [157]:
## Performance Matrix on train data set
y_train_predict=ADB_model.predict(X_train)
ADB_model_score_train=ADB_model.score(X_train,y_train)
print("The ADA boost Model Score for train data set is %.3f " % ADB_model_score_train)
print(metrics.confusion_matrix(y_train,y_train_predict))
print(metrics.classification_report(y_train,y_train_predict))
The ADA boost Model Score for train data set is 0.850
[[214 93]
[ 66 688]]
precision recall f1-score support
0 0.76 0.70 0.73 307
1 0.88 0.91 0.90 754
accuracy 0.85 1061
macro avg 0.82 0.80 0.81 1061
weighted avg 0.85 0.85 0.85 1061
ROC_AUC Curve for Ada Boost on train data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 78/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [158]:
probs = ADB_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The AUC_ROC score for ADB Model train data set %.2f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for ADB Model train data set",fontsize=14,color = 'red');
The AUC_ROC score for ADB Model train data set 0.91
Performance Matrix of Ada Boost on test data set
In [159]:
## Performance Matrix on train data set
y_test_predict = ADB_model.predict(X_test)
ADB_model_score_test = ADB_model.score(X_test, y_test)
print("The ADA boost Model Score for test data set is %.2f " % ADB_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The ADA boost Model Score for test data set is 0.81
[[103 50]
[ 35 268]]
precision recall f1-score support
0 0.75 0.67 0.71 153
1 0.84 0.88 0.86 303
accuracy 0.81 456
macro avg 0.79 0.78 0.79 456
weighted avg 0.81 0.81 0.81 456
ROC_AUC Curve for Ada Boost on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 79/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [160]:
probs_test = ADB_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The AUC_ROC score for ADB Model test data set %.2f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr);
plt.title("ROC Curve for ADB Model test data set",fontsize=14,color = 'red');
The AUC_ROC score for ADB Model test data set 0.88
Performance Matrix of Gradient Boosting on train data set
In [161]:
## Performance Matrix on train data set
y_train_predict = gbc_model.predict(X_train)
gbc_model_score_train = gbc_model.score(X_train, y_train)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_train)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
The Gradient Boosting Score for train data set is 0.893
[[239 68]
[ 46 708]]
precision recall f1-score support
0 0.84 0.78 0.81 307
1 0.91 0.94 0.93 754
accuracy 0.89 1061
macro avg 0.88 0.86 0.87 1061
weighted avg 0.89 0.89 0.89 1061
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 80/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [162]:
Recall=(708/(46+708))
print("Gradient Boosting-Train Data Set-Recall for class 1 is %.3f " % Recall)
Gradient Boosting-Train Data Set-Recall for class 1 is 0.939
ROC_AUC Curve for Gradient Boosting on train data set
In [163]:
probs = gbc_model.predict_proba(X_train)
probs = probs[:, 1]
auc = roc_auc_score(y_train, probs)
print("The ROC_AUC score for Gradient Boosting train data set %.3f " % auc)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(train_fpr, train_tpr);
plt.title("ROC Curve for Gradient Boosting train data set",fontsize=14,color = 'red');
The ROC_AUC score for Gradient Boosting train data set 0.951
Performance Matrix of Gradient Boosting on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 81/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [164]:
## Performance Matrix on test data set
y_test_predict = gbc_model.predict(X_test)
gbc_model_score_test = gbc_model.score(X_test, y_test)
print("The Gradient Boosting Score for train data set is %.3f " % gbc_model_score_test)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
The Gradient Boosting Score for train data set is 0.836
[[105 48]
[ 27 276]]
precision recall f1-score support
0 0.80 0.69 0.74 153
1 0.85 0.91 0.88 303
accuracy 0.84 456
macro avg 0.82 0.80 0.81 456
weighted avg 0.83 0.84 0.83 456
In [165]:
Recall=(276/(27+276))
print("Gradient Boosting-Test Data Set-Recall for class 1 is %.3f " % Recall)
Gradient Boosting-Test Data Set-Recall for class 1 is 0.911
ROC_AUC Curve for Gradient Boosting on test data set
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 82/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [166]:
probs_test = gbc_model.predict_proba(X_test)
probs_test = probs_test[:, 1]
auc = roc_auc_score(y_test, probs_test)
print("The ROC_AUC score for Gradient Boosting test data set %.3f " % auc)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(test_fpr, test_tpr)
plt.title("ROC Curve for Gradient Boosting test data set",fontsize=14,color = 'red');
The ROC_AUC score for Gradient Boosting test data set 0.899
Comparison of Different Models
In [168]:
print("The Logistic Regression Model Score Post Tuning on train data set is %.3f " % best_m
print("The Logistic Regression Model Score Post Tuning on test data set is %.3f " % best_m
print("The LDA Model Score Post Tuning on train data set is %.3f " % best_model_lda.score(X
print("The LDA Model Score Post Tuning on test data set is %.3f " % best_model_lda.score(X
print("The KNN Model Score Post Tuning on Train data %.3f " % KNN_model_1.score(X_train, y_
print("The KNN Model Score Post Tuning on Test data %.3f " % KNN_model_1.score(X_test, y_te
print("The Naive Bayes Model Score Post Tuning on train data is %.3f " % NB_SM_model.score(
print("The Naive Bayes Model Score Post Tuning on test data is %.3f " % NB_SM_model.score(X
The Logistic Regression Model Score Post Tuning on train data set is 0.834
The Logistic Regression Model Score Post Tuning on test data set is 0.829
The LDA Model Score Post Tuning on train data set is 0.835
The LDA Model Score Post Tuning on test data set is 0.831
The KNN Model Score Post Tuning on Train data 0.843
The KNN Model Score Post Tuning on Test data 0.829
The Naive Bayes Model Score Post Tuning on train data is 0.822
The Naive Bayes Model Score Post Tuning on test data is 0.809
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 83/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
In [169]:
print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lr.score(X_tr
Variance in Test and train Scores of LDA Model is 0.00517
In [170]:
print("Variance in Test and train Scores of LDA Model is %.5f " % (best_model_lda.score(X_t
Variance in Test and train Scores of LDA Model is 0.00392
In [171]:
print("Variance in Test and train Scores of KNN Model for is %.5f " % (KNN_model_1.score(X
Variance in Test and train Scores of KNN Model for is 0.01365
In [172]:
print("Variance in Test and train Scores of LR Model for is %.5f " % (NB_SM_model.score(X_
Variance in Test and train Scores of LR Model for is 0.01241
Cross Validation
In [173]:
from sklearn.model_selection import cross_val_score
In [174]:
scores = cross_val_score(best_model_lda, X_train, y_train, cv=10)
scores
Out[174]:
array([0.78504673, 0.77358491, 0.83962264, 0.85849057, 0.85849057,
0.8490566 , 0.81132075, 0.8490566 , 0.81132075, 0.82075472])
In [175]:
scores = cross_val_score(best_model_lda, X_test, y_test, cv=10)
scores
Out[175]:
array([0.80434783, 0.76086957, 0.86956522, 0.82608696, 0.89130435,
0.86956522, 0.93333333, 0.84444444, 0.75555556, 0.84444444])
--------------------------------------END OF PROBELM 1------------
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 84/85
3/6/22, 10:44 PM MACHINE LEARNING_PROJECT-PROBLEM 1 - Jupyter Notebook
--------------------------
localhost:8888/notebooks/Desktop/MACHINE LEARNING_PROJECT-PROBLEM 1.ipynb 85/85