[Type here]
This document forms the Report to
predict whether or not an employee
will use Car as a mode of transport
using Machine Learning Algorithms.
Project -5
Machine Learning
R S Prashanti
pg. 0
Project-4 greatlearning
Table of Contents
No. Questions Ratings Pts
EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
1. 7
and missing values and check the summary of the dataset
2. EDA - Illustrate the insights based on EDA 5
3. EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 3
4. Data Preparation (SMOTE) 10
5. Applying Logistic Regression & Interpret results. 3
6. Applying KNN Model & Interpret results 3
Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
7 3
it is not applicable, how can you build an NB model in this case?)
8 Confusion matrix interpretation 3
9 Remarks on Model validation exercise <Which model performed the best> 3
10 Bagging 7.5
11 Boosting 7.5
12 Actionable Insights and Recommendations 5
All the questions are in same order as in the Table of contents
pg. 1
Project-4 greatlearning
Projective Objective
This project requires the understanding of what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict whether
or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision.
The following steps will be performed to predict the commute mode:
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model
performance metrics
KNN
Naive
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.
Note: Each question referred to the above table has been answered with the same nomenclature.
pg. 2
Project-4 greatlearning
Q1. Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and
check the summary of the dataset.
A1.
setwd("D:/users/Prashanti Sharma/Personal/PROJS/Pred-2")
library(readr)
data= read.csv("Diabetes.csv")
1.1 ##Understanding the data
variable.names(data)
[1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary" "Distance"
[8] "license" "Transport"
str(data)
str(data)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
>
dim(data)
> dim(data)
[1] 444 9
View(data)
#basic data summary
Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.
pg. 3
Project-4 greatlearning
summary(data)
> summary(data)
Age Gender Engineer MBA Work.Exp Salary
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60
Mean :27.75 Mean : 6.3 Mean :16.24
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72
Max. :43.00 Max. :24.0 Max. :57.00
Distance license Transport
Min. : 3.20 0:340 2Wheeler : 83
1st Qu.: 8.80 1:104 Car : 61
Median :11.00 Public Transport:300
Mean :11.32
3rd Qu.:13.43
Max. :23.40
We can conclude that we have majority of Males approx.. 75%
Similarly Engineers outnumber MBA’s
Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
One of data point for MBA is missing
Salary might have skewed distribution
Again, public transport is most common mode of transportation
# Checking null data
sapply(data,function(x) sum(is.na(x)))
> sapply(data,function(x) sum(is.na(x)))
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
0 0 0 1 0 0 0 0 0
#Checking for Outliers
summary(data$Gender)
summary(data$Engineer)
summary(data$MBA)
summary(data$Work.Exp)
summary(data$Salary)
summary(data$Distance)
summary(data$license)
pg. 4
Project-4 greatlearning
summary(data$Transport)
summary(data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 25.00 27.00 27.75 30.00 43.00
> summary(data$Gender)
Female Male
128 316
> summary(data$Engineer)
0 1
109 335
> summary(data$MBA)
0 1 NA's
331 112 1
> summary(data$Work.Exp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 5.0 6.3 8.0 24.0
> summary(data$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.50 9.80 13.60 16.24 15.72 57.00
> summary(data$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.20 8.80 11.00 11.32 13.43 23.40
> summary(data$license)
0 1
340 104
> summary(data$Transport)
2Wheeler Car Public Transport
83 61 300
>
#Checking for Missing Values
data.frame(data)
is.na(data)
sum(is.na(data))
> sum(is.na(data))
[1] 1
As stated earlier in the summary analysis one data point is missing which is apparent from the above.
#Univariate analysis
pg. 5
Project-4 greatlearning
Univariate analysis of all the continuous factors (numeric)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col="Blue")
pg. 6
Project-4 greatlearning
#Bivariate Analysis
boxplot(data$Age ~data$Engineer, main = "Age vs Eng.")
boxplot(data$Age ~data$MBA, main ="Age Vs MBA")
As expected not much of difference here, people for all qualifications and all work exp would be
employed in firm.
boxplot(data$Age ~data$Engineer, main = "Salary vs Eng.")
boxplot(data$Age ~data$MBA, main ="Salary vs MBA.")
pg. 7
Project-4 greatlearning
We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16
hist(data$Work.Exp, col = "blue", main = "Distribution of work exp")
pg. 8
Project-4 greatlearning
This is skewed towards right, again this would be on expected lines as there would be more juniors then
seniors in any firm
table(data$license,data$Transport)
2Wheeler Car Public Transport
0 60 13 267
1 23 48 33
boxplot(data$Work.Exp ~ data$Gender)
Not much of difference between mean work experience in two genders, so population is equally
distributed for both male and females.
pg. 9
Project-4 greatlearning
Hypothesis Testing
1. Higher the salary more the chances of using car for commute.
boxplot(data$Salary~data$Transport, main="Salary vs Transport")
Plot clearly shows as salary increase, inclination of commuting by car is higher.
2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
cor(data$Age, data$Work.Exp)
> cor(data$Age, data$Work.Exp)
[1] 0.9322364
boxplot(data$Age~data$Transport, main="Age vs Transport")
pg. 10
Project-4 greatlearning
As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.
3. As distance increase employee, would prefer car for comfort and ease.
boxplot(data$Distance~data$Transport, main="Distance vs Transport")
There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.
4. Females would prefer more of private transfer then public transport
table(data$Gender,data$Transport)
pg. 11
Project-4 greatlearning
table(data$Gender,data$Transport)
2Wheeler Car Public Transport
Female 38 13 77
Male 45 48 223
We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.
A.2 –Insights of EDA
We can conclude that we have majority of Males approx.. 75%
Similarly Engineers outnumber MBA’s
Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
One of data point for MBA is missing
Salary might have skewed distribution
Again, public transport is most common mode of transportation
We could see that around 40 % of females use private transport and 10% use car compared to m
ales where 15% prefers car and total of 30% uses private transport. Thus, even though percenta
ge of car usage is high but they are also high on public transport.
There is a slight pattern that could be observed here. For greater distance car is preferred follow
ed by 2-wheeler and then public transport
As was the case with salary, we could see clear demarcation in usage of transport. With lower ag
e group 2-wheeler is preferable and with higher work exp car is preferred.
Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
A.3 Multicollinearity
library(corrplot)
data.matrix=cor(data)
corrplot(data.matrix)
library(ppcor)
pcor(data1, method = "pearson")
pg. 12
Project-4 greatlearning
A.4 SMOTE
## SMOTE
The columns Engineer,MBA and license need to be converted into factors
data$Engineer<-as.factor(data$Engineer)
data$MBA<-as.factor(data$MBA)
data$license<-as.factor( data$license)
data<-knnImputation(data)
data$CarUsage<-ifelse(data$Transport =='Car',1,0)
table(data$CarUsage)
sum(data$CarUsage == 1)/nrow(data)
data$CarUsage<-as.factor(data$CarUsage)
data$Engineer<-as.factor(data$Engineer)
> data$MBA<-as.factor(data$MBA)
> data$license<-as.factor( data$license)
> data<-knnImputation(data)
Warning message:
In knnImputation(data) :
No case has missing values. Stopping as there is nothing to do.
>
>
> data$CarUsage<-ifelse(data$Transport =='Car',1,0)
> table(data$CarUsage)
0 1
383 61
>
> sum(data$CarUsage == 1)/nrow(data)
[1] 0.1373874
>
> data$CarUsage<-as.factor(data$CarUsage)
The number of records for people travelling by car is in minority. Hence we need to use an appropriate
sampling method on the train data.
##Split the data into test and train
set.seed(400)
pg. 13
Project-4 greatlearning
carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
carsdatatrain<-data[carindex,]
carsdatatest<-data[-carindex,]
prop.table(table(carsdatatrain$CarUsage))
> set.seed(400)
> carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
> carsdatatrain<-data[carindex,]
> carsdatatest<-data[-carindex,]
> prop.table(table(carsdatatrain$CarUsage))
0 1
0.8621795 0.1378205
prop.table(table(carsdatatest$CarUsage))
> prop.table(table(carsdatatest$CarUsage))
0 1
0.8636364 0.1363636
carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]
## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set
library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))
arsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
> prop.table(table(carsdataSMOTE$CarUsage))
0 1
0.5 0.5
We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will
use them all in model building
outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
pg. 14
Project-4 greatlearning
carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family =
"binomial",trControl = trainctrl)
> outcomevar<-'CarUsage'
> regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "b
inomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(carsglm$finalModel)
> summary(carsglm$finalModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.09669 -0.01064 0.00000 0.02798 2.11942
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -64.8739 18.4863 -3.509 0.000449 ***
Age 1.7911 0.5273 3.397 0.000681 ***
Work.Exp -0.2056 0.4347 -0.473 0.636175
Salary -0.1061 2.1572 -0.049 0.960759
Distance 0.6172 0.1957 3.154 0.001611 **
license1 1.5699 0.8530 1.840 0.065711 .
Engineer1 3.1162 1.2024 2.592 0.009554 **
MBA1 -3.4351 1.1784 -2.915 0.003557 **
GenderMale 1.6470 0.9453 1.742 0.081460 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 357.664 on 257 degrees of freedom
Residual deviance: 49.909 on 249 degrees of freedom
AIC: 67.909
Number of Fisher Scoring iterations: 9
pg. 15
Project-4 greatlearning
carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)
carglmcoeff<-exp(coef(carsglm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = carsglm)
glm variable importance
Overall
Age 100.00
Distance 92.74
MBA1 85.60
Engineer1 75.94
license1 53.50
GenderMale 50.57
Work.Exp 12.66
Salary 0.00
plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")
pg. 16
Project-4 greatlearning
From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that
the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability
that he/she will use a car. One lkah increase in salary increases the probability of car usage by 72% The
null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R
Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high We shall
do the prediction based on this model
#confusion matrix
carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")
> carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
> confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 111 3
1 3 15
Accuracy : 0.9545
95% CI : (0.9037, 0.9831)
No Information Rate : 0.8636
P-Value [Acc > NIR] : 0.0005559
Kappa : 0.807
Mcnemar's Test P-Value : 1.0000000
Sensitivity : 0.8333
Specificity : 0.9737
Pos Pred Value : 0.8333
Neg Pred Value : 0.9737
Prevalence : 0.1364
Detection Rate : 0.1136
Detection Prevalence : 0.1364
Balanced Accuracy : 0.9035
'Positive' Class : 1
pg. 17
Project-4 greatlearning
A.5 Applying Logistic Regression
1. Logistic Regression
### Model Building - Logistic regression
install.packages("MASS")
library(class)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
summary(model)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
> summary(model)
Call:
glm(formula = Transport ~ ., family = binomial(link = "logit"),
data = cars_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5733 0.1351 0.3372 0.5675 1.8440
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.48505 2.68180 -4.283 1.85e-05 ***
Age 0.52837 0.11253 4.696 2.66e-06 ***
GenderMale 1.52604 0.38092 4.006 6.17e-05 ***
Engineer1 0.33374 0.38788 0.860 0.38956
MBA1 0.64960 0.43553 1.492 0.13583
Work.Exp -0.07111 0.12017 -0.592 0.55403
Salary -2.55115 2.29565 -1.111 0.26644
Distance -0.15098 0.05167 -2.922 0.00348 **
license1 -2.32637 0.46451 -5.008 5.49e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 302.59 on 311 degrees of freedom
Residual deviance: 223.09 on 303 degrees of freedom
AIC: 241.09
Number of Fisher Scoring iterations: 6
The top three most relevant features are "Work.Exp", "Distance", "Salary " and "license" because of the
low p-values.
pg. 18
Project-4 greatlearning
Also, the AIC# Score is 241.09. This will be observed in subsequent stages when we refine the model.
The model having least AIC Score would be the most preferred and optimized one.
anova(model, train="Chisq")
anova(model, train="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: Transport
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 311 302.59
Age 1 26.1316 310 276.46
Gender 1 10.2970 309 266.16
Engineer 1 1.3586 308 264.80
MBA 1 0.8274 307 263.98
Work.Exp 1 2.6335 306 261.34
Salary 1 0.1789 305 261.16
Distance 1 11.1693 304 249.99
license 1 26.9018 303 223.09
## Cross Validation
fitted.results <- predict(model,newdata=train,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != train$class)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.775925925925926"
## Decision Tree
## Decision Tree
library(rpart)
model2 <- rpart(Transport ~ Salary + Distance + license
+Work.Exp, data=cars_train, method="class")
plot(model2, uniform=TRUE,
main="Classification Tree for data")
text(model2, use.n=TRUE, all=TRUE, cex=.8)
pg. 19
Project-4 greatlearning
treePred <- predict(model2, cars_train, type = 'class')
table(treePred, cars_train$Transport)
mean(treePred==cars_train$Transport)
library(rpart)
> model2 <- rpart(Transport ~ Salary + Distance + license
+ +Work.Exp, data=cars_train, method="class")
>
> plot(model2, uniform=TRUE,
+ main="Classification Tree for data")
> text(model2, use.n=TRUE, all=TRUE, cex=.8)
> treePred <- predict(model2, cars_train, type = 'class')
> table(treePred, cars_train$Transport)
treePred 2Wheeler Car Public Transport
2Wheeler 24 0 10
Car 3 36 0
Public Transport 32 7 200
> mean(treePred==cars_train$Transport)
[1] 0.8333333
pg. 20
Project-4 greatlearning
A.6Applying KNN Model
#KNN
library(caret)
library(mlbench)
library(caret)
random <- createDataPartition(data$Transport, p=0.70, list=FALSE)
cars_train <- data[ random,]
cars_test <- data[-random,]
library(class)
trControl <- trainControl(method = "cv", number = 10)
fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = cars_train)
fit.knn
k-Nearest Neighbors
312 samples
8 predict or
3 classes: '2Wheeler', 'Car', 'Public Transport'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 280, 281, 280, 282, ...
Resampling results across tuning parameters:
k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860
pg. 21
Project-4 greatlearning
13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3
KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)
KNN_predictions 2Wheeler Car Public Transport
2Wheeler 37 0 8
Car 0 35 2
Public Transport 22 8 200
KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)
KNN_predictions 2Wheeler Car Public Transport
2Wheeler 9 0 11
Car 1 15 3
Public Transport 14 3 76
predict(fit.knn,carsTest)
[1] Public Transport Public Transport
Levels: 2Wheeler Car Public TransportA.2 Data Split
For classification problem, it is important to ensure that the train and test sets have approximately the
same percentage of samples of each target class. Hence, we will use stratified sampling
install.packages("caret")
library(caret)
set.seed(1234)
trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
train.data = data[trainIndex, ]
test.data = data[-trainIndex,]
dim(train.data)
dim(test.data)
pg. 22
Project-4 greatlearning
prop.table(table(data$class))
prop.table(table(train.data$class))
prop.table(table(test.data$class))
> set.seed(1234)
> trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
> train.data = data[trainIndex, ]
> test.data = data[-trainIndex,]
>
> dim(train.data)
[1] 538 9
> dim(test.data)
[1] 230 9
>
> prop.table(table(data$class))
0 1
0.6510417 0.3489583
> prop.table(table(train.data$class))
0 1
0.6728625 0.3271375
> prop.table(table(test.data$class))
0 1
0.6 0.4
A.7 Applying Naïve Bayes
Data Cleaning
Missing values
anyNA(data)
anyNA(data)
[1] TRUE
Finding out where the missing value is
data[!complete.cases(data), ]
data[!complete.cases(data), ]
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 Public Transport
pg. 23
Project-4 greatlearning
Use KNN means method to impute the missing value
library(DMwR)
data = knnImputation(data, 5)
Normalize continuous variables
# Model Building and Predictions
Naïve Bayes
library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model
library(e1071)
> Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
> Naive_Bayes_Model
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769
Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.11864 3.173624
Car 35.20930 3.211276
Public Transport 26.76190 2.952653
Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.1395349 0.8604651
Public Transport 0.2285714 0.7714286
Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
pg. 24
Project-4 greatlearning
Car 0.1627907 0.8372093
Public Transport 0.2714286 0.7285714
MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.8604651 0.1395349
Public Transport 0.7142857 0.2857143
Work.Exp
Y [,1] [,2]
2Wheeler 4.000000 3.449138
Car 15.046512 4.690180
Public Transport 5.004762 3.181131
Salary
Y [,1] [,2]
2Wheeler 2.442295 0.4234003
Car 3.479695 0.4330174
Public Transport 2.520766 0.3208979
Distance
Y [,1] [,2]
2Wheeler 12.03051 3.178172
Car 15.07209 3.903559
Public Transport 10.11524 2.964633
license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2093023 0.7906977
Public Transport 0.8857143 0.1142857
This gives us the rule or factors which can help us employees decision to use car or not.
General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5.
#Prediction on the test dataset
NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)
NB_Predictions=predict(Naive_Bayes_Model,cars_test)
pg. 25
Project-4 greatlearning
> table(NB_Predictions,cars_test$Transport)
NB_Predictions 2Wheeler Car Public Transport
2Wheeler 7 0 5
Car 0 14 3
Public Transport 17 4 82
# prediction for test sample
NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions
Levels: 2Wheeler Car Public Transport
A.10 Boosting
#BOOSTING
install.packages("xgboost")
boostcontrol <- trainControl(number=10)
xgbGrid <- expand.grid(
eta = 0.3,
max_depth = 1,
nrounds = 50,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1, subsample = 1
)
carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric =
"Accuracy",method = "xgbTree")
carsxgb$finalModel
carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric = "Accuracy
",method = "xgbTree")
>
> carsxgb$finalModel
##### xgb.Booster
raw: 38.5 Kb
call:
xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,
pg. 26
Project-4 greatlearning
min_child_weight = param$min_child_weight, subsample = param$subsample),
data = x, nrounds = param$nrounds, num_class = length(lev),
objective = "multi:softprob")
params (as set within xgb.train):
eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsam
ple = "1", num_class = "3", objective = "multi:softprob", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 8
niter: 50
nfeatures : 8
xNames : Age GenderMale Engineer1 MBA1 Work.Exp Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Car Public Transport
param :
list()
Let us predict using the test dataset
predictions_xgb<-predict(carsxgb,cars_test)
confusionMatrix(predictions_xgb,cars_test$Transport)
> predictions_xgb<-predict(carsxgb,cars_test)
> confusionMatrix(predictions_xgb,cars_test$Transport)
Confusion Matrix and Statistics
Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 5 0 4
Car 2 17 2
Public Transport 17 1 84
Overall Statistics
Accuracy : 0.803
95% CI : (0.7249, 0.8671)
No Information Rate : 0.6818
P-Value [Acc > NIR] : 0.001323
Kappa : 0.5514
pg. 27
Project-4 greatlearning
Mcnemar's Test P-Value : 0.015591
Statistics by Class:
Class: 2Wheeler Class: Car Class: Public Transport
Sensitivity 0.20833 0.9444 0.9333
Specificity 0.96296 0.9649 0.5714
Pos Pred Value 0.55556 0.8095 0.8235
Neg Pred Value 0.84553 0.9910 0.8000
Prevalence 0.18182 0.1364 0.6818
Detection Rate 0.03788 0.1288 0.6364
Detection Prevalence 0.06818 0.1591 0.7727
Balanced Accuracy 0.58565 0.9547 0.7524
The accuracy is 80%.
A.11 Bagging
# Bagging
library(ipred)
library(rpart)
German.bagging <- bagging(Transport ~.,
data=carsdatatrain,
control=rpart.control(maxdepth=5, minsplit=4))
carsdatatest$pred.Transport<- predict(German.bagging, carsdatatest)
carsdatatest$pred.Transport<- ifelse(carsdatatest$pred.Transport < 0.5,0,1)
confusionMatrix(data=factor(carsdatatest$pred.Transport),
reference=factor(carsdatatest$Transport),
positive='1')
pg. 28
Project-4 greatlearning
A.12 Actionable Insights and Recommendations
We see that all three models predict Public Transport for the two test samples
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:
Important variables are Age, Work.Exp, Distance and License
Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
Hence employees with work exp of 10 and above are likely to use car
Employees who must commute for distance greater than 12 are more likely to prefer car
With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
Again, people with higher salaries (>20) are likely to use cars
pg. 29