67% found this document useful (3 votes)

1K views30 pages

Machine Learning Project

This document summarizes an analysis of employee commute data to predict which mode of transport employees will use. The analysis includes: 1) Exploratory data analysis including univariate, bivariate and outlier analysis to understand patterns in the age, salary, experience and other attributes between transport modes. 2) Checks for missing data and multicollinearity between variables. 3) Data preparation including handling imbalanced classes with SMOTE. 4) Application of machine learning algorithms like logistic regression, KNN, naive bayes and ensemble methods to predict employee transport mode. 5) Evaluation of model performance and selection of best model.

Uploaded by

Prasanna rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

67% found this document useful (3 votes)

1K views30 pages

Machine Learning Project

Uploaded by

Prasanna rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

[Type here]

This document forms the Report to

predict whether or not an employee
will use Car as a mode of transport
using Machine Learning Algorithms.

Project -5
Machine Learning

R S Prashanti

pg. 0
Project-4 greatlearning

Table of Contents

No. Questions Ratings Pts

EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
1. 7
and missing values and check the summary of the dataset

2. EDA - Illustrate the insights based on EDA 5

3. EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 3

4. Data Preparation (SMOTE) 10

5. Applying Logistic Regression & Interpret results. 3

6. Applying KNN Model & Interpret results 3

Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
7 3
it is not applicable, how can you build an NB model in this case?)

8 Confusion matrix interpretation 3

9 Remarks on Model validation exercise <Which model performed the best> 3

10 Bagging 7.5

11 Boosting 7.5

12 Actionable Insights and Recommendations 5

All the questions are in same order as in the Table of contents

pg. 1
Project-4 greatlearning

Projective Objective

This project requires the understanding of what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict whether
or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision.

The following steps will be performed to predict the commute mode:

 Perform an EDA on the data

 Illustrate the insights based on EDA
 Check for Multicollinearity
 Data Preparation
 Prepare the data for analysis (SMOTE)
 Modeling
 Create multiple models and explore how each model perform using appropriate model
performance metrics
 KNN
 Naive
 Logistic Regression
 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.

Note: Each question referred to the above table has been answered with the same nomenclature.

pg. 2
Project-4 greatlearning

Q1. Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and
check the summary of the dataset.

A1.

setwd("D:/users/Prashanti Sharma/Personal/PROJS/Pred-2")
library(readr)
data= read.csv("Diabetes.csv")

1.1 ##Understanding the data

variable.names(data)

[1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary" "Distance"

[8] "license" "Transport"

str(data)

str(data)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

dim(data)

> dim(data)
[1] 444 9

View(data)

#basic data summary

Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.

pg. 3
Project-4 greatlearning

summary(data)

> summary(data)
Age Gender Engineer MBA Work.Exp Salary
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60
Mean :27.75 Mean : 6.3 Mean :16.24
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72
Max. :43.00 Max. :24.0 Max. :57.00
Distance license Transport
Min. : 3.20 0:340 2Wheeler : 83
1st Qu.: 8.80 1:104 Car : 61
Median :11.00 Public Transport:300
Mean :11.32
3rd Qu.:13.43
Max. :23.40

 We can conclude that we have majority of Males approx.. 75%

# Checking null data

sapply(data,function(x) sum(is.na(x)))

> sapply(data,function(x) sum(is.na(x)))

Age Gender Engineer MBA Work.Exp Salary Distance license Transport
0 0 0 1 0 0 0 0 0

#Checking for Outliers

summary(data$Gender)
summary(data$Engineer)
summary(data$MBA)
summary(data$Work.Exp)
summary(data$Salary)
summary(data$Distance)
summary(data$license)

pg. 4
Project-4 greatlearning

summary(data$Transport)

summary(data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 25.00 27.00 27.75 30.00 43.00
> summary(data$Gender)
Female Male
128 316
> summary(data$Engineer)
0 1
109 335
> summary(data$MBA)
0 1 NA's
331 112 1
> summary(data$Work.Exp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 5.0 6.3 8.0 24.0
> summary(data$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.50 9.80 13.60 16.24 15.72 57.00
> summary(data$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.20 8.80 11.00 11.32 13.43 23.40
> summary(data$license)
0 1
340 104
> summary(data$Transport)
2Wheeler Car Public Transport
83 61 300

#Checking for Missing Values

data.frame(data)
is.na(data)
sum(is.na(data))

> sum(is.na(data))
[1] 1

As stated earlier in the summary analysis one data point is missing which is apparent from the above.

#Univariate analysis

pg. 5
Project-4 greatlearning

Univariate analysis of all the continuous factors (numeric)

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col="Blue")

pg. 6
Project-4 greatlearning

#Bivariate Analysis

boxplot(data$Age ~data$Engineer, main = "Age vs Eng.")

boxplot(data$Age ~data$MBA, main ="Age Vs MBA")

As expected not much of difference here, people for all qualifications and all work exp would be
employed in firm.

boxplot(data$Age ~data$Engineer, main = "Salary vs Eng.")

boxplot(data$Age ~data$MBA, main ="Salary vs MBA.")

pg. 7
Project-4 greatlearning

We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16

hist(data$Work.Exp, col = "blue", main = "Distribution of work exp")

pg. 8
Project-4 greatlearning

This is skewed towards right, again this would be on expected lines as there would be more juniors then
seniors in any firm

table(data$license,data$Transport)

2Wheeler Car Public Transport

0 60 13 267
1 23 48 33

boxplot(data$Work.Exp ~ data$Gender)

Not much of difference between mean work experience in two genders, so population is equally
distributed for both male and females.

pg. 9
Project-4 greatlearning

Hypothesis Testing

1. Higher the salary more the chances of using car for commute.

boxplot(data$Salary~data$Transport, main="Salary vs Transport")

Plot clearly shows as salary increase, inclination of commuting by car is higher.

2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
cor(data$Age, data$Work.Exp)

> cor(data$Age, data$Work.Exp)

[1] 0.9322364

boxplot(data$Age~data$Transport, main="Age vs Transport")

pg. 10
Project-4 greatlearning

As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.

3. As distance increase employee, would prefer car for comfort and ease.

boxplot(data$Distance~data$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.

4. Females would prefer more of private transfer then public transport

table(data$Gender,data$Transport)

pg. 11
Project-4 greatlearning

table(data$Gender,data$Transport)

2Wheeler Car Public Transport

Female 38 13 77
Male 45 48 223

We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.

A.2 –Insights of EDA

 We can conclude that we have majority of Males approx.. 75%

 Similarly Engineers outnumber MBA’s
 Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
 One of data point for MBA is missing
 Salary might have skewed distribution
 Again, public transport is most common mode of transportation
 We could see that around 40 % of females use private transport and 10% use car compared to m
ales where 15% prefers car and total of 30% uses private transport. Thus, even though percenta
ge of car usage is high but they are also high on public transport.
 There is a slight pattern that could be observed here. For greater distance car is preferred follow
ed by 2-wheeler and then public transport
 As was the case with salary, we could see clear demarcation in usage of transport. With lower ag
e group 2-wheeler is preferable and with higher work exp car is preferred.
 Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.

A.3 Multicollinearity

library(corrplot)
data.matrix=cor(data)
corrplot(data.matrix)
library(ppcor)
pcor(data1, method = "pearson")

pg. 12
Project-4 greatlearning

A.4 SMOTE

## SMOTE

The columns Engineer,MBA and license need to be converted into factors

data$Engineer<-as.factor(data$Engineer)
data$MBA<-as.factor(data$MBA)
data$license<-as.factor( data$license)
data<-knnImputation(data)

data$CarUsage<-ifelse(data$Transport =='Car',1,0)
table(data$CarUsage)

sum(data$CarUsage == 1)/nrow(data)

data$CarUsage<-as.factor(data$CarUsage)

data$Engineer<-as.factor(data$Engineer)
> data$MBA<-as.factor(data$MBA)
> data$license<-as.factor( data$license)
> data<-knnImputation(data)
Warning message:
In knnImputation(data) :
No case has missing values. Stopping as there is nothing to do.
>
>
> data$CarUsage<-ifelse(data$Transport =='Car',1,0)
> table(data$CarUsage)

0 1
383 61
>
> sum(data$CarUsage == 1)/nrow(data)
[1] 0.1373874
>
> data$CarUsage<-as.factor(data$CarUsage)

The number of records for people travelling by car is in minority. Hence we need to use an appropriate
sampling method on the train data.

##Split the data into test and train

set.seed(400)

pg. 13
Project-4 greatlearning

carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)

carsdatatrain<-data[carindex,]
carsdatatest<-data[-carindex,]
prop.table(table(carsdatatrain$CarUsage))

> set.seed(400)
> carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
> carsdatatrain<-data[carindex,]
> carsdatatest<-data[-carindex,]
> prop.table(table(carsdatatrain$CarUsage))

0 1
0.8621795 0.1378205

prop.table(table(carsdatatest$CarUsage))

> prop.table(table(carsdatatest$CarUsage))

0 1
0.8636364 0.1363636

carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]

## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set

library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))

arsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)

> prop.table(table(carsdataSMOTE$CarUsage))

0 1
0.5 0.5

We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will
use them all in model building

outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)

pg. 14
Project-4 greatlearning

carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family =

"binomial",trControl = trainctrl)

> outcomevar<-'CarUsage'
> regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "b
inomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(carsglm$finalModel)

> summary(carsglm$finalModel)

Call:
NULL

Deviance Residuals:
Min 1Q Median 3Q Max
-2.09669 -0.01064 0.00000 0.02798 2.11942

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -64.8739 18.4863 -3.509 0.000449 ***
Age 1.7911 0.5273 3.397 0.000681 ***
Work.Exp -0.2056 0.4347 -0.473 0.636175
Salary -0.1061 2.1572 -0.049 0.960759
Distance 0.6172 0.1957 3.154 0.001611 **
license1 1.5699 0.8530 1.840 0.065711 .
Engineer1 3.1162 1.2024 2.592 0.009554 **
MBA1 -3.4351 1.1784 -2.915 0.003557 **
GenderMale 1.6470 0.9453 1.742 0.081460 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 357.664 on 257 degrees of freedom

Residual deviance: 49.909 on 249 degrees of freedom
AIC: 67.909

Number of Fisher Scoring iterations: 9

pg. 15
Project-4 greatlearning

carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)

carglmcoeff<-exp(coef(carsglm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = carsglm)
glm variable importance

Overall
Age 100.00
Distance 92.74
MBA1 85.60
Engineer1 75.94
license1 53.50
GenderMale 50.57
Work.Exp 12.66
Salary 0.00

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

pg. 16
Project-4 greatlearning

From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that
the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability
that he/she will use a car. One lkah increase in salary increases the probability of car usage by 72% The
null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R
Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high We shall
do the prediction based on this model

#confusion matrix

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")

confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

> carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
> confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 111 3
1 3 15

Accuracy : 0.9545
95% CI : (0.9037, 0.9831)
No Information Rate : 0.8636
P-Value [Acc > NIR] : 0.0005559

Kappa : 0.807

Mcnemar's Test P-Value : 1.0000000

Sensitivity : 0.8333
Specificity : 0.9737
Pos Pred Value : 0.8333
Neg Pred Value : 0.9737
Prevalence : 0.1364
Detection Rate : 0.1136
Detection Prevalence : 0.1364
Balanced Accuracy : 0.9035

'Positive' Class : 1

pg. 17
Project-4 greatlearning

A.5 Applying Logistic Regression

1. Logistic Regression

### Model Building - Logistic regression

install.packages("MASS")
library(class)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
summary(model)

model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)

> summary(model)

Call:
glm(formula = Transport ~ ., family = binomial(link = "logit"),
data = cars_train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.5733 0.1351 0.3372 0.5675 1.8440

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.48505 2.68180 -4.283 1.85e-05 ***
Age 0.52837 0.11253 4.696 2.66e-06 ***
GenderMale 1.52604 0.38092 4.006 6.17e-05 ***
Engineer1 0.33374 0.38788 0.860 0.38956
MBA1 0.64960 0.43553 1.492 0.13583
Work.Exp -0.07111 0.12017 -0.592 0.55403
Salary -2.55115 2.29565 -1.111 0.26644
Distance -0.15098 0.05167 -2.922 0.00348 **
license1 -2.32637 0.46451 -5.008 5.49e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 302.59 on 311 degrees of freedom

Residual deviance: 223.09 on 303 degrees of freedom
AIC: 241.09

Number of Fisher Scoring iterations: 6

The top three most relevant features are "Work.Exp", "Distance", "Salary " and "license" because of the
low p-values.

pg. 18
Project-4 greatlearning

Also, the AIC# Score is 241.09. This will be observed in subsequent stages when we refine the model.
The model having least AIC Score would be the most preferred and optimized one.

anova(model, train="Chisq")

anova(model, train="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Transport

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev

NULL 311 302.59
Age 1 26.1316 310 276.46
Gender 1 10.2970 309 266.16
Engineer 1 1.3586 308 264.80
MBA 1 0.8274 307 263.98
Work.Exp 1 2.6335 306 261.34
Salary 1 0.1789 305 261.16
Distance 1 11.1693 304 249.99
license 1 26.9018 303 223.09

## Cross Validation

fitted.results <- predict(model,newdata=train,type='response')

fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != train$class)
print(paste('Accuracy',1-misClasificError))

[1] "Accuracy 0.775925925925926"

## Decision Tree

library(rpart)
model2 <- rpart(Transport ~ Salary + Distance + license
+Work.Exp, data=cars_train, method="class")

plot(model2, uniform=TRUE,
main="Classification Tree for data")
text(model2, use.n=TRUE, all=TRUE, cex=.8)

pg. 19
Project-4 greatlearning

treePred <- predict(model2, cars_train, type = 'class')

table(treePred, cars_train$Transport)
mean(treePred==cars_train$Transport)

library(rpart)
> model2 <- rpart(Transport ~ Salary + Distance + license
+ +Work.Exp, data=cars_train, method="class")
>
> plot(model2, uniform=TRUE,
+ main="Classification Tree for data")
> text(model2, use.n=TRUE, all=TRUE, cex=.8)
> treePred <- predict(model2, cars_train, type = 'class')
> table(treePred, cars_train$Transport)

treePred 2Wheeler Car Public Transport

2Wheeler 24 0 10
Car 3 36 0
Public Transport 32 7 200
> mean(treePred==cars_train$Transport)
[1] 0.8333333

pg. 20
Project-4 greatlearning

A.6Applying KNN Model

#KNN

library(caret)
library(mlbench)
library(caret)

random <- createDataPartition(data$Transport, p=0.70, list=FALSE)

cars_train <- data[ random,]
cars_test <- data[-random,]

library(class)

trControl <- trainControl(method = "cv", number = 10)

fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = cars_train)
fit.knn

k-Nearest Neighbors

312 samples
8 predict or
3 classes: '2Wheeler', 'Car', 'Public Transport'

Pre-processing: centered (8), scaled (8)

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 280, 281, 280, 282, ...
Resampling results across tuning parameters:

k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860

pg. 21
Project-4 greatlearning

13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3

KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)

KNN_predictions 2Wheeler Car Public Transport

2Wheeler 37 0 8
Car 0 35 2
Public Transport 22 8 200

KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)

KNN_predictions 2Wheeler Car Public Transport

2Wheeler 9 0 11
Car 1 15 3
Public Transport 14 3 76

predict(fit.knn,carsTest)

[1] Public Transport Public Transport

Levels: 2Wheeler Car Public TransportA.2 Data Split

For classification problem, it is important to ensure that the train and test sets have approximately the
same percentage of samples of each target class. Hence, we will use stratified sampling

install.packages("caret")
library(caret)
set.seed(1234)
trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
train.data = data[trainIndex, ]
test.data = data[-trainIndex,]

dim(train.data)
dim(test.data)

pg. 22
Project-4 greatlearning

prop.table(table(data$class))
prop.table(table(train.data$class))
prop.table(table(test.data$class))

> set.seed(1234)
> trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
> train.data = data[trainIndex, ]
> test.data = data[-trainIndex,]
>
> dim(train.data)
[1] 538 9
> dim(test.data)
[1] 230 9
>
> prop.table(table(data$class))

0 1
0.6510417 0.3489583
> prop.table(table(train.data$class))

0 1
0.6728625 0.3271375
> prop.table(table(test.data$class))

0 1
0.6 0.4

A.7 Applying Naïve Bayes

Data Cleaning

Missing values

anyNA(data)

anyNA(data)
[1] TRUE

Finding out where the missing value is

data[!complete.cases(data), ]

data[!complete.cases(data), ]
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 Public Transport

pg. 23
Project-4 greatlearning

Use KNN means method to impute the missing value

library(DMwR)

data = knnImputation(data, 5)

Normalize continuous variables

# Model Building and Predictions

Naïve Bayes

library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model

library(e1071)
> Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
> Naive_Bayes_Model

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769

Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.11864 3.173624
Car 35.20930 3.211276
Public Transport 26.76190 2.952653

Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.1395349 0.8604651
Public Transport 0.2285714 0.7714286

Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627

pg. 24
Project-4 greatlearning

Car 0.1627907 0.8372093

Public Transport 0.2714286 0.7285714

MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.8604651 0.1395349
Public Transport 0.7142857 0.2857143

Work.Exp
Y [,1] [,2]
2Wheeler 4.000000 3.449138
Car 15.046512 4.690180
Public Transport 5.004762 3.181131

Salary
Y [,1] [,2]
2Wheeler 2.442295 0.4234003
Car 3.479695 0.4330174
Public Transport 2.520766 0.3208979

Distance
Y [,1] [,2]
2Wheeler 12.03051 3.178172
Car 15.07209 3.903559
Public Transport 10.11524 2.964633

license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2093023 0.7906977
Public Transport 0.8857143 0.1142857

This gives us the rule or factors which can help us employees decision to use car or not.

General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5.

#Prediction on the test dataset

NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)

NB_Predictions=predict(Naive_Bayes_Model,cars_test)

pg. 25
Project-4 greatlearning

> table(NB_Predictions,cars_test$Transport)

NB_Predictions 2Wheeler Car Public Transport

2Wheeler 7 0 5
Car 0 14 3
Public Transport 17 4 82

# prediction for test sample

NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions

Levels: 2Wheeler Car Public Transport

A.10 Boosting

#BOOSTING

install.packages("xgboost")

boostcontrol <- trainControl(number=10)

xgbGrid <- expand.grid(

eta = 0.3,
max_depth = 1,
nrounds = 50,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1, subsample = 1
)

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric =

"Accuracy",method = "xgbTree")

carsxgb$finalModel

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric = "Accuracy

",method = "xgbTree")
>
> carsxgb$finalModel
##### xgb.Booster
raw: 38.5 Kb
call:
xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,

pg. 26
Project-4 greatlearning

min_child_weight = param$min_child_weight, subsample = param$subsample),

data = x, nrounds = param$nrounds, num_class = length(lev),
objective = "multi:softprob")
params (as set within xgb.train):
eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsam
ple = "1", num_class = "3", objective = "multi:softprob", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 8
niter: 50
nfeatures : 8
xNames : Age GenderMale Engineer1 MBA1 Work.Exp Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Car Public Transport
param :
list()

Let us predict using the test dataset

predictions_xgb<-predict(carsxgb,cars_test)
confusionMatrix(predictions_xgb,cars_test$Transport)

> predictions_xgb<-predict(carsxgb,cars_test)
> confusionMatrix(predictions_xgb,cars_test$Transport)
Confusion Matrix and Statistics

Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 5 0 4
Car 2 17 2
Public Transport 17 1 84

Overall Statistics

Accuracy : 0.803
95% CI : (0.7249, 0.8671)
No Information Rate : 0.6818
P-Value [Acc > NIR] : 0.001323

Kappa : 0.5514

pg. 27
Project-4 greatlearning

Mcnemar's Test P-Value : 0.015591

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport

Sensitivity 0.20833 0.9444 0.9333
Specificity 0.96296 0.9649 0.5714
Pos Pred Value 0.55556 0.8095 0.8235
Neg Pred Value 0.84553 0.9910 0.8000
Prevalence 0.18182 0.1364 0.6818
Detection Rate 0.03788 0.1288 0.6364
Detection Prevalence 0.06818 0.1591 0.7727
Balanced Accuracy 0.58565 0.9547 0.7524

The accuracy is 80%.

A.11 Bagging

# Bagging

library(ipred)
library(rpart)

German.bagging <- bagging(Transport ~.,

data=carsdatatrain,
control=rpart.control(maxdepth=5, minsplit=4))

carsdatatest$pred.Transport<- predict(German.bagging, carsdatatest)

carsdatatest$pred.Transport<- ifelse(carsdatatest$pred.Transport < 0.5,0,1)
confusionMatrix(data=factor(carsdatatest$pred.Transport),
reference=factor(carsdatatest$Transport),
positive='1')

pg. 28
Project-4 greatlearning

A.12 Actionable Insights and Recommendations

We see that all three models predict Public Transport for the two test samples

Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:

 Important variables are Age, Work.Exp, Distance and License

 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

pg. 29

Deep Learning With Python
100% (8)
Deep Learning With Python
396 pages
TUV SUD - ISO 42001 - The Future of AI Security (2024)
No ratings yet
TUV SUD - ISO 42001 - The Future of AI Security (2024)
46 pages
Cashio Siwes Report
No ratings yet
Cashio Siwes Report
41 pages
SuperKart Milestone1 Final
No ratings yet
SuperKart Milestone1 Final
15 pages
Easy Visa Project: Machine Learning for Visa Approval
100% (1)
Easy Visa Project: Machine Learning for Visa Approval
24 pages
Deep Learning-Question Bank-Module-Wise
67% (3)
Deep Learning-Question Bank-Module-Wise
5 pages
Employee Transport Prediction
100% (2)
Employee Transport Prediction
44 pages
Rajendra Ladda SQL and Databases New Wheels Project Report
100% (2)
Rajendra Ladda SQL and Databases New Wheels Project Report
12 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Great Learning DVT Final Project - Car Claims For Insurance
100% (1)
Great Learning DVT Final Project - Car Claims For Insurance
113 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Unit 4
100% (1)
Unit 4
7 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Predicting Commute Mode with ML
100% (1)
Predicting Commute Mode with ML
12 pages
Data Analysis Tasks for Analysts
No ratings yet
Data Analysis Tasks for Analysts
2 pages
AI's Impact on Jobs & Skills
No ratings yet
AI's Impact on Jobs & Skills
4 pages
Software Engineering
No ratings yet
Software Engineering
317 pages
New Wheels - Project - Report
No ratings yet
New Wheels - Project - Report
31 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Graded Quiz - SAS (Week 4) - Data Analysis Using SAS - Great Learning
100% (1)
Graded Quiz - SAS (Week 4) - Data Analysis Using SAS - Great Learning
6 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Financial Risk Analysis Guide
No ratings yet
Financial Risk Analysis Guide
49 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
ECT386 - Ktu Qbank
No ratings yet
ECT386 - Ktu Qbank
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Unit-Ii DLL
No ratings yet
Unit-Ii DLL
19 pages
Digital Humanism: A Human-Centric Approach To Digital Technologies Marta Bertolaso Sample
No ratings yet
Digital Humanism: A Human-Centric Approach To Digital Technologies Marta Bertolaso Sample
170 pages
Cs100 Lesson 1
No ratings yet
Cs100 Lesson 1
12 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Project ML
100% (4)
Project ML
36 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
MLT Machine Learning
No ratings yet
MLT Machine Learning
6 pages
A IEEE Topics 2024-25
No ratings yet
A IEEE Topics 2024-25
3 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Predicting Cubic Zirconia Prices Using Linear Regression
100% (1)
Predicting Cubic Zirconia Prices Using Linear Regression
58 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Generating Music Using AI: Ebba Rickard
No ratings yet
Generating Music Using AI: Ebba Rickard
66 pages
Data Mining Assignment Guide
100% (1)
Data Mining Assignment Guide
21 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
New Wheels Quarterly Business Report
No ratings yet
New Wheels Quarterly Business Report
20 pages
Data Analysis for Python Users
100% (1)
Data Analysis for Python Users
14 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Previewpdf
No ratings yet
Previewpdf
75 pages
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
100% (1)
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
10 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
Instant Download Building Enterprise IoT Applications 1st Edition Chandrasekar Vuppalapati PDF All Chapter
100% (2)
Instant Download Building Enterprise IoT Applications 1st Edition Chandrasekar Vuppalapati PDF All Chapter
55 pages
Ds Unit 1
No ratings yet
Ds Unit 1
77 pages
Deep Air Quality Forecasting Using Hybrid Deep
No ratings yet
Deep Air Quality Forecasting Using Hybrid Deep
14 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Transformer & GPT Model Basics
No ratings yet
Transformer & GPT Model Basics
69 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Business Performance Analysis
No ratings yet
Business Performance Analysis
20 pages
Extended Project FastKart SQLite MYSQL 1 1 PDF
No ratings yet
Extended Project FastKart SQLite MYSQL 1 1 PDF
5 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Geochemical Anomalies
No ratings yet
Geochemical Anomalies
8 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Extended Project
No ratings yet
Extended Project
1 page
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Predictive Modeling For Real-Time Customer Lifetime Value
No ratings yet
Predictive Modeling For Real-Time Customer Lifetime Value
6 pages
AI 'Driven Robotics For Autonomous Vehicle
No ratings yet
AI 'Driven Robotics For Autonomous Vehicle
29 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
Data Insights for Auto Parts Firm
100% (3)
Data Insights for Auto Parts Firm
46 pages
SWEP200 Questions (Compiled)
No ratings yet
SWEP200 Questions (Compiled)
23 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
3 - Enhancing Graph Neural Network-Based Fraud Detectors Against Camouflaged Fraudsters
No ratings yet
3 - Enhancing Graph Neural Network-Based Fraud Detectors Against Camouflaged Fraudsters
10 pages
Artificial Intelligence and Surgical Decision-Making
No ratings yet
Artificial Intelligence and Surgical Decision-Making
11 pages
Algorithm To Deduce Parameter From Data
No ratings yet
Algorithm To Deduce Parameter From Data
4 pages
JD - AI - ML Engineer - AT
No ratings yet
JD - AI - ML Engineer - AT
3 pages
Big Data Analytics and Mining For Effective Visualization and Trends Forecasting of Crime Data
No ratings yet
Big Data Analytics and Mining For Effective Visualization and Trends Forecasting of Crime Data
8 pages
Unsupervised Tts by Unsupervised Automatic Speech Recognition-HIFIGAN
No ratings yet
Unsupervised Tts by Unsupervised Automatic Speech Recognition-HIFIGAN
5 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
MySQL - Week 2 Quiz
100% (2)
MySQL - Week 2 Quiz
6 pages
Election Prediction & Speech Analysis
No ratings yet
Election Prediction & Speech Analysis
3 pages
Project Instructions 2
No ratings yet
Project Instructions 2
5 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages