R Machine Learning
Trainer: Dr Ghazaleh Babanejad
Website:www.tertiarycourses.com.my
Email: [email protected]
About the Trainer
Dr Ghazaleh Babanejad has received Phd from
University Putra Malaysia in Faculty of
Computer Science and Information Technology..She is
working on recommender systems in the field of skyline
queries over Dynamic and Incomplete databases for her
PhD thesis. She is also working on Data Science field as a
trainer and Data Scientist. She worked on Machine
Learning and Process Mining projects. She also has
several international certificates in Practical Machine
Learning (John Hopkins University) Mining Massive
Datasets (Stanford University), Process Mining
(Eindhoven University), Hadoop (University of San Diego),
MongoDB for DBAs (MongoDB Inc) and some
other certificates. She has more than 5 year experience
as a lecturer and data base administrator.
Agenda
Module 1 Introduction to Machine Learning
- What is Machine Learning
- Installing mlr package
- Supervised vs Unsupervised Learning
- Regression vs Classification
Module 2 Datasets
- Iris Dataset
- Boston Housing Price Dataset
- Mtcars Dataset
Module 3 Preprocessing
- Sampling
- Impute missing values
- Normalize columns
- Split data into train and test set
Agenda
Module 4 Regression based Models
- What is Supervised Learning
- Linear Regression
- Logistics Regression Classifier
- Linear Discriminant Analysis
Module 5 Tree based Models
- Decision Tree
- Random Forest
- Gradient Boost
- Xg Boost
Module 6 Nearest Neighbor Models
- Naive Bayes
- KNN
Agenda
Module 7 Unsupervised Learning
- What is Unsupervised Learning
- Clustering
- Dimensionality Reduction
Module 8 Intro to Neural Network (Optional)
- What is Neural Network
- Multi Layer Perceptron
Prerequisite
Basic knowledge of R is assumed
Exercise Files
Download the exercise file from
https://github.com/rkrtiwari/rMachineLe
arning
Module 1
Getting Started
What is Machine Learning?
• Machine Learning is about building
programs with tunable parameters that
are adjusted automatically so as to
improve their behavior by adapting to
previously seen data
• Machine Learning is a subfield of
Artificial Intelligence
Why Machine Learning?
http://www.goratings.org/
Machine Learning
• Supervised Learning
– Classification
– Regression
• Unsupervised Learning
– Clustering
– Dimensionality Reduction
R Packages for ML
• rpart
• randomForest
• e1071
• glmnet
• nnet
• class
• FNN
• Xgboost
• lda
Installing and Loading R ML
Packages
install.packages(“mlr")
library(mlr)
Module 2
Datasets
Iris Flower Dataset
Iris Flower Dataset
setosa (0) versicolor (1) virginica (2)
Iris flower dataset, introduced in 1936 by
Sir Ronald Fisher
Iris Flower Dataset
Features in the Iris dataset:
• sepal length in cm
• sepal width in cm
• petal length in cm
• petal width in cm
Target classes to predict:
• setosa
• versicolor
• virginica
Load Iris Dataset
data(iris)
dim(iris)
levels(iris$Species)
head(iris)
Boston Housing
Price Dataset
Boston Housing Price Dataset
There are 13 features for this dataset.
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000
sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment
centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
-B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by
town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
Load Boston Housing Dataset
library(MASS)
Boston
dim(Boston)
head(Boston)
Mtcars Dataset
Motor Trend Car (mtcars) dataset
There are 11 features for this dataset.
- mpg Miles/(US) gallon
- cyl Number of cylinders
- disp Displacement (cu.in.)
- hp Gross horsepower
- drat Rear axle ratio
- wt Weight (lb/1000)
- qsec 1/4 mile time
- vs V/S
- am Transmission (0 = automatic, 1 =
manual)
- gear Number of forward gears
- carb Number of carburetors
Load MTCars Dataset
mtcars
dim(mtcars)
head(mtcars)
Module 3
Pre-processing data
Sampling
## selecting rows and columns
iris2.1=subset(iris,
select=c("Sepal.Length","Sepal.Width"))
# select only these 2 columns
iris2.2=iris[1:100, ]
# select the first 100 rows
Sampling
## random sampling
# take a random sample of size 50 rows
from a dataset iris
# sample without replacement
myiris <- iris[sample(1:nrow(iris), 50,
replace=FALSE),]
Impute missing values
data(airquality)
aqr=airquality
summary(aqr)
imp = impute(aqr, classes = list(integer =
imputeMean(), factor = imputeMode()),
dummy.classes = "integer")
summary(imp$data)
Normalize data
## normalize columns
iris2.4 = normalizeFeatures(iris[,1:4],
method = "range")
summary(iris2.4)
Train and test set
## create train and test set
nr <- nrow(iris)
inTrain <- sample(1:nr, 0.6*nr)
iris.train <- iris[inTrain,]
iris.test <- iris[-inTrain,]
Module 4
Regression Models
What is Supervised Learning
• In Supervised Learning, we have a
dataset consisting of both features and
labels.
• The input data (X) is associated with a
target label (y)
Supervised Learning Examples
• Spam Email Filter
• Tumor Classification
Classification Steps
# Step 1 Load classifer library
library(package)
# Step 2 Split the data
index <- sample(....prob = c(0.6, 0.4))
# Step 3 Training
model <- classifier(y ~ ., data = train)
# Step 4: Prediction
class <- predict(model, data = test)
Multiple Linear
Regression
Split the Boston dataset
### Splitting data
data(Boston)
nr <- nrow(Boston)
inTrain <- sample(1:nr, 0.6*nr)
bh.train <- Boston[inTrain,]
bh.test <- Boston[-inTrain,]
Making tasks and learner
### Making Tasks
library(mlr)
regr.task = makeRegrTask(id = "bh", data
= bh.train, target= "medv")
regr.task
### Making learner
regr.lrn = makeLearner("regr.lm")
regr.lrn
Train the model
mod = train(regr.lrn, regr.task)
mod
names(mod)
getLearnerModel(mod)
Make predictions
regr.pred = predict(mod, newdata =
bh.test)
regr.pred
performance(regr.pred, measures =
list(rmse))
head(getPredictionTruth(regr.pred))
head(getPredictionResponse(regr.pred))
Visualize Results
plotLearnerPrediction(regr.lrn,
features="lstat", task=regr.task)
Ex: Multiple Linear Regression
Use MLR regressor to build a model to
predict media house price (MEDV) using
boston dataset
Time: 5 mins
Logistic Regression
Classifier
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
log.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
log.task
log.lrn = makeLearner("classif.logreg")
# predict.type = "prob" >if you want
probabililties)
log.lrn
Train the model
mod = train(log.lrn, log.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
log.pred = predict(mod, newdata =
ir2.test)
log.pred
performance(log.pred, measures =
list(mmce, acc))
head(getPredictionTruth(log.pred))
head(getPredictionResponse(log.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(log.pred)
### ROC curve
# for ROC the prediction must be type
"prob"
df =
generateThreshVsPerfData(log.pred,
measures = list(fpr, tpr,mmce))
plotROCCurves(df)
Visualize Results
plotLearnerPrediction(log.lrn,
features=c("Petal.Length","Petal.Width"),
task=log.task)
Ex: Logistic Regression Classifier
Use Logistic regression to build a model to
predict am variable using mtcars dataset
Time: 5 mins
Linear Discriminant
Analysis
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
lda.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
lda.task
lda.lrn = makeLearner("classif.lda")
lda.lrn
Train the model
mod = train(lda.lrn, lda.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
lda.pred = predict(mod, newdata =
ir2.test)
lda.pred
performance(lda.pred, measures =
list(mmce, acc))
head(getPredictionTruth(lda.pred))
head(getPredictionResponse(lda.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(lda.pred)
Visualize Results
plotLearnerPrediction(lda.lrn,
features=c("Petal.Length","Petal.Width"),
task=lda.task)
Ex: Linear Discriminant
Use LDA to build a model to predict gear
variable using mtcars dataset
Time: 5 mins
Decision Tree
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
rpart.task = makeClassifTask(id = "ir2", data
= ir2.train, target= "Species")
rpart.task
rpart.lrn = makeLearner("classif.rpart")
rpart.lrn
Train the model
mod = train(rpart.lrn, rpart.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
rpart.pred = predict(mod, newdata =
ir2.test)
rpart.pred
performance(rpart.pred, measures =
list(mmce, acc))
head(getPredictionTruth(rpart.pred))
head(getPredictionResponse(rpart.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(rpart.pred)
Visualize Results
plotLearnerPrediction(rpart.lrn,
features=c("Petal.Length","Petal.Width"),
task=rpart.task)
Ex: Decision Tree
Use Decision Tree to build a model to
predict gear variable using mtcars dataset
Time: 5 mins
Random Forest
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
rf.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
rf.task
rf.lrn = makeLearner("classif.randomForest")
rpart.lrn
Train the model
mod = train(rf.lrn, rf.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
rf.pred = predict(mod, newdata =
ir2.test)
rf.pred
performance(rf.pred, measures =
list(mmce, acc))
head(getPredictionTruth(rf.pred))
head(getPredictionResponse(rf.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(rf.pred)
Visualize Results
plotLearnerPrediction(rf.lrn,
features=c("Petal.Length","Petal.Width"),
task=rf.task)
Ex: Random Forest
Use Random Forest to build a model to
predict gear variable using mtcars
dataset
Time: 5 mins
Gradient Booster
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
gbm.task = makeClassifTask(id = "ir2", data
= ir2.train, target= "Species")
gbm.task
gbm.lrn = makeLearner("classif.gbm")
gbm.lrn
Train the model
mod = train(gbm.lrn, gbm.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
gbm.pred = predict(mod, newdata =
ir2.test)
gbm.pred
performance(gbm.pred, measures =
list(mmce, acc))
head(getPredictionTruth(gbm.pred))
head(getPredictionResponse(gbm.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(gbm.pred)
Visualize Results
plotLearnerPrediction(gbm.lrn,
features=c("Petal.Length","Petal.Width"),
task=gbm.task)
Ex: Gradient Boost
Use GBM tree to build a model to predict
gear variable using mtcars dataset
Time: 5 mins
XG Boost
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
xg.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
xg.task
xg.lrn = makeLearner("classif.xgboost")
xg.lrn
Train the model
mod = train(xg.lrn, xg.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
xg.pred = predict(mod, newdata =
ir2.test)
xg.pred
performance(xg.pred, measures =
list(mmce, acc))
head(getPredictionTruth(xg.pred))
head(getPredictionResponse(xg.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(xg.pred)
Visualize Results
plotLearnerPrediction(xg.lrn,
features=c("Petal.Length","Petal.Width"),
task=xg.task)
Ex: XG Boost
Use XG boost tree to build a model to
predict gear variable using mtcars
dataset
Time: 5 mins
Naïve Bayes
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
nb.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
nb.task
nb.lrn = makeLearner("classif.naiveBayes")
nb.lrn
Train the model
mod = train(nb.lrn, nb.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
nb.pred = predict(mod, newdata =
ir2.test)
nb.pred
performance(nb.pred, measures =
list(mmce, acc))
head(getPredictionTruth(nb.pred))
head(getPredictionResponse(nb.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(nb.pred)
Visualize Results
plotLearnerPrediction(nb.lrn,
features=c("Petal.Length","Petal.Width"),
task=nb.task)
Ex: naiveBayes
Use naiveBayes to build a model to
predict gear variable using mtcars
dataset
Time: 5 mins
k Nearest Neighbour
Split the Iris Dataset
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
knn.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species")
knn.task
knn.lrn = makeLearner("classif.knn")
knn.lrn
Train the model
mod = train(knn.lrn, knn.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
knn.pred = predict(mod, newdata =
ir2.test)
knn.pred
performance(knn.pred, measures =
list(mmce, acc))
head(getPredictionTruth(knn.pred))
head(getPredictionResponse(knn.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(knn.pred)
Visualize Results
plotLearnerPrediction(knn.lrn,
features=c("Petal.Length","Petal.Width"),
task=knn.task)
Ex: k Nearest Neighbour
Use kNN to build a model to predict gear
variable using mtcars dataset
Time: 5 mins
Support Vector
Machines
Split the Iris Dataset
### Splitting data
data(iris)
iris2=subset(iris, subset=iris$Species
%in% c("versicolor","virginica"))
iris2$Species=factor(iris2$Species)
nr <- nrow(iris2)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris2[inTrain,]
ir2.test <- iris2[-inTrain,]
Making task and learner
svm.task = makeClassifTask(id = "ir2", data
= ir2.train, target= "Species")
svm.task
svm.lrn = makeLearner("classif.svm")
svm.lrn
Train the model
mod = train(svm.lrn, svm.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
svm.pred = predict(mod, newdata =
ir2.test)
svm.pred
performance(svm.pred, measures =
list(mmce, acc))
head(getPredictionTruth(svm.pred))
head(getPredictionResponse(svm.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(svm.pred)
Visualize Results
plotLearnerPrediction(svm.lrn,
features=c("Petal.Length","Petal.Width"),
task=svm.task)
Ex: support vector machines
Use svm to build a model to predict am
variable using mtcars dataset
Time: 5 mins
Unsupervised
Learning
What is Unsupervised Learning
• In Supervised Learning, we have a
dataset consisting of both features but
without labels.
• The most common method is cluster
analysis, which is used for exploratory
data analysis to find hidden patterns or
grouping in data.
Unsupervised Learning Examples
• Image grouping
• Grouping of drug molecules
K-means clustering
Split the Iris Dataset
### Splitting data
data(iris)
nr <- nrow(iris)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris[inTrain,-5]
ir2.test <- iris[-inTrain,-5]
ir2.Class<- iris[-inTrain,5]
## clustering only deals with numeric
data
Making task and learner
kmeans.task = makeClassifTask(id = "ir2",
data = ir2.train)
kmeans.task
kmeans.lrn =
makeLearner("cluster.kmeans“,
centers=3)
# specify how many clusters you want
kmeans.lrn
Train the model
mod = train(kmeans.lrn, kmeans.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
kmeans.pred = predict(mod, newdata =
ir2.test)
svm.pred
head(getPredictionResponse(kmeans.pr
ed))
Visualize Results
plotLearnerPrediction(kmeans.lrn,
features=c("Petal.Length","Petal.Width"),
task=kmeans.task)
Ex: kMeans
Use kmeans to build a cluster the mtcars
dataset into 5 groups
Time: 5 mins
Dimensionality Reduction - PCA
cor(iris[,1:4])
names(iris)
plot(iris$Sepal.Length, col = iris$Species)
plot(iris$Sepal.Width, col = iris$Species)
## PCA on full data set
pc2 <- prcomp(iris[,1:4], scale = TRUE)
pc2$x[1:3,]
plot(pc2$x[,1], col = iris$Species)
Dimensionality Reduction - PCA
iris2 <- scale(iris[,1:4])
iris2[1:5,]%*%pc2$rotation
pc2$x[1:5,]
summary(pc2)
vars <- apply(pc2$x, 2, var)
props <- vars / sum(vars)
cumsum(props)
barplot(cumsum(props))
Neural Network
(optional)
One Layer MLP
Split the Iris Dataset
### Splitting data
data(iris)
nr <- nrow(iris)
inTrain <- sample(1:nr, 0.6*nr)
ir2.train <- iris[inTrain,]
ir2.test <- iris[-inTrain,]
Making task and learner
nn.task = makeClassifTask(id = "ir2", data =
ir2.train, target= "Species"))
nn.task
nn.lrn = makeLearner("classif.nnet")
Train the model
mod = train(nn.lrn, nn.task)
mod
names(mod)
getLearnerModel(mod)
Make Prediction
nn.pred = predict(mod, newdata =
iris.test)
nn.pred
performance(nn.pred, measures =
list(mmce, acc))
head(getPredictionTruth(nn.pred))
head(getPredictionResponse(nn.pred))
Verify Model
### Confusion Matrix
calculateConfusionMatrix(nn.pred)
Visualize Results
plotLearnerPrediction(nn.lrn,
features=c("Petal.Length","Petal.Width"),
task=nn.task)
Ex: Neural Network
Use neural nets to build a model to
predict gear variable using mtcars
dataset
Time: 5 mins
Summary
Parting
Message
Q&A
Feedback
https://goo.gl/EDezXH
136
Thank You!
Ghazaleh Babanejad
[email protected]
01123005257