PE IV - Practical Machine Learning
in sample error = error resulted from applying your prediction algorithm to the dataset you built it
with
– also known as resubstitution error
– often optimistic (less than on a new sample) as the model may be tuned to error of the sample
out of sample error = error resulted from applying your prediction algorithm to a new data set
– also known as generalization error
– out of sample error most important as it better evaluates how the model should perform
in sample error < out of sample error
– reason is over-fitting: model too adapted/optimized for the initial dataset
when discussing the outcome decided on by the algorithm, Positive = identified and negative =
rejected
– True positive = correctly identified (predicted true when true)
– False positive = incorrectly identified (predicted true when false)
– True negative = correctly rejected (predicted false when false)
– False negative = incorrectly rejected (predicted false when true)
example: medical testing
– True positive = Sick people correctly diagnosed as sick
– False positive = Healthy people incorrectly identified as sick
– True negative = Healthy people correctly identified as healthy
– False negative = Sick people incorrectly identified as healthy
PE IV - Practical Machine Learning 1
Receiver Operating Characteristic Curves:
are commonly used techniques to measure the quality of a prediction algorithm.
predictions for binary classification often are quantitative (i.e. probability, scale of 1 to 10)
different cutoffs/threshold of classification (> 0.8 → one outcome) yield different results/predictions
Receiver Operating Characteristic curves are generated to compare the different outcomes
x-axis = 1 - specificity (or, probability of false positive)
y-axis = sensitivity (or, probability of true positive)
points plotted = cutoff/combination
areas under curve = quantifies whether the prediction model is viable or not
Cross Validation:
Random subsampling: a randomly sampled test set is subsetted out from the original training set, the predictor
is built on the remaining training data and applied to the test set
K-folds method: Breaking data into 3 subsets, building models for all three, and comparing them. larger k =
less bias, more variance, smaller k = more bias, less variance
k <- 5 # Number of folds for cross-validation
folds <- createFolds(data$Species, k = k, list = TRUE, returnTrain = FALSE, folds = k)
# Explanation of Parameters:
# data$Species: The outcome variable or grouping factor used for stratified sampling
# k: Number of folds for cross-validation
# list = TRUE: Returns a list of indices; each list element represents a fold
# returnTrain = FALSE: Specifies that only the indices for test/validation sets are returned
# folds = k: Specifies the number of folds (optional; usually the default value)
Leave one out: Exactly use one sample and train the model on the other samples and test with one sample left
out in the beginning.
Data Preprocessing: (Caret Package)
createDataPartition - Multiple partitions in the data and test
library(caret)
createDataPartition(y=data$var, times=1, p=0.75, list=FALSE)
createFolds - Multiple folds in the data
library(caret)
createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE)
createResample -Multiple samples in the data
library(caret)
resamples <- createResample(y=spam$type,times=10,list=TRUE)
PE IV - Practical Machine Learning 2
createTimeSlices - initialWindow and horizon
library(caret)
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
Training Options:
train() default method is rf(random forest)
train(y ~ x, data=df, method="glm")
# function to apply the machine learning algorithm to construct model from training data
trainControl() creates object that sets many options for how the model will be applied for training
method = "boot" -> meant for bootstrapping(drawing without replacement)
Plotting Predictors:
featurePlot(x=preds, y= outcomes, plot = "pairs")
#Plots relationship between preds and outcomes
qplot(age, wage, color=eduction, data=training)
#qplot can also be used with diff colors
cut2(variable, g=3)
#creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles
Centering and scaling:
train(y~x, data=training, preProcess = c("center","scale"))
#preProcess method: BoxCox, knnImpute
Creating dummy variables by converting factor variables
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
Removing zero covariates as no variability and hurts the model
nearZeroVar(training, saveMetrics = true)
#returns the stats of the data sets
Creating splines through splines package.. bs() and poly() functions
PE IV - Practical Machine Learning 3
bsBasis <- bs(training$age,df=3)
Multicore Parallel processing.
doMC::registerDoMC(cores=4)
#using multiple cores for data intensive model.. doMC package ..
PCA used to reduce the predictors, data compression, capture most of the information, linear type models
pr <-prcomp(data)
#does PCA on the data
#RMSE test > RMSE train,
# Generate sample data
set.seed(123)
data <- matrix(rnorm(100), ncol = 5) # Creating a random data matrix
# Perform PCA
pca_result <- prcomp(data, scale. = TRUE) # Scale. = TRUE scales the variables to have unit variance
# Standard deviation of each principal component
sd_pca <- pca_result$sdev
# Eigenvectors (loadings) of each principal component
eigenvectors <- pca_result$rotation
# Percentage of variance explained by each principal component
percentage_var <- (sd_pca^2) / sum(sd_pca^2) * 100
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# run model on outcome and principle components
modelFit <- train(training$type ~ .,method="glm",data=trainPC)
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))
Prediction with Trees: split variables into groups.. produce nonlinear model
Measures of impurity:
Misclassification error: pmk>0.5 to have predominance else continue the classification
Gini index: 0 perfect.. 0.5 no purity
PE IV - Practical Machine Learning 4
Deviance: 0 perfect.. 1 no purity (e log)
Information Gai: 0 perfect.. 1 no purity (2 log)
#Constructing trees
train(y~., data=train, method="rpart")
Bagging: resample training dataset/bootstrap aggregating
Averaging the predictions together or majority vote... algorithms: bagEarth, treebag, bagFDA
bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate))
Random forest: extension of bagging on classification trees
Process: bootstrap samples.. split and bootstrap variables..grow trees
drawbacks: overfitting, slow, hard to interpret
#Constructing trees random forest method
rf<-train(outcome~., data=train, method="rf",ntree=500)
#ntree=500 = specify number of trees that should be constructed
Boosting: gradient boosting, take a group of predictors.. add them up to make a strong predictor... method=
"gbm" for trees
#Boosting
gbm <- train(outcome ~ variables, method="gbm", data=train, verbose=F)
Model-based prediction: use of Bayes theorem,, certain assumptions..
linear discriminant analysis: same covariance for each predictor variable
method = "lda"
lda <- train(Species ~ .,data=training,method="lda")
# predict test outcomes using LDA model
pred.lda <- predict(lda,testing)
quadratic discriminant analysis: different covariance for each predictor variable
Naive Bayes: variables independent, probability proportional to the numerator
method = "nb"
nb <- train(Species ~ ., data=training,method="nb")
# predict test outcomes using naive Bayes model
pred.nb <- predict(nb,testing)
PE IV - Practical Machine Learning 5
Model Selection:
1) More predictors less error in prediction - training set
2) More predictors error dec and then inc - test set
3)Avoid overfitting and minimize errors
How to?
1) Split samples 60-20-20
2) Reduce the expected predicted error
Problems: low data, high computation/complexity
goal of prediction model = minimize overall expected prediction error
Regularized Regression Concept:
Regularizing the large coefficients.. increase bias.. decrease variance.. less pred error
Penalized residual sum of squares(PRSS):
Penalty + Prediction squared error
Higher PRSS worse model, lambda is the tuning parameter
# as λ → 0, the result approaches the least square solution as λ → ∞,
# all of the coefficients receive large penalties and the conditional coefficients β ridge
# as λ = ∞ approaches zero collectively
# Ridge Regression
ridge<-lm.ridge(outcome ~ predictors, data=training, lambda=5)
Lasso regression
Similar to ridge regression,
controls size of coefficients or the amount of regularization
large values of λ will set some coefficient equal to zero
#Lars Package
lasso<-lars(as.matrix(x), y, type="lasso", trace=TRUE)
Combining Predictors:
Combine classifiers.. vote majority or average..
reduces interpretability.. increase computation
Forecasting:
Predicting values using time series..
subsampling is hard as data dependent on time..
specific patterns: trend, seasonal patterns, cyclic
prediction through EMA, SMA
PE IV - Practical Machine Learning 6
#Forecast package
ma(ts, order=3)
# calculates the simple moving average for the order specified
# order=3 = order of moving average smoother, effectively the number of values that should be
# used to calculate the moving average
ets(train, model="MMM")
#runs exponential smoothing model on training data
# model = "MMM" = method used to create exponential smoothing
Unsupervised prediction:
When we are unaware about the classifications.
create clusters, label them, build models, predict clusters
use kmeans()
kmeans(data, centers=3)
PE IV - Practical Machine Learning 7