0% found this document useful (0 votes)

22 views7 pages

Practical Machine Learning Guide

Uploaded by

Dpt Htegn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

Practical Machine Learning Guide

Uploaded by

Dpt Htegn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

PE IV - Practical Machine Learning

in sample error = error resulted from applying your prediction algorithm to the dataset you built it
with
– also known as resubstitution error
– often optimistic (less than on a new sample) as the model may be tuned to error of the sample

out of sample error = error resulted from applying your prediction algorithm to a new data set
– also known as generalization error
– out of sample error most important as it better evaluates how the model should perform

in sample error < out of sample error

– reason is over-fitting: model too adapted/optimized for the initial dataset

when discussing the outcome decided on by the algorithm, Positive = identified and negative =
rejected
– True positive = correctly identified (predicted true when true)
– False positive = incorrectly identified (predicted true when false)
– True negative = correctly rejected (predicted false when false)
– False negative = incorrectly rejected (predicted false when true)

example: medical testing

– True positive = Sick people correctly diagnosed as sick
– False positive = Healthy people incorrectly identified as sick
– True negative = Healthy people correctly identified as healthy
– False negative = Sick people incorrectly identified as healthy

PE IV - Practical Machine Learning 1

Receiver Operating Characteristic Curves:
are commonly used techniques to measure the quality of a prediction algorithm.
predictions for binary classification often are quantitative (i.e. probability, scale of 1 to 10)
different cutoffs/threshold of classification (> 0.8 → one outcome) yield different results/predictions
Receiver Operating Characteristic curves are generated to compare the different outcomes

x-axis = 1 - specificity (or, probability of false positive)

y-axis = sensitivity (or, probability of true positive)
points plotted = cutoff/combination
areas under curve = quantifies whether the prediction model is viable or not

Cross Validation:
Random subsampling: a randomly sampled test set is subsetted out from the original training set, the predictor
is built on the remaining training data and applied to the test set
K-folds method: Breaking data into 3 subsets, building models for all three, and comparing them. larger k =
less bias, more variance, smaller k = more bias, less variance

k <- 5 # Number of folds for cross-validation

folds <- createFolds(data$Species, k = k, list = TRUE, returnTrain = FALSE, folds = k)

# Explanation of Parameters:
# data$Species: The outcome variable or grouping factor used for stratified sampling
# k: Number of folds for cross-validation
# list = TRUE: Returns a list of indices; each list element represents a fold
# returnTrain = FALSE: Specifies that only the indices for test/validation sets are returned
# folds = k: Specifies the number of folds (optional; usually the default value)

Leave one out: Exactly use one sample and train the model on the other samples and test with one sample left
out in the beginning.

Data Preprocessing: (Caret Package)

createDataPartition - Multiple partitions in the data and test

library(caret)
createDataPartition(y=data$var, times=1, p=0.75, list=FALSE)

createFolds - Multiple folds in the data

library(caret)
createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE)

createResample -Multiple samples in the data

library(caret)
resamples <- createResample(y=spam$type,times=10,list=TRUE)

PE IV - Practical Machine Learning 2

createTimeSlices - initialWindow and horizon

library(caret)
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)

Training Options:
train() default method is rf(random forest)

train(y ~ x, data=df, method="glm")

# function to apply the machine learning algorithm to construct model from training data

trainControl() creates object that sets many options for how the model will be applied for training

method = "boot" -> meant for bootstrapping(drawing without replacement)

Plotting Predictors:

featurePlot(x=preds, y= outcomes, plot = "pairs")

#Plots relationship between preds and outcomes
qplot(age, wage, color=eduction, data=training)
#qplot can also be used with diff colors
cut2(variable, g=3)
#creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles

Centering and scaling:

train(y~x, data=training, preProcess = c("center","scale"))

#preProcess method: BoxCox, knnImpute

Creating dummy variables by converting factor variables

inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)

training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)

Removing zero covariates as no variability and hurts the model

nearZeroVar(training, saveMetrics = true)

#returns the stats of the data sets

Creating splines through splines package.. bs() and poly() functions

PE IV - Practical Machine Learning 3

bsBasis <- bs(training$age,df=3)

Multicore Parallel processing.

doMC::registerDoMC(cores=4)
#using multiple cores for data intensive model.. doMC package ..

PCA used to reduce the predictors, data compression, capture most of the information, linear type models

pr <-prcomp(data)
#does PCA on the data
#RMSE test > RMSE train,

# Generate sample data

set.seed(123)
data <- matrix(rnorm(100), ncol = 5) # Creating a random data matrix

# Perform PCA
pca_result <- prcomp(data, scale. = TRUE) # Scale. = TRUE scales the variables to have unit variance

# Standard deviation of each principal component

sd_pca <- pca_result$sdev

# Eigenvectors (loadings) of each principal component

eigenvectors <- pca_result$rotation

# Percentage of variance explained by each principal component

percentage_var <- (sd_pca^2) / sum(sd_pca^2) * 100

# create train and test sets

inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# run model on outcome and principle components
modelFit <- train(training$type ~ .,method="glm",data=trainPC)
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))

Prediction with Trees: split variables into groups.. produce nonlinear model

Measures of impurity:
Misclassification error: pmk>0.5 to have predominance else continue the classification
Gini index: 0 perfect.. 0.5 no purity

PE IV - Practical Machine Learning 4

Deviance: 0 perfect.. 1 no purity (e log)
Information Gai: 0 perfect.. 1 no purity (2 log)

#Constructing trees
train(y~., data=train, method="rpart")

Bagging: resample training dataset/bootstrap aggregating

Averaging the predictions together or majority vote... algorithms: bagEarth, treebag, bagFDA

bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate))

Random forest: extension of bagging on classification trees

Process: bootstrap samples.. split and bootstrap variables..grow trees
drawbacks: overfitting, slow, hard to interpret

#Constructing trees random forest method

rf<-train(outcome~., data=train, method="rf",ntree=500)
#ntree=500 = specify number of trees that should be constructed

Boosting: gradient boosting, take a group of predictors.. add them up to make a strong predictor... method=
"gbm" for trees

#Boosting
gbm <- train(outcome ~ variables, method="gbm", data=train, verbose=F)

Model-based prediction: use of Bayes theorem,, certain assumptions..

linear discriminant analysis: same covariance for each predictor variable

method = "lda"

lda <- train(Species ~ .,data=training,method="lda")

# predict test outcomes using LDA model
pred.lda <- predict(lda,testing)

quadratic discriminant analysis: different covariance for each predictor variable

Naive Bayes: variables independent, probability proportional to the numerator

method = "nb"

nb <- train(Species ~ ., data=training,method="nb")

# predict test outcomes using naive Bayes model
pred.nb <- predict(nb,testing)

PE IV - Practical Machine Learning 5

Model Selection:
1) More predictors less error in prediction - training set
2) More predictors error dec and then inc - test set
3)Avoid overfitting and minimize errors

How to?
1) Split samples 60-20-20
2) Reduce the expected predicted error
Problems: low data, high computation/complexity
goal of prediction model = minimize overall expected prediction error

Regularized Regression Concept:

Regularizing the large coefficients.. increase bias.. decrease variance.. less pred error

Penalized residual sum of squares(PRSS):

Penalty + Prediction squared error
Higher PRSS worse model, lambda is the tuning parameter

# as λ → 0, the result approaches the least square solution as λ → ∞,

# all of the coefficients receive large penalties and the conditional coefficients β ridge
# as λ = ∞ approaches zero collectively

# Ridge Regression
ridge<-lm.ridge(outcome ~ predictors, data=training, lambda=5)

Lasso regression
Similar to ridge regression,
controls size of coefficients or the amount of regularization
large values of λ will set some coefficient equal to zero

#Lars Package
lasso<-lars(as.matrix(x), y, type="lasso", trace=TRUE)

Combining Predictors:
Combine classifiers.. vote majority or average..
reduces interpretability.. increase computation

Forecasting:
Predicting values using time series..
subsampling is hard as data dependent on time..
specific patterns: trend, seasonal patterns, cyclic
prediction through EMA, SMA

PE IV - Practical Machine Learning 6

#Forecast package
ma(ts, order=3)
# calculates the simple moving average for the order specified
# order=3 = order of moving average smoother, effectively the number of values that should be
# used to calculate the moving average
ets(train, model="MMM")
#runs exponential smoothing model on training data
# model = "MMM" = method used to create exponential smoothing

Unsupervised prediction:
When we are unaware about the classifications.
create clusters, label them, build models, predict clusters
use kmeans()

kmeans(data, centers=3)

PE IV - Practical Machine Learning 7

Car Mechanic Simulator 2021 Car Modding Guide
100% (3)
Car Mechanic Simulator 2021 Car Modding Guide
50 pages
T.ms6586.u705 + 25-DB5414-X2P1 Shg6002c-173e Lc-60ui9362e
100% (1)
T.ms6586.u705 + 25-DB5414-X2P1 Shg6002c-173e Lc-60ui9362e
54 pages
Artificial Intelligence Assignment
70% (10)
Artificial Intelligence Assignment
5 pages
A Study of The Internal Aerodynamics of The Concorde Inlet: John W. Slater
100% (2)
A Study of The Internal Aerodynamics of The Concorde Inlet: John W. Slater
17 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
Simulink Missile Control System
No ratings yet
Simulink Missile Control System
6 pages
Machine Learning with R Guide
No ratings yet
Machine Learning with R Guide
2 pages
Glass Cockpit & Flight Instruments
100% (1)
Glass Cockpit & Flight Instruments
116 pages
Machine Learning Overview & SVMs
No ratings yet
Machine Learning Overview & SVMs
378 pages
A Short Introduction To The Caret Package: Max Kuhn June 20, 2013
No ratings yet
A Short Introduction To The Caret Package: Max Kuhn June 20, 2013
10 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
Machine Learning With R
No ratings yet
Machine Learning With R
2 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Machine Learning for IT Students
No ratings yet
Machine Learning for IT Students
99 pages
A Short Introduction To Caret
No ratings yet
A Short Introduction To Caret
10 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Practical Machine Learning
No ratings yet
Practical Machine Learning
11 pages
Precision and Recall in ML Evaluation
No ratings yet
Precision and Recall in ML Evaluation
28 pages
SDQCQAManual
No ratings yet
SDQCQAManual
344 pages
Business Process Simulation Guide
No ratings yet
Business Process Simulation Guide
24 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Agniva
No ratings yet
Agniva
16 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Model Selection & Evaluation Guide
No ratings yet
Model Selection & Evaluation Guide
51 pages
UnivariateRegression Summary
No ratings yet
UnivariateRegression Summary
36 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
IJERT Design of Lateral Acceleration Con
No ratings yet
IJERT Design of Lateral Acceleration Con
11 pages
Railway Applications Katalog25214
0% (1)
Railway Applications Katalog25214
74 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
The Pixar Story
No ratings yet
The Pixar Story
5 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
Data Analysis Techniques Guide
No ratings yet
Data Analysis Techniques Guide
31 pages
Gate Controlled Switch
No ratings yet
Gate Controlled Switch
14 pages
3408-Data Structure
No ratings yet
3408-Data Structure
3 pages
ISO (International Organization Standardization)
100% (1)
ISO (International Organization Standardization)
18 pages
FACTORS INFLUENCING ADOPTION OF E-PROCUREMENT IN HUMANITARIAN ORGANIZATIONS (A Case of Norwegian Refugee Council - Kakuma Refugee Camp
100% (1)
FACTORS INFLUENCING ADOPTION OF E-PROCUREMENT IN HUMANITARIAN ORGANIZATIONS (A Case of Norwegian Refugee Council - Kakuma Refugee Camp
72 pages
Disco Externo Iomega Datasheet
No ratings yet
Disco Externo Iomega Datasheet
2 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Display HMD
No ratings yet
Display HMD
55 pages
UML Class Diagram
No ratings yet
UML Class Diagram
4 pages
Nozzle Theory and Thrust Vectoring
100% (1)
Nozzle Theory and Thrust Vectoring
74 pages
AAE 3156 Avionics Subsystem Databus and Avionics Architecture
100% (3)
AAE 3156 Avionics Subsystem Databus and Avionics Architecture
132 pages
R Assignment
No ratings yet
R Assignment
8 pages
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
No ratings yet
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
12 pages
GNC All NPTEL
No ratings yet
GNC All NPTEL
140 pages
Parent Card Stage 5
100% (1)
Parent Card Stage 5
2 pages
Weight Calculation Equations
No ratings yet
Weight Calculation Equations
187 pages
Navigation SYS
No ratings yet
Navigation SYS
25 pages
Ch.4 Torsion - NOTES
No ratings yet
Ch.4 Torsion - NOTES
9 pages
L21-L24 - Depreciation
No ratings yet
L21-L24 - Depreciation
21 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
MATLAB Scripts & Functions Guide
No ratings yet
MATLAB Scripts & Functions Guide
38 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
Sensitivity Analysis and Flight Tests Re
No ratings yet
Sensitivity Analysis and Flight Tests Re
40 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Shear in Beams
No ratings yet
Shear in Beams
58 pages
Handout 2-Ramjet & Pulsejet
No ratings yet
Handout 2-Ramjet & Pulsejet
117 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Sol Eval 1
No ratings yet
Sol Eval 1
4 pages
Optimal Approach Autopilot Topologies
No ratings yet
Optimal Approach Autopilot Topologies
30 pages
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
No ratings yet
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
2 pages
Handout 1.0 Introduction
100% (1)
Handout 1.0 Introduction
47 pages
Lab 2 - Behavioral Level, RTL, and Gate Level Design
No ratings yet
Lab 2 - Behavioral Level, RTL, and Gate Level Design
3 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Amor de Siempre (Audiotree Live Version) : Download Print
No ratings yet
Amor de Siempre (Audiotree Live Version) : Download Print
1 page
Ada 592550
No ratings yet
Ada 592550
52 pages
Three Plane Approach For 3D True Proportional Navigation
No ratings yet
Three Plane Approach For 3D True Proportional Navigation
26 pages
Handout 3-Nuclear Propulsion
100% (1)
Handout 3-Nuclear Propulsion
41 pages
Final - Emt 11 - 12 Q2 0802 PS
No ratings yet
Final - Emt 11 - 12 Q2 0802 PS
53 pages
Missile Guidance System Modeling
No ratings yet
Missile Guidance System Modeling
5 pages
Chemical Rocket Propulsion - Part 1
No ratings yet
Chemical Rocket Propulsion - Part 1
37 pages
Medium Voltage Detector Guide
No ratings yet
Medium Voltage Detector Guide
1 page
ICCA2010-SDRE Final
No ratings yet
ICCA2010-SDRE Final
5 pages
Keywords and Identifiers in C
No ratings yet
Keywords and Identifiers in C
3 pages
ML Assignment
No ratings yet
ML Assignment
13 pages
ML Techniques and Concepts
No ratings yet
ML Techniques and Concepts
48 pages
Ship Borne RFTracking System
No ratings yet
Ship Borne RFTracking System
11 pages
Assembly of Stiffened Shell Structures: - Single Cell Closed Section - Multi Cellular and
No ratings yet
Assembly of Stiffened Shell Structures: - Single Cell Closed Section - Multi Cellular and
10 pages
Missile Interceptor Guidance and Control
No ratings yet
Missile Interceptor Guidance and Control
6 pages
MP - ECE - UNIT-2 8086 and Interfacing
No ratings yet
MP - ECE - UNIT-2 8086 and Interfacing
60 pages
Introduction R
No ratings yet
Introduction R
9 pages
Target Hit Interceptor Mid Course Guidan
No ratings yet
Target Hit Interceptor Mid Course Guidan
6 pages
Frequency Effects On Dynamic Stability Derivatives Obtained From Small-Amplitude Oscillatory Testing
No ratings yet
Frequency Effects On Dynamic Stability Derivatives Obtained From Small-Amplitude Oscillatory Testing
8 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
ML Imp Ques 1
No ratings yet
ML Imp Ques 1
22 pages
SPLA Licensing Best Practices
No ratings yet
SPLA Licensing Best Practices
1 page
CORVETTE 14L PV 200813 1510 Locked
No ratings yet
CORVETTE 14L PV 200813 1510 Locked
85 pages
AAE 4050 Formula Sheet
No ratings yet
AAE 4050 Formula Sheet
3 pages
ML Notes
No ratings yet
ML Notes
15 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
3d Game Thesis
100% (3)
3d Game Thesis
8 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Aerospace Engineering Course Guide
No ratings yet
Aerospace Engineering Course Guide
3 pages
DM Chapter 8
No ratings yet
DM Chapter 8
7 pages
Cybersecurity Case Studies
No ratings yet
Cybersecurity Case Studies
7 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
Diagnose IIS Performance Problems Using Windows Performance Monitor
No ratings yet
Diagnose IIS Performance Problems Using Windows Performance Monitor
2 pages
hw16 109090023
No ratings yet
hw16 109090023
22 pages
07-Ensembles Notes
No ratings yet
07-Ensembles Notes
21 pages
Lecture 13
No ratings yet
Lecture 13
39 pages
Taj Mahal ClipartC2A0 - Google Search
No ratings yet
Taj Mahal ClipartC2A0 - Google Search
1 page
Chapter 3
No ratings yet
Chapter 3
56 pages
Pure Mathematics Coordinate Geometry Project
No ratings yet
Pure Mathematics Coordinate Geometry Project
25 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
R Companion Data Mining
No ratings yet
R Companion Data Mining
370 pages
Lecture 9 Machine Learning Using Caret API Updated
No ratings yet
Lecture 9 Machine Learning Using Caret API Updated
46 pages