Machine Learning with mlr in R
Machine Learning with mlr in R
Contents
mlr Tutorial 4
Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Basics 5
Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Task types and creation . . . . . . . . . . . . . . . . . . . . . . . 5
Further settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Accessing a learning task . . . . . . . . . . . . . . . . . . . . . . 10
Modifying a learning task . . . . . . . . . . . . . . . . . . . . . . 13
Example tasks and convenience functions . . . . . . . . . . . . . 15
Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Constructing a learner . . . . . . . . . . . . . . . . . . . . . . . . 15
Accessing a learner . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Modifying a learner . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Listing learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Training a Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Accessing learner models . . . . . . . . . . . . . . . . . . . . . . . 23
Further options and comments . . . . . . . . . . . . . . . . . . . 26
Predicting Outcomes for New Data . . . . . . . . . . . . . . . . . . . . 27
Accessing the prediction . . . . . . . . . . . . . . . . . . . . . . . 29
Adjusting the threshold . . . . . . . . . . . . . . . . . . . . . . . 33
Visualizing the prediction . . . . . . . . . . . . . . . . . . . . . . 35
Evaluating Learner Performance . . . . . . . . . . . . . . . . . . . . . 38
Available performance measures . . . . . . . . . . . . . . . . . . . 38
Listing measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Calculate performance measures . . . . . . . . . . . . . . . . . . 40
Access a performance measure . . . . . . . . . . . . . . . . . . . . 41
Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 42
Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Stratified resampling . . . . . . . . . . . . . . . . . . . . . . . . . 47
1
CONTENTS CONTENTS
Advanced 97
Configuring mlr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Example: Reducing the output on the console . . . . . . . . . . . 97
Accessing and resetting the configuration . . . . . . . . . . . . . 98
Example: Turning off parameter checking . . . . . . . . . . . . . 99
Example: Handling errors in a learning method . . . . . . . . . . 100
Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Example: Bagging wrapper . . . . . . . . . . . . . . . . . . . . . 102
Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fusing learners with preprocessing . . . . . . . . . . . . . . . . . 107
Preprocessing with makePreprocWrapperCaret . . . . . . . . . . 108
Writing a custom preprocessing wrapper . . . . . . . . . . . . . . 114
Imputation of Missing Values . . . . . . . . . . . . . . . . . . . . . . . 120
Imputation and reimputation . . . . . . . . . . . . . . . . . . . . 120
Fusing a learner with imputation . . . . . . . . . . . . . . . . . . 125
Generic Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Changing the type of prediction . . . . . . . . . . . . . . . . . . . 128
Advanced Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Iterated F-Racing for mixed spaces and dependencies . . . . . . . 130
Tuning across whole model spaces with ModelMultiplexer . . . . 131
Multi-criteria evaluation and optimization . . . . . . . . . . . . . 133
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Filter methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Nested Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
2
CONTENTS CONTENTS
Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Benchmark experiments . . . . . . . . . . . . . . . . . . . . . . . 159
Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . . 166
Class-dependent misclassification costs . . . . . . . . . . . . . . . 166
Example-dependent misclassification costs . . . . . . . . . . . . . 180
Imbalanced Classification Problems . . . . . . . . . . . . . . . . . . . . 183
Sampling-based approaches . . . . . . . . . . . . . . . . . . . . . 183
(Simple) over- and undersampling . . . . . . . . . . . . . . . . . 184
Cost-based approaches . . . . . . . . . . . . . . . . . . . . . . . . 188
ROC Analysis and Performance Curves . . . . . . . . . . . . . . . . . 189
Performance plots with plotROCCurves . . . . . . . . . . . . . . 190
Performance plots with asROCRPrediction . . . . . . . . . . . . 196
Viper charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Multilabel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Creating a task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Constructing a learner . . . . . . . . . . . . . . . . . . . . . . . . 201
Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Binary performance . . . . . . . . . . . . . . . . . . . . . . . . . 205
Learning Curve Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Plotting the learning curve . . . . . . . . . . . . . . . . . . . . . 207
Exploring Learner Predictions . . . . . . . . . . . . . . . . . . . . . . . 210
Generating partial dependences . . . . . . . . . . . . . . . . . . . 211
Functional ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Plotting partial dependences . . . . . . . . . . . . . . . . . . . . 220
Classifier Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Evaluating Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . 232
Generating hyperparameter tuning data . . . . . . . . . . . . . . 233
Visualizing the effect of a single hyperparameter . . . . . . . . . 236
Visualizing the effect of 2 hyperparameters . . . . . . . . . . . . 240
Extend 247
Integrating Another Learner . . . . . . . . . . . . . . . . . . . . . . . . 247
Classes, constructors, and naming schemes . . . . . . . . . . . . . 248
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Multilabel classification . . . . . . . . . . . . . . . . . . . . . . . 255
Creating a new feature importance method . . . . . . . . . . . . 256
Registering your learner . . . . . . . . . . . . . . . . . . . . . . . 257
Integrating Another Measure . . . . . . . . . . . . . . . . . . . . . . . 257
Performance measures and aggregation schemes . . . . . . . . . . 258
3
MLR TUTORIAL
mlr Tutorial
This web page provides an in-depth introduction on how to use the mlr framework
for machine learning experiments in R.
We focus on the comprehension of the basic functions and applications. More
detailed technical information can be found in the manual pages which are
regularly updated and reflect the documentation of the current package version
on CRAN.
An offline version of this tutorial is available for download
• here for the current mlr release on CRAN
• and here for the mlr devel version on Github.
The tutorial explains the basic analysis of a data set step by step. Please refer
to sections of the menu above: Basics, Advanced, Extend and Appendix.
During the tutorial we present various simple examples from classification,
regression, cluster and survival analysis to illustrate the main features of the
package.
Enjoy reading!
Quick start
4
BASICS
## Do the resampling
r = resample(learner = lrn, task = task, resampling = rdesc,
show.info = FALSE)
Basics
Learning Tasks
Learning tasks encapsulate the data set and further relevant information about
a machine learning problem, for example the name of the target variable for
supervised problems.
The tasks are organized in a hierarchy, with the generic Task at the top. The
following tasks can be instantiated and all inherit from the virtual superclass
Task:
• RegrTask for regression problems,
• ClassifTask for binary and multi-class classification problems (cost-sensitive
classification with class-dependent costs can be handled as well),
• SurvTask for survival analysis,
• ClusterTask for cluster analysis,
• MultilabelTask for multilabel classification problems,
• CostSensTask for general cost-sensitive classification (with example-specific
costs).
To create a task, just call make<TaskType>, e.g., makeClassifTask. All tasks
require an identifier (argument id) and a data.frame (argument data). If no ID
is provided it is automatically generated using the variable name of the data. The
ID will be later used to name results, for example of benchmark experiments, and
to annotate plots. Depending on the nature of the learning problem, additional
arguments may be required and are discussed in the following sections.
5
Learning Tasks BASICS
Regression
For supervised learning like regression (as well as classification and survival
analysis) we, in addition to data, have to specify the name of the target
variable.
data(BostonHousing, package = "mlbench")
regr.task = makeRegrTask(id = "bh", data = BostonHousing, target
= "medv")
regr.task
#> Supervised task: bh
#> Type: regr
#> Target: medv
#> Observations: 506
#> Features:
#> numerics factors ordered
#> 12 1 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
As you can see, the Task records the type of the learning problem and basic
information about the data set, e.g., the types of the features (numeric vectors,
factors or ordered factors), the number of observations, or whether missing values
are present.
Creating tasks for classification and survival analysis follows the same scheme,
the data type of the target variables included in data is simply different. For
each of these learning problems some specifics are described below.
Classification
For classification the target column has to be a factor.
In the following example we define a classification task for the BreastCancer data
set and exclude the variable Id from all further model fitting and evaluation.
data(BreastCancer, package = "mlbench")
df = BreastCancer
df$Id = NULL
classif.task = makeClassifTask(id = "BreastCancer", data = df,
target = "Class")
classif.task
#> Supervised task: BreastCancer
#> Type: classif
#> Target: Class
#> Observations: 699
#> Features:
6
Learning Tasks BASICS
In binary classification the two classes are usually referred to as positive and
negative class with the positive class being the category of greater interest. This
is relevant for many performance measures like the true positive rate or ROC
curves. Moreover, mlr, where possible, permits to set options (like the decision
threshold or class weights) and returns and plots results (like class posterior
probabilities) for the positive class only.
makeClassifTask by default selects the first factor level of the target variable
as the positive class, in the above example benign. Class malignant can be
manually selected as follows:
classif.task = makeClassifTask(id = "BreastCancer", data = df,
target = "Class", positive = "malignant")
Survival analysis
Survival tasks use two target columns. For left and right censored problems
these consist of the survival time and a binary event indicator. For interval
censored data the two target columns must be specified in the "interval2"
format (see Surv).
data(lung, package = "survival")
lung$status = (lung$status == 2) # convert to logical
surv.task = makeSurvTask(data = lung, target = c("time",
"status"))
surv.task
#> Supervised task: lung
#> Type: surv
#> Target: time,status
#> Observations: 228
#> Features:
#> numerics factors ordered
#> 8 0 0
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE
7
Learning Tasks BASICS
The type of censoring can be specified via the argument censoring, which
defaults to "rcens" for right censored data.
Multilabel classification
In multilabel classification each object can belong to more than one category at
the same time.
The data are expected to contain as many target columns as there are class
labels. The target columns should be logical vectors that indicate which class
labels are present. The names of the target columns are taken as class labels
and need to be passed to the target argument of makeMultilabelTask.
In the following example we get the data of the yeast data set, extract the label
names, and pass them to the target argument in makeMultilabelTask.
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast,
target = labels)
yeast.task
#> Supervised task: multi
#> Type: multilabel
#> Target:
label1,label2,label3,label4,label5,label6,label7,label8,label9,label10,label11,label12,l
#> Observations: 2417
#> Features:
#> numerics factors ordered
#> 103 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 14
#> label1 label2 label3 label4 label5 label6 label7
label8 label9
#> 762 1038 983 862 722 597 428
480 178
#> label10 label11 label12 label13 label14
#> 253 289 1816 1799 34
Cluster analysis
As cluster analysis is unsupervised, the only mandatory argument to construct a
cluster analysis task is the data. Below we create a learning task from the data
8
Learning Tasks BASICS
set mtcars.
data(mtcars, package = "datasets")
cluster.task = makeClusterTask(data = mtcars)
cluster.task
#> Unsupervised task: mtcars
#> Type: cluster
#> Observations: 32
#> Features:
#> numerics factors ordered
#> 11 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
Cost-sensitive classification
The standard objective in classification is to obtain a high prediction accuracy,
i.e., to minimize the number of errors. All types of misclassification errors are
thereby deemed equally severe. However, in many applications different kinds of
errors cause different costs.
In case of class-dependent costs, that solely depend on the actual and predicted
class labels, it is sufficient to create an ordinary ClassifTask.
In order to handle example-specific costs it is necessary to generate a CostSen-
sTask. In this scenario, each example (x, y) is associated with an individual cost
vector of length K with K denoting the number of classes. The k-th component
indicates the cost of assigning x to class k. Naturally, it is assumed that the cost
of the intended class label y is minimal.
As the cost vector contains all relevant information about the intended class y,
only the feature values x and a cost matrix, which contains the cost vectors for
all examples in the data set, are required to create the CostSensTask.
In the following example we use the iris data and an artificial cost matrix (which
is generated as proposed by Beygelzimer et al., 2005):
df = iris
cost = matrix(runif(150 * 3, 0, 2000), 150) * (1 -
diag(3))[df$Species,]
df$Species = NULL
9
Learning Tasks BASICS
#> Features:
#> numerics factors ordered
#> 4 0 0
#> Missings: FALSE
#> Has blocking: FALSE
#> Classes: 3
#> y1, y2, y3
Further settings
The Task help page also lists several other arguments to describe further details
of the learning problem.
For example, we could include a blocking factor in the task. This would indicate
that some observations “belong together” and should not be separated when
splitting the data into training and test sets for resampling.
Another option is to assign weights to observations. These can simply indicate
observation frequencies or result from the sampling scheme used to collect the
data.
Note that you should use this option only if the weights really belong to the
task. If you plan to train some learning algorithms with different weights on the
same Task, mlr offers several other ways to set observation or class weights (for
supervised classification). See for example the tutorial page about training or
function makeWeightedClassesWrapper.
We provide many operators to access the elements stored in a Task. The most
important ones are listed in the documentation of Task and getTaskData.
To access the task description that contains basic information about the task
you can use:
getTaskDescription(classif.task)
#> $id
#> [1] "BreastCancer"
#>
#> $type
#> [1] "classif"
#>
#> $target
#> [1] "Class"
#>
10
Learning Tasks BASICS
#> $size
#> [1] 699
#>
#> $n.feat
#> numerics factors ordered
#> 0 4 5
#>
#> $has.missings
#> [1] TRUE
#>
#> $has.weights
#> [1] FALSE
#>
#> $has.blocking
#> [1] FALSE
#>
#> $class.levels
#> [1] "benign" "malignant"
#>
#> $positive
#> [1] "malignant"
#>
#> $negative
#> [1] "benign"
#>
#> attr(,"class")
#> [1] "TaskDescClassif" "TaskDescSupervised" "TaskDesc"
Note that task descriptions have slightly different elements for different types of
Tasks. Frequently required elements can also be accessed directly.
### Get the ID
getTaskId(classif.task)
#> [1] "BreastCancer"
11
Learning Tasks BASICS
12
Learning Tasks BASICS
Note that getTaskData offers many options for converting the data set into a
convenient format. This especially comes in handy when you integrate a new
learner from another R package into mlr. In this regard function getTaskFormula
is also useful.
mlr provides several functions to alter an existing Task, which is often more
convenient than creating a new Task from scratch. Here are some examples.
### Select observations and/or features
cluster.task = subsetTask(cluster.task, subset = 4:17)
13
Learning Tasks BASICS
14
Learners BASICS
For more functions and more detailed explanations have a look at the data
preprocessing page.
For your convenience mlr provides pre-defined Tasks for each type of learning
problem. These are also used throughout this tutorial in order to get shorter
and more readable code. A list of all Tasks can be found in the Appendix.
Moreover, mlr’s function convertMLBenchObjToTask can generate Tasks from
the data sets and data generating functions in package mlbench.
Learners
The following classes provide a unified interface to all popular machine learning
methods in R: (cost-sensitive) classification, regression, survival analysis, and
clustering. Many are already integrated in mlr, others are not, but the package
is specifically designed to make extensions simple.
Section integrated learners shows the already implemented machine learning
methods and their properties. If your favorite method is missing, either open an
issue or take a look at how to integrate a learning method yourself. This basic
introduction demonstrates how to use already implemented learners.
Constructing a learner
15
Learners BASICS
surv.lrn
#> Learner cph from package survival
#> Type: surv
#> Name: Cox Proportional Hazard Model; Short name: coxph
#> Class: surv.coxph
#> Properties: numerics,factors,weights,rcens
#> Predict-Type: response
#> Hyperparameters:
All generated learners are objects of class Learner. This class contains the
properties of the method, e.g., which types of features it can handle, what
kind of output is possible during prediction, and whether multi-class problems,
observations weights or missing values are supported.
16
Learners BASICS
As you might have noticed, there is currently no special learner class for cost-
sensitive classification. For ordinary misclassification costs you can use standard
classification methods. For example-dependent costs there are several ways
to generate cost-sensitive learners from ordinary regression and classification
learners. This is explained in greater detail in the section about cost-sensitive
classification.
Accessing a learner
The Learner object is a list and the following elements contain information
regarding the hyperparameters and the type of prediction.
### Get the configured hyperparameter settings that deviate from
the defaults
cluster.lrn$par.vals
#> $centers
#> [1] 5
17
Learners BASICS
18
Learners BASICS
We can also use getParamSet to get a quick overview about the available hyper-
parameters and defaults of a learning method without explicitly constructing it
(by calling makeLearner).
getParamSet("classif.randomForest")
#> Type len Def Constr Req Tunable
Trafo
#> ntree integer - 500 1 to Inf - TRUE
-
#> mtry integer - - 1 to Inf - TRUE
-
#> replace logical - TRUE - - TRUE
-
#> classwt numericvector <NA> - 0 to Inf - TRUE
-
#> cutoff numericvector <NA> - 0 to 1 - TRUE
-
#> strata untyped - - - - TRUE
-
#> sampsize integervector <NA> - 1 to Inf - TRUE
-
#> nodesize integer - 1 1 to Inf - TRUE
-
19
Learners BASICS
Modifying a learner
There are also some functions that enable you to change certain aspects of a
Learner without needing to create a new Learner from scratch. Here are some
examples.
### Change the ID
surv.lrn = setLearnerId(surv.lrn, "CoxModel")
surv.lrn
#> Learner CoxModel from package survival
#> Type: surv
#> Name: Cox Proportional Hazard Model; Short name: coxph
#> Class: surv.coxph
#> Properties: numerics,factors,weights,rcens
#> Predict-Type: response
#> Hyperparameters:
20
Learners BASICS
Listing learners
A list of all learners integrated in mlr and their respective properties is shown in
the Appendix.
If you would like a list of available learners, maybe only with certain properties
or suitable for a certain learning Task use function listLearners.
### List everything in mlr
lrns = listLearners()
head(lrns[c("class", "package")])
#> class package
#> 1 classif.ada ada
#> 2 classif.avNNet nnet
#> 3 classif.bartMachine bartMachine
#> 4 classif.bdk kohonen
#> 5 classif.binomial stats
#> 6 classif.blackboost mboost,party
21
Training a Learner BASICS
### The calls above return character vectors, but you can also
create learner objects
head(listLearners("cluster", create = TRUE), 2)
#> [[1]]
#> Learner cluster.cmeans from package e1071,clue
#> Type: cluster
#> Name: Fuzzy C-Means Clustering; Short name: cmeans
#> Class: cluster.cmeans
#> Properties: numerics,prob
#> Predict-Type: response
#> Hyperparameters: centers=2
#>
#>
#> [[2]]
#> Learner cluster.Cobweb from package RWeka
#> Type: cluster
#> Name: Cobweb Clustering Algorithm; Short name: cobweb
#> Class: cluster.Cobweb
#> Properties: numerics
#> Predict-Type: response
#> Hyperparameters:
Training a Learner
Training a learner means fitting a model to a given data set. In mlr this can be
done by calling function train on a Learner and a suitable Task.
We start with a classification example and perform a linear discriminant analysis
on the iris data set.
### Generate the task
task = makeClassifTask(data = iris, target = "Species")
In the above example creating the Learner explicitly is not absolutely necessary.
As a general rule, you have to generate the Learner yourself if you want to change
22
Training a Learner BASICS
any defaults, e.g., setting hyperparameter values or altering the predict type.
Otherwise, train and many other functions also accept the class name of the
learner and call makeLearner internally with default settings.
mod = train("classif.lda", task)
mod
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris; obs = 150; features = 4
#> Hyperparameters:
Training a learner works the same way for every type of learning problem. Below
is a survival analysis example where a Cox proportional hazards model is fitted
to the lung data set. Note that we use the corresponding lung.task provided by
mlr. All available Tasks are listed in the Appendix.
mod = train("surv.coxph", lung.task)
mod
#> Model for learner.id=surv.coxph; learner.class=surv.coxph
#> Trained on: task.id = lung-example; obs = 167; features = 8
#> Hyperparameters:
23
Training a Learner BASICS
● ●
●●
150
● ● ● ●
● ● ●
●
●
● ● ●● ● ●
●
●
●
●
●
●
● ●
●
●
● ● ● ●
● ●
●●
100
● ●
●
●
y
● ●
● ●
● ●
●
●
●
● ● ●●
● ●
●
● ●
50
● ●
●
●
●
● ● ●
●
● ● ●
● ●
●
0
0 20 40 60 80 100 120
24
Training a Learner BASICS
mod$learner
#> Learner cluster.kmeans from package stats,clue
#> Type: cluster
#> Name: K-Means; Short name: kmeans
#> Class: cluster.kmeans
#> Properties: numerics,prob
#> Predict-Type: response
#> Hyperparameters: centers=4
mod$features
#> [1] "x" "y"
mod$time
#> [1] 0.001
25
Training a Learner BASICS
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss"
#> [5] "tot.withinss" "betweenss" "size" "iter"
#> [9] "ifault"
By default, the whole data set in the Task is used for training. The subset argu-
ment of train takes a logical or integer vector that indicates which observations
to use, for example if you want to split your data into a training and a test set
or if you want to fit separate models to different subgroups in the data.
Below we fit a linear regression model to the BostonHousing data set (bh.task)
and randomly select 1/3 of the data set for training.
### Get the number of observations
n = getTaskSize(bh.task)
Note, for later, that all standard resampling strategies are supported. Therefore
you usually do not have to subset the data yourself.
Moreover, if the learner supports this, you can specify observation weights that
reflect the relevance of observations in the training process. Weights can be useful
in many regards, for example to express the reliability of the training observations,
reduce the influence of outliers or, if the data were collected over a longer time
period, increase the influence of recent data. In supervised classification weights
can be used to incorporate misclassification costs or account for class imbalance.
For example in the BreastCancer data set class benign is almost twice as frequent
as class malignant. In order to grant both classes equal importance in training
the classifier we can weight the examples according to the inverse class frequencies
in the data set as shown in the following R code.
### Calculate the observation weights
26
Predicting Outcomes for New Data BASICS
target = getTaskTargets(bc.task)
tab = as.numeric(table(target))
w = 1/tab[target]
Note, for later, that mlr offers much more functionality to deal with imbalanced
classification problems.
As another side remark for more advanced readers: By varying the weights in the
calls to train, you could also implement your own variant of a general boosting
type algorithm on arbitrary mlr base learners.
As you may recall, it is also possible to set observation weights when creating the
Task. As a general rule, you should specify them in make*Task if the weights
really “belong” to the task and always should be used. Otherwise, pass them to
train. The weights in train take precedence over the weights in Task.
Predicting the target values for new observations is implemented the same way
as most of the other predict methods in R. In general, all you need to do is call
predict on the object returned by train and pass the data you want predictions
for.
There are two ways to pass the data:
• Either pass the Task via the task argument or
• pass a data frame via the newdata argument.
The first way is preferable if you want predictions for data already included in a
Task.
Just as train, the predict function has a subset argument, so you can set aside
different portions of the data in Task for training and prediction (more advanced
methods for splitting the data in train and test set are described in the section
on resampling).
In the following example we fit a gradient boosting machine to every second
observation of the BostonHousing data set and make predictions on the remaining
data in bh.task.
n = getTaskSize(bh.task)
train.set = seq(1, n, by = 2)
27
Predicting Outcomes for New Data BASICS
test.set = seq(2, n, by = 2)
lrn = makeLearner("regr.gbm", n.trees = 100)
mod = train(lrn, bh.task, subset = train.set)
The second way is useful if you want to predict data not included in the Task.
Here we cluster the iris data set without the target variable. All observations
with an odd index are included in the Task and used for training. Predictions
are made for the remaining observations.
n = nrow(iris)
iris.train = iris[seq(1, n, by = 2), -5]
iris.test = iris[seq(2, n, by = 2), -5]
task = makeClusterTask(data = iris.train)
mod = train("cluster.kmeans", task)
Note that for supervised learning you do not have to remove the target columns
28
Predicting Outcomes for New Data BASICS
from the data. These columns are automatically removed prior to calling the
underlying predict method of the learner.
Function predict returns a named list of class Prediction. Its most important
element is $data which is a data frame that contains columns with the true
values of the target variable (in case of supervised learning problems) and the
predictions. Use as.data.frame for direct access.
In the following the predictions on the BostonHousing and the iris data sets are
shown. As you may recall, the predictions in the first case were made from a
Task and in the second case from a data frame.
### Result of predict with data passed via task argument
head(as.data.frame(task.pred))
#> id truth response
#> 2 2 21.6 22.28539
#> 4 4 33.4 23.33968
#> 6 6 28.7 22.40896
#> 8 8 27.1 22.12750
#> 10 10 18.9 22.12750
#> 12 12 18.9 22.12750
As you can see when predicting from a Task, the resulting data frame contains
an additional column, called id, which tells us which element in the original
data set the prediction corresponds to.
A direct way to access the true and predicted values of the target variable(s) is
provided by functions getPredictionTruth and getPredictionResponse.
head(getPredictionTruth(task.pred))
#> [1] 21.6 33.4 28.7 27.1 18.9 18.9
head(getPredictionResponse(task.pred))
#> [1] 22.28539 23.33968 22.40896 22.12750 22.12750 22.12750
29
Predicting Outcomes for New Data BASICS
Extract Probabilities
The predicted probabilities can be extracted from the Prediction using the func-
tion getPredictionProbabilities. (Function getProbabilities has been deprecated
in favor of getPredictionProbabilities in mlr version 2.5.) Here is another cluster
analysis example. We use fuzzy c-means clustering on the mtcars data set.
lrn = makeLearner("cluster.cmeans", predict.type = "prob")
mod = train(lrn, mtcars.task)
For classification problems there are some more things worth mentioning. By
default, class labels are predicted.
### Linear discriminant analysis on the iris data set
mod = train("classif.lda", task = iris.task)
30
Predicting Outcomes for New Data BASICS
head(as.data.frame(pred))
#> truth prob.setosa prob.versicolor prob.virginica response
#> 1 setosa 1 0 0 setosa
#> 2 setosa 1 0 0 setosa
#> 3 setosa 1 0 0 setosa
#> 4 setosa 1 0 0 setosa
#> 5 setosa 1 0 0 setosa
#> 6 setosa 1 0 0 setosa
In addition to the probabilities, class labels are predicted by choosing the class
with the maximum probability and breaking ties at random.
As mentioned above, the predicted posterior probabilities can be accessed via
the getPredictionProbabilities function.
head(getPredictionProbabilities(pred))
#> setosa versicolor virginica
#> 1 1 0 0
#> 2 1 0 0
#> 3 1 0 0
#> 4 1 0 0
#> 5 1 0 0
#> 6 1 0 0
Confusion matrix
A confusion matrix can be obtained by calling calculateConfusionMatrix. The
columns represent predicted and the rows true class labels.
calculateConfusionMatrix(pred)
#> predicted
#> true setosa versicolor virginica -err.-
#> setosa 50 0 0 0
#> versicolor 0 49 1 1
#> virginica 0 5 45 5
#> -err.- 0 5 1 6
You can see the number of correctly classified observations on the diagonal of
the matrix. Misclassified observations are on the off-diagonal. The total number
of errors for single (true and predicted) classes is shown in the -err.- row and
column, respectively.
To get relative frequencies additional to the absolute numbers we can set
relative = TRUE.
conf.matrix = calculateConfusionMatrix(pred, relative = TRUE)
conf.matrix
#> Relative confusion matrix (normalized by row/column):
31
Predicting Outcomes for New Data BASICS
#> predicted
#> true setosa versicolor virginica -err.-
#> setosa 1.00/1.00 0.00/0.00 0.00/0.00 0.00
#> versicolor 0.00/0.00 0.98/0.91 0.02/0.02 0.02
#> virginica 0.00/0.00 0.10/0.09 0.90/0.98 0.10
#> -err.- 0.00 0.09 0.02 0.08
#>
#>
#> Absolute confusion matrix:
#> predicted
#> true setosa versicolor virginica -err.-
#> setosa 50 0 0 0
#> versicolor 0 49 1 1
#> virginica 0 5 45 5
#> -err.- 0 5 1 6
Finally, we can also add the absolute number of observations for each predicted
and true class label to the matrix (both absolute and relative) by setting sums
= TRUE.
calculateConfusionMatrix(pred, relative = TRUE, sums = TRUE)
#> Relative confusion matrix (normalized by row/column):
#> predicted
#> true setosa versicolor virginica -err.- -n-
#> setosa 1.00/1.00 0.00/0.00 0.00/0.00 0.00 50
#> versicolor 0.00/0.00 0.98/0.91 0.02/0.02 0.02 54
#> virginica 0.00/0.00 0.10/0.09 0.90/0.98 0.10 46
#> -err.- 0.00 0.09 0.02 0.08 <NA>
#> -n- 50 50 50 <NA> 150
#>
#>
#> Absolute confusion matrix:
32
Predicting Outcomes for New Data BASICS
We can set the threshold value that is used to map the predicted posterior
probabilities to class labels. Note that for this purpose we need to create a
Learner that predicts probabilities. For binary classification, the threshold
determines when the positive class is predicted. The default is 0.5. Now, we
set the threshold for the positive class to 0.9 (that is, an example is assigned
to the positive class if its posterior probability exceeds 0.9). Which of the two
classes is the positive one can be seen by accessing the Task. To illustrate binary
classification, we use the Sonar data set from the mlbench package.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task = sonar.task)
pred2
#> Prediction: 208 observations
#> predict.type: prob
#> threshold: M=0.90,R=0.10
#> time: 0.00
#> id truth prob.M prob.R response
#> 1 1 R 0.1060606 0.8939394 R
#> 2 2 R 0.7333333 0.2666667 R
33
Predicting Outcomes for New Data BASICS
calculateConfusionMatrix(pred2)
#> predicted
#> true M R -err.-
#> M 84 27 27
#> R 6 91 6
#> -err.- 6 27 33
34
Predicting Outcomes for New Data BASICS
If you are interested in tuning the threshold (vector) have a look at the section
about performance curves and threshold tuning.
35
Predicting Outcomes for New Data BASICS
rpart: xval=0
Train: mmce=0.207; CV: mmce.test.mean=0.253
4.5
●
●
●
4.0 ●
●
● ●
● ● ●
● ● ●
3.5 ● ● ● ● Species
Sepal.Width
● ● ● ● ● ●
● ● ● setosa
● ● ● ●
● ● ● versicolor
3.0 ● ● ● ● ●
●
virginica
2.5
●
●
2.0
5 6 7 8
Sepal.Length
For clustering we also get a scatter plot of two selected features. The color of
the points indicates the predicted cluster.
lrn = makeLearner("cluster.kmeans")
plotLearnerPrediction(lrn, task = mtcars.task, features =
c("disp", "drat"), cv = 0)
5.0
●
4.5
●
● ●
● ●
4.0 response
● ●●
drat
● ● 1
●
● ● ● 2
●
●
●
3.5
● ●
● ●
● ● ●
3.0 ●
●
● ●
For regression, there are two types of plots. The 1D plot shows the target values
in relation to a single feature, the regression curve and, if the chosen learner
36
Predicting Outcomes for New Data BASICS
lm:
Train: mse=38.5; CV: mse.test.mean=38.7
50 ●● ●
●●
●●
●● ●● ● ● ●●
●
● ●
● ●●
● ●
●
●● ●
●● ●
40 ●
●
●● ●● ●●
●
●● ● ●● ●
● ●● ● ● ●
● ● ●●● ●●● ●● ● ●
● ●●● ●●●● ●● ●
●●● ● ●
30 ● ●●● ●● ● ● ●
● ●● ●
● ● ● ●
● ● ● ●●● ●
●● ●● ●●
● ●● ● ● ● ● ● ●
medv
● ●●●●●
●● ●●●●● ●●● ● ●
●
●● ●
●●
●
● ●●●
●●
● ●
●●● ●● ●●
●● ●● ●● ● ●● ● ● ●
●● ● ● ●●
● ●●●●● ●
● ●●● ● ● ●●●●●●● ●● ●● ● ●
●●● ●● ●●● ● ●●● ●● ●● ●
● ● ● ●
●●● ● ●
●●●●●
●● ●●●● ● ● ●
20 ● ● ●●●● ●●●
●
●● ●●
●● ●●●●●
●
●●
●
●●●
●●●●
●
●●●● ●●●
●
●●●● ● ● ●
● ●●●
●●●
● ●●●● ●
●
●●●
● ●● ● ● ● ●●
●● ● ●● ● ● ●● ● ●●●●●
● ●● ●
● ● ● ●
●● ● ●●●● ●● ● ● ● ● ● ● ● ●
● ●● ●
●●● ●
● ●● ●● ● ●
● ● ●● ● ● ● ●●● ●●
●● ●●
●● ● ● ● ● ●
● ●
● ●●●●●
● ●● ●●● ●●
● ●
●
● ●● ●● ● ● ● ●● ●
●● ● ● ●● ● ●
● ● ●●● ●● ●
● ●● ●
● ●
●●●
●● ● ●
10 ● ● ● ●
●●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ●
0 10 20 30
lstat
37
Evaluating Learner Performance BASICS
lm:
Train: mse=30.5; CV: mse.test.mean=31.1
9
●
●●
● ●● ● ●
● ●
8 ●● ●
●●●●●
● ●
●
● ●
● ●
●
● ● ●
● ● ●● ● ●
● ● ● ●● ● ●
● ●●●● ●●● ● ● ● medv
● ● ● ●● ● ● ● ●● ● ●
7 ●●●● ● ● ●
●● ●●●●●● ●●●● ● ● ● ● ● ●
● 40
●● ● ● ● ●●●● ● ●● ●
●
●
●●● ●●●
●● ●● ● ● ●● ● ● ●
● ●● ●●● ●
●●
● ● ●●●● ●
● ●● ●● ● ● ● ●●●●● 30
● ● ● ●●●●
● ●●● ●●●● ●●●●
●
● ●
●●●●
● ●●● ● ●●●● ● ● ●
● ●
● ●●
● ●
●●● ●● ● ●●
●● ●● ● ● ● ● ●
●● ● ● ●●●●●● ●
● ●● ●●●
rm
●●● ● ●● ●●● ●● ● ● ●● ●● ● ●
●●● ●●●
● ● ●● ●
●●
●●●● ●●
● ● ●●●
●
● ● ●●● ●●●● ● ● 20
6 ●●●●●
● ●●●●
● ●●
● ●●
●●●
● ●●● ●
●
●● ● ●● ● ● ●
● ● ●● ● ● ● ●
●● ●● ●● ●
● ●● ●● ●●● ●●
● ●●● ● ● ●
●●●●●● ●●● ●● ● ●
●●● ●● ●● ●● ●
●● ● ● ● ●●
● ●● ●
●
● ● ●● ●
●
●
●●● ●●●●● ●●● ● ●
●●
●
● ●● ● ●●
● 10
● ●● ● ● ●● ●● ● ●● ● ● ● ● ●
●
● ● ●
● ● ●● ● ●●
●
● ● ●● ● ● ● ● 0
● ●
● ● ●
5 ● ● ● ● ●
●● ● ●
● ●
●
●
● ●
4
●
●
0 10 20 30
lstat
The quality of the predictions of a model in mlr can be assessed with respect to a
number of different performance measures. In order to calculate the performance
measures, call performance on the object returned by predict and specify the
desired performance measures.
mlr provides a large number of performance measures for all types of learning
problems. Typical performance measures for classification are the mean misclas-
sification error (mmce), accuracy (acc) or measures based on ROC analysis. For
regression the mean of squared errors (mse) or mean of absolute errors (mae) are
usually considered. For clustering tasks, measures such as the Dunn index (dunn)
are provided, while for survival predictions, the Concordance Index (cindex) is
supported, and for cost-sensitive predictions the misclassification penalty (mcp)
and others. It is also possible to access the time to train the learner (timetrain),
the time to compute the prediction (timepredict) and their sum (timeboth) as
performance measures.
To see which performance measures are implemented, have a look at the table of
performance measures and the measures documentation page.
If you want to implement an additional measure or include a measure with non-
standard misclassification costs, see the section on creating custom measures.
38
Evaluating Learner Performance BASICS
Listing measures
The properties and requirements of the individual measures are shown in the
table of performance measures.
If you would like a list of available measures with certain properties or suitable
for a certain learning Task use the function listMeasures.
### Performance measures for classification with multiple classes
listMeasures("classif", properties = "classif.multi")
#> [1] "multiclass.brier" "multiclass.aunp" "multiclass.aunu"
#> [4] "qsr" "ber" "logloss"
#> [7] "timeboth" "timepredict" "acc"
#> [10] "lsr" "featperc" "multiclass.au1p"
#> [13] "multiclass.au1u" "ssr" "timetrain"
#> [16] "mmce"
### Performance measure suitable for the iris classification task
listMeasures(iris.task)
#> [1] "multiclass.brier" "multiclass.aunp" "multiclass.aunu"
#> [4] "qsr" "ber" "logloss"
#> [7] "timeboth" "timepredict" "acc"
#> [10] "lsr" "featperc" "multiclass.au1p"
#> [13] "multiclass.au1u" "ssr" "timetrain"
#> [16] "mmce"
For convenience there exists a default measure for each type of learning problem,
which is calculated if nothing else is specified. As defaults we chose the most
commonly used measures for the respective types, e.g., the mean squared error
(mse) for regression and the misclassification rate (mmce) for classification. The
help page of function getDefaultMeasure lists all defaults for all types of learning
problems. The function itself returns the default measure for a given task type,
Task or Learner.
### Get default measure for iris.task
getDefaultMeasure(iris.task)
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: test.mean
#> Note:
39
Evaluating Learner Performance BASICS
performance(pred)
#> mse
#> 42.68414
The following code computes the median of squared errors (medse) instead.
performance(pred, measures = medse)
#> medse
#> 9.134965
For the other types of learning problems and measures, calculating the perfor-
mance basically works in the same way.
40
Evaluating Learner Performance BASICS
Also bear in mind that many of the performance measures that are available
for classification, e.g., the false positive rate (fpr), are only suitable for binary
problems.
Performance measures in mlr are objects of class Measure. If you are interested
in the properties or requirements of a single measure you can access it directly.
See the help page of Measure for information on the individual slots.
### Mean misclassification error
str(mmce)
#> List of 10
#> $ id : chr "mmce"
#> $ minimize : logi TRUE
#> $ properties: chr [1:4] "classif" "classif.multi" "req.pred"
"req.truth"
#> $ fun :function (task, model, pred, feats, extra.args)
41
Evaluating Learner Performance BASICS
Binary classification
42
Evaluating Learner Performance BASICS
plotThreshVsPerf(d)
0.50
0.75 0.75
0.45
performance
0.35
0.25 0.25
0.30
0.00 0.00
0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
ROC measures
For binary classification a large number of specialized measures exist, which
can be nicely formatted into one matrix, see for example the receiver operating
characteristic page on wikipedia.
We can generate a similiar table with the calculateROCMeasures function.
r = calculateROCMeasures(pred)
r
#> predicted
#> true M R
#> M 0.7 0.3 tpr: 0.7 fnr: 0.3
#> R 0.25 0.75 fpr: 0.25 tnr: 0.75
#> ppv: 0.76 for: 0.32 lrp: 2.79 acc: 0.72
#> fdr: 0.24 npv: 0.68 lrm: 0.4 dor: 6.88
#>
#>
#> Abbreviations:
#> tpr - True positive rate (Sensitivity, Recall)
#> fpr - False positive rate (Fall-out)
#> fnr - False negative rate (Miss rate)
43
Resampling BASICS
The top left 2 × 2 matrix is the confusion matrix, which shows the relative
frequency of correctly and incorrectly classified observations. Below and to the
right a large number of performance measures that can be inferred from the
confusion matrix are added. By default some additional info about the measures
is printed. You can turn this off using the abbreviations argument of the print
method: print(r, abbreviations = FALSE).
Resampling
44
Resampling BASICS
If you want to read up further details, the paper Resampling Strategies for Model
Assessment and Selection by Simon is proabably not a bad choice. Bernd has
also published a paper Resampling methods for meta-model validation with rec-
ommendations for evolutionary computation which contains detailed descriptions
and lots of statistical background information on resampling methods.
In mlr the resampling strategy can be chosen via the function makeResampleDesc.
The supported resampling strategies are:
• Cross-validation ("CV"),
• Leave-one-out cross-validation ("LOO""),
• Repeated cross-validation ("RepCV"),
• Out-of-bag bootstrap and other variants ("Bootstrap"),
• Subsampling, also called Monte-Carlo cross-validaton ("Subsample"),
• Holdout (training/test) ("Holdout").
The resample function evaluates the performance of a Learner using the specified
resampling strategy for a given machine learning Task.
In the following example the performance of the Cox proportional hazards
model on the lung data set is calculated using 3-fold cross-validation. Generally,
in K-fold cross-validation the data set D is partitioned into K subsets of
(approximately) equal size. In the i-th step of the K iterations, the i-th subset
is used for testing, while the union of the remaining parts forms the training set.
The default performance measure in survival analysis is the concordance index
(cindex).
### Specify the resampling strategy (3-fold cross-validation)
rdesc = makeResampleDesc("CV", iters = 3)
45
Resampling BASICS
The columns iter and setindicate the resampling iteration and if an individual
prediction was made on the test or the training data set.
In the above example the performance measure is the concordance index (cindex).
Of course, it is possible to compute multiple performance measures at once by
passing a list of measures (see also the previous section on evaluating learner
performance).
46
Resampling BASICS
In the following we estimate the Dunn index (dunn), the Davies-Bouldin cluster
separation measure (db), and the time for training the learner (timetrain) by
subsampling with 5 iterations. In each iteration the data set D is randomly
partitioned into a training and a test set according to a given percentage, e.g.,
2/3 training and 1/3 test set. If there is just one iteration, the strategy is
commonly called holdout or test sample estimation.
### cluster iris feature data
task = makeClusterTask(data = iris[,-5])
### Subsampling with 5 iterations and default split 2/3
rdesc = makeResampleDesc("Subsample", iters = 5)
### Subsampling with 5 iterations and 4/5 training data
rdesc = makeResampleDesc("Subsample", iters = 5, split = 4/5)
Stratified resampling
For classification, it is usually desirable to have the same proportion of the classes
in all of the partitions of the original data set. Stratified resampling ensures
this. This is particularly useful in case of imbalanced classes and small data sets.
Otherwise it may happen, for example, that observations of less frequent classes
are missing in some of the training sets which can decrease the performance of
the learner, or lead to model crashes In order to conduct stratified resampling,
set stratify = TRUE when calling makeResampleDesc.
### 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3, stratify = TRUE)
47
Resampling BASICS
Stratification is also available for survival tasks. Here the stratification balances
the censoring rate.
Sometimes it is required to also stratify on the input data, e.g. to ensure that all
subgroups are represented in all training and test sets. To stratify on the input
columns, specify factor columns of your task data via stratify.cols
rdesc = makeResampleDesc("CV", iters = 3, stratify.cols = "chas")
r = resample("regr.rpart", bh.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mse.test.mean=23.2
Keeping only certain information instead of entire models, for example the
variable importance in a regression tree, can be achieved using the extract
48
Resampling BASICS
49
Resampling BASICS
The result rdescis an object of class ResampleDesc and contains, as the name
implies, a description of the resampling strategy. In principle, this is an instruc-
tion for drawing training and test sets including the necessary parameters like
the number of iterations, the sizes of the training and test sets etc.
Based on this description, the data set is randomly partitioned into multiple
training and test sets. For each iteration, we get a set of index vectors indicating
the training and test examples. These are stored in a ResampleInstance.
If a ResampleDesc is passed to resample, it is instantiated internally. Naturally,
it is also possible to pass a ResampleInstance directly.
A ResampleInstance can be created through the function makeResampleInstance
given a ResampleDesc and either the size of the data set at hand or the Task.
It basically performs the random drawing of indices to separate the data into
training and test sets according to the description.
### Create a resample instance based an a task
rin = makeResampleInstance(rdesc, task = iris.task)
rin
#> Resample instance for 150 cases.
#> Resample description: cross-validation with 3 iterations.
#> Predict: test
#> Stratification: FALSE
### Create a resample instance given the size of the data set
rin = makeResampleInstance(rdesc, size = nrow(iris))
str(rin)
#> List of 5
#> $ desc :List of 4
#> ..$ id : chr "cross-validation"
#> ..$ iters : int 3
#> ..$ predict : chr "test"
#> ..$ stratify: logi FALSE
50
Resampling BASICS
While having two separate objects, resample descriptions and instances as well
as the resample function seems overly complicated, it has several advantages:
• Resample instances allow for paired experiments, that is comparing the
performance of several learners on exactly the same training and test
sets. This is particularly useful if you want to add another method to a
comparison experiment you already did.
rdesc = makeResampleDesc("CV", iters = 3)
rin = makeResampleInstance(rdesc, task = iris.task)
51
Resampling BASICS
#> mmce.test.mean
#> 0.02666667
r.rpart$aggr
#> mmce.test.mean
#> 0.06
• It is easy to add other resampling methods later on. You can simply
derive from the ResampleInstance class, but you do not have to touch any
methods that use the resampling strategy.
As mentioned above, when calling makeResampleInstance the index sets are
drawn randomly. Mainly for holdout (test sample) estimation you might want
full control about the training and tests set and specify them manually. This
can be done using the function makeFixedHoldoutInstance.
rin = makeFixedHoldoutInstance(train.inds = 1:100, test.inds =
101:150, size = 150)
rin
#> Resample instance for 150 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
The aggregation method of a Measure can be changed via the function setAggre-
gation. See the documentation of aggregations for available methods.
52
Resampling BASICS
Example: Bootstrap
53
Resampling BASICS
In out-of-bag bootstrap estimation B new data sets D1 to DB are drawn from the
data set D with replacement, each of the same size as D. In the i-th iteration,
Di forms the training set, while the remaining elements from D, i.e., elements
not in the training set, form the test set.
The variants b632 and b632+ calculate a convex combination of the training per-
formance and the out-of-bag bootstrap performance and thus require predictions
on the training sets and an appropriate aggregation strategy.
rdesc = makeResampleDesc("Bootstrap", predict = "both", iters =
10)
b632.mmce = setAggregation(mmce, b632)
b632plus.mmce = setAggregation(mmce, b632plus)
b632.mmce
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: b632
#> Note:
Convenience functions
When quickly trying out some learners, it can get tedious to write the R code
for generating a resample instance, setting the aggregation strategy and so on.
For this reason mlr provides some convenience functions for the frequently used
resampling strategies, for example holdout, crossval or bootstrapB632. But note
that you do not have as much control and flexibility as when using resample
with a resample description or instance.
54
Tuning Hyperparameters BASICS
Tuning Hyperparameters
Basics
55
Tuning Hyperparameters BASICS
In this tutorial, we show how to specify the search space and optimization
algorithm, how to do the tuning and how to access the tuning result, and how
to visualize the hyperparameter tuning effects through several examples.
Throughout this section we consider classification examples. For the other types
of learning problems, you can follow the same process analogously.
We use the iris classification task for illustration and tune the hyperparameters
of an SVM (function ksvm from the kernlab package) with a radial basis kernel.
The following examples tune the cost parameter C and the RBF kernel parameter
sigma of the ksvm function.
56
Tuning Hyperparameters BASICS
Many other parameters can be created, check out the examples in makeParamSet.
In order to standardize your workflow across several packages, whenever parame-
ters in the underlying R functions should be passed in a list structure, mlr tries
to give you direct access to each parameter and get rid of the list structure!
This is the case with the kpar argument of ksvm which is a list of kernel
parameters like sigma. This allows us to interface with learners from different
packages in the same way when defining parameters to tune!
In the case of num_ps above, since we have only specified the upper and lower
bounds for the search space, grid search will create a grid using equally-sized steps.
By default, grid search will span the space in 10 equal-sized steps. The number
of steps can be changed with the resolution argument. Here we change to 15
equal-sized steps in the space defined within the ParamSet object. For num_ps,
this means 15 steps in the form of 10 ^ seq(-10, 10, length.out = 15):
ctrl = makeTuneControlGrid(resolution = 15L)
Many other types of optimization algorithms are available. Check out TuneCon-
trol for some examples.
Since grid search is normally too slow in practice, we’ll also examine random
search. In the case of discrete_ps, random search will randomly choose from
the specified values. The maxit argument controls the amount of iterations.
ctrl = makeTuneControlRandom(maxit = 10L)
57
Tuning Hyperparameters BASICS
In the case of num_ps, random search will randomly choose points within the
space according to the specified bounds. Perhaps in this case we would want to
increase the amount of iterations to ensure we adequately cover the space:
ctrl = makeTuneControlRandom(maxit = 200L)
Finally, by combining all the previous pieces, we can tune the SVM parameters
by calling tuneParams. We will use discrete_ps with grid search:
discrete_ps = makeParamSet(
makeDiscreteParam("C", values = c(0.5, 1.0, 1.5, 2.0)),
makeDiscreteParam("sigma", values = c(0.5, 1.0, 1.5, 2.0))
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 3L)
res = tuneParams("classif.ksvm", task = iris.task, resampling =
rdesc,
par.set = discrete_ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#> Type len Def Constr Req Tunable Trafo
#> C discrete - - 0.5,1,1.5,2 - TRUE -
#> sigma discrete - - 0.5,1,1.5,2 - TRUE -
#> With control class: TuneControlGrid
#> Imputation value: 1
#> [Tune-x] 1: C=0.5; sigma=0.5
#> [Tune-y] 1: mmce.test.mean=0.04; time: 0.0 min; memory: 176Mb
use, 711Mb max
#> [Tune-x] 2: C=1; sigma=0.5
#> [Tune-y] 2: mmce.test.mean=0.04; time: 0.0 min; memory: 176Mb
use, 711Mb max
#> [Tune-x] 3: C=1.5; sigma=0.5
#> [Tune-y] 3: mmce.test.mean=0.0467; time: 0.0 min; memory:
176Mb use, 711Mb max
#> [Tune-x] 4: C=2; sigma=0.5
58
Tuning Hyperparameters BASICS
res
#> Tune result:
#> Op. pars: C=0.5; sigma=1.5
#> mmce.test.mean=0.0333
tuneParams simply performs the cross-validation for every element of the cross-
59
Tuning Hyperparameters BASICS
product and selects the parameter setting with the best mean performance. As
no performance measure was specified, by default the error rate (mmce) is used.
Note that each measure “knows” if it is minimized or maximized during tuning.
### error rate
mmce$minimize
#> [1] TRUE
### accuracy
acc$minimize
#> [1] FALSE
Of course, you can pass other measures and also a list of measures to tuneParams.
In the latter case the first measure is optimized during tuning, the others
are simply evaluated. If you are interested in optimizing several measures
simultaneously have a look at Advanced Tuning.
In the example below we calculate the accuracy (acc) instead of the error rate. We
use function setAggregation, as described on the resampling page, to additionally
obtain the standard deviation of the accuracy. We also use random search with
100 iterations on the num_set we defined above and set show.info to FALSE to
hide the output for all 100 iterations:
num_ps = makeParamSet(
makeNumericParam("C", lower = -10, upper = 10, trafo =
function(x) 10^x),
makeNumericParam("sigma", lower = -10, upper = 10, trafo =
function(x) 10^x)
)
ctrl = makeTuneControlRandom(maxit = 100L)
res = tuneParams("classif.ksvm", task = iris.task, resampling =
rdesc, par.set = num_ps,
control = ctrl, measures = list(acc, setAggregation(acc,
test.sd)), show.info = FALSE)
res
#> Tune result:
#> Op. pars: C=95.2; sigma=0.0067
#> acc.test.mean=0.987,acc.test.sd=0.0231
60
Tuning Hyperparameters BASICS
#>
#> $sigma
#> [1] 0.006695534
res$y
#> acc.test.mean acc.test.sd
#> 0.98666667 0.02309401
Then you can proceed as usual. Here we refit and predict the learner on the
complete iris data set:
m = train(lrn, iris.task)
predict(m, task = iris.task)
#> Prediction: 150 observations
#> predict.type: response
#> threshold:
#> time: 0.00
#> id truth response
#> 1 1 setosa setosa
#> 2 2 setosa setosa
#> 3 3 setosa setosa
#> 4 4 setosa setosa
#> 5 5 setosa setosa
#> 6 6 setosa setosa
#> ... (150 rows, 3 cols)
But what if you wanted to inspect the other points on the search path, not just
the optimal?
61
Tuning Hyperparameters BASICS
generateHyperParsEffectData(res)
#> HyperParsEffectData:
#> Hyperparameters: C,sigma
#> Measures: acc.test.mean,acc.test.sd
#> Optimizer: TuneControlRandom
#> Nested CV Used: FALSE
#> Snapshot of data:
#> C sigma acc.test.mean acc.test.sd iteration
exec.time
#> 1 -9.9783231 1.0531818 0.2733333 0.02309401 1
0.051
#> 2 -0.5292817 3.2214785 0.2733333 0.02309401 2
0.053
#> 3 -0.3544567 4.1644832 0.2733333 0.02309401 3
0.052
#> 4 0.6341910 7.8640461 0.2866667 0.03055050 4
0.052
#> 5 5.7640748 -3.3159251 0.9533333 0.03055050 5
0.051
#> 6 -6.5880397 0.4600323 0.2733333 0.02309401 6
0.052
62
Tuning Hyperparameters BASICS
6 0.052
Note that we can also generate performance on the train data along with the
validation/test data, as discussed on the resampling tutorial page:
rdesc2 = makeResampleDesc("Holdout", predict = "both")
res2 = tuneParams("classif.ksvm", task = iris.task, resampling =
rdesc2, par.set = num_ps,
control = ctrl, measures = list(acc, setAggregation(acc,
train.mean)), show.info = FALSE)
generateHyperParsEffectData(res2)
#> HyperParsEffectData:
#> Hyperparameters: C,sigma
#> Measures: acc.test.mean,acc.train.mean
#> Optimizer: TuneControlRandom
#> Nested CV Used: FALSE
#> Snapshot of data:
#> C sigma acc.test.mean acc.train.mean iteration
exec.time
#> 1 9.457202 -4.0536025 0.98 0.97 1
0.040
#> 2 9.900523 1.8815923 0.40 1.00 2
0.030
#> 3 2.363975 5.3202458 0.26 1.00 3
0.029
#> 4 -1.530251 4.7579424 0.26 0.37 4
0.031
#> 5 -7.837476 2.4352698 0.26 0.37 5
0.029
#> 6 8.782931 -0.4143757 0.92 1.00 6
0.029
63
Tuning Hyperparameters BASICS
1.0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●
0.8
Accuracy
0.6
0.4
0 25 50 75 100
iteration
Note that by default, we only plot the current global optima. This can be
changed with the global.only argument.
For an in-depth exploration of generating hyperparameter tuning effects and
plotting the data, check out Hyperparameter Tuning Effects.
Further comments
• Tuning works for all other tasks like regression, survival analysis and so on
in a completely similar fashion.
• In longer running tuning experiments it is very annoying if the computation
stops due to numerical or other errors. Have a look at on.learner.error
in configureMlr as well as the examples given in section Configure mlr of
this tutorial. You might also want to inform yourself about impute.val
in TuneControl.
64
Benchmark Experiments BASICS
• As we continually optimize over the same data during tuning, the estimated
performance value might be optimistically biased. A clean approach to
ensure unbiased performance estimation is nested resampling, where we
embed the whole model selection process into an outer resampling loop.
Benchmark Experiments
65
Benchmark Experiments BASICS
bmr
#> task.id learner.id mmce.test.mean
#> 1 Sonar-example classif.lda 0.3000000
#> 2 Sonar-example classif.rpart 0.2857143
In the printed table every row corresponds to one pair of Task and Learner. The
entries show the mean misclassification error (mmce), the default performance
measure for classification, on the test data set.
The result bmr is an object of class BenchmarkResult. Basically, it contains a list
of lists of ResampleResult objects, first ordered by Task and then by Learner.
Learner performances
Let’s have a look at the benchmark result above. getBMRPerformances returns
individual performances in resampling runs, while getBMRAggrPerformances
gives the aggregated values.
getBMRPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> iter mmce
#> 1 1 0.3
#>
#> $`Sonar-example`$classif.rpart
#> iter mmce
#> 1 1 0.2857143
getBMRAggrPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean
#> 0.3
#>
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean
#> 0.2857143
66
Benchmark Experiments BASICS
Predictions
Per default, the BenchmarkResult contains the learner predictions. If you do
not want to keep them, e.g., to conserve memory, set keep.pred = FALSE when
calling benchmark.
You can access the predictions using function getBMRPredictions. Per default,
you get a list of lists of ResamplePrediction objects. In most cases you might
prefer the data.frame version.
getBMRPredictions(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold:
#> time (mean): 0.00
#> id truth response iter set
#> 1 180 M M 1 test
#> 2 100 M R 1 test
#> 3 53 R M 1 test
#> 4 89 R R 1 test
#> 5 92 R M 1 test
#> 6 11 R R 1 test
#> ... (70 rows, 5 cols)
#>
#>
#> $`Sonar-example`$classif.rpart
67
Benchmark Experiments BASICS
It is also easily possible to access results for certain learners or tasks via their IDs.
For this purpose many “getter” functions have a learner.ids and a task.ids
argument.
head(getBMRPredictions(bmr, learner.ids = "classif.rpart", as.df
= TRUE))
#> task.id learner.id id truth response iter set
#> 1 Sonar-example classif.rpart 180 M M 1 test
#> 2 Sonar-example classif.rpart 100 M M 1 test
#> 3 Sonar-example classif.rpart 53 R R 1 test
#> 4 Sonar-example classif.rpart 89 R M 1 test
#> 5 Sonar-example classif.rpart 92 R M 1 test
#> 6 Sonar-example classif.rpart 11 R R 1 test
If you don’t like the default IDs, you can set the IDs of learners and tasks via
the id option of makeLearner and make*Task. Moreover, you can conveniently
change the ID of a Learner via function setLearnerId.
IDs
68
Benchmark Experiments BASICS
The IDs of all Learners, Tasks and Measures in a benchmark experiment can be
retrieved as follows:
getBMRTaskIds(bmr)
#> [1] "Sonar-example"
getBMRLearnerIds(bmr)
#> [1] "classif.lda" "classif.rpart"
getBMRMeasureIds(bmr)
#> [1] "mmce"
Learner models
Per default the BenchmarkResult also contains the fitted models for all learners
on all tasks. If you do not want to keep them set models = FALSE when calling
benchmark. The fitted models can be retrieved by function getBMRModels. It
returns a list of lists of WrappedModel objects.
getBMRModels(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> $`Sonar-example`$classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters:
#>
#>
#> $`Sonar-example`$classif.rpart
#> $`Sonar-example`$classif.rpart[[1]]
#> Model for learner.id=classif.rpart;
learner.class=classif.rpart
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: xval=0
69
Benchmark Experiments BASICS
getBMRLearners(bmr)
#> $classif.lda
#> Learner classif.lda from package MASS
#> Type: classif
#> Name: Linear Discriminant Analysis; Short name: lda
#> Class: classif.lda
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: response
#> Hyperparameters:
#>
#>
#> $classif.rpart
#> Learner classif.rpart from package rpart
#> Type: classif
#> Name: Decision Tree; Short name: rpart
#> Class: classif.rpart
#> Properties:
twoclass,multiclass,missings,numerics,factors,ordered,prob,weights,featimp
#> Predict-Type: response
#> Hyperparameters: xval=0
getBMRMeasures(bmr)
#> [[1]]
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: test.mean
#> Note:
Sometimes after completing a benchmark experiment it turns out that you want
to extend it by another Learner or another Task. In this case you can perform
an additional benchmark experiment and then merge the results to get a single
BenchmarkResult object that can be accessed and analyzed as usual.
mlr provides two functions to merge results: mergeBenchmarkResultLearner
combines two or more benchmark results for different sets of learners on the
same Tasks, while mergeBenchmarkResultTask fuses results obtained with the
same Learners on different sets of Tasks.
For example in the benchmark experiment above we applied lda and rpart to
the sonar.task. We now perform a second experiment using a random forest and
70
Benchmark Experiments BASICS
Note that in the above examples in each case a resample description was passed
to the benchmark function. For this reason lda and rpart were most likely
evaluated on a different training/test set pair than random forest and qda.
Differing training/test set pairs across learners pose an additional source of
variation in the results, which can make it harder to detect actual performance
differences between learners. Therefore, if you suspect that you will have to
extend your benchmark experiment by another Learner later on it’s probably
easiest to work with ResampleInstances from the start. These can be stored and
used for any additional experiments.
Alternatively, if you used a resample description in the first benchmark experi-
ment you could also extract the ResampleInstances from the BenchmarkResult
bmr and pass these to all further benchmark calls.
rin = getBMRPredictions(bmr)[[1]][[1]]$instance
rin
#> Resample instance for 208 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
71
Benchmark Experiments BASICS
mlr offers several ways to analyze the results of a benchmark experiment. This
includes visualization, ranking of learning algorithms and hypothesis tests to
assess performance differences between learners.
In order to demonstrate the functionality we conduct a slightly larger benchmark
experiment with three learning algorithms that are applied to five classification
tasks.
72
Benchmark Experiments BASICS
73
Benchmark Experiments BASICS
0.31153361
#> 15 Sonar-example randomForest 0.17785714
0.17442696
#> timetrain.test.mean
#> 1 0.0022
#> 2 0.0035
#> 3 0.0374
#> 4 0.0062
#> 5 0.0088
#> 6 0.3726
#> 7 0.0066
#> 8 0.0256
#> 9 0.4191
#> 10 0.0035
#> 11 0.0048
#> 12 0.3431
#> 13 0.0127
#> 14 0.0105
#> 15 0.2280
From the aggregated performance values we can see that for the iris- and
PimaIndiansDiabetes-example linear discriminant analysis performs well while
for all other tasks the random forest seems superior. Training takes longer for
the random forest than for the other learners.
In order to draw any conclusions from the average performances at least their vari-
ability has to be taken into account or, preferably, the distribution of performance
values across resampling iterations.
The individual performances on the 10 folds for every task, learner, and measure
are retrieved below.
perf = getBMRPerformances(bmr, as.df = TRUE)
head(perf)
#> task.id learner.id iter mmce ber timetrain
#> 1 iris-example lda 1 0.0000000 0.0000000 0.002
#> 2 iris-example lda 2 0.1333333 0.1666667 0.002
#> 3 iris-example lda 3 0.0000000 0.0000000 0.002
#> 4 iris-example lda 4 0.0000000 0.0000000 0.003
#> 5 iris-example lda 5 0.0000000 0.0000000 0.002
#> 6 iris-example lda 6 0.0000000 0.0000000 0.002
A closer look at the result reveals that the random forest outperforms the
classification tree in every instance, while linear discriminant analysis performs
better than rpart most of the time. Additionally lda sometimes even beats the
random forest. With increasing size of such benchmark experiments, those tables
become almost unreadable and hard to comprehend.
74
Benchmark Experiments BASICS
Integrated plots
Plots are generated using ggplot2. Further customization, such as renaming plot
elements or changing colors, is easily possible.
Visualizing performances
plotBMRBoxplots creates box or violin plots which show the distribution of
performance values across resampling iterations for one performance measure and
for all learners and tasks (and thus visualize the output of getBMRPerformances).
Below are both variants, box and violin plots. The first plot shows the mmce
and the second plot the balanced error rate (ber). Moreover, in the second plot
we color the boxes according to the learners to make them better distinguishable.
plotBMRBoxplots(bmr, measure = mmce)
0.4
●
0.3
●
0.2 ● ●
Mean misclassification error
●
●
0.1
●
0.0
PimaIndiansDiabetes−example Sonar−example
0.5
●
0.4
0.3
0.2
0.1
0.0
ld
rp
ra
ld
rp
ra
a
a
n
n
ar
ar
do
do
t
t
m
m
Fo
Fo
r
r
es
es
t
75
Benchmark Experiments BASICS
aes(color = learner.id) +
theme(strip.text.x = element_text(size = 8))
0.4
0.3
0.2
0.1
learner.id
0.0
lda
ber
0.3
0.2
0.1
0.0
ld
rp
ra
ld
rp
ra
a
a
nd
nd
ar
ar
t
t
o
o
m
m
Fo
Fo
re
re
st
st
Note that by default the measure names are used as labels for the y-axis.
mmce$name
#> [1] "Mean misclassification error"
mmce$id
#> [1] "mmce"
If you prefer the shorter ids like mmce and ber set pretty.names = FALSE (as
done for the second plot). Of course you can also use the ylab function to choose
a completely different label.
Another thing which probably comes up quite often is changing the panel headers
(which default to the Task IDs) and the learner names on the x-axis (which
default to the Learner IDs). For example looking at the above plots we would
like to remove the “example” suffixes and the “mlbench” prefixes from the panel
headers. Moreover, compared to the other learner names “randomForest” seems
a little long. Currently, the probably simplest solution is to change the factor
levels of the plotted data as shown below.
plt = plotBMRBoxplots(bmr, measure = mmce)
head(plt$data)
76
Benchmark Experiments BASICS
0.4
●
0.3
●
0.2 ● ●
●
●
0.1
●
Error rate
0.0
Diabetes Sonar
0.5
●
0.4
0.3
0.2
0.1
0.0
ld
rp
rF
ld
rp
rF
a
a
ar
ar
t
77
Benchmark Experiments BASICS
plotBMRSummary(bmr)
Sonar−example ● ● ●
PimaIndiansDiabetes−example ●● ●
learner.id
● lda
mlbench.waveform ● ● ●
● rpart
● rf
mlbench.ringnorm ● ● ●
iris−example ● ● ●
78
Benchmark Experiments BASICS
Methods with best performance, i.e., with lowest mmce, are assigned the lowest
rank. Linear discriminant analysis is best for the iris and PimaIndiansDiabetes-
examples while the random forest shows best results on the remaining tasks.
plotBMRRanksAsBarChart with option pos = "tile" shows a corresponding
heat map. The ranks are displayed on the x-axis and the learners are color-coded.
plotBMRRanksAsBarChart(bmr, pos = "tile")
Sonar−example
PimaIndiansDiabetes−example
learner.id
lda
mlbench.waveform
rpart
rf
mlbench.ringnorm
iris−example
1 2 3
rank
A similar plot can also be obtained via plotBMRSummary. With option trafo
= "rank" the ranks are displayed instead of the aggregated performances.
plotBMRSummary(bmr, trafo = "rank", jitter = 0)
79
Benchmark Experiments BASICS
Sonar−example ● ● ●
PimaIndiansDiabetes−example ● ● ●
learner.id
● lda
mlbench.waveform ● ● ● ● rpart
● rf
mlbench.ringnorm ● ● ●
iris−example ● ● ●
Alternatively, you can draw stacked bar charts (the default) or bar charts
with juxtaposed bars (pos = "dodge") that are better suited to compare the
frequencies of learners within and across ranks.
plotBMRRanksAsBarChart(bmr)
plotBMRRanksAsBarChart(bmr, pos = "dodge")
80
Benchmark Experiments BASICS
In order to keep the computation time for this tutorial small, the Learners are
only evaluated on five tasks. This also means that we operate on a relatively low
significance level α = 0.1. As we can reject the null hypothesis of the Friedman
test at a reasonable significance level we might now want to test where these
differences lie exactly.
friedmanPostHocTestBMR(bmr, p.value = 0.1)
#>
#> Pairwise comparisons using Nemenyi multiple comparison test
#> with q approximation for unreplicated blocked
data
#>
#> data: mmce.test.mean and learner.id and task.id
#>
#> lda rpart
#> rpart 0.254 -
#> randomForest 0.802 0.069
#>
#> P value adjustment method: none
At this level of significance, we can reject the null hypothesis that there exists no
performance difference between the decision tree (rpart) and the random Forest.
81
Benchmark Experiments BASICS
rf rpart
● ● ●
0 1 2 3 4
Average Rank
82
Benchmark Experiments BASICS
lda
rf rpart
Custom plots
You can easily generate your own visualizations by customizing the ggplot
objects returned by the plots above, retrieve the data from the ggplot objects
and use them as basis for your own plots, or rely on the data.frames returned by
getBMRPerformances or getBMRAggrPerformances. Here are some examples.
Instead of boxplots (as in plotBMRBoxplots) we could create density plots to
show the performance values resulting from individual resampling iterations.
perf = getBMRPerformances(bmr, as.df = TRUE)
83
Benchmark Experiments BASICS
iris−example Sonar−example
12
8 learner.id
density
lda
rpart
randomForest
4
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
mmce
84
Benchmark Experiments BASICS
0.4
●
0.2 ●
●
●
●
●
●
●
learner.id
performance
● ●
0.0 ● ●
lda
PimaIndiansDiabetes−example Sonar−example rpart
randomForest
●
0.4
●
0.2
● ● ●
0.0
mmce timetrain mmce timetrain
measure
GGally::ggpairs(df, 3:5)
85
Benchmark Experiments BASICS
3
mmce_lda
Corr: Corr:
2
0.145 0.306
0
●
0.4
mmce_rpart
Corr:
● ●
0.292
0.3
● ● ●
0.2 ●
●
● ● ● ●
mmce_randomForest
● ●
● ●
0.2 ● ● ● ●
● ●
0.1 ● ●
● ● ●
Further comments
• Note that for supervised classification mlr offers some more plots that
operate on BenchmarkResult objects and allow you to compare the per-
formance of learning algorithms. See for example the tutorial page on
ROC curves and functions generateThreshVsPerfData, plotROCCurves,
and plotViperCharts as well as the page about classifier calibration and
function generateCalibrationData.
• In the examples shown in this section we applied “raw” learning algorithms,
but often things are more complicated. At the very least, many learners
have hyperparameters that need to be tuned to get sensible results. Reliable
performance estimates can be obtained by nested resampling, i.e., by doing
the tuning in an inner resampling loop while estimating the performance
in an outer loop. Moreover, you might want to combine learners with pre-
processing steps like imputation, scaling, outlier removal, dimensionality
86
Parallelization BASICS
reduction or feature selection and so on. All this can be easily done
using mlr’s wrapper functionality. The general principle is explained in
the section about wrapped learners in the Advanced part of this tutorial.
There are also several sections devoted to common pre-processing steps.
• Benchmark experiments can very quickly become computationally demand-
ing. mlr offers some possibilities for parallelization.
Parallelization
parallelStop()
#> Stopped parallelization. All cleaned up.
Parallelization levels
We offer different parallelization levels for fine grained control over the paral-
lelization. E.g., if you do not want to parallelize the benchmark function because
it has only very few iterations but want to parallelize the resampling of each
learner instead, you can specifically pass the level "mlr.resample" to the
parallelStart* function. Currently the following levels are supported:
87
Parallelization BASICS
parallelGetRegisteredLevels()
#> mlr: mlr.benchmark, mlr.resample, mlr.selectFeatures,
mlr.tuneParams
If you have implemented a custom learner yourself, locally, you currently need
to export this to the slave. So if you see an error after calling, e.g., a parallelized
version of resample like this:
no applicable method for 'trainLearner' applied to an object of
class <my_new_learner>
The end
88
Visualization BASICS
Visualization
Some examples
In the example below we create a plot of classifier performance as function of the
decision threshold for the binary classification problem sonar.task. The generation
function generateThreshVsPerfData creates an object of class ThreshVsPerfData
which contains the data for the plot in slot $data.
lrn = makeLearner("classif.lda", predict.type = "prob")
n = getTaskSize(sonar.task)
mod = train(lrn, task = sonar.task, subset = seq(1, n, by = 2))
pred = predict(mod, task = sonar.task, subset = seq(2, n, by =
2))
d = generateThreshVsPerfData(pred, measures = list(fpr, fnr,
mmce))
class(d)
#> [1] "ThreshVsPerfData"
head(d$data)
#> fpr fnr mmce threshold
#> 1 1.0000000 0.0000000 0.4615385 0.00000000
#> 2 0.3541667 0.1964286 0.2692308 0.01010101
89
Visualization BASICS
0.50
0.75 0.75
0.45
performance
0.35
0.25 0.25
0.30
0.00 0.00
0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
Note that by default the Measure names are used to annotate the panels.
fpr$name
#> [1] "False positive rate"
fpr$id
#> [1] "fpr"
This does not only apply to plotThreshVsPerf, but to other plot functions that
show performance measures as well, for example plotLearningCurve. You can
use the ids instead of the names by setting pretty.names = FALSE.
Customizing plots
As mentioned above it is easily possible to customize the built-in plots or making
your own visualizations from scratch based on the generated data.
What will probably come up most often is changing labels and annotations.
Generally, this can be done by manipulating the ggplot object, in this example
the object returned by plotThreshVsPerf, using the usual ggplot2 functions
like ylab or labeller. Moreover, you can change the underlying data, either
d$data (resulting from generateThreshVsPerfData) or the possibly reshaped
90
Visualization BASICS
levels(plt$data$measure)
#> [1] "fpr" "fnr" "mmce"
91
Visualization BASICS
0.50
0.75 0.75
0.45
Performance
0.35
0.25 0.25
0.30
0.00 0.00
0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Cutoff
Using the labeller function requires calling facet_wrap (or facet_grid), which
can be useful if you want to change how the panels are positioned (number of
rows and columns) or influence the axis limits.
plt = plotThreshVsPerf(d, pretty.names = FALSE)
measure_names = c(
fpr = "False positive rate",
fnr = "False negative rate",
mmce = "Error rate"
)
### Manipulate the measure names via the labeller function and
### arrange the panels in two columns and choose common axis
limits for all panels
plt = plt + facet_wrap( ~ measure, labeller = labeller(measure =
measure_names), ncol = 2)
plt = plt + xlab("Decision threshold") + ylab("Performance")
plt
92
Visualization BASICS
1.00
0.75
0.50
0.25
Performance
0.00
Error rate
1.00
0.75
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
Decision threshold
93
Visualization BASICS
1.00
0.75
fpr
0.50
0.25
0.00
1.0
0.8
0.8
performance
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
threshold
94
Visualization BASICS
Let’s conclude with a brief look on a second example. Here we use plotPartialDe-
pendence but extract the data from the ggplot object pltand use it to create a
traditional graphics::plot, additional to the ggplot2 plot.
sonar = getTaskData(sonar.task)
pd = generatePartialDependenceData(mod, sonar, "V11")
plt = plotPartialDependence(pd)
head(plt$data)
#> Class Probability Feature Value
#> 1 M 0.2737158 V11 0.0289000
#> 2 M 0.3689970 V11 0.1072667
#> 3 M 0.4765742 V11 0.1856333
#> 4 M 0.5741233 V11 0.2640000
#> 5 M 0.6557857 V11 0.3423667
#> 6 M 0.7387962 V11 0.4207333
plt
●
●
0.8
●
Probability
Class
0.6
●
● M
0.4
●
95
Visualization BASICS
●
0.9
●
●
●
0.8
●
0.7
Probability
●
0.6
●
0.5
●
0.4
●
0.3
V11
The table shows the currently available generation and plotting functions. It
also references tutorial pages that provide in depth descriptions of the listed
functions.
Note that some plots, e.g., plotTuneMultiCritResult are not described here since
they lack a generation function. Both plotThreshVsPerf and plotROCCurves
operate on the result of generateThreshVsPerfData.
The ggvis functions are experimental and are subject to change, though they
should work. Most generate interactive shiny applications, that automatically
start and run locally.
ggplot2
generation plotting ggvis plotting
function function function tutorial page
generateThreshVsPerfData
plotThresVsPerfplotThreshVsPerfGGVIS
Performance
plotROCCurves– ROC Analysis
generateCritDifferencesData
plotCritDifferences
– Benchmark Experiments
96
ADVANCED
ggplot2
generation plotting ggvis plotting
function function function tutorial page
generateHyperParsEffectData
plotHyperParsEffect Tuning
generateFilterValuesData
plotFilterValuesplotFilterValuesGGVIS
Feature Selection
generateLearningCurveData
plotLearningCurve
plotLearningCurveGGVIS
Learning Curves
generatePartialDependenceData
plotPartialDependence
plotPartialDependenceGGVIS
Partial Dependence Plots
generateFunctionalANOVAData
generateCalibrationData
plotCalibration – Classifier Calibration Plots
Advanced
Configuring mlr
mlr is designed to make usage errors due to typos or invalid parameter values as
unlikely as possible. Occasionally, you might want to break those barriers and
get full access, for example to reduce the amount of output on the console or to
turn off checks. For all available options simply refer to the documentation of
configureMlr. In the following we show some common use cases.
Generally, function configureMlr permits to set options globally for your current
R session.
It is also possible to set options locally.
• All options referring to the behavior of learners (these are all options except
show.info) can be set for an individual learner via the config argument
of makeLearner. The local precedes the global configuration.
• Some functions like resample, benchmark, selectFeatures, tuneParams,
and tuneParamsMultiCrit have a show.info flag that controls if progress
messages are shown. The default value of show.info can be set by config-
ureMlr.
You are bothered by all the output on the console like in this example?
rdesc = makeResampleDesc("Holdout")
r = resample("classif.multinom", iris.task, rdesc)
#> [Resample] holdout iter: 1
#> # weights: 18 (10 variable)
#> initial value 109.861229
#> iter 10 value 12.256619
#> iter 20 value 3.638740
97
Configuring mlr ADVANCED
You can suppress the output for this Learner and this resample call as follows:
lrn = makeLearner("classif.multinom", config =
list(show.learner.output = FALSE))
r = resample(lrn, iris.task, rdesc, show.info = FALSE)
(Note that multinom has a trace switch that can alternatively be used to turn
off the progress messages.)
To globally suppress the output for all subsequent learners and calls to resample,
benchmark etc. do the following:
configureMlr(show.learner.output = FALSE, show.info = FALSE)
r = resample("classif.multinom", iris.task, rdesc)
98
Configuring mlr ADVANCED
#> $show.learner.output
#> [1] FALSE
getMlrOptions()
#> $on.learner.error
#> [1] "stop"
#>
#> $on.learner.warning
#> [1] "warn"
#>
#> $on.par.out.of.bounds
#> [1] "stop"
#>
#> $on.par.without.desc
#> [1] "stop"
#>
#> $show.info
#> [1] TRUE
#>
#> $show.learner.output
#> [1] TRUE
It might happen that you want to set a parameter of a Learner, but the parameter
is not registered in the learner’s parameter set yet. In this case you might want
to contact us or open an issue as well! But until the problem is fixed you can
turn off mlr’s parameter checking. The parameter setting will then be passed to
the underlying function without further ado.
### Support Vector Machine with linear kernel and new parameter
'newParam'
lrn = makeLearner("classif.ksvm", kernel = "vanilladot",
newParam = 3)
#> Error in setHyperPars2.Learner(learner, insert(par.vals,
args)): classif.ksvm: Setting parameter newParam without
available description object!
#> Did you mean one of these hyperparameters instead: degree
scaled kernel
#> You can switch off this check by using configureMlr!
99
Configuring mlr ADVANCED
train(lrn, iris.task)
#> Model for learner.id=classif.ksvm; learner.class=classif.ksvm
#> Trained on: task.id = iris-example; obs = 150; features = 4
#> Hyperparameters: fit=FALSE,kernl=vanilladot,newParam=3
100
Configuring mlr ADVANCED
larger benchmark study with multiple data sets and learners, you usually don’t
want the whole experiment stopped due to one error. You can prevent this using
the on.learner.error option of configureMlr.
### This call gives an error caused by the low number of
observations in class "virginica"
train("classif.qda", task = iris.task, subset = 1:104)
#> Error in qda.default(x, grouping, ...): some group is too
small for 'qda'
#> Timing stopped at: 0.003 0 0.002
mod
#> Model for learner.id=classif.qda; learner.class=classif.qda
#> Trained on: task.id = iris-example; obs = 104; features = 4
#> Hyperparameters:
#> Training failed: Error in qda.default(x, grouping, ...) :
#> some group is too small for 'qda'
#>
#> Training failed: Error in qda.default(x, grouping, ...) :
#> some group is too small for 'qda'
101
Wrapper ADVANCED
performance(pred)
#> mmce
#> NA
Wrapper
102
Wrapper ADVANCED
data(iris)
task = makeClassifTask(data = iris, target = "Species", weights
= as.integer(iris$Species))
Next, we use makeBaggingWrapper to create the base learners and the bagged
learner. We choose to set equivalents of ntree (100 base learners) and mtry
(proportion of randomly selected features).
base.lrn = makeLearner("classif.rpart")
wrapped.lrn = makeBaggingWrapper(base.lrn, bw.iters = 100,
bw.feats = 0.5)
print(wrapped.lrn)
#> Learner classif.rpart.bagged from package rpart
#> Type: classif
#> Name: ; Short name:
#> Class: BaggingWrapper
#> Properties:
twoclass,multiclass,missings,numerics,factors,ordered,prob,weights,featimp
#> Predict-Type: response
#> Hyperparameters: xval=0,bw.iters=100,bw.feats=0.5
As we can see in the output, the wrapped learner inherited all properties from
the base learner, especially the “weights” attribute is still present. We can use
this newly constructed learner like all base learners, i.e. we can use it in train,
benchmark, resample, etc.
benchmark(tasks = task, learners = list(base.lrn, wrapped.lrn))
#> Task: iris, Learner: classif.rpart
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] cross-validation iter: 4
#> [Resample] cross-validation iter: 5
#> [Resample] cross-validation iter: 6
#> [Resample] cross-validation iter: 7
#> [Resample] cross-validation iter: 8
#> [Resample] cross-validation iter: 9
#> [Resample] cross-validation iter: 10
#> [Resample] Result: mmce.test.mean=0.0667
#> Task: iris, Learner: classif.rpart.bagged
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] cross-validation iter: 4
#> [Resample] cross-validation iter: 5
#> [Resample] cross-validation iter: 6
#> [Resample] cross-validation iter: 7
103
Wrapper ADVANCED
That far we are quite happy with our new learner. But we hope for a better
performance by tuning some hyperparameters of both the decision trees and
bagging wrapper. Let’s have a look at the available hyperparameters of the fused
learner:
getParamSet(wrapped.lrn)
#> Type len Def Constr Req Tunable Trafo
#> bw.iters integer - 10 1 to Inf - TRUE -
#> bw.replace logical - TRUE - - TRUE -
#> bw.size numeric - - 0 to 1 - TRUE -
#> bw.feats numeric - 0.667 0 to 1 - TRUE -
#> minsplit integer - 20 1 to Inf - TRUE -
#> minbucket integer - - 1 to Inf - TRUE -
#> cp numeric - 0.01 0 to 1 - TRUE -
#> maxcompete integer - 4 0 to Inf - TRUE -
#> maxsurrogate integer - 5 0 to Inf - TRUE -
#> usesurrogate discrete - 2 0,1,2 - TRUE -
#> surrogatestyle discrete - 0 0,1 - TRUE -
#> maxdepth integer - 30 1 to 30 - TRUE -
#> xval integer - 10 0 to Inf - FALSE -
#> parms untyped - - - - TRUE -
We choose to tune the parameters minsplit and bw.feats for the mmce using
a random search in a 3-fold CV:
ctrl = makeTuneControlRandom(maxit = 10)
rdesc = makeResampleDesc("CV", iters = 3)
par.set = makeParamSet(
makeIntegerParam("minsplit", lower = 1, upper = 10),
makeNumericParam("bw.feats", lower = 0.25, upper = 1)
)
tuned.lrn = makeTuneWrapper(wrapped.lrn, rdesc, mmce, par.set,
ctrl)
print(tuned.lrn)
#> Learner classif.rpart.bagged.tuned from package rpart
#> Type: classif
#> Name: ; Short name:
#> Class: TuneWrapper
104
Wrapper ADVANCED
#> Properties:
numerics,factors,ordered,missings,weights,prob,twoclass,multiclass,featimp
#> Predict-Type: response
#> Hyperparameters: xval=0,bw.iters=100,bw.feats=0.5
Calling the train method of the newly constructed learner performs the following
steps:
1. The tuning wrapper sets parameters for the underlying model in slot
$next.learner and calls its train method.
2. Next learner is the bagging wrapper. The passed down argument bw.feats
is used in the bagging wrapper training function, the argument minsplit
gets passed down to $next.learner. The base wrapper function calls the
base learner bw.iters times and stores the resulting models.
3. The bagged models are evaluated using the mean mmce (default aggregation
for this performance measure) and new parameters are selected using the
tuning method.
4. This is repeated until the tuner terminates. Output is a tuned bagged
learner.
lrn = train(tuned.lrn, task = task)
#> [Tune] Started tuning learner classif.rpart.bagged for
parameter set:
#> Type len Def Constr Req Tunable Trafo
#> minsplit integer - - 1 to 10 - TRUE -
#> bw.feats numeric - - 0.25 to 1 - TRUE -
#> With control class: TuneControlRandom
#> Imputation value: 1
#> [Tune-x] 1: minsplit=5; bw.feats=0.935
#> [Tune-y] 1: mmce.test.mean=0.0467; time: 0.1 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 2: minsplit=9; bw.feats=0.675
#> [Tune-y] 2: mmce.test.mean=0.0467; time: 0.1 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 3: minsplit=2; bw.feats=0.847
#> [Tune-y] 3: mmce.test.mean=0.0467; time: 0.1 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 4: minsplit=4; bw.feats=0.761
#> [Tune-y] 4: mmce.test.mean=0.0467; time: 0.1 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 5: minsplit=6; bw.feats=0.338
#> [Tune-y] 5: mmce.test.mean=0.0867; time: 0.0 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 6: minsplit=1; bw.feats=0.637
#> [Tune-y] 6: mmce.test.mean=0.0467; time: 0.1 min; memory:
179Mb use, 711Mb max
#> [Tune-x] 7: minsplit=1; bw.feats=0.998
105
Data Preprocessing ADVANCED
Data Preprocessing
Data preprocessing refers to any transformation of the data done before ap-
plying a learning algorithm. This comprises for example finding and resolving
inconsistencies, imputation of missing values, identifying, removing or replacing
outliers, discretizing numerical data or generating numerical dummy variables
for categorical data, any kind of transformation like standardization of predictors
or Box-Cox, dimensionality reduction and feature extraction and/or selection.
mlr offers several options for data preprocessing. Some of the following simple
methods to change a Task (or data.frame) were already mentioned on the page
about learning tasks:
• capLargeValues: Convert large/infinite numeric values.
• createDummyFeatures: Generate dummy variables for factor features.
• dropFeatures: Remove selected features.
• joinClassLevels: Only for classification: Merge existing classes to new,
larger classes.
• mergeSmallFactorLevels: Merge infrequent levels of factor features.
• normalizeFeatures: Normalize features by different methods, e.g., stan-
dardization or scaling to a certain range.
• removeConstantFeatures: Remove constant features.
• subsetTask: Remove observations and/or features from a Task.
Moreover, there are tutorial pages devoted to
• Feature selection and
106
Data Preprocessing ADVANCED
107
Data Preprocessing ADVANCED
• the preprocessing is not done globally, i.e., for the whole data set, but for
every pair of training/test data sets in, e.g., resampling,
• any parameters controlling the preprocessing as, e.g., the percentage of
outliers to be removed can be tuned together with the base learner param-
eters.
We start with some examples for makePreprocWrapperCaret.
where learner is a mlr Learner or the name of a learner class like "classif.lda".
If you enable multiple preprocessing options (like knn imputation and principal
component analysis above) these are executed in a certain order detailed on the
help page of function preProcess.
In the following we show an example where principal components analysis (PCA)
is used for dimensionality reduction. This should never be applied blindly, but
can be beneficial with learners that get problems with high dimensionality or
those that can profit from rotating the data.
We consider the sonar.task, which poses a binary classification problem with 208
observations and 60 features.
108
Data Preprocessing ADVANCED
sonar.task
#> Supervised task: Sonar-example
#> Type: classif
#> Target: Class
#> Observations: 208
#> Features:
#> numerics factors ordered
#> 60 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 2
#> M R
#> 111 97
#> Positive class: M
getLearnerModel(mod)
#> Model for learner.id=classif.qda; learner.class=classif.qda
109
Data Preprocessing ADVANCED
Below the performances of qda with and without PCA preprocessing are com-
pared in a benchmark experiment. Note that we use stratified resampling to
prevent errors in qda due to a too small number of observations from either
class.
rin = makeResampleInstance("CV", iters = 3, stratify = TRUE,
task = sonar.task)
res = benchmark(list(makeLearner("classif.qda"), lrn),
sonar.task, rin, show.info = FALSE)
res
#> task.id learner.id mmce.test.mean
110
Data Preprocessing ADVANCED
PCA preprocessing in this case turns out to be really beneficial for the perfor-
mance of Quadratic Discriminant Analysis.
111
Data Preprocessing ADVANCED
112
Data Preprocessing ADVANCED
cipal components (ppc.pcaComp) directly. Moreover, for qda we try two different
ways to estimate the posterior probabilities (parameter predict.method): the
usual plug-in estimates and unbiased estimates.
We perform a grid search and set the resolution to 10. This is for demonstration.
You might want to use a finer resolution.
ps = makeParamSet(
makeIntegerParam("ppc.pcaComp", lower = 1, upper =
getTaskNFeats(sonar.task)),
makeDiscreteParam("predict.method", values = c("plug-in",
"debiased"))
)
ctrl = makeTuneControlGrid(resolution = 10)
res = tuneParams(lrn, sonar.task, rin, par.set = ps, control =
ctrl, show.info = FALSE)
res
#> Tune result:
#> Op. pars: ppc.pcaComp=8; predict.method=plug-in
#> mmce.test.mean=0.192
as.data.frame(res$opt.path)[1:3]
#> ppc.pcaComp predict.method mmce.test.mean
#> 1 1 plug-in 0.4757074
#> 2 8 plug-in 0.1920635
#> 3 14 plug-in 0.2162871
#> 4 21 plug-in 0.2643202
#> 5 27 plug-in 0.2454106
#> 6 34 plug-in 0.2645273
#> 7 40 plug-in 0.2742581
#> 8 47 plug-in 0.3173223
#> 9 53 plug-in 0.3512767
#> 10 60 plug-in 0.3941339
#> 11 1 debiased 0.5336094
#> 12 8 debiased 0.2450656
#> 13 14 debiased 0.2403037
#> 14 21 debiased 0.2546584
#> 15 27 debiased 0.3075224
#> 16 34 debiased 0.3172533
#> 17 40 debiased 0.3125604
#> 18 47 debiased 0.2979986
#> 19 53 debiased 0.3079365
#> 20 60 debiased 0.3654244
113
Data Preprocessing ADVANCED
114
Data Preprocessing ADVANCED
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
## Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
## Store the scaling parameters in control
## These are needed to preprocess the data before prediction
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
## Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
115
Data Preprocessing ADVANCED
Let’s compare the cross-validated mean squared error (mse) on the Boston
Housing data set with and without scaling.
rdesc = makeResampleDesc("CV", iters = 3)
116
Data Preprocessing ADVANCED
Often it’s not clear which preprocessing options work best with a certain learning
algorithm. As already shown for the number of principal components in makePre-
procWrapperCaret we can tune them easily together with other hyperparameters
of the learner.
In our scaling example we can try if nnet works best with both centering
and scaling the data or if it’s better to omit one of the two operations or do
no preprocessing at all. In order to tune center and scale we have to add
appropriate LearnerParams to the parameter set of the wrapped learner.
As mentioned above scale allows for numeric and logical center and scale
arguments. As we want to use the latter option we declare center and scale
as logical learner parameters.
lrn = makeLearner("regr.nnet", trace = FALSE)
lrn = makePreprocWrapper(lrn, train = trainfun, predict =
predictfun,
par.set = makeParamSet(
makeLogicalLearnerParam("center"),
makeLogicalLearnerParam("scale")
),
par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name:
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,center=TRUE,scale=TRUE
getParamSet(lrn)
#> Type len Def Constr Req Tunable Trafo
#> center logical - - - - TRUE -
#> scale logical - - - - TRUE -
#> size integer - 3 0 to Inf - TRUE -
#> maxit integer - 100 1 to Inf - TRUE -
#> linout logical - FALSE - Y TRUE -
#> entropy logical - FALSE - Y TRUE -
#> softmax logical - FALSE - Y TRUE -
#> censored logical - FALSE - Y TRUE -
#> skip logical - FALSE - - TRUE -
#> rang numeric - 0.7 -Inf to Inf - TRUE -
#> decay numeric - 0 0 to Inf - TRUE -
#> Hess logical - FALSE - - TRUE -
#> trace logical - TRUE - - FALSE -
117
Data Preprocessing ADVANCED
Now we do a simple grid search for the decay parameter of nnet and the center
and scale parameters.
rdesc = makeResampleDesc("Holdout")
ps = makeParamSet(
makeDiscreteParam("decay", c(0, 0.05, 0.1)),
makeLogicalParam("center"),
makeLogicalParam("scale")
)
ctrl = makeTuneControlGrid()
res = tuneParams(lrn, bh.task, rdesc, par.set = ps, control =
ctrl, show.info = FALSE)
res
#> Tune result:
#> Op. pars: decay=0.05; center=FALSE; scale=TRUE
#> mse.test.mean=14.8
as.data.frame(res$opt.path)
#> decay center scale mse.test.mean dob eol error.message
exec.time
#> 1 0 TRUE TRUE 49.38128 1 NA <NA>
0.038
#> 2 0.05 TRUE TRUE 20.64761 2 NA <NA>
0.045
#> 3 0.1 TRUE TRUE 22.42986 3 NA <NA>
0.050
#> 4 0 FALSE TRUE 96.25474 4 NA <NA>
0.022
#> 5 0.05 FALSE TRUE 14.84306 5 NA <NA>
0.047
#> 6 0.1 FALSE TRUE 16.65383 6 NA <NA>
0.044
#> 7 0 TRUE FALSE 40.51518 7 NA <NA>
0.044
#> 8 0.05 TRUE FALSE 68.00069 8 NA <NA>
0.044
#> 9 0.1 TRUE FALSE 55.42210 9 NA <NA>
0.046
#> 10 0 FALSE FALSE 96.25474 10 NA <NA>
0.022
#> 11 0.05 FALSE FALSE 56.25758 11 NA <NA>
118
Data Preprocessing ADVANCED
0.044
#> 12 0.1 FALSE FALSE 42.85529 12 NA <NA>
0.045
119
Imputation of Missing Values ADVANCED
lrn = makePreprocWrapperScale("classif.lda")
train(lrn, iris.task)
#> Model for learner.id=classif.lda.preproc;
learner.class=PreprocWrapper
#> Trained on: task.id = iris-example; obs = 150; features = 4
#> Hyperparameters: center=TRUE,scale=TRUE
mlr provides several imputation methods which are listed on the help page
imputations. These include standard techniques as imputation by a constant
value (like a fixed constant, the mean, median or mode) and random numbers
(either from the empirical distribution of the feature under consideration or a
certain distribution family). Moreover, missing values in one feature can be
replaced based on the other features by predictions from any supervised Learner
integrated into mlr.
If your favourite option is not implemented in mlr yet, you can easily create your
own imputation method.
Also note that some of the learning algorithms included in mlr can deal with
missing values in a sensible way, i.e., other than simply deleting observations
with missing values. Those Learners have the property "missings" and thus
can be identified using listLearners.
### Regression learners that can deal with missing values
listLearners("regr", properties = "missings")[c("class",
"package")]
#> class package
#> 1 regr.blackboost mboost,party
#> 2 regr.cforest party
#> 3 regr.ctree party
#> 4 regr.cubist Cubist
#> 5 regr.gbm gbm
#> 6 regr.randomForestSRC randomForestSRC
#> 7 regr.randomForestSRCSyn randomForestSRC
#> 8 regr.rpart rpart
120
Imputation of Missing Values ADVANCED
factors. Moreover, you can generate dummy variables that indicate which values
are missing, also either for classes of features or for individual features. These
allow to identify the patterns and reasons for missing data and permit to treat
imputed and observed values differently in a subsequent analysis.
Let’s have a look at the airquality data set.
data(airquality)
summary(airquality)
#> Ozone Solar.R Wind Temp
#> Min. : 1.00 Min. : 7.0 Min. : 1.700 Min.
:56.00
#> 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st
Qu.:72.00
#> Median : 31.50 Median :205.0 Median : 9.700 Median
:79.00
#> Mean : 42.13 Mean :185.9 Mean : 9.958 Mean
:77.88
#> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd
Qu.:85.00
#> Max. :168.00 Max. :334.0 Max. :20.700 Max.
:97.00
#> NA's :37 NA's :7
#> Month Day
#> Min. :5.000 Min. : 1.0
#> 1st Qu.:6.000 1st Qu.: 8.0
#> Median :7.000 Median :16.0
#> Mean :6.993 Mean :15.8
#> 3rd Qu.:8.000 3rd Qu.:23.0
#> Max. :9.000 Max. :31.0
#>
There are 37 NA's in variable Ozone (ozone pollution) and 7 NA's in variable
Solar.R (solar radiation). For demonstration purposes we insert artificial NA's
in column Wind (wind speed) and coerce it into a factor.
airq = airquality
ind = sample(nrow(airq), 10)
airq$Wind[ind] = NA
airq$Wind = cut(airq$Wind, c(0,8,16,24))
summary(airq)
#> Ozone Solar.R Wind Temp
#> Min. : 1.00 Min. : 7.0 (0,8] :51 Min. :56.00
#> 1st Qu.: 18.00 1st Qu.:115.8 (8,16] :86 1st Qu.:72.00
#> Median : 31.50 Median :205.0 (16,24]: 6 Median :79.00
#> Mean : 42.13 Mean :185.9 NA's :10 Mean :77.88
#> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:85.00
121
Imputation of Missing Values ADVANCED
If you want to impute NA's in all integer features (these include Ozone and
Solar.R) by the mean, in all factor features (Wind) by the mode and additionally
generate dummy variables for all integer features, you can do this as follows:
imp = impute(airq, classes = list(integer = imputeMean(), factor
= imputeMode()),
dummy.classes = "integer")
impute returns a list where slot $data contains the imputed data set. Per default,
the dummy variables are factors with levels "TRUE" and "FALSE". It is also
possible to create numeric zero-one indicator variables.
head(imp$data, 10)
#> Ozone Solar.R Wind Temp Month Day Ozone.dummy
Solar.R.dummy
#> 1 41.00000 190.0000 (0,8] 67 5 1 FALSE
FALSE
#> 2 36.00000 118.0000 (0,8] 72 5 2 FALSE
FALSE
#> 3 12.00000 149.0000 (8,16] 74 5 3 FALSE
FALSE
#> 4 18.00000 313.0000 (8,16] 62 5 4 FALSE
FALSE
#> 5 42.12931 185.9315 (8,16] 56 5 5 TRUE
TRUE
#> 6 28.00000 185.9315 (8,16] 66 5 6 FALSE
TRUE
#> 7 23.00000 299.0000 (8,16] 65 5 7 FALSE
FALSE
#> 8 19.00000 99.0000 (8,16] 59 5 8 FALSE
FALSE
#> 9 8.00000 19.0000 (16,24] 61 5 9 FALSE
FALSE
#> 10 42.12931 194.0000 (8,16] 69 5 10 TRUE
FALSE
122
Imputation of Missing Values ADVANCED
Slot $desc is an ImputationDesc object that stores all relevant information about
the imputation. For the current example this includes the means and the mode
computed on the non-missing data.
imp$desc
#> Imputation description
#> Target:
#> Features: 6; Imputed: 6
#> impute.new.levels: TRUE
#> recode.factor.levels: TRUE
#> dummy.type: factor
The imputation description shows the name of the target variable (not present),
the number of features and the number of imputed features. Note that the latter
number refers to the features for which an imputation method was specified
(five integers plus one factor) and not to the features actually containing NA's.
dummy.type indicates that the dummy variables are factors. For details on
impute.new.levels and recode.factor.levels see the help page of function
impute.
Let’s have a look at another example involving a target variable. A possible
learning task associated with the airquality data is to predict the ozone pollution
based on the meteorological features. Since we do not want to use columns Day
and Month we remove them.
airq = subset(airq, select = 1:4)
In case of a supervised learning problem you need to pass the name of the target
variable to impute. This prevents imputation and creation of a dummy variable
for the target variable itself and makes sure that the target variable is not used
to impute the features.
In contrast to the example above we specify imputation methods for individual
features instead of classes of features.
Missing values in Solar.R are imputed by random numbers drawn from the
empirical distribution of the non-missing observations.
Function imputeLearner allows to use all supervised learning algorithms inte-
grated into mlr for imputation. The type of the Learner (regr, classif) must
correspond to the class of the feature to be imputed. The missing values in
Wind are replaced by the predictions of a classification tree (rpart). Per default,
all available columns in airq.train except the target variable (Ozone) and the
variable to be imputed (Wind) are used as features in the classification tree, here
123
Imputation of Missing Values ADVANCED
Solar.R and Temp. You can also select manually which columns to use. Note
that rpart can deal with missing feature values, therefore the NA's in column
Solar.R do not pose a problem.
imp = impute(airq.train, target = "Ozone", cols = list(Solar.R =
imputeHist(),
Wind = imputeLearner("classif.rpart")), dummy.cols =
c("Solar.R", "Wind"))
summary(imp$data)
#> Ozone Solar.R Wind Temp
#> Min. : 1.00 Min. : 7.00 (0,8] :34 Min. :56.00
#> 1st Qu.: 16.00 1st Qu.: 98.75 (8,16] :61 1st Qu.:69.00
#> Median : 34.00 Median :221.50 (16,24]: 5 Median :79.50
#> Mean : 41.59 Mean :191.54 Mean :76.87
#> 3rd Qu.: 63.00 3rd Qu.:274.25 3rd Qu.:84.00
#> Max. :135.00 Max. :334.00 Max. :93.00
#> NA's :31
#> Solar.R.dummy Wind.dummy
#> FALSE:93 FALSE:92
#> TRUE : 7 TRUE : 8
#>
#>
#>
#>
#>
imp$desc
#> Imputation description
#> Target: Ozone
#> Features: 3; Imputed: 2
#> impute.new.levels: TRUE
#> recode.factor.levels: TRUE
#> dummy.type: factor
The ImputationDesc object can be used by function reimpute to impute the test
data set the same way as the training data.
airq.test.imp = reimpute(airq.test, imp$desc)
head(airq.test.imp)
#> Ozone Solar.R Wind Temp Solar.R.dummy Wind.dummy
#> 1 110 207 (0,8] 90 FALSE FALSE
#> 2 NA 222 (8,16] 92 FALSE FALSE
#> 3 NA 137 (8,16] 86 FALSE FALSE
#> 4 44 192 (8,16] 86 FALSE FALSE
#> 5 28 273 (8,16] 82 FALSE FALSE
#> 6 65 157 (8,16] 80 FALSE FALSE
124
Imputation of Missing Values ADVANCED
Before training the resulting Learner, impute is applied to the training set.
Before prediction reimpute is called on the test set and the ImputationDesc
object from the training stage.
We again aim to predict the ozone pollution from the meteorological variables.
In order to create the Task we need to delete observations with missing values
in the target variable.
airq = subset(airq, subset = !is.na(airq$Ozone))
task = makeRegrTask(data = airq, target = "Ozone")
125
Imputation of Missing Values ADVANCED
126
Generic Bagging ADVANCED
Generic Bagging
One reason why random forests perform so well is that they are using bagging
as a technique to gain more stability. But why do you want to limit yourself
to the classifiers already implemented in well known random forests when it is
really easy to build your own with mlr?
Just bag an mlr learner already makeBaggingWrapper.
As in a random forest, we need a Learner which is trained on a subset of the data
during each iteration of the bagging process. The subsets are chosen according
to the parameters given to makeBaggingWrapper:
• bw.iters On how many subsets (samples) do we want to train our Learner?
• bw.replace Sample with replacement (also known as bootstrapping)?
• bw.size Percentage size of the samples. If bw.replace = TRUE, bw.size
= 1 is the default. This does not mean that one sample will contain all
the observations as observations will occur multiple times in each sample.
• bw.feats Percentage size of randomly selected features for each iteration.
Of course we also need a Learner which we have to pass to makeBaggingWrapper.
lrn = makeLearner("classif.rpart")
bag.lrn = makeBaggingWrapper(lrn, bw.iters = 50, bw.replace =
TRUE, bw.size = 0.8, bw.feats = 3/4)
Now we can compare the performance with and without bagging. First let’s try
it without bagging:
rdesc = makeResampleDesc("CV", iters = 10)
r = resample(learner = lrn, task = sonar.task, resampling =
rdesc, show.info = FALSE)
r$aggr
#> mmce.test.mean
#> 0.2735714
Training more learners takes more time, but can outperform pure learners on
noisy data with many features.
127
Generic Bagging ADVANCED
Note that it is not relevant if the base learner itself can predict probabilities
and that for this reason the predict type of the base learner always has to be
"response".
For regression the mean value across predictions is computed. Moreover, the
standard deviation across predictions is estimated if the predict type of the
bagging learner is changed to "se". Below, we give a small example for regression.
n = getTaskSize(bh.task)
train.inds = seq(1, n, 3)
test.inds = setdiff(1:n, train.inds)
lrn = makeLearner("regr.rpart")
bag.lrn = makeBaggingWrapper(lrn)
bag.lrn = setPredictType(bag.lrn, predict.type = "se")
mod = train(learner = bag.lrn, task = bh.task, subset =
train.inds)
With function getLearnerModel, you can access the models fitted in the individual
iterations.
head(getLearnerModel(mod), 2)
#> [[1]]
#> Model for learner.id=regr.rpart; learner.class=regr.rpart
#> Trained on: task.id = BostonHousing-example; obs = 169;
features = 13
#> Hyperparameters: xval=0
#>
#> [[2]]
#> Model for learner.id=regr.rpart; learner.class=regr.rpart
#> Trained on: task.id = BostonHousing-example; obs = 169;
features = 13
#> Hyperparameters: xval=0
128
Generic Bagging ADVANCED
In the column labelled se the standard deviation for each prediction is given.
Let’s visualise this a bit using ggplot2. Here we plot the percentage of lower
status of the population (lstat) against the prediction.
library("ggplot2")
library("reshape2")
data = cbind(as.data.frame(pred), getTaskData(bh.task, subset =
test.inds))
g = ggplot(data, aes(x = lstat, y = response, ymin =
response-se, ymax = response+se, col = age))
g + geom_point() + geom_linerange(alpha=0.5)
129
Advanced Tuning ADVANCED
50
● ●●●●
●●●
● ●● ●
●
●
40
●
● ●
●
●
● age
● 100
● ● ● ●●●● ●
●
●●
●●
response
● ● ●
● ●
75
●● ● ● ●
30
●● ● 50
●●
● ●●
●● ●
● ● ●● 25
● ● ● ● ●
●
● ●●●● ●●●●
●
●●●●●
● ● ●●●
●● ●
●●● ● ●●●●●●
●●
●●
●●
● ● ●● ● ●
●●● ●
●
●
●●●
●●
●●●●
●●
●●●
●●●●●
●●●●●
● ● ● ● ●●
●
●●●
●●●●●
●
●●●●●●
●● ●
●●●●● ● ●
●● ● ● ● ●●
20 ● ●● ● ● ● ● ●
●●●●
● ●●●●
●
●
● ●
● ● ● ● ●● ● ●
●
● ● ● ●
● ● ● ●
●● ● ●● ●●
●● ● ●● ● ● ●●
● ●● ●● ● ● ● ●●
●●
●● ● ●
● ●● ● ● ●
●
●
●●
● ●● ●●
● ● ●●
● ●● ● ●● ●
● ●● ●● ●●● ●● ● ●
● ● ●
●●●● ● ● ●● ● ●
●● ● ●
10
0 10 20 30
lstat
Advanced Tuning
The package supports a larger number of tuning algorithms, which can all be
looked up and selected via TuneControl. One of the cooler algorithms is iterated
F-racing from the irace package (technical description here). This not only works
for arbitrary parameter types (numeric, integer, discrete, logical), but also for
so-called dependent / hierarchical parameters:
ps = makeParamSet(
makeNumericParam("C", lower = -12, upper = 12, trafo =
function(x) 2^x),
makeDiscreteParam("kernel", values = c("vanilladot",
"polydot", "rbfdot")),
130
Advanced Tuning ADVANCED
See how we made the kernel parameters like sigma and degree dependent on
the kernel selection parameters? This approach allows you to tune parameters
of multiple kernels at once, efficiently concentrating on the ones which work best
for your given data set.
We can now take the following example even one step further. If we use the
ModelMultiplexer we can tune over different model classes at once, just as we
did with the SVM kernels above.
base.learners = list(
makeLearner("classif.ksvm"),
makeLearner("classif.randomForest")
)
lrn = makeModelMultiplexer(base.learners)
131
Advanced Tuning ADVANCED
and the requires element is set, too, to make all parameters subordinate to
selected.learner.
ps = makeModelMultiplexerParamSet(lrn,
makeNumericParam("sigma", lower = -12, upper = 12, trafo =
function(x) 2^x),
makeIntegerParam("ntree", lower = 1L, upper = 500L)
)
print(ps)
#> Type len Def
#> selected.learner discrete - -
#> classif.ksvm.sigma numeric - -
#> classif.randomForest.ntree integer - -
#> Constr
Req Tunable
#> selected.learner classif.ksvm,classif.randomForest
- TRUE
#> classif.ksvm.sigma -12 to 12
Y TRUE
#> classif.randomForest.ntree 1 to 500
Y TRUE
#> Trafo
#> selected.learner -
#> classif.ksvm.sigma Y
#> classif.randomForest.ntree -
132
Advanced Tuning ADVANCED
head(as.data.frame(trafoOptPath(res$opt.path)))
#> C sigma fpr.test.mean fnr.test.mean dob
eol
#> 1 6.731935e+01 1.324673e+03 1.0000000 0.0000000 1
NA
#> 2 4.719282e-02 7.660068e-04 1.0000000 0.0000000 2
NA
#> 3 7.004097e+00 1.211249e+01 1.0000000 0.0000000 3
NA
#> 4 1.207932e+00 6.096186e+00 1.0000000 0.0000000 4
NA
133
Advanced Tuning ADVANCED
134
Feature Selection ADVANCED
●
●
0.15
0.10
fnr
●
●
0.05
0.00 ●
● ● ●
Feature Selection
Often, data sets include a large number of features. The technique of extracting
a subset of relevant features is called feature selection. Feature selection can
enhance the interpretability of the model, speed up the learning process and
improve the learner performance. There exist different approaches to identify
the relevant features. mlr supports filter and wrapper methods.
Filter methods
Filter methods assign an importance value to each feature. Based on these values
the features can be ranked and a feature subset can be selected.
135
Feature Selection ADVANCED
Different methods for calculating the feature importance are built into mlr’s
function generateFilterValuesData (getFilterValues has been deprecated in favor
of generateFilterValuesData.). Currently, classification, regression and survival
analysis tasks are supported. A table showing all available methods can be found
here.
Function generateFilterValuesData requires the Task and a character string
specifying the filter method.
fv = generateFilterValuesData(iris.task, method =
"information.gain")
fv
#> FilterValues:
#> Task: iris-example
#> name type information.gain
#> 1 Sepal.Length numeric 0.4521286
#> 2 Sepal.Width numeric 0.2672750
#> 3 Petal.Length numeric 0.9402853
#> 4 Petal.Width numeric 0.9554360
A bar plot of importance values for the individual features can be obtained using
function plotFilterValues.
plotFilterValues(fv2)
136
Feature Selection ADVANCED
iris−example (4 features)
information.gain chi.squared
1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
th
th
th
th
th
th
gt
gt
id
id
id
id
en
en
en
en
l.W
l.W
l.W
l.W
.L
.L
.L
.L
ta
pa
ta
pa
al
al
l
pa
pa
Pe
Pe
t
t
Se
Se
Pe
Pe
Se
Se
137
Feature Selection ADVANCED
With mlr’s function filterFeatures you can create a new Task by leaving out
features of lower importance.
There are several ways to select a feature subset based on feature importance
values:
• Keep a certain absolute number (abs) of features with highest importance.
• Keep a certain percentage (perc) of features with highest importance.
• Keep all features whose importance exceeds a certain threshold value
(threshold).
Function filterFeatures supports these three methods as shown in the following
example. Moreover, you can either specify the method for calculating the feature
importance or you can use previously computed importance values via argument
fval.
### Keep the 2 most important features
filtered.task = filterFeatures(iris.task, method =
"information.gain", abs = 2)
138
Feature Selection ADVANCED
features so that it can be part of the validation method of your choice. A Learner
can be fused with a filter method by function makeFilterWrapper. The resulting
Learner has the additional class attribute FilterWrapper.
In the following example we calculate the 10-fold cross-validated error rate
(mmce) of the k nearest neighbor classifier with preceding feature selection on
the iris data set. We use "information.gain" as importance measure and select
the 2 features with highest importance. In each resampling iteration feature
selection is carried out on the corresponding training data set before fitting the
learner.
lrn = makeFilterWrapper(learner = "classif.fnn", fw.method =
"information.gain", fw.abs = 2)
rdesc = makeResampleDesc("CV", iters = 10)
r = resample(learner = lrn, task = iris.task, resampling =
rdesc, show.info = FALSE, models = TRUE)
r$aggr
#> mmce.test.mean
#> 0.04
You may want to know which features have been used. Luckily, we have
called resample with the argument models = TRUE, which means that r$models
contains a list of models fitted in the individual resampling iterations. In order
to access the selected feature subsets we can call getFilteredFeatures on each
model.
sfeats = sapply(r$models, getFilteredFeatures)
table(sfeats)
#> sfeats
#> Petal.Length Petal.Width
#> 10 10
139
Feature Selection ADVANCED
140
Feature Selection ADVANCED
After tuning we can generate a new wrapped learner with the optimal percentage
value for further use.
lrn = makeFilterWrapper(learner = "regr.lm", fw.method =
"chi.squared", fw.perc = res$x$fw.perc)
mod = train(lrn, bh.task)
mod
#> Model for learner.id=regr.lm.filtered;
learner.class=FilterWrapper
#> Trained on: task.id = BostonHousing-example; obs = 506;
features = 13
#> Hyperparameters: fw.method=chi.squared,fw.perc=0.5
getFilteredFeatures(mod)
#> [1] "crim" "zn" "rm" "dis" "rad" "lstat"
141
Feature Selection ADVANCED
142
Feature Selection ADVANCED
●
●
● ●
●
0.5 ● ● ●
●●
●
● ●
●
●
●
0.4
fnr
● ●●
●
●
0.3
●
●
● ●
●
●
●
●●
●
●
●
●
●
Wrapper methods
143
Feature Selection ADVANCED
available:
• Exhaustive search (makeFeatSelControlExhaustive),
• Genetic algorithm (makeFeatSelControlGA),
• Random search (makeFeatSelControlRandom),
• Deterministic forward or backward search (makeFeatSelControlSequential).
144
Feature Selection ADVANCED
sfeats$x
#> [1] "mean_radius" "mean_area"
"mean_smoothness"
#> [4] "mean_concavepoints" "mean_symmetry"
"mean_fractaldim"
#> [7] "SE_texture" "SE_perimeter"
"SE_smoothness"
#> [10] "SE_compactness" "SE_concavity"
"SE_concavepoints"
#> [13] "worst_area" "worst_compactness"
"worst_concavepoints"
#> [16] "tsize" "pnodes"
sfeats$y
#> cindex.test.mean
#> 0.713799
Further information about the sequential feature selection process can be obtained
by function analyzeFeatSelResult.
analyzeFeatSelResult(sfeats)
#> Features : 11
#> Performance : mse.test.mean=23.7
#> crim, zn, chas, nox, rm, dis, rad, tax, ptratio, b, lstat
#>
#> Path to optimum:
145
Feature Selection ADVANCED
146
Feature Selection ADVANCED
sfeats = getFeatSelResult(mod)
sfeats
#> FeatSel result:
#> Features (19): mean_radius, mean_texture, mean_perimeter,
mean_area, mean_smoothness, mean_compactness,
mean_concavepoints, mean_fractaldim, SE_compactness,
SE_concavity, SE_concavepoints, SE_symmetry, worst_texture,
worst_perimeter, worst_area, worst_concavepoints,
worst_symmetry, tsize, pnodes
#> cindex.test.mean=0.631
The selected feature sets in the individual resampling iterations can be extracted
as follows:
lapply(r$models, getFeatSelResult)
#> [[1]]
#> FeatSel result:
#> Features (18): mean_texture, mean_area, mean_smoothness,
mean_compactness, mean_concavity, mean_symmetry, SE_radius,
SE_compactness, SE_concavity, SE_concavepoints,
SE_fractaldim, worst_radius, worst_smoothness,
147
Nested Resampling ADVANCED
Nested Resampling
In order to obtain honest performance estimates for a learner all parts of the
model building like preprocessing and model selection steps should be included
in the resampling, i.e., repeated for every pair of training/test data. For steps
that themselves require resampling like parameter tuning or feature selection
148
Nested Resampling ADVANCED
(via the wrapper approach) this results in two nested resampling loops.
The graphic above illustrates nested resampling for parameter tuning with 3-fold
cross-validation in the outer and 4-fold cross-validation in the inner loop.
In the outer resampling loop, we have three pairs of training/test sets. On each
of these outer training sets parameter tuning is done, thereby executing the inner
resampling loop. This way, we get one set of selected hyperparameters for each
outer training set. Then the learner is fitted on each outer training set using the
corresponding selected hyperparameters and its performance is evaluated on the
outer test sets.
In mlr, you can get nested resampling for free without programming any looping
by using the wrapper functionality. This works as follows:
1. Generate a wrapped Learner via function makeTuneWrapper or makeFeat-
SelWrapper. Specify the inner resampling strategy using their resampling
argument.
2. Call function resample (see also the section about resampling) and pass
the outer resampling strategy to its resampling argument.
You can freely combine different inner and outer resampling strategies.
The outer strategy can be a resample description (ResampleDesc) or a resample
instance (ResampleInstance). A common setup is prediction and performance
evaluation on a fixed outer test set. This can be achieved by using function
makeFixedHoldoutInstance to generate the outer ResampleInstance.
149
Nested Resampling ADVANCED
Tuning
As you might recall from the tutorial page about tuning, you need to define a
search space by function makeParamSet, a search strategy by makeTuneControl*,
and a method to evaluate hyperparameter settings (i.e., the inner resampling
strategy and a performance measure).
Below is a classification example. We evaluate the performance of a support vector
machine (ksvm) with tuned cost parameter C and RBF kernel parameter sigma.
We use 3-fold cross-validation in the outer and subsampling with 2 iterations in
the inner loop. For tuning a grid search is used to find the hyperparameters with
lowest error rate (mmce is the default measure for classification). The wrapped
Learner is generated by calling makeTuneWrapper.
Note that in practice the parameter set should be larger. A common recommen-
dation is 2^(-12:12) for both C and sigma.
### Tuning in inner resampling loop
ps = makeParamSet(
makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2))
)
ctrl = makeTuneControlGrid()
inner = makeResampleDesc("Subsample", iters = 2)
lrn = makeTuneWrapper("classif.ksvm", resampling = inner,
par.set = ps, control = ctrl, show.info = FALSE)
150
Nested Resampling ADVANCED
You can obtain the error rates on the 3 outer test sets by:
r$measures.test
#> iter mmce
#> 1 1 0.02
#> 2 2 0.06
#> 3 3 0.08
names(r$extract[[1]])
#> [1] "learner" "control" "x" "y"
"threshold" "opt.path"
151
Nested Resampling ADVANCED
152
Nested Resampling ADVANCED
1 2 3
mmce.test.mean
2
0.4
sigma
1 0.3
0.2
0.1
0.5
0.0
0.25
Feature selection
As you might recall from the section about feature selection, mlr supports the
filter and the wrapper approach.
Wrapper methods
Wrapper methods use the performance of a learning algorithm to assess the
usefulness of a feature set. In order to select a feature subset a learner is trained
repeatedly on different feature subsets and the subset which leads to the best
learner performance is chosen.
For feature selection in the inner resampling loop, you need to choose a search
strategy (function makeFeatSelControl*), a performance measure and the inner
resampling strategy. Then use function makeFeatSelWrapper to bind everything
together.
Below we use sequential forward selection with linear regression on the Boston-
Housing data set (bh.task).
### Feature selection in inner resampling loop
inner = makeResampleDesc("CV", iters = 3)
153
Nested Resampling ADVANCED
r
#> Resample Result
#> Task: BostonHousing-example
#> Learner: regr.lm.featsel
#> Aggr perf: mse.test.mean=31.7
#> Runtime: 39.7649
r$measures.test
#> iter mse
#> 1 1 35.08611
#> 2 2 28.31215
154
Nested Resampling ADVANCED
As for tuning, you can extract the optimization paths. The resulting data.frames
contain, among others, binary columns for all features, indicating if they were
included in the linear regression model, and the corresponding performances.
opt.paths = lapply(r$extract, function(x)
as.data.frame(x$opt.path))
head(opt.paths[[1]])
#> crim zn indus chas nox rm age dis rad tax ptratio b lstat
mse.test.mean
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0
80.33019
#> 2 1 0 0 0 0 0 0 0 0 0 0 0 0
65.95316
#> 3 0 1 0 0 0 0 0 0 0 0 0 0 0
69.15417
#> 4 0 0 1 0 0 0 0 0 0 0 0 0 0
55.75473
#> 5 0 0 0 1 0 0 0 0 0 0 0 0 0
80.48765
#> 6 0 0 0 0 1 0 0 0 0 0 0 0 0
63.06724
#> dob eol error.message exec.time
#> 1 1 2 <NA> 0.017
#> 2 2 2 <NA> 0.027
#> 3 2 2 <NA> 0.027
#> 4 2 2 <NA> 0.027
#> 5 2 2 <NA> 0.031
#> 6 2 2 <NA> 0.026
155
Nested Resampling ADVANCED
156
Nested Resampling ADVANCED
157
Nested Resampling ADVANCED
158
Nested Resampling ADVANCED
Benchmark experiments
159
Nested Resampling ADVANCED
### Learners
lrns = list(lrn1, lrn2)
The print method for the BenchmarkResult shows the aggregated performances
from the outer resampling loop.
As you might recall, mlr offers several accessor function to extract information
from the benchmark result. These are listed on the help page of BenchmarkResult
and many examples are shown on the tutorial page about benchmark experiments.
The performance values in individual outer resampling runs can be obtained
by getBMRPerformances. Note that, since we used different outer resampling
strategies for the two tasks, the number of rows per task differ.
getBMRPerformances(res, as.df = TRUE)
#> task.id learner.id iter acc ber
#> 1 iris-example classif.ksvm.tuned 1 0.9400000 0.05882353
#> 2 iris-example classif.kknn.tuned 1 0.9200000 0.08683473
#> 3 Sonar-example classif.ksvm.tuned 1 0.5373134 0.50000000
#> 4 Sonar-example classif.ksvm.tuned 2 0.5205479 0.50000000
#> 5 Sonar-example classif.kknn.tuned 1 0.8208955 0.18234767
#> 6 Sonar-example classif.kknn.tuned 2 0.7945205 0.20864662
160
Nested Resampling ADVANCED
The results from the parameter tuning can be obtained through function getBM-
RTuneResults.
getBMRTuneResults(res)
#> $`iris-example`
#> $`iris-example`$classif.ksvm.tuned
#> $`iris-example`$classif.ksvm.tuned[[1]]
#> Tune result:
#> Op. pars: C=0.5; sigma=0.5
#> mmce.test.mean=0.0588
#>
#>
#> $`iris-example`$classif.kknn.tuned
#> $`iris-example`$classif.kknn.tuned[[1]]
#> Tune result:
#> Op. pars: k=3
#> mmce.test.mean=0.049
#>
#>
#>
#> $`Sonar-example`
#> $`Sonar-example`$classif.ksvm.tuned
#> $`Sonar-example`$classif.ksvm.tuned[[1]]
#> Tune result:
#> Op. pars: C=1; sigma=2
#> mmce.test.mean=0.343
#>
#> $`Sonar-example`$classif.ksvm.tuned[[2]]
#> Tune result:
#> Op. pars: C=2; sigma=0.5
#> mmce.test.mean= 0.2
#>
#>
#> $`Sonar-example`$classif.kknn.tuned
#> $`Sonar-example`$classif.kknn.tuned[[1]]
#> Tune result:
#> Op. pars: k=4
#> mmce.test.mean=0.11
#>
#> $`Sonar-example`$classif.kknn.tuned[[2]]
#> Tune result:
#> Op. pars: k=3
#> mmce.test.mean=0.0667
161
Nested Resampling ADVANCED
It is also possible to extract the tuning results for individual tasks and learners
and, as shown in earlier examples, inspect the optimization path.
tune.res = getBMRTuneResults(res, task.ids = "Sonar-example",
learner.ids = "classif.ksvm.tuned",
as.df = TRUE)
tune.res
#> task.id learner.id iter C sigma mmce.test.mean
#> 1 Sonar-example classif.ksvm.tuned 1 1 2.0 0.3428571
#> 2 Sonar-example classif.ksvm.tuned 2 2 0.5 0.2000000
getNestedTuneResultsOptPathDf(res$results[["Sonar-example"]][["classif.ksvm.tuned"]])
### Learners
lrns = list(makeLearner("regr.rpart"), lrn)
162
Nested Resampling ADVANCED
res
#> task.id learner.id mse.test.mean
#> 1 BostonHousing-example regr.rpart 25.86232
#> 2 BostonHousing-example regr.lm.featsel 25.07465
You can access results for individual learners and tasks and inspect them further.
feats = getBMRFeatSelResults(res, learner.id = "regr.lm.featsel")
feats = feats$`BostonHousing-example`$`regr.lm.featsel`
As for tuning, you can extract the optimization paths. The resulting data.frames
contain, among others, binary columns for all features, indicating if they were
included in the linear regression model, and the corresponding performances.
analyzeFeatSelResult gives a clearer overview.
163
Nested Resampling ADVANCED
analyzeFeatSelResult(feats[[1]])
#> Features : 8
#> Performance : mse.test.mean=26.7
#> crim, zn, chas, nox, rm, dis, ptratio, lstat
#>
#> Path to optimum:
#> - Features: 0 Init : Perf =
90.162 Diff: NA *
#> - Features: 1 Add : lstat Perf =
42.646 Diff: 47.515 *
#> - Features: 2 Add : ptratio Perf = 34.52
Diff: 8.1263 *
#> - Features: 3 Add : rm Perf =
30.454 Diff: 4.066 *
#> - Features: 4 Add : dis Perf =
29.405 Diff: 1.0495 *
#> - Features: 5 Add : nox Perf =
28.059 Diff: 1.3454 *
#> - Features: 6 Add : chas Perf =
27.334 Diff: 0.72499 *
#> - Features: 7 Add : zn Perf =
26.901 Diff: 0.43296 *
164
Nested Resampling ADVANCED
### Learners
lrns = list(makeLearner("regr.rpart"), lrn)
res
#> task.id learner.id mse.test.mean
#> 1 BostonHousing-example regr.rpart 22.11687
#> 2 BostonHousing-example regr.lm.filtered.tuned 23.76666
165
Cost-Sensitive Classification ADVANCED
Cost-Sensitive Classification
In regular classification the aim is to minimize the misclassification rate and thus
all types of misclassification errors are deemed equally severe. A more general
setting is cost-sensitive classification where the costs caused by different kinds of
errors are not assumed to be equal and the objective is to minimize the expected
costs.
In case of class-dependent costs the costs depend on the true and predicted class
label. The costs c(k, l) for predicting class k if the true label is l are usually
organized into a K × K cost matrix where K is the number of classes. Naturally,
it is assumed that the cost of predicting the correct class label y is minimal (that
is c(y, y) ≤ c(k, y) for all k = 1, . . . , K).
A further generalization of this scenario are example-dependent misclassification
costs where each example (x, y) is coupled with an individual cost vector of length
K. Its k-th component expresses the cost of assigning x to class k. A real-world
example is fraud detection where the costs do not only depend on the true and
predicted status fraud/non-fraud, but also on the amount of money involved in
each case. Naturally, the cost of predicting the true class label y is assumed to be
minimum. The true class labels are redundant information, as they can be easily
inferred from the cost vectors. Moreover, given the cost vector, the expected
costs do not depend on the true class label y. The classification problem is
therefore completely defined by the feature values x and the corresponding cost
vectors.
In the following we show ways to handle cost-sensitive classification problems
in mlr. Some of the functionality is currently experimental, and there may be
changes in the future.
166
Cost-Sensitive Classification ADVANCED
true/pred. +1 −1
+1 c(+1, +1) c(−1, +1)
−1 c(+1, −1) c(−1, −1)
Often, the diagonal entries are zero or the cost matrix is rescaled to achieve
zeros in the diagonal (see for example O’Brien et al, 2008).
A well-known cost-sensitive classification problem is posed by the German Credit
data set (see also the UCI Machine Learning Repository). The correspond-
ing cost matrix (though Elkan (2001) argues that this matrix is economically
unreasonable) is given as:
As in the table above, the rows indicate true and the columns predicted classes.
In case of class-dependent costs it is sufficient to generate an ordinary ClassifTask.
A CostSensTask is only needed if the costs are example-dependent. In the R
code below we create the ClassifTask, remove two constant features from the
data set and generate the cost matrix. Per default, Bad is the positive class.
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target =
"Class")
credit.task = removeConstantFeatures(credit.task)
#> Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task
#> Supervised task: GermanCredit
167
Cost-Sensitive Classification ADVANCED
1. Thresholding
We start by fitting a logistic regression model to the German credit data set and
predict posterior probabilities.
### Train and predict posterior probabilities
lrn = makeLearner("classif.multinom", predict.type = "prob",
trace = FALSE)
mod = train(lrn, credit.task)
pred = predict(mod, task = credit.task)
pred
#> Prediction: 1000 observations
#> predict.type: prob
#> threshold: Bad=0.50,Good=0.50
#> time: 0.01
#> id truth prob.Bad prob.Good response
#> 1 1 Good 0.03525092 0.9647491 Good
#> 2 2 Bad 0.63222363 0.3677764 Bad
#> 3 3 Good 0.02807414 0.9719259 Good
#> 4 4 Good 0.25182703 0.7481730 Good
#> 5 5 Bad 0.75193275 0.2480673 Bad
#> 6 6 Good 0.26230149 0.7376985 Good
#> ... (1000 rows, 5 cols)
168
Cost-Sensitive Classification ADVANCED
The default thresholds for both classes are 0.5. But according to the cost matrix
we should predict class Good only if we are very sure that Good is indeed the
correct label. Therefore we should increase the threshold for class Good and
decrease the threshold for class Bad.
i. Theoretical thresholding
The theoretical threshold for the positive class can be calculated from the cost
matrix as
c(+1, −1) − c(−1, −1)
t∗ = .
c(+1, −1) − c(+1, +1) + c(−1, +1) − c(−1, −1)
As you may recall you can change thresholds in mlr either before training by
using the predict.threshold option of makeLearner or after prediction by
calling setThreshold on the Prediction object.
As we already have a prediction we use the setThreshold function. It returns an
altered Prediction object with class predictions for the theoretical threshold.
### Predict class labels according to the theoretical threshold
pred.th = setThreshold(pred, th)
pred.th
#> Prediction: 1000 observations
#> predict.type: prob
#> threshold: Bad=0.17,Good=0.83
#> time: 0.01
#> id truth prob.Bad prob.Good response
#> 1 1 Good 0.03525092 0.9647491 Good
#> 2 2 Bad 0.63222363 0.3677764 Bad
#> 3 3 Good 0.02807414 0.9719259 Good
#> 4 4 Good 0.25182703 0.7481730 Bad
#> 5 5 Bad 0.75193275 0.2480673 Bad
#> 6 6 Good 0.26230149 0.7376985 Bad
#> ... (1000 rows, 5 cols)
In order to calculate the average costs over the entire data set we first need
to create a new performance Measure. This can be done through function
169
Cost-Sensitive Classification ADVANCED
makeCostMeasure. It is expected that the rows of the cost matrix indicate true
and the columns predicted class labels.
credit.costs = makeCostMeasure(id = "credit.costs", name =
"Credit costs", costs = costs,
best = 0, worst = 5)
credit.costs
#> Name: Credit costs
#> Performance measure: credit.costs
#> Properties:
classif,classif.multi,req.pred,req.truth,predtype.response,predtype.prob
#> Minimize: TRUE
#> Best: 0; Worst: 5
#> Aggregated by: test.mean
#> Note:
These performance values may be overly optimistic as we used the same data set
for training and prediction, and resampling strategies should be preferred. In the
R code below we make use of the predict.threshold argument of makeLearner
to set the threshold before doing a 3-fold cross-validation on the credit.task.
Note that we create a ResampleInstance (rin) that is used throughout the next
several code chunks to get comparable performance values.
### Cross-validated performance with theoretical thresholds
rin = makeResampleInstance("CV", iters = 3, task = credit.task)
lrn = makeLearner("classif.multinom", predict.type = "prob",
predict.threshold = th, trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures =
list(credit.costs, mmce), show.info = FALSE)
r
#> Resample Result
#> Task: GermanCredit
#> Learner: classif.multinom
170
Cost-Sensitive Classification ADVANCED
0.6
1.25
performance
0.5
1.00
0.4
0.75
0.3
0.50
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
171
Cost-Sensitive Classification ADVANCED
tuneThreshold returns the optimal threshold value for the positive class and the
corresponding performance. As expected the tuned threshold is smaller than
the theoretical threshold.
2. Rebalancing
In order to minimize the average costs, observations from the less costly class
should be given higher importance during training. This can be achieved by
weighting the classes, provided that the learner under consideration has a ‘class
weights’ or an ‘observation weights’ argument. To find out which learning
methods support either type of weights have a look at the list of integrated
learners in the Appendix or use listLearners.
### Learners that accept observation weights
172
Cost-Sensitive Classification ADVANCED
173
Cost-Sensitive Classification ADVANCED
multiplied by
1 − t t0
.
t 1 − t0
Alternatively, the proportion of observations in the negative class can be multi-
plied by the inverse. A proof is given by Elkan (2001).
In most cases, the original threshold is t0 = 0.5 and thus the second factor
vanishes. If additionally the target threshold t equals the theoretical threshold
t∗ the proportion of observations in the positive class has to be multiplied by
For the credit example the theoretical threshold corresponds to a weight of 5 for
the positive class.
### Weight for positive class corresponding to theoretical
treshold
w = (1 - th)/th
w
#> [1] 5
A unified and convenient way to assign class weights to a Learner (and tune
them) is provided by function makeWeightedClassesWrapper. The class weights
are specified using argument wcw.weight. For learners that support observation
weights a suitable weight vector is then generated internally during training or
resampling. If the learner can deal with class weights, the weights are basically
passed on to the appropriate learner parameter. The advantage of using the
wrapper in this case is the unified way to specify the class weights.
Below is an example using learner "classif.multinom" (multinom from package
nnet) which accepts observation weights. For binary classification problems it is
sufficient to specify the weight w for the positive class. The negative class then
automatically receives weight 1.
### Weighted learner
lrn = makeLearner("classif.multinom", trace = FALSE)
lrn = makeWeightedClassesWrapper(lrn, wcw.weight = w)
lrn
#> Learner weightedclasses.classif.multinom from package nnet
#> Type: classif
#> Name: ; Short name:
#> Class: WeightedClassesWrapper
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: response
#> Hyperparameters: trace=FALSE,wcw.weight=5
174
Cost-Sensitive Classification ADVANCED
Just like the theoretical threshold, the theoretical weights may not always be
suitable, therefore you can tune the weight for the positive class as shown in the
following example. Calculating the theoretical weight beforehand may help to
narrow down the search interval.
lrn = makeLearner("classif.multinom", trace = FALSE)
lrn = makeWeightedClassesWrapper(lrn)
ps = makeParamSet(makeDiscreteParam("wcw.weight", seq(4, 12,
0.5)))
ctrl = makeTuneControlGrid()
tune.res = tuneParams(lrn, credit.task, resampling = rin,
par.set = ps,
measures = list(credit.costs, mmce), control = ctrl, show.info
= FALSE)
tune.res
#> Tune result:
#> Op. pars: wcw.weight=7.5
#> credit.costs.test.mean=0.501,mmce.test.mean=0.381
as.data.frame(tune.res$opt.path)[1:3]
175
Cost-Sensitive Classification ADVANCED
Note that in the above example the learner was trained on the oversampled task
credit.task.over. In order to get the training performance on the original
task predictions were calculated for credit.task.
We usually prefer resampled performance values, but simply calling resample
on the oversampled task does not work since predictions have to be based on
the original task. The solution is to create a wrapped Learner via function
makeOversampleWrapper. Internally, oversample is called before training, but
predictions are done on the original data.
lrn = makeLearner("classif.multinom", trace = FALSE)
176
Cost-Sensitive Classification ADVANCED
Of course, we can also tune the oversampling rate. For this purpose we again
have to create an OversampleWrapper. Optimal values for parameter osw.rate
can be obtained using function tuneParams.
lrn = makeLearner("classif.multinom", trace = FALSE)
lrn = makeOversampleWrapper(lrn, osw.cl = "Bad")
ps = makeParamSet(makeDiscreteParam("osw.rate", seq(3, 7, 0.25)))
ctrl = makeTuneControlGrid()
tune.res = tuneParams(lrn, credit.task, rin, par.set = ps,
measures = list(credit.costs, mmce),
control = ctrl, show.info = FALSE)
tune.res
#> Tune result:
#> Op. pars: osw.rate=6.25
#> credit.costs.test.mean=0.507,mmce.test.mean=0.355
Multi-class problems
We consider the waveform data set from package mlbench and add an artificial
cost matrix:
true/pred. 1 2 3
1 0 30 80
2 5 0 4
3 10 8 0
177
Cost-Sensitive Classification ADVANCED
We start by creating the Task, the cost matrix and the corresponding performance
measure.
### Task
df = mlbench::mlbench.waveform(500)
wf.task = makeClassifTask(id = "waveform", data =
as.data.frame(df), target = "classes")
1. Thresholding
Given a vector of positive threshold values as long as the number of classes
K, the predicted probabilities for all classes are adjusted by dividing them by
the corresponding threshold value. Then the class with the highest adjusted
probability is predicted. This way, as in the binary case, classes with a low
threshold are preferred to classes with a larger threshold.
Again this can be done by function setThreshold as shown in the following
example (or alternatively by the predict.threshold option of makeLearner).
Note that the threshold vector needs to have names that correspond to the class
labels.
lrn = makeLearner("classif.rpart", predict.type = "prob")
rin = makeResampleInstance("CV", iters = 3, task = wf.task)
r = resample(lrn, wf.task, rin, measures = list(wf.costs, mmce),
show.info = FALSE)
r
#> Resample Result
#> Task: waveform
#> Learner: classif.rpart
#> Aggr perf: wf.costs.test.mean=7.02,mmce.test.mean=0.262
178
Cost-Sensitive Classification ADVANCED
The threshold vector th in the above example is chosen according to the average
costs of the true classes 55, 4.5 and 9. More exactly, th corresponds to an
artificial cost matrix of the structure mentioned above with off-diagonal elements
c(2, 1) = c(3, 1) = 55, c(1, 2) = c(3, 2) = 4.5 and c(1, 3) = c(2, 3) = 9. This
threshold vector may be not optimal but leads to smaller total costs on the data
set than the default.
ii. Empirical thresholding
As in the binary case it is possible to tune the threshold vector using function
tuneThreshold. Since the scaling of the threshold vector does not change the
predicted class labels tuneThreshold returns threshold values that lie in [0,1]
and sum to unity.
tune.res = tuneThreshold(pred = r$pred, measure = wf.costs)
tune.res
#> $th
#> 1 2 3
#> 0.01447413 0.35804444 0.62748143
#>
#> $perf
#> [1] 4.544369
2. Rebalancing
i. Weighting
179
Cost-Sensitive Classification ADVANCED
In the multi-class case you have to pass a vector of weights as long as the number
of classes K to function makeWeightedClassesWrapper. The weight vector can
be tuned using function tuneParams.
lrn = makeLearner("classif.multinom", trace = FALSE)
lrn = makeWeightedClassesWrapper(lrn)
ps = makeParamSet(makeNumericVectorParam("wcw.weight", len = 3,
lower = 0, upper = 1))
ctrl = makeTuneControlRandom()
180
Cost-Sensitive Classification ADVANCED
181
Cost-Sensitive Classification ADVANCED
getLearnerModel(mod)
#> [[1]]
#> Model for learner.id=classif.multinom;
learner.class=classif.multinom
#> Trained on: task.id = feats; obs = 150; features = 4
#> Hyperparameters: trace=FALSE
#>
#> [[2]]
#> Model for learner.id=classif.multinom;
learner.class=classif.multinom
#> Trained on: task.id = feats; obs = 150; features = 4
#> Hyperparameters: trace=FALSE
#>
#> [[3]]
#> Model for learner.id=classif.multinom;
learner.class=classif.multinom
#> Trained on: task.id = feats; obs = 150; features = 4
#> Hyperparameters: trace=FALSE
182
Imbalanced Classification Problems ADVANCED
Sampling-based approaches
The basic idea of sampling methods is to simply adjust the proportion of the
classes in order to increase the weight of the minority class observations within
the model.
The sampling-based approaches can be divided further into three different cate-
gories:
1. Undersampling methods: Elimination of randomly chosen cases of the
majority class to decrease their effect on the classifier. All cases of the
minority class are kept.
2. Oversampling methods: Generation of additional cases (copies, artificial
observations) of the minority class to increase their effect on the classifier.
All cases of the majority class are kept.
3. Hybrid methods: Mixture of under- and oversampling strategies.
All these methods directly access the underlying data and “rearrange” it. In
this way the sampling is done as part of the preprocesssing and can therefore be
combined with every appropriate classifier.
mlr currently supports the first two approaches.
183
Imbalanced Classification Problems ADVANCED
table(getTaskTargets(task))
#>
#> A B
#> 100 5000
table(getTaskTargets(task.over))
#>
#> A B
#> 800 5000
table(getTaskTargets(task.under))
#>
#> A B
#> 100 625
Please note that the undersampling rate has to be between 0 and 1, where 1
means no undersampling and 0.5 implies a reduction of the majority class size to
50 percent. Correspondingly, the oversampling rate must be greater or equal to
1, where 1 means no oversampling and 2 would result in doubling the minority
class size.
As a result the performance should improve if the model is applied to new data.
lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task)
mod.over = train(lrn, task.over)
mod.under = train(lrn, task.under)
184
Imbalanced Classification Problems ADVANCED
data.imbal.test = rbind(
data.frame(x = rnorm(10, mean = 1), class = "A"),
data.frame(x = rnorm(500, mean = 2), class = "B")
)
In this case the performance measure has to be considered very carefully. As the
misclassification rate (mmce) evaluates the overall accuracy of the predictions,
the balanced error rate (ber) and area under the ROC Curve (auc) might be
more suitable here, as the misclassifications within each class are separately
taken into account.
185
Imbalanced Classification Problems ADVANCED
Extensions to oversampling
Two extensions to (simple) oversampling are available in mlr.
table(getTaskTargets(task.smote))
#>
#> A B
#> 800 5000
186
Imbalanced Classification Problems ADVANCED
2. Overbagging
Another extension of oversampling consists in the combination of sampling with
the bagging approach. For each iteration of the bagging process, minority class
observations are oversampled with a given rate in obw.rate. The majority class
cases can either all be taken into account for each iteration (obw.maxcl = "all")
or bootstrapped with replacement to increase variability between training data
sets during iterations (obw.maxcl = "boot").
The construction of the Overbagging Wrapper works similar to makeBagging-
Wrapper. First an existing mlr learner has to be passed to makeOverBagging-
Wrapper. The number of iterations or fitted models can be set via obw.iters.
lrn = makeLearner("classif.rpart", predict.type = "response")
obw.lrn = makeOverBaggingWrapper(lrn, obw.rate = 8, obw.iters =
3)
187
Imbalanced Classification Problems ADVANCED
While overbagging slighty improves the performance of the decision tree, the auc
decreases in the second example when additional overbagging is applied. As the
random forest itself is already a strong learner (and a bagged one as well), a
further bagging step isn’t very helpful here and usually won’t improve the model.
Cost-based approaches
188
ROC Analysis and Performance Curves ADVANCED
For binary classification, the single number passed to the classifier corresponds
to the weight of the positive / majority class, while the negative / minority class
receives a weight of 1. So actually, no real costs are used within this approach,
but the cost ratio is taken into account.
If the underlying learner already has a parameter for class weighting (e.g.,
class.weights in "classif.ksvm"), the wcw.weight is basically passed to the
specific class weighting parameter.
lrn = makeLearner("classif.ksvm")
wcw.lrn = makeWeightedClassesWrapper(lrn, wcw.weight = 0.01)
For binary scoring classifiers a threshold (or cutoff ) value controls how predicted
posterior probabilities are converted into class labels. ROC curves and other
performance plots serve to visualize and analyse the relationship between one or
two performance measures and the threshold.
This page is mainly devoted to receiver operating characteristic (ROC) curves that
plot the true positive rate (sensitivity) on the vertical axis against the false positive
rate (1 - specificity, fall-out) on the horizontal axis for all possible threshold
values. Creating other performance plots like lift charts or precision/recall graphs
works analogously and is shown briefly.
In addition to performance visualization ROC curves are helpful in
• determining an optimal decision threshold for given class prior probabili-
ties and misclassification costs (for alternatives see also the pages about
cost-sensitive classification and imbalanced classification problems in this
tutorial),
• identifying regions where one classifier outperforms another and building
suitable multi-classifier systems,
• obtaining calibrated estimates of the posterior probabilities.
For more information see the tutorials and introductory papers by Fawcett
(2004), Fawcett (2006) as well as Flach (ICML 2004).
In many applications as, e.g., diagnostic tests or spam detection, there is un-
certainty about the class priors or the misclassification costs at the time of
prediction, for example because it’s hard to quantify the costs or because costs
and class priors vary over time. Under these circumstances the classifier is
expected to work well for a whole range of decision thresholds and the area under
the ROC curve (AUC) provides a scalar performance measure for comparing and
selecting classifiers. mlr provides the AUC for binary classification (auc based on
package ROCR) and also several generalizations of the AUC to the multi-class
case (e.g., multiclass.au1p, multiclass.au1u based on Ferri et al. (2009)).
189
ROC Analysis and Performance Curves ADVANCED
mlr offers three ways to plot ROC and other performance curves.
1. Function plotROCCurves can, based on the output of gener-
ateThreshVsPerfData, plot performance curves for any pair of performance
measures available in mlr.
2. mlr offers an interface to package ROCR through function asROCRPredic-
tion.
3. mlr’s function plotViperCharts provides an interface to ViperCharts.
With mlr version 2.8 functions generateROCRCurvesData, plotROCRCurves, and
plotROCRCurvesGGVIS were deprecated.
Below are some examples that demonstrate the three possible ways. Note that
you can only use learners that are capable of predicting probabilities. Have a
look at the learner table in the Appendix or run listLearners("classif",
properties = c("twoclass", "prob")) to get a list of all learners that sup-
port this.
Since we want to plot ROC curves we calculate the false and true positive rates
(fpr and tpr). Additionally, we also compute error rates (mmce).
df = generateThreshVsPerfData(pred1, measures = list(fpr, tpr,
mmce))
190
ROC Analysis and Performance Curves ADVANCED
Per default, plotROCCurves plots the performance values of the first two mea-
sures passed to generateThreshVsPerfData. The first is shown on the x-axis,
the second on the y-axis. Moreover, a diagonal line that represents the perfor-
mance of a random classifier is added. You can remove the diagonal by setting
diagonal = FALSE.
plotROCCurves(df)
The corresponding area under curve (auc) can be calculated as usual by calling
performance.
performance(pred1, auc)
#> auc
#> 0.847973
1.00 1.00
0.5
0.75 0.75
performance
0.4
0.50 0.50
0.3
0.25 0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
191
ROC Analysis and Performance Curves ADVANCED
In order to compare the performance of the two learners you might want to
display the two corresponding ROC curves in one plot. For this purpose just
pass a named list of Predictions to generateThreshVsPerfData.
df = generateThreshVsPerfData(list(lda = pred1, ksvm = pred2),
measures = list(fpr, tpr))
plotROCCurves(df)
ksvm lda
1.00
0.75
True positive rate
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate
It’s clear from the plot above that ksvm has a slightly higher AUC than lda.
performance(pred2, auc)
#> auc
#> 0.9214527
Based on the $data member of df you can easily generate custom plots. Below
the curves for the two learners are superposed.
qplot(x = fpr, y = tpr, color = learner, data = df$data, geom =
"path")
192
ROC Analysis and Performance Curves ADVANCED
193
ROC Analysis and Performance Curves ADVANCED
classif.ksvm.tuned classif.lda
1.00
0.75
True positive rate
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate
194
ROC Analysis and Performance Curves ADVANCED
classif.ksvm.tuned classif.lda
1.00
0.75
True positive rate
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate
1.00 1.00
0.6
0.75 0.75
performance
learner
0.4
0.50 0.50 classif.ksvm.tuned
classif.lda
0.00 0.00
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
An alternative to averaging is to just merge the 5 test folds and draw a single
ROC curve. Merging can be achieved by manually changing the class attribute
of the prediction objects from ResamplePrediction to Prediction.
Below, the predictions are extracted from the BenchmarkResult via function
getBMRPredictions, the class is changed and the ROC curves are created.
195
ROC Analysis and Performance Curves ADVANCED
Averaging methods are normally preferred (cp. Fawcett, 2006), as they permit to
assess the variability, which is needed to properly compare classifier performance.
### Extract predictions
preds = getBMRPredictions(bmr)[[1]]
classif.ksvm.tuned classif.lda
1.00
0.75
True positive rate
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate
Again, you can easily create other standard evaluation plots by passing the
appropriate performance measures to generateThreshVsPerfData and plotROC-
Curves.
Drawing performance plots with package ROCR works through three basic
commands:
1. ROCR::prediction: Create a ROCR prediction object.
196
ROC Analysis and Performance Curves ADVANCED
Below is the same ROC curve, but we make use of some more graphical param-
eters: The ROC curve is color-coded by the threshold and selected threshold
values are printed on the curve. Additionally, the convex hull (black broken line)
of the ROC curve is drawn.
### Draw ROC curve
ROCR::plot(ROCRperf1, colorize = TRUE, print.cutoffs.at =
seq(0.1, 0.9, 0.1), lwd = 2)
197
ROC Analysis and Performance Curves ADVANCED
We draw the vertically averaged ROC curves (solid lines) as well as the ROC
curves for the individual resampling iterations (broken lines). Moreover, standard
error bars are plotted for selected true positive rates (0.1, 0.2, . . . , 0.9). See
ROCR’s plot function for details.
### lda average ROC curve
plot(ROCRperfs[[1]], col = "blue", avg = "vertical",
spread.estimate = "stderror",
show.spread.at = seq(0.1, 0.8, 0.1), plotCI.col = "blue",
plotCI.lwd = 2, lwd = 2)
### lda individual ROC curves
plot(ROCRperfs[[1]], col = "blue", lty = 2, lwd = 0.25, add =
TRUE)
In order to create other evaluation plots like precision/recall graphs you just
have to change the performance measures when calling ROCR::performance.
(Note that you have to use the measures provided by ROCR listed here and not
mlr’s performance measures.)
### Extract and convert predictions
198
ROC Analysis and Performance Curves ADVANCED
preds = getBMRPredictions(bmr)[[1]]
ROCRpreds = lapply(preds, asROCRPrediction)
If you want to plot a performance measure versus the threshold, specify only one
measure when calling ROCR::performance. Below the average accuracy over the
5 cross-validation iterations is plotted against the threshold. Moreover, boxplots
for certain threshold values (0.1, 0.2, . . . , 0.9) are drawn.
### Extract and convert predictions
preds = getBMRPredictions(bmr)[[1]]
ROCRpreds = lapply(preds, asROCRPrediction)
Viper charts
mlr also supports ViperCharts for plotting ROC and other performance curves.
Like generateThreshVsPerfData it has S3 methods for objects of class Predic-
tion, ResampleResult and BenchmarkResult. Below plots for the benchmark
experiment (Example 2) are generated.
z = plotViperCharts(bmr, chart = "rocc", browse = FALSE)
You can see the plot created this way here. Note that besides ROC curves you get
several other plots like lift charts or cost curves. For details, see plotViperCharts.
199
Multilabel Classification ADVANCED
Multilabel Classification
Creating a task
The first thing you have to do for multilabel classification in mlr is to get
your data in the right format. You need a data.frame which consists of the
features and a logical vector for each label which indicates if the label is present
in the observation or not. After that you can create a MultilabelTask like a
normal ClassifTask. Instead of one target name you have to specify a vector
of targets which correspond to the names of logical variables in the data.frame.
In the following example we get the yeast data frame from the already existing
yeast.task, extract the 14 label names and create the task again.
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast,
target = labels)
yeast.task
#> Supervised task: multi
#> Type: multilabel
#> Target:
label1,label2,label3,label4,label5,label6,label7,label8,label9,label10,label11,label12,l
#> Observations: 2417
#> Features:
#> numerics factors ordered
#> 103 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 14
#> label1 label2 label3 label4 label5 label6 label7
label8 label9
#> 762 1038 983 862 722 597 428
480 178
#> label10 label11 label12 label13 label14
#> 253 289 1816 1799 34
200
Multilabel Classification ADVANCED
Constructing a learner
201
Multilabel Classification ADVANCED
#> Properties:
numerics,factors,ordered,missings,weights,prob,twoclass,multiclass
#> Predict-Type: prob
#> Hyperparameters: xval=0
lrn.br2 = makeMultilabelBinaryRelevanceWrapper("classif.rpart")
lrn.br2
#> Learner multilabel.classif.rpart from package rpart
#> Type: multilabel
#> Name: ; Short name:
#> Class: MultilabelBinaryRelevanceWrapper
#> Properties:
numerics,factors,ordered,missings,weights,prob,twoclass,multiclass
#> Predict-Type: response
#> Hyperparameters: xval=0
Binary relevance
This problem transformation method converts the multilabel problem to binary
classification problems for each label and applies a simple binary classificator on
these. In mlr this can be done by converting your binary learner to a wrapped
binary relevance multilabel learner.
Classifier chains
Trains consecutively the labels with the input data. The input data in each
step is augmented by the already trained labels (with the real observed values).
Therefore an order of the labels has to be specified. At prediction time the labels
are predicted in the same order as while training. The required labels in the
input data are given by the previous done prediction of the respective label.
Nested stacking
Same as classifier chains, but the labels in the input data are not the real ones,
but estimations of the labels obtained by the already trained learners.
202
Multilabel Classification ADVANCED
Stacking
Same as the dependent binary relevance method, but in the training phase the
labels used as input for each label are obtained by the binary relevance method.
Train
You can train a model as usual with a multilabel learner and a multilabel task as
input. You can also pass subset and weights arguments if the learner supports
this.
mod = train(lrn.br, yeast.task)
mod = train(lrn.br, yeast.task, subset = 1:1500, weights =
rep(1/1500, 1500))
mod
#> Model for learner.id=multilabel.classif.rpart;
learner.class=MultilabelBinaryRelevanceWrapper
#> Trained on: task.id = multi; obs = 1500; features = 103
#> Hyperparameters: xval=0
Predict
Prediction can be done as usual in mlr with predict and by passing a trained
model and either the task to the task argument or some new data to the newdata
argument. As always you can specify a subset of the data which should be
predicted.
pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])
names(as.data.frame(pred))
#> [1] "truth.label1" "truth.label2" "truth.label3"
#> [4] "truth.label4" "truth.label5" "truth.label6"
#> [7] "truth.label7" "truth.label8" "truth.label9"
#> [10] "truth.label10" "truth.label11" "truth.label12"
#> [13] "truth.label13" "truth.label14" "prob.label1"
#> [16] "prob.label2" "prob.label3" "prob.label4"
#> [19] "prob.label5" "prob.label6" "prob.label7"
#> [22] "prob.label8" "prob.label9" "prob.label10"
203
Multilabel Classification ADVANCED
Depending on the chosen predict.type of the learner you get true and predicted
values and possibly probabilities for each class label. These can be extracted
by the usual accessor functions getPredictionTruth, getPredictionResponse and
getPredictionProbabilities.
Performance
204
Multilabel Classification ADVANCED
listMeasures("multilabel")
#> [1] "multilabel.f1" "multilabel.subset01"
"multilabel.tpr"
#> [4] "multilabel.ppv" "multilabel.acc" "timeboth"
#> [7] "timepredict" "multilabel.hamloss" "featperc"
#> [10] "timetrain"
Resampling
For evaluating the overall performance of the learning algorithm you can do
some resampling. As usual you have to define a resampling strategy, either
via makeResampleDesc or makeResampleInstance. After that you can run the
resample function. Below the default measure Hamming loss is calculated.
rdesc = makeResampleDesc(method = "CV", stratify = FALSE, iters
= 3)
r = resample(learner = lrn.br, task = yeast.task, resampling =
rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: multi
#> Learner: multilabel.classif.rpart
#> Aggr perf: multilabel.hamloss.test.mean=0.225
#> Runtime: 4.2915
Binary performance
If you want to calculate a binary performance measure like, e.g., the accuracy,
the mmce or the auc for each label, you can use function getMultilabelBina-
ryPerformances. You can apply this function to any multilabel prediction, e.g.,
also on the resample multilabel prediction. For calculating the auc you need
predicted probabilities.
getMultilabelBinaryPerformances(pred, measures = list(acc, mmce,
auc))
205
Learning Curve Analysis ADVANCED
To analyse how the increase of observations in the training set improves the
performance of a learner the learning curve is an appropriate visual tool. The
experiment is conducted with an increasing subsample size and the performance
is measured. In the plot the x-axis represents the relative subsample size whereas
the y-axis represents the performance.
Note that this function internally uses benchmark in combination with make-
DownsampleWrapper, so for every run new observations are drawn. Thus the
results are noisy. To reduce noise increase the number of resampling iterations.
206
Learning Curve Analysis ADVANCED
You can define the resampling method in the resampling argument of generate-
LearningCurveData. It is also possible to pass a ResampleInstance (which is a
result of makeResampleInstance) to make resampling consistent for all passed
learners and each step of increasing the number of observations.
The mlr function generateLearningCurveData can generate the data for learning
curves for multiple learners and multiple performance measures at once. With
plotLearningCurve the result of generateLearningCurveData can be plotted
using ggplot2. plotLearningCurve has an argument facet which can be either
“measure” or “learner”. By default facet = "measure" and facetted subplots
are created for each measure input to generateLearningCurveData. If facet =
"measure" learners are mapped to color, and vice versa.
r = generateLearningCurveData(
learners = list("classif.rpart", "classif.knn"),
task = sonar.task,
percs = seq(0.1, 1, by = 0.2),
measures = list(tp, fp, tn, fn),
resampling = makeResampleDesc(method = "CV", iters = 5),
show.info = FALSE)
plotLearningCurve(r)
207
Learning Curve Analysis ADVANCED
●
●
18 ● ●
●
●
12
17
● ●
16 ●
8 ●
●
●
15
● ● ●
performance
●
4 ● learner
14
● classif.knn
True negatives False negatives
● 8 ●
● classif.rpart
15
● ● ●
● 7
12 ●
●
●
6
● ●
9
5
●
●
6 ● ●
4 ●
●
● ● ●
208
Learning Curve Analysis ADVANCED
●
●
●
0.8 ●
●
●
● ● ●
●
learner
performance
●
0.7
●
● classif.randomForest
●
● ● ● ksvm1
●
●
● ksvm2
●
0.6 ● ● ●
●
● ●
●
●
0.5
0.25 0.50 0.75 1.00
percentage
We can display performance on the train set as well as the test set:
rin2 = makeResampleDesc(method = "CV", iters = 5, predict =
"both")
lc2 = generateLearningCurveData(learners = lrns, task =
sonar.task,
percs = seq(0.1, 1, by = 0.1),
measures = list(acc,setAggregation(acc, train.mean)),
resampling = rin2,
show.info = FALSE)
plotLearningCurve(lc2, facet = "learner")
209
Exploring Learner Predictions ADVANCED
●
● ●
●
● ●
● ●
●
0.9 ●
●
0.9
● ●
●
● 0.8
●
performance
● 0.8 ●
●
● ● ●
measure
● ● ●
● ●
●
● Accuracy: Test mean
0.8 ●
●
● ● ● Accuracy: Training mean
●
● 0.7 ● ●
● ●
●
●
● ● 0.6
● ●
●
●
0.7 0.6 ● ● ●
●
● ●
●
●
● 0.5 ● ●
0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00
percentage
210
Exploring Learner Predictions ADVANCED
N
1 Xˆ
fˆXs = f (Xs , xic ).
N i=1
and survival tasks the partial derivative of a single feature Xs is the gradient of
the partial dependence function, and for classification tasks where the learner
can output class probabilities the Jacobian. Note that if the learner produces
discontinuous partial dependence (e.g., piecewise constant functions such as
decision trees, ensembles of decision trees, etc.) the derivative will be 0 (where
the function is not changing) or trending towards positive or negative infinity
(at the discontinuities where the derivative is undefined). Plotting the partial
dependence function of such learners may give the impression that the function is
not discontinuous because the prediction grid is not composed of all discontinuous
points in the predictor space. This results in a line interpolating that makes the
function appear to be piecewise linear (where the derivative would be defined
except at the boundaries of each piece).
The partial derivative can be informative regarding the additivity of the learned
(i)
function in certain features. If fˆXs is an additive function in a feature Xs , then
its partial derivative will not depend on any other features (Xc ) that may have
been used by the learner. Variation in the estimated partial derivative indicates
that there is a region of interaction between Xs and Xc in fˆ. Similarly, instead
of using the mean to estimate the expected value of the function at different
values of Xs , instead computing the variance can highlight regions of interaction
between Xs and Xc .
See Goldstein, Kapelner, Bleich, and Pitkin (2014) for more details and their
package ICEbox for the original implementation. The algorithm works for any
supervised learner with classification, regression, and survival tasks.
211
Exploring Learner Predictions ADVANCED
212
Exploring Learner Predictions ADVANCED
tail(pd.lst$data)
#> Class Probability Petal.Width Petal.Length
#> 55 virginica 0.2006336 NA 3.622222
#> 56 virginica 0.3114545 NA 4.277778
#> 57 virginica 0.4404613 NA 4.933333
#> 58 virginica 0.6005358 NA 5.588889
#> 59 virginica 0.7099841 NA 6.244444
#> 60 virginica 0.7242584 NA 6.900000
213
Exploring Learner Predictions ADVANCED
are necessarily tighter because the feature accounts for more of the variance
of the predictions, i.e., it is “used” more by the learner. More directly setting
fun = var identifies regions of interaction between Xs and Xc .
lrn.regr = makeLearner("regr.ksvm")
fit.regr = train(lrn.regr, bh.task)
pd.regr = generatePartialDependenceData(fit.regr, bh.task,
"lstat", fun = median)
pd.regr
#> PartialDependenceData
#> Task: BostonHousing-example
#> Features: lstat
#> Target: medv
#> Derivative: FALSE
#> Interaction: FALSE
#> Individual: FALSE
#> medv lstat
#> 1 24.69031 1.730000
#> 2 23.72479 5.756667
#> 3 22.34841 9.783333
#> 4 20.78817 13.810000
#> 5 19.76183 17.836667
#> 6 19.33115 21.863333
#> ... (10 rows, 2 cols)
pd.classif = generatePartialDependenceData(fit.classif,
iris.task, "Petal.Length", fun = median)
pd.classif
214
Exploring Learner Predictions ADVANCED
#> PartialDependenceData
#> Task: iris-example
#> Features: Petal.Length
#> Target: setosa, versicolor, virginica
#> Derivative: FALSE
#> Interaction: FALSE
#> Individual: FALSE
#> Class Probability Petal.Length
#> 1 setosa 0.31008788 1.000000
#> 2 setosa 0.24271454 1.655556
#> 3 setosa 0.17126036 2.311111
#> 4 setosa 0.09380787 2.966667
#> 5 setosa 0.04579912 3.622222
#> 6 setosa 0.02455344 4.277778
#> ... (30 rows, 3 cols)
tail(pd.se$data)
#> medv lstat crim lower upper
#> 15 21.65846 NA 39.54849 19.50827 23.80866
#> 16 21.64409 NA 49.43403 19.49704 23.79114
#> 17 21.63038 NA 59.31957 19.48054 23.78023
#> 18 21.61514 NA 69.20512 19.46092 23.76936
#> 19 21.61969 NA 79.09066 19.46819 23.77119
#> 20 21.61987 NA 88.97620 19.46843 23.77130
As previously mentioned if the aggregation function is not used, i.e., it is the iden-
215
Exploring Learner Predictions ADVANCED
(i)
tity, then the conditional expectation of fˆXs is estimated. If individual = TRUE
then generatePartialDependenceData returns n partial dependence estimates
made at each point in the prediction grid constructed from the features.
pd.ind.regr = generatePartialDependenceData(fit.regr, bh.task,
"lstat", individual = TRUE)
pd.ind.regr
#> PartialDependenceData
#> Task: BostonHousing-example
#> Features: lstat
#> Target: medv
#> Derivative: FALSE
#> Interaction: FALSE
#> Individual: TRUE
#> Predictions centered: FALSE
#> medv lstat idx
#> 1 25.66995 1.730000 1
#> 2 24.71747 5.756667 1
#> 3 23.64157 9.783333 1
#> 4 22.70812 13.810000 1
#> 5 22.00059 17.836667 1
#> 6 21.46195 21.863333 1
#> ... (5060 rows, 3 cols)
The resulting output, particularly the element data in the returned object, has
an additional column idx which gives the index of the observation to which the
row pertains.
For classification tasks this index references both the class and the observation
index.
pd.ind.classif = generatePartialDependenceData(fit.classif,
iris.task, "Petal.Length", individual = TRUE)
pd.ind.classif
#> PartialDependenceData
#> Task: iris-example
#> Features: Petal.Length
#> Target: setosa, versicolor, virginica
#> Derivative: FALSE
#> Interaction: FALSE
#> Individual: TRUE
#> Predictions centered: FALSE
#> Class Probability Petal.Length idx
#> 1 setosa 0.9814053 1 1.setosa
#> 2 setosa 0.9747355 1 2.setosa
#> 3 setosa 0.9815516 1 3.setosa
#> 4 setosa 0.9795761 1 4.setosa
216
Exploring Learner Predictions ADVANCED
pd.regr.der.ind = generatePartialDependenceData(fit.regr,
bh.task, "lstat", derivative = TRUE,
individual = TRUE)
head(pd.regr.der.ind$data)
#> medv lstat idx
#> 1 -0.1931323 1.730000 1
#> 2 -0.2656911 5.756667 1
#> 3 -0.2571006 9.783333 1
#> 4 -0.2033080 13.810000 1
#> 5 -0.1511472 17.836667 1
#> 6 -0.1193129 21.863333 1
pd.classif.der = generatePartialDependenceData(fit.classif,
iris.task, "Petal.Width", derivative = TRUE)
head(pd.classif.der$data)
217
Exploring Learner Predictions ADVANCED
pd.classif.der.ind = generatePartialDependenceData(fit.classif,
iris.task, "Petal.Width", derivative = TRUE,
individual = TRUE)
head(pd.classif.der.ind$data)
#> Class Probability Petal.Width idx
#> 1 setosa 0.02479474 0.1 1.setosa
#> 2 setosa 0.01710561 0.1 2.setosa
#> 3 setosa 0.01646252 0.1 3.setosa
#> 4 setosa 0.01530718 0.1 4.setosa
#> 5 setosa 0.02608577 0.1 5.setosa
#> 6 setosa 0.03925531 0.1 6.setosa
Functional ANOVA
N
!
1 X ˆ X
ĝu (X) = f (X) − gv (X)
N i=1 v⊂u
218
Exploring Learner Predictions ADVANCED
lrn.regr = makeLearner("regr.ksvm")
fit.regr = train(lrn.regr, bh.task)
219
Exploring Learner Predictions ADVANCED
This is done by computing the univariate partial dependence for each feature
and subtracting it from the bivariate partial dependence for each possible pair.
fa.bv = generateFunctionalANOVAData(fit.regr, bh.task, c("crim",
"lstat", "age"),
depth = 2)
fa.bv
#> FunctionalANOVAData
#> Task: BostonHousing-example
#> Features: crim, lstat, age
#> Target: medv
#>
#>
#> effect medv crim lstat age
#> 1 crim:lstat -22.68734 0.006320 1.73 NA
#> 2 crim:lstat -23.22114 9.891862 1.73 NA
#> 3 crim:lstat -24.77479 19.777404 1.73 NA
#> 4 crim:lstat -26.41395 29.662947 1.73 NA
#> 5 crim:lstat -27.56524 39.548489 1.73 NA
#> 6 crim:lstat -28.27952 49.434031 1.73 NA
#> ... (300 rows, 5 cols)
220
Exploring Learner Predictions ADVANCED
24
●
22
medv
20
●
●
● ●
●
18
0 10 20 30
lstat
With a classification task, a line is drawn for each class, which gives the estimated
partial probability of that class for a particular point in the feature grid.
plotPartialDependence(pd.classif)
●
0.8 ●
0.6 ●
● Class
Probability
●
● ● setosa
●
0.4 ● versicolor
● ●
●
● ● ● ● virginica
●
● ●
0.2
● ●
● ●
●
●
●
● ●
● ●
● ● ●
0.0
2 4 6
Petal.Length
221
Exploring Learner Predictions ADVANCED
30
25 ●
●
medv
20 ●
● ●
● ● ●
15
0 10 20 30
lstat
The same goes for plots of partial dependences where the learner has
predict.type = "se".
plotPartialDependence(pd.se)
lstat crim
35
30
medv
25
●
●
●
● ● ● ● ● ● ● ● ●
●
20 ● ● ● ● ●
0 10 20 30 0 25 50 75
Value
222
Exploring Learner Predictions ADVANCED
Petal.Width Petal.Length
● ●
●
●
●
●
0.6 ● ●
● ●
●
● ●
● Class
Probability
●
●
●
●
●
●
● setosa
●
●
0.4 ●
● versicolor
●
●
● ● ● virginica
●
●
●
● ● ● ●
●
0.2 ● ● ●
● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
● ●
223
Exploring Learner Predictions ADVANCED
0.8 ●
● ●
● ● ● ●
● ● ● ● ●
● ●
0.6 ● ● ●
●
●
● ● ●
● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ●
0.4 ● ● ●
● ● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
●
● ● ● ● ● ●
● ●
0.2 ● ●
● ● ● ● ● ● ●
●
● ● ● ●
●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
●
●
● ●
●
● setosa
● ● ● ●
●
0.4 ● ● versicolor
● ● ●
●
● ●
● ● ● ● ● ● ●
● virginica
● ●
● ● ● ● ●
0.2 ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ●
●
●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
0.0 0.5 1.0 1.5 2.0 2.50.0 0.5 1.0 1.5 2.0 2.5
Petal.Width
224
Exploring Learner Predictions ADVANCED
50 ●
●
●
● ●
●
● ●
●
● ●
● ●
● ●
●
● ● ●
● ● ●
●
●
● ● ●
● ●
● ●
● ●
● ●
●
● ● ● ●
● ● ●
40 ●
●
● ●
● ●
●
● ●
● ● ● ●
●
●
● ● ● ●
● ● ● ●
● ●
● ● ●
●
● ● ●
● ● ●
●
● ● ● ●
●
●
● ●
● ●
● ●
●
● ●
● ● ● ●
●
● ● ● ● ●
●
●
● ●
● ● ● ●
●
● ●
● ●
● ● ●
● ● ● ● ● ●
medv
●
● ● ●
● ●
● ●
● ●
30 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ● ● ●
●
●
● ●
● ●
● ●
● ● ●
●
●
● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ● ●
● ●
●
●
●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ● ● ● ●
20 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ● ● ● ●
● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ●
● ● ●
● ●
●
●
● ●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
● ●
● ● ●
● ●
● ● ● ●
●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ● ●
● ●
●
● ● ●
● ●
●
● ●
● ●
● ● ●
●
● ● ●
● ● ●
● ●
● ●
●
●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ●
● ● ●
● ●
●
● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ● ● ●
10 ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
● ●
0 10 20 30
lstat
When the individual curves are centered by subtracting the individual conditional
expectations estimated at a particular value of Xs this results in a fixed intercept
(i)
which aids in visualizing variation in predictions made by fˆXs .
plotPartialDependence(pd.ind.classif)
● ● ●
●
● ● ● ● ● ●
●
● ●
● ●
● ●
● ●
● ●
●
● ● ● ●
● ●
● ●
● ●
●
●
● ●
● ●
●
● ●
●
● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ● ●
●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
●
● ●
● ● ●
● ●
● ● ●
● ●
●
●
● ● ●
●
● ●
● ● ● ●
●
● ● ●
● ●
● ● ● ●
● ●
●
● ●
● ●
● ● ●
●
● ● ● ●
●
● ● ●
● ●
● ● ●
● ●
● ● ● ● ● ●
● ● ●
0.4 ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ● ●
● ●
● ●
●
● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ●
● ●
● ●
●
● ● ●
● ● ●
● ●
Species (centered)
● ● ●
● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ●
● ●
●
● ●
● ● ●
● ● ●
● ●
● ●
●
● ● ●
●
●
● ●
● ● ●
● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
Class
● ●
● ●
● ●
● ●
● ●
● ●
●
● ● ●
● ● ●
● ● ●
●
●
● ● ●
● ● ●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
● setosa
0.0 ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● versicolor
● ●
●
● ● ●
● ●
● ● ● ●
●
● ●
● ●
● ● ● ●
● ● ● ●
●
● ● ●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● virginica
●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
●
● ●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ● ● ● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ● ●
● ●
●
● ● ● ● ●
● ●
● ●
● ● ● ● ●
−0.4 ● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ● ●
● ●
● ● ●
● ● ●
●
●
● ●
● ●
● ● ●
●
● ●
● ● ●
●
●
●
●
● ● ●
●
● ●
●
● ● ●
● ●
●
● ●
● ●
●
●
● ●
● ●
●
● ●
● ●
−0.8
2 4 6
Petal.Length
225
Exploring Learner Predictions ADVANCED
0.2
●
●
0.0
medv (derivative)
●
−0.2
●
●
●
−0.4
● ●
0 10 20 30
lstat
This suggests that fˆ is not additive in lstat except in the neighborhood of 25.
plotPartialDependence(pd.regr.der.ind)
●
●
●
●
● ●
●
● ● ●
●
●
0.5 ●
● ● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
● ● ●
● ●
● ●
●
●
● ● ●
● ●
● ● ● ●
● ●
●
●
●
● ● ● ●
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
● ● ● ●
● ●
● ●
●
●
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
●
● ● ●
● ●
● ● ● ● ●
● ●
●
●
● ●
● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ● ● ● ● ● ●
medv (derivative)
● ● ● ● ● ●
● ● ● ●
0.0 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
● ● ●
● ●
● ● ● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ● ● ●
●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
● ●
●
● ●
● ● ●
● ●
● ● ●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ● ●
●
● ●
● ●
● ● ●
● ●
●
●
● ●
● ● ●
● ● ●
● ● ●
●
● ●
● ●
● ● ●
● ●
● ● ●
●
● ●
● ●
● ●
●
● ● ●
● ●
● ●
●
● ●
●
● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
●
● ● ●
● ● ●
● ●
● ● ● ● ● ●
−0.5 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
●
●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ● ●
●
● ● ● ●
● ●
●
● ●
● ● ● ● ●
●
● ● ●
●
● ●
●
● ●
●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ●
●
●
● ● ● ●
−1.0 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
● ●
● ●
●
● ●
●
● ● ●
● ●
●
●
● ● ●
●
● ● ●
●
● ●
● ●
●
● ● ●
●
● ●
●
● ●
●
●
●
●
−1.5
0 10 20 30
lstat
This suggests that Petal.Width interacts with some other feature in the neigh-
borhood of (1.5, 2) for classes “virginica” and “versicolor”.
plotPartialDependence(pd.classif.der.ind)
226
Exploring Learner Predictions ADVANCED
3 ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
2 ●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
● ●
●
● ●
● ●
●
● ● ●
● ● ●
●
Species (derivative)
●
● ●
● ●
●
● ● ●
●
●
● ●
● ●
● ● ● ●
1 ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Class
●
● ● ●
● ● ● ● ●
● ●
●
●
● ● ●
● ●
● ●
● ●
●
● ● ●
●
● ● ● ●
● ●
● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ● setosa
● ● ● ●
● ● ●
● ●
● ● ●
●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
●
● ●
●
●
● ●
● ●
● ● ●
● ● ●
● ● ● ●
●
0 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● versicolor
●
● ●
●
● ●
● ●
● ●
● ●
● ● ● ●
●
● ●
● ● ● ● ●
● ●
● ● ●
●
● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● virginica
● ● ● ● ●
●
−1 ● ●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
● ●
● ● ●
●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
● ●
−2 ●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
●
−3
0.0 0.5 1.0 1.5 2.0 2.5
Petal.Width
crim lstat
●
●
25.0
●
●
22.5
medv
● ●
●
●
20.0 ●
●
●
●
●
● ●
●
● ●
17.5 ● ●
0 25 50 75 0 10 20 30
Value
227
Classifier Calibration ADVANCED
40
30
medv
−20
lstat
20
−25
10
0 25 50 75
crim
Classifier Calibration
228
Classifier Calibration ADVANCED
229
Classifier Calibration ADVANCED
1.00
●
0.75
Class Proportion
●
Class
0.50
● M
0.25
0.00
[0.000,0.267)
[0.267,0.925)
[0.925,1.000]
Probability Bin
230
Classifier Calibration ADVANCED
1.00
0.75
Class Proportion
Class
0.50
M
0.25
0.00
[0,0.1]
(0.1,0.2]
(0.2,0.3]
(0.3,0.4]
(0.4,0.5]
(0.5,0.6]
(0.6,0.7]
(0.7,0.8]
(0.8,0.9]
(0.9,1]
Probability Bin
231
Evaluating Hyperparameter Tuning ADVANCED
nnet randomForest
1.00 ● ●
0.75
Class Proportion
Class
● setosa
0.50 ●
● versicolor
● virginica
0.25
0.00 ● ● ● ● ●
[0,0.3]
(0.3,0.6]
(0.6,1]
[0,0.3]
(0.3,0.6]
(0.6,1]
Probability Bin
232
Evaluating Hyperparameter Tuning ADVANCED
After tuning, you may want to evaluate the tuning process in order to answer
questions such as:
• How does varying the value of a hyperparameter change the performance
of the machine learning algorithm?
• What’s the relative importance of each hyperparameter?
• How did the optimization algorithm (prematurely) converge?
mlr provides methods to generate and plot the data in order to evaluate the
effect of hyperparameter tuning.
mlr separates the generation of the data from the plotting of the data in case
the user wishes to use the data in a custom way downstream.
The generateHyperParsEffectData method takes the tuning result along with 2
additional arguments: trafo and include.diagnostics. The trafo argument
will convert the hyperparameter data to be on the transformed scale in case a
transformation was used when creating the parameter (as in the case below).
The include.diagnostics argument will tell mlr whether to include the eol
and any error messages from the learner.
Below we perform random search on the C parameter for SVM on the famous
Pima Indians dataset. We generate the hyperparameter effect data so that the C
parameter is on the transformed scale and we do not include diagnostic data:
ps = makeParamSet(
makeNumericParam("C", lower = -5, upper = 5, trafo =
function(x) 2^x)
)
ctrl = makeTuneControlRandom(maxit = 100L)
rdesc = makeResampleDesc("CV", iters = 2L)
res = tuneParams("classif.ksvm", task = pid.task, control = ctrl,
measures = list(acc, mmce), resampling = rdesc, par.set = ps,
show.info = FALSE)
generateHyperParsEffectData(res, trafo = T, include.diagnostics
= FALSE)
#> HyperParsEffectData:
#> Hyperparameters: C
#> Measures: acc.test.mean,mmce.test.mean
#> Optimizer: TuneControlRandom
#> Nested CV Used: FALSE
#> Snapshot of data:
#> C acc.test.mean mmce.test.mean iteration exec.time
#> 1 0.3770897 0.7695312 0.2304688 1 0.055
#> 2 3.4829323 0.7526042 0.2473958 2 0.053
233
Evaluating Hyperparameter Tuning ADVANCED
234
Evaluating Hyperparameter Tuning ADVANCED
#> 3 3 0.071
#> 4 4 0.073
#> 5 5 0.072
#> 6 6 0.072
In the example below, we perform grid search on the C parameter for SVM
on the Pima Indians dataset using nested cross validation. We generate the
hyperparameter effect data so that the C parameter is on the untransformed
scale and we do not include diagnostic data. As you can see below, nested cross
validation is supported without any extra work by the user, allowing the user to
obtain an unbiased estimator for the performance.
ps = makeParamSet(
makeNumericParam("C", lower = -5, upper = 5, trafo =
function(x) 2^x)
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 2L)
lrn = makeTuneWrapper("classif.ksvm", control = ctrl,
measures = list(acc, mmce), resampling = rdesc, par.set = ps,
show.info = FALSE)
res = resample(lrn, task = pid.task, resampling = cv2, extract =
getTuneResult, show.info = FALSE)
generateHyperParsEffectData(res)
#> HyperParsEffectData:
#> Hyperparameters: C
#> Measures: acc.test.mean,mmce.test.mean
#> Optimizer: TuneControlGrid
#> Nested CV Used: TRUE
#> Snapshot of data:
#> C acc.test.mean mmce.test.mean iteration exec.time
#> 1 -5.0000000 0.6640625 0.3359375 1 0.041
#> 2 -3.8888889 0.6640625 0.3359375 2 0.039
#> 3 -2.7777778 0.6822917 0.3177083 3 0.040
#> 4 -1.6666667 0.7473958 0.2526042 4 0.040
#> 5 -0.5555556 0.7708333 0.2291667 5 0.041
#> 6 0.5555556 0.7682292 0.2317708 6 0.041
#> nested_cv_run
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
After generating the hyperparameter effect data, the next step is to visualize
it. mlr has several methods built-in to visualize the data, meant to support the
235
Evaluating Hyperparameter Tuning ADVANCED
needs of the researcher and the engineer in industry. The next few sections will
walk through the visualization support for several use-cases.
In a situation when the user is tuning a single hyperparameter for a learner, the
user may wish to plot the performance of the learner against the values of the
hyperparameter.
In the example below, we tune the number of clusters against the silhouette
score on the Pima dataset. We specify the x-axis with the x argument and the
y-axis with the y argument. If the plot.type argument is not specified, mlr will
attempt to plot a scatterplot by default. Since plotHyperParsEffect returns a
ggplot object, we can easily customize it to our liking!
ps = makeParamSet(
makeDiscreteParam("centers", values = 3:10)
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("Holdout")
res = tuneParams("cluster.kmeans", task = mtcars.task, control =
ctrl,
measures = silhouette, resampling = rdesc, par.set = ps,
show.info = FALSE)
#>
#> This is package 'modeest' written by P. PONCET.
#> For a complete list of functions, use 'library(help =
"modeest")' or 'help.start()'.
data = generateHyperParsEffectData(res)
plt = plotHyperParsEffect(data, x = "centers", y =
"silhouette.test.mean")
### add our own touches to the plot
plt + geom_point(colour = "red") +
ggtitle("Evaluating Number of Cluster Centers on mtcars") +
scale_x_continuous(breaks = 3:10) +
theme_bw()
236
Evaluating Hyperparameter Tuning ADVANCED
● ●
●
Rousseeuw's silhouette internal cluster quality index
0.6 ●
0.4
0.2
3 4 5 6 7 8 9 10
centers
In the example below, we tune SVM with the C hyperparameter on the Pima
dataset. We will use simulated annealing optimizer, so we are interested in seeing
if the optimization algorithm actually improves with iterations. By default, mlr
only plots improvements to the global optimum.
ps = makeParamSet(
makeNumericParam("C", lower = -5, upper = 5, trafo =
function(x) 2^x)
)
ctrl = makeTuneControlGenSA(budget = 100L)
rdesc = makeResampleDesc("Holdout")
res = tuneParams("classif.ksvm", task = pid.task, control = ctrl,
resampling = rdesc, par.set = ps, show.info = FALSE)
data = generateHyperParsEffectData(res)
plt = plotHyperParsEffect(data, x = "iteration", y =
"mmce.test.mean",
plot.type = "line")
237
Evaluating Hyperparameter Tuning ADVANCED
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.24
●●●●
●●
Mean misclassification error
●●●●
0.23
0.22
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.21
0 25 50 75 100
iteration
In the case of a learner crash, mlr will impute the crash with the worst value
graphically and indicate the point. In the example below, we give the C parameter
negative values, which will result in a learner crash for SVM.
ps = makeParamSet(
makeDiscreteParam("C", values = c(-1, -0.5, 0.5, 1, 1.5))
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 2L)
res = tuneParams("classif.ksvm", task = pid.task, control = ctrl,
measures = list(acc, mmce), resampling = rdesc, par.set = ps,
show.info = FALSE)
data = generateHyperParsEffectData(res)
plt = plotHyperParsEffect(data, x = "C", y = "acc.test.mean")
238
Evaluating Hyperparameter Tuning ADVANCED
0.7340
0.7335
Accuracy
learner_status
Failure
0.7330 Success
0.7325
0.7320
The example below uses nested cross validation with an outer loop of 2 runs.
mlr indicates each run within the visualization.
ps = makeParamSet(
makeNumericParam("C", lower = -5, upper = 5, trafo =
function(x) 2^x)
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("Holdout")
lrn = makeTuneWrapper("classif.ksvm", control = ctrl,
measures = list(acc, mmce), resampling = rdesc, par.set = ps,
show.info = FALSE)
res = resample(lrn, task = pid.task, resampling = cv2, extract =
getTuneResult, show.info = FALSE)
239
Evaluating Hyperparameter Tuning ADVANCED
data = generateHyperParsEffectData(res)
plotHyperParsEffect(data, x = "C", y = "acc.test.mean",
plot.type = "line")
0.775 ●
0.750 ● ● ● ●
● ● ●
0.725
● ●
nested_cv_run
Accuracy
● 1
● 2
0.700
●
● ●
0.675
0.650 ●
240
Evaluating Hyperparameter Tuning ADVANCED
241
Evaluating Hyperparameter Tuning ADVANCED
5.0
2.5
Accuracy
0.7421875
sigma
0.0 0.7109375
0.6796875
0.6484375
−2.5
−5.0
242
Evaluating Hyperparameter Tuning ADVANCED
5.0
2.5
learner_status
Success
Accuracy
sigma
0.0 0.7421875
0.7109375
0.6796875
0.6484375
−2.5
−5.0
We can also visualize how long the optimizer takes to reach an optima for the
same example:
plotHyperParsEffect(data, x = "iteration", y = "acc.test.mean",
plot.type = "line")
243
Evaluating Hyperparameter Tuning ADVANCED
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.77
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.76
●●●●●●●●●●●●●
Accuracy
0.75
0.74
0.73
●●●
0 25 50 75 100
iteration
In the case where we are tuning 2 hyperparameters and we have a learner crash,
mlr will indicate the respective points and impute them with the worst value.
In the example below, we tune C and sigma, forcing C to be negative for some
instances which will crash SVM. We perform interpolation to get a regular grid in
order to plot a heatmap. We can see that the interpolation creates axis parallel
lines resulting from the learner crashes.
ps = makeParamSet(
makeDiscreteParam("C", values = c(-1, 0.5, 1.5, 1, 0.2, 0.3,
0.4, 5)),
makeDiscreteParam("sigma", values = c(-1, 0.5, 1.5, 1, 0.2,
0.3, 0.4, 5)))
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("Holdout")
learn = makeLearner("classif.ksvm", par.vals = list(kernel =
"rbfdot"))
res = tuneParams(learn, task = pid.task, control = ctrl,
244
Evaluating Hyperparameter Tuning ADVANCED
measures = acc,
resampling = rdesc, par.set = ps, show.info = FALSE)
data = generateHyperParsEffectData(res)
plotHyperParsEffect(data, x = "C", y = "sigma", z =
"acc.test.mean",
plot.type = "heatmap", interpolate = "regr.earth")
learner_status
Failure
Success
Accuracy
sigma
2
0.6
0.4
0.2
0.0
0 2 4
C
245
Evaluating Hyperparameter Tuning ADVANCED
246
EXTEND
5.0
2.5
Accuracy
0.7484198
0.7068666
0.6653134
sigma
0.0
0.6237602
0.5822070
learner_status
Success
−2.5
−5.0
Extend
In order to integrate a learning algorithm into mlr some interface code has to be
written. Three functions are mandatory for each learner.
• First, define a new learner class with a name, description, capabilities,
parameters, and a few other things. (An object of this class can then be
generated by makeLearner.)
• Second, you need to provide a function that calls the learner function and
builds the model given data (which makes it possible to invoke training by
calling mlr’s train function).
• Finally, a prediction function that returns predicted values given new data
247
Integrating Another Learner EXTEND
class(makeLearner(cl = "regr.lm"))
#> [1] "regr.lm" "RLearnerRegr" "RLearner" "Learner"
class(makeLearner(cl = "surv.coxph"))
#> [1] "surv.coxph" "RLearnerSurv" "RLearner" "Learner"
class(makeLearner(cl = "cluster.kmeans"))
#> [1] "cluster.kmeans" "RLearnerCluster" "RLearner"
"Learner"
class(makeLearner(cl = "multilabel.rFerns"))
#> [1] "multilabel.rFerns" "RLearnerMultilabel" "RLearner"
#> [4] "Learner"
The first element of each class attribute vector is the name of the learner class
passed to the cl argument of makeLearner. Obviously, this adheres to the
naming conventions
• "classif.<R_method_name>" for classification,
• "multilabel.<R_method_name>" for multilabel classification,
• "regr.<R_method_name>" for regression,
• "surv.<R_method_name>" for survival analysis, and
• "cluster.<R_method_name>" for clustering.
Additionally, there exist intermediate classes that reflect the type of learning
problem, i.e., all classification learners inherit from RLearnerClassif, all regression
learners from RLearnerRegr and so on. Their superclasses are RLearner and
finally Learner. For all these (sub)classes there exist constructor functions
makeRLearner, makeRLearnerClassif, makeRLearneRegr etc. that are called
internally by makeLearner.
248
Integrating Another Learner EXTEND
A short side remark: As you might have noticed there does not exist a special
learner class for cost-sensitive classification (costsens) with example-specific costs.
This type of learning task is currently exclusively handled through wrappers like
makeCostSensWeightedPairsWrapper.
In the following we show how to integrate learners for the five types of learning
tasks mentioned above. Defining a completely new type of learner that has
special properties and does not fit into one of the existing schemes is of course
possible, but much more advanced and not covered here.
We use a classification example to explain some general principles (so even if
you are interested in integrating a learner for another type of learning task you
might want to read the following section). Examples for other types of learning
tasks are shown later on.
Classification
We show how the Linear Discriminant Analysis from package MASS has been
integrated into the classification learner classif.lda in mlr as an example.
249
Integrating Another Learner EXTEND
This function must fit a model on the data of the task .task with regard to
the subset defined in the integer vector .subset and the parameters passed in
250
Integrating Another Learner EXTEND
the ... arguments. Usually, the data should be extracted from the task using
getTaskData. This will take care of any subsetting as well. It must return the
fitted model. mlr assumes no special data type for the return value – it will be
passed to the predict function we are going to define below, so any special code
the learner may need can be encapsulated there.
For our example, the definition of the function looks like this. In addition to the
data of the task, we also need the formula that describes what to predict. We
use the function getTaskFormula to extract this from the task.
trainLearner.classif.lda = function(.learner, .task, .subset,
.weights = NULL, ...) {
f = getTaskFormula(.task)
MASS::lda(f, data = getTaskData(.task, .subset), ...)
}
It must predict for the new observations in the data.frame .newdata with the
wrapped model .model, which is returned from the training function. The actual
model the learner built is stored in the $learner.model member and can be
accessed simply through .model$learner.model.
For classification, you have to return a factor of predicted classes if
.learner$predict.type is "response", or a matrix of predicted probabilities
if .learner$predict.type is "prob" and this type of prediction is supported
by the learner. In the latter case the matrix must have the same number of
columns as there are classes in the task and the columns have to be named by
the class names.
The definition for LDA looks like this. It is pretty much just a straight pass-
through of the arguments to the predict function and some extraction of predic-
tion data depending on the type of prediction requested.
predictLearner.classif.lda = function(.learner, .model,
.newdata, predict.method = "plug-in", ...) {
p = predict(.model$learner.model, newdata = .newdata, method =
predict.method, ...)
if (.learner$predict.type == "response")
return(p$class) else return(p$posterior)
}
251
Integrating Another Learner EXTEND
Regression
The main difference for regression is that the type of predictions are different
(numeric instead of labels or probabilities) and that not all of the properties are
relevant. In particular, whether one-, two-, or multi-class problems and posterior
probabilities are supported is not applicable.
Apart from this, everything explained above applies. Below is the definition for
the earth learner from the earth package.
makeRLearner.regr.earth = function() {
makeRLearnerRegr(
cl = "regr.earth",
package = "earth",
par.set = makeParamSet(
makeLogicalLearnerParam(id = "keepxy", default = FALSE,
tunable = FALSE),
makeNumericLearnerParam(id = "trace", default = 0, upper =
10, tunable = FALSE),
makeIntegerLearnerParam(id = "degree", default = 1L, lower
= 1L),
makeNumericLearnerParam(id = "penalty"),
makeIntegerLearnerParam(id = "nk", lower = 0L),
makeNumericLearnerParam(id = "thres", default = 0.001),
makeIntegerLearnerParam(id = "minspan", default = 0L),
makeIntegerLearnerParam(id = "endspan", default = 0L),
makeNumericLearnerParam(id = "newvar.penalty", default =
0),
makeIntegerLearnerParam(id = "fast.k", default = 20L,
lower = 0L),
makeNumericLearnerParam(id = "fast.beta", default = 1),
makeDiscreteLearnerParam(id = "pmethod", default =
"backward",
values = c("backward", "none", "exhaustive", "forward",
"seqrep", "cv")),
makeIntegerLearnerParam(id = "nprune")
),
properties = c("numerics", "factors"),
name = "Multivariate Adaptive Regression Splines",
short.name = "earth",
note = ""
)
}
252
Integrating Another Learner EXTEND
Again most of the data is passed straight through to/from the train/predict
functions of the learner.
Survival analysis
253
Integrating Another Learner EXTEND
Clustering
For clustering, you have to return a numeric vector with the IDs of the clusters
that the respective datum has been assigned to. The numbering should start at
1.
Below is the definition for the FarthestFirst learner from the RWeka package.
Weka starts the IDs of the clusters at 0, so we add 1 to the predicted clusters.
RWeka has a different way of setting learner parameters; we use the special
Weka_control function to do this.
254
Integrating Another Learner EXTEND
makeRLearner.cluster.FarthestFirst = function() {
makeRLearnerCluster(
cl = "cluster.FarthestFirst",
package = "RWeka",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "N", default = 2L, lower =
1L),
makeIntegerLearnerParam(id = "S", default = 1L, lower =
1L),
makeLogicalLearnerParam(id = "output-debug-info", default
= FALSE, tunable = FALSE)
),
properties = c("numerics"),
name = "FarthestFirst Clustering Algorithm",
short.name = "farthestfirst"
)
}
predictLearner.cluster.FarthestFirst = function(.learner,
.model, .newdata, ...) {
as.integer(predict(.model$learner.model, .newdata, ...)) + 1L
}
Multilabel classification
255
Integrating Another Learner EXTEND
Some learners, for example decision trees and random forests, can calculate
feature importance values, which can be extracted from a fitted model using
function getFeatureImportance.
If your newly integrated learner supports this you need to
• add "featimp" to the learner properties and
• implement a new S3 method for function getFeatureImportanceLearner
(which later is called internally by getFeatureImportance).
in order to make this work.
256
Integrating Another Measure EXTEND
This method takes the Learner .learner, the WrappedModel .model and
potential further arguments and calculates or extracts the feature importance.
It must return a named vector of importance values.
Below are two simple examples. In case of "classif.rpart" the feature impor-
tance values can be easily extracted from the fitted model.
getFeatureImportanceLearner.classif.rpart = function(.learner,
.model, ...) {
mod = getLearnerModel(.model)
mod$variable.importance
}
For the random forest from package randomForestSRC function vimp is called.
getFeatureImportanceLearner.classif.randomForestSRC =
function(.learner, .model, ...) {
mod = getLearnerModel(.model)
randomForestSRC::vimp(mod, ...)$importance[, "all"]
}
If your interface code to a new learning algorithm exists only locally, i.e., it is
not (yet) merged into mlr or does not live in an extra package with a proper
namespace you might want to register the new S3 methods to make sure that
these are found by, e.g., listLearners. You can do this as follows:
registerS3method("makeRLearner", "<awesome_new_learner_class>",
makeRLearner.<awesome_new_learner_class>)
registerS3method("trainLearner", "<awesome_new_learner_class>",
trainLearner.<awesome_new_learner_class>)
registerS3method("predictLearner",
"<awesome_new_learner_class>",
predictLearner.<awesome_new_learner_class>)
### And if you also have written a method to extract the feature
importance
registerS3method("getFeatureImportanceLearner",
"<awesome_new_learner_class>",
getFeatureImportanceLearner.<awesome_new_learner_class>)
257
Integrating Another Measure EXTEND
performance measure which is not listed in the Appendix or a measure that uses
a misclassification cost matrix.
Performance measures in mlr are objects of class Measure. For example the mse
(mean squared error) looks as follows.
str(mse)
#> List of 10
#> $ id : chr "mse"
#> $ minimize : logi TRUE
#> $ properties: chr [1:3] "regr" "req.pred" "req.truth"
#> $ fun :function (task, model, pred, feats, extra.args)
#> $ extra.args: list()
#> $ best : num 0
#> $ worst : num Inf
#> $ name : chr "Mean of squared errors"
#> $ note : chr ""
#> $ aggr :List of 4
#> ..$ id : chr "test.mean"
#> ..$ name : chr "Test mean"
#> ..$ fun :function (task, perf.test, perf.train,
measure, group, pred)
#> ..$ properties: chr "req.test"
#> ..- attr(*, "class")= chr "Aggregation"
#> - attr(*, "class")= chr "Measure"
mse$fun
#> function (task, model, pred, feats, extra.args)
#> {
#> measureMSE(pred$data$truth, pred$data$response)
#> }
#> <bytecode: 0xbfd1b48>
#> <environment: namespace:mlr>
measureMSE
#> function (truth, response)
#> {
#> mean((response - truth)^2)
#> }
#> <bytecode: 0xbc6a8f8>
#> <environment: namespace:mlr>
See the Measure documentation page for a detailed description of the object
slots.
258
Integrating Another Measure EXTEND
At the core is slot $fun which contains the function that calculates the per-
formance value. The actual work is done by function measureMSE. Similar
functions, generally adhering to the naming scheme measure followed by the
capitalized measure ID, exist for most performance measures. See the measures
help page for a complete list.
Just as Task and Learner objects each Measure has an identifier $id which is for
example used to annotate results and plots. For plots there is also the option to
use the longer measure $name instead. See the tutorial page on Visualization for
more information.
Moreover, a Measure includes a number of $properties that indicate for which
types of learning problems it is suitable and what information is required to calcu-
late it. Obviously, most measures need the Prediction object ("req.pred") and,
for supervised problems, the true values of the target variable(s) ("req.truth").
For tuning or feature selection each Measure knows its extreme values $best
and $worst and if it wants to be minimized or maximized ($minimize).
For resampling slot $aggr specifies how the overall performance across all
resampling iterations is calculated. Typically, this is just a matter of aggregating
the performance values obtained on the test sets perf.test or the training sets
perf.train by a simple function. The by far most common scheme is test.mean,
i.e., the unweighted mean of the performances on the test sets.
str(test.mean)
#> List of 4
#> $ id : chr "test.mean"
#> $ name : chr "Test mean"
#> $ fun :function (task, perf.test, perf.train, measure,
group, pred)
#> $ properties: chr "req.test"
#> - attr(*, "class")= chr "Aggregation"
test.mean$fun
#> function (task, perf.test, perf.train, measure, group, pred)
#> mean(perf.test)
#> <bytecode: 0xa8f7860>
#> <environment: namespace:mlr>
All aggregation schemes are objects of class Aggregation with the function in slot
$fun doing the actual work. The $properties member indicates if predictions
(or performance values) on the training or test data sets are required to calculate
the aggregation.
You can change the aggregation scheme of a Measure via function setAggregation.
See the tutorial page on resampling for some examples and the aggregations help
page for all available aggregation schemes.
259
Integrating Another Measure EXTEND
You can construct your own Measure and Aggregation objects via functions
makeMeasure, makeCostMeasure, makeCustomResampledMeasure and makeAg-
gregation. Some examples are shown in the following.
260
Integrating Another Measure EXTEND
For in depth explanations and details see the tutorial page on cost-sensitive
classification.
To create a measure that involves ordinary, i.e., class-dependent misclassification
costs you can use function makeCostMeasure. You first need to define the cost
matrix. The rows indicate true and the columns predicted classes and the rows
and columns have to be named by the class labels. The cost matrix can then be
wrapped in a Measure object and predictions can be evaluated as usual with the
performance function.
See the R documentation of function makeCostMeasure for details on the various
parameters.
### Create the cost matrix
costs = matrix(c(0, 2, 2, 3, 0, 2, 1, 1, 0), ncol = 3)
rownames(costs) = colnames(costs) = getTaskClassLevels(iris.task)
261
Integrating Another Measure EXTEND
The aggregation function must have a certain signature detailed in the documen-
tation of makeAggregation. Usually, you will only need the performance values
on the test sets perf.test or the training sets perf.train. In rare cases, e.g.,
the Prediction object pred or information stored in the Task object might be
required to obtain the aggregated performance. For an example have a look at
the definition of function test.join.
perf.train and perf.test are both numerical vectors containing the perfor-
mances on the train and test data sets. In most cases (unless you are using
bootstrap as resampling strategy or have set predict = "both" in makeResam-
pleDesc) the perf.train vector is empty.
Now we can run a feature selection based on the first measure in the provided
list and see how the other measures turn out.
### mmce with default aggregation scheme test.mean
ms1 = mmce
262
Integrating Another Measure EXTEND
head(perf.data[1:8])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
mmce.test.mean
#> 1 0 0 0 0
0.70666667
#> 2 1 0 0 0
0.31333333
#> 3 0 1 0 0
0.50000000
#> 4 0 0 1 0
0.09333333
#> 5 0 0 0 1
0.04666667
#> 6 1 1 0 0
0.28666667
#> mmce.test.range mmce.test.min mmce.test.max
#> 1 0.16 0.60 0.76
#> 2 0.02 0.30 0.32
#> 3 0.22 0.36 0.58
#> 4 0.10 0.04 0.14
#> 5 0.08 0.02 0.10
#> 6 0.08 0.24 0.32
263
Creating an Imputation Method EXTEND
0.20
●
0.15 as.factor(Petal.Width)
mmce.test.range
● 0
as.factor(Sepal.Width)
● 0
0.10 ●
●
●
● 1
0.05
The plot shows the range versus the mean misclassification error. The value
on the y-axis thus corresponds to the length of the error bars. (Note that the
points and error bars are jittered in y-direction.)
264
Creating an Imputation Method EXTEND
impute = simpleImpute)
}
The learn function calculates the mean of the non-missing observations in column
col. The mean is passed via argument const to the impute function that replaces
all missing values in feature col.
Now let’s write a new imputation method: A frequently used simple technique
for longitudinal data is last observation carried forward (LOCF). Missing values
are replaced by the most recent observed value.
In the R code below the learn function determines the last observed value
previous to each NA (values) as well as the corresponding number of consecutive
NA's (times). The impute function generates a vector by replicating the entries
in values according to times and replaces the NA's in feature col.
imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1) # position of the last
observed value previous to NA
lastNA = which(dind == -1) # position of the last of
potentially several consecutive NA's
values = x[lastValue] # last observed value
previous to NA
times = lastNA - lastValue # number of consecutive NA's
return(list(values = values, times = times))
},
265
Integrating Another Filter Method EXTEND
Note that this function is just for demonstration and is lacking some checks
for real-world usage (for example ‘What should happen if the first value in x
is already missing?’). Below it is used to impute the missing values in features
Ozone and Solar.R in the airquality data set.
data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(),
Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10)
#> Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
#> 1 41 190 7.4 67 5 1 FALSE FALSE
#> 2 36 118 8.0 72 5 2 FALSE FALSE
#> 3 12 149 12.6 74 5 3 FALSE FALSE
#> 4 18 313 11.5 62 5 4 FALSE FALSE
#> 5 18 313 14.3 56 5 5 TRUE TRUE
#> 6 28 313 14.9 66 5 6 FALSE TRUE
#> 7 23 299 8.6 65 5 7 FALSE FALSE
#> 8 19 99 13.8 59 5 8 FALSE FALSE
#> 9 8 19 20.1 61 5 9 FALSE FALSE
#> 10 8 194 8.6 69 5 10 TRUE FALSE
A lot of feature filter methods are already integrated in mlr and a complete list
is given in the Appendix or can be obtained using listFilterMethods. You can
easily add another filter, be it a brand new one or a method which is already
implemented in another package, via function makeFilter.
Filter objects
In mlr all filter methods are objects of class Filter and are registered in an
environment called .FilterRegister (where listFilterMethods looks them up
to compile the list of available methods). To get to know their structure let’s
have a closer look at the "rank.correlation" filter which interfaces function
rank.correlation in package FSelector.
266
Integrating Another Filter Method EXTEND
filters = as.list(mlr:::.FilterRegister)
filters$rank.correlation
#> Filter: 'rank.correlation'
#> Packages: 'FSelector'
#> Supported tasks: regr
#> Supported features: numerics
str(filters$rank.correlation)
#> List of 6
#> $ name : chr "rank.correlation"
#> $ desc : chr "Spearman's correlation between
feature and target"
#> $ pkg : chr "FSelector"
#> $ supported.tasks : chr "regr"
#> $ supported.features: chr "numerics"
#> $ fun :function (task, nselect, ...)
#> - attr(*, "class")= chr "Filter"
filters$rank.correlation$fun
#> function (task, nselect, ...)
#> {
#> y = FSelector::rank.correlation(getTaskFormula(task),
data = getTaskData(task))
#> setNames(y[["attr_importance"]],
getTaskFeatureNames(task))
#> }
#> <bytecode: 0xc73ee50>
#> <environment: namespace:mlr>
The core element is $fun which calculates the feature importance. For the
"rank.correlation" filter it just extracts the data and formula from the task
and passes them on to the rank.correlation function.
Additionally, each Filter object has a $name, which should be short and is for
example used to annotate graphics (cp. plotFilterValues), and a slightly more
detailed description in slot $desc. If the filter method is implemented by another
package its name is given in the $pkg member. Moreover, the supported task
types and feature types are listed.
You can integrate your own filter method using makeFilter. This function gener-
ates a Filter object and also registers it in the .FilterRegister environment.
The arguments of makeFilter correspond to the slot names of the Filter object
above. Currently, feature filtering is only supported for supervised learning tasks
267
Integrating Another Filter Method EXTEND
and possible values for supported.tasks are "regr", "classif" and "surv".
supported.features can be "numerics", "factors" and "ordered".
fun must be a function with at least the following formal arguments:
• task is a mlr learning Task.
• nselect corresponds to the argument of generateFilterValuesData of the
same name and specifies the number of features for which to calculate
importance scores. Some filter methods have the option to stop after a
certain number of top-ranked features have been found, in order to save
time and ressources when the number of features is high. The majority of
filter methods integrated in mlr doesn’t support this and thus nselect is
ignored in most cases. An exception is the minimum redundancy maximum
relevance filter from package mRMRe.
• ... for additional arguments.
fun must return a named vector of feature importance values. By convention
the most important features receive the highest scores.
If nselect is actively used fun can either return a vector of nselect scores or
a vector as long as the numbers of features in the task that contains NAs for all
features whose scores weren’t calculated.
For writing fun many of the getter functions for Tasks come in handy, particularly
getTaskData, getTaskFormula and getTaskFeatureNames. It’s worth having a
closer look at getTaskData which provides many options for formatting the data
and recoding the target variable.
As a short demonstration we write a totally meaningless filter that determines
the importance of features according to alphabetical order, i.e., giving highest
scores to features with names that come first (decreasing = TRUE) or last
(decreasing = FALSE) in the alphabet.
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of
features",
pkg = "",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
#> Filter: 'nonsense.filter'
#> Packages: ''
268
Integrating Another Filter Method EXTEND
You can use it like any other filter method already integrated in mlr (i.e., via
the method argument of generateFilterValuesData or the fw.method argument
of makeFilterWrapper; see also the page on feature selection).
d = generateFilterValuesData(iris.task, method =
c("nonsense.filter", "anova.test"))
d
#> FilterValues:
#> Task: iris-example
#> name type nonsense.filter anova.test
#> 1 Sepal.Length numeric 2 119.26450
#> 2 Sepal.Width numeric 1 49.16004
#> 3 Petal.Length numeric 4 1180.16118
#> 4 Petal.Width numeric 3 960.00715
plotFilterValues(d)
269
Integrating Another Filter Method EXTEND
iris−example (4 features)
nonsense.filter anova.test
1200
4
900
3
2 600
1 300
0 0
h
th
th
th
th
th
th
gt
gt
id
id
id
id
en
en
en
en
l.W
l.W
l.W
l.W
.L
.L
.L
.L
ta
pa
ta
pa
al
al
l
pa
pa
Pe
Pe
t
t
Se
Se
Pe
Pe
Se
Se
270
Integrating Another Filter Method EXTEND
getTaskFeatureNames(iris.task.filtered)
#> [1] "Petal.Length" "Petal.Width"
You might also want to have a look at the source code of the filter methods
already integrated in mlr for some more complex and meaningful examples.
271