Statistical Machine Learning (BE4M33SSU)
Lecture 1.
Czech Technical University in Prague
Course format
2/10
Teachers: Jan Drchal, Boris Flach, Vojtech Franc and Daniel Bonilla
Format: 1 lecture & 1 tutorial per week (6 credits), tutorials of two types
seminars: discussing solutions of theoretical assignments (published a week before the
class). You are expected to work on them in advance.
practical labs: explaining and discussing practical homeworks, i.e. implementation of
selected methods in Python (or Matlab). You have to submit
1. a report in PDF format (typeset preferably in LaTeX). Exception: if necessary, you
may include lengthy formula derivations as handwritten scans.
2. your code either as source file or as python notebook. The code must be
executable.
Code either as source file or as python notebook. The code must be executable.
Grading: 40% homeworks + 60% written exam = 100% (+ bonus points)
Prerequisites:
probability theory and statistics (A0B01PSI)
pattern recognition and machine learning (AE4B33RPZ)
optimisation (AE4B33OPT)
More details: https://cw.fel.cvut.cz/wiki/courses/be4m33ssu/start
Goals
3/10
The aim of statistical machine learning is to develop systems (models and algorithms) for
solving prediction tasks given a set of examples and some prior knowledge about the task.
Machine learning has been successfully applied e.g. in areas
text and document classification,
speech recognition and natural language processing,
computational biology (genes, proteins) and biological imaging & medical diagnosis
computer vision,
fraud detection, network intrusion,
and many others
You will gain skills to construct learning systems for typical applications by successfully
combining appropriate models and learning methods.
Characters of the play
4/10
object features x ∈ X are observable; x can be:
a categorical variable, a scalar, a real valued vector, a tensor, a sequence of values, an
image, a labelled graph, . . .
state of the object y ∈ Y is usually hidden; y can be: see above
prediction strategy (a.k.a. inference rule) h : X → Y; depending on the type of Y:
• y is a categorical variable ⇒ classification
• y is a real valued variable ⇒ regression
training examples T = {(x, y) | x ∈ X , y ∈ Y}
loss function ` : Y × Y → R+ penalises wrong predictions,
i.e. `(y, h(x)) is the loss for predicting y 0 = h(x) when y is the true state
Goal: optimal prediction strategy h : X → Y that minimises the loss
Q: give meaningful application examples for combinations of different X , Y and
related loss functions
Statistical machine learning
5/10
Main assumption:
X, Y are random variables,
X, Y are related by an unknown joint p.d.f. p(x, y),
we can collect examples (x, y) drawn from p(x, y).
Typical concepts:
regression: Y = f (X) + , where f is unknown and is a random error,
classification: p(x, y) = p(y)p(x | y), where p(y) is the prior class probability and
p(x | y) the conditional feature distribution.
Consequences and problems
the inference rule h(X) and the loss `(Y, h(X)) become random variables.
risk of an inference rule h(X) ⇒ expected loss
XX
R(h) = E[`(Y, h(X))] = p(x, y)`(y, h(x))
x∈X y∈Y
how to estimate R(h) if p(x, y) is unknown?
how to choose an optimal predictor h(x) if p(x, y) is unknown?
Statistical machine learning
6/10
Estimating R(h):
m
i i
collect an i.i.d. test sample S = (x , y ) ∈ X × Y | i = 1, . . . , m drawn from the
distribution p(x, y),
estimate the risk R(h) of the strategy h by the empirical risk
m
1 X i
R(h) ≈ RS m (h) = `(y , h(xi))
m i=1
Q: how strong can they deviate from each other? (see next lectures)
P |RS m (h) − R(h)| > ≤??
Statistical machine learning
7/10
Choosing an optimal inference rule h(x)
If p(x, y) is known:
The smallest possible risk is
XX X X
∗
R = inf R(h) = inf p(x, y)`(y, h(x)) = p(x) inf
0
p(y | x)`(y, y 0)
h∈Y X h∈Y X y ∈Y
x∈X y∈Y x∈X y∈Y
The corresponding best possible inference rule is the Bayes inference rule
X
∗
h (x) = arg min p(y | x)`(y, y 0)
y 0 ∈Y y∈Y
But p(x, y) is not known and we can only collect examples drawn from it. We need:
Learning algorithms that use training data and prior assumptions/knowledge about the task
Learning types
8/10
Training data:
m
i i
if T = (x , y ) ∈ X × Y | i = 1, . . . , m ⇒ supervised learning
m
i
if T = x ∈ X | i = 1, . . . , m ⇒ unsupervised learning
m1 S m2
m
if T = Tl Tu , with labelled training data Tlm1 and unlabelled training data Tum2
⇒ semi-supervised learning
Prior knowledge about the task:
Discriminative learning: assume that the optimal inference rule h∗ is in some class of
rules H ⇒ replace the true risk by empirical risk
1 X
RT (h) = `(y, h(x))
|T |
(x,y)∈T
and minimise it w.r.t. h ∈ H, i.e. h∗T = arg min RT (h).
h∈H
Q: How strong can R(h∗T ) deviate from R(h )? How does this deviation depend on H?
∗
P |R(h∗T ) − R(h∗)| > ≤??
Learning types
9/10
Generative learning: assume that the true p.d. p(x, y) is in some parametrised family
of distributions, i.e. p = pθ∗ ∈ PΘ ⇒ use the training set T to estimate θ ∈ Θ:
1. θT∗ = arg max log pθ (T ), i.e. maximum likelihood estimator,
θ∈Θ
2. set h∗T = hθT∗ , where hθ denotes the Bayes inference rule for the p.d. pθ .
Q: How strong can θT∗ deviate from θ∗? How does this deviation depend on PΘ?
Possible combinations (training data vs. learning type)
discr. gener.
superv. yes yes
semi-sup. (yes) yes
unsuperv. no yes
In this course:
discriminative: Support Vector Machines, Deep Neural Networks
generative: mixture models, Hidden Markov Models
other: Bayesian learning, Ensembling
Example: Classification of handwritten digits
10/10
x ∈ X - grey valued images, 28x28, y ∈ Y - categorical variable with 10 values
discriminative: Specify a class of strategies H and a loss function `(y, y 0). How would
you estimate the optimal inference rule h∗ ∈ H?
generative: Specify a parametrised family pθ (x, y), θ ∈ Θ and a loss function `(y, y 0).
How would you estimate the optimal θ∗ by using the MLE? What is the Bayes inference
rule for pθ∗ ?