Introduction to Machine Learning
Introduction
林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University
Some slides are modified from Prof. Sheng-Jyh Wang,
Prof. Hwang-Tzong Chen, and Prof. Yung-Yu Chuang
Pattern recognition and machine learning
• Pattern recognition is the automated recognition of patterns
and regularities in data
➢ Discover pattern regularities
➢ Take actions, such as classification or regression, with regularities
• Data: A set of hand-written digits and the class ground truth
• Computer algorithm: It extracts features from each image,
analyze the patterns and regularities in data
• Model: Given a new hand-written digit, predict its class label
2
Pattern recognition and machine learning
• Machine learning: to design and develop algorithms that allow
computers to predict data based on empirical data
➢ Try to explore certain patterns or regularities
➢ Learn models from the given data
➢ Based on the given data, the learner produces a useful output in
new cases
• Machine learning is one approach to pattern recognition, while
other approaches include hand-crafted (not learned) rules or
heuristics
• Machine learning ⊂ Pattern recognition
3
Applications of machine learning
• Computer vision
• Speech recognition
• Information retrieval
• Natural language processing
• Robotics
• Bioinformatics
• Data mining
• Finance
• …
4
Problem definition of a machine learning task
• Training data
➢ A set of N training data {x1, x2, …, xN}, sometimes together with
their target vectors {t1, t2, …, tN}
• Feature extraction
➢ Original input variables are usually transformed into some new
space of variables, where the problem can be better handled
• Model learning
➢ We learn a proper model for the problem
• Generalization or testing
➢ To correctly predict new examples (testing data) that differ from
those used for training
5
Cat image classification: Training data
• Collect a set of training data with target vectors
6
Cat image classification: Feature extraction
• Feature extraction is crucial
➢ Need to take feature variations into account
viewpoint variations Illumination variations background variations
pose variations
7
Cat image classification: Model learning
• Based on the given training data and the extracted features,
we learn a classifier
Classifier
8
Cat image classification: Testing
• Apply the learned classifier to the testing images
9
Cat image classification: Testing
• Apply the learned classifier to the testing images and make
prediction
10
Regression
a real value
x y
input Model predicted
data value
estimated TAIEX on 11/5
Taiwan Capitalization
Weighted Stock Index
11
Supervised vs. Unsupervised learning
• Supervised learning: the training data comprises examples of
the input vectors along with their corresponding target vectors
• Classification: assign each input vector to one of a finite number
of discrete categories
• Regression: assign each input vector to one or more continuous
variables
• Methods: linear regression, linear classification, neural
networks, support vector machine, ensemble learning,
dimensionality reduction, deep learning, …
12
Good vs. bad features for classification
Feature B
Feature B
Feature A
Feature B
Feature A
bad features
Feature A
good features
13
Good vs. bad features for regression
Output Value
Output Value
Feature
Output Value
Feature
bad feature
Feature
good feature
14
Supervised vs. Unsupervised learning
• Unsupervised learning: the training data consist of a set of
input vectors x without any corresponding target values
• Clustering: to discover groups of similar examples within the
data
• Density estimation: to determine the distribution of data within
the input space
• Dimensionality reduction: to project the data from a high-
dimensional space down to a low-dimensional space
• Data generation: to synthesize new data with some particular
conditons
15
Unsupervised learning for clustering
• Clustering: To group a set of data in such a way that data
points in the same group, called a cluster, are more similar to
each other than to those in other clusters
k-mean clustering
16
Unsupervised learning for dimensionality reduction
• Dimensionality reduction: To project data from a high-
dimensional space to a low-dimensional one
PCA: Principal
component
analysis
17
Unsupervised learning for density estimation
• Density estimation: Based on given data, estimate the
underlying probability density function
kernel density
estimation (KDE)
18
Unsupervised learning for data generation
• Given a set of natural images, we try to generate new images
that look natural and photorealistic
Generative Adversarial
Networks (GAN): Given a
set of images, generate
new images from the
same distributions
19
Applications of data generation
• Face synthesis
Tero Karras et al. “Progressive Growing of GANs for
Improved Quality, Stability, and Variation”
20
Polynomial curve fitting: Problem definition
• Training data (observations)
➢ 10 blue circles, each of which has
◆One-dimensional input
◆One target output
• Green curve sin(2𝜋𝑥) is the
function used to generate these
data, which is unknown
• Each point is sampled from the function with a random
Gaussian noise
• Goal of curve fitting: To exploit the training data to discover
the underlying function so that we can make predictions of
the value 𝑡Ƹ for some new input 𝑥ො
21
Polynomial curve fitting: Choose a fitting function
• Fit the data using a polynomial function of the form:
➢ This function is parametrized by w
➢ 𝑤0 is the bias term
➢ Its input is a data point, while the output is estimated target
➢ M is the order of the polynomial function
22
Polynomial curve fitting: Error function
• An error function (objective function) is used to determine the
parameters
• In this case, we minimize the sum-of-squares error
➢ Differentiable
➢ Closed form solution
23
Polynomial curve fitting: Model selection
• Models with different values of hyperparameter M
• Model selection: To choose a proper value of M
24
Polynomial curve fitting: Model selection
• Under-fitting: M = 0 or M =1
➢ The constant or first order polynomial gives poor fit due to
insufficient flexibility
• The third order polynomial gives the best fit
• Over-fitting: M = 9
➢ All training points are perfectly fitted
➢ Poor representation of the green curve
➢ The generalization is poor
25
Polynomial curve fitting: Generalization
• Suppose we are given a set of training data and a separate set
of 100 test data
• Evaluate the generalization for each choice of M via root-
mean-square (RMS) error
26
Polynomial curve fitting: Generalization
• Small values of M give relatively large values of training and
test errors
• When M is between 3 and 8, reasonable representations are
obtained
• For M=9, the training error goes to zero, but the test error
increases significantly
27
Polynomial curve fitting: Data size vs. Over-fitting
M=9
• Over-fitting becomes less severe as the data size increases
• In general, the number of data points should be no less than
some multiple (say 5 or 10) of the number of adaptive
parameters in the model
• Regularization is often used to control the over-fitting
phenomenon
28
Polynomial curve fitting: Regularization
• Regularization: Add a penalty term to the error function to
discourage the coefficients from reaching large values
where w = wT w = 02 + 12 + + M2
2
➢ The coefficient 0 is usually omitted
➢ This kind of techniques is called shrinkage methods in the
statistics literature
➢ A quadratic regularizer is called ridge regression
➢ In neural networks, this approach is known as weight decay
29
Polynomial curve fitting: Regularization
30
Probability theory
• We need to handle data uncertainties, which result from
➢ Noise on measurement
➢ Finite size of data sets
• Probability theory provides a consistent framework to
manipulate uncertainties, and hence is essential to pattern
recognition research
31
A toy examples
• Two boxes: r (red box) and b (blue box)
• Two types of fruits: a (apple) and o (orange)
• A trial: Randomly selecting a box from which we randomly
picking a fruit
• Introduce one variable B for box and one variable F for fruit
• Many trials: Repeat the process many times
• Question 1: What is the probability that an apple is picked
➢ Marginal probability
• Question 2: Given that we have picked an orange, what is the
probability that the box we chose was the blue one?
➢ Conditional probability
32
Probability theory: A two-variable case
• Two random variables: 𝑋 and 𝑌
• Each variable has a set of discrete states
➢ 𝑋 can take any value 𝑥𝑖 where 𝑖 = 1, 2, … , 𝑀
➢ 𝑌 can take any value 𝑦𝑗 where 𝑗 = 1, 2, … , 𝐿
• 𝑁 trails where both variables 𝑋 and 𝑌 are sampled
• Some notations
➢ Let the number of trails where 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 be 𝑛𝑖𝑗
➢ Let the number of trails where 𝑋 takes value 𝑥𝑖 be 𝑐𝑖
➢ Let the number of trails where 𝑌 takes value 𝑦𝑗 be 𝑟𝑗
33
Joint, marginal, and conditional probabilities
• The probability that 𝑋 takes value 𝑥𝑖 and 𝑌 takes value 𝑦𝑗 is
called joint probability
• It is defined by the fraction of points (trails) falling in the cell i,j
34
Joint, marginal, and conditional probabilities
• The probability that 𝑋 takes value 𝑥𝑖 irrespective of the value
of 𝑌 is called marginal probability and is written as 𝑝(𝑋 = 𝑥𝑖 )
• It is defined by the fraction of the number of points that fall in
column i, namely
• With the joint probability and 𝑐𝑖 = σ𝑗 𝑛𝑖𝑗 , we have
• The sum rule
35
Joint, marginal, and conditional probabilities
• If we consider only those cases where 𝑋 takes value 𝑥𝑖 , the
fraction of those cases where 𝑌 = 𝑦𝑗 is written as
𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 . It is called conditional probability
• It is defined by
• Relationships among joint, marginal, and conditional probabilities:
• The product rule
36
Joint, marginal, and conditional probabilities
37
Bayes’ theorem
• By using the product rule and the symmetry property
𝑝(𝑋, 𝑌)=𝑝(𝑌, 𝑋), we have
38
Probability with continuous variables
• The probability density 𝑝(𝑥) over a continuous variable 𝑥 must
satisfy the two conditions:
➢ Nonnegative: Probabilities are nonnegative
➢ Sum-to-1: The value of 𝑥 must lie somewhere on the real axis
• The cumulative distribution function defines the probability
that 𝑥 lies in the interval −∞, 𝑧 via
39
Sum rule and product rule
• Sum rule in discrete cases
• Sum rule in continuous cases
• Product rule in discrete cases
• Product rule in continuous cases
40
Expectations and covariances
• The average value of some function 𝑓(𝑥) under a probability
distribution 𝑝(𝑥) is called the expectation of 𝑓(𝑥)
• For a discrete distribution, the expectation of 𝑓(𝑥) is
• For a continuous probability, the expectation of 𝑓(𝑥) is
41
Expectations and covariances
• The variance of 𝑓(𝑥) under a probability distribution 𝑝(𝑥) is
• It is a measure of how much variability there is in 𝑓(𝑥) around
its mean
• For two random variables 𝑥 and 𝑦, the covariance is defined by
• It expresses the extent to which 𝑥 and 𝑦 vary together.
42
Gaussian distribution
• For a single continuous variable, the Gaussian or normal
distribution is defined by
which is specified by two parameters: mean 𝜇 and variance 𝜎 2
43
Mean and variance of a Gaussian distribution
• The average value of a random variable 𝑥 whose distribution is
Gaussian
• The second order moment of variable 𝑥
• The variance of variable 𝑥
44
Multivariate Gaussian
• The multivariate Gaussian distribution defined over a D-
dimensional vector x of continuous variables:
where D x D matrix is called the co-variance matrix while
denotes the determinant of
45
Bayes’ theorem for polynomial curve fitting
• Recall the curve fitting problem
➢ Given a set of N observations D = {x1, x2, …, xN} and their target
values {t1, t2, …, tN}
➢ Polynomial curve fitting: Determine the values of 𝒘
• Prior probability 𝑝 𝒘 : Express our assumption about 𝒘
before observing any data
• Likelihood function 𝑝 𝐷|𝒘 : Express how probable the
observed data D is under 𝒘. It is evaluated after the
observations D are given
46
Bayes’ theorem for polynomial curve fitting
• Bayes’ theorem takes the form
which allows us to evaluate the uncertainty after we have
observations D
• 𝑝 𝐷 is the normalization constant. Thus, we have
47
Determining Gaussian parameters by maximum
likelihood
• Given a set of N observations:
• Assume these observations are sampled from a Gaussian
distribution with mean 𝜇 and variance 𝜎 2 (unknown)
• Our goal is to determine 𝜇 and 𝜎 2 based on the observations
• We assume that data are sampled independently from the
same distribution, namely independent and identically
distributed, or i.i.d. for short
48
Determining Gaussian parameters by maximum
likelihood
• Since the data are i.i.d., the likelihood function of data given
mean 𝜇 and variance 𝜎 2 is
• The log likelihood function is
• Maximum likelihood solution:
49
Probabilistic perspective of polynomial curve fitting
• Given N data for regression: &
➢ Fit the data using a polynomial function of the form:
➢ This function is parametrized by 𝐰
• Given the value of 𝑥, we assume the corresponding value of 𝑡
has a Gaussian distribution with a mean equal to 𝑦(𝑥, 𝒘)
where 𝛽−1 is the variance 𝜎 2 (𝛽 is called precision)
50
Probabilistic perspective of polynomial curve fitting
• The Gaussian conditional distribution for 𝑡 given 𝑥
• If data are i.i.d., the likelihood function is
51
Maximum likelihood solution
• The log likelihood function
• Maximum likelihood (ML) solution for determining 𝐰 and 𝛽
➢ Compute the gradient of the log likelihood function w.r.t. 𝐰. And
set it to 0. We can get 𝐰ML .
➢ By setting the gradient of the log likelihood function w.r.t. 𝛽 to 0,
𝛽ML is obtained by solving
52
Maximum likelihood solution
• After determining the values of 𝐰ML and 𝛽ML , we can make
predictions for a new value of 𝑥
53
Maximum a posterior (MAP) solution
• While ML solution is obtained by maximizing the likelihood,
MAP solution is by maximizing the posterior
• Recall
• Introduce a prior distribution over the curve parameters 𝐰
➢ 𝑀 is the order of the polynomial
➢ 𝛼 is a hyperparameter
• The posterior distribution for 𝐰
54
Maximum a posterior (MAP) solution
• The MAP solution, 𝐰MAP and 𝛽MAP , is obtained by maximizing
the posterior function, or equivalently by minimizing
55
Bayesian curve fitting
• We make a point estimation of 𝐰 no matter in ML and MAP
solutions
• In a full Bayesian approach, we integrate over all possible
values of 𝐰 for regression, i.e.,
( xn ) = ( xn0 ,..., xnM )T
56
Probabilistic polynomial curve fitting
• Given the assumption
➢ ML solution: Find 𝐰 that maximizes the likelihood function
p(t | x, D) = p(t | x, w ML , −1 )
➢ MAP solution: Find 𝐰 that maximizes the posterior probability
p(t | x, D) = p(t | x, w MAP , −1 )
➢ Bayesian solution: Integrate over 𝐰
p(t | x, D) =
57
Model selection
• Hyperparameters, such as 𝑀 in polynomial curve fitting, control
the model behavior complexity
• Model selection: determine the values of hyperparameters that
achieve the best predictive performance on new (testing) data
• Idea: split training data into a training set and a validation set
➢ Training set: Used to learn the model with particular
hyperparameters values
➢ Validation set: Used to evaluate the performance of the learned
model
58
Model selection
• About the size of the validation set
➢ A large validation set: Less training data for model learning
➢ A small validation set: Less reliable performance evaluation
59
Model selection via cross validation
• S-fold cross-validation
➢ Partition training data into S equal-sized groups
➢ S-1 groups are used to train the model that is evaluated on the
remaining group
➢ Repeat the procedure for all S possible runs
➢ Average the performance
60
Drawbacks of model selection
• If training data are limited, a large value of S is appropriate
• At the extreme, setting S=N (number of training data), it gives
the leave-one-out technique
• Some drawbacks
➢ The number of training runs increases by a factor of S
➢ The number of hyperparameter value combinations increases
exponentially
61
Summary
• Polynomial curve fitting for regression
➢ Fitting by minimizing the sum-of-squares error
➢ Regularization for alleviating overfitting
• Probability density
➢ Expectation, variance, and covariance
➢ Gaussian distribution
62
Summary
• Bayes’ theorem
• When applying Bayes’ theorem to polynomial curve fitting,
➢ ML solution: Find 𝐰 that maximizes the likelihood function
➢ MAP solution: Find 𝐰 that maximizes the posterior probability
➢ Bayesian solution: Integrate over 𝐰
• Model selection by cross-validation
63
References
• Chapters 1.1, 1.2, 1.3, and 1.4 in the PRML textbook
64
Thank You for Your Attention!
Yen-Yu Lin (林彥宇)
Email:
[email protected]URL: https://www.cs.nycu.edu.tw/members/detail/lin
65