1
Machine Learning
Chapter 1: Introduction
孫民
清華大學
Credit : 林嘉文 (Chia-Wen Lin)
2/22/25
3 Polynomial Curve Fitting
Data Set Size:
𝑁 = 10
2/22/25
4 Sum-of-Squares Error Function
2/22/25
5 0th Order Polynomial
2/22/25
6 1st Order Polynomial
2/22/25
7 3rd Order Polynomial
2/22/25
8 9th Order Polynomial
2/22/25
9 Over-fitting
Root-Mean-Square (RMS) Error:
2/22/25
10 Polynomial Coefficients
2/22/25
11 Data Set Size: 𝑁 = 15
9th Order Polynomial
2/22/25
12 Data Set Size: 𝑁 = 100
9th Order Polynomial
2/22/25
13 Regularization
´ Penalize large coefficient values
2/22/25
14 Regularization: ln 𝜆 = −18
2/22/25
15 Regularization: ln 𝜆 = 0
2/22/25
16 Regularization: 𝐸!"# vs. ln 𝜆
2/22/25
17 Polynomial Coefficients
2/22/25
18 Probability Theory
Apples and Oranges
𝐵 𝑜𝑥 𝑖𝑠 𝑏 𝑙𝑢𝑒 𝑜𝑟 𝑟 𝑒𝑑
(F)Ruit is (a)pple or (o)range
2/22/25
19 Probability Theory – two random variables
´Marginal Probability
´Conditional Probability
Joint Probability
2/22/25
20 Probability Theory
´Sum Rule
Product Rule
2/22/25
21 The Rules of Probability
´ Sum Rule
´ Product Rule
2/22/25
22 Bayes’ Theorem
𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 Product Rule
Since P(X,Y) = P(Y,X), and
𝑃 𝑌, 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
Hence, 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
Posterior Likelihood Prior
Evidence
Posterior µ Likelihood × Prior 2/22/25
23 Probability Theory
Apples and Oranges
4
𝑝 𝐵=𝑟 = = 2/5 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑝𝑖𝑐𝑘𝑖𝑛𝑔 𝑎𝑛 𝑎𝑝𝑝𝑙𝑒?
10
p(B = b) = 6/10=3/5 p(F=a)?
p(F=a|B=r) = 2/8 = ¼ Use Sum Rule:
P(F=a|B=b) = 3/4 p(F=a) = p(F=a,B=r)+p(F=a,B=b)
Use Product Rule:
p(F=a,B=r) = p(F=a|B=r)p(B=r)
p(F=a,B=b) = p(F=a|B=b)p(B=b)
Hence,
p(F=a) = p(F=a|B=r)p(B=r)+ p(F=a|B=b)p(B=b)
= ¼*2/5+3/4*3/5 = 2/20+9/20=11/20
p(F=o) = 1- p(F=a) = 9/20 2/22/25
24 Probability Theory
Apples and Oranges
4
𝑝 𝐵=𝑟 = = 2/5 𝐼𝑓 𝑓𝑟𝑢𝑖𝑡 𝑖𝑠 𝑜𝑟𝑎𝑛𝑔𝑒, 𝑤ℎ𝑖𝑐ℎ 𝑏𝑜𝑥?
10
p(B = b) = 6/10=3/5 p(B|F=o)?
p(F=a|B=r) = 2/8 = ¼ Use Bayes’ Rule:
P(F=a|B=b) = 3/4 < 𝐹 = 𝑜 𝐵 <(>)
p(B|F=o) =
<(@AB)
!
∗D/F
p(B=r|F=o) = "
G/DH
= 6/9=2/3
p(B=b|F=o) = 1 - p(B=r|F=o)= 1/3
2/22/25
25 Probability Densities (continuous variable)
Probability density Cumulative distribution function
2/22/25
26 Transformed Densities
x = g(y)
dx/dy = d g(y)/ dy = g’(y)
2/22/25
27 Expectations
Approximate Expectation
(discrete and continuous)
Conditional Expectation
(discrete)
2/22/25
28 Variances and Covariances
𝑊ℎ𝑒𝑛 𝑥, 𝑦 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
E(xy) = E(x)E(y). Hence,
Cov[x,y] = 0
2/22/25
29 The Gaussian Distribution
2/22/25
30 Gaussian Mean and Variance
2/22/25
31 The Multivariate Gaussian
Σ 𝑖𝑠 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥. 𝐷𝑖𝑎𝑔𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝜎
2/22/25
32 Gaussian Parameter Estimation
Likelihood function
𝐱 = 𝑥! , 𝑥" , ⋯ , 𝑥# If x is i.i.d.
2/22/25
33 Two Principles for Estimating Parameters
´Maximum likelihood estimation (ML)
Choose 𝛉 that maximizes the probability (likelihood)
of observed data
.!" = argmax 𝑃(𝐷|𝛉)
𝛉
𝛉
´Maximum a posteriori estimation (MAP)
Choose 𝛉 that is most probable given prior
probability and data
𝑃 𝐷𝛉 𝑃 𝛉
.!$%
𝛉 = argmax 𝑃 𝛉 𝐷 = argmax
𝛉 𝛉 𝑃(𝐷)
2/22/25
34 Maximum (Log) Likelihood
𝐱 = 𝑥! , 𝑥" , ⋯ , 𝑥# , 𝐱 is i.i.d.
𝛉$% = argmax 𝑝 𝐱 𝛉 ?
𝛉
(log-likelihood)
𝛉$% = argmax ln 𝑝 𝐱 𝛉
𝛉
(sample mean) (sample variance)
2/22/25
%
35 Properties of 𝜇"$ and 𝜎"$
(unbiased)
(biased)
2/22/25
(unbiased)
36 Curve Fitting Re-visited
𝛽: inverse variance (precision)
2/22/25
37 Maximum Likelihood
Determine 𝐰$% by minimizing sum-of-squares error, 𝐸 𝐰 .
1 #
"
𝐰$% = arg min 5 𝑦 𝑥( , 𝐰 − 𝑡(
𝐰 2 ()!
2/22/25
ML Curve Fitting
38
Green: Actual model 2/22/25
Red: Predicted model
39 MAP: A Step towards Bayes
Posterior Likelihood Prior
Determine 𝐰$*+ by minimizing regularized sum-of-squares error, 𝐸9 𝐰 .
Eq. (1.4) 2/22/25
40 Bayesian Curve Fitting
Predictive
Distribution
W for both mu
And beta. 𝑝 𝑡 𝑥, 𝐰, 𝛽 = 𝒩 𝑡 𝑦 𝑥, 𝐰 , 𝛽 ,!
(Refer to Sec. 3.3 for detailed derivation)
2/22/25
ML Curve Fitting:
41 Bayesian Predictive Distribution
2/22/25
ML Curve Fitting Bayesian Curve Fitting
42 Cross Validation for Model Selection
´5-fold cross-validation -> split the training
data into 5 equal folds
´4 of them for training and 1 for validation
2/22/25
43 Cross Validation
2/22/25
44 Curse of Dimensionality
2/22/25
Curse of Dimensionality
45
Polynomial curve fitting, M = 3
𝐷R # 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
𝑉𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝑆𝑝ℎ𝑒𝑟𝑒 𝑤𝑖𝑡ℎ 𝑟𝑎𝑑𝑖𝑢𝑠 𝑟
VD 𝑟 = 𝐾𝐷rD
[VD(1)-VD(1-∈)]/VD(1) = 1-(1- ∈)D
2/22/25
46 Decision Theory
Given (x,t), predict t give new x.
´ Inference step
´ Determine either or .
´ Decision step
´ For given x, determine optimal t or decision/action based
on t.
2/22/25
47 Minimum Misclassification Rate
Assuming t as C1 or C2 class
Change x_hat to
X0, blue/green fixed,
but red reduced.
2/22/25
Red/Green Blue
48 Minimum Expected Loss
´ Example: classify medical images as ‘cancer’
or ‘normal’
Decision
Truth
When a cancer patient is classified as normal -> 1000 loss
2/22/25
49 Minimum Expected Loss
True class k, but
Assign class j
Regions are chosen to minimize
Eliminate common factor p(x)
2/22/25
50 Reject Option – avoid making decision
2/22/25
51 Generative vs Discriminative
´ Generative approach:
Model
Use Bayes’ theorem
´ Discriminative approach:
Model directly
2/22/25
52 Why Separate Inference and Decision?
• Minimizing risk (loss matrix may change over time)
• Reject option
• Unbalanced class priors
• Combining models
2/22/25
53 Decision Theory for Regression
´ Inference step
´ Determine 𝑝 𝐱, 𝑡 .
´ Decision step
´ For given x, make optimal prediction, y(x), for t.
´ Loss function:
2/22/25
54 The Squared Loss Function
𝔼 𝐿 is minimized when
2/22/25
55
Information Theory
2/22/25
56 Entropy
h(x) is a monotonic function of p(x),
and expresses the information content (>=0).
ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)
If x,y independent, p(x,y) = p(x) p(y),
h(x,y) = -log2p(x) -log2p(y) = h(x)+h(y)
H(x) is the expectation of h(x)
2/22/25
57 Entropy
Important quantity in
• coding theory
• statistical physics
• machine learning
2/22/25
58 Entropy
2/22/25
59 Entropy - coding theory
´ Coding theory: x discrete with 8 possible
states; how many bits to transmit the
state of x?
´ All states equally likely
Code: 000, 001, 010, 011, 100, 101, 110, 111 2/22/25
60 Entropy
2/22/25
61 Entropy - statistical physics
In how many ways can N identical
objects be allocated M bins?
Note that ni balls in ith bin.
# ways to allocate (multiplicity)
pi is the prob that ball assigned to ith bin.
Entropy maximized when
2/22/25
64 Differential Entropy – continuous x
Put bins of width ¢ along the real line
Differential entropy maximized (for fixed
& 𝜇) when
in which case (only related to 𝜎)
2/22/25
65 Conditional Entropy
ℎ 𝑦|𝑥 = −𝑙𝑜𝑔2 𝑝(𝑦|𝑥)
H[x]
2/22/25
66 The Kullback-Leibler Divergence (Relative Entropy)
Unknow p(x) modeled by q(x). Additional info required
2/22/25
67 Mutual Information
ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)
If x,y independent, p(x,y) = p(x) p(y),
h(x,y) = -log2p(x) -log2p(y) = h(x)+h(y)
If x, y not independent
2/22/25