Introduction of
Machine / Deep Learning
Hung-yi Lee 李宏毅
Machine Learning
≈ Looking for Function
• Speech Recognition
f( ) = “How are you”
• Image Recognition
f( ) = “Cat”
• Playing Go
f( ) = “5-5” (next move)
Different types of Functions
Regression: The function outputs a scalar.
PM2.5 today
Predict PM2.5 of
PM2.5
temperature f tomorrow
Concentration
of O3
Classification: Given options (classes), the function outputs
the correct one.
Spam
filtering f Yes/No
Different types of Functions
Classification: Given options (classes), the function
outputs the correct one.
Each position
is a class
(19 x 19 classes)
Function
a position on
the board
Next move
Playing GO
Structured Learning
create something with
structure (image, document)
Regression,
Classification
How to find a function?
A Case Study
YouTube Channel
https://www.youtube.com/c/HungyiLeeNTU
The function we want to find …
𝑦=𝑓
no. of views
on 2/26
1. Function
with Unknown Parameters
𝑦=𝑓
Model 𝑦 = 𝑏 + 𝑤𝑥! based on domain knowledge
feature
𝑦: no. of views on 2/26, 𝑥!: no. of views on 2/25
𝑤 and 𝑏 are unknown parameters (learned from data)
weight bias
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31
4.8k 4.9k 7.5k 3.4k 9.8k
0.5𝑘+1𝑥! = 𝑦 5.3k
𝑒!= 𝑦 − 𝑦& = 0.4𝑘
label 𝑦&
4.9k
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31
4.8k 4.9k 7.5k 3.4k 9.8k
0.5𝑘+1𝑥! = 𝑦 5.4k 0.5𝑘+1𝑥! = 𝑦
𝑒"= 𝑦 − 𝑦& = 2.1𝑘 𝑒#
𝑦& 𝑦&
4.9k 7.5k 9.8k
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
4.8k 4.9k
𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿 = , 𝑒"
𝑒! 𝑁
"
𝑦&
4.9k
𝑒 = 𝑦 − 𝑦& 𝐿 is mean absolute error (MAE)
𝑒 = 𝑦 − 𝑦& "
𝐿 is mean square error (MSE)
If 𝑦 and 𝑦& are both probability distributions Cross-entropy
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
Model 𝑦 = 𝑏 + 𝑤𝑥! values is.
Small 𝐿
𝑏 Error Surface
Large 𝐿 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss
𝐿 Negative Increase w
Positive Decrease w
𝑤( 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
𝜕𝐿
𝜂 |%)%! 𝜂: learning rate
𝜕𝑤
hyperparameters
𝑤( 𝑤! 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
Ø Update 𝑤 iteratively
Does local minima truly cause the problem?
Local global
minima minima
𝑤( 𝑤! 𝑤" 𝑤* 𝑤
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
Ø (Randomly) Pick initial values 𝑤 (, 𝑏 (
Ø Compute
𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑤! ← 𝑤( −𝜂 |%)%! ,')'!
𝜕𝑤 𝜕𝑤
𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑏! ← 𝑏( − 𝜂 |%)%! ,')'!
𝜕𝑏 𝜕𝑏
Can be done in one line in most deep learning frameworks
Ø Update 𝑤 and 𝑏 interatively
Model 𝑦 = 𝑏 + 𝑤𝑥!
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
Compute 𝜕𝐿⁄𝜕𝑤, 𝜕𝐿⁄𝜕𝑏
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿⁄𝜕𝑤, −𝜂 𝜕𝐿⁄𝜕𝑏)
Compute 𝜕𝐿⁄𝜕𝑤, 𝜕𝐿⁄𝜕𝑏
𝑤
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
Training
𝑦 = 0.1𝑘 + 0.97𝑥! achieves the smallest loss 𝐿 = 0.48𝑘
on data of 2017 – 2020 (training data)
How about data of 2021 (unseen during training)?
𝐿′ = 0.58𝑘
Red: real no. of views
𝑦 = 0.1𝑘 + 0.97𝑥! blue: estimated no. of views
Views
(k)
2021/01/01 2021/02/14
2017 - 2020 2021
𝑦 = 𝑏 + 𝑤𝑥!
𝐿 = 0.48𝑘 𝐿′ = 0.58𝑘
'
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.38𝑘 𝐿′ = 0.49𝑘
#$!
𝒃 𝒘∗𝟏 𝒘∗𝟐 𝒘∗𝟑 𝒘∗𝟒 𝒘∗𝟓 𝒘∗𝟔 𝒘∗𝟕
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
()
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥# 𝐿′ = 0.46𝑘
𝐿 = 0.33𝑘
#$!
%&
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.32𝑘 𝐿′ = 0.46𝑘
#$!
Linear models
Linear models are too simple … we need more sophisticated modes.
Different w
Different 𝑏
𝑥!
Linear models have severe limitation. Model Bias
We need a more flexible model!
red curve = constant + sum of a set of
0
𝑥!
2
All Piecewise Linear Curves
= constant + sum of a set of
More pieces require more
Beyond Piecewise Linear?
Approximate continuous curve
𝑦 by a piecewise linear curve.
𝑥!
To have good approximation, we need sufficient pieces.
red curve = constant + sum of a set of
How to represent
this function? Hard Sigmoid
𝑥!
Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 2 '3%4"
= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!
𝑥!
Different 𝑤
Change slopes
Different b
Shift
Different 𝑐
Change height
red curve = sum of a set of + constant
𝑦
𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1
𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3
0
𝑥!
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
New Model: More Features
𝑦 = 𝑏 + 𝑤𝑥!
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
*
𝑦 = 𝑏 + , 𝑤# 𝑥#
#
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #
𝑗: 1,2,3
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# no. of features
* # 𝑖: 1,2,3
no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +
1 𝑥5
3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5
𝑟! 𝑏! 𝑤!! 𝑤!( 𝑤!+ 𝑥!
𝑟( = 𝑏( + 𝑤(! 𝑤(( 𝑤(+ 𝑥(
𝑟+ 𝑏+ 𝑤+! 𝑤+( 𝑤++ 𝑥+
𝒓 = 𝒃 + 𝑊 𝒙
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑟! + 𝑤!!
𝑏! 𝑤!" 𝑥!
1
𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +
1 𝑥5
3
𝑟5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5
1 + 𝑒 28"
2 𝑥"
𝑎" 𝑟" +
1 𝑥5
3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
1
𝑦 = 𝑏 + 𝒄 , 𝒂
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
1
𝑦 = 𝑏 + 𝒄, 𝒂
𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Function with unknown parameters
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝒙 feature Rows
of 𝑊
……
𝜃!
Unknown parameters
𝜃(
𝜽 =
𝜃+
𝑊 𝒃 ⋮
𝒄, 𝑏
Back to ML Framework
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Loss Ø Loss is a function of parameters 𝐿 𝜃
Ø Loss means how good a set of values is.
feature
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑒
label 𝑦
D
Given a set of values
1
Loss: 𝐿 = , 𝑒"
𝑁
"
Back to ML Framework
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Optimization of New Model 𝜃!
𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽 = 𝜃(
𝜽 𝜃+
⋮
Ø (Randomly) Pick initial values 𝜽(
𝜕𝐿 𝜕𝐿
|𝜽$𝜽! 𝜂 |𝜽$𝜽!
𝜕𝜃! 𝜃!! 𝜃!- 𝜕𝜃!
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿
gradient 𝜕𝜃 |𝜽$𝜽! 𝜂 |𝜽$𝜽!
( ⋮ ⋮ 𝜕𝜃(
⋮ ⋮
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈
Optimization of New Model
𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽
Ø (Randomly) Pick initial values 𝜽(
Ø Compute gradient 𝒈 = ∇𝐿 𝜽(
𝜽! ← 𝜽( − 𝜂𝒈
Ø Compute gradient 𝒈 = ∇𝐿 𝜽!
𝜽" ← 𝜽! − 𝜂𝒈
Ø Compute gradient 𝒈 = ∇𝐿 𝜽"
𝜽5 ← 𝜽" − 𝜂𝒈
Optimization of New Model
𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽
B batch
Ø (Randomly) Pick initial values 𝜽( 𝐿
Ø Compute gradient 𝒈 = ∇𝐿! 𝜽( 𝐿!
batch
update 𝜽! ← 𝜽( − 𝜂𝒈
N
Ø Compute gradient 𝒈 = ∇𝐿" 𝜽! 𝐿"
update 𝜽" ← 𝜽! − 𝜂𝒈 batch
Ø Compute gradient 𝒈 = ∇𝐿5 𝜽" 𝐿5
update 𝜽5 ← 𝜽" − 𝜂𝒈 batch
1 epoch = see all the batches once
Optimization of New Model
Example 1
Ø 10,000 examples (N = 10,000) B batch
Ø Batch size is 10 (B = 10)
How many update in 1 epoch?
batch
1,000 updates
Example 2
N
Ø 1,000 examples (N = 1,000) batch
Ø Batch size is 100 (B = 100)
How many update in 1 epoch?
10 updates batch
Back to ML Framework
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
More variety of models …
Sigmoid → ReLU
How to represent
this function?
𝑥!
Rectified Linear
Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!
𝑥!
𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!
Sigmoid → ReLU
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #
Activation function
𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #
Which one is better?
Experimental Results
𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #
linear 10 ReLU 100 ReLU 1000 ReLU
2017 – 2020 0.32k 0.32k 0.28k 0.27k
2021 0.46k 0.45k 0.43k 0.43k
Back to ML Framework
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Even more variety of models …
+ 𝑎! +
𝑥!
1 1
𝑥"
+ 𝑎" +
1 or 1 𝑥5
……
+ 𝑎5 +
1 1
𝒂′ = 𝜎 𝒃′ + 𝑊′ 𝒂 𝒂 =𝜎 𝒃 + 𝑊 𝒙
Experimental Results
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Red: real no. of views
3 layers blue: estimated no. of views
Views
(k)
2021/01/01 2021/02/14
Back to ML Framework
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
It is not fancy enough.
Let’s give it a fancy name!
hidden layer hidden layer
+ 𝑎! +
𝑥!
1 1
𝑥"
+ 𝑎" +
1 1 𝑥5
……
+ 𝑎5 +
1 Neuron 1
Neural Network This mimics human brains … (???)
Many layers means Deep Deep Learning
Deep = Many hidden layers
22 layers
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf
8 layers
6.7%
7.3%
16.4%
AlexNet (2012) VGG (2014) GoogleNet (2014)
Deep = Many hidden layers
152 layers 101 layers
Special
structure
Why we want “Deep” network,
not “Fat” network? 3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Better on training data, worse on unseen data
Overfitting
Let’s predict no. of views today!
• If we want to select a model for predicting no. of
views today, which one will you use?
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
We will talk about model selection next time. J
To learn more ……
Backpropagation
Basic Introduction Computing gradients in
an efficient way
https://youtu.be/Dr-WRlEFefw https://youtu.be/ibJpTrp5mcE