Deep Learning Basics
Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization
• Specifically: additional terms in the training optimization objective to
prevent overfitting or help the optimization
Review: overfitting
Overfitting example: regression using polynomials
𝑡 = sin 2𝜋𝑥 + 𝜖
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting example: regression using polynomials
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different
• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps
• Classical regularization: some principal ways to constrain hypotheses
• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective 𝑛
1
min 𝐿 𝑓 = 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑓 𝑛
𝑖=1
subject to: 𝑓 ∈ 𝓗
• When parametrized 𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization 𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
2
subject to: | 𝜃| 2 ≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ 𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃 ∗ is the optimal for hard-constraint optimization
𝜃 ∗ = argmin max ℒ 𝜃, 𝜆 ≔ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
𝜃 𝜆≥0
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃 ∗ = argmin ℒ 𝜃, 𝜆∗ ≔ 𝐿 𝜃 + 𝜆∗ [𝑅 𝜃 − 𝑟]
𝜃
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 }
• Likelihood: 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
• Maximum A Posteriori (MAP):
max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃
Regularization MLE loss
Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)
Three views
• Typical choice for optimization: soft-constraint
min 𝐿 𝑅 𝜃 = 𝐿 𝜃 + 𝜆𝑅(𝜃)
𝜃
• Hard constraint and Bayesian view: conceptual; or used for derivation
Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability
• Bayesian view preferred if
• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
• Robustness to noise
𝑙2 regularization
𝛼
min 𝐿 𝑅 𝜃 = 𝐿 (𝜃) + | 𝜃| 2
2
𝜃 2
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿 𝑅 𝜃 = 𝛻 𝐿 (𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿 𝑅 𝜃 = 𝜃 − 𝜂 𝛻𝐿 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻 𝐿 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
• Since 𝜃 ∗ is optimal, 𝛻 𝐿 𝜃 ∗ = 0
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
𝛻 𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻 𝐿 𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻 𝐿 𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃 ∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
Effect on the optimal solution
• The optimal
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃 ∗ = 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄 𝑇 𝜃 ∗
• Effect: rescale along eigenvectors of 𝐻
Effect on the optimal solution
Notations:
𝜃 ∗ = 𝑤 ∗ , 𝜃𝑅∗ = 𝑤
Figure from Deep Learning,
Goodfellow, Bengio and Courville
𝑙1 regularization
min 𝐿 𝑅 𝜃 = 𝐿 (𝜃) + 𝛼| 𝜃 |1
𝜃
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿 𝑅 𝜃 = 𝛻 𝐿 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿 𝑅 𝜃 = 𝜃 − 𝜂 𝛻 𝐿 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
• Since 𝜃 ∗ is optimal, 𝛻 𝐿 𝜃 ∗ = 0
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖 > 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
1
𝐿𝑅 𝜃 ≈ 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖∗ 2
+ 𝛼 |𝜃𝑖 |
2
𝑖
• The optimal 𝜃𝑅∗
𝛼
max − 𝜃𝑖∗ ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
∗ 𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
Effect on the optimal solution
• Effect: induce sparsity
(𝜃𝑅∗ )𝑖
𝛼 𝛼 (𝜃 ∗ )𝑖
−
𝐻𝑖𝑖 𝐻𝑖𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅∗
𝛼
(𝜃𝑅∗ )𝑖 ≈ sign 𝜃𝑖∗ max{ 𝜃𝑖∗ − , 0}
𝐻𝑖𝑖
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior
𝑝 𝜃 ∝ exp(𝛼 |𝜃𝑖 |)
𝑖
log 𝑝 𝜃 = 𝛼 |𝜃𝑖 | + constant = 𝛼| 𝜃 |1 + constant
𝑖