0% found this document useful (0 votes)

55 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Regularization helps prevent overfitting by adding additional terms or constraints to the training objective. It can be viewed as imposing hard constraints during optimization, adding regularization terms like l2 or l1 norms as soft constraints, or incorporating priors over the parameters in a Bayesian view. L2 regularization scales parameter values proportionally along eigenvectors of the Hessian, while l1 regularization induces sparsity by driving small parameter values to exactly zero.

Uploaded by

baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Deep Learning Basics

Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization

• Specifically: additional terms in the training optimization objective to

prevent overfitting or help the optimization
Review: overfitting
Overfitting example: regression using polynomials
𝑡 = sin 2𝜋𝑥 + 𝜖

Figure from Machine Learning

and Pattern Recognition, Bishop
Overfitting example: regression using polynomials

Figure from Machine Learning

and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different

• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps

• Classical regularization: some principal ways to constrain hypotheses

• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective 𝑛
1
min 𝐿෠ 𝑓 = ෍ 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑓 𝑛
𝑖=1
subject to: 𝑓 ∈ 𝓗

• When parametrized 𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1

subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization 𝑛
1
෠
min 𝐿 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
2
subject to: | 𝜃| 2 ≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ 𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃 ∗ is the optimal for hard-constraint optimization
𝜃 ∗ = argmin max ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
𝜃 𝜆≥0
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃 ∗ = argmin ℒ 𝜃, 𝜆∗ ≔ 𝐿෠ 𝜃 + 𝜆∗ [𝑅 𝜃 − 𝑟]
𝜃
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 }
• Likelihood: 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)

• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })

• Maximum A Posteriori (MAP):

max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃

Regularization MLE loss

Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1

• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)

Three views
• Typical choice for optimization: soft-constraint
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ 𝜃 + 𝜆𝑅(𝜃)
𝜃

• Hard constraint and Bayesian view: conceptual; or used for derivation

Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability

• Bayesian view preferred if

• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization

• Robustness to noise
𝑙2 regularization
𝛼
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + | 𝜃| 2
2
𝜃 2

• Effect on (stochastic) gradient descent

• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ (𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻𝐿෠ 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
𝛻 𝐿෠ 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻 𝐿෠ 𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃 ∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
Effect on the optimal solution
• The optimal
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗

• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇

𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃 ∗ = 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄 𝑇 𝜃 ∗

• Effect: rescale along eigenvectors of 𝐻

Effect on the optimal solution

Notations:
𝜃 ∗ = 𝑤 ∗ , 𝜃𝑅∗ = 𝑤
෥

Figure from Deep Learning,

Goodfellow, Bengio and Courville
𝑙1 regularization
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + 𝛼| 𝜃 |1
𝜃

• Effect on (stochastic) gradient descent

• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖 > 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
1
෠𝐿𝑅 𝜃 ≈ ෍ 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖∗ 2
+ 𝛼 |𝜃𝑖 |
2
𝑖
• The optimal 𝜃𝑅∗
𝛼
max − 𝜃𝑖∗ ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
∗ 𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
Effect on the optimal solution
• Effect: induce sparsity
(𝜃𝑅∗ )𝑖

𝛼 𝛼 (𝜃 ∗ )𝑖
−
𝐻𝑖𝑖 𝐻𝑖𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅∗
𝛼
(𝜃𝑅∗ )𝑖 ≈ sign 𝜃𝑖∗ max{ 𝜃𝑖∗ − , 0}
𝐻𝑖𝑖
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior

𝑝 𝜃 ∝ exp(𝛼 ෍ |𝜃𝑖 |)
𝑖
log 𝑝 𝜃 = 𝛼 ෍ |𝜃𝑖 | + constant = 𝛼| 𝜃 |1 + constant
𝑖

Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
No ratings yet
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
41 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Duda Problemsolutions
100% (6)
Duda Problemsolutions
446 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
68 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Astral Pet Store: Novel Next
No ratings yet
Astral Pet Store: Novel Next
616 pages
Application For Withdrawl
No ratings yet
Application For Withdrawl
16 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Regularization for Overfitting Prevention
No ratings yet
Regularization for Overfitting Prevention
7 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
RTV 4 Manual - Regu Tools
No ratings yet
RTV 4 Manual - Regu Tools
128 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
RTV 4 Manual
No ratings yet
RTV 4 Manual
128 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
No ratings yet
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
14 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Lectures On Electromagnetic Theory - Weng Cho Chew
No ratings yet
Lectures On Electromagnetic Theory - Weng Cho Chew
591 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Regularization in ML Models
No ratings yet
Regularization in ML Models
47 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
77 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
PYu-RC Group 51 RoHS L 12
No ratings yet
PYu-RC Group 51 RoHS L 12
10 pages
Scribe Notes Fall 2022
No ratings yet
Scribe Notes Fall 2022
41 pages
Introml sp24 Lec2
No ratings yet
Introml sp24 Lec2
48 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
Regularization
No ratings yet
Regularization
46 pages
Module 3 - 3
No ratings yet
Module 3 - 3
93 pages
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
No ratings yet
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
11 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Mod 4
No ratings yet
Mod 4
65 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
315 F19 15 SVM 2
No ratings yet
315 F19 15 SVM 2
35 pages
Regularization
No ratings yet
Regularization
74 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
Deep Learning 02
No ratings yet
Deep Learning 02
28 pages
Unit-2 L1
No ratings yet
Unit-2 L1
23 pages
CS231n Lecture: Regularization
No ratings yet
CS231n Lecture: Regularization
105 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Deep Learning Basics Lecture 6 Convolutional NN
No ratings yet
Deep Learning Basics Lecture 6 Convolutional NN
36 pages
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
No ratings yet
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
11 pages
Machine Learinig Ja Bca 2nd Year Part 1
No ratings yet
Machine Learinig Ja Bca 2nd Year Part 1
10 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Regularization PDF
No ratings yet
Regularization PDF
32 pages
U4 PDF
No ratings yet
U4 PDF
18 pages
Deep Learning Basics Lecture 11 Practical Methodology
No ratings yet
Deep Learning Basics Lecture 11 Practical Methodology
25 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Modular Arithmetic Lesson Plan
No ratings yet
Modular Arithmetic Lesson Plan
4 pages
Dissertation Outline Mixed Methods
100% (2)
Dissertation Outline Mixed Methods
8 pages
Christophanic Exegesis and The Problem o PDF
No ratings yet
Christophanic Exegesis and The Problem o PDF
20 pages
Thesis Statement Spanish
100% (3)
Thesis Statement Spanish
5 pages
UCODE Lecture v2.3
No ratings yet
UCODE Lecture v2.3
45 pages
Regularization
No ratings yet
Regularization
8 pages
Backpropagation Lecture Notes
No ratings yet
Backpropagation Lecture Notes
31 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
A Nice OSCP Cheat Sheet
50% (2)
A Nice OSCP Cheat Sheet
12 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
Volume 50 Easter 2011
No ratings yet
Volume 50 Easter 2011
24 pages
Vision VAM State Legislature
No ratings yet
Vision VAM State Legislature
16 pages
Recurrent Pneumonia Final2
No ratings yet
Recurrent Pneumonia Final2
81 pages
Tinjauan Yuridis Tentang Upaya-Upaya Hukum Oleh Putra Halomoan HSB
No ratings yet
Tinjauan Yuridis Tentang Upaya-Upaya Hukum Oleh Putra Halomoan HSB
23 pages
OSRAM SFH 309 Datasheet
No ratings yet
OSRAM SFH 309 Datasheet
16 pages
Network Security: Intrusion Detection
No ratings yet
Network Security: Intrusion Detection
4 pages
SFH 235 Fa - en
No ratings yet
SFH 235 Fa - en
15 pages
SFH 203 - en
No ratings yet
SFH 203 - en
15 pages
Motion of Charged Particles in Fields: 2.1 Uniform B Field, E 0
No ratings yet
Motion of Charged Particles in Fields: 2.1 Uniform B Field, E 0
23 pages
q2 English Eport
No ratings yet
q2 English Eport
13 pages
Entrep - Branding
No ratings yet
Entrep - Branding
13 pages
Legal Dispute on DOJ Order in Murder Case
No ratings yet
Legal Dispute on DOJ Order in Murder Case
14 pages
In The Mountains (Form 4)
50% (2)
In The Mountains (Form 4)
6 pages
JUDICIAL AFFIDAVIT OF BELLE VERAS ANNEX 12 Answer For Forcible Entry Civil Case No. 890
No ratings yet
JUDICIAL AFFIDAVIT OF BELLE VERAS ANNEX 12 Answer For Forcible Entry Civil Case No. 890
4 pages
Sand Casting Feeder Optimization
No ratings yet
Sand Casting Feeder Optimization
10 pages
The 2019 2020 Tokushima Prefecture ALT Skill Development Conference
No ratings yet
The 2019 2020 Tokushima Prefecture ALT Skill Development Conference
9 pages
Exploring Megacities: Tokyo vs. Mexico City
No ratings yet
Exploring Megacities: Tokyo vs. Mexico City
4 pages
Of Kings and Queens
No ratings yet
Of Kings and Queens
6 pages
THP 12.01.2024 CPO 2023 Comprehension Part-2
No ratings yet
THP 12.01.2024 CPO 2023 Comprehension Part-2
4 pages
Management & Org Behavior Course
0% (1)
Management & Org Behavior Course
1 page
VEGA v. SSS
No ratings yet
VEGA v. SSS
4 pages
Hindu Medieval Salvation Islamic Sufism: Bhakti Movement
No ratings yet
Hindu Medieval Salvation Islamic Sufism: Bhakti Movement
1 page
ECE604 f20 hw3
0% (1)
ECE604 f20 hw3
3 pages
ECE604 f20 hw1
No ratings yet
ECE604 f20 hw1
1 page
The Present Simple and Present Continuous in Engli Activities Promoting Classroom Dynamics Group Form - 94392
No ratings yet
The Present Simple and Present Continuous in Engli Activities Promoting Classroom Dynamics Group Form - 94392
2 pages
4 Days Bali (Gtmc-Git)
No ratings yet
4 Days Bali (Gtmc-Git)
1 page

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

Deep Learning Basics

• Specifically: additional terms in the training optimization objective to

Figure from Machine Learning

Figure from Machine Learning

• Classical regularization: some principal ways to constrain hypotheses

• Maximum A Posteriori (MAP):

Regularization MLE loss

• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)

• Hard constraint and Bayesian view: conceptual; or used for derivation

• Bayesian view preferred if

• Effect on (stochastic) gradient descent

• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇

• Effect: rescale along eigenvectors of 𝐻

Figure from Deep Learning,

• Effect on (stochastic) gradient descent

You might also like