0% found this document useful (0 votes)

137 views17 pages

Regularization in Machine Learning

The document discusses various regularization techniques in machine learning, emphasizing their role in mitigating overfitting and enhancing model generalizability. Key methods include Lasso (L1), Ridge (L2), Elastic Net, and newer approaches like L2,1 regularization and Tangent Prop. It concludes that while many regularization penalties exist, Elastic Net is a reasonable default, and techniques like early stopping and dropout are particularly effective.

Uploaded by

Juan Placer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views17 pages

Regularization in Machine Learning

Uploaded by

Juan Placer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Regularization in

Machine Learning
David Sahner, M.D.
Senior Advisor, National Center for the Advancement of Translational Sciences, and
Executive Advisor, Data Science and Translational Medicine, Axle Research and Technologies
Presentation Outline
• General points
• Commonly used norm penalties
• Lasso (L1), Ridge Regression (L2) and Elastic Net

• L2,1 regularization - a new tool in ML?

• Tangent Prop
• Early stopping during training – several approaches
• Classic random unit dropout and newer variations
• Conclusions
• Supplemental slide: Other regularization penalties
Regularization: General Points
• Regularization affects the bias-variance trade-off with the net result that generalizability is
enhanced at the possible expense of increased training error – more bias and less variance
• People often think of parameter norm penalties to the cost function as “regularization,” but
model simplification and regularization can be achieved in other ways to reduce the risk of
overfitting:
• Dimensionality reduction (see separate learning module on this topic)
• Parameter sharing, which renders certain deep model types (e.g., CNNs) inherently less susceptible
to overfitting
• Early stopping during training
• Random dropout of units decreases model complexity and enables formation of an ensemble. May
need to start with a more complex model.
• Regularization is most critical when the model is large & complex, and the training data set
is of only small/modest size (high risk of model overfitting)
• Overregularization (extreme weight decay) can maroon us in a local cost minimum
Norm penalties
• A norm of the weight vector is multiplied by a hyperparameter and added to
the cost function, inflicting a cost hit for complexity.

• In deep learning, this affects weights for all layers.

• The most frequently used norm penalties include:

• L1 Norm (Lasso), consisting of the sum of weights multiplied by a scalar hyperparameter
that controls its strength. Lasso promotes sparsification of weights.

• L2 Norm (Ridge Regression), in which the sum of the squared weights (squared L2
α
norm) is multiplied by the hyperparameter 2 .

• Elastic net: A hybrid of the two norms above; like a weighted combination of each. May
be superior to Lasso in the context of highly correlated predictors.
L2 Regularization
• Recommend reviewing the theoretical background on L2 regularization, including
derivation of the equations illuminating its exact effect at
https://www.deeplearningbook.org/
• Analysis based on a quadratic approximation of a cost function (represented as a Taylor expansion)
leveraging basic calculus and linear algebra shows that weight decay rescales unregularized weights by a
factor of λi/(λi+α), where λi is the eigenvalue of the Hessian corresponding to the eigenvector aligned with an
element of the weight vector

• This rescaling factor implies that weights “associated” with eigenvalues of H much smaller than α are
preferentially targeted for decay.

• A small eigenvalue of H tied to a weight implies that the cost function is less sensitive to that weight.
These are the weights L2 regularization shifts toward zero.

• In linear regression, L2 regularization artificially inflates feature variance by a fixed amount.

• This disproportionately shrinks the weights of features that exhibit low covariance with the output target, as
that relatively low covariance is more greatly diluted by the L2 “variance boost”
L2,1 regularization: Relatively new and less
commonly used, yet . . .
• L2 norm can be applied to each row of a matrix, followed by summing
of these values over all rows (L2,1 regularization)
• Used successfully in penalized multivariate regression, aiding in
variable selection when potential phenotypic trait predictors (SNPs)
greatly outnumbered the number of genotypes (Mbebi et al., 2021)
• Performed well against several other regression techniques in this study
• Induction of sparsity in coefficient matrix
• Relations between examples considered by this approach
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8479665/pdf/btab212.pdf
Tangent Prop Regularizer: Fundamental Idea
• Quite different from norm penalties. Members of a class can be thought of as living
on or near a manifold that can be defined by the vector output of a classifier.

• For a new member of the class, one would like to ensure that movement along that
manifold does not affect classification.
• We would like the model output to change very minimally, if at all, for a new member of the
class compared with other members of the same class.

• This can be approximated locally by requiring model output to change only minimally if a new
member of the class resides on the class manifold’s tangent* plane.

• Caveat: Tangent prop is NOT appropriate if ReLU activation functions are used.

*In actuality, a manifold has multiple tangent planes, referred to collectively as the tangent bundle,
which can be estimated/extracted using a Contractive Auto-Encoder. See:
https://www.iro.umontreal.ca/~vincentp/Publications/MTC_nips2011.pdf
Tangent Prop Regularizer: Fundamental Idea

Toy example of class manifold.

Impact on output generated by a local
move along the tangent is minimal.
Orthogonal movement in the direction
of the outward arrow has a much
greater effect on model output.
Tangent Prop (TP) Regularizer: Formalism
• We would like the gradient of the model output with respect to input to be zero/minimal for a
succession of members of the same class. Conversely, we want the model’s output to
change when it sees a member of another class during training.
• This is achieved by adding a penalty term that is nonzero if the gradient of the output is
both:
a. Significant/large AND
b. NOT orthogonal to the tangent plane of a class manifold

• We can express this generically as a penalty term of the form:

σ𝑖(∇Xf(x)Tti)2, where f(x) is the model output, and ti is a tangent plane of the class manifold.
Can scale this with a hyperparameter.

This enforces small output gradients along the tangent planes of the class manifold although TP
can regularize only with respect to minor input perturbations
Early stopping during training
Can track validation loss over time,
keeping a running log of parameters,
and then return to the set of model
parameters that yielded the lowest
validation set loss.
Here, p (a “patience” hyperparameter)
can be defined as the number of
iterations/epochs through which
training is allowed to continue with no
improvement over the lowest validation
set loss.
In short, training stops when a
surrogate for generalizability
Image from (validation error) ceases to improve.
https://www.deeplearningbook.org/contents/
regularization.html
Selected variations on early stopping
• Stopping in the context of “Low Progress” during training. Intuition is that the risk of
overfitting increases when the training error declines slowly. Can formalize this:

where Pk is training progress over a consecutive strip of length k of, for example, five epochs
(k=5). The sum of the errors is divided by the minimum training error during that strip multiplied by k.
When Pk falls below a preset threshold, training stops.

• One can stop when the “generality loss (GL) to training progress ratio” exceeds a
threshold, where GL ≡ 100∙(average error in validation set in current epoch ÷ lowest
validation set error across epochs – 1)
Random unit dropout (I)
• Random masking of units simplifies each model and creates, in effect, an
ensemble.
• Masking probabilities for hidden units often set at 0.5 (0.8 for input units).

• Calculating arithmetic mean of the probability of a class over all possible

models is impractical, but we can sample 10-20 masks
• Another approach relies on the geometric mean. Here the unnormalized
probability of a class is given by:

where μ is the number of mask vectors (and masked models), d is the number of units that
can be dropped, and 2d is the number of “dropped model” permutations
Random unit dropout (II)
We then normalize the predicted probability by including a denominator that sums
these values across all classes. But this still entails scads of forward propagations,
so we rely on an expediency that allows us to accomplish this feat (approximately)
with only ONE forward propagation – namely, weight scaling inference.

Modified image from Tian and Zhang, 2022

Weight scaling inference rule
• We can estimate the geometric mean of the probability of a given class
across all potential sub-models in one forward propagation of the parent
(full) model with no masking by “weighting the weights” by which a unit’s
activity is multiplied by the probability of that unit’s inclusion in a sub-
model.

• With a masking probability of 0.5 for any hidden unit, this translates into division of
each weight tethered to a hidden unit by 2.

• This technique provides an approximation of the ensemble’s assigned

probability of any given class if the hidden units are paired with nonlinear
activation functions (and exact estimates in some other settings).
Newer twists on standard dropout
• Curriculum dropout: Smoothly increase the dropout rate during training,
guided by the intuition that the risk of co-adaptation of feature detectors,
leading to overfitting, increases with time
• As good or better than standard dropout in image classification
• See https://arxiv.org/pdf/1703.06229

• Dropconnect: Randomly drop weights rather than unit activations

• Standout: A binary belief network influences which units are dropped, with
a preference for retaining confident units

• Dropmaps: For each batch, features are dropped in accordance with a

Bernoulli distribution
Conclusions
• Regularization mitigates the risk of overfitting models with attendant failure of
generalizability
• There exists a profusion of regularization penalties, but only several are very
commonly used in machine learning.
• Elastic net, which allows one to titrate contributions of L1 and L2 regularization,
is a reasonable default regularization strategy
• L1 regularization promotes feature sparsity
• Early stopping and dropout are very useful
• Curriculum dropout may offer advantages over standard dropout, especially in the
setting of image classification

• Although not discussed in this module, other techniques are extremely helpful
in preventing overfitting, including various dimensionality reduction techniques
(see separate module) and dataset augmentation
Other types of regularization
Leaky capped l1 norm:

For details, see:

https://www.sciencedirect.com/science/article/abs/pii/S156625352100230X
SCAD: Smoothly Clipped Absolute Deviation; MCP: Minimax concave penalty;
LSP: Log sum penalty; Rat: Rational penalty; Atan: Arctangent penalty; Exp:
Exponential penalty.

20 Pips Daily Price Action Forex Breakout Strategy
0% (1)
20 Pips Daily Price Action Forex Breakout Strategy
4 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
77 pages
Regularization
No ratings yet
Regularization
74 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Regularization
No ratings yet
Regularization
46 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
16 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
Aa New
No ratings yet
Aa New
15 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Understanding Loss & Regularization in Deep Learning
No ratings yet
Understanding Loss & Regularization in Deep Learning
19 pages
L1, L2andBatchnormalization (1) T1754749408264
No ratings yet
L1, L2andBatchnormalization (1) T1754749408264
9 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
DL Lecture 09 Regularization
No ratings yet
DL Lecture 09 Regularization
15 pages
U4 PDF
No ratings yet
U4 PDF
18 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
Module 3 - 3
No ratings yet
Module 3 - 3
93 pages
Regularization Slides
No ratings yet
Regularization Slides
50 pages
S10 DNN Regularization Wip
No ratings yet
S10 DNN Regularization Wip
11 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Regularization: L1, L2 & Dropout
No ratings yet
Regularization: L1, L2 & Dropout
49 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
Unit 4
No ratings yet
Unit 4
93 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Scribe Notes Fall 2022
No ratings yet
Scribe Notes Fall 2022
41 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
68 pages
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
No ratings yet
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
17 pages
Regularization
No ratings yet
Regularization
19 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
Lec 4 - Regularization
No ratings yet
Lec 4 - Regularization
32 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
DL Class3
No ratings yet
DL Class3
28 pages
Week 10
No ratings yet
Week 10
69 pages
DL 3 Regularization
No ratings yet
DL 3 Regularization
50 pages
Cours 4
No ratings yet
Cours 4
30 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
Anticipation Guide-Phonics and Word Recognition
No ratings yet
Anticipation Guide-Phonics and Word Recognition
5 pages
Parabola Assignment Solutions
No ratings yet
Parabola Assignment Solutions
34 pages
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
100% (1)
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
4 pages
Universal Shipbuilding Corporation: Single Loop Electro-Hydraulic Steering Gear S.No.038 TYPE
No ratings yet
Universal Shipbuilding Corporation: Single Loop Electro-Hydraulic Steering Gear S.No.038 TYPE
125 pages
Story Name: "The Story Canvas"
No ratings yet
Story Name: "The Story Canvas"
1 page
A Report On Management in Ancient Civilization
No ratings yet
A Report On Management in Ancient Civilization
17 pages
Hypertension Cheat Sheet
No ratings yet
Hypertension Cheat Sheet
4 pages
Astm C273-C273M - 19
No ratings yet
Astm C273-C273M - 19
9 pages
Alison Vidal
No ratings yet
Alison Vidal
5 pages
Media Ownership in India
No ratings yet
Media Ownership in India
11 pages
Combustion Tutorials 3dsmax Elements
No ratings yet
Combustion Tutorials 3dsmax Elements
30 pages
Link Game PPSSPP (Sfile
100% (1)
Link Game PPSSPP (Sfile
9 pages
Parthavi Electricals
No ratings yet
Parthavi Electricals
11 pages
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
No ratings yet
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
12 pages
High Pass Filter
No ratings yet
High Pass Filter
12 pages
Cre6-C-240
No ratings yet
Cre6-C-240
1 page
CBSE Sample Question Paper-2021 (Solved) : Section-A (Objective Type)
0% (1)
CBSE Sample Question Paper-2021 (Solved) : Section-A (Objective Type)
17 pages
Benchmarking Sox Costs, Hours and Controls
No ratings yet
Benchmarking Sox Costs, Hours and Controls
45 pages
CSS 12 Module 5
No ratings yet
CSS 12 Module 5
4 pages
INFINITIVO Inglés
No ratings yet
INFINITIVO Inglés
20 pages
FU5 P3 DTy Qeg 5 O4 Ne FKWG
No ratings yet
FU5 P3 DTy Qeg 5 O4 Ne FKWG
18 pages
Hanon Complete Text
No ratings yet
Hanon Complete Text
129 pages
Blue Lock: Isagi & Rin's Reunion
No ratings yet
Blue Lock: Isagi & Rin's Reunion
24 pages
Journal: Embedded Finance
No ratings yet
Journal: Embedded Finance
116 pages
Dialogue Completion & Reading Comprehension
0% (1)
Dialogue Completion & Reading Comprehension
8 pages
Siemens PBX & Cisco CallManager Guide
No ratings yet
Siemens PBX & Cisco CallManager Guide
37 pages
Adani Group Acquires NDTV Assingment No. 1
No ratings yet
Adani Group Acquires NDTV Assingment No. 1
11 pages
Goldfrank's Toxicologic Emergencies, 11E (TRUE PDF) 11th Edition Robert S. Hoffman PDF Download
100% (1)
Goldfrank's Toxicologic Emergencies, 11E (TRUE PDF) 11th Edition Robert S. Hoffman PDF Download
62 pages

Regularization in Machine Learning

Uploaded by

Regularization in Machine Learning

Uploaded by

Regularization in

• L2,1 regularization - a new tool in ML?

• In deep learning, this affects weights for all layers.

• The most frequently used norm penalties include:

• In linear regression, L2 regularization artificially inflates feature variance by a fixed amount.

Toy example of class manifold.

• We can express this generically as a penalty term of the form:

• Calculating arithmetic mean of the probability of a class over all possible

Modified image from Tian and Zhang, 2022

• This technique provides an approximation of the ensemble’s assigned

• Dropconnect: Randomly drop weights rather than unit activations

• Dropmaps: For each batch, features are dropped in accordance with a

For details, see:

You might also like