Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
137 views17 pages

Regularization in Machine Learning

The document discusses various regularization techniques in machine learning, emphasizing their role in mitigating overfitting and enhancing model generalizability. Key methods include Lasso (L1), Ridge (L2), Elastic Net, and newer approaches like L2,1 regularization and Tangent Prop. It concludes that while many regularization penalties exist, Elastic Net is a reasonable default, and techniques like early stopping and dropout are particularly effective.

Uploaded by

Juan Placer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views17 pages

Regularization in Machine Learning

The document discusses various regularization techniques in machine learning, emphasizing their role in mitigating overfitting and enhancing model generalizability. Key methods include Lasso (L1), Ridge (L2), Elastic Net, and newer approaches like L2,1 regularization and Tangent Prop. It concludes that while many regularization penalties exist, Elastic Net is a reasonable default, and techniques like early stopping and dropout are particularly effective.

Uploaded by

Juan Placer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Regularization in

Machine Learning
David Sahner, M.D.
Senior Advisor, National Center for the Advancement of Translational Sciences, and
Executive Advisor, Data Science and Translational Medicine, Axle Research and Technologies
Presentation Outline
• General points
• Commonly used norm penalties
• Lasso (L1), Ridge Regression (L2) and Elastic Net

• L2,1 regularization - a new tool in ML?


• Tangent Prop
• Early stopping during training – several approaches
• Classic random unit dropout and newer variations
• Conclusions
• Supplemental slide: Other regularization penalties
Regularization: General Points
• Regularization affects the bias-variance trade-off with the net result that generalizability is
enhanced at the possible expense of increased training error – more bias and less variance
• People often think of parameter norm penalties to the cost function as “regularization,” but
model simplification and regularization can be achieved in other ways to reduce the risk of
overfitting:
• Dimensionality reduction (see separate learning module on this topic)
• Parameter sharing, which renders certain deep model types (e.g., CNNs) inherently less susceptible
to overfitting
• Early stopping during training
• Random dropout of units decreases model complexity and enables formation of an ensemble. May
need to start with a more complex model.
• Regularization is most critical when the model is large & complex, and the training data set
is of only small/modest size (high risk of model overfitting)
• Overregularization (extreme weight decay) can maroon us in a local cost minimum
Norm penalties
• A norm of the weight vector is multiplied by a hyperparameter and added to
the cost function, inflicting a cost hit for complexity.

• In deep learning, this affects weights for all layers.

• The most frequently used norm penalties include:


• L1 Norm (Lasso), consisting of the sum of weights multiplied by a scalar hyperparameter
that controls its strength. Lasso promotes sparsification of weights.

• L2 Norm (Ridge Regression), in which the sum of the squared weights (squared L2
α
norm) is multiplied by the hyperparameter 2 .

• Elastic net: A hybrid of the two norms above; like a weighted combination of each. May
be superior to Lasso in the context of highly correlated predictors.
L2 Regularization
• Recommend reviewing the theoretical background on L2 regularization, including
derivation of the equations illuminating its exact effect at
https://www.deeplearningbook.org/
• Analysis based on a quadratic approximation of a cost function (represented as a Taylor expansion)
leveraging basic calculus and linear algebra shows that weight decay rescales unregularized weights by a
factor of λi/(λi+α), where λi is the eigenvalue of the Hessian corresponding to the eigenvector aligned with an
element of the weight vector

• This rescaling factor implies that weights “associated” with eigenvalues of H much smaller than α are
preferentially targeted for decay.

• A small eigenvalue of H tied to a weight implies that the cost function is less sensitive to that weight.
These are the weights L2 regularization shifts toward zero.

• In linear regression, L2 regularization artificially inflates feature variance by a fixed amount.


• This disproportionately shrinks the weights of features that exhibit low covariance with the output target, as
that relatively low covariance is more greatly diluted by the L2 “variance boost”
L2,1 regularization: Relatively new and less
commonly used, yet . . .
• L2 norm can be applied to each row of a matrix, followed by summing
of these values over all rows (L2,1 regularization)
• Used successfully in penalized multivariate regression, aiding in
variable selection when potential phenotypic trait predictors (SNPs)
greatly outnumbered the number of genotypes (Mbebi et al., 2021)
• Performed well against several other regression techniques in this study
• Induction of sparsity in coefficient matrix
• Relations between examples considered by this approach
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8479665/pdf/btab212.pdf
Tangent Prop Regularizer: Fundamental Idea
• Quite different from norm penalties. Members of a class can be thought of as living
on or near a manifold that can be defined by the vector output of a classifier.

• For a new member of the class, one would like to ensure that movement along that
manifold does not affect classification.
• We would like the model output to change very minimally, if at all, for a new member of the
class compared with other members of the same class.

• This can be approximated locally by requiring model output to change only minimally if a new
member of the class resides on the class manifold’s tangent* plane.

• Caveat: Tangent prop is NOT appropriate if ReLU activation functions are used.

*In actuality, a manifold has multiple tangent planes, referred to collectively as the tangent bundle,
which can be estimated/extracted using a Contractive Auto-Encoder. See:
https://www.iro.umontreal.ca/~vincentp/Publications/MTC_nips2011.pdf
Tangent Prop Regularizer: Fundamental Idea

Toy example of class manifold.


Impact on output generated by a local
move along the tangent is minimal.
Orthogonal movement in the direction
of the outward arrow has a much
greater effect on model output.
Tangent Prop (TP) Regularizer: Formalism
• We would like the gradient of the model output with respect to input to be zero/minimal for a
succession of members of the same class. Conversely, we want the model’s output to
change when it sees a member of another class during training.
• This is achieved by adding a penalty term that is nonzero if the gradient of the output is
both:
a. Significant/large AND
b. NOT orthogonal to the tangent plane of a class manifold

• We can express this generically as a penalty term of the form:

σ𝑖(∇Xf(x)Tti)2, where f(x) is the model output, and ti is a tangent plane of the class manifold.
Can scale this with a hyperparameter.

This enforces small output gradients along the tangent planes of the class manifold although TP
can regularize only with respect to minor input perturbations
Early stopping during training
Can track validation loss over time,
keeping a running log of parameters,
and then return to the set of model
parameters that yielded the lowest
validation set loss.
Here, p (a “patience” hyperparameter)
can be defined as the number of
iterations/epochs through which
training is allowed to continue with no
improvement over the lowest validation
set loss.
In short, training stops when a
surrogate for generalizability
Image from (validation error) ceases to improve.
https://www.deeplearningbook.org/contents/
regularization.html
Selected variations on early stopping
• Stopping in the context of “Low Progress” during training. Intuition is that the risk of
overfitting increases when the training error declines slowly. Can formalize this:

where Pk is training progress over a consecutive strip of length k of, for example, five epochs
(k=5). The sum of the errors is divided by the minimum training error during that strip multiplied by k.
When Pk falls below a preset threshold, training stops.

• One can stop when the “generality loss (GL) to training progress ratio” exceeds a
threshold, where GL ≡ 100∙(average error in validation set in current epoch ÷ lowest
validation set error across epochs – 1)
Random unit dropout (I)
• Random masking of units simplifies each model and creates, in effect, an
ensemble.
• Masking probabilities for hidden units often set at 0.5 (0.8 for input units).

• Calculating arithmetic mean of the probability of a class over all possible


models is impractical, but we can sample 10-20 masks
• Another approach relies on the geometric mean. Here the unnormalized
probability of a class is given by:

where μ is the number of mask vectors (and masked models), d is the number of units that
can be dropped, and 2d is the number of “dropped model” permutations
Random unit dropout (II)
We then normalize the predicted probability by including a denominator that sums
these values across all classes. But this still entails scads of forward propagations,
so we rely on an expediency that allows us to accomplish this feat (approximately)
with only ONE forward propagation – namely, weight scaling inference.

Modified image from Tian and Zhang, 2022


Weight scaling inference rule
• We can estimate the geometric mean of the probability of a given class
across all potential sub-models in one forward propagation of the parent
(full) model with no masking by “weighting the weights” by which a unit’s
activity is multiplied by the probability of that unit’s inclusion in a sub-
model.

• With a masking probability of 0.5 for any hidden unit, this translates into division of
each weight tethered to a hidden unit by 2.

• This technique provides an approximation of the ensemble’s assigned


probability of any given class if the hidden units are paired with nonlinear
activation functions (and exact estimates in some other settings).
Newer twists on standard dropout
• Curriculum dropout: Smoothly increase the dropout rate during training,
guided by the intuition that the risk of co-adaptation of feature detectors,
leading to overfitting, increases with time
• As good or better than standard dropout in image classification
• See https://arxiv.org/pdf/1703.06229

• Dropconnect: Randomly drop weights rather than unit activations

• Standout: A binary belief network influences which units are dropped, with
a preference for retaining confident units

• Dropmaps: For each batch, features are dropped in accordance with a


Bernoulli distribution
Conclusions
• Regularization mitigates the risk of overfitting models with attendant failure of
generalizability
• There exists a profusion of regularization penalties, but only several are very
commonly used in machine learning.
• Elastic net, which allows one to titrate contributions of L1 and L2 regularization,
is a reasonable default regularization strategy
• L1 regularization promotes feature sparsity
• Early stopping and dropout are very useful
• Curriculum dropout may offer advantages over standard dropout, especially in the
setting of image classification

• Although not discussed in this module, other techniques are extremely helpful
in preventing overfitting, including various dimensionality reduction techniques
(see separate module) and dataset augmentation
Other types of regularization
Leaky capped l1 norm:

For details, see:


https://www.sciencedirect.com/science/article/abs/pii/S156625352100230X
SCAD: Smoothly Clipped Absolute Deviation; MCP: Minimax concave penalty;
LSP: Log sum penalty; Rat: Rational penalty; Atan: Arctangent penalty; Exp:
Exponential penalty.

You might also like