0% found this document useful (0 votes)

24 views11 pages

Regularization (Mathematics)

Uploaded by

Victor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

Regularization (Mathematics)

Uploaded by

Victor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Regularization (mathematics)

In mathematics, statistics, finance,[1] and computer science,

particularly in machine learning and inverse problems,
regularization is a process that converts the answer of a
problem to a simpler one. It is often used in solving ill-posed
problems or to prevent overfitting.[2]

Although regularization procedures can be divided in many

ways, the following delineation is particularly helpful:

Explicit regularization is regularization whenever

one explicitly adds a term to the optimization
problem. These terms could be priors, penalties, or
constraints. Explicit regularization is commonly
employed with ill-posed optimization problems. The The green and blue functions both incur
regularization term, or penalty, imposes a cost on zero loss on the given data points. A
the optimization function to make the optimal learned model can be induced to prefer
solution unique. the green function, which may generalize
Implicit regularization is all other forms of better to more points drawn from the
regularization. This includes, for example, early underlying unknown distribution, by
stopping, using a robust loss function, and adjusting , the weight of the
discarding outliers. Implicit regularization is regularization term.
essentially ubiquitous in modern machine learning
approaches, including stochastic gradient descent
for training deep neural networks, and ensemble methods (such as random forests and
gradient boosted trees).
In explicit regularization, independent of the problem or model, there is always a data term, that
corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By
combining both using Bayesian statistics, one can compute a posterior, that includes both information
sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be
more aligned to the data or to enforce regularization (to prevent overfitting). There is a whole research
branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and
then figures out the probability density that corresponds to that regularization to justify the choice. It can
also be physically motivated by common sense or intuition.

In machine learning, the data term corresponds to the training data and the regularization is either the
choice of the model or modifications to the algorithm. It is always intended to reduce the generalization
error, i.e. the error score with the trained model on the evaluation set and not the training data.[3]

One of the earliest uses of regularization is Tikhonov regularization (ridge regression), related to the
method of least squares.

Regularization in machine learning

In machine learning, a key challenge is enabling models to accurately predict outcomes on unseen data,
not just on familiar training data. Regularization is crucial for addressing overfitting—where a model
memorizes training data details but can't generalize to new data. The goal of regularization is to
encourage models to learn the broader patterns within the data rather than memorizing it. Techniques like
early stopping, L1 and L2 regularization, and dropout are designed to prevent overfitting and underfitting,
thereby enhancing the model's ability to adapt to and perform well with new data, thus improving model
generalization.[4]

Early Stopping
Stops training when validation performance deteriorates, preventing overfitting by halting before the
model memorizes training data.[4]

L1 and L2 Regularization
Adds penalty terms to the cost function to discourage complex models:

L1 regularization (also called LASSO) leads to sparse models by adding a penalty based
on the absolute value of coefficients.
L2 regularization (also called ridge regression) encourages smaller, more evenly
distributed weights by adding a penalty based on the square of the coefficients.[4]

Dropout
In the context of neural networks, the Dropout technique repeatedly ignores random subsets of neurons
during training, which simulates the training of multiple neural network architectures at once to improve
generalization.[4]

Classification
Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it
attempts to infer a function of any given only examples .

A regularization term (or regularizer) is added to a loss function:

where is an underlying loss function that describes the cost of predicting when the label is , such
as the square loss or hinge loss; and is a parameter which controls the importance of the regularization
term. is typically chosen to impose a penalty on the complexity of . Concrete notions of
complexity used include restrictions for smoothness and bounds on the vector space norm.[5]

A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution (as
depicted in the figure above, where the green function, the simpler one, may be preferred). From a
Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions
on model parameters.[6]
Regularization can serve multiple purposes, including learning simpler models, inducing models to be
sparse and introducing group structure into the learning problem.

The same idea arose in many fields of science. A simple form of regularization applied to integral
equations (Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm
of the solution. More recently, non-linear regularization methods, including total variation regularization,
have become popular.

Generalization
Regularization can be motivated as a technique to improve the generalizability of a learned model.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that
minimizes the expected error over all possible inputs and labels. The expected error of a function is:

where and are the domains of input data and their labels respectively.

Typically in learning problems, only a subset of input data and labels are available, measured with some
noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical
error over the available samples:

Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space)
available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements
(e.g. of ) were made with noise, this model may suffer from overfitting and display poor expected error.
Regularization introduces a penalty for exploring certain regions of the function space used to build the
model, which can improve generalization.

Tikhonov regularization (ridge regression)

These techniques are named for Andrey Nikolayevich Tikhonov, who applied regularization to integral
equations and made important contributions in many other areas.

When learning a linear function , characterized by an unknown vector such that , one
can add the -norm of the vector to the loss expression in order to prefer solutions with smaller
norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It
is expressed as:

where would represent samples used for training.

In the case of a general function, the norm of the function in its reproducing kernel Hilbert space is:

As the norm is differentiable, learning can be advanced by gradient descent.

Tikhonov-regularized least squares

The learning problem with the least squares loss function and Tikhonov regularization can be solved
analytically. Written in matrix form, the optimal is the one for which the gradient of the loss function
with respect to is 0.

where the third statement is a first-order condition.

By construction of the optimization problem, other values of give larger values for the loss function.
This can be verified by examining the second derivative .

During training, this algorithm takes time. The terms correspond to the matrix inversion
and calculating , respectively. Testing takes time.

Early stopping
Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient
descent tends to learn more and more complex functions with increasing iterations. By regularizing for
time, model complexity can be controlled, improving generalization.

Early stopping is implemented using one data set for training, one statistically independent data set for
validation and another for testing. The model is trained until performance on the validation set no longer
improves and then applied to the test set.

Theoretical motivation in least squares

Consider the finite approximation of Neumann series for an invertible matrix A where :
This can be used to approximate the analytical solution of unregularized least squares, if γ is introduced
to ensure the norm is less than one.

The exact solution to the unregularized least squares learning problem minimizes the empirical error, but
may fail. By limiting T, the only free parameter in the algorithm above, the problem is regularized for
time, which may improve its generalization.

The algorithm above is equivalent to restricting the number of gradient descent iterations for the
empirical risk

with the gradient descent update:

The base case is trivial. The inductive case is proved as follows:

Regularizers for sparsity

Assume that a dictionary with dimension is given such that a function in the function space can be
expressed as:

Enforcing a sparsity constraint on can lead to simpler and more interpretable models. This is useful in
many real-life applications such as computational biology. An example is developing a simple predictive
test for a disease in order to minimize the cost of performing medical tests while maximizing predictive
power.

A sensible sparsity constraint is the norm , defined as the number of non-zero elements in .
Solving a regularized learning problem, however, has been demonstrated to be NP-hard.[7]
The norm (see also Norms) can be used to approximate the
optimal norm via convex relaxation. It can be shown that the
norm induces sparsity. In the case of least squares, this
problem is known as LASSO in statistics and basis pursuit in
signal processing.
A comparison between the L1 ball
and the L2 ball in two dimensions
gives an intuition on how L1
regularization achieves sparsity.
regularization can occasionally produce non-unique solutions.
A simple example is provided in the figure when the space of
possible solutions lies on a 45 degree line. This can be problematic
for certain applications, and is overcome by combining with
regularization in elastic net regularization, which takes the
following form:

Elastic net regularization

Elastic net regularization tends to have a grouping effect, where correlated input features are assigned
equal weights.

Elastic net regularization is commonly used in practice and is implemented in many machine learning
libraries.

Proximal methods
While the norm does not result in an NP-hard problem, the norm is convex but is not strictly
differentiable due to the kink at x = 0. Subgradient methods which rely on the subderivative can be used
to solve regularized learning problems. However, faster convergence can be achieved through
proximal methods.

For a problem such that is convex, continuous, differentiable, with Lipschitz

continuous gradient (such as the least squares loss function), and is convex, continuous, and proper,
then the proximal method to solve the problem is as follows. First define the proximal operator

and then iterate

The proximal method iteratively performs gradient descent and then projects the result back into the
space permitted by .

When is the L1 regularizer, the proximal operator is equivalent to the soft-thresholding operator,

This allows for efficient computation.

Group sparsity without overlaps

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain
prior knowledge into an optimization problem.

In the case of a linear model with non-overlapping known groups, a regularizer can be defined:

where

This can be viewed as inducing a regularizer over the norm over members of each group followed by
an norm over groups.

This can be solved by the proximal method, where the proximal operator is a block-wise soft-
thresholding function:

Group sparsity with overlaps

The algorithm described for group sparsity without overlaps can be applied to the case where groups do
overlap, in certain situations. This will likely result in some groups with all zero elements, and other
groups with some non-zero and some zero elements.

If it is desired to preserve the group structure, a new regularizer can be defined:

For each , is defined as the vector such that the restriction of to the group equals and all
other entries of are zero. The regularizer finds the optimal disintegration of into parts. It can be
viewed as duplicating all elements that exist in multiple groups. Learning problems with this regularizer
can also be solved with the proximal method with a complication. The proximal operator cannot be
computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the
proximal method iteration.

Regularizers for semi-supervised learning

When labels are more expensive to gather than input examples, semi-supervised learning can be useful.
Regularizers have been designed to guide learning algorithms to learn models that respect the structure of
unsupervised training samples. If a symmetric weight matrix is given, a regularizer can be defined:

If encodes the result of some distance metric for points and , it is desirable that
. This regularizer captures this intuition, and is equivalent to:

where is the Laplacian matrix of the graph induced by .

The optimization problem can be solved analytically if the constraint

is applied for all supervised samples. The labeled part of the vector is therefore obvious. The unlabeled
part of is solved for by:

The pseudo-inverse can be taken because has the same range as .

Regularizers for multitask learning

In the case of multitask learning, problems are considered simultaneously, each related in some way.
The goal is to learn functions, ideally borrowing strength from the relatedness of tasks, that have
predictive power. This is equivalent to learning the matrix .

Sparse regularizer on columns

This regularizer defines an L2 norm on each column and an L1 norm over all columns. It can be solved
by proximal methods.

Nuclear norm regularization

where is the eigenvalues in the singular value decomposition of .

Mean-constrained regularization

This regularizer constrains the functions learned for each task to be similar to the overall average of the
functions across all tasks. This is useful for expressing prior information that each task is expected to
share with each other task. An example is predicting blood iron levels measured at different times of the
day, where each task represents an individual.

Clustered mean-constrained regularization

where is a cluster of tasks.

This regularizer is similar to the mean-constrained regularizer, but instead enforces similarity between
tasks within the same cluster. This can capture more complex prior information. This technique has been
used to predict Netflix recommendations. A cluster would correspond to a group of people who share
similar preferences.

Graph-based similarity
More generally than above, similarity between tasks can be defined by a function. The regularizer
encourages the model to learn similar functions for similar tasks.

for a given symmetric similarity matrix .

Other uses of regularization in statistics and machine learning

Bayesian learning methods make use of a prior probability that (usually) gives lower probability to more
complex models. Well-known model selection techniques include the Akaike information criterion (AIC),
minimum description length (MDL), and the Bayesian information criterion (BIC). Alternative methods
of controlling overfitting not involving regularization include cross-validation.

Examples of applications of different methods of regularization to the linear model are:

Model Fit measure Entropy measure[5][8]

AIC/BIC

Lasso[9]

Ridge regression[10]

Basis pursuit denoising

Rudin–Osher–Fatemi model (TV)

Potts model

RLAD[11]

Dantzig Selector[12]

SLOPE[13]

See also
Bayesian interpretation of regularization
Bias–variance tradeoff
Matrix regularization
Regularization by spectral filtering
Regularized least squares
Lagrange multiplier
Variance reduction

Notes
1. Kratsios, Anastasis (2020). "Deep Arbitrage-Free Learning in a Generalized HJM
Framework via Arbitrage-Regularization Data" (https://mdpi.com/2227-9091/8/2/40). Risks. 8
(2): [1] (https://www.mdpi.com/2227-9091/8/2/40). doi:10.3390/risks8020040 (https://doi.org/
10.3390%2Frisks8020040). hdl:20.500.11850/456375 (https://hdl.handle.net/20.500.1185
0%2F456375). "Term structure models can be regularized to remove arbitrage opportunities
[sic?]."
2. Bühlmann, Peter; Van De Geer, Sara (2011). Statistics for High-Dimensional Data (https://ar
chive.org/details/statisticsforhig00bhlm). Springer Series in Statistics. p. 9 (https://archive.or
g/details/statisticsforhig00bhlm/page/n27). doi:10.1007/978-3-642-20192-9 (https://doi.org/1
0.1007%2F978-3-642-20192-9). ISBN 978-3-642-20191-2. "If p > n, the ordinary least
squares estimator is not unique and will heavily overfit the data. Thus, a form of complexity
regularization will be necessary."
3. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning Book (https://www.deeple
arningbook.org/contents/ml.html). Retrieved 2021-01-29.
4. Guo, Jingru. "AI Notes: Regularizing neural networks" (https://deeplearning.ai/ai-notes/regul
arization/). deeplearning.ai. Retrieved 2024-02-04.
5. Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr.
printing. ed.). New York: Springer. ISBN 978-0-387-31073-2.
6. For the connection between maximum a posteriori estimation and ridge regression, see
Weinberger, Kilian (July 11, 2018). "Linear / Ridge Regression" (https://www.cs.cornell.edu/c
ourses/cs4780/2018fa/lectures/lecturenote08.html#map-estimate). CS4780 Machine
Learning Lecture 13. Cornell.
7. Natarajan, B. (1995-04-01). "Sparse Approximate Solutions to Linear Systems" (http://epub
s.siam.org/doi/abs/10.1137/S0097539792240406). SIAM Journal on Computing. 24 (2):
227–234. doi:10.1137/S0097539792240406 (https://doi.org/10.1137%2FS00975397922404
06). ISSN 0097-5397 (https://search.worldcat.org/issn/0097-5397). S2CID 2072045 (https://
api.semanticscholar.org/CorpusID:2072045).
8. Duda, Richard O. (2004). Pattern classification + computer manual : hardcover set (2 ed.).
New York [u.a.]: Wiley. ISBN 978-0-471-70350-1.
9. Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso" (http://www-s
tat.stanford.edu/~tibs/ftp/lasso.ps) (PostScript). Journal of the Royal Statistical Society,
Series B. 58 (1): 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x (https://doi.org/10.111
1%2Fj.2517-6161.1996.tb02080.x). MR 1379242 (https://mathscinet.ams.org/mathscinet-ge
titem?mr=1379242). Retrieved 2009-03-19.
10. Arthur E. Hoerl; Robert W. Kennard (1970). "Ridge regression: Biased estimation for
nonorthogonal problems". Technometrics. 12 (1): 55–67. doi:10.2307/1267351 (https://doi.or
g/10.2307%2F1267351). JSTOR 1267351 (https://www.jstor.org/stable/1267351).
11. Li Wang; Michael D. Gordon; Ji Zhu (2006). "Regularized Least Absolute Deviations
Regression and an Efficient Algorithm for Parameter Tuning". Sixth International Conference
on Data Mining. pp. 690–700. doi:10.1109/ICDM.2006.134 (https://doi.org/10.1109%2FICD
M.2006.134). ISBN 978-0-7695-2701-7.
12. Candes, Emmanuel; Tao, Terence (2007). "The Dantzig selector: Statistical estimation when
p is much larger than n". Annals of Statistics. 35 (6): 2313–2351. arXiv:math/0506081 (http
s://arxiv.org/abs/math/0506081). doi:10.1214/009053606000001523 (https://doi.org/10.121
4%2F009053606000001523). MR 2382644 (https://mathscinet.ams.org/mathscinet-getite
m?mr=2382644). S2CID 88524200 (https://api.semanticscholar.org/CorpusID:88524200).
13. Małgorzata Bogdan; Ewout van den Berg; Weijie Su; Emmanuel J. Candes (2013).
"Statistical estimation and testing via the ordered L1 norm". arXiv:1310.1969 (https://arxiv.or
g/abs/1310.1969) [stat.ME (https://arxiv.org/archive/stat.ME)].

References
Neumaier, A. (1998). "Solving ill-conditioned and singular linear systems: A tutorial on
regularization" (https://www.mat.univie.ac.at/~neum/ms/regtutorial.pdf) (PDF). SIAM
Review. 40 (3): 636–666. Bibcode:1998SIAMR..40..636N (https://ui.adsabs.harvard.edu/ab
s/1998SIAMR..40..636N). doi:10.1137/S0036144597321909 (https://doi.org/10.1137%2FS0
036144597321909).

Retrieved from "https://en.wikipedia.org/w/index.php?title=Regularization_(mathematics)&oldid=1270042278"

Mathematics For Electrical Science and Physical Science, M-1, S2
No ratings yet
Mathematics For Electrical Science and Physical Science, M-1, S2
4 pages
Babok 3.0 Tasks & Their Inputs and Outputs
100% (2)
Babok 3.0 Tasks & Their Inputs and Outputs
6 pages
Design of Water Supply Networks CIVL 5995 Project I
100% (1)
Design of Water Supply Networks CIVL 5995 Project I
36 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
Deep Learning Basics Lecture 3 Regularization I
No ratings yet
Deep Learning Basics Lecture 3 Regularization I
32 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Class 02
No ratings yet
Class 02
42 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Regularization Techniques in ML
No ratings yet
Regularization Techniques in ML
62 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
Regularization for Overfitting Prevention
No ratings yet
Regularization for Overfitting Prevention
7 pages
Regularization: L1, L2 & Dropout
No ratings yet
Regularization: L1, L2 & Dropout
49 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Regularization
No ratings yet
Regularization
74 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
57 pages
Deep Learning 02
No ratings yet
Deep Learning 02
28 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
68 pages
Regularized Least-Squares Classification
No ratings yet
Regularized Least-Squares Classification
24 pages
Regularization
No ratings yet
Regularization
46 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
ML - Perplexity
No ratings yet
ML - Perplexity
71 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Paper 17 Modern Regularization Methods For Inverse Problems
No ratings yet
Paper 17 Modern Regularization Methods For Inverse Problems
97 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Chapman-Kolmogorov Equations 30 The Effect of 48511
No ratings yet
Chapman-Kolmogorov Equations 30 The Effect of 48511
9 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
U4 PDF
No ratings yet
U4 PDF
18 pages
ML Paper
No ratings yet
ML Paper
21 pages
Lliptic OSS Egularization: Ali Hasan, Haoming Yang, Yuting NG, Vahid Tarokh
No ratings yet
Lliptic OSS Egularization: Ali Hasan, Haoming Yang, Yuting NG, Vahid Tarokh
25 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
77 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Regularization
No ratings yet
Regularization
8 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
L S N N R: Earning Parse Eural Etworks Through Egularization
No ratings yet
L S N N R: Earning Parse Eural Etworks Through Egularization
13 pages
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
No ratings yet
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
14 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
Honey I Shrunk The Irrelevant Effects Simple and F - 2025 - Journal of Mathema
No ratings yet
Honey I Shrunk The Irrelevant Effects Simple and F - 2025 - Journal of Mathema
16 pages
Module 3 - 3
No ratings yet
Module 3 - 3
93 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
Regularization Networks and Support Vector Machines
No ratings yet
Regularization Networks and Support Vector Machines
53 pages
Ridge and Lasso Regresssion
No ratings yet
Ridge and Lasso Regresssion
22 pages
Inverse Problems: From Regularization To Bayesian Inference: Calvetti D, Somersalo E
No ratings yet
Inverse Problems: From Regularization To Bayesian Inference: Calvetti D, Somersalo E
37 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Stuart 81
No ratings yet
Stuart 81
25 pages
Unit-2 L1
No ratings yet
Unit-2 L1
23 pages
Noise Training Equals Tikhonov Regularization
No ratings yet
Noise Training Equals Tikhonov Regularization
8 pages
Calculus of Variations
No ratings yet
Calculus of Variations
16 pages
Variation of Parameters
No ratings yet
Variation of Parameters
9 pages
Probability Theory
No ratings yet
Probability Theory
9 pages
Differential Equation
No ratings yet
Differential Equation
9 pages
Lagrange Multiplier
No ratings yet
Lagrange Multiplier
17 pages
First Law of Thermodynamics
No ratings yet
First Law of Thermodynamics
13 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Nageshwara Rao Park
100% (1)
Nageshwara Rao Park
3 pages
Sky Dancers
No ratings yet
Sky Dancers
8 pages
Engine Diagrams Cummins
100% (3)
Engine Diagrams Cummins
34 pages
Mysql Assignment 1
No ratings yet
Mysql Assignment 1
2 pages
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
No ratings yet
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
7 pages
Critical Resistance and Critical Speed For DC Shunt Generator For PDF
No ratings yet
Critical Resistance and Critical Speed For DC Shunt Generator For PDF
10 pages
Materials ISO IEC 17025 Annex Cement Testing
No ratings yet
Materials ISO IEC 17025 Annex Cement Testing
8 pages
UV Lab Report - BE
No ratings yet
UV Lab Report - BE
15 pages
OPC Penetration Testing in Oil PCS
No ratings yet
OPC Penetration Testing in Oil PCS
15 pages
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
No ratings yet
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
4 pages
Swiss GIS Projection Guide
No ratings yet
Swiss GIS Projection Guide
6 pages
Crystals, Defects and Microstructures - Modeling Across Scales - R. Phillips (Cambridge, 2004) WW PDF
100% (3)
Crystals, Defects and Microstructures - Modeling Across Scales - R. Phillips (Cambridge, 2004) WW PDF
808 pages
LDPC Codes
No ratings yet
LDPC Codes
3 pages
3-in-1 Transducer Install Guide
No ratings yet
3-in-1 Transducer Install Guide
2 pages
Coccinia Grandis
No ratings yet
Coccinia Grandis
9 pages
Data Presentation Methods in Education
No ratings yet
Data Presentation Methods in Education
4 pages
Concrete Wind Towers 05
No ratings yet
Concrete Wind Towers 05
0 pages
Discussion Design Procedure For A Contact Stabilization Activated Sludge Process Randall 1977
No ratings yet
Discussion Design Procedure For A Contact Stabilization Activated Sludge Process Randall 1977
9 pages
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
No ratings yet
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
6 pages
Es 103 - Module 4 - Shearing Deformation
No ratings yet
Es 103 - Module 4 - Shearing Deformation
21 pages
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
No ratings yet
A Spacetime Curvature Model For The Three-Body Problem: A Novel Approach To Orbital Dynamics
8 pages
Skylight Space Frame
No ratings yet
Skylight Space Frame
1 page
Jogger Headset Demand Forecasting
No ratings yet
Jogger Headset Demand Forecasting
4 pages
B.Tech CSE IoT Curriculum VNR VJIET
No ratings yet
B.Tech CSE IoT Curriculum VNR VJIET
368 pages
Lecture # 05b, 06a (Vertical Curves)
No ratings yet
Lecture # 05b, 06a (Vertical Curves)
27 pages
Week 4
No ratings yet
Week 4
35 pages
Quickassist Adapter 8950 Brief
No ratings yet
Quickassist Adapter 8950 Brief
3 pages
Grade 8 August Holiday Revision Booklet
No ratings yet
Grade 8 August Holiday Revision Booklet
154 pages
FYP Proposal Form
No ratings yet
FYP Proposal Form
7 pages

Regularization (Mathematics)

Uploaded by

Regularization (Mathematics)

Uploaded by

Regularization (mathematics)

In mathematics, statistics, finance,[1] and computer science,

Although regularization procedures can be divided in many

Explicit regularization is regularization whenever

Regularization in machine learning

A regularization term (or regularizer) is added to a loss function:

Tikhonov regularization (ridge regression)

where would represent samples used for training.

As the norm is differentiable, learning can be advanced by gradient descent.

Tikhonov-regularized least squares

where the third statement is a first-order condition.

Theoretical motivation in least squares

with the gradient descent update:

The base case is trivial. The inductive case is proved as follows:

Regularizers for sparsity

Elastic net regularization

For a problem such that is convex, continuous, differentiable, with Lipschitz

and then iterate

This allows for efficient computation.

Group sparsity without overlaps

Group sparsity with overlaps

If it is desired to preserve the group structure, a new regularizer can be defined:

Regularizers for semi-supervised learning

where is the Laplacian matrix of the graph induced by .

The optimization problem can be solved analytically if the constraint

The pseudo-inverse can be taken because has the same range as .

Regularizers for multitask learning

Sparse regularizer on columns

Nuclear norm regularization

where is the eigenvalues in the singular value decomposition of .

Clustered mean-constrained regularization

where is a cluster of tasks.

for a given symmetric similarity matrix .

Other uses of regularization in statistics and machine learning

Examples of applications of different methods of regularization to the linear model are:

Model Fit measure Entropy measure[5][8]

Basis pursuit denoising

Rudin–Osher–Fatemi model (TV)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Regularization_(mathematics)&oldid=1270042278"

You might also like