Deep Learning Basics
Lecture 4: Regularization II
Princeton University COS 495
Instructor: Yingyu Liang
Review
Regularization as hard constraint
• Constrained optimization
𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝑅 𝜃 ≤ 𝑟
Regularization as soft constraint
• Unconstrained optimization
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆 > 0
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
• Maximum A Posteriori (MAP):
max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃
Regularization MLE loss
Classical regularizations
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
More examples
Other types of regularizations
• Robustness to noise
• Noise to the input
• Noise to the weights
• Noise to the output
• Data augmentation
• Early stopping
• Dropout
Multiple optimal solutions?
Class +1
𝑤1 𝑤2 𝑤3
Class -1
Prefer 𝑤2 (higher confidence)
Add noise to the input
Class +1
𝑤2
Class -1
Prefer 𝑤2 (higher confidence)
Caution: not too much noise
Too much noise leads
to data points cross
the boundary
Class +1
𝑤2
Class -1
Prefer 𝑤2 (higher confidence)
Equivalence to weight decay
• Suppose the hypothesis is 𝑓 𝑥 = 𝑤 𝑇 𝑥, noise is 𝜖~𝑁(0, 𝜆𝐼)
• After adding noise, the loss is
2
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝜖 − 𝑦 = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝑤 𝑇 𝜖 − 𝑦 2
2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 + 2𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 𝑓 𝑥 − 𝑦 + 𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 2
2 2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 +𝜆 𝑤
Add noise to the weights
• For the loss on each data point, add a noise term to the weights
before computing the prediction
𝜖~𝑁(0, 𝜂𝐼), 𝑤′ = 𝑤 + 𝜖
• Prediction: 𝑓𝑤 ′ 𝑥 instead of 𝑓𝑤 𝑥
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
Add noise to the weights
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
• To simplify, use Taylor expansion
𝑇 𝜖𝑇 𝛻2 𝑓 𝑥 𝜖
• 𝑓𝑤+𝜖 𝑥 ≈ 𝑓𝑤 𝑥 + 𝜖 𝛻𝑓 𝑥 +
2
• Plug in
2
• 𝐿 𝑓 ≈ 𝔼 𝑓𝑤 𝑥 − 𝑦 + 𝜂𝔼[ 𝑓𝑤 𝑥 − 𝑦 𝛻 2 𝑓𝑤 𝑥 ] + 𝜂𝔼||𝛻𝑓𝑤 (𝑥)||2
Small so can be ignored Regularization term
Data augmentation
Figure from Image Classification with Pyramid Representation
and Rotated Data Augmentation on Torch 7, by Keven Wang
Data augmentation
• Adding noise to the input: a special kind of augmentation
• Be careful about the transformation applied:
• Example: classifying ‘b’ and ‘d’
• Example: classifying ‘6’ and ‘9’
Early stopping
• Idea: don’t train the network to too small training error
• Recall overfitting: Larger the hypothesis class, easier to find a
hypothesis that fits the difference between the two
• Prevent overfitting: do not push the hypothesis too much; use
validation error to decide when to stop
Early stopping
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Early stopping
• When training, also output validation error
• Every time validation error improved, store a copy of the weights
• When validation error not improved for some time, stop
• Return the copy of the weights stored
Early stopping
• hyperparameter selection: training step is the hyperparameter
• Advantage
• Efficient: along with training; only store an extra copy of weights
• Simple: no change to the model/algo
• Disadvantage: need validation data
Early stopping
• Strategy to get rid of the disadvantage
• After early stopping of the first run, train a second run and reuse validation
data
• How to reuse validation data
1. Start fresh, train with both training data and validation data up to the
previous number of epochs
2. Start from the weights in the first run, train with both training data and
validation data util the validation loss < the training loss at the early
stopping point
Early stopping as a regularizer
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Dropout
• Randomly select weights to update
• More precisely, in each update step
• Randomly sample a different binary mask to all the input and hidden units
• Multiple the mask bits with the units and do the update as usual
• Typical dropout probability: 0.2 for input and 0.5 for hidden units
Dropout
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Dropout
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Dropout
Figure from Deep Learning,
Goodfellow, Bengio and Courville
What regularizations are frequently used?
• 𝑙2 regularization
• Early stopping
• Dropout
• Data augmentation if the transformations known/easy to implement