Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 8e2fcc1

Browse files
committed
Minor updates to noisy_reg
1 parent 17b0aef commit 8e2fcc1

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

_posts/2022-07-22-noisy_reg.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ X_test, y_test = X[n:,:], y[n:]
2727
## The neural network
2828
We will use a single hidden layer neural network with a quadratic activation function. The hidden layer will have $$m = 100$$ nodes. The parameters of this network are a $$d \times m$$ matrix $$A$$ and a $$m$$-dimensional vector $$b$$. For a $$d$$-dimensional data point $$X$$, the model predicts
2929

30-
$$\text{predict}(X) = \sum_{i=1}^m b_i (A_i \cdot X)^2$$
30+
$$\text{predict}(X) = \sum_{i=1}^m b_i (A_i \cdot X)^2.$$
3131

3232
This neural network should be well suited to solve the regression task since it is straightforward to find values for the parameters that solve the task exactly, for example we may pick $$A_{11} = b_1 = 1$$ and set everything else to zero.
3333

@@ -79,6 +79,8 @@ for lr in lr_list:
7979

8080
For learning rates above 0.096 the optimization diverges, so we can't go higher. We see that all learning rates up to 0.03 give the same bad MSE, but after that larger learning rates improve performance. Interestingly, we can actually solve the task, but only if we choose learning rates right at the edge of diverging.
8181

82+
It should be noted that while the above plot paints a deceptively simple picture, it is not true in general that higher learning rate is better. The experiment seems robust against random seeds for initalization and data generation, but is quite fragile against changes in other hyperparameters such as the number of hidden neurons and scale of inizialization.
83+
8284
## Label noise
8385
We can achieve a similar effect without huge learning rates by adding "label noise". Label noise is a form a *data augmentation*, where we generate more training data by modifying the original data. In each iteration of gradient descent we will replace the training targets by the original targets plus some noise. We choose standard normal noise.
8486

@@ -127,15 +129,15 @@ I found the neat derivation presented above in [Implicit Gradient Regularization
127129
# Label noise
128130
But wait, optimizing $$L$$ is the same as optimizing $$\tilde{L}$$, right? A minimizer of $$L$$ has gradient zero, so it is also a minimizer of $$\tilde{L}$$. Well, we are in the overparameterized setting, so the optimization path might change *which* minimizer we end up at. When we add label noise it becomes clearer.
129131

130-
With label noise we have $$L = E_{z \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2\|_2^2$$. After a bit of calculation we may simplify this to
132+
With label noise we have $$L = E_{z_i \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z_i \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2\|_2^2$$. After a bit of calculation we may simplify this to
131133

132-
$$\tilde{L} = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2 + \eta\frac{1}{n}\sum_{i=1}^n \Big[\big((P(X_i)-y_i)\cdot \nabla P(X_i)\big)^2 + \|\nabla P(X_i)\|_2^2\Big] + 1.$$
134+
$$\tilde{L} = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2 + \eta\frac{1}{n^2}\sum_{i=1}^n \Big[\big((P(X_i)-y_i)\cdot \nabla P(X_i)\big)^2 + \|\nabla P(X_i)\|_2^2\Big] + 1.$$
133135

134136
The $$+1$$ doesn't matter for optimization, so we remove it. At convergence $$P(X_i) \approx y_i$$, so we have
135137

136-
$$\tilde{L} \approx \eta\frac{1}{n}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2$$
138+
$$\tilde{L} \approx \eta\frac{1}{n^2}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2$$
137139

138-
One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will show why this regularizer is useful for solving our regression task. I will not do that here.
140+
One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will likely show why this regularizer is useful for solving our regression task. I will not do that here.
139141

140142
# Comparing with small initalization
141143
If you read the [previous blog post](/2022/07/06/dln_classifier.html), you might wonder if we can apply the implicit bias given by small initialization to solve the regression task given here. Indeed you can! Simply scaling down the outputs of the neural network by a factor 100, we get MSE = 0.00 with learning rate 0.01.

0 commit comments

Comments
 (0)