You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2022-07-22-noisy_reg.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ X_test, y_test = X[n:,:], y[n:]
27
27
## The neural network
28
28
We will use a single hidden layer neural network with a quadratic activation function. The hidden layer will have $$m = 100$$ nodes. The parameters of this network are a $$d \times m$$ matrix $$A$$ and a $$m$$-dimensional vector $$b$$. For a $$d$$-dimensional data point $$X$$, the model predicts
This neural network should be well suited to solve the regression task since it is straightforward to find values for the parameters that solve the task exactly, for example we may pick $$A_{11} = b_1 = 1$$ and set everything else to zero.
33
33
@@ -79,6 +79,8 @@ for lr in lr_list:
79
79
80
80
For learning rates above 0.096 the optimization diverges, so we can't go higher. We see that all learning rates up to 0.03 give the same bad MSE, but after that larger learning rates improve performance. Interestingly, we can actually solve the task, but only if we choose learning rates right at the edge of diverging.
81
81
82
+
It should be noted that while the above plot paints a deceptively simple picture, it is not true in general that higher learning rate is better. The experiment seems robust against random seeds for initalization and data generation, but is quite fragile against changes in other hyperparameters such as the number of hidden neurons and scale of inizialization.
83
+
82
84
## Label noise
83
85
We can achieve a similar effect without huge learning rates by adding "label noise". Label noise is a form a *data augmentation*, where we generate more training data by modifying the original data. In each iteration of gradient descent we will replace the training targets by the original targets plus some noise. We choose standard normal noise.
84
86
@@ -127,15 +129,15 @@ I found the neat derivation presented above in [Implicit Gradient Regularization
127
129
# Label noise
128
130
But wait, optimizing $$L$$ is the same as optimizing $$\tilde{L}$$, right? A minimizer of $$L$$ has gradient zero, so it is also a minimizer of $$\tilde{L}$$. Well, we are in the overparameterized setting, so the optimization path might change *which* minimizer we end up at. When we add label noise it becomes clearer.
129
131
130
-
With label noise we have $$L = E_{z \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2\|_2^2$$. After a bit of calculation we may simplify this to
132
+
With label noise we have $$L = E_{z_i \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z_i \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2\|_2^2$$. After a bit of calculation we may simplify this to
One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will show why this regularizer is useful for solving our regression task. I will not do that here.
140
+
One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will likely show why this regularizer is useful for solving our regression task. I will not do that here.
139
141
140
142
# Comparing with small initalization
141
143
If you read the [previous blog post](/2022/07/06/dln_classifier.html), you might wonder if we can apply the implicit bias given by small initialization to solve the regression task given here. Indeed you can! Simply scaling down the outputs of the neural network by a factor 100, we get MSE = 0.00 with learning rate 0.01.
0 commit comments