Minor updates to noisy_reg

johanwind · johanwind · commit 8e2fcc10556b · 2022-08-08T13:15:49.000+02:00
diff --git a/_posts/2022-07-22-noisy_reg.md b/_posts/2022-07-22-noisy_reg.md
@@ -27,7 +27,7 @@ X_test,  y_test  = X[n:,:], y[n:]
 ## The neural network
 We will use a single hidden layer neural network with a quadratic activation function. The hidden layer will have $$m = 100$$ nodes. The parameters of this network are a $$d \times m$$ matrix $$A$$ and a $$m$$-dimensional vector $$b$$. For a $$d$$-dimensional data point $$X$$, the model predicts
 
-$$\text{predict}(X) = \sum_{i=1}^m b_i (A_i \cdot X)^2$$
+$$\text{predict}(X) = \sum_{i=1}^m b_i (A_i \cdot X)^2.$$
 
 This neural network should be well suited to solve the regression task since it is straightforward to find values for the parameters that solve the task exactly, for example we may pick $$A_{11} = b_1 = 1$$ and set everything else to zero.
 
@@ -79,6 +79,8 @@ for lr in lr_list:
 
 For learning rates above 0.096 the optimization diverges, so we can't go higher. We see that all learning rates up to 0.03 give the same bad MSE, but after that larger learning rates improve performance. Interestingly, we can actually solve the task, but only if we choose learning rates right at the edge of diverging.
 
+It should be noted that while the above plot paints a deceptively simple picture, it is not true in general that higher learning rate is better. The experiment seems robust against random seeds for initalization and data generation, but is quite fragile against changes in other hyperparameters such as the number of hidden neurons and scale of inizialization.
+
 ## Label noise
 We can achieve a similar effect without huge learning rates by adding "label noise". Label noise is a form a *data augmentation*, where we generate more training data by modifying the original data. In each iteration of gradient descent we will replace the training targets by the original targets plus some noise. We choose standard normal noise.
 
@@ -127,15 +129,15 @@ I found the neat derivation presented above in [Implicit Gradient Regularization
 # Label noise
 But wait, optimizing $$L$$ is the same as optimizing $$\tilde{L}$$, right? A minimizer of $$L$$ has gradient zero, so it is also a minimizer of $$\tilde{L}$$. Well, we are in the overparameterized setting, so the optimization path might change *which* minimizer we end up at. When we add label noise it becomes clearer.
 
-With label noise we have $$L = E_{z \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z)^2\|_2^2$$. After a bit of calculation we may simplify this to
+With label noise we have $$L = E_{z_i \sim \mathcal{N}(0,1)}\frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2$$ and $$\tilde{L} = L + \frac{\eta}{4} E_{z_i \sim \mathcal{N}(0,1)}\|\nabla \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i-z_i)^2\|_2^2$$. After a bit of calculation we may simplify this to
 
-$$\tilde{L} = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2 + \eta\frac{1}{n}\sum_{i=1}^n \Big[\big((P(X_i)-y_i)\cdot \nabla P(X_i)\big)^2 + \|\nabla P(X_i)\|_2^2\Big] + 1.$$
+$$\tilde{L} = \frac{1}{n}\sum_{i=1}^n(P(X_i)-y_i)^2 + \eta\frac{1}{n^2}\sum_{i=1}^n \Big[\big((P(X_i)-y_i)\cdot \nabla P(X_i)\big)^2 + \|\nabla P(X_i)\|_2^2\Big] + 1.$$
 
 The $$+1$$ doesn't matter for optimization, so we remove it. At convergence $$P(X_i) \approx y_i$$, so we have
 
-$$\tilde{L} \approx \eta\frac{1}{n}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2$$
+$$\tilde{L} \approx \eta\frac{1}{n^2}\sum_{i=1}^n \|\nabla P(X_i)\|_2^2$$
 
-One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will show why this regularizer is useful for solving our regression task. I will not do that here.
+One way to interpret this is that among the parameter configurations fitting the training data, we choose the one minimizing the gradient of the neural network output with respect to the parameters. We may further write out and analyze $$\|\nabla P(X_i)\|_2^2$$ for the case of our neural network, and that will likely show why this regularizer is useful for solving our regression task. I will not do that here.
 
 # Comparing with small initalization
 If you read the [previous blog post](/2022/07/06/dln_classifier.html), you might wonder if we can apply the implicit bias given by small initialization to solve the regression task given here. Indeed you can! Simply scaling down the outputs of the neural network by a factor 100, we get MSE = 0.00 with learning rate 0.01.