sync with computational astrophysics class

zingale · zingale · commit e8eb697475fc · 2025-05-09T08:31:47.000-04:00
diff --git a/content/11-machine-learning/gradient-descent.ipynb b/content/11-machine-learning/gradient-descent.ipynb
diff --git a/content/11-machine-learning/keras-clustering.ipynb b/content/11-machine-learning/keras-clustering.ipynb
diff --git a/content/11-machine-learning/neural-net-basics.md b/content/11-machine-learning/neural-net-basics.md
@@ -2,12 +2,17 @@
 
 ## Neural networks
 
-When we talk about machine learning, we often mean an [_artifical
+An [_artifical
 neural
-network_](https://en.wikipedia.org/wiki/Artificial_neural_network).  A
-neural network mimics the action of neurons in your brain.  We'll
+network_](https://en.wikipedia.org/wiki/Artificial_neural_network)
+mimics the action of neurons in your brain to form connections
+between nodes (neurons) that link the input to the output.
+
+```{note}
+We'll loosely 
 follow the notation from _Computational Methods for Physics_ by
 Franklin.
+```
 
 Basic idea:
 
@@ -106,7 +111,9 @@ performance of the network.
 
 ## Basic algorithm
 
-
+We'll consider the case where we have training data---a set of inputs, ${\bf x}^k$,
+together with the expected output (answer), ${\bf y}^k$.  These training pairs
+allow us to constrain the output of the network and train the weights.
 
 * Training
 
@@ -121,14 +128,14 @@ performance of the network.
       This is a minimization problem, where we are minimizing:
 
       \begin{align*}
-      f(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
+      \mathcal{L}(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
                 &= \sum_{i=1}^{N_\mathrm{out}} \left [ g\left (\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right ) - y^k_i \right ]^2
       \end{align*}
 
-      We call this function the _cost function_ or _loss function_.
+      We call this function, $\mathcal{L}$, the _cost function_ or [loss function](https://en.wikipedia.org/wiki/Loss_function).
 
       ```{note}
-      This is one possible choice for the cost function, $f(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
+      This is called the _mean square error_ loss function, and is one possible choice for $\mathcal{L}(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
       ```
 
     * Update the matrix ${\bf A}$ based on the training pair $({\bf x}^k, {\bf y^{k}})$.
diff --git a/content/11-machine-learning/neural-net-derivation.md b/content/11-machine-learning/neural-net-derivation.md
@@ -3,17 +3,23 @@
 For gradient descent, we need to derive the update to the matrix
 ${\bf A}$ based on training on a set of our data, $({\bf x}^k, {\bf y}^k)$.
 
+```{important}
+The derivation we do here is specific to our choice of loss function, $\mathcal{L}(A_{ij})$
+and activation function, $g(\xi)$.
+```
+
 Let's start with our cost function:
 
-$$f(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}} 
+$$\mathcal{L}(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}} 
   \Biggl [ g\biggl (\underbrace{\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j}_{\equiv \alpha_i} \biggr ) - y^k_i \Biggr ]^2$$
 
-where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf Ax}$ to help simplify notation.
+where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf
+Ax}$ to help simplify notation.  This means that ${\bf z} = g({\boldsymbol \alpha})$.
 
 We can compute the derivative with respect to a single matrix
 element, $A_{pq}$ by applying the chain rule:
 
-$$\frac{\partial f}{\partial A_{pq}} =
+$$\frac{\partial \mathcal{L}}{\partial A_{pq}} =
   2 \sum_{i=1}^{N_\mathrm{out}} (z_i - y^k_i) \left . \frac{\partial g}{\partial \xi} \right |_{\xi=\alpha_i} \frac{\partial \alpha_i}{\partial A_{pq}}$$
   
 
@@ -31,22 +37,24 @@ $$\frac{\partial g}{\partial \xi}
 which gives us:
 
 \begin{align*}
-\frac{\partial f}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
+\frac{\partial \mathcal{L}}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
    (z_i - y^k_i) z_i (1 - z_i) \delta_{ip} x^k_q \\
    &= 2 (z_p - y^k_p) z_p (1- z_p) x^k_q
 \end{align*}
    
 where we used the fact that the $\delta_{ip}$ means that only a single term contributes to the sum.
 
-Note that:
+```{note}
+Observe that:
 
 * $e_p^k \equiv (z_p - y_p^k)$ is the error on the output layer,
   and the correction is proportional to the error (as we would
   expect).
 
 * The $k$ superscripts here remind us that this is the result of
   only a single pair of data from the training set.
-  
+```
+
 Now ${\bf z}$ and ${\bf y}^k$ are all vectors of size $N_\mathrm{out} \times 1$ and ${\bf x}^k$ is a vector of size $N_\mathrm{in} \times 1$, so we can write this expression for the matrix as a whole as:
 
 $$\frac{\partial f}{\partial {\bf A}} = 2 ({\bf z} - {\bf y}^k) \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$
@@ -58,7 +66,7 @@ where the operator $\circ$ represents _element-by-element_ multiplication (the [
 We could do the update like we just saw with our gradient descent
 example: take a single data point, $({\bf x}^k, {\bf y}^k)$ and
 do the full minimization, continually estimating the correction,
-$\partial f/\partial {\bf A}$ and updating ${\bf A}$ until we
+$\partial \mathcal{L}/\partial {\bf A}$ and updating ${\bf A}$ until we
 reach a minimum.  The problem with this is that $({\bf x}^k, {\bf y}^k)$ is only one point in our training data, and there is no
 guarantee that if we minimize completely with point $k$ that we will
 also be a minimum with point $k+1$.
diff --git a/content/11-machine-learning/neural-net-hidden.md b/content/11-machine-learning/neural-net-hidden.md
@@ -29,7 +29,7 @@ do here generalizes to multiple hidden layers.
 ```
 
 \begin{equation}
-f(A_{lm}, B_{ij}) = \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2
+\mathcal{L}(A_{lm}, B_{ij}) = \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2
 \end{equation}                  
 
 $$\tilde{z}_i = g \biggl ( \underbrace{\sum_{j=1}^{N_\mathrm{in}} B_{ij} x^k_j}_{\equiv \beta_i} \biggr )$$
@@ -46,7 +46,7 @@ directly, ${\bf e}^k = {\bf z} - {\bf y}^k$.  As a result, we can just use
 the result that we got for a single layer, but now the input is $\tilde{\bf z}$
 instead of ${\bf x}$:
 
-$$\frac{\partial f}{\partial {\bf A}} = 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal$$
+$$\frac{\partial \mathcal{L}}{\partial {\bf A}} = 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal$$
 
 ## Updates to ${\bf B}$
 
@@ -59,15 +59,15 @@ hidden layer&mdash;a process called _backpropagation_.
 Let's start with our cost function:
 
 \begin{align*}
-f(A_{lm}, B_{ij}) &= \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2 \\
+\mathcal{L}(A_{lm}, B_{ij}) &= \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2 \\
                   &= \sum_{l=1}^{N_\mathrm{out}} \Biggl [ g \biggl ( \sum_{m=1}^{N_\mathrm{hidden}} A_{lm} \tilde{z}_m \biggr ) - y_l^k \Biggr ]^2 \\
                   &= \sum_{l=1}^{N_\mathrm{out}} \Biggl [ g \biggl ( \sum_{m=1}^{N_\mathrm{hidden}} A_{lm} \,g \biggl ( \sum_{j=1}^{N_\mathrm{in}} B_{mj} x_j^k \biggr ) \biggr ) - y_l^k \Biggr ]^2
 \end{align*}                  
 
 Differentiating with respect to an element in matrix ${\bf B}$, we apply the chain rule over and over,
 giving:
 
-$$\frac{\partial f}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}} (z_l - y_l^k)
+$$\frac{\partial \mathcal{L}}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}} (z_l - y_l^k)
     \left .\frac{\partial g}{\partial \xi} \right |_{\xi = \alpha_l}
     \sum_{m=1}^{N_\mathrm{hidden}} A_{lm}\, \left . \frac{\partial g}{\partial \xi} \right |_{\xi = \beta_m}
     \sum_{j=1}^{N_\mathrm{in}} \frac{\partial B_{mj}}{\partial B_{pq}} x_j^k $$
@@ -85,7 +85,7 @@ $$\frac{\partial B_{mj}}{\partial B_{pq}} = \delta_{mp} \delta_{jq}$$
 
 Inserting these dervatives and using the $\delta$'s, we are left with:
 
-$$\frac{\partial f}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}}
+$$\frac{\partial \mathcal{L}}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}}
    \underbrace{(z_l - y_l^k)}_{ = e_l^k} z_l (1 - z_l) A_{lp} \tilde{z}_p (1 - \tilde{z}_p) x^k_q$$
    
 Now, that remaining sum is contracting on the first of the indices of
@@ -97,14 +97,14 @@ $$\tilde{e}_p^k = \sum_{l=1}^{N_\mathrm{out}} e_l^k z_l (1 - z_l) A_{lp}
 
 and we can write
 
-$$\frac{\partial f}{\partial {\bf B}} = 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal$$
+$$\frac{\partial \mathcal{L}}{\partial {\bf B}} = 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal$$
 
 
 Notice the symmetry in the update of each matrix:
 
 \begin{align*}
-\frac{\partial f}{\partial {\bf A}} &= 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal \\
-\frac{\partial f}{\partial {\bf B}} &= 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal
+\frac{\partial \mathcal{L}}{\partial {\bf A}} &= 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal \\
+\frac{\partial \mathcal{L}}{\partial {\bf B}} &= 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal
 \end{align*}
 
 Adding additional hidden layers would continue the trend, with each hidden layer's matrix update depending
diff --git a/content/11-machine-learning/neural-net-improvements.md b/content/11-machine-learning/neural-net-improvements.md
@@ -12,12 +12,12 @@ Right now, we did our training as:
 * Loop over the $T$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, T$
 
   * Propagate $({\bf x}^k, {\bf y}^k)$ through the network
-  * Compute the corrections $\partial f/\partial {\bf A}$, $\partial f/\partial {\bf B}$
+  * Compute the corrections $\partial \mathcal{L}/\partial {\bf A}$, $\partial \mathcal{L}/\partial {\bf B}$
   * Update the matrices:
   
-    $${\bf A} \leftarrow {\bf A} + \eta \frac{\partial f}{\partial {\bf A}}$$
+    $${\bf A} \leftarrow {\bf A} - \eta \frac{\partial \mathcal{L}}{\partial {\bf A}}$$
 
-    $${\bf B} \leftarrow {\bf B} + \eta \frac{\partial f}{\partial {\bf B}}$$
+    $${\bf B} \leftarrow {\bf B} - \eta \frac{\partial \mathcal{L}}{\partial {\bf B}}$$
 
 In this manner, each training pair sees slightly different
 matrices ${\bf A}$ and ${\bf B}$, as each previous pair
@@ -31,19 +31,24 @@ each with $\tau = T/N$ training pairs and do our update as:
   * Loop over the $\tau$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, \tau$ in the current batch 
 
     * Propagate $({\bf x}^k, {\bf y}^k)$ through the network
-    * Compute the corrections $\partial f/\partial {\bf A}^k$, $\partial f/\partial {\bf B}^k$ from the current pair
+    * Compute the gradients $\partial \mathcal{L}/\partial {\bf A}^k$, $\partial \mathcal{L}/\partial {\bf B}^k$ from the current pair
     
-    * Accumulate the corrections:
+    * Accumulate the gradients:
   
-      $$\frac{\partial f}{\partial {\bf A}} = \frac{\partial f}{\partial {\bf A}} + \frac{\partial f}{\partial {\bf A}^k}$$
+      $$\frac{\partial \mathcal{L}}{\partial {\bf A}} = \frac{\partial \mathcal{L}}{\partial {\bf A}} + \frac{\partial \mathcal{L}}{\partial {\bf A}^k}$$
       
-      $$\frac{\partial f}{\partial {\bf B}} = \frac{\partial f}{\partial {\bf B}} + \frac{\partial f}{\partial {\bf B}^k}$$
+      $$\frac{\partial \mathcal{L}}{\partial {\bf B}} = \frac{\partial \mathcal{L}}{\partial {\bf B}} + \frac{\partial \mathcal{L}}{\partial {\bf B}^k}$$
       
   * Apply a single update to the matrices for this batch:
 
-    $${\bf A} \leftarrow {\bf A} + \eta \frac{\partial f}{\partial {\bf A}}$$
+    $${\bf A} \leftarrow {\bf A} - \frac{\eta}{\tau} \frac{\partial \mathcal{L}}{\partial {\bf A}}$$
 
-    $${\bf B} \leftarrow {\bf B} + \eta \frac{\partial f}{\partial {\bf B}}$$
+    $${\bf B} \leftarrow {\bf B} - \frac{\eta}{\tau} \frac{\partial \mathcal{L}}{\partial {\bf B}}$$
+
+```{note}
+We normalize the accumulated gradients by the batch size, $\tau$, which means that
+we are applying the average gradient over the batch.
+```
 
 The advantage of this is that the $\tau$ trainings in a batch
 can all be done in parallel now, spread across many CPU cores