Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit e8eb697

Browse files
committed
sync with computational astrophysics class
1 parent 6d78bd3 commit e8eb697

File tree

6 files changed

+204
-180
lines changed

6 files changed

+204
-180
lines changed

content/11-machine-learning/gradient-descent.ipynb

Lines changed: 24 additions & 20 deletions
Large diffs are not rendered by default.

content/11-machine-learning/keras-clustering.ipynb

Lines changed: 129 additions & 129 deletions
Large diffs are not rendered by default.

content/11-machine-learning/neural-net-basics.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,17 @@
22

33
## Neural networks
44

5-
When we talk about machine learning, we often mean an [_artifical
5+
An [_artifical
66
neural
7-
network_](https://en.wikipedia.org/wiki/Artificial_neural_network). A
8-
neural network mimics the action of neurons in your brain. We'll
7+
network_](https://en.wikipedia.org/wiki/Artificial_neural_network)
8+
mimics the action of neurons in your brain to form connections
9+
between nodes (neurons) that link the input to the output.
10+
11+
```{note}
12+
We'll loosely
913
follow the notation from _Computational Methods for Physics_ by
1014
Franklin.
15+
```
1116

1217
Basic idea:
1318

@@ -106,7 +111,9 @@ performance of the network.
106111

107112
## Basic algorithm
108113

109-
114+
We'll consider the case where we have training data---a set of inputs, ${\bf x}^k$,
115+
together with the expected output (answer), ${\bf y}^k$. These training pairs
116+
allow us to constrain the output of the network and train the weights.
110117

111118
* Training
112119

@@ -121,14 +128,14 @@ performance of the network.
121128
This is a minimization problem, where we are minimizing:
122129

123130
\begin{align*}
124-
f(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
131+
\mathcal{L}(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
125132
&= \sum_{i=1}^{N_\mathrm{out}} \left [ g\left (\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right ) - y^k_i \right ]^2
126133
\end{align*}
127134

128-
We call this function the _cost function_ or _loss function_.
135+
We call this function, $\mathcal{L}$, the _cost function_ or [loss function](https://en.wikipedia.org/wiki/Loss_function).
129136

130137
```{note}
131-
This is one possible choice for the cost function, $f(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
138+
This is called the _mean square error_ loss function, and is one possible choice for $\mathcal{L}(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
132139
```
133140
134141
* Update the matrix ${\bf A}$ based on the training pair $({\bf x}^k, {\bf y^{k}})$.

content/11-machine-learning/neural-net-derivation.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,23 @@
33
For gradient descent, we need to derive the update to the matrix
44
${\bf A}$ based on training on a set of our data, $({\bf x}^k, {\bf y}^k)$.
55

6+
```{important}
7+
The derivation we do here is specific to our choice of loss function, $\mathcal{L}(A_{ij})$
8+
and activation function, $g(\xi)$.
9+
```
10+
611
Let's start with our cost function:
712

8-
$$f(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}}
13+
$$\mathcal{L}(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}}
914
\Biggl [ g\biggl (\underbrace{\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j}_{\equiv \alpha_i} \biggr ) - y^k_i \Biggr ]^2$$
1015

11-
where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf Ax}$ to help simplify notation.
16+
where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf
17+
Ax}$ to help simplify notation. This means that ${\bf z} = g({\boldsymbol \alpha})$.
1218

1319
We can compute the derivative with respect to a single matrix
1420
element, $A_{pq}$ by applying the chain rule:
1521

16-
$$\frac{\partial f}{\partial A_{pq}} =
22+
$$\frac{\partial \mathcal{L}}{\partial A_{pq}} =
1723
2 \sum_{i=1}^{N_\mathrm{out}} (z_i - y^k_i) \left . \frac{\partial g}{\partial \xi} \right |_{\xi=\alpha_i} \frac{\partial \alpha_i}{\partial A_{pq}}$$
1824

1925

@@ -31,22 +37,24 @@ $$\frac{\partial g}{\partial \xi}
3137
which gives us:
3238

3339
\begin{align*}
34-
\frac{\partial f}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
40+
\frac{\partial \mathcal{L}}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
3541
(z_i - y^k_i) z_i (1 - z_i) \delta_{ip} x^k_q \\
3642
&= 2 (z_p - y^k_p) z_p (1- z_p) x^k_q
3743
\end{align*}
3844

3945
where we used the fact that the $\delta_{ip}$ means that only a single term contributes to the sum.
4046

41-
Note that:
47+
```{note}
48+
Observe that:
4249
4350
* $e_p^k \equiv (z_p - y_p^k)$ is the error on the output layer,
4451
and the correction is proportional to the error (as we would
4552
expect).
4653
4754
* The $k$ superscripts here remind us that this is the result of
4855
only a single pair of data from the training set.
49-
56+
```
57+
5058
Now ${\bf z}$ and ${\bf y}^k$ are all vectors of size $N_\mathrm{out} \times 1$ and ${\bf x}^k$ is a vector of size $N_\mathrm{in} \times 1$, so we can write this expression for the matrix as a whole as:
5159

5260
$$\frac{\partial f}{\partial {\bf A}} = 2 ({\bf z} - {\bf y}^k) \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$
@@ -58,7 +66,7 @@ where the operator $\circ$ represents _element-by-element_ multiplication (the [
5866
We could do the update like we just saw with our gradient descent
5967
example: take a single data point, $({\bf x}^k, {\bf y}^k)$ and
6068
do the full minimization, continually estimating the correction,
61-
$\partial f/\partial {\bf A}$ and updating ${\bf A}$ until we
69+
$\partial \mathcal{L}/\partial {\bf A}$ and updating ${\bf A}$ until we
6270
reach a minimum. The problem with this is that $({\bf x}^k, {\bf y}^k)$ is only one point in our training data, and there is no
6371
guarantee that if we minimize completely with point $k$ that we will
6472
also be a minimum with point $k+1$.

content/11-machine-learning/neural-net-hidden.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ do here generalizes to multiple hidden layers.
2929
```
3030

3131
\begin{equation}
32-
f(A_{lm}, B_{ij}) = \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2
32+
\mathcal{L}(A_{lm}, B_{ij}) = \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2
3333
\end{equation}
3434

3535
$$\tilde{z}_i = g \biggl ( \underbrace{\sum_{j=1}^{N_\mathrm{in}} B_{ij} x^k_j}_{\equiv \beta_i} \biggr )$$
@@ -46,7 +46,7 @@ directly, ${\bf e}^k = {\bf z} - {\bf y}^k$. As a result, we can just use
4646
the result that we got for a single layer, but now the input is $\tilde{\bf z}$
4747
instead of ${\bf x}$:
4848

49-
$$\frac{\partial f}{\partial {\bf A}} = 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal$$
49+
$$\frac{\partial \mathcal{L}}{\partial {\bf A}} = 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal$$
5050

5151
## Updates to ${\bf B}$
5252

@@ -59,15 +59,15 @@ hidden layer—a process called _backpropagation_.
5959
Let's start with our cost function:
6060

6161
\begin{align*}
62-
f(A_{lm}, B_{ij}) &= \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2 \\
62+
\mathcal{L}(A_{lm}, B_{ij}) &= \sum_{l=1}^{N_\mathrm{out}} (z_l - y^k_l)^2 \\
6363
&= \sum_{l=1}^{N_\mathrm{out}} \Biggl [ g \biggl ( \sum_{m=1}^{N_\mathrm{hidden}} A_{lm} \tilde{z}_m \biggr ) - y_l^k \Biggr ]^2 \\
6464
&= \sum_{l=1}^{N_\mathrm{out}} \Biggl [ g \biggl ( \sum_{m=1}^{N_\mathrm{hidden}} A_{lm} \,g \biggl ( \sum_{j=1}^{N_\mathrm{in}} B_{mj} x_j^k \biggr ) \biggr ) - y_l^k \Biggr ]^2
6565
\end{align*}
6666

6767
Differentiating with respect to an element in matrix ${\bf B}$, we apply the chain rule over and over,
6868
giving:
6969

70-
$$\frac{\partial f}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}} (z_l - y_l^k)
70+
$$\frac{\partial \mathcal{L}}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}} (z_l - y_l^k)
7171
\left .\frac{\partial g}{\partial \xi} \right |_{\xi = \alpha_l}
7272
\sum_{m=1}^{N_\mathrm{hidden}} A_{lm}\, \left . \frac{\partial g}{\partial \xi} \right |_{\xi = \beta_m}
7373
\sum_{j=1}^{N_\mathrm{in}} \frac{\partial B_{mj}}{\partial B_{pq}} x_j^k $$
@@ -85,7 +85,7 @@ $$\frac{\partial B_{mj}}{\partial B_{pq}} = \delta_{mp} \delta_{jq}$$
8585

8686
Inserting these dervatives and using the $\delta$'s, we are left with:
8787

88-
$$\frac{\partial f}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}}
88+
$$\frac{\partial \mathcal{L}}{\partial B_{pq}} = 2 \sum_{l=1}^{N_\mathrm{out}}
8989
\underbrace{(z_l - y_l^k)}_{ = e_l^k} z_l (1 - z_l) A_{lp} \tilde{z}_p (1 - \tilde{z}_p) x^k_q$$
9090

9191
Now, that remaining sum is contracting on the first of the indices of
@@ -97,14 +97,14 @@ $$\tilde{e}_p^k = \sum_{l=1}^{N_\mathrm{out}} e_l^k z_l (1 - z_l) A_{lp}
9797

9898
and we can write
9999

100-
$$\frac{\partial f}{\partial {\bf B}} = 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal$$
100+
$$\frac{\partial \mathcal{L}}{\partial {\bf B}} = 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal$$
101101

102102

103103
Notice the symmetry in the update of each matrix:
104104

105105
\begin{align*}
106-
\frac{\partial f}{\partial {\bf A}} &= 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal \\
107-
\frac{\partial f}{\partial {\bf B}} &= 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal
106+
\frac{\partial \mathcal{L}}{\partial {\bf A}} &= 2 {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot \tilde{\bf z}^\intercal \\
107+
\frac{\partial \mathcal{L}}{\partial {\bf B}} &= 2 \tilde{\bf e}^k \circ \tilde{\bf z} \circ (1 - \tilde{\bf z}) \cdot ({\bf x}^k)^\intercal
108108
\end{align*}
109109

110110
Adding additional hidden layers would continue the trend, with each hidden layer's matrix update depending

content/11-machine-learning/neural-net-improvements.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ Right now, we did our training as:
1212
* Loop over the $T$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, T$
1313

1414
* Propagate $({\bf x}^k, {\bf y}^k)$ through the network
15-
* Compute the corrections $\partial f/\partial {\bf A}$, $\partial f/\partial {\bf B}$
15+
* Compute the corrections $\partial \mathcal{L}/\partial {\bf A}$, $\partial \mathcal{L}/\partial {\bf B}$
1616
* Update the matrices:
1717

18-
$${\bf A} \leftarrow {\bf A} + \eta \frac{\partial f}{\partial {\bf A}}$$
18+
$${\bf A} \leftarrow {\bf A} - \eta \frac{\partial \mathcal{L}}{\partial {\bf A}}$$
1919

20-
$${\bf B} \leftarrow {\bf B} + \eta \frac{\partial f}{\partial {\bf B}}$$
20+
$${\bf B} \leftarrow {\bf B} - \eta \frac{\partial \mathcal{L}}{\partial {\bf B}}$$
2121

2222
In this manner, each training pair sees slightly different
2323
matrices ${\bf A}$ and ${\bf B}$, as each previous pair
@@ -31,19 +31,24 @@ each with $\tau = T/N$ training pairs and do our update as:
3131
* Loop over the $\tau$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, \tau$ in the current batch
3232

3333
* Propagate $({\bf x}^k, {\bf y}^k)$ through the network
34-
* Compute the corrections $\partial f/\partial {\bf A}^k$, $\partial f/\partial {\bf B}^k$ from the current pair
34+
* Compute the gradients $\partial \mathcal{L}/\partial {\bf A}^k$, $\partial \mathcal{L}/\partial {\bf B}^k$ from the current pair
3535

36-
* Accumulate the corrections:
36+
* Accumulate the gradients:
3737

38-
$$\frac{\partial f}{\partial {\bf A}} = \frac{\partial f}{\partial {\bf A}} + \frac{\partial f}{\partial {\bf A}^k}$$
38+
$$\frac{\partial \mathcal{L}}{\partial {\bf A}} = \frac{\partial \mathcal{L}}{\partial {\bf A}} + \frac{\partial \mathcal{L}}{\partial {\bf A}^k}$$
3939

40-
$$\frac{\partial f}{\partial {\bf B}} = \frac{\partial f}{\partial {\bf B}} + \frac{\partial f}{\partial {\bf B}^k}$$
40+
$$\frac{\partial \mathcal{L}}{\partial {\bf B}} = \frac{\partial \mathcal{L}}{\partial {\bf B}} + \frac{\partial \mathcal{L}}{\partial {\bf B}^k}$$
4141

4242
* Apply a single update to the matrices for this batch:
4343

44-
$${\bf A} \leftarrow {\bf A} + \eta \frac{\partial f}{\partial {\bf A}}$$
44+
$${\bf A} \leftarrow {\bf A} - \frac{\eta}{\tau} \frac{\partial \mathcal{L}}{\partial {\bf A}}$$
4545

46-
$${\bf B} \leftarrow {\bf B} + \eta \frac{\partial f}{\partial {\bf B}}$$
46+
$${\bf B} \leftarrow {\bf B} - \frac{\eta}{\tau} \frac{\partial \mathcal{L}}{\partial {\bf B}}$$
47+
48+
```{note}
49+
We normalize the accumulated gradients by the batch size, $\tau$, which means that
50+
we are applying the average gradient over the batch.
51+
```
4752

4853
The advantage of this is that the $\tau$ trainings in a batch
4954
can all be done in parallel now, spread across many CPU cores

0 commit comments

Comments
 (0)