Deep Learning Recap
Deep Learning Recap
Spring 2024
Many materials from CSE447@UW (Taylor Sorensen) and CS224n@Stanford with special thanks!
Announcements
2 Lecture 3: Tokenization
Lecture plan
3 Lecture 3: Tokenization
(Very quick) Deep learning review
Essential elements:
•
d
Input: Vector x 2 R , Output: y 2 R
<latexit sha1_base64="1wkcrsS56MDYhSZq3fEJyaLZzOw=">AAAB+3icbVC7TsMwFL0pr1JeoYwsFhUSU5UgXmMFC2NB9CE1pXJcp7XqOJHtIKIov8LCAEKs/Agbf4PTdoDCkSwdnXOv7vHxY86Udpwvq7S0vLK6Vl6vbGxube/Yu9W2ihJJaItEPJJdHyvKmaAtzTSn3VhSHPqcdvzJVeF3HqhULBJ3Oo1pP8QjwQJGsDbSwK6myGMCeSHWY9/PbvP74cCuOXVnCvSXuHNSgzmaA/vTG0YkCanQhGOleq4T636GpWaE07ziJYrGmEzwiPYMFTikqp9Ns+fo0ChDFETSPKHRVP25keFQqTT0zWSRUS16hfif10t0cNHPmIgTTQWZHQoSjnSEiiLQkElKNE8NwUQykxWRMZaYaFNXxZTgLn75L2kf192z+unNSa1xOa+jDPtwAEfgwjk04Bqa0AICj/AEL/Bq5daz9Wa9z0ZL1nxnD37B+vgGlXKUKw==</latexit>
k
<latexit sha1_base64="BZH8qWvUi0/ZNSKyv/JUJA77Xk0=">AAAB+3icbVC7TsMwFL0pr1JeoYwsFhUSU5UgXmMFC2NB9CE1oXJct7XqOJHtoFZRfoWFAYRY+RE2/gan7QAtR7J0dM69uscniDlT2nG+rcLK6tr6RnGztLW9s7tn75ebKkokoQ0S8Ui2A6woZ4I2NNOctmNJcRhw2gpGN7nfeqJSsUg86ElM/RAPBOszgrWRunZ5jDwmkBdiPQyC9D57HHXtilN1pkDLxJ2TCsxR79pfXi8iSUiFJhwr1XGdWPsplpoRTrOSlygaYzLCA9oxVOCQKj+dZs/QsVF6qB9J84RGU/X3RopDpSZhYCbzjGrRy8X/vE6i+1d+ykScaCrI7FA/4UhHKC8C9ZikRPOJIZhIZrIiMsQSE23qKpkS3MUvL5PmadW9qJ7fnVVq1/M6inAIR3ACLlxCDW6hDg0gMIZneIU3K7NerHfrYzZasOY7B/AH1ucPnnmUMQ==</latexit>
term b 2 R
<latexit sha1_base64="vHygv6sVUn2QNLvTi0OYuoDFseA=">AAAC0HicbVJdb9MwFHUyPkb5KvDIyxUV0xBVlWQF9jJpghdekMZE10lNiWzXba05TmTfsFYhQrzy83jjF/A3cLJOgpQrWTk659zrYzssV9JiEPzy/J0bN2/d3r3TuXvv/oOH3UePz2xWGC5GPFOZOWfUCiW1GKFEJc5zI2jKlBizi3e1Pv4ijJWZ/oTrXExTutByLjlFRyXd33vxkmK5ruAIxkkEsZWLlO6PkxBW8BJYEr5oPk6JO3sxihWWl0thBFTOEEsNcUpxyVh5Wn0+6MP1tJYS9aEe2WLLIcQoU2HhoOo3GzQRWqbo2jR0JrY9Zeham1xA9czFYtszok7S7QWDoCnYBuEG9MimTpLuz3iW8SIVGrmi1k7CIMdpSQ1KrkTViQsrcsov6EJMHNTUJZyWzYNU8NwxM5hnxi2N0LB/d5Q0tXadMuesQ9q2VpP/0yYFzg+npdR5gULzq43mhQLMoH5dmEkjOKq1A5Qb6bICX1JDObp/oL6EsH3kbXAWDcLXg1cfh73jt5vr2CVPyTOyT0LyhhyT9+SEjAj3PnjW++pV/qm/8r/536+svrfpeUL+Kf/HHwtn2Z8=</latexit>
di+1 ŷ = W2 (W1 x + b1 ) + b2
<latexit sha1_base64="Tv8AKsudbVIKLhiCSwgCOdvzi2E=">AAACA3icbVDLSsNAFJ34rPUVdaebwSIIQknE17LoxmUV+4Amhslk0g6dTMLMRCgh4MZfceNCEbf+hDv/xkmbhbYeuHA4517uvcdPGJXKsr6NufmFxaXlykp1dW19Y9Pc2m7LOBWYtHDMYtH1kSSMctJSVDHSTQRBkc9Ixx9eFX7ngQhJY36nRglxI9TnNKQYKS155q4PHcqhEyE18P3sNr/PAi+jR3aee2bNqltjwFlil6QGSjQ988sJYpxGhCvMkJQ920qUmyGhKGYkrzqpJAnCQ9QnPU05ioh0s/EPOTzQSgDDWOjiCo7V3xMZiqQcRb7uLG6V014h/uf1UhVeuBnlSaoIx5NFYcqgimERCAyoIFixkSYIC6pvhXiABMJKx1bVIdjTL8+S9nHdPquf3pzUGpdlHBWwB/bBIbDBOWiAa9AELYDBI3gGr+DNeDJejHfjY9I6Z5QzO+APjM8fhYqXeA==</latexit>
3 2 4⇥3
where x 2 R , ŷ 2 R , W1 2 R ,
• Set of all parameters is often referred to as ✓
<latexit sha1_base64="RnyzIg4Iy64zhhoBANN9EcBOI5Y=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKr2PQi8cI5gHJEmYnvcmY2Z1lplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80jUo1hwZXUul2wAxIEUMDBUpoJxpYFEhoBaPbqd96Am2Eih9wnIAfsUEsQsEZWqnZxSEg65UrbtWdgS4TLycVkqPeK391+4qnEcTIJTOm47kJ+hnTKLiESambGkgYH7EBdCyNWQTGz2bXTuiJVfo0VNpWjHSm/p7IWGTMOApsZ8RwaBa9qfif10kxvPYzEScpQszni8JUUlR0+jrtCw0c5dgSxrWwt1I+ZJpxtAGVbAje4svLpHlW9S6rF/fnldpNHkeRHJFjcko8ckVq5I7USYNw8kieySt5c5Tz4rw7H/PWgpPPHJI/cD5/AKc3jzI=</latexit>
2⇥4 4 2
W2 2 R , b1 2 R , and b2 2 R
https://en.wikipedia.org/wiki/Arti cial_neural_network
1
<latexit sha1_base64="VFdO9y/HzVPa+EoCb+aeuTxuCGs=">AAAB/nicbVDLSsNAFJ3UV62vqLhyM1gEQSyJ+FoW3bisYB/Q1jKZ3rRDJ5MwMxFLCPgrblwo4tbvcOffOGmz0OqBC4dz7uXee7yIM6Ud58sqzM0vLC4Vl0srq2vrG/bmVkOFsaRQpyEPZcsjCjgTUNdMc2hFEkjgcWh6o6vMb96DVCwUt3ocQTcgA8F8Rok2Us/e6fiS0MRNExcfYrhLjh7StNSzy07FmQD/JW5OyihHrWd/dvohjQMQmnKiVNt1It1NiNSMckhLnVhBROiIDKBtqCABqG4yOT/F+0bpYz+UpoTGE/XnREICpcaBZzoDoodq1svE/7x2rP2LbsJEFGsQdLrIjznWIc6ywH0mgWo+NoRQycytmA6JyUObxLIQ3NmX/5LGccU9q5zenJSrl3kcRbSL9tABctE5qqJrVEN1RFGCntALerUerWfrzXqfthasfGYb/YL18Q2lWJSm</latexit>
<latexit sha1_base64="BGY2IOg6wi+f5MnNRr6waBVczzc=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSUCbTSTt0HmFmIi2hn+HGhSJu/Rp3/o2TNgttPXDhcM693HtPGDOqjet+O4WV1bX1jeJmaWt7Z3evvH/Q0jJRmDSxZFJ1QqQJo4I0DTWMdGJFEA8ZaYeju8xvPxGlqRSPZhKTgKOBoBHFyFjJ73I0rrpncHxa6pUrbs2dAS4TLycVkKPRK391+xInnAiDGdLa99zYBClShmJGpqVuokmM8AgNiG+pQJzoIJ2dPIUnVunDSCpbwsCZ+nsiRVzrCQ9tJ0dmqBe9TPzP8xMT3QQpFXFiiMDzRVHCoJEw+x/2qSLYsIklCCtqb4V4iBTCxqaUheAtvrxMWuc176p2+XBRqd/mcRTBETgGVeCBa1AH96ABmgADCZ7BK3hzjPPivDsf89aCk88cgj9wPn8ALD+P4w==</latexit>
x
max(0, x)
1+e
Tanh GeLU
x x ✓ ✓ ◆◆
e e
<latexit sha1_base64="VBFwWQK3UUuM6Qt2umuQjHPn7Ss=">AAACDXicbVDLSgMxFM34rPU16tJNsAqCtMyIr2XRjcsK9gHttGTSO21o5kGSkZZhfsCNv+LGhSJu3bvzb0zbWWjrgZCTc+7l5h434kwqy/o2FhaXlldWc2v59Y3NrW1zZ7cmw1hQqNKQh6LhEgmcBVBVTHFoRAKI73Kou4ObsV9/ACFZGNyrUQSOT3oB8xglSksd87DlCUITwG08xEWs76Q4TNME2kN8gmH66pgFq2RNgOeJnZECylDpmF+tbkhjHwJFOZGyaVuRchIiFKMc0nwrlhAROiA9aGoaEB+kk0y2SfGRVrrYC4U+gcIT9XdHQnwpR76rK32i+nLWG4v/ec1YeVdOwoIoVhDQ6SAv5liFeBwN7jIBVPGRJoQKpv+KaZ/oeJQOMK9DsGdXnie105J9UTq/OyuUr7M4cmgfHaBjZKNLVEa3qIKqiKJH9Ixe0ZvxZLwY78bHtHTByHr20B8Ynz9EoZpz</latexit>
<latexit sha1_base64="FVtxiC38aa/Fy19Qzw+HJzMrB2Y=">AAACMnicbVDLSgMxFM34tr6qLt0Ei1ARykzxtRTd6E7BqtApJZPeaUMzD5M70jLMN7nxSwQXulDErR9h2s5CWw+EHM45l+QeL5ZCo22/WlPTM7Nz8wuLhaXlldW14vrGjY4SxaHGIxmpO49pkCKEGgqUcBcrYIEn4dbrng382wdQWkThNfZjaASsHQpfcIZGahYvXF8xnjpZWs1oj7oSfCw7dI+6CD1MQfnZSBvlelnq6nuFJp25SrQ7uJtfhWaxZFfsIegkcXJSIjkum8VntxXxJIAQuWRa1x07xkbKFAouISu4iYaY8S5rQ93QkAWgG+lw5YzuGKVF/UiZEyIdqr8nUhZo3Q88kwwYdvS4NxD/8+oJ+seNVIRxghDy0UN+IilGdNAfbQkFHGXfEMaVMH+lvMNMN2haHpTgjK88SW6qFeewcnC1Xzo5zetYIFtkm5SJQ47ICTknl6RGOHkkL+SdfFhP1pv1aX2NolNWPrNJ/sD6/gFdp6ri</latexit>
1 x
x 1 + erf p
x
e +e x 2 2
Learning Problem:
N
<latexit sha1_base64="aeLIVwrr2GT/zo7wnqpYzvzQZEY=">AAACXnicbVDRahQxFM2M2tattau+CL4EF2EXZJkpWn0pFH3xQaSC2xZ2tsOdbGY3NMkMyY04hPlJ38QXP8XMziLaeiFw7jn3niSnqKWwmCQ/ovjO3Xs7u3v3B/sPDh4eDh89PreVM4zPWCUrc1mA5VJoPkOBkl/WhoMqJL8ort93+sVXbqyo9Bdsar5QsNKiFAwwUPnQZWtAn+GaI7T0hGZOL8M4/8P5DMyKZkrolmalAebT1n8K2DqVe3GStldd+3HcXPmxmLQv6caxafs2WJZ5bzX+1lOTySAfjpJpsil6G6RbMCLbOsuH37NlxZziGpkEa+dpUuPCg0HBJG8HmbO8BnYNKz4PUIPiduE38bT0RWCWtKxMOBrphv17w4OytlFFmFSAa3tT68j/aXOH5duFF7p2yDXrLyqdpFjRLmu6FIYzlE0AwIwIb6VsDSFDDBF3IaQ3v3wbnB9N0+Pp68+vRqfvtnHskWfkORmTlLwhp+QDOSMzwsjPKIoG0X70K96JD+LDfjSOtjtPyD8VP/0NFnC0uQ==</latexit>
1 X
ˆ
✓ = arg min (i) (i) (i)
L(y , ŷ = f✓ (x ))
✓ N i=1
1 X
2 2
L2 (y, ŷ) = ||y ŷ||2 = (yi ŷi )
2 i=1
k
<latexit sha1_base64="+7IbFQtOKxO7SHLba3fuJ71PqM8=">AAACLnicbVDLSgMxFM3UV62vUZdugkWooGVGfG0KRRFcuKhgH9CpQyZN29DMgyQjDDPzRW78FV0IKuLWzzB9iNp6IHA451xu7nECRoU0jBctMzM7N7+QXcwtLa+srunrGzXhhxyTKvaZzxsOEoRRj1QllYw0Ak6Q6zBSd/rnA79+R7igvncjo4C0XNT1aIdiJJVk6xdXtlmI9qDVQzKO0l1YgkkSwf1vIUlsU2nQEqFrx7Rkprd9mEQ2/YnYNLH1vFE0hoDTxByTPBijYutPVtvHoUs8iRkSomkagWzFiEuKGUlzVihIgHAfdUlTUQ+5RLTi4bkp3FFKG3Z8rp4n4VD9PREjV4jIdVTSRbInJr2B+J/XDGXntBVTLwgl8fBoUSdkUPpw0B1sU06wZJEiCHOq/gpxD3GEpWo4p0owJ0+eJrWDonlcPLo+zJfPxnVkwRbYBgVgghNQBpegAqoAg3vwCF7Bm/agPWvv2scomtHGM5vgD7TPL6MUpy0=</latexit>
X
• Mean Absolute Error/L1 loss: L1 (y, ŷ) = ||y ŷ||1 = |yi ŷi |
• 2-way classi cation: i=1
X
LCE (y, ŷ) = yi log(ŷi )
(Very related to perplexity!)
i=1
1 X
ˆ
✓ = arg min (i) (i) (i)
L(y , ŷ = f✓ (x ))
✓ N i=1 Gradient is:
• the vector of partial
• However, nding the global minimum is often impossible derivatives of the
parameters with respect to
in practice (need to search over all of R !)
<latexit sha1_base64="9ILyy26CKSH4w7gO/kU17MlUu+0=">AAACCXicbVC5TsNAEF2HK4TLQEmzIkIKTWQjrjKChjIgckhxiNabSbLK+tDuGBFZbmn4FRoKEKLlD+j4G+wkBSQ8aaSn92Y0M88NpdBoWd9GbmFxaXklv1pYW9/Y3DK3d+o6iBSHGg9koJou0yCFDzUUKKEZKmCeK6HhDi8zv3EPSovAv8VRCG2P9X3RE5xhKnVM6ngMB64b3yR3sYPwgHFXeEnJwQEgO0wKHbNola0x6Dyxp6RIpqh2zC+nG/DIAx+5ZFq3bCvEdswUCi4hKTiRhpDxIetDK6U+80C34/EnCT1IlS7tBSotH+lY/T0RM0/rkeemndndetbLxP+8VoS983Ys/DBC8PlkUS+SFAOaxUK7QgFHOUoJ40qkt1I+YIpxTMPLQrBnX54n9aOyfVo+uT4uVi6mceTJHtknJWKTM1IhV6RKaoSTR/JMXsmb8WS8GO/Gx6Q1Z0xndskfGJ8/9h2afw==</latexit>
(i)
the loss function at ✓
2 @L
3
•
(i)
Learning rate ↵ 2 R, ↵ > 0 (often quite small e.g., 3e-4) @✓1
<latexit sha1_base64="gP8jUMO7JqSZMetfa6XOem/Bo+g=">AAACC3icbVDLSsNAFL3xWesr6tLN0CK4kJKIr5UU3bisYh/QhDKZTtuhk0mYmQgldO/GX3HjQhG3/oA7/8ZJm4W2HrhwOOde7r0niDlT2nG+rYXFpeWV1cJacX1jc2vb3tltqCiRhNZJxCPZCrCinAla10xz2oolxWHAaTMYXmd+84FKxSJxr0cx9UPcF6zHCNZG6tglD/N4gJHHBPJCrAdBkN6Nj1AuXyKn2LHLTsWZAM0TNydlyFHr2F9eNyJJSIUmHCvVdp1Y+ymWmhFOx0UvUTTGZIj7tG2owCFVfjr5ZYwOjNJFvUiaEhpN1N8TKQ6VGoWB6czOVbNeJv7ntRPdu/BTJuJEU0Gmi3oJRzpCWTCoyyQlmo8MwUQycysiAywx0Sa+LAR39uV50jiuuGeV09uTcvUqj6MA+1CCQ3DhHKpwAzWoA4FHeIZXeLOerBfr3fqYti5Y+cwe/IH1+QNi1plY</latexit>
6 @L 7
@L (i) 6 (i) 7
6 @✓2 7
• Randomly initialize ✓ Next estimate Learning rate (step size)
<latexit sha1_base64="lZEcENlTL18KxfoRjQ8bAURJd70=">AAAB83icbVBNS8NAEJ3Ur1q/qh69BItQLyURv45FLx4r2A9oYtlsN+3SzSbsToQS+je8eFDEq3/Gm//GbZuDtj4YeLw3w8y8IBFco+N8W4WV1bX1jeJmaWt7Z3evvH/Q0nGqKGvSWMSqExDNBJesiRwF6ySKkSgQrB2Mbqd++4kpzWP5gOOE+REZSB5yStBInodDhuQxqzqnk1654tScGexl4uakAjkavfKX149pGjGJVBCtu66ToJ8RhZwKNil5qWYJoSMyYF1DJYmY9rPZzRP7xCh9O4yVKYn2TP09kZFI63EUmM6I4FAvelPxP6+bYnjtZ1wmKTJJ54vCVNgY29MA7D5XjKIYG0Ko4uZWmw6JIhRNTCUTgrv48jJpndXcy9rF/XmlfpPHUYQjOIYquHAFdbiDBjSBQgLP8ApvVmq9WO/Wx7y1YOUzh/AH1ucPYnCRRQ==</latexit>
(0) (✓ ) = 6 . 7
@✓ 6 . 7
4 . 5
@L (i)
<latexit sha1_base64="BnG2oFXYTa/9OoG7ThqBxvuWo84=">AAACQnicbVDLSgMxFM34tr6qLt0Ei9AqlhnxtRFENy5cKNha6dRyJ83YYOZBckcow3ybG7/AnR/gxoUibl2YPsBHPRA4OecebnK8WAqNtv1kjYyOjU9MTk3nZmbn5hfyi0tVHSWK8QqLZKRqHmguRcgrKFDyWqw4BJ7kl97tcde/vONKiyi8wE7MGwHchMIXDNBIzfyVi22OcJ0WxYZTyugB/RbMdZO6IOM20HXq+gpY6sagUICkp9k370ey4s9oqZkv2GW7BzpMnAEpkAHOmvlHtxWxJOAhMgla1x07xkba3cEkz3JuonkM7BZueN3QEAKuG2mvgoyuGaVF/UiZEyLtqT8TKQRadwLPTAaAbf3X64r/efUE/f1GKsI4QR6y/iI/kRQj2u2TtoTiDGXHEGBKmLdS1gZTFZrWc6YE5++Xh0l1q+zslnfOtwuHR4M6psgKWSVF4pA9ckhOyBmpEEbuyTN5JW/Wg/VivVsf/dERa5BZJr9gfX4BVsmvkA==</latexit>
(i+1) (i) @L
• Iteratively get better estimate with:
✓ =✓ ↵⇤
@✓
(✓ ) (i)
@✓n
Previous Estimate
31 Lecture 4: Embeddings and RNNs
fi
Stochastic gradient descent
Gradient Descent:
@L (i)
<latexit sha1_base64="BnG2oFXYTa/9OoG7ThqBxvuWo84=">AAACQnicbVDLSgMxFM34tr6qLt0Ei9AqlhnxtRFENy5cKNha6dRyJ83YYOZBckcow3ybG7/AnR/gxoUibl2YPsBHPRA4OecebnK8WAqNtv1kjYyOjU9MTk3nZmbn5hfyi0tVHSWK8QqLZKRqHmguRcgrKFDyWqw4BJ7kl97tcde/vONKiyi8wE7MGwHchMIXDNBIzfyVi22OcJ0WxYZTyugB/RbMdZO6IOM20HXq+gpY6sagUICkp9k370ey4s9oqZkv2GW7BzpMnAEpkAHOmvlHtxWxJOAhMgla1x07xkba3cEkz3JuonkM7BZueN3QEAKuG2mvgoyuGaVF/UiZEyLtqT8TKQRadwLPTAaAbf3X64r/efUE/f1GKsI4QR6y/iI/kRQj2u2TtoTiDGXHEGBKmLdS1gZTFZrWc6YE5++Xh0l1q+zslnfOtwuHR4M6psgKWSVF4pA9ckhOyBmpEEbuyTN5JW/Wg/VivVsf/dERa5BZJr9gfX4BVsmvkA==</latexit>
(i+1) (i)
✓ =✓ ↵⇤ (✓ )
@✓
• Problem: calculating the true gradient can be very expensive (requires running model
on entire dataset!)
• Solution: Stochastic Gradient Descent
• Sample a subset of the data of xed size (batch size)
• Take the gradient with respect to that subset
• Take a step in that direction; repeat
• Not only is it more computationally ef cient, but it often nds better minima than
vanilla gradient descent
• Why? Possibly because it does a better job skipping past plateaus in loss landscape
dx du dx
1. Forward Pass
3. Backwards Pass
h1 = W 1 x + b 2
<latexit sha1_base64="vJRL+qdbJplsKaChHjhD1lkzrRE=">AAACMHicbZDLSsNAFIYn9VbjLerSzWBRKkJJgreNUHShywr2Ak0Ik+m0HTq5MDMRS+gjufFRdKOgiFufwklaQVsPzPDznf8wc34/ZlRI03zVCnPzC4tLxWV9ZXVtfcPY3GqIKOGY1HHEIt7ykSCMhqQuqWSkFXOCAp+Rpj+4zPrNO8IFjcJbOYyJG6BeSLsUI6mQZ1zt9z0LnsOmuu/hIfQ9GzqOrqitqCNoL0BlZTnIqdNHMh2Ocr8NM8/PhGeUzIqZF5wV1kSUwKRqnvHkdCKcBCSUmCEh2pYZSzdFXFLMyEh3EkFihAeoR9pKhiggwk3zhUdwT5EO7EZcnVDCnP6eSFEgxDDwlTNAsi+mexn8r9dOZPfMTWkYJ5KEePxQN2FQRjBLD3YoJ1iyoRIIc6r+CnEfcYSlylhXIVjTK8+Khl2xTirHN0el6sUkjiLYAbugDCxwCqrgGtRAHWDwAJ7BG3jXHrUX7UP7HFsL2mRmG/wp7esb7oqi7Q==</latexit>
L(y, ŷ)
Backpropagation dx
=
du dx
1. Forward Pass (Long, messy exact derivation below)
h1 = W 1 x + b 2
<latexit sha1_base64="vJRL+qdbJplsKaChHjhD1lkzrRE=">AAACMHicbZDLSsNAFIYn9VbjLerSzWBRKkJJgreNUHShywr2Ak0Ik+m0HTq5MDMRS+gjufFRdKOgiFufwklaQVsPzPDznf8wc34/ZlRI03zVCnPzC4tLxWV9ZXVtfcPY3GqIKOGY1HHEIt7ykSCMhqQuqWSkFXOCAp+Rpj+4zPrNO8IFjcJbOYyJG6BeSLsUI6mQZ1zt9z0LnsOmuu/hIfQ9GzqOrqitqCNoL0BlZTnIqdNHMh2Ocr8NM8/PhGeUzIqZF5wV1kSUwKRqnvHkdCKcBCSUmCEh2pYZSzdFXFLMyEh3EkFihAeoR9pKhiggwk3zhUdwT5EO7EZcnVDCnP6eSFEgxDDwlTNAsi+mexn8r9dOZPfMTWkYJ5KEePxQN2FQRjBLD3YoJ1iyoRIIc6r+CnEfcYSlylhXIVjTK8+Khl2xTirHN0el6sUkjiLYAbugDCxwCqrgGtRAHWDwAJ7BG3jXHrUX7UP7HFsL2mRmG/wp7esb7oqi7Q==</latexit>
3. Backwards Pass
@L
<latexit sha1_base64="Tu4zjq8SZUrTccCBq8GApHXgNww=">AAAHMHictVVbi9NAFJ5dTVzrZbv66MtgUVqEkhRvCMKiID74sMJ2u9DUMDOZNMNOLmQmsiXkJ/niT9EXBUV89Vc4SaN2k3Rb6DowcDiXb873nUkGR5wJaRhft7YvXdb0KztXW9eu37i52967dSTCJCZ0SEIexscYCcpZQIeSSU6Po5giH3M6wicv8/joPY0FC4NDOYvoxEfTgLmMIKlc9p72ynIol8hOLQ/JdJZl8P6z59ByY0RSK0KxZIjDN9k/+2+eZbXOSRvZAwW1JhJxQlnN/BOtQG6A2FUA0FP7AcT2oNeAXJViDqNK3h2uoIsvni7+b3Rz5Ca2imHp8wo6518Eb8P+6ow3RVzOuEReNuFRMeGzAphrCGCuaLc4trHVPFJHWlS/WTPBpj7qqvxetXzV52iuup9zlOZmzQrUYq/LqtQ0zNN8FL1qH43Vp2t8YxfGAW/CAdc5tOx2x+gbxYJ1wyyNDijXgd3+ZDkhSXwaSMKREGPTiOQkzY8gnGYtKxE0QuQETelYmQHyqZikxQ8/g/eUx4FuGKsdSFh4FytS5Asx87HK9JH0RDWWO5ti40S6TycpC6JE0oDMD3ITDmUI89cDOiymRPKZMhCJmeoVEg8pyaR6Y3IRzCrlunE06JuP+4/ePuzsvyjl2AF3wF3QBSZ4AvbBa3AAhoBoH7TP2jftu/5R/6L/0H/OU7e3yprb4MzSf/0Gmjh9FA==</latexit>
:=
h2 = (h1 ) ŷ
@ ŷ
@L @L @ ŷ @L @(W2 h2 + b2 )
ŷ = W2 h2 + b2 = · = · = ŷ · hT2
@W2 @ ŷ @W2 @ ŷ @W2
@L @L @ ŷ @L @(W2 h2 + b2 )
= · = · = ŷ
@b2 @ ŷ @b2 @ ŷ @b2
2. Calculate Loss @L @L @ ŷ @L @(W2 h2 + b2 ) T
h2 := = · = · = ŷ · W2
<latexit sha1_base64="JBCjza+kTAxS7kCJrgmLHx0JF4s=">AAAB9HicbVDJSgNBEK1xjXGLevTSGIQIEmbE7Rj04sFDBLNAMoSeTidp0rPYXRMYhnyHFw+KePVjvPk3dpI5aOKDgsd7VVTV8yIpNNr2t7W0vLK6tp7byG9ube/sFvb26zqMFeM1FspQNT2quRQBr6FAyZuR4tT3JG94w9uJ3xhxpUUYPGIScden/UD0BKNoJPe+lJyS9oBimoxPOoWiXbanIIvEyUgRMlQ7ha92N2SxzwNkkmrdcuwI3ZQqFEzycb4dax5RNqR93jI0oD7Xbjo9ekyOjdIlvVCZCpBM1d8TKfW1TnzPdPoUB3rem4j/ea0Ye9duKoIoRh6w2aJeLAmGZJIA6QrFGcrEEMqUMLcSNqCKMjQ55U0IzvzLi6R+VnYuyxcP58XKTRZHDg7hCErgwBVU4A6qUAMGT/AMr/BmjawX6936mLUuWdnMAfyB9fkDuwyRcg==</latexit>
be a probability distribution
• To force the output to be a probability distribution, we apply the
softmax function
<latexit sha1_base64="HF0JmxyJbvmDVBKrFChTqZPF5II=">AAACKnicbVBNT9tAEF1DC2koEOixl1UjpOQS2YivC1JoLxyp1ABSHKz1Zgwbdm1rd4xirfx7eulf6YUDCHHlh7AJPrTAk0Z6em9GM/PiXAqDvv/gLSx++Li03PjUXPm8urbe2tg8NVmhOQx4JjN9HjMDUqQwQIESznMNTMUSzuLrHzP/7Aa0EVn6C8scRopdpiIRnKGTotZRiDBFa7IEFZtWnbIbCXpIw0QzbkOY5p0yEt3KhqZQkZ0cBtWFHVe0dibdqhm12n7Pn4O+JUFN2qTGSdS6DccZLxSkyCUzZhj4OY4s0yi4hKoZFgZyxq/ZJQwdTZkCM7LzVyu65ZQxTTLtKkU6V/+dsEwZU6rYdSqGV+a1NxPf84YFJgcjK9K8QEj5y6KkkBQzOsuNjoUGjrJ0hHEt3K2UXzGXErp0ZyEEr19+S063e8Feb/fnTrv/vY6jQb6Sb6RDArJP+uSYnJAB4eQ3+UvuyL33x7v1HrzHl9YFr575Qv6D9/QM8LGnjA==</latexit>
exp(yi )
softmax(y)i = Pd
j=1 exp(yj )
• The values y before applying the softmax are often called “logits”
<latexit sha1_base64="tmEvmksfb9cWDAa5xB1ckWUhUbQ=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVx71S2a24M5Bl4uWkDDlqvdJXtx+zNEJpmKBadzw3MX5GleFM4KTYTTUmlI3oADuWShqh9rPZoRNyapU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjjZ1wmqUHJ5ovCVBATk+nXpM8VMiPGllCmuL2VsCFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MH6v2NBw==</latexit>