Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views13 pages

Deep Learning Recap

Lecture 7 of COMP 3361 focuses on Recurrent Neural Networks (RNNs) and includes announcements about class scheduling and upcoming assignments. The lecture plan covers topics such as tokenization, basics of neural networks, and the structure of RNNs. The document also contains references to materials from other courses and celebrates the Chinese New Year.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Deep Learning Recap

Lecture 7 of COMP 3361 focuses on Recurrent Neural Networks (RNNs) and includes announcements about class scheduling and upcoming assignments. The lecture plan covers topics such as tokenization, basics of neural networks, and the structure of RNNs. The document also contains references to materials from other courses and celebrates the Chinese New Year.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

COMP 3361 Natural Language Processing

Lecture 7: Recurrent Neural Networks

Spring 2024

Many materials from CSE447@UW (Taylor Sorensen) and CS224n@Stanford with special thanks!
Announcements

• The class will not have an in-person/zoom meeting this Friday.


• We will record a video tutorial on PyTorch and huggingface and
upload it to the course website.
• Assignment 1 due in two weeks!
• 新年快樂!Happy Chinese New Year.

2 Lecture 3: Tokenization
Lecture plan

• Recap of Byte-pair encoding (BPE) tokenization


• Other tokenization variants
• Basics of neural networks
• Recurrent neural networks

3 Lecture 3: Tokenization
(Very quick) Deep learning review

26 Lecture 4: Embeddings and RNNs


Neural networks
k d
Goal: Approximate some function f : R ! R
<latexit sha1_base64="R3wkHKpBO2NHUzQ3XvYOfK+Lr3s=">AAACCnicbVC7TsMwFL0pr1JeAUYWQ4XEVCWIl5gqWBgLog+pCZXjOq1V5yHbQaqiziz8CgsDCLHyBWz8DU6bobQcydLROffK9xwv5kwqy/oxCguLS8srxdXS2vrG5pa5vdOQUSIIrZOIR6LlYUk5C2ldMcVpKxYUBx6nTW9wnfnNRyoki8J7NYypG+BeyHxGsNJSx9z3L5ETYNX3vPRu9DBAjoqmhW6pY5atijUGmid2TsqQo9Yxv51uRJKAhopwLGXbtmLlplgoRjgdlZxE0hiTAe7RtqYhDqh003GUETrUShf5kdAvVGisTm+kOJByGHh6MjtSznqZ+J/XTpR/4aYsjBNFQzL5yE840nGzXlCXCUoUH2qCiWD6VkT6WGCidHtZCfZs5HnSOK7YZ5XT25Ny9Sqvowh7cABHYMM5VOEGalAHAk/wAm/wbjwbr8aH8TkZLRj5zi78gfH1CziGmfs=</latexit>

Essential elements:

d
Input: Vector x 2 R , Output: y 2 R
<latexit sha1_base64="1wkcrsS56MDYhSZq3fEJyaLZzOw=">AAAB+3icbVC7TsMwFL0pr1JeoYwsFhUSU5UgXmMFC2NB9CE1pXJcp7XqOJHtIKIov8LCAEKs/Agbf4PTdoDCkSwdnXOv7vHxY86Udpwvq7S0vLK6Vl6vbGxube/Yu9W2ihJJaItEPJJdHyvKmaAtzTSn3VhSHPqcdvzJVeF3HqhULBJ3Oo1pP8QjwQJGsDbSwK6myGMCeSHWY9/PbvP74cCuOXVnCvSXuHNSgzmaA/vTG0YkCanQhGOleq4T636GpWaE07ziJYrGmEzwiPYMFTikqp9Ns+fo0ChDFETSPKHRVP25keFQqTT0zWSRUS16hfif10t0cNHPmIgTTQWZHQoSjnSEiiLQkElKNE8NwUQykxWRMZaYaFNXxZTgLn75L2kf192z+unNSa1xOa+jDPtwAEfgwjk04Bqa0AICj/AEL/Bq5daz9Wa9z0ZL1nxnD37B+vgGlXKUKw==</latexit>

k
<latexit sha1_base64="BZH8qWvUi0/ZNSKyv/JUJA77Xk0=">AAAB+3icbVC7TsMwFL0pr1JeoYwsFhUSU5UgXmMFC2NB9CE1oXJct7XqOJHtoFZRfoWFAYRY+RE2/gan7QAtR7J0dM69uscniDlT2nG+rcLK6tr6RnGztLW9s7tn75ebKkokoQ0S8Ui2A6woZ4I2NNOctmNJcRhw2gpGN7nfeqJSsUg86ElM/RAPBOszgrWRunZ5jDwmkBdiPQyC9D57HHXtilN1pkDLxJ2TCsxR79pfXi8iSUiFJhwr1XGdWPsplpoRTrOSlygaYzLCA9oxVOCQKj+dZs/QsVF6qB9J84RGU/X3RopDpSZhYCbzjGrRy8X/vE6i+1d+ykScaCrI7FA/4UhHKC8C9ZikRPOJIZhIZrIiMsQSE23qKpkS3MUvL5PmadW9qJ7fnVVq1/M6inAIR3ACLlxCDW6hDg0gMIZneIU3K7NerHfrYzZasOY7B/AH1ucPnnmUMQ==</latexit>

• Hidden representation layers hi 2 R di


<latexit sha1_base64="iSGw9PORJeOPSLpxd9zU+N4g/5g=">AAACAXicbVDLSsNAFJ3UV62vqBvBzWARXJVEfC2LblxWsQ9oYphMJu3QySTMTIQS4sZfceNCEbf+hTv/xkmbhbYeuHA4517uvcdPGJXKsr6NysLi0vJKdbW2tr6xuWVu73RknApM2jhmsej5SBJGOWkrqhjpJYKgyGek64+uCr/7QISkMb9T44S4ERpwGlKMlJY8c2/oUehQDp0IqaHvZ7f5fRZ4NPfMutWwJoDzxC5JHZRoeeaXE8Q4jQhXmCEp+7aVKDdDQlHMSF5zUkkShEdoQPqachQR6WaTD3J4qJUAhrHQxRWcqL8nMhRJOY583VncKWe9QvzP66cqvHAzypNUEY6ni8KUQRXDIg4YUEGwYmNNEBZU3wrxEAmElQ6tpkOwZ1+eJ53jhn3WOL05qTcvyziqYB8cgCNgg3PQBNegBdoAg0fwDF7Bm/FkvBjvxse0tWKUM7vgD4zPH17+lt4=</latexit>

• Non-linear, differentiable (almost everywhere)


activation function : R ! R (applied element-wise)
<latexit sha1_base64="MNoUxgTtwYwVNhmdVAApc6DHNQ0=">AAACC3icbVC7TsMwFL0pr1JeAUYWqxUSU5UgXmKqYGEsiD6kJqoc122t2klkO0hV1J2FX2FhACFWfoCNv8FpM5TCkSwdn3Ov7r0niDlT2nG+rcLS8srqWnG9tLG5tb1j7+41VZRIQhsk4pFsB1hRzkLa0Exz2o4lxSLgtBWMrjO/9UClYlF4r8cx9QUehKzPCNZG6tplT7GBwJfIE1gPgyC9myBPR3PfUteuOFVnCvSXuDmpQI561/7yehFJBA014VipjuvE2k+x1IxwOil5iaIxJiM8oB1DQyyo8tPpLRN0aJQe6kfSvFCjqTrfkWKh1FgEpjJbUS16mfif10l0/8JPWRgnmoZkNqifcGSOzYJBPSYp0XxsCCaSmV0RGWKJiTbxZSG4iyf/Jc3jqntWPb09qdSu8jiKcABlOAIXzqEGN1CHBhB4hGd4hTfryXqx3q2PWWnBynv24Reszx8qhJqB</latexit>

• Weights connecting layers: W 2 R di+1 ⇥di


and bias
<latexit sha1_base64="fT+uP9RVm6URjH85Q3En/59L+s8=">AAACD3icbVDLSsNAFJ34rPUVdelmsCiCUBLxtSy6cVnFPqCJYTKZtEMnkzAzEUrIH7jxV9y4UMStW3f+jZM2C209cOFwzr3ce4+fMCqVZX0bc/MLi0vLlZXq6tr6xqa5td2WcSowaeGYxaLrI0kY5aSlqGKkmwiCIp+Rjj+8KvzOAxGSxvxOjRLiRqjPaUgxUlryzIMOdCiHToTUwPez2/w+C7yMHtk5dBSNiISBR/OqZ9asujUGnCV2SWqgRNMzv5wgxmlEuMIMSdmzrUS5GRKKYkbyqpNKkiA8RH3S05QjvcrNxv/kcF8rAQxjoYsrOFZ/T2QoknIU+bqzuFtOe4X4n9dLVXjhZpQnqSIcTxaFKYMqhkU4MKCCYMVGmiAsqL4V4gESCCsdYRGCPf3yLGkf1+2z+unNSa1xWcZRAbtgDxwCG5yDBrgGTdACGDyCZ/AK3own48V4Nz4mrXNGObMD/sD4/AG6zZvZ</latexit>

term b 2 R
<latexit sha1_base64="vHygv6sVUn2QNLvTi0OYuoDFseA=">AAAC0HicbVJdb9MwFHUyPkb5KvDIyxUV0xBVlWQF9jJpghdekMZE10lNiWzXba05TmTfsFYhQrzy83jjF/A3cLJOgpQrWTk659zrYzssV9JiEPzy/J0bN2/d3r3TuXvv/oOH3UePz2xWGC5GPFOZOWfUCiW1GKFEJc5zI2jKlBizi3e1Pv4ijJWZ/oTrXExTutByLjlFRyXd33vxkmK5ruAIxkkEsZWLlO6PkxBW8BJYEr5oPk6JO3sxihWWl0thBFTOEEsNcUpxyVh5Wn0+6MP1tJYS9aEe2WLLIcQoU2HhoOo3GzQRWqbo2jR0JrY9Zeham1xA9czFYtszok7S7QWDoCnYBuEG9MimTpLuz3iW8SIVGrmi1k7CIMdpSQ1KrkTViQsrcsov6EJMHNTUJZyWzYNU8NwxM5hnxi2N0LB/d5Q0tXadMuesQ9q2VpP/0yYFzg+npdR5gULzq43mhQLMoH5dmEkjOKq1A5Qb6bICX1JDObp/oL6EsH3kbXAWDcLXg1cfh73jt5vr2CVPyTOyT0LyhhyT9+SEjAj3PnjW++pV/qm/8r/536+svrfpeUL+Kf/HHwtn2Z8=</latexit>

di+1 ŷ = W2 (W1 x + b1 ) + b2
<latexit sha1_base64="Tv8AKsudbVIKLhiCSwgCOdvzi2E=">AAACA3icbVDLSsNAFJ34rPUVdaebwSIIQknE17LoxmUV+4Amhslk0g6dTMLMRCgh4MZfceNCEbf+hDv/xkmbhbYeuHA4517uvcdPGJXKsr6NufmFxaXlykp1dW19Y9Pc2m7LOBWYtHDMYtH1kSSMctJSVDHSTQRBkc9Ixx9eFX7ngQhJY36nRglxI9TnNKQYKS155q4PHcqhEyE18P3sNr/PAi+jR3aee2bNqltjwFlil6QGSjQ988sJYpxGhCvMkJQ920qUmyGhKGYkrzqpJAnCQ9QnPU05ioh0s/EPOTzQSgDDWOjiCo7V3xMZiqQcRb7uLG6V014h/uf1UhVeuBnlSaoIx5NFYcqgimERCAyoIFixkSYIC6pvhXiABMJKx1bVIdjTL8+S9nHdPquf3pzUGpdlHBWwB/bBIbDBOWiAa9AELYDBI3gGr+DNeDJejHfjY9I6Z5QzO+APjM8fhYqXeA==</latexit>

3 2 4⇥3
where x 2 R , ŷ 2 R , W1 2 R ,
• Set of all parameters is often referred to as ✓
<latexit sha1_base64="RnyzIg4Iy64zhhoBANN9EcBOI5Y=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKr2PQi8cI5gHJEmYnvcmY2Z1lplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80jUo1hwZXUul2wAxIEUMDBUpoJxpYFEhoBaPbqd96Am2Eih9wnIAfsUEsQsEZWqnZxSEg65UrbtWdgS4TLycVkqPeK391+4qnEcTIJTOm47kJ+hnTKLiESambGkgYH7EBdCyNWQTGz2bXTuiJVfo0VNpWjHSm/p7IWGTMOApsZ8RwaBa9qfif10kxvPYzEScpQszni8JUUlR0+jrtCw0c5dgSxrWwt1I+ZJpxtAGVbAje4svLpHlW9S6rF/fnldpNHkeRHJFjcko8ckVq5I7USYNw8kieySt5c5Tz4rw7H/PWgpPPHJI/cD5/AKc3jzI=</latexit>

2⇥4 4 2
W2 2 R , b1 2 R , and b2 2 R
https://en.wikipedia.org/wiki/Arti cial_neural_network

27 Lecture 4: Embeddings and RNNs


fi
Common activation functions
Sigmoid ReLU

1
<latexit sha1_base64="VFdO9y/HzVPa+EoCb+aeuTxuCGs=">AAAB/nicbVDLSsNAFJ3UV62vqLhyM1gEQSyJ+FoW3bisYB/Q1jKZ3rRDJ5MwMxFLCPgrblwo4tbvcOffOGmz0OqBC4dz7uXee7yIM6Ud58sqzM0vLC4Vl0srq2vrG/bmVkOFsaRQpyEPZcsjCjgTUNdMc2hFEkjgcWh6o6vMb96DVCwUt3ocQTcgA8F8Rok2Us/e6fiS0MRNExcfYrhLjh7StNSzy07FmQD/JW5OyihHrWd/dvohjQMQmnKiVNt1It1NiNSMckhLnVhBROiIDKBtqCABqG4yOT/F+0bpYz+UpoTGE/XnREICpcaBZzoDoodq1svE/7x2rP2LbsJEFGsQdLrIjznWIc6ywH0mgWo+NoRQycytmA6JyUObxLIQ3NmX/5LGccU9q5zenJSrl3kcRbSL9tABctE5qqJrVEN1RFGCntALerUerWfrzXqfthasfGYb/YL18Q2lWJSm</latexit>

<latexit sha1_base64="BGY2IOg6wi+f5MnNRr6waBVczzc=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0WoICURX8uiG5cV7APSUCbTSTt0HmFmIi2hn+HGhSJu/Rp3/o2TNgttPXDhcM693HtPGDOqjet+O4WV1bX1jeJmaWt7Z3evvH/Q0jJRmDSxZFJ1QqQJo4I0DTWMdGJFEA8ZaYeju8xvPxGlqRSPZhKTgKOBoBHFyFjJ73I0rrpncHxa6pUrbs2dAS4TLycVkKPRK391+xInnAiDGdLa99zYBClShmJGpqVuokmM8AgNiG+pQJzoIJ2dPIUnVunDSCpbwsCZ+nsiRVzrCQ9tJ0dmqBe9TPzP8xMT3QQpFXFiiMDzRVHCoJEw+x/2qSLYsIklCCtqb4V4iBTCxqaUheAtvrxMWuc176p2+XBRqd/mcRTBETgGVeCBa1AH96ABmgADCZ7BK3hzjPPivDsf89aCk88cgj9wPn8ALD+P4w==</latexit>

x
max(0, x)
1+e

Tanh GeLU

x x ✓ ✓ ◆◆
e e
<latexit sha1_base64="VBFwWQK3UUuM6Qt2umuQjHPn7Ss=">AAACDXicbVDLSgMxFM34rPU16tJNsAqCtMyIr2XRjcsK9gHttGTSO21o5kGSkZZhfsCNv+LGhSJu3bvzb0zbWWjrgZCTc+7l5h434kwqy/o2FhaXlldWc2v59Y3NrW1zZ7cmw1hQqNKQh6LhEgmcBVBVTHFoRAKI73Kou4ObsV9/ACFZGNyrUQSOT3oB8xglSksd87DlCUITwG08xEWs76Q4TNME2kN8gmH66pgFq2RNgOeJnZECylDpmF+tbkhjHwJFOZGyaVuRchIiFKMc0nwrlhAROiA9aGoaEB+kk0y2SfGRVrrYC4U+gcIT9XdHQnwpR76rK32i+nLWG4v/ec1YeVdOwoIoVhDQ6SAv5liFeBwN7jIBVPGRJoQKpv+KaZ/oeJQOMK9DsGdXnie105J9UTq/OyuUr7M4cmgfHaBjZKNLVEa3qIKqiKJH9Ixe0ZvxZLwY78bHtHTByHr20B8Ynz9EoZpz</latexit>

<latexit sha1_base64="FVtxiC38aa/Fy19Qzw+HJzMrB2Y=">AAACMnicbVDLSgMxFM34tr6qLt0Ei1ARykzxtRTd6E7BqtApJZPeaUMzD5M70jLMN7nxSwQXulDErR9h2s5CWw+EHM45l+QeL5ZCo22/WlPTM7Nz8wuLhaXlldW14vrGjY4SxaHGIxmpO49pkCKEGgqUcBcrYIEn4dbrng382wdQWkThNfZjaASsHQpfcIZGahYvXF8xnjpZWs1oj7oSfCw7dI+6CD1MQfnZSBvlelnq6nuFJp25SrQ7uJtfhWaxZFfsIegkcXJSIjkum8VntxXxJIAQuWRa1x07xkbKFAouISu4iYaY8S5rQ93QkAWgG+lw5YzuGKVF/UiZEyIdqr8nUhZo3Q88kwwYdvS4NxD/8+oJ+seNVIRxghDy0UN+IilGdNAfbQkFHGXfEMaVMH+lvMNMN2haHpTgjK88SW6qFeewcnC1Xzo5zetYIFtkm5SJQ47ICTknl6RGOHkkL+SdfFhP1pv1aX2NolNWPrNJ/sD6/gFdp6ri</latexit>

1 x
x 1 + erf p
x
e +e x 2 2

28 Lecture 4: Embeddings and RNNs


Learning
Required:
<latexit sha1_base64="02V5s2KnbqxhYdXOtMkg6dpKi2U=">AAACKnicbZDLSsNAFIYn9VbrrerSzWARWiglEW8boV4WLivYCzSxTCbTduhkEmYmYgl5Hje+ipsulOLWB3HSVtDWAwMf/38OZ87vhoxKZZpjI7O0vLK6ll3PbWxube/kd/caMogEJnUcsEC0XCQJo5zUFVWMtEJBkO8y0nQHN6nffCJC0oA/qGFIHB/1OO1SjJSWOvkr20eqjxGLbxN4Ce24+PwYF61SUobDKZTK0GZeoGQZTjz+42ko2Umuky+YFXNScBGsGRTArGqd/Mj2Ahz5hCvMkJRtywyVEyOhKGYkydmRJCHCA9QjbY0c+UQ68eTUBB5pxYPdQOjHFZyovydi5Es59F3dmR4m571U/M9rR6p74cSUh5EiHE8XdSMGVQDT3KBHBcGKDTUgLKj+K8R9JBBWOt00BGv+5EVoHFess8rp/Umhej2LIwsOwCEoAgucgyq4AzVQBxi8gDfwDj6MV2NkjI3PaWvGmM3sgz9lfH0DEP2j+A==</latexit>

(1) (1) (n) (n)


• Training data D = {(x ,y ), . . . , (x ,y )}
• Model family: some speci ed function (e.g., ŷ = W2 (W1 x + b1 ) + b2)
<latexit sha1_base64="aVvvvPF0kWMJhzbR991+75NBysQ=">AAACEnicbVDLSsNAFJ34rPUVdelmsCgtQkmKr41QdOOygm0KTQiT6aQdOnkwMxFD6De48VfcuFDErSt3/o3TNAttPXC5h3PuZeYeL2ZUSMP41hYWl5ZXVktr5fWNza1tfWe3I6KEY9LGEYt410OCMBqStqSSkW7MCQo8RixvdD3xrXvCBY3CO5nGxAnQIKQ+xUgqydVrR/YQySwdw0touQ1oCzoIUNVyTfgAj6HnmrW8NaCrV4y6kQPOE7MgFVCg5epfdj/CSUBCiRkSomcasXQyxCXFjIzLdiJIjPAIDUhP0RAFRDhZftIYHiqlD/2IqwolzNXfGxkKhEgDT00GSA7FrDcR//N6ifQvnIyGcSJJiKcP+QmDMoKTfGCfcoIlSxVBmFP1V4iHiCMsVYplFYI5e/I86TTq5ln99Pak0rwq4iiBfXAAqsAE56AJbkALtAEGj+AZvII37Ul70d61j+noglbs7IE/0D5/AF9ymhw=</latexit>

• Number/size of hidden layers, activation function, etc. are FIXED


here

d d
(Differentiable) Loss function L(y, ŷ) : R ⇥ R ! R
<latexit sha1_base64="CGjFsOfem608YioRGYMvnV5v+GY=">AAACKHicbVDLSsNAFJ3UV62vqEs3g0WoICURX7ix6MaFiyr2AU0sk8m0HTp5MDMRQsjnuPFX3Igo0q1f4qTNorYeGDiccy9zz3FCRoU0jJFWWFhcWl4prpbW1jc2t/TtnaYIIo5JAwcs4G0HCcKoTxqSSkbaISfIcxhpOcObzG89Ey5o4D/KOCS2h/o+7VGMpJK6+tVdJT6C1gDJJE4PL6HlITlwnOQhfXKhJalHxKwWTAmlrl42qsYYcJ6YOSmDHPWu/mG5AY484kvMkBAd0wilnSAuKWYkLVmRICHCQ9QnHUV9pC6wk3HQFB4oxYW9gKvnSzhWpzcS5AkRe46azE4Us14m/ud1Itm7sBPqh5EkPp581IsYVGGz1qBLOcGSxYogzKm6FeIB4ghL1W1WgjkbeZ40j6vmWfX0/qRcu87rKII9sA8qwATnoAZuQR00AAYv4A18gi/tVXvXvrXRZLSg5Tu74A+0n1/3AqX4</latexit>

Learning Problem:
N
<latexit sha1_base64="aeLIVwrr2GT/zo7wnqpYzvzQZEY=">AAACXnicbVDRahQxFM2M2tattau+CL4EF2EXZJkpWn0pFH3xQaSC2xZ2tsOdbGY3NMkMyY04hPlJ38QXP8XMziLaeiFw7jn3niSnqKWwmCQ/ovjO3Xs7u3v3B/sPDh4eDh89PreVM4zPWCUrc1mA5VJoPkOBkl/WhoMqJL8ort93+sVXbqyo9Bdsar5QsNKiFAwwUPnQZWtAn+GaI7T0hGZOL8M4/8P5DMyKZkrolmalAebT1n8K2DqVe3GStldd+3HcXPmxmLQv6caxafs2WJZ5bzX+1lOTySAfjpJpsil6G6RbMCLbOsuH37NlxZziGpkEa+dpUuPCg0HBJG8HmbO8BnYNKz4PUIPiduE38bT0RWCWtKxMOBrphv17w4OytlFFmFSAa3tT68j/aXOH5duFF7p2yDXrLyqdpFjRLmu6FIYzlE0AwIwIb6VsDSFDDBF3IaQ3v3wbnB9N0+Pp68+vRqfvtnHskWfkORmTlLwhp+QDOSMzwsjPKIoG0X70K96JD+LDfjSOtjtPyD8VP/0NFnC0uQ==</latexit>

1 X
ˆ
✓ = arg min (i) (i) (i)
L(y , ŷ = f✓ (x ))
✓ N i=1

29 Lecture 4: Embeddings and RNNs


fi
Common loss functions
• Regression problems:
• Euclidean Distance/Mean Squared Error/L2 loss:
k
<latexit sha1_base64="K5eu7dn31emfS6waXwc6yRZ5NL4=">AAACPXicbVBLS8NAGNzUV62vqkcvi0VoQUsSfF0E0YsHDxVaFZo2bLabdunmwe5GCEn+mBf/gzdvXjwo4tWr2xpRqwMLw8x8fPuNEzIqpK4/aIWp6ZnZueJ8aWFxaXmlvLp2KYKIY9LCAQv4tYMEYdQnLUklI9chJ8hzGLlyhqcj/+qGcEEDvynjkHQ81PepSzGSSrLLzXPbrMbb0BogmcRZDR7BNI3hzpeQprbZNZVquRzhxMgSM4OWiDw7oUdG1h3CamzT77xNa13TLlf0uj4G/EuMnFRAjoZdvrd6AY484kvMkBBtQw9lJ0FcUsxIVrIiQUKEh6hP2or6yCOik4yvz+CWUnrQDbh6voRj9edEgjwhYs9RSQ/JgZj0RuJ/XjuS7mEnoX4YSeLjz0VuxKAM4KhK2KOcYMliRRDmVP0V4gFSNUlVeEmVYEye/JdcmnVjv753sVs5PsnrKIINsAmqwAAH4BicgQZoAQxuwSN4Bi/anfakvWpvn9GCls+sg1/Q3j8Am9SsiQ==</latexit>

1 X
2 2
L2 (y, ŷ) = ||y ŷ||2 = (yi ŷi )
2 i=1
k
<latexit sha1_base64="+7IbFQtOKxO7SHLba3fuJ71PqM8=">AAACLnicbVDLSgMxFM3UV62vUZdugkWooGVGfG0KRRFcuKhgH9CpQyZN29DMgyQjDDPzRW78FV0IKuLWzzB9iNp6IHA451xu7nECRoU0jBctMzM7N7+QXcwtLa+srunrGzXhhxyTKvaZzxsOEoRRj1QllYw0Ak6Q6zBSd/rnA79+R7igvncjo4C0XNT1aIdiJJVk6xdXtlmI9qDVQzKO0l1YgkkSwf1vIUlsU2nQEqFrx7Rkprd9mEQ2/YnYNLH1vFE0hoDTxByTPBijYutPVtvHoUs8iRkSomkagWzFiEuKGUlzVihIgHAfdUlTUQ+5RLTi4bkp3FFKG3Z8rp4n4VD9PREjV4jIdVTSRbInJr2B+J/XDGXntBVTLwgl8fBoUSdkUPpw0B1sU06wZJEiCHOq/gpxD3GEpWo4p0owJ0+eJrWDonlcPLo+zJfPxnVkwRbYBgVgghNQBpegAqoAg3vwCF7Bm/agPWvv2scomtHGM5vgD7TPL6MUpy0=</latexit>

X
• Mean Absolute Error/L1 loss: L1 (y, ŷ) = ||y ŷ||1 = |yi ŷi |
• 2-way classi cation: i=1

• Binary Cross Entropy Loss:


<latexit sha1_base64="UnttTCtTrBY33FI1WTto1h2V6YQ=">AAACNHicbZDJSgNBEIZ74h63qEcvjUGIaMKMuF0ESRAEPSgYE8gMoafTSZr0LHTXiMMwD+XFB/EiggdFvPoMdhbFraDh4/+rqK7fDQVXYJqPRmZsfGJyanomOzs3v7CYW1q+UkEkKavSQASy7hLFBPdZFTgIVg8lI54rWM3tVfp+7ZpJxQP/EuKQOR7p+LzNKQEtNXOnZ83EBnYDSblynKaFeAvbXQJJnG7gQ1xsxNgWQafwpW3igoWLON4Y6n3+9JxmLm+WzEHhv2CNII9Gdd7M3dutgEYe84EKolTDMkNwEiKBU8HSrB0pFhLaIx3W0OgTjyknGRyd4nWttHA7kPr5gAfq94mEeErFnqs7PQJd9dvri/95jQjaB07C/TAC5tPhonYkMAS4nyBucckoiFgDoZLrv2LaJZJQ0DlndQjW75P/wtV2ydor7V7s5I/Kozim0SpaQwVkoX10hE7QOaoiim7RA3pGL8ad8WS8Gm/D1owxmllBP8p4/wCwk6fH</latexit>

LBCE (y, ŷ) = [y log(ŷ) + (1 y) log(1 ŷ)]


• Multi-class classi cation: (for example, words…)
• Cross Entropy Loss: C
<latexit sha1_base64="4Xy3rLOEg0BYuWPOLsmojWErIaY=">AAACOHicbVDLSgMxFM34tr6qLt0Ei1BBy4z42hSKRXAhqGBV6NQhk2baYOZBckccwnyWGz/DnbhxoYhbv8C0FnweCBzOOZebe/xEcAW2/WANDY+Mjo1PTBampmdm54rzC2cqTiVlDRqLWF74RDHBI9YADoJdJJKR0Bfs3L+q9/zzayYVj6NTyBLWCkkn4gGnBIzkFY8OPRfYDej6fl52QwJdP9BZvobdLgH9JeSruIrXXZWGnuZVJ7/U9RxnHseuiDvlfjjLPb7qFUt2xe4D/yXOgJTQAMde8d5txzQNWQRUEKWajp1ASxMJnAqWF9xUsYTQK9JhTUMjEjLV0v3Dc7xilDYOYmleBLivfp/QJFQqC32T7F2ifns98T+vmUKw29I8SlJgEf1cFKQCQ4x7LeI2l4yCyAwhVHLzV0y7RBIKpuuCKcH5ffJfcrZRcbYrWyebpdreoI4JtISWURk5aAfV0AE6Rg1E0S16RM/oxbqznqxX6+0zOmQNZhbRD1jvHw+LrSc=</latexit>

X
LCE (y, ŷ) = yi log(ŷi )
(Very related to perplexity!)
i=1

30 Lecture 4: Embeddings and RNNs


fi
fi
Gradient Descent
Learning Problem: ✓
<latexit sha1_base64="RnyzIg4Iy64zhhoBANN9EcBOI5Y=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKr2PQi8cI5gHJEmYnvcmY2Z1lplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80jUo1hwZXUul2wAxIEUMDBUpoJxpYFEhoBaPbqd96Am2Eih9wnIAfsUEsQsEZWqnZxSEg65UrbtWdgS4TLycVkqPeK391+4qnEcTIJTOm47kJ+hnTKLiESambGkgYH7EBdCyNWQTGz2bXTuiJVfo0VNpWjHSm/p7IWGTMOApsZ8RwaBa9qfif10kxvPYzEScpQszni8JUUlR0+jrtCw0c5dgSxrWwt1I+ZJpxtAGVbAje4svLpHlW9S6rF/fnldpNHkeRHJFjcko8ckVq5I7USYNw8kieySt5c5Tz4rw7H/PWgpPPHJI/cD5/AKc3jzI=</latexit>

“Loss landscape” - loss w.r.t


https://www.cs.umd.edu/~tomg/projects/landscapes/
N
<latexit sha1_base64="aeLIVwrr2GT/zo7wnqpYzvzQZEY=">AAACXnicbVDRahQxFM2M2tattau+CL4EF2EXZJkpWn0pFH3xQaSC2xZ2tsOdbGY3NMkMyY04hPlJ38QXP8XMziLaeiFw7jn3niSnqKWwmCQ/ovjO3Xs7u3v3B/sPDh4eDh89PreVM4zPWCUrc1mA5VJoPkOBkl/WhoMqJL8ort93+sVXbqyo9Bdsar5QsNKiFAwwUPnQZWtAn+GaI7T0hGZOL8M4/8P5DMyKZkrolmalAebT1n8K2DqVe3GStldd+3HcXPmxmLQv6caxafs2WJZ5bzX+1lOTySAfjpJpsil6G6RbMCLbOsuH37NlxZziGpkEa+dpUuPCg0HBJG8HmbO8BnYNKz4PUIPiduE38bT0RWCWtKxMOBrphv17w4OytlFFmFSAa3tT68j/aXOH5duFF7p2yDXrLyqdpFjRLmu6FIYzlE0AwIwIb6VsDSFDDBF3IaQ3v3wbnB9N0+Pp68+vRqfvtnHskWfkORmTlLwhp+QDOSMzwsjPKIoG0X70K96JD+LDfjSOtjtPyD8VP/0NFnC0uQ==</latexit>

1 X
ˆ
✓ = arg min (i) (i) (i)
L(y , ŷ = f✓ (x ))
✓ N i=1 Gradient is:
• the vector of partial
• However, nding the global minimum is often impossible derivatives of the
parameters with respect to
in practice (need to search over all of R !)
<latexit sha1_base64="9ILyy26CKSH4w7gO/kU17MlUu+0=">AAACCXicbVC5TsNAEF2HK4TLQEmzIkIKTWQjrjKChjIgckhxiNabSbLK+tDuGBFZbmn4FRoKEKLlD+j4G+wkBSQ8aaSn92Y0M88NpdBoWd9GbmFxaXklv1pYW9/Y3DK3d+o6iBSHGg9koJou0yCFDzUUKKEZKmCeK6HhDi8zv3EPSovAv8VRCG2P9X3RE5xhKnVM6ngMB64b3yR3sYPwgHFXeEnJwQEgO0wKHbNola0x6Dyxp6RIpqh2zC+nG/DIAx+5ZFq3bCvEdswUCi4hKTiRhpDxIetDK6U+80C34/EnCT1IlS7tBSotH+lY/T0RM0/rkeemndndetbLxP+8VoS983Ys/DBC8PlkUS+SFAOaxUK7QgFHOUoJ40qkt1I+YIpxTMPLQrBnX54n9aOyfVo+uT4uVi6mceTJHtknJWKTM1IhV6RKaoSTR/JMXsmb8WS8GO/Gx6Q1Z0xndskfGJ8/9h2afw==</latexit>

dim(✓) the loss function


• A linear approximation of
• Instead, get a local minimum with gradient descent
<latexit sha1_base64="x/nquJtU97VwLOZyjMcUnXqzyAk=">AAAB83icbVBNS8NAEJ3Ur1q/qh69BItQLyURv45FLx4r2A9oYtlsN+3SzSbsToQS+je8eFDEq3/Gm//GbZuDtj4YeLw3w8y8IBFco+N8W4WV1bX1jeJmaWt7Z3evvH/Q0nGqKGvSWMSqExDNBJesiRwF6ySKkSgQrB2Mbqd++4kpzWP5gOOE+REZSB5yStBInodDhuQxq/LTSa9ccWrODPYycXNSgRyNXvnL68c0jZhEKojWXddJ0M+IQk4Fm5S8VLOE0BEZsK6hkkRM+9ns5ol9YpS+HcbKlER7pv6eyEik9TgKTGdEcKgXvan4n9dNMbz2My6TFJmk80VhKmyM7WkAdp8rRlGMDSFUcXOrTYdEEYomppIJwV18eZm0zmruZe3i/rxSv8njKMIRHEMVXLiCOtxBA5pAIYFneIU3K7VerHfrY95asPKZQ/gD6/MHuUaRfg==</latexit>

(i)
the loss function at ✓

Gradient Descent <latexit sha1_base64="FXOarCVUJH151h4m8zu6eoiU24s=">AAACznicjVLBbhMxEPUutLShLQGOvVhESOkl2q2AckGq4IIEh1Rq2krZNPJ6ZxOrXu/Wno0I1qrXfh83PoD/wJusQmiR6EiWnt685xnPOC6kMBgEPz3/0eONzSdb262nO7t7z9rPX5yZvNQcBjyXub6ImQEpFAxQoISLQgPLYgnn8dWnOn8+A21Erk5xXsAoYxMlUsEZOmrc/hWlmnEbFUyjYJJ+rf7gCKeArKJdu0SXtisOquqAfqBRDBOhbJwx1OJbRf97yzhs3DSKHqA+XFfPkhzNw3xq5QOVrNprjdudoBcsgt4HYQM6pIn+uP0jSnJeZqCQS2bMMAwKHNm6GJdQtaLSQMH4FZvA0EHFMjAju1hHRV87JqFprt1RSBfsusOyzJh5Fjula3Bq7uZq8l+5YYnp+5EVqigRFF8WSktJMaf1bmkiNHCUcwcY18L1SvmUuZmh+wH1EMK7T74Pzg574bve25M3neOPzTi2yD55RbokJEfkmHwmfTIg3PviXXvfPev3/Zlf+TdLqe81npfkr/BvfwPimeKm</latexit>

2 @L
3


(i)
Learning rate ↵ 2 R, ↵ > 0 (often quite small e.g., 3e-4) @✓1
<latexit sha1_base64="gP8jUMO7JqSZMetfa6XOem/Bo+g=">AAACC3icbVDLSsNAFL3xWesr6tLN0CK4kJKIr5UU3bisYh/QhDKZTtuhk0mYmQgldO/GX3HjQhG3/oA7/8ZJm4W2HrhwOOde7r0niDlT2nG+rYXFpeWV1cJacX1jc2vb3tltqCiRhNZJxCPZCrCinAla10xz2oolxWHAaTMYXmd+84FKxSJxr0cx9UPcF6zHCNZG6tglD/N4gJHHBPJCrAdBkN6Nj1AuXyKn2LHLTsWZAM0TNydlyFHr2F9eNyJJSIUmHCvVdp1Y+ymWmhFOx0UvUTTGZIj7tG2owCFVfjr5ZYwOjNJFvUiaEhpN1N8TKQ6VGoWB6czOVbNeJv7ntRPdu/BTJuJEU0Gmi3oJRzpCWTCoyyQlmo8MwUQycysiAywx0Sa+LAR39uV50jiuuGeV09uTcvUqj6MA+1CCQ3DhHKpwAzWoA4FHeIZXeLOerBfr3fqYti5Y+cwe/IH1+QNi1plY</latexit>

6 @L 7
@L (i) 6 (i) 7
6 @✓2 7
• Randomly initialize ✓ Next estimate Learning rate (step size)
<latexit sha1_base64="lZEcENlTL18KxfoRjQ8bAURJd70=">AAAB83icbVBNS8NAEJ3Ur1q/qh69BItQLyURv45FLx4r2A9oYtlsN+3SzSbsToQS+je8eFDEq3/Gm//GbZuDtj4YeLw3w8y8IBFco+N8W4WV1bX1jeJmaWt7Z3evvH/Q0nGqKGvSWMSqExDNBJesiRwF6ySKkSgQrB2Mbqd++4kpzWP5gOOE+REZSB5yStBInodDhuQxqzqnk1654tScGexl4uakAjkavfKX149pGjGJVBCtu66ToJ8RhZwKNil5qWYJoSMyYF1DJYmY9rPZzRP7xCh9O4yVKYn2TP09kZFI63EUmM6I4FAvelPxP6+bYnjtZ1wmKTJJ54vCVNgY29MA7D5XjKIYG0Ko4uZWmw6JIhRNTCUTgrv48jJpndXcy9rF/XmlfpPHUYQjOIYquHAFdbiDBjSBQgLP8ApvVmq9WO/Wx7y1YOUzh/AH1ucPYnCRRQ==</latexit>

(0) (✓ ) = 6 . 7
@✓ 6 . 7
4 . 5
@L (i)
<latexit sha1_base64="BnG2oFXYTa/9OoG7ThqBxvuWo84=">AAACQnicbVDLSgMxFM34tr6qLt0Ei9AqlhnxtRFENy5cKNha6dRyJ83YYOZBckcow3ybG7/AnR/gxoUibl2YPsBHPRA4OecebnK8WAqNtv1kjYyOjU9MTk3nZmbn5hfyi0tVHSWK8QqLZKRqHmguRcgrKFDyWqw4BJ7kl97tcde/vONKiyi8wE7MGwHchMIXDNBIzfyVi22OcJ0WxYZTyugB/RbMdZO6IOM20HXq+gpY6sagUICkp9k370ey4s9oqZkv2GW7BzpMnAEpkAHOmvlHtxWxJOAhMgla1x07xkba3cEkz3JuonkM7BZueN3QEAKuG2mvgoyuGaVF/UiZEyLtqT8TKQRadwLPTAaAbf3X64r/efUE/f1GKsI4QR6y/iI/kRQj2u2TtoTiDGXHEGBKmLdS1gZTFZrWc6YE5++Xh0l1q+zslnfOtwuHR4M6psgKWSVF4pA9ckhOyBmpEEbuyTN5JW/Wg/VivVsf/dERa5BZJr9gfX4BVsmvkA==</latexit>

(i+1) (i) @L
• Iteratively get better estimate with:
✓ =✓ ↵⇤
@✓
(✓ ) (i)
@✓n

Previous Estimate
31 Lecture 4: Embeddings and RNNs
fi
Stochastic gradient descent
Gradient Descent:
@L (i)
<latexit sha1_base64="BnG2oFXYTa/9OoG7ThqBxvuWo84=">AAACQnicbVDLSgMxFM34tr6qLt0Ei9AqlhnxtRFENy5cKNha6dRyJ83YYOZBckcow3ybG7/AnR/gxoUibl2YPsBHPRA4OecebnK8WAqNtv1kjYyOjU9MTk3nZmbn5hfyi0tVHSWK8QqLZKRqHmguRcgrKFDyWqw4BJ7kl97tcde/vONKiyi8wE7MGwHchMIXDNBIzfyVi22OcJ0WxYZTyugB/RbMdZO6IOM20HXq+gpY6sagUICkp9k370ey4s9oqZkv2GW7BzpMnAEpkAHOmvlHtxWxJOAhMgla1x07xkba3cEkz3JuonkM7BZueN3QEAKuG2mvgoyuGaVF/UiZEyLtqT8TKQRadwLPTAaAbf3X64r/efUE/f1GKsI4QR6y/iI/kRQj2u2TtoTiDGXHEGBKmLdS1gZTFZrWc6YE5++Xh0l1q+zslnfOtwuHR4M6psgKWSVF4pA9ckhOyBmpEEbuyTN5JW/Wg/VivVsf/dERa5BZJr9gfX4BVsmvkA==</latexit>

(i+1) (i)
✓ =✓ ↵⇤ (✓ )
@✓
• Problem: calculating the true gradient can be very expensive (requires running model
on entire dataset!)
• Solution: Stochastic Gradient Descent
• Sample a subset of the data of xed size (batch size)
• Take the gradient with respect to that subset
• Take a step in that direction; repeat
• Not only is it more computationally ef cient, but it often nds better minima than
vanilla gradient descent
• Why? Possibly because it does a better job skipping past plateaus in loss landscape

32 Lecture 4: Embeddings and RNNs


fi
fi
fi
Backpropagation
One ef cient way to calculate the gradient is with backpropagation.
Leverages the Chain Rule: dy = dy du
<latexit sha1_base64="Ssu9sNrYtK9rqv/0KOgdWyohH64=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIr41QdOOygn1AG8pkMmmHTh7MQwwhH+HGX3HjQhG3Ltz5N07TCLX1wMC559zLnXvcmFEhLevbKC0sLi2vlFcra+sbm1vm9k5LRIpj0sQRi3jHRYIwGpKmpJKRTswJClxG2u7oeuy37wkXNArvZBITJ0CDkPoUI6mlvnnU8znCqZdkqfeQwUs4Vavst1K52zerVs3KAeeJXZAqKNDom189L8IqIKHEDAnRta1YOinikmJGskpPCRIjPEID0tU0RAERTpoflcEDrXjQj7h+oYS5Oj2RokCIJHB1Z4DkUMx6Y/E/r6ukf+GkNIyVJCGeLPIVgzKC44SgRznBkiWaIMyp/ivEQ6RzkDrHig7Bnj15nrSOa/ZZ7fT2pFq/KuIogz2wDw6BDc5BHdyABmgCDB7BM3gFb8aT8WK8Gx+T1pJRzOyCPzA+fwCoB5/c</latexit>

dx du dx
1. Forward Pass
3. Backwards Pass
h1 = W 1 x + b 2
<latexit sha1_base64="vJRL+qdbJplsKaChHjhD1lkzrRE=">AAACMHicbZDLSsNAFIYn9VbjLerSzWBRKkJJgreNUHShywr2Ak0Ik+m0HTq5MDMRS+gjufFRdKOgiFufwklaQVsPzPDznf8wc34/ZlRI03zVCnPzC4tLxWV9ZXVtfcPY3GqIKOGY1HHEIt7ykSCMhqQuqWSkFXOCAp+Rpj+4zPrNO8IFjcJbOYyJG6BeSLsUI6mQZ1zt9z0LnsOmuu/hIfQ9GzqOrqitqCNoL0BlZTnIqdNHMh2Ocr8NM8/PhGeUzIqZF5wV1kSUwKRqnvHkdCKcBCSUmCEh2pYZSzdFXFLMyEh3EkFihAeoR9pKhiggwk3zhUdwT5EO7EZcnVDCnP6eSFEgxDDwlTNAsi+mexn8r9dOZPfMTWkYJ5KEePxQN2FQRjBLD3YoJ1iyoRIIc6r+CnEfcYSlylhXIVjTK8+Khl2xTirHN0el6sUkjiLYAbugDCxwCqrgGtRAHWDwAJ7BG3jXHrUX7UP7HFsL2mRmG/wp7esb7oqi7Q==</latexit>

Calculate the gradient


h2 = (h1 )
of the loss w.r.t. each
ŷ = W2 h2 + b2
parameter using the chain
2. Calculate Loss rule and intermediate outputs
<latexit sha1_base64="JBCjza+kTAxS7kCJrgmLHx0JF4s=">AAAB9HicbVDJSgNBEK1xjXGLevTSGIQIEmbE7Rj04sFDBLNAMoSeTidp0rPYXRMYhnyHFw+KePVjvPk3dpI5aOKDgsd7VVTV8yIpNNr2t7W0vLK6tp7byG9ube/sFvb26zqMFeM1FspQNT2quRQBr6FAyZuR4tT3JG94w9uJ3xhxpUUYPGIScden/UD0BKNoJPe+lJyS9oBimoxPOoWiXbanIIvEyUgRMlQ7ha92N2SxzwNkkmrdcuwI3ZQqFEzycb4dax5RNqR93jI0oD7Xbjo9ekyOjdIlvVCZCpBM1d8TKfW1TnzPdPoUB3rem4j/ea0Ye9duKoIoRh6w2aJeLAmGZJIA6QrFGcrEEMqUMLcSNqCKMjQ55U0IzvzLi6R+VnYuyxcP58XKTRZHDg7hCErgwBVU4A6qUAMGT/AMr/BmjawX6936mLUuWdnMAfyB9fkDuwyRcg==</latexit>

L(y, ŷ)

33 Lecture 4: Embeddings and RNNs


fi
dy dy du
<latexit sha1_base64="Ssu9sNrYtK9rqv/0KOgdWyohH64=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIr41QdOOygn1AG8pkMmmHTh7MQwwhH+HGX3HjQhG3Ltz5N07TCLX1wMC559zLnXvcmFEhLevbKC0sLi2vlFcra+sbm1vm9k5LRIpj0sQRi3jHRYIwGpKmpJKRTswJClxG2u7oeuy37wkXNArvZBITJ0CDkPoUI6mlvnnU8znCqZdkqfeQwUs4Vavst1K52zerVs3KAeeJXZAqKNDom189L8IqIKHEDAnRta1YOinikmJGskpPCRIjPEID0tU0RAERTpoflcEDrXjQj7h+oYS5Oj2RokCIJHB1Z4DkUMx6Y/E/r6ukf+GkNIyVJCGeLPIVgzKC44SgRznBkiWaIMyp/ivEQ6RzkDrHig7Bnj15nrSOa/ZZ7fT2pFq/KuIogz2wDw6BDc5BHdyABmgCDB7BM3gFb8aT8WK8Gx+T1pJRzOyCPzA+fwCoB5/c</latexit>

Backpropagation dx
=
du dx
1. Forward Pass (Long, messy exact derivation below)
h1 = W 1 x + b 2
<latexit sha1_base64="vJRL+qdbJplsKaChHjhD1lkzrRE=">AAACMHicbZDLSsNAFIYn9VbjLerSzWBRKkJJgreNUHShywr2Ak0Ik+m0HTq5MDMRS+gjufFRdKOgiFufwklaQVsPzPDznf8wc34/ZlRI03zVCnPzC4tLxWV9ZXVtfcPY3GqIKOGY1HHEIt7ykSCMhqQuqWSkFXOCAp+Rpj+4zPrNO8IFjcJbOYyJG6BeSLsUI6mQZ1zt9z0LnsOmuu/hIfQ9GzqOrqitqCNoL0BlZTnIqdNHMh2Ocr8NM8/PhGeUzIqZF5wV1kSUwKRqnvHkdCKcBCSUmCEh2pYZSzdFXFLMyEh3EkFihAeoR9pKhiggwk3zhUdwT5EO7EZcnVDCnP6eSFEgxDDwlTNAsi+mexn8r9dOZPfMTWkYJ5KEePxQN2FQRjBLD3YoJ1iyoRIIc6r+CnEfcYSlylhXIVjTK8+Khl2xTirHN0el6sUkjiLYAbugDCxwCqrgGtRAHWDwAJ7BG3jXHrUX7UP7HFsL2mRmG/wp7esb7oqi7Q==</latexit>

3. Backwards Pass
@L
<latexit sha1_base64="Tu4zjq8SZUrTccCBq8GApHXgNww=">AAAHMHictVVbi9NAFJ5dTVzrZbv66MtgUVqEkhRvCMKiID74sMJ2u9DUMDOZNMNOLmQmsiXkJ/niT9EXBUV89Vc4SaN2k3Rb6DowcDiXb873nUkGR5wJaRhft7YvXdb0KztXW9eu37i52967dSTCJCZ0SEIexscYCcpZQIeSSU6Po5giH3M6wicv8/joPY0FC4NDOYvoxEfTgLmMIKlc9p72ynIol8hOLQ/JdJZl8P6z59ByY0RSK0KxZIjDN9k/+2+eZbXOSRvZAwW1JhJxQlnN/BOtQG6A2FUA0FP7AcT2oNeAXJViDqNK3h2uoIsvni7+b3Rz5Ca2imHp8wo6518Eb8P+6ow3RVzOuEReNuFRMeGzAphrCGCuaLc4trHVPFJHWlS/WTPBpj7qqvxetXzV52iuup9zlOZmzQrUYq/LqtQ0zNN8FL1qH43Vp2t8YxfGAW/CAdc5tOx2x+gbxYJ1wyyNDijXgd3+ZDkhSXwaSMKREGPTiOQkzY8gnGYtKxE0QuQETelYmQHyqZikxQ8/g/eUx4FuGKsdSFh4FytS5Asx87HK9JH0RDWWO5ti40S6TycpC6JE0oDMD3ITDmUI89cDOiymRPKZMhCJmeoVEg8pyaR6Y3IRzCrlunE06JuP+4/ePuzsvyjl2AF3wF3QBSZ4AvbBa3AAhoBoH7TP2jftu/5R/6L/0H/OU7e3yprb4MzSf/0Gmjh9FA==</latexit>

:=
h2 = (h1 ) ŷ
@ ŷ
@L @L @ ŷ @L @(W2 h2 + b2 )
ŷ = W2 h2 + b2 = · = · = ŷ · hT2
@W2 @ ŷ @W2 @ ŷ @W2
@L @L @ ŷ @L @(W2 h2 + b2 )
= · = · = ŷ
@b2 @ ŷ @b2 @ ŷ @b2
2. Calculate Loss @L @L @ ŷ @L @(W2 h2 + b2 ) T
h2 := = · = · = ŷ · W2
<latexit sha1_base64="JBCjza+kTAxS7kCJrgmLHx0JF4s=">AAAB9HicbVDJSgNBEK1xjXGLevTSGIQIEmbE7Rj04sFDBLNAMoSeTidp0rPYXRMYhnyHFw+KePVjvPk3dpI5aOKDgsd7VVTV8yIpNNr2t7W0vLK6tp7byG9ube/sFvb26zqMFeM1FspQNT2quRQBr6FAyZuR4tT3JG94w9uJ3xhxpUUYPGIScden/UD0BKNoJPe+lJyS9oBimoxPOoWiXbanIIvEyUgRMlQ7ha92N2SxzwNkkmrdcuwI3ZQqFEzycb4dax5RNqR93jI0oD7Xbjo9ekyOjdIlvVCZCpBM1d8TKfW1TnzPdPoUB3rem4j/ea0Ye9duKoIoRh6w2aJeLAmGZJIA6QrFGcrEEMqUMLcSNqCKMjQ55U0IzvzLi6R+VnYuyxcP58XKTRZHDg7hCErgwBVU4A6qUAMGT/AMr/BmjawX6936mLUuWdnMAfyB9fkDuwyRcg==</latexit>

@h2 @ ŷ @h2 @ ŷ @h2


L(y, ŷ)
@L @L @h2 @ (h1 )
h1 := = · = h2 ·
@h1 @h2 @h1 @h1
@L @L @h1 @(W1 x + b) T
= · = h1 · = h1 · x
@W1 @h1 @W1 @W1
@L @L @h1 @(W1 x + b)
= · = h1 · = h1
@b1 @h1 @b1 @b1
34 Lecture 4: Embeddings and RNNs
Classi cation with deep learning

• For classi cation problems (like next word-prediction…) we want to


predict a probability distribution over the label space
• However, neural networks’ output y 2 R is not guaranteed (or likely) to
d
<latexit sha1_base64="1wkcrsS56MDYhSZq3fEJyaLZzOw=">AAAB+3icbVC7TsMwFL0pr1JeoYwsFhUSU5UgXmMFC2NB9CE1pXJcp7XqOJHtIKIov8LCAEKs/Agbf4PTdoDCkSwdnXOv7vHxY86Udpwvq7S0vLK6Vl6vbGxube/Yu9W2ihJJaItEPJJdHyvKmaAtzTSn3VhSHPqcdvzJVeF3HqhULBJ3Oo1pP8QjwQJGsDbSwK6myGMCeSHWY9/PbvP74cCuOXVnCvSXuHNSgzmaA/vTG0YkCanQhGOleq4T636GpWaE07ziJYrGmEzwiPYMFTikqp9Ns+fo0ChDFETSPKHRVP25keFQqTT0zWSRUS16hfif10t0cNHPmIgTTQWZHQoSjnSEiiLQkElKNE8NwUQykxWRMZaYaFNXxZTgLn75L2kf192z+unNSa1xOa+jDPtwAEfgwjk04Bqa0AICj/AEL/Bq5daz9Wa9z0ZL1nxnD37B+vgGlXKUKw==</latexit>

be a probability distribution
• To force the output to be a probability distribution, we apply the
softmax function
<latexit sha1_base64="HF0JmxyJbvmDVBKrFChTqZPF5II=">AAACKnicbVBNT9tAEF1DC2koEOixl1UjpOQS2YivC1JoLxyp1ABSHKz1Zgwbdm1rd4xirfx7eulf6YUDCHHlh7AJPrTAk0Z6em9GM/PiXAqDvv/gLSx++Li03PjUXPm8urbe2tg8NVmhOQx4JjN9HjMDUqQwQIESznMNTMUSzuLrHzP/7Aa0EVn6C8scRopdpiIRnKGTotZRiDBFa7IEFZtWnbIbCXpIw0QzbkOY5p0yEt3KhqZQkZ0cBtWFHVe0dibdqhm12n7Pn4O+JUFN2qTGSdS6DccZLxSkyCUzZhj4OY4s0yi4hKoZFgZyxq/ZJQwdTZkCM7LzVyu65ZQxTTLtKkU6V/+dsEwZU6rYdSqGV+a1NxPf84YFJgcjK9K8QEj5y6KkkBQzOsuNjoUGjrJ0hHEt3K2UXzGXErp0ZyEEr19+S063e8Feb/fnTrv/vY6jQb6Sb6RDArJP+uSYnJAB4eQ3+UvuyL33x7v1HrzHl9YFr575Qv6D9/QM8LGnjA==</latexit>

exp(yi )
softmax(y)i = Pd
j=1 exp(yj )

• The values y before applying the softmax are often called “logits”
<latexit sha1_base64="tmEvmksfb9cWDAa5xB1ckWUhUbQ=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVx71S2a24M5Bl4uWkDDlqvdJXtx+zNEJpmKBadzw3MX5GleFM4KTYTTUmlI3oADuWShqh9rPZoRNyapU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjjZ1wmqUHJ5ovCVBATk+nXpM8VMiPGllCmuL2VsCFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MH6v2NBw==</latexit>

35 Lecture 4: Embeddings and RNNs


fi
fi

You might also like