Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
212 views6 pages

CS 229 - Deep Learning Cheatsheet

deep learning cheatsheet

Uploaded by

Junaid Dooast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
212 views6 pages

CS 229 - Deep Learning Cheatsheet

deep learning cheatsheet

Uploaded by

Junaid Dooast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

Want more content like this? Subscribe here


(https://docs.google.com/forms/d/e/1FAIpQLSeOr-
yp8VzYIs4ZtE9HVkRcMJyDcJ2FieM82fUsFoCssHu9DA/viewform) to be notified of new
releases!

(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#cs-229---
machine-learning)CS 229 - Machine Learning (teaching/cs-229) English 

Supervised Learning Unsupervised Learning Deep Learning Tips and tricks

(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-deep-learning#cheatsheet)Deep
Learning cheatsheet Star 17,328

By Afshine Amidi (https://twitter.com/afshinea) and Shervine Amidi (https://twitter.com/shervinea)

(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#nn)
Neural Networks
Neural networks are a class of models that are built with layers. Commonly used types of neural
networks include convolutional and recurrent neural networks.

❐ Architecture ― The vocabulary around neural networks architectures is described in the figure
below:

By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 1/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

[i] [i] T [i]


z j = w j x + bj
​ ​ ​ ​

where we note w , b, z the weight, bias and output respectively.

❐ Activation function ― Activation functions are used at the end of a hidden unit to introduce
non-linear complexities to the model. Here are the most common ones:

Sigmoid Tanh ReLU Leaky ReLU

1 ez − e−z g(z) = max(ϵz, z)


g(z) = g(z) = z g(z) = max(0, z)
1 + e−z e + e−z with ϵ ≪ 1

❐ Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z, y) is
commonly used and is defined as follows:

L(z, y) = −[y log(z) + (1 − y) log(1 − z)] ​

❐ Learning rate ― The learning rate, often noted α or sometimes η , indicates at which pace the
weights get updated. This can be fixed or adaptively changed. The current most popular method is
called Adam, which is a method that adapts the learning rate.

❐ Backpropagation ― Backpropagation is a method to update the weights in the neural network


by taking into account the actual output and the desired output. The derivative with respect to
weight w is computed using chain rule and is of the following form:

∂L(z, y) ∂L(z, y) ∂a ∂z
= × ×
∂w ∂a ∂z ∂w
​ ​ ​ ​ ​

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 2/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

As a result, the weight is updated as follows:

∂L(z, y)
w ⟵w−α
∂w
​ ​

❐ Updating weights ― In a neural network, weights are updated as follows:

Step 1: Take a batch of training data.


Step 2: Perform forward propagation to obtain the corresponding loss.
Step 3: Backpropagate the loss to get the gradients.
Step 4: Use the gradients to update the weights of the network.

❐ Dropout ― Dropout is a technique meant to prevent overfitting the training data by dropping
out units in a neural network. In practice, neurons are either dropped with probability p or kept with
probability 1 − p.

(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#cnn)
Convolutional Neural Networks
❐ Convolutional layer requirement ― By noting W the input volume size, F the size of the
convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit
in a given volume is such that:

W − F + 2P
N= +1 ​ ​

❐ Batch normalization ― It is a step of hyperparameter γ, β that normalizes the batch {xi }. By ​

2
noting μB , σB the mean and variance of that we want to correct to the batch, it is done as follows:
​ ​

xi − μ B
xi ⟵ γ +β
​ ​

σB2 + ϵ
​ ​ ​

​ ​

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 3/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.

(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#rnn)
Recurrent Neural Networks
❐ Types of gates ― Here are the different types of gates that we encounter in a typical recurrent
neural network:

Input gate Forget gate Gate Output gate

How much to write How much to reveal


Write to cell or not? Erase a cell or not?
to cell? cell?

❐ LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the
vanishing gradient problem by adding 'forget' gates.

For a more detailed overview of the concepts above, check out the Deep Learning cheatsheets
(teaching/cs-230)!

(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#reinforcement)
Reinforcement Learning and Control
The goal of reinforcement learning is for an agent to learn how to evolve in an environment.

Definitions
❐ Markov decision processes ― A Markov decision process (MDP) is a 5-tuple
(S, A, {Psa }, γ, R) where:

S is the set of states


A is the set of actions
{Psa } are the state transition probabilities for s ∈ S and a ∈ A

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 4/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

γ ∈ [0, 1[ is the discount factor


R : S × A ⟶ R or R : S ⟶ R is the reward function that the algorithm wants to
maximize

❐ Policy ― A policy π is a function π : S ⟶ A that maps states to actions.

Remark: we say that we execute a given policy π if given a state s we take the action a = π(s).

❐ Value function ― For a given policy π and a given state s, we define the value function V π as
follows:

V π (s) = E[R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...∣s0 = s, π]


​ ​ ​ ​ ​


❐ Bellman equation ― The optimal Bellman equations characterizes the value function V π of the
optimal policy π ∗ :

V π (s) = R(s) + max γ ∑ Psa (s′ )V π (s′ )


∗ ∗
​ ​ ​ ​

a∈A
s′ ∈S

Remark: we note that the optimal policy π ∗ for a given state s is such that:

π ∗ (s) = argmax ∑ Psa (s′ )V ∗ (s′ )


​ ​ ​ ​

a∈A
s′ ∈S

❐ Value iteration algorithm ― The value iteration algorithm is in two steps:

1) We initialize the value:

V0 (s) = 0
​ ​

2) We iterate the value based on the values before:

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 5/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet

Vi+1 (s) = R(s) + max [∑ γPsa (s′ )Vi (s′ )]


​ ​ ​ ​ ​

a∈A
s′ ∈S

❐ Maximum likelihood estimate ― The maximum likelihood estimates for the state transition
probabilities are as follows:

′ #times took action a in state s and got to s′


Psa (s ) =
#times took action a in state s
​ ​ ​

❐ Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:

Q(s, a) ← Q(s, a) + α[R(s, a, s′ ) + γ max



Q(s′ , a′ ) − Q(s, a)] ​ ​

For a more detailed overview of the concepts above, check out the States-based Models
cheatsheets (teaching/cs-221/cheatsheet-states-models)!

 (https://twitter.com/shervinea)  (https://linkedin.com/in/shervineamidi) 
(https://github.com/shervinea)  (https://scholar.google.com/citations?user=nMnMTm8AAAAJ) 
 (https://www.amazon.com/stores/author/B0B37XBSJL)

https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 6/6

You might also like