8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
Want more content like this? Subscribe here
(https://docs.google.com/forms/d/e/1FAIpQLSeOr-
yp8VzYIs4ZtE9HVkRcMJyDcJ2FieM82fUsFoCssHu9DA/viewform) to be notified of new
releases!
(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#cs-229---
machine-learning)CS 229 - Machine Learning (teaching/cs-229) English
Supervised Learning Unsupervised Learning Deep Learning Tips and tricks
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-deep-learning#cheatsheet)Deep
Learning cheatsheet Star 17,328
By Afshine Amidi (https://twitter.com/afshinea) and Shervine Amidi (https://twitter.com/shervinea)
(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#nn)
Neural Networks
Neural networks are a class of models that are built with layers. Commonly used types of neural
networks include convolutional and recurrent neural networks.
❐ Architecture ― The vocabulary around neural networks architectures is described in the figure
below:
By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 1/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
[i] [i] T [i]
z j = w j x + bj
where we note w , b, z the weight, bias and output respectively.
❐ Activation function ― Activation functions are used at the end of a hidden unit to introduce
non-linear complexities to the model. Here are the most common ones:
Sigmoid Tanh ReLU Leaky ReLU
1 ez − e−z g(z) = max(ϵz, z)
g(z) = g(z) = z g(z) = max(0, z)
1 + e−z e + e−z with ϵ ≪ 1
❐ Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z, y) is
commonly used and is defined as follows:
L(z, y) = −[y log(z) + (1 − y) log(1 − z)]
❐ Learning rate ― The learning rate, often noted α or sometimes η , indicates at which pace the
weights get updated. This can be fixed or adaptively changed. The current most popular method is
called Adam, which is a method that adapts the learning rate.
❐ Backpropagation ― Backpropagation is a method to update the weights in the neural network
by taking into account the actual output and the desired output. The derivative with respect to
weight w is computed using chain rule and is of the following form:
∂L(z, y) ∂L(z, y) ∂a ∂z
= × ×
∂w ∂a ∂z ∂w
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 2/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
As a result, the weight is updated as follows:
∂L(z, y)
w ⟵w−α
∂w
❐ Updating weights ― In a neural network, weights are updated as follows:
Step 1: Take a batch of training data.
Step 2: Perform forward propagation to obtain the corresponding loss.
Step 3: Backpropagate the loss to get the gradients.
Step 4: Use the gradients to update the weights of the network.
❐ Dropout ― Dropout is a technique meant to prevent overfitting the training data by dropping
out units in a neural network. In practice, neurons are either dropped with probability p or kept with
probability 1 − p.
(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#cnn)
Convolutional Neural Networks
❐ Convolutional layer requirement ― By noting W the input volume size, F the size of the
convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit
in a given volume is such that:
W − F + 2P
N= +1
❐ Batch normalization ― It is a step of hyperparameter γ, β that normalizes the batch {xi }. By
2
noting μB , σB the mean and variance of that we want to correct to the batch, it is done as follows:
xi − μ B
xi ⟵ γ +β
σB2 + ϵ
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 3/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#rnn)
Recurrent Neural Networks
❐ Types of gates ― Here are the different types of gates that we encounter in a typical recurrent
neural network:
Input gate Forget gate Gate Output gate
How much to write How much to reveal
Write to cell or not? Erase a cell or not?
to cell? cell?
❐ LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the
vanishing gradient problem by adding 'forget' gates.
For a more detailed overview of the concepts above, check out the Deep Learning cheatsheets
(teaching/cs-230)!
(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
deep-learning#reinforcement)
Reinforcement Learning and Control
The goal of reinforcement learning is for an agent to learn how to evolve in an environment.
Definitions
❐ Markov decision processes ― A Markov decision process (MDP) is a 5-tuple
(S, A, {Psa }, γ, R) where:
S is the set of states
A is the set of actions
{Psa } are the state transition probabilities for s ∈ S and a ∈ A
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 4/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
γ ∈ [0, 1[ is the discount factor
R : S × A ⟶ R or R : S ⟶ R is the reward function that the algorithm wants to
maximize
❐ Policy ― A policy π is a function π : S ⟶ A that maps states to actions.
Remark: we say that we execute a given policy π if given a state s we take the action a = π(s).
❐ Value function ― For a given policy π and a given state s, we define the value function V π as
follows:
V π (s) = E[R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + ...∣s0 = s, π]
∗
❐ Bellman equation ― The optimal Bellman equations characterizes the value function V π of the
optimal policy π ∗ :
V π (s) = R(s) + max γ ∑ Psa (s′ )V π (s′ )
∗ ∗
a∈A
s′ ∈S
Remark: we note that the optimal policy π ∗ for a given state s is such that:
π ∗ (s) = argmax ∑ Psa (s′ )V ∗ (s′ )
a∈A
s′ ∈S
❐ Value iteration algorithm ― The value iteration algorithm is in two steps:
1) We initialize the value:
V0 (s) = 0
2) We iterate the value based on the values before:
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 5/6
8/13/24, 8:04 PM CS 229 - Deep Learning Cheatsheet
Vi+1 (s) = R(s) + max [∑ γPsa (s′ )Vi (s′ )]
a∈A
s′ ∈S
❐ Maximum likelihood estimate ― The maximum likelihood estimates for the state transition
probabilities are as follows:
′ #times took action a in state s and got to s′
Psa (s ) =
#times took action a in state s
❐ Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:
Q(s, a) ← Q(s, a) + α[R(s, a, s′ ) + γ max
′
Q(s′ , a′ ) − Q(s, a)]
For a more detailed overview of the concepts above, check out the States-based Models
cheatsheets (teaching/cs-221/cheatsheet-states-models)!
(https://twitter.com/shervinea) (https://linkedin.com/in/shervineamidi)
(https://github.com/shervinea) (https://scholar.google.com/citations?user=nMnMTm8AAAAJ)
(https://www.amazon.com/stores/author/B0B37XBSJL)
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning#nn 6/6