Neural Network
Programming
Vectorization
Vectorization
• Vectorization is basically the art of getting rid of explicit “for” loops in
your code.
• In the deep learning era, you often find yourself training on relatively
large data sets.
• So, it's important that your code very quickly because otherwise,
• if it's running on a big data set, your code might take a long time to
run then you just find yourself waiting a very long time to get the
result.
• So in the deep learning era, the ability to perform vectorization has
become a key skill.
Let's start with an example.
This shows how vectorization, by using built-in functions and by avoiding explicit for
loops, allows you to speed up your code significantly.
Let's look at a few more examples.
Vectorization
Here's Jupiter notebook to write some Python code to print
ALLAH.
Now, having written this chunk of code, if I hit shift enter,
then it executes the code.
Create A as an array as follows:
First, import the numpy library to import.
Let's say print A. So, it created the array A and it
prints it out.
Vectorization
Let's actually illustrate the efficiency of Vectorization with a little demo.
# Created two very large dimensional arrays just as W and X the
large nx dimension.
Vectorization
• Non-Vectorization:
Vectorized version VS Non-
vectorized
• In both cases, the vectorized version and the non-vectorized version
computed the same values, as you know, 2.50 to 6.99, so on.
• The vectorized version took 1.5 milliseconds.
• The explicit for loop and non-vectorized version took about 400,
almost 500 milliseconds.
• The non-vectorized version took something like 300 times longer than
the vectorized version.
• When you are implementing deep learning algorithms, you can really
get a result much faster if you vectorize your code.
More Vectorization Examples
• The rule of thumb to keep in mind is, when you're programming your
new networks, or when you're programming just a
regression, whenever possible avoid explicit for-loops.
• It is not always possible to never use a for-loop, but when you can use
a built in function or find some other way to compute whatever you
need, you'll often go faster than if you have an explicit for-loop.
• Let's look at another example.
More Vectorization Examples
If you want to compute a
vector u as the product of the
matrix A, and another vector v.
Then the definition of our
matrix multiply is that your Ui The vectorized implementation which is to
is equal to sum over j,, Aij, Vj. say u equals np dot (A,v).
The vectorized version eliminates two
A non-vectorized different for-loops, and it's going to be way
implementation of this would faster.
be to set u equals NP.zeros, it
would be n by 1.
For i, and so on.
For j, and so on..
More Vectorization Examples
• Let's say you already have a vector, v, in memory and you want to
apply the exponential operation on every element of this vector v. So
you can put u equals the vector, that's e to the v1, e to the v2, and so
on, down to e to the vn.
• So this would be a non-vectorized implementation, which is at first
you initialize u to the vector of zeros. And then you have a for-loop
that computes the elements one at a time.
More Vectorization Examples
Python and NumPy have many built-in functions that
allow you to compute these vectors with just a single
call to a single function.
To implement, import numpy as np, and then what
you just call u = np.exp(v).
Notice that previously you had that explicit for-
loop, with just one line of code here (just v as an input In fact, the NumPy library
vector u as an output vector) you've gotten rid of the has many of the vector
explicit for-loop, and this implementation will be much value functions.
faster than the one needing an explicit for-loop.
More Vectorization Examples
Some Built-in functions:
• np.log (v) will compute the element-wise log,
• np.abs computes the absolute value,
• np.maximum computes the element-wise
maximum to take the max of every element of v
with 0.
• v**2 just takes the element-wise square of each
element of v.
• One over v takes the element-wise inverse, and
so on.
So, whenever you are want to write a for-loop, first check if there's a way to call a NumPy built-in function to do
it without that for-loop.
The code for computing the derivatives for logistic regression, and we had two for-loops.
In our example we have nx equals 2, but
if you had more features than just 2 features then
you'd need have a for-loop over dw1, dw2, dw3, and
so on.
So we'd like to eliminate this second for-loop.
Vectorizing Logistic Regression
• We'll talk about how you can vectorize the implementation of logistic
regression, so they can process an entire training set, i.e.,
• Implement a single elevation of grading descent with respect to an
entire training set without using even a single explicit for loop.
• Let's first examine the forward propagation steps of logistic
regression:
• If you have m training examples, then to make a prediction on the 1st
example, you need to compute that, compute z.
• then compute the activations So, you might need to do this M times for m
• Similarly for 2 and 3 training examples:
nd rd
training examples
Vectorizing Logistic Regression
There is a way to do so without using an explicit for loop.
• As, We defined a matrix capital X to be your training inputs, stacked
together in different columns:
• But how to compute z1, z2, z3, and so on, all in one step with one line
of code?
• Construct a 1 x m matrix that's really a row vector to compute z1, z2, and so
on, down to zm all at the same time.
• This can be expressed as:
Vectorizing Logistic Regression
• Doing so, it becomes (after substituting the values) the matrix form:
W transpose can be a row vector.
Stacking the lowercase x’s
To define capital Z to be this: horizontally variable
capital X and the same way
Z= stacking these lowercase z
where you take the lowercase z's and stack them horizontally horizontally, this variable
W transpose Xm, and then add this second term B, B, B, and so on, end up by adding
capital Z. B to each element.
• So We end up with another 1xM vector/matrix.
• If refer to the definitions,
• this first element is exactly the definition of z1.
• The second element is exactly the definition of z2 and so on:
Vectorizing Logistic Regression
• Python implementation (to compute Capital Z) would be equal to:
numpy dot (W transpose X plus B) as
With just one line of code, we can calculate capital Z and capital Z is a 1XM matrix
Here B is a real number or 1x1 matrix, is just a normal real number. But, when we add this vector to this real
that contains
number, all of thetakes
Python automatically lower
this cases z's. BLowercase
real number and expands itz1
outthrough
to this 1XMlower case zm.
row vector.
This is called broadcasting in Python.
Vectorizing Logistic Regression
• What about the activations lower case “a”?
Next to find a way to compute A1, A2, and so on to AM, all at the same time:
• Just as stacking lowercase X's resulted in capital X and
• stacking horizontally lowercase Z's resulted in capital Z,
• Stacking lowercase As, results in a new variable, defined as capital
In programs, youA. see how to implement a vector-
valued sigmoid function, so that the sigmoid function,
inputs this capital Z as a variable and very efficiently
outputs capital A.
Vectorizing Logistic Regression
To summarize:
• In order to carry out the forward propagation step, that is to compute
these predictions on our m training examples:
• you've just seen how you can use vectorization to very efficiently compute all
of the activations, all the lowercase A's at the same time.
• Next, we use vectorization to compute the Gradients very efficiently
for Backward Propagation.
Vectorizing Logistic Regression's
Gradient Output
• For the gradient computation, we ‘ve already computed dz1 for the
first example and so on for all m training example. Where dz1:
• A we have defined a new variable, dZ:
A 1 by m matrix or alternatively a m dimensional row vector.
• We'd already figured out how to compute capital A:
• Similarly, Y:
• Based on these Definitions:
So, with just one line of code, we can compute all of this at the
same time.
Vectorizing Logistic Regression's
Gradient Output
Now, we've gotten rid of one for loop already
But we still had this second for loop over
training examples.
So, let's take these operations and vectorize
them.
Vectorizing Logistic Regression's
Gradient Output
n1(1) n1(2) …. n1(m)
Nxm M x (1)
n2 1 n2(2) …. n2(m)
n3(1) n3(2) …. n3(m)
n4(1) n4(2) …. n4(m)
…. …
nN(1) …
…. nN(m)
Vectorizing Logistic Regression's
Gradient Output- Implementation
using these two lines
Vectorizing Logistic Regression's
Gradient Output
Also get rid of this outer for loop.
Get rid of inner loop by Vector form
• We ‘ve Just completed forward
propagation and back
propagation.
This was original, highly inefficient non vectorize • Really computing the predictions
implementation. and computing the derivatives
on all m training examples without
Vectorizing Logistic Regression's
Gradient Output
• Thus, the gradient descent would update parameters as:
• By this, a single iteration of gradient descent for logistic regression
will be accomplished.
Vectorizing Logistic Regression's
Gradient Output
• To implement multiple iterations of
gradient descent, still need a for loop over
the number of iterations.
• So, if you want to have a thousand
iterations of gradient descent, you might
still need a for loop over the iteration
number.
• There is an outermost for loop, No way to get rid of that
loop.
• However, incredibly cool to implement at least one
iteration of gradient descent without needing to use a full
loop.
Broadcasting in Python
Broadcasting is another technique that you can use to make your Python code
run faster.
• Consider an example:
• Calculate the % of calories from carbs, proteins and fats for each of the four foods.
• You really need to compute
• The sum up each of the four columns of this matrix to get the total number of calories in 100
grams of apples, beef, eggs, and potatoes and then to
• divide throughout the matrix.
Can be done without using for loop?
Broadcasting in Python
• Since this matrix equal to three by four matrix A.
• And then with one line of Python code we're going to sum down the
columns.
• So we're going to get four numbers corresponding to the total number of
calories in these four different types of foods, 100 grams of these four
different types of foods.
• The second line of Python code to divide each of the four columns by
their corresponding sum.
this is an example of Python broadcasting where you take a matrix A.
Broadcasting in Python
Broadcasting in Python
# axis is equals 0 means to sum vertically.
Broadcasting in Python
Broadcasting Examples
You're adding a 100 to every element, and in fact we use this form of
broadcasting
Broadcasting Examples
So the general case would be if you have some (m,n) matrix here and
you want to add it to a (1,n) matrix.
What Python will do is copy the matrix m, times to turn this into m by n matrix.
Broadcasting Examples
Broadcasting
General principle of broadcasting in Python.
• If you have an (m,n) matrix and you add or subtract or multiply or
divide with a (1,n) matrix, then this will copy it n times into an (m,n)
matrix. And then apply the addition, subtraction, and multiplication of
division element wise.
• If conversely, you were to take the (m,n) matrix and add, subtract,
multiply, divide by an (m,1) matrix, then also this would copy it now n
times. And turn that into an (m,n) matrix and then apply the
operation element wise.
• Further reading from Numpy Documentation.
In MATLAB, use bsxfun function for broadcasting
A note on Python/Numpy vectors
• Broadcasting has Strengths as a flexibility but has Weaknesses.
• Possibly can introduce very subtle bugs, if not familiar about how broadcasting
and how features like broadcasting work.
• Example:
• Adding a column vector to a row vector would cause to throw up a dimension
mismatch or type error.
• But you might actually get back a matrix as a sum of a row vector and a column
vector.
• So there is an internal logic to these strange effects of Python
• A couple of tips and tricks are useful to eliminate or simplify and eliminate bugs in
the code to write a much more easily bug-free, Python and Numpy code.
A note on Python/Numpy vectors
• To illustrate one of the less intuitive effects of Python-Numpy,
especially how you construct vectors in Python-Numpy,
Called a rank 1 array in Python.
Neither a row vector nor a column vector.
This leads it to some slightly non-intuitive effects
A note on Python/Numpy vectors
• Print a transpose
• It is the same as a.
• Print the inner product
between a and a
transpose Note the differences:
• One might think a times a • In this data structure, there are two square brackets when
print a transpose.• Whereas
It iswerecommended If set
to now toprint
apreviously, thewas
be (5,1),
there product
then onethis
transpose would be the between
outer product that should
square bracket. So that's
commits a to be (5,1) columnavector.
the differenceand
betweena transpose,
a 1 by 5
matrix versus rank 1 arrays.
then it gives the required
give a matrix.
outer product of a vector to
• But it produces a number. produce a matrix.
A note on Python/Numpy vectors
Conclusion Created a data structure with a.shape was: (5,), called
• Always use either n-by-one matrices, basically column
A rank 1 array.
This data structure. doesn't behave consistently as
vectors, or one-by-n matrices,
either a rowor basically
vector
non-intuitive.
row
or a column vector, tovectors.
make it
• Feel free to
1)Print dimensions to check Now and
the behavior of these vectors may be easier
to understand.
2)Use the reshape operation to
a.shape make
will sure
be equal that your
to 5,1.
This is a (5,1) matrix and behaves like a column
matrices or your vectors are vector.the dimension that you
need. Here a.shape is going to be 1,5, behaves
consistently as a row vector.
If not entirely sure what's the dimension of one of my vectors, Use reshape command to make
sure that this is a vector. In this case it is a (5,1) vector
Logistic Regression Cost Function-
Why?
• These two eq.s can be summarized them into a singleSoequation
what we've just shown is
that this equation is a
correct definition for p(ylx).
• How? Consider two cases: y =1 and y =0
Logistic Regression Cost Function-
Why?
if we compute log of p(y|x), that’s equal
to log of y hat to the power of y, 1 - y hat
to the power of 1 - y.
That simplifies to y log y hat + 1- y times
log 1- y hat.
Now, finally, because the log function is a strictly monotonically increasing function, your maximizing log p(y|
x) should give you a similar result as optimizing p(y|x).
Logistic Regression Cost Function-
Why?
There's a negative sign there because
while training a learning algorithm, we
want to make probabilities large
whereas in logistic regression we're
expressing this: We want to minimize
the loss function.
So minimizing the loss corresponds to
maximizing the log of the probability.
This is actually a negative of the loss function that we had to find previously.
This is a loss function on a single example looks like. What about overall Cost?
Logistic Regression Cost Function-
Why?
If we assume that the training examples I've drawn are independently or identically
independently distributed, then the probability of the example is the product of
probabilities.
The product from i = 1 through m p(y(i) ) given x(i).
If you want to carry out maximum likelihood estimation, then you want to maximize the, find the
parameters that maximize the chance of your observations and training set.
But maximizing this is the same as maximizing the log, so we just put logs on both sides.
So log of the probability of the labels in the training set is equal to, log of a product is the sum of the log.
So that's sum from i=1 through m of log p(y(i)) given x(i). And we have previously figured out on the previous
slide that this is negative L of y hat i, y i.
Logistic Regression Cost Function-
Why?
Logistic Regression Cost Function-
Why?
• In statistics, there's a principle
called the principle of maximum
likelihood estimation.
• Means choosing the parameters
that maximize it.
• Or in other words, that
maximizes this (( Negative sum
from i = 1 through m L(y hat ,y)
and just move the negative sign
outside the summation )
Logistic Regression Cost Function-
Why?
IID Example: Tossing a Coin
• So thiswe
Suppose justifies the 10
toss a coin cost weand
times hadkeep
for track
logistic regression
of the number ofwhich
times is
theJ(w,b) of
coin lands
onthis.
heads.
This is an example of a random variable that is independently and identically
distributed because the following two conditions are met:
• Because we now want to minimize the cost instead of maximizing
(1)likelihood,
Independent we've
– Thegot to rid of
outcome of the
one minus sign.
coin toss does not affect the outcome of
• Finallycoin
another fortoss.
convenience, to make sure that our quantities are better
Each toss is independent.
scaled, we just add a 1 over m extra scaling factor there.
(2)
• ToIdentically
summarize,Distributed – The probability
by minimizing this costthat a coin J(w,b)
function lands on heads
we're on any
really given
carrying
toss
outis maximum
0.5. This probability
likelihooddoes not change
estimation from
with theone toss toregression
logistic the next. model.
Under the assumption that our training examples were identically independently distributed (IID)