0 ratings0% found this document useful (0 votes) 21 views18 pagesNeural Network
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
— Twtrodustiow to aati fefal Nawal Nunsonks
—¥ ANN is a Machine Learning model inspired by the networks of biological neurons found in our brains.
© There is now a huge quantity of data available to train neural networks, and ANNs frequently
outperform other ML techniques on very large and complex problems
@ The tremendous increase in computing power since the 1990s now makes it possible to train large neural
networks in a reasonable amount of time. This is in part due to Moore's law (the number of components
in integrated circuits hasdoubled about every 2 years over the last 50 years), but also thanks to the
gaming industry, which has stimulated the production of powerful GPU cards by the millions. Moreover,
cloud platforms have made this power accessible to everyone
‘@ The training algorithms have been improved. To be fair they are only slightly different from the ones used
in the 1990s, but these relatively small tweaks have had a huge positive impact. Some theoretical
limitations of ANNs have turned out to be benign in practice. For example, many people thought that ANN
training algorithms were doomed because they were likely to get stuck in local optima, but it turns out
that this is rather rare in practice (and when it is the case, they are usually fairly close to the global
optimum).
@ ANNs seem to have entered a virtuous circle of funding and progress. Amazing products based on ANNs
regularly make the headline news, which pulls more and more attention and funding toward them,
resulting in more and more progress and even more amazing products.
McCulloch and Pitts proposed a very simple model of the biological neuron, which later became known
as an artificial neuron: it has one or more binary (on/off) inputs and one binary output. The artificial
neuron activates its output when more than a certain number of its inputs are active. In their paper, they
showed that even with such a simplified model it is possible to build a network of artificial neurons that
computes any logical proposition you want.
Neurons Connection
@
‘ y
®® ©
A
©
¥
@
» Cea Comp udedion,
c= C=AAB C=AVB C=AA-B
+ The first network on the left is the identity function: if neuron A is activated,then neuron C gets activated
as well (since it receives two input signals from neuron A); but if neuron A is off, then neuron Cis off as
well
+ The second network performs a logical AND: neuron C is activated only when both neurons A and B are
activated (a single input signal is not enough to activate neuron C).
+ The third network performs a logical OR: neuron C gets activated if either neuron A or neuron Bis
activated (or both).
+ Finally, if we suppose that an input connection can inhibit the neuron's activity (which is the case with
biological neurons), then the fourth network computes a slightly more complex logical proposition:neuron C is activated only if neuron A is active and neuron B is off. If neuron A is active all the time, then
you get a logical NOT: neuron C is active when neuron Bis off, and vice versa
© Pouptson:- ge oe ap tw ops ANN asddiccbes ,
iwomted Oa 1457 by fi ee Sh baud ow aligtt
Ac uent aatifre! al er KOd a tluenled ee end
ef sewetime a Ler thet wr (LTU)- te
and pulp are vwrabers (Visreod *% binay enloyy wy,
and face nput sonniction A — orsoudadtd with aw
wut + The TW ake A Wi fan ibs
din puss (A=W BWrMet - Warn = TW) stom opplies
fomion $0 tak sv Gund oukpult tre Avewlt Mayo) =
Bepl2), wate Ze ww.
Output: h(x) = step(x" w)
a
= x
Stop function: step(z) | 4% Mvnt Wet peel
\Z7 Weighted sum: z= x'w
Weights baw (™) = are)
x, % % Inputs
D Haretrold Logie ont
— Mer common pa fucker use a haat
ons eta fonction & urd rirtead-
i “LY 2d0
huawikt 5 : ae g
iatde (2) 5 i 44720 Squ(2)= © 4Y2zro
1 34270
A single TLU can be used for simple linear binary classification. It computes a linear combination of the
inputs, and if the result exceeds a threshold, it outputs the positive class. Otherwise it outputs the
negative class (just like a Logistic Regression or linear SVM classifier)A Perceptron is simply composed of a single layer of TLUswith each TLU connected to all the inputs.
When all the neurons in a layer are connected to every neuron in the previous layer (ie., its input neurons),
the layer is called a fully connected layer, or a dense layer. The inputs of the Perceptron are fed to special
passthrough neurons called input neurons: they output whatever input they are fed. All the input neurons
form the input layer. Moreover, an extra bias feature is generally added (x0 = 1): it istypically represented
using a special type of neuron called a bias neuron, which outputs 1 all the time, A Perceptron with two
inputs and three outputs is represented in Figure . This Perceptron can classify instances simultaneously
into three different binary classes, which makes it a multioutput classifier.
Outputs
. —B BUNK, poacpires witty
Tu »\ Output e
| layer +wo input purer , our bleu vimser
= Oud Here output
. RLLROLu-
Bias neuron Input
{always outputs 1) id
Input neuron
(passthrough) *1 %
Inputs
BO) cy corgi tomepese Het corpus 61 fully vonnutl
ie
hw, oF Cl to)
In this equation
+ As always, X represents the matrix of input features. It has one row per instance and one column per
feature.
+ The weight matrix W contains all the connection weights except for the ones from the bias neuron. It
has one row per input neuron and one column per artificial neuron in the layer.
+ The bias vector b contains all the connection weights between the bias neuron and the artificial
neurons. It has one bias term per artificial neuron.
+ The function @ is called the activation function: when the artificial neurons are TLUs, it is a step function
—#Perceptron learning rule (weight update)
:
Wig lm OP = Wig + H(GE- yg)
In this equation:
+ wi, jis the connection weight between the ith input neuron and the jth outputneuron.
+ xi is the ith input value of the current training instance.
+ y jis the output of the jth output neuron for the current training instance.
+ yjis the target output of the jth output neuron for the current training instance.
«tis the learning rate.— Pauprew fy, Sklin dee wot outpat probit, frcrad Hoy
make pedsion baud ew & bard trucked -
~> in their 1969 monograph Perceptrons, Marvin Minsky and Seymour Papert highligh-ted a number of
serious weaknesses of Perceptrons—in particular, the fact that theyare incapable of solving some trivial
problems (e,g., the Exclusive OR (XOR) classifi-cation problem; see the left side of Figure . This is true of
any other linear classification model (such as Logistic Regression classifiers), but researchers had
expectedmuch more from Perceptrons, and some were so disappointed that they droppedneural
networks altogether in favor of higher-level problems such as logic, problemsolving, and search.
It turns out that some of the limitations of Perceptrons can be eliminated by stackingmultiple
Perceptrons. The resulting ANN is called a Multilayer Perceptron (MLP). AnMLP can solve the XOR
problem, as you can verify by computing the output of theMLP represented on the right side of Figure:
with inputs (0, 0) or (1, 1), the net-work outputs 0, and with inputs (0, 1) or (1, 0) it outputs 1
ie thc Muti hager Reraptacw and Barkpropagadion *-
An MLP is composed of one (passthrough) input layer, one or more layers of TLUs,called hidden layers,
and one final layer of TLUs called the output layer (seeFigure ). The layers close to the input layer are
usually called the lower layers, andthe ones close to the outputs are usually called the upper layers. Every
layer except theoutput layer includes a bias neuron and is fully connected to the next layer.
~S output ~P Aaeetedtione % MLP With two
7 layer Crpuk » Ore hidden (agen , anal
*S, Hidden ‘YM oukpuk neurons «
| layerNote !-
“The signal flows only in one direction (from the inputs to the out-puts), so this architecture is an example
of a feedforward neural net-work (FNN).
~
When an ANN contains a deep stack of hidden layers, it is called a deep neural net-work (DNN). The field
of Deep Learning studies DNNs, and more generally modelscontaining deep stacks of computations.
Even so, many people talk about DeepLearning whenever neural networks are involved (even shallow
ones)
Gradient Descent using an efficient technique for computing the gradients automatically: in just two
passes through the network (ane forward, one backward), the backpropagation algo-ithm is able to
compute the gradient of the network's error with regard to every sin-gle model parameter. In other words,
it can find out how each connection weight and each bias term should be tweaked in order to reduce the
error. Once it has these gra-dients, it just performs a regular Gradient Descent step, and the whole
process isrepeated until the network converges to the solution.
Automatically computing gradients is called automatic differentia-tion, or autodiff. There are various
autodiff techniques, with differ-ent pros and cons. The one used by backpropagation is called reverse-
“mode autodiff.
Let's run through this algorithm in a bit more detail:
+ Ithandles one mini-batch at a time (for example, containing 32 instances each),and it goes
through the full training set multiple times. Each pass is called an epoch
+ Each mini-batch is passed to the network's input layer, which sends it to the firsthidden layer. The
algorithm then computes the output of all the neurons in thislayer (for every instance in the mini-
batch). The result is passed on to the nextlayer, its output is computed and passed to the next
layer, and so on until we getthe output of the last layer, the output layer. This is the forward pass:
its exactlylike making predictions, except all intermediate results are preserved since theyare
needed for the backward pass.
+ Next, the algorithm measures the network's output error (i.e, it uses a loss func-tion that
compares the desired output and the actual output of the network, andreturns some measure of
the error).
+ Then it computes how much each output connection contributed to the ertor.This is done
analytically by applying the chain rule (perhaps the most fundamen-tal rule in calculus), which
makes this step fast and precise.
+ The algorithm then measures how much of these error contributions came from each
connection in the layer below, again using the chain rule, working backward until the algorithm
reaches the input layer. As explained earlier, this reverse pass efficiently measures the error
gradient across all the connection weights in the network by propagating the error gradient
backward through the network (hence the name of the algorithm),
+ Finally, the algorithm performs a Gradient Descent step to tweak all the connec-tion weights in
the network, using the error gradients it just computed.ao Sunamusys-
for each training instance, the backpropagation algorithm first makes a prediction (forward pass) and
measures the error, then goes through each layer in reverse to measure the error con-tribution from each
connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient
Descent step)
Noty '-
It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For
example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly
identical, and thus back propagation will affect them in exactly the same way, so they will remain
identical. In other words,despite having hundreds of neurons per layer, your model will act as if it had only
one neuron per layer: it won't be too smart. If instead you randomly initialize the weights, you break the
symme-try and allow back propagation to train a diverse team of neurons.
fe onde for t4 algorithm to wosk » We wild place
rep fusution uly te Letistte [Hermotd) fumaulon= Hus
War Kyertial puave te Bip furnctore Contains Sule
fur fines Named) Goa cinta Ti work wits
(gradi duet cannot move om flat ouafper) colle
hogs ide furndiow hos Qa tall lined en 240
durivataus cuag whe, Ca Gb fo Woke soe
proguc cr fing Hye
ant (rom SHaword fumcton we bow ee eae
vi 0
D revives , dit penilelp tevitvorn bur wot dh, leak 200.
to ie duatvedine & Q for 2 & wrooluw HU? (tretudiig RELU an
Se wogtman) for olaurtpication-
>, Hiden tayer
D eneg Glue Woe we prudeding pave arty
Input Aaberbution she lew fomuion wit
i be A cae
ees PTT ne)
Input and hidden layers Same as regression Same as regression Same as regression
# output neurons 1 1 per label 1 per class
Output layer activation Logistic Logistic Softmax
Loss function Cross entropy Cross entropy Cross entropy
at)
Tunplumeatotton Hit wytte Teuor fe and Kuros:
@ Keras is a high-level Deep Learning AP! that allows you to easily build, train, evaluate, and execute all
sorts of neural networks.
@ TensorFlow itself now comes bundled with its own Keras implemen-tation, tf-keras. It only supports
TensorFlow as the backend, but it has the advantage of offering some very useful extra features (seeFigure : for example, it supports TensorFlow’s Data API, which makes it easy to load and preprocess data
efficiently,
0 80 Baplemmtotr
Aguas Keane APT!
ad é TF-only
‘eras Keras API features Wy tkt boukund Kiros
Parner
a Es ty amd Uf. kenas
(Rigs)
Co feateaes om Earage. howl Fre wae) Hae Sequanrtal RPL
O ood tw dolent [ Fatnion MNEST Dated J
ae Kua we wn Wad me Comer
Dore
? bnaport temtorfios as $f
—P fexwion_mmiat = tf: eos. dakards farlion _muist
(Ktrain YAwat) , (X-tet, ytd J =
ee fostion must - bead —otataC)
» & acpanented Pr (28x28 omag |).
piza frwuttin ane Nepreweuted os
Tateqens (0 fo 268]
How vw om | BXval, A-tams Xtfeak [15000] /a6s-o,
noting Vultdasi A_taalin $000! Jags.o
dota und
a ond | PY Vat) Y-ain= Y_trath (36000], Y_ bain(So00:]
Also urs We,
Pcolfg ty takinkes, and Vadtdation, iO tO)BUA_bala har
“hn.owwms (‘4 Tarik [409% , 8 Teeuece" y
© dan.romn [ ¥ AsabitoT ]
> 4epat”
Otdnntetre
—» model = keras.models.Sequential() Moot
© model.add(keras.layers.Flatten(input_shape=I28, 28]))
—» model.add(keras.layers.Dense(300, activation="relu"))
—* model.add(keras.layers.Dense(100, activation="relu’))
~6 model.add(keras.layers.Dense(10, activation="softmax’))
Let's go through this code line by line:
+ The first line creates a Sequential model. This is the simplest kind of Kerasmodel for neural networks
that are just composed of a single stack of layers con-nected sequentially. This is called the Sequential
APL.
+ Next, we build the first layer and add it to the model. Itis a Flatten layer whoserole is to convert each
input image into a 1D array: if it receives input data X, itcomputes X.reshape(-1, 1). This layer does not
have any parameters; itis justthere to do some simple preprocessing, Since it is the first layer in the
model, youshould specify the input_shape, which doesn't include the batch size, only theshape of the
instances. Alternatively, you could add a keras. layers. InputLayeras the first layer, setting
input_shape=[28,28].
+ Next we add a Dense hidden layer with 300 neurons. It will use the ReLU activa-tion function. Each
Dense layer manages its own weight matrix, containing all theconnection weights between the neurons
and their inputs. It also manages a vec-tor of bias terms (one per neuron). When it receives some input
data.
+ Then we add a second Dense hidden layer with 100 neurons, also using the ReLUactivation function
+ Finally, we add a Dense output layer with 10 neurons (one per class), using thesoftmax activation
function (because the classes are exclusive)
4a otk pleg a modil lage
aaa
pe ree ad
sual 4 a
— A lyor par _cobeas
eee eG (at4y200) + 00 = 238800The shape of the weight matrix depends on the number of inputs. This is why it is recommended to
specify the input_shape when creating the first layer in a Sequential model. However, if you do not specify
the input shape, it's OK: Keras will simply wait until it knows the input shape before it actually builds the
model. This will happen either when you feed it actual data (e.g., during training),or when you call its
build() method. Until the model is really built, the layers will not have any weights, and you will not be able
to do certain things (such as print the model summary or save the model). So, if you know the input
shape when creating the modeliit is best to specify it.
—P apie tha, Hada #
After a model is created, you must call its compile() method to specify the loss func-tion and the
optimizer to use. Optionally, you can specify a list of extra metrics to compute during training and
evaluation:
~ rod - comapite ( tors =" — Cokegovical — cvevenrkropy" ,
Optimizes c "sad" 5
PP Trasrsig avd Ws Modu ?-
eee
eet vied 40 au mod: fib) metuod
— Wu a = mod: fit (Ytaakn , \ —heka, ee
Vet Sdotton_daka *(x_vauid,
~~ qoveuid))
And that's it! The neural network is trained.15 At each epoch during training, Keras displays the number of
instances processed so far (along with a progress bar), the mean training time per sample, and the loss
and accuracy (or any other extra metricsyou asked for) on both the training set and the validation set.
Note-
=
© Instead of passing a validation set using the validation_dataargument, you could set validation_split to
the ratio of thetraining set that you want Keras to use for validation. For example validation_split=0.1 tells
Keras to use the last 10% of the data(before shuffling) for validation.
@ The training set performance ends up beating the validation performance, as is gen-erally the case when
you train for long enough. You can tell that the model has notquite converged yet, as the validation loss is
still going down, so you should probablycontinue training. It’s as simple as calling the fit() method again,
since Keras justcontinues training where it left off.@ Ifyou are not satisfied with the performance of your model, you should go back andtune the
hyperparameters. The first one to check is the leaming rate. If that doesn’thelp, try another optimizer (and
always retune the learning rate after changing any hyperparameter). If the performance is still not great,
then try tuning model hyper-parameters such as the number of layers, the number of neurons per layer,
and the types of activation functions to use for each hidden layer. You can also try tuning other
hyperparameters, such as the batch size (it can be set in the fit() method using the batch_size argument,
which defaults to 32),
eosin navel ( predfck
Next, we can use the model's predict() method to make predictions on new instances
Xn = A-tet 0:3)
> y-prera = modu - padiet ( x-nur)
~ Yo prea round (2)
layne woud que prreaeiely pa dae for
Carn = Rattemrue s Yoo thew Slut Whiiedw
Ifyou only care about the class with the highest estimated probability (even if that probability is quite
low), then you can use the predict_classes() method instead:
BY pud = Modul: predict clases Qtnwo)
2 y-pd
~»> amen (C4213) dye tom VAw find lee
daw et bom boy Cade qi
*y fur Gode-
Poo ‘ ° ° a
sul Regruione fees he Sequential APL:
D Hue we weul ure saliifotinio, hourata dataset to do
Angqunion 10k by voting, Sequenitin’ APL modut:© from eka dolar Compote fete cakiforinia - Wourtng
9 fro skiean. wOdel_ selection cat praia — beat Split
=o ar fa Pace feng Standad Scaler
> hextne flee cauifetinia _ Wausin (3
X_train_full, X_test, y_train_full, y_test = train_test_split( housing.data, housing.target)
X.train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)
scaler = StandardScaler()
X train = scaler fit transform(X train)
X.valid = scaler:transform(X_valid)
X.test = scaler.transform(X_test)
Using the Sequential API to build, train, evaluate, and use a regression MLP to make predictions is quite
similar to what we did for classification. The main differences are the fact that the output layer has a
single neuron (since we only want to predict a single value) and uses no activation function, and the loss
function is the mean squared error. Since the dataset is quite noisy, we just use a single hidden layer with
fewer neurons than before, to avoid overfitting:
model = keras.models.Sequential({
keras.layers.Dense(30, activation="relu’, input_shape=X_train.shape[1),
keras.layers.Dense(1)])
model.compile(loss="mean_squared_error’, optimiz.
od")
history = model fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test{:3] # pretend these are new instances
y_pred = model.predict(X_new)
: Output Layer
Ofer we wt woke an Topas Oop
@Nwb, wl Grote a cine with
°
20 fea leat achivedie
fomution: (yiddut 7
@® We ton uate 2 Weddin layer
Wide Deep
Input Layer@ Nust ,»we Grote Coveatevate lager 1 Comcatenare te Pepa
ond = Qvd Widdew Loya.
© than we yuoke aw Bubp Wy wit Brat muron
witheuk Adavatiew fundiew
Gee we ual q Kites Modut specifying coluicle
Bpuds Gord gutpudt & Us:
input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden? = keras.layers.Dense(30, activation="relu’)(input_)
hidden2 = keras.layers.Dense(30, activation="relu’)(hidden1)
concat = keras.layers.Concatenate()(linput_, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_], outputs=[output])
Neer Newel netoosk — tan Mase featone rth dater tat axe
Vl
cle Cute (yagde) Ue) mote etaeige
TUS a sapcceudl Sear en Cc
Neg dual
Gime we ourput (aye tin (al ed $0 bar
see
—P mods compte (tou = Prmae"s" ese" J, Jou weights [0%,091,
Optimians “ qa]
rf cows fia eee we thew feat te roc sun
wot ue hone bo co cacle oubputy Une
We Code
> ee Wed: fir( Cx trae A, x-190in-B], DY-taadn, Ytacud),
epochs = 20).
P Wn wu eawake Hae node, kinod Wi Sibu tw
tote On ee Cie nnd crc clcca loca}? fdtad- lee, matudeu, aux lee = modus ewoluae (
Crtat_ay %-400_8] ,Cy-tr, Y tat)
> Putt 5 °
LO Mae Ptacllek() method all) aehitww
pdictions for caely Subp
en Peas oral eae ea
¥-nueg])
2d sane me wed) ue he
—P moda caus (' Nuo- module hs")
Kas wul Ue tne HDES formar +0 Sau. bette
wrod’, axurtertare ae Care Ansys hapeepamnan’
P load He medi to move prrdli
Pmatu= tf Kisar: Load medal" Nex-modushs”)
— Cobos | -
what if training lasts several hours? This is quite common, especially when train-ing on large datasets. In
‘this case, you should not only save your model at the end of training, but also save checkpoints at regular
intervals during training, to avoid losing everything if your computer crashes. But how can you tell the fit()
method to savecheckpoints? Use callbacks.
The fit() method accepts a callbacks argument that lets you specify a list of objectsthat Keras will call at
the start and end of training, at the start and end of each epoch,and even before and after processing
each batch, For example, the ModelCheckpointcallback saves checkpoints of your model at regular
intervals during training, bydefault at the end of each epoch:
checkpoint_ch = keras.callbacks.ModelCheckpoint(‘my_keras_model.h5")
history = model. fit(X train, y_train, epochs=10, callbacks=[checkpoint_cb])Moreover, if you use a validation set during training, you can set save_best_only=True when creating the
ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is
the best so far. This way, you do not need to worry about training for too long and overfitting the training
set: simply restore the last model saved after training, and this will be the best model on the validation
set.
checkpoint_cb = keras.callbacks ModelCheckpoint('my_keras_model.hs", save_best_only=True)
history = model fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid),
callbacks=[checkpoint_cb})
wi livvp Lumaersd Us fo Simply ure He
“cagengyh et ba oda Madan ae Whe
Re uae no progana om the Volidation Soe
@ hwmbe vi epouns (4yfired we patione anquunatuct),
oud Gk willl optiomallg Re bat to the beer model:
Nov Can ceubic bet watloaee fo Sue Arack pois
med and in Whow
+ ea dedorep 4 oat :
Huse U Wo wor Proguss Up awefd cd
Augounces,)*
2 ik stopping = Ups banas» Coulparte- Barly Sevpp hy (
prrivn= 30, ROR vbw
eTau |
2 hig © modal. fit (x-taain, hai, oe aus
= (evonids yvoid] , Coutbarer= C
talesttecb , Sed
muatane cake Nevsat Neaootk Iappes araron obcre oe
© The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to
tweak. Not only can you use any imaginable network architecture, but even in a simple MLP you can
change the number of layers, the number of neurons per layer, the type of activation function to use ineach layer, the weight initialization logic, and much more.
@ How do you know what combination of hyperpara-meters is the best for your task?One option is to
simply try many combinations of hyperparameters and see which one works best on the validation set
(or use K-fold cross-validation). For example, we can use GridSearchCV or RandomizedSearchCV to
explore the hyperparameter space,
‘@ Keras Tuner
An easy-to-use hyperparameter optimization library by Google for Keras models,with a hosted service for
visualization and analysis,
@ Scikit-Optimize (skopt)
A general-purpose optimization library. The BayesSearchCV class performsBayesian optimization using
an interface similar to GridSearchCV.
Neves
@ for many problems you can start with just one or two hidden layers and the neural network will work just
fine. For instance, you can easily reach above 97% accuracy on the MNIST dataset using just one hidden
layer with a few hundred neurons, and above 98% accuracy using two hidden layers with the same total
number of neurons, in roughly the same amount of training time. For more complex problems,you can
ramp up the number of hidden layers until you start overfitting the trainingset. Very complex tasks, such
as large image classification or speech recognition, typically require networks with dozens of layers (or
even hundreds, but not fully connected ones), and they need a huge amount of training data.You will
rarely have to train such networks from scratch: it is much more common to reuse parts of a pretrained
state-of-the-art network that performs a similar task.