Ad3501 DL Unit 3 Notes
Ad3501 DL Unit 3 Notes
Scantoopenon Studocu
Studocuisnotsponsoredorendorsedbyanycollegeor university
Downloaded by KIRUTHIKA N
AD3501-DEEPLEARNING
UNITIII RECURRENTNEURALNETWORKS
RECURRENTNEURALNETWORKS
Introductionto RNN
Traditional neural networks mainly have independent input and output layers, which make them
inefficient when dealing with sequential data. Hence, a new neural network called Recurrent
Neural Network, introduced to store results of previous outputs in the internal memory. These
results arethen fed intothe network inputs inorderto predict theoutputofthe layer. This allows it to
beused inapplicationslikepatterndetection,speechand voicerecognition, naturallanguage
processing, and time series prediction.
BelowishowwecanconvertaFeed-ForwardNeuralNetworkintoaRecurrentNeuralNetwork:
Fig:SimpleRecurrentNeuralNetwork
RNN has hidden layers that act as memory locations to store the outputs ofa layer in a loop.
Here, “x” is the input layer, “h” is the hidden layer (act asmemory locations to storethe outputs
ofa layer in a loop), and “y” is the output layer. A, B, and C arethe network parameters used to
improve the output ofthe model. At any given time t, the current input is a combination of input
at x(t) and x(t-1). The output at anygiven time is fetched back to the networkto improve on the
output.
Downloaded by KIRUTHIKA N
WhyRecurrentNeuralNetworks?
RNNwerecreatedbecausetherewereafewissues inthefeed-forwardneuralnetwork:
Cannothandlesequentialdata
Considersonlythecurrentinput
Cannotmemorizepreviousinputs
The solution to these issues is the RNN. An RNN can handle sequential data, accepting the
current input data, and previously received inputs. RNNs can memorize previous inputs due to
their internal memory.
HowDoesRecurrentNeuralNetworksWork?
InRecurrentNeuralnetworks,theinformationcyclesthroughalooptothemiddlehiddenlayer.
Fig:WorkingofRecurrentNeuralNetwork
The input layer ‘x’ takesinthe inputtothe neuralnetworkand processes it and passes it ontothe
middle layer.
Downloaded by KIRUTHIKA N
The Recurrent Neural Network will standardize the different activation functions and weightsand
biases so that each hidden layer has the same parameters. Then, instead ofcreating multiple
hidden layers, it will create one and loop over it as many times as required.
Feed-ForwardNeuralNetworksvsRecurrentNeuralNetworks
A feed-forward neural network allows information to flow only in the forward direction, fromthe
input nodes, throughthe hidden layers, and to the output nodes. There are no cycles or loops in
the network. Below is how a simplified presentation of a feed-forward neural network looks like:
Fig:
Feed-forward Neural Network
In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.
ApplicationsofRecurrentNeuralNetworks
ImageCaptioning:RNNsareusedtocaptionanimagebyanalysingtheactivitiespresent.
Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.
Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).
Downloaded by KIRUTHIKA N
Machine Translation: Given an input in one language, RNNs can be used to translate the input
into different languages as output.
AdvantagesofRecurrentNeuralNetwork
Recurrent Neural Networks (RNNs) have several advantages over other types of neuralnetworks,
including:
Ability to Handle Variable-Length Sequences: RNNs are designed to handle input sequences
of variable length, which makes them well-suited for tasks such as speech recognition, natural
language processing, and time series analysis.
Memory of Past Inputs:RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes them useful for tasks such as
language modelling, where the meaning of a word depends on the context in which it appears.
ParameterSharing: RNNs share the same set of parameters across alltime steps, whichreduce
the number ofparameters that need to be learned and can lead to better generalization.
Non-Linear Mapping: RNNs use non-linear activation functions, which allow them to learn
complex, non-linear mappings between inputs and outputs.
Sequential Processing: RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.
Flexibility:RNNs can be adapted to a wide range of tasks and input types, including text, speech,
and image sequences.
Downloaded by KIRUTHIKA N
Improved Accuracy:RNNs have been shown to achieve state-of-the-art performance on a
variety of sequence modeling tasks, including language modeling, speech recognition, and
machine translation.
DisadvantagesofRecurrentNeuralNetwork
Although Recurrent Neural Networks (RNNs) have several advantages, they also have some
disadvantages. Here are some of the main disadvantages of RNNs:
Vanishing and Exploding Gradients:RNNs can suffer from the problem of vanishing or
exploding gradients, which can make it difficult to train the network effectively. This occurs
when the gradients ofthe loss function with respect to the parameters become verysmall or very
large as they propagate through time.
Lack of Parallelism:RNNs are inherently sequential, which makes it difficult to parallelize the
computation. This can limit the speed and scalability of the network.
Difficulty in Choosing the Right Architecture:There are many different variants of RNNs,
eachwith its ownadvantages and disadvantages. Choosing the right architecture for a giventask
can be challenging, and may require extensive experimentation and tuning.
These disadvantages are important when deciding whether to use an RNN for a given task.
However, many of these issues can be addressed through careful design and training of the
network and through techniques such as regularization and attention mechanisms.
ThefourcommonlyusedtypesofRecurrentNeuralNetworksare:
Downloaded by KIRUTHIKA N
1. One-to-One
The simplest type ofRNN isOne-to-One,whichallowsa single input and a single output. It has
fixed input and output sizes and acts as atraditionalneuralnetwork. The One-to-One application
can be found in Image Classification.
One-toOne
2. One-to-Many
One-to-Manyis a type of RNN that gives multiple outputs when given a single input. It takes a
fixed input size and gives a sequence of data outputs. Its applications can be found in Music
Generation and Image Captioning.
One-to-Many
3. Many-to-One
Many-to-Oneisused whena singleoutput isrequired frommultiple input unitsor asequenceof
them. It takes a sequence of inputs to display a fixed output. Sentiment Analysis is a common
example of this type of Recurrent Neural Network.
4. Many-to-Many
Many-to-Manyareusedtogenerateasequenceofoutput datafromasequenceofinput units. This type
of RNN is further divided into the following two subcategories:
1. EqualUnitSize: In thiscase,thenumberof both theinputandoutputunitsisthesame.A common
application can be found in Name-Entity Recognition.
Downloaded by KIRUTHIKA N
2. Unequal Unit Size: In this case, inputs and outputs have different numbers of units. Its
application can be found in Machine Translation.
TwoIssuesofStandardRNNs
1. VanishingGradient Problem
Recurrent Neural Networks enable us to model time-dependent and sequential data problems,
suchasstockmarket prediction, machinetranslation,andtext generation.Wewillfind,however, RNN
is hard to train because of the gradient problem.
RNNs suffer from the problem of vanishing gradients. The gradients carry information used in
theRNN, andwhenthegradient becomestoo small, theparameter updatesbecome insignificant.
This makes the learning of long data sequences difficult.
Downloaded by KIRUTHIKA N
1. ExplodingGradientProblem
While training a neural network, if the slope tends to grow exponentially instead of decaying,this
is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training
process.Long training time, poor performance, and bad accuracyare the major issues in gradient
problems.
Feed-ForwardNeuralNetworksvsRecurrentNeuralNetworks
A feed-forward neural network allows information to flow only in the forward direction, fromthe
input nodes, throughthe hidden layers, and to the output nodes. There are no cycles or loops in
the network. Below is how a simplified presentation of a feed-forward neural network looks like:
Fig:Feed-forwardNeuralNetwork
In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.
VariantRNNArchitectures
There are several variant RNN architectures that have been developed over the years to address
the limitations of the standard RNN architecture. Here are a few examples:
LongShort-TermMemory(LSTM)Networks
LSTM isa type ofRNN that isdesigned to handle the vanishing gradient problemthat canoccur
instandardRNNs.Itdoesthisbyintroducingthreegatingmechanismsthatcontroltheflowof
Downloaded by KIRUTHIKA N
informationthroughthe network:the input gate,the forget gate, and the output gate. These gates
allow the LSTMnetworkto selectivelyremember orforget information fromthe input sequence,
which makes it more effective for long-term dependencies.
GatedRecurrentUnit(GRU)Networks
GRU is another type of RNN that is designed to address the vanishing gradient problem. It has
two gates: the reset gate and the update gate. The reset gate determines how much of theprevious
state should be forgotten, while the update gate determines how much of the new state should be
remembered. This allows the GRU network to selectively update its internal state based on the
input sequence.
BidirectionalRNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be useful
for speech recognition and natural language processing tasks.
Encoder-DecoderRNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder network
that generates the output sequence based on the encoder's representation. This architecture is
commonly used for sequence-to-sequence tasks such as machine translation.
AttentionMechanisms
Attention mechanisms are a technique that can be used to improve the performance ofRNNs on
tasksthatinvolve long input sequences.Theyworkbyallowingthe networkto attendtodifferent
partsofthe input sequence selectivelyrather thantreating allpartsofthe input sequence equally.
Thiscanhelp the networkfocusonthe input sequence's most relevant partsand ignore irrelevant
information.
These are just a few examples of the many variant RNN architectures that have been developed
over the years. The choice of architecture depends on the specific task and the characteristics of
the input and output sequences.
Encoder-DecoderModel
Therearethreemainblocksintheencoder-decodermodel,
Encoder
HiddenVector
Decoder
Downloaded by KIRUTHIKA N
TheEncoder will converttheinputsequenceintoasingle-dimensional vector (hidden vector). The
decoder will convert the hidden vector into the output sequence.
Encoder-Decodermodelsarejointlytrainedtomaximizetheconditionalprobabilitiesofthe target
sequence given the input sequence.
SEQUENCETOSEQUENCERNN
HowtheSequencetoSequenceModelworks?
Inordertofullyunderstandthemodel’sunderlyinglogic,wewillgooverthebelowillustration:
Encoder-decodersequencetosequencemodel
Encoder
MultipleRNNcellscanbestackedtogethertoformtheencoder.RNNreadseachinputs sequentially
For every timestep (each input) t, the hidden state (hidden vector) h is updated according to
the input at that timestep X[i].
After all the inputs are read by encoder model, the final hidden state of the model represents
the context/summary of the whole input sequence.
Downloaded by KIRUTHIKA N
Example:Considertheinput sequence“IamaStudent”tobeencoded.Therewillbetotally4
timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will be
updated using the previous hidden state and the current input.
Example:Encoder
At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly
chosen. So the first RNN cell will updatethe current hidden state withthe first input and h0.
Each layer outputs two things — updated hidden state and the output for each stage. The
outputs at each stage are rejected and only the hidden states will be propagated to the next
layer.
Thehiddenstatesh_iarecomputedusingtheformula:
At second timestep t2, the hidden state h1 and the second input X[2] will be given as input ,
andthehiddenstateh2willbeupdatedaccordingtobothinputs.Thenthehiddenstateh1will be
updated with the new input and will produce the hidden state h2. This happens for all the
four stages wrt example taken.
A stack of several recurrent units (LSTM or GRU cells for better performance) where each
accepts a single element of the input sequence, collects information for that element, and
propagates it forward.
In the question-answering problem, the input sequence is a collection of all words from the
question. Each word is represented as x_i where i is the order ofthat word.
Downloaded by KIRUTHIKA N
This simple formula represents the result of an ordinary recurrent neural network. As you can
see, we just apply the appropriate weights to the previously hidden stateh_(t-1)and the input
vector x_t.
EncoderVector
This is the final hidden state produced from the encoder part of the model. It is calculated
using the formula above.
This vector aims to encapsulate the information for all input elements in order to help the
decoder make accurate predictions.
Decoder
Theinputforthedecoderisthefinalhiddenvectorobtainedattheendofencodermodel.
Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous
layer output yt-1, original hidden vector h.
The second layer will have the updated hidden state h1 and the previous output y1 and
original hidden vector h as current inputs, producesthe hidden vector h2 and output y2.
Astackofseveralrecurrentunitswhereeachpredictsanoutputy_tatatimestept.
Downloaded by KIRUTHIKA N
Each recurrent unit accepts a hidden state from the previous unit and produces an output
aswell as its own hidden state.
In the question-answering problem, the output sequence is a collection of all words from the
answer. Each word is represented as y_i where i is the order ofthat word.
Example:Decoder.
Anyhiddenstateh_iiscomputedusingtheformula:
Asyoucansee,wearejustusingtheprevioushiddenstatetocomputethenextone.
OutputLayer
WeuseSoftmaxactivationfunctionattheoutputlayer.
It is used to produce the probability distribution from a vector of values with the target class
of high probability.
Theoutputy_tattimesteptiscomputedusingtheformula:
Downloaded by KIRUTHIKA N
We calculate the outputs using the hidden state at the current time step together with the
respectiveweight W(S).Softmaxisusedtocreateaprobabilityvectorthat willhelpusdetermine the
final output (e.g. word in the question-answering problem).
The power of this model lies in the fact that it can map sequences of different lengths to each
other.As you can see the inputs and outputs are not correlated and their lengths can differ. This
opens a whole new range ofproblems that can now be solved using such architecture.
Applications
Itpossessesmanyapplicationssuchas
Google’sMachineTranslation
Question answeringchatbots
Speechrecognition
TimeSeriesApplicationetc.,
BIDIRECTIONALRNN
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the
forward direction, while the other processes it in the reverse direction. The outputs ofthese two
RNNs are then combined in some way to produce the final output.
One common way to combine the outputs of the forward and reverse RNNs is to concatenate
them, but other methods, such as element-wise addition or multiplication can also be used. The
choice of combination method can depend on the specific task and the desired properties of the
final output.
NeedforBi-directionalRNNs
Downloaded by KIRUTHIKA N
A uni-directional recurrent neural network (RNN) processes input sequences in a single
direction, either from left to right or right to left.
This means that the network can only use information from earlier time steps when making
predictions at later time steps.
This can be limiting, as the network may not capture important contextual information
relevant to the output prediction.
For example, in natural language processing tasks, a uni-directional RNN may not accurately
predict the next word in a sentence if the previous words provide important context for the
current word.
Consider an example where we could use the recurrent networkto predict the masked word in a
sentence.
1. Apple is my favorite .
2. Apple is my favourite ,andIworkthere.
3. Apple is my favorite ,and Iamgoingtobuyone.
In thefirstsentence, the answer couldbe fruit, company, or phone. Butin the second and third
sentences, it cannot be a fruit.
ARecurrent NeuralNetworkthat canonlyprocess the inputs fromleft to right might not be able to
accurately predict the right answer for sentences discussed above.
Bi-directionalRNNs
A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN)
that processes input sequences in both forward and backward directions.
This allows the RNN to capture information from the input sequence that may be relevant to
the output prediction, but the same could be lost in a traditional RNN that only processes the
input sequence in one direction.
Thisallowsthenetworktoconsiderinformationfromthepastandfuturewhenmaking predictions
rather than just relying onthe input data at the current time step.
This can be useful for tasks such as language processing, where understanding the context ofa
word or phrase can be important for making accurate predictions.
In general, bidirectional RNNs can help improve the performance of a model on a variety of
sequence-based tasks.
Downloaded by KIRUTHIKA N
ThismeansthatthenetworkhastwoseparateRNNs:
ThesetwoRNNsaretypicallyreferredtoastheforwardandbackwardRNNs, respectively.
During the forward passofthe RNN, the forwardRNN processes the input sequence inthe usual
way by taking the input at each time step and using it to update the hidden state. The updated
hidden state is then used to predict the output at that time step.
Back-propagation through time (BPTT) is a widely used algorithm for training recurrent
neural networks (RNNs). It is a variant of the back-propagation algorithm specifically designed
to handle the temporalnatureofRNNs, where the output at eachtime step depends onthe inputs
and outputs at previous time steps.
In the case of a bidirectional RNN, BPTT involves two separate Back-propagation passes: one
forthe forwardRNNandone forthebackwardRNN. Duringthe forwardpass, the forwardRNN
processes the input sequence in the usual way and makes predictions for the output sequence.
These predictions are then compared to the target output sequence, and the error is back-
propagated through the network to update the weights of the forward RNN.
During the backward pass, the backward RNN processes the input sequence in reverse order and
makes predictions for the output sequence. These predictions are then compared to the target
output sequence inreverse order, and the erroris back-propagatedthroughthe networkto update the
weights of the backward RNN.
Once both passes are complete, the weights of the forward and backward RNNs are updated
based on the errors computed during the forward and backward passes, respectively. Thisprocess
is repeated for multiple iterations until the model converges and the predictions of the
bidirectional RNN are accurate.
Downloaded by KIRUTHIKA N
This allows the bidirectional RNN to consider information from past and future time steps when
making predictions, which can significantly improve the model's accuracy.
ApplicationsofBi-directionalRNNs
Bidirectional recurrent neural networks (RNNs) can outperform traditional RNNs on various
tasks, particularly those involving sequential data processing. Some examples of tasks where
bidirectional RNNs have been shown to outperform traditional RNNs include:
Natural languages processing tasks, such as language translation and sentiment analysis,
where understanding the context of a word or phrase can be important for making accurate
predictions.
Time series forecasting tasks, such as predicting stock prices or weather patterns, where the
sequence of past data can provide important clues about future trends.
Audio processing tasks, such as speech recognition or music generation, where the
information in the audio signal can be complex and non-linear.
In general, bidirectional RNNs can be useful for any task where the input data has a temporal structure and wh
AdvantagesandDisadvantagesofBi-directionalRNNs Advantages:
BidirectionalRecurrentNeuralNetworks(RNNs)haveseveraladvantagesovertraditionalRNNs. Some
of the key advantages of bidirectional RNNs include the following:
Improved performance ontasksthat involve processing sequentialdata. Because bidirectional
RNNscanconsiderinformationfrombothpastandfuturetimestepswhenmaking
Downloaded by KIRUTHIKA N
predictions,theycanoutperformtraditionalRNNsontaskssuchasnaturallanguage processing, time
series forecasting, and audio processing.
Disadvantages:
However, Bidirectional RNNs also have some disadvantages. Some of the key disadvantages of
bidirectional RNNs include the following:
Increased computational complexity. Because bidirectional RNNs have two separate RNNs
(one for the forward pass and one for the backward pass), they can require more
computational resources to train and evaluate than traditional RNNs. This can make them
more difficult to implement and less efficient in terms of runtime performance.
The need for longer input sequences. For a bidirectional RNN to capture long-term
dependencies in the data, it typically requires longer input sequences than a traditional RNN.
This can be a disadvantage insituations where the input data is limited or noisy, as it maynot
be possible to generate enough input data to train the model effectively.
RECURSIVENEURALNETWORKS
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed
and structured information. With RvNN, you can get a structured prediction by recursively
applying the same set of weights on structured inputs. The word recursive indicates that the
neural network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data.
The tree structure means combining child nodes and producing parent nodes. Each child-parent
bond has a weight matrix, and similar children have the same weights. The number of children
for every node in the tree is fixed to enable it to perform recursive operations and use the same
weights. RvNNs are used when there's a need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (W_i)
and the children's representations (C_i) and apply the transformation f:
RecurrentNeuralNetworks(RNNs)areanotherwell-knownclassofneuralnetworks usedfor
processingsequentialdata.TheyarecloselyrelatedtotheRecursiveNeuralNetwork.
RecurrentNeuralNetworksrepresenttemporalsequences,whichtheyfindapplication
inNaturallanguageProcessing (NLP)sincelanguage-relateddatalikesentencesand
Downloaded by KIRUTHIKA N
paragraphsaresequentialinnature.Recurrentnetworksareusuallychainstructures.The weights are
shared across the chain length, keeping the dimensionality constant.
On the other hand, Recursive Neural Networks operate on hierarchical data models due
totheirtreestructure.Therearea fixed numberofchildrenfor eachnode inthetreesothat it can
execute recursive operations and use the same weights for eachstep. Child representations are
combined into parent representations.
Theefficiencyofarecursivenetworkishigherthanafeed-forwardnetwork.
Recurrent Networks are recurrent over time, meaning recursive networks are just a
generalization of the recurrent network.
RecursiveNeuralNetworkImplementation
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is
one of the most important tasks of Natural language Processing (NLP), which identifies the
writing tone and sentiments of the writer in a particular sentence. If a writer expresses any
sentiment, basic labels about the writing tone are recognized. We want to identify the smaller
components like nouns or verb phrases and order them in a syntactic hierarchy. For example, it
identifies whether the sentence showcases a constructive form of writing or negative word
choices.
A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases
and words we must combine to formthe perfect syntactic tree for a given sentence.
is a lot of fun.
An RNN representation ofthis phrase would not be suitable because it considers only sequential
relations. Each state varies with the preceding words' representation. So, a subsequence that
doesn't occur at the beginning of the sentence can't be represented. With RNN, when processing
the word 'fun,' the hidden state will represent the whole sentence.
However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representationoftheexact phrase. It lies inthe hiddenstateofthenodeR_{a\ lot\of\ fun}. Thus,
Syntactic parsing is completely implemented with the help of Recursive Neural Networks.
The two significant advantages of Recursive Neural Networks for Natural Language
Processing are their structure and reduction in network depth.
As already explained, the tree structure of Recursive Neural Networks can managehierarchical
data like in parsing problems.
Downloaded by KIRUTHIKA N
Another benefit of RvNN is that the trees can have a logarithmic height. When there are O(n)
input words, a Recursive Neural Network can represent a binary tree with height O(log\ n).
This lessens the distance between the first and last input elements. Hence, the long-term
dependency turns shorter and easier to grab.
The main disadvantage of recursive neural networks can be the tree structure. Using the tree
structure indicates introducing a unique inductive bias to our model. The bias corresponds to
the assumption that the data follow a tree hierarchy structure. But that is not the truth. Thus,
the network may not be able to learn the existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow
and ambiguous. Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive
neural networksthan to construct recurrent neural networks. Manually parsing a sentence into
short components is more time-consuming and tedious than assigning a label to a sentence.
Gated Architecture
LONGSHORTTERMMEMORYNETWORK(LSTM).
LSTM used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs)that
are capable of learning long-termdependencies, especially in sequence predictionproblems.
LSTMs are predominantly used to learn, process, and classify sequential data because these
networks can learn long-term dependencies between time steps of data. Common LSTM
applications include sentiment analysis, language modelling, speech recognition, and video
analysis.
LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data,
apart from single data points such as images. This finds application in speech recognition,
machine translation, etc. LSTM is a specialkind ofRNN, whichshows outstanding performance
on a large variety of problems.
TheLogicbehindLSTM
The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that
maintains its state over time. The cell state is the horizontal line that runs through the top of the
below diagram. It can be visualized as a conveyor belt through which information just flows,
unchanged.
Downloaded by KIRUTHIKA N
Information can be added to or removed from the cell state in LSTM and is regulated by gates.
These gates optionally let the information flow in and out of the cell. It contains a point wise
multiplication operation and a sigmoid neural net layer that assist the mechanism.
The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should
be let through,’ and one means ‘everything should be let through.’
1. ForgetGate(f):Atforgetgatetheinputiscombinedwiththepreviousoutputtogenerate a fraction
between 0 and 1, that determines how much of the previous state need to be preserved (or in
other words,how much of the state shouldbe forgotten). This outputis
thenmultipliedwiththepreviousstate.Note:Anactivationoutputof1.0means “remember
everything” and activation output of 0.0 means “forget everything.” From a
differentperspective,abetternamefor the forgetgate mightbe the“remembergate”
2. Input Gate(i): Input gate operates on the same signals as the forget gate, but here the
objective is to decide which new information is going to enter the state of LSTM. Theoutput
of the input gate (again a fraction between 0 and 1) is multiplied with the output of tan h
block that produces the new values that must be added to previous state. This gated vector is
then added to previous state to generate currentstate
3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much
literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is
used tomodulate the information that the Inputgate will write onto the Internal State Cell
byaddingnon-linearitytotheinformationandmakingtheinformationZero-mean.This is done to
reduce the learning time as Zero-mean input has faster convergence.
Althoughthisgate’sactionsarelessimportantthantheothersandareoftentreatedasafinesse-
Downloaded by KIRUTHIKA N
providingconcept,itisgoodpracticetoincludethisgateinthestructureof theLSTM unit.
4. OutputGate(o): Atoutputgate,theinputandpreviousstatearegatedasbeforeto
generateanotherscalingfractionthatiscombinedwiththeoutputoftanhblockthat
bringsthecurrentstate.Thisoutputis then given out.The outputandstate arefedback into the
LSTM block.
The basic workflow of a Long Short Term Memory Network is similar to the workflow of a
Recurrent Neural Network with the only difference being that the Internal Cell State is also
passed forward along with the Hidden State.
WorkingofanLSTMrecurrentunit:
1. Takeinputthecurrentinput,theprevioushiddenstate,andthepreviousinternalcellstate.
2. Calculatethevaluesofthefourdifferentgatesbyfollowingthebelowsteps:-
Foreach gate,calculate the parameterizedvectorsfor the currentinput and the previous
hidden state by element-wise multiplication with the concerned vector withthe
respective weights for each gate.
Apply the respective activation function for each gate element-wise on the
parameterizedvectors.Belowgivenisthelistofthegateswiththeactivation function to be
applied for the gate.
3. Calculatethecurrentinternalcellstatebyfirstcalculatingtheelement-wise multiplication vector
of the input gate and the input modulation gate, then calculate the element-
wisemultiplicationvectorof theforgetgateandthepreviousinternalcellstate and then
add the two vectors.
4. Calculatethecurrenthiddenstatebyfirsttakingtheelement-wisehyperbolictangentof
thecurrentinternalcellstatevectorandthenperformingelement-wisemultiplication with the
output gate.
Theabove-statedworkingisillustratedasbelow:-
Note that the blue circles denote element-wise multiplication. The weight matrix W contains
differentweightsfor thecurrentinputvectorand theprevioushiddenstateforeach gate.
LSTMsworkina 3-stepprocess.
Downloaded by KIRUTHIKA N
Step1:DecideHowMuchPastDataItShouldRemember
The first step inthe LSTM is to decide which informationshould be omitted fromthe cellinthat
particular time step. The sigmoid function determines this. It looks at the previous state (ht-1)
along with the current input xt and computes the function.
Considerthefollowingtwo sentences:
1. Lettheoutputofh(t-1)be“AliceisgoodinPhysics.John,ontheotherhand,isgoodat Chemistry.”
2. Let the current input at x(t)be “Johnplays footballwell. He told me yesterdayoverthe phone
that he had served as the captain of his college football team.”
Step2:DecideHowMuchThisUnitAddstotheCurrentState
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh
function. In thesigmoid function, it decides which values to let through (0 or1). tanhfunction
gives weightage to the values which are passed, deciding their level of importance (-1 to 1).
With the current input at x(t), the input gate analyses the important information John plays
football, and the fact that he was the captain of his college team is important.
“Hetoldmeyesterdayoverthephone”islessimportant;henceit'sforgotten.Thisprocessof
addingsomenew informationcanbedoneviatheinputgate.
Step3:DecideWhatPartoftheCurrentCellStateMakesIttotheOutput
The third step is to decide what the output will be. First, we run a sigmoid layer, which decides
what partsofthe cell statemake it to the output. Then, we put the cell statethrough tanh to push the
values to be between -1 and 1 and multiply it bythe output of the sigmoid gate.
Downloaded by KIRUTHIKA N
Ot-outputgate.Allowsthepassedininformationtoimpacttheoutputinthecurrenttime step
Let’s consider this example to predict the next word in the sentence: “John played tremendously
well against the opponent and won for his team. For his contributions, brave was awarded player
of the match.”There could be many choices for the empty space. The current input braveis an
adjective, and adjectives describe a noun. So, “John” could be the best output after brave.
LSTMApplications
LSTMnetworksfindusefulapplicationsinthe followingareas:
Language modelling
Machinetranslation
Handwritingrecognition
Imagecaptioning
Imagegenerationusingattentionmodels
Questionanswering
Video-to-textconversion
Polymorphicmusicmodelling
Speechsynthesis
Proteinsecondarystructureprediction
connections
Skip connections are a type of shortcut that connects the output of one layer to the input of
another layer that isnot adjacent to it.Forexample, ina CNN with four layers, A, B,C,and D, a
skip connection could connect layer A to layer C, or layer B to layer D, or both.
As previously explained, using the chain rule, we must keep multiplying terms with the error
gradient as we go backwards. However, in the long chain of multiplication, ifwe multiply many
things together that are less than one, then the resulting gradient will be very small. Thus, the
gradient becomes very smallas we approach the earlier layers in a deep architecture. In
some cases, the gradient becomes zero, meaning that we do not update the early layers at all.
a) Additionasinresidualarchitectures,
b) Concatenation asindenselyconnected architectures.
Wewillfirstdescribeadditionwhichiscommonlyreferredasresidualskipconnections.
Downloaded by KIRUTHIKA N
Skipconnectionsviaaddition
Mathematically, we can represent the residual block, and calculate its partial derivative
(gradient), given the loss function like this:
Apart from the vanishing gradients, there is another reason that we commonly use them. For a
plethora of tasks (such as semantic segmentation, optical flow estimation, etc.) there is some
information that was captured in the initial layers and we would like to allow the later layers to
also learn from them. It has been observed that in earlier layers the learned features
correspond to lower semantic information that is extracted from the input. If we had not
used the skip connection that information would have turned too abstract.
Skipconnectionsviaconcatenation
As stated, for many dense prediction problems, there is low-level information shared between
the input and output, and it would be desirable to pass this information directly across the
net.The alternative way that we can achieve skip connections is by concatenation of previous
feature maps. The most famous deep learning architecture is DenseNet. Below we can see an
example of feature reusability by concatenation with 5 convolutional layers:
Downloaded by KIRUTHIKA N
This architecture heavily uses feature concatenation so as to ensure maximum information flow
between layers in the network. This is achieved byconnecting via concatenation all layers
directly with each other, as opposed to ResNets. Practically, what we basically do is to
concatenate the feature channel dimension. This leads to
a) Anenormousamountoffeaturechannelsonthelastlayersofthenetwork,
b) Tomorecompactmodels,and
c) Extremefeaturereusability.
ShortandlongskipconnectionsinDeepLearning
In more practicalterms, we have to be carefulwhen introducing additive skip connections in our
deep learning model. The dimensionality has to be the same in addition and also in
concatenation apart from the chosen channel dimension. That is the reason why we see that
additive skip connections are used in two kinds of setups:
a) Shortskipconnections
b) Longskipconnections.
Short skip connections are used along with consecutive convolutional layers that do not change
the input dimension (see Res-Net), while long skip connections usually exist in encoder-decoder
architectures.Itisknownthatthe globalinformation (shapeoftheimageandother statistics) resolves
what, while local information resolves where (small details in an image patch).
Downloaded by KIRUTHIKA N
Long skip connections often exist inarchitecturesthat are symmetrical, where the spatial
dimensionality is reduced in the encoderpart and is graduallyincreased inthe decoder part as
illustratedbelow.Inthedecoderpart,onecanincreasethedimensionalityofafeaturemap viatranspose
convolutional layers. The transposed convolution operation forms the same connectivity as the
normal convolution but in the backward direction.
Benefitsofskip connections
Skip connections can provide several benefits for CNNs, such as improving accuracy and
generalization, solving the vanishing gradient problem, and enabling deeper networks. Skip
connections can help the network to learn more complex and diverse patterns from the data and
reduce the number of parameters and operations needed by the network. Additionally, skip
connections can help to alleviate the problem of vanishing gradients by providing alternative
paths for the gradients to flow. Furthermore, they can make it easier and faster to train deeper
networks, which have more expressive power and can capture more features from the data.
Drawbacksofskipconnections
Skip connections are a popular and powerful technique for improving the performance and
efficiency of CNNs, but they are not a panacea. They can help preserve information and
gradients, combine features, solve the vanishing gradient problem, and enable deeper networks.
However, they can also increase complexity and memory requirements, introduce redundancy
and noise, and require careful design and tuning to match the network architecture and data
domain. Different types and locations of skip connections can have different impacts on the
network performance, with some being more beneficial or harmful than others. Thus, it is
essential to understand how skip connections work and how to use them wisely and effectively
for CNNs.
Dropouts
Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results. A neural network is software attempting to emulate the actions of
the human brain.
Neural networks are the building blocks of any machine-learning architecture. They consist of
one input layer, one or more hidden layers, and an output layer.
When we training our neural network (or model) by updating each of its weights, it might
become too dependent on the dataset we are using. Therefore, when this model has to make a
predictionorclassification, it willnot givesatisfactoryresults.Thisisknownas over-fitting.We
mightunderstandthisproblemthroughareal-worldexample:Ifastudentofmathematics learnsonly one
chapter of a book and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced byGeoffreyHinton in 2012.
This technique is known as dropout.
Downloaded by KIRUTHIKA N
The basic idea of this method is to, based on probability, temporarily“drop out” neurons from
our original network. Doing this for every training example gives us different models for each
one. Afterwards, when we want to test our model, we take the average of each model to get our
answer/prediction.
Dropoutduringtraining
We assign ‘p’ to represent the probability of a neuron, in the hidden layer, being excluded from
the network; this probability value is usually equal to 0.5. We do the same process for the input
layer whose probability value is usually lower than 0.5 (e.g. 0.2). Remember, we delete the
connections going into, and out of, the neuron when we drop it.
Dropoutduringtesting
An output, given from a model trained using the dropout technique, is a bitdifferent: We can take
a sample of many dropped-out models and compute the geometric mean of their output neurons
by multiplying all the numbers together and taking the product’s square root. However, since this
is computationally expensive, we use the original model instead by simply cutting allof the
hidden units’ weights in half. This will give us a good approximation of the average for each of
the different dropped-out models.
RNNDESIGNPATTERNS
Downloaded by KIRUTHIKA N
Downloaded by KIRUTHIKA N