Recurrent Neural networks
What is a Recurrent Neural Network (RNN)?
Recurrent neural network is a type of neural network in which the output form
the previous step is fed as input to the current step
In traditional neural networks, all the inputs and outputs are independent of
each other, but this is not a good idea if we want to predict the next word in a
sentence
We need to remember the previous word in order to generate the next word in a
sentence, hence traditional neural networks are not efficient for NLP
applications
RNNs also have a hidden stage which used to capture information about a
sentence
RNNs have a ‘memory’, which is used to capture information about the
calculations made so far
In theory, RNNs can use information in arbitrary long sequences, but
practically they are limited to look back only a few steps
Diagrammatic Representation
Unfolding means writing the network for the complete sequence, for example, if a sequence
has 4 words then the network will be unfolded into a 4 layered neural network
We can think of s t as the memory of the network as it captures information about what
happened in all the previous steps
A traditional neural network uses different parameter at each layer while an RNN shares the
same parameter across all the layers, in the diagram we could see that the same parameters
(U, V, W) were being used across all the layers
Using the same parameters across all layers shown that we are performing the same task with
different inputs, thus reducing the total number of parameters to learn
The tree set of parameter ( U, V, and W) are used to apply linear transformation over their
respective inputs
Parameter U transformation the input xt to the state st
Parameter W transforms the previous state st-1 to the current state st
And, parameter V maps the computed internal state st to the output Ot
Formula to calculate current state:
ht = f(ht-1,xt)
Here, ht is the current state, ht-1 is the previous state and xt is the current input
The equation applying after activation function (tanh) is:
ht=tanh(whhht-1 + wxhxt)
Here, whh : weight at recurrent neuron, Wxh : weight at input neuron
After calculating the final state, we can then produce the output
The output state can be calculated as:
Ot = Why ht
Here, Ot is the output state, why: weight at output layer, ht: current state
Backward propagation in RNN
Backward phase :
To train an RNN, we need a loss function. We will make use of cross-entropy loss which is
often paired with softmax, which can be calculated as:
L = -ln(pc)
Here, pc is the RNN’s predicted probability for the correct class (positive or negative). For
example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:
L= -ln(0.95) = 0.051
After calculating loss, we will train the RNN using gradient descent to minimize loss
Steps for Back Propagation
We compute the cross-entropy error first using the current and actual output. The
network is unfolding for each time step
Then for each time step in the network, the gradient descent is calculated with respect
to the weight of each of the parameter
Once the weight for all time step is the same we can combine together the gradients
for all the time steps
Then we update the weights for both the recurrent neurons as well as the dense
layers
Vanishing and Exploding Gradient Problem
Defining the problem
During the training of all a deep network, the gradients are propagated back in time all
the way to the initial layer
Gradients that come from deeper layers go through multiple matrix multiplications
according to the chain rule, and when they approach the earlier layers, if they have
small values ( <1 ) they shrink exponentially till they vanish
Vanishing gradient make model learning difficult
While if they have large values (>1), then they eventually blow up and crash the
model, this is the exploding gradient problem
Types of RNN Architectures
The common architectures which are used for sequence learning are:
One to one
One to many
Many to one
Many to many
One to one :This model is similar to a single layer neural network as it only provides linear
predictions .It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ image
classification)
One to many
This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
Example: generating an audio stream. It takes a single audio stream as input and
generates new tones or new music based on that stream
In some cases, it propagates the output ‘y’ to the next RNN units
Many to one
This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and
produce a single output ‘y’ at the end
This type of architecture is mostly used to perform sentiment analysis as it processes
the entire input (collection of words sentences) to produce a single output (positive,
negative, or neutral sentiment)
Many to many
In this, a single frame is taken as input for each RNN unit. A-frame represents
multiple inputs ‘x’, activations ‘a’ which are propagated through the network to
produce output ‘y’ which are the classification result for each frame
It used mostly in video classification, where we try to classify each frame of the video
Bi- directional RNNs
In this neural network, 2 hidden layers running in the opposite direction are
connected to produce a single output
These layers allow the neural network to received information from both past as well
as a future state
For example, given a word sequence: ‘I like programming’. The forward layer will
input the sequence as it is while the backward layer will feed the sequence in the
reverse order ‘programming like i’
The output for this will be calculated by concatenating the word sequence at each time
step and generating the weight
Note:-
RNNs remember each and every piece of information through timestamp
The memory state which stores information of all the state is useful for tasks such as
sentence generation and time series prediction
RNNs can handle inputs and outputs of arbitrary length
RNNs share the same parameters across different time steps which means fewer
parameters to train and computation cost
RNNs can not process very long sequences while making use of tanh or ReLu as an
activation function
RNNs face vanishing and exploding gradient problem
Rcnn applications
Text summarization: Summarizing the text from any literature, for example, if a
news website wants to display brief summary of important news from each and every
news article on the website, then text summarization will be helpful
Text recommendation: Text autofill or sentence generation in data every work by
making use of RNNs can help in automating the processes and make it less time
consuming
Image recognition: RNNs can be combined with CNN in order to recognize an
image and give its description
Music generation: RNNs can be used to generate new music or tunes, by feeding a
single tune as an input we can generate new notes or tunes of music.
Types of Rcnn networks :
Feedforward networks map one input to one output, and while we’ve visualized recurrent
neural networks in this way in the above diagrams, they do not actually have this constraint.
Instead, their inputs and outputs can vary in length, and different types of RNNs are used for
different use cases, such as music generation, sentiment classification, and machine
translation.
1. Vanishing Gradient Problem
Recurrent Neural Networks enable you to model time-dependent and sequential data
problems, such as stock market prediction, machine translation, and text generation. You will
find, however, RNN is hard to train because of the gradient problem.
RNNs suffer from the problem of vanishing gradients. The gradients carry information used
in the RNN, and when the gradient becomes too small, the parameter updates become
insignificant. This makes the learning of long data sequences difficult.
2. Exploding Gradient Problem
While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients
accumulate, resulting in very large updates to the neural network model weights during the
training process.
Long training time, poor performance, and bad accuracy are the major issues in gradient
problems. Now, let’s discuss the most popular and efficient way to deal with gradient
problems, i.e., Long Short-Term Memory Network (LSTMs).
The word you predict will depend on the previous few words in context. Here, you need the
context of Spain to predict the last word in the text, and the most suitable answer to this
sentence is “Spanish.” The gap between the relevant information and the point where it's
eeded may have become very large. LSTMs help you solve this problem.
Common Activation Functions
Recurrent Neural Networks (RNNs) use activation functions just like other neural networks
to introduce non-linearity to their models. Here are some common activation functions used
in RNNs:
Sigmoid Function:
The sigmoid function is commonly used in RNNs. It has a range between 0 and 1, which
makes it useful for binary classification tasks. The formula for the sigmoid function is:
σ(x) = 1 / (1 + e^(-x))
Hyperbolic Tangent (Tanh) Function:
The tanh function is also commonly used in RNNs. It has a range between -1 and 1, which
makes it useful for non-linear classification tasks. The formula for the tanh function is:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Rectified Linear Unit (Relu) Function:
The ReLU function is a non-linear activation function that is widely used in deep neural
networks. It has a range between 0 and infinity, which makes it useful for models that require
positive outputs. The formula for the ReLU function is:
ReLU(x) = max(0, x)
Leaky Relu Function:
The Leaky ReLU function is similar to the ReLU function, but it introduces a small slope to
negative values, which helps to prevent "dead neurons" in the model. The formula for the
Leaky ReLU function is:
Leaky ReLU(x) = max(0.01x, x)
Softmax Function:
The softmax function is often used in the output layer of RNNs for multi-class classification
tasks. It converts the network output into a probability distribution over the possible classes.
The formula for the softmax function is:
softmax(x) = e^x / ∑(e^x)
These are just a few examples of the activation functions used in RNNs. The choice of
activation function depends on the specific task and the model's architecture
Variant RNN Architectures
There are several variant RNN architectures that have been developed over the years to
address the limitations of the standard RNN architecture. Here are a few examples:
Long Short-Term Memory (LSTM) Networks
LSTM is a type of RNN that is designed to handle the vanishing gradient problem that can
occur in standard RNNs. It does this by introducing three gating mechanisms that control the
flow of information through the network: the input gate, the forget gate, and the output gate.
These gates allow the LSTM network to selectively remember or forget information from the
input sequence, which makes it more effective for long-term dependencies.
Gated Recurrent Unit (GRU) Networks
GRU is another type of RNN that is designed to address the vanishing gradient problem. It
has two gates: the reset gate and the update gate. The reset gate determines how much of the
previous state should be forgotten, while the update gate determines how much of the new
state should be remembered. This allows the GRU network to selectively update its internal
state based on the input sequence.
Bidirectional RNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be
useful for speech recognition and natural language processing tasks.
Encoder-Decoder RNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder
network that generates the output sequence based on the encoder's representation. This
architecture is commonly used for sequence-to-sequence tasks such as machine translation.
Attention Mechanisms
Attention mechanisms are a technique that can be used to improve the performance of RNNs
on tasks that involve long input sequences. They work by allowing the network to attend to
different parts of the input sequence selectively rather than treating all parts of the input
sequence equally. This can help the network focus on the input sequence's most relevant parts
and ignore irrelevant information.
These are just a few examples of the many variant RNN architectures that have been
developed over the years. The choice of architecture depends on the specific task and the
characteristics of the input and output sequences.
Long Short-Term Memory Networks
LSTMs are a special kind of RNN — capable of learning long-term dependencies by
remembering information for long periods is the default behavior.
All RNN are in the form of a chain of repeating modules of a neural network. In standard
RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
Fig: Long Short Term Memory Networks
LSTMs also have a chain-like structure, but the repeating module is a bit different structure.
Instead of having a single neural network layer, four interacting layers are communicating
extraordinarily.
Advantages and Shortcomings of RNNs
RNNs have various advantages, such as:
Ability to handle sequence data
Ability to handle inputs of varying lengths
Ability to store or “memorize” historical information
The disadvantages are:
The computation can be very slow.
The network does not take into account future inputs to make decisions.
Vanishing gradient problem, where the gradients used to compute the weight update
may get very close to zero, preventing the network from learning new weights. The
deeper the network, the more pronounced this problem is.
Different RNN Architectures
There are different variations of RNNs that are being applied practically in machine learning
problems:
Bidirectional Recurrent Neural Networks (BRNN)
In BRNN, inputs from future time steps are used to improve the accuracy of the network. It is
like knowing the first and last words of a sentence to predict the middle words.
Gated Recurrent Units (GRU)
These networks are designed to handle the vanishing gradient problem. They have a reset and
update gate. These gates determine which information is to be retained for future predictions.
Long Short Term Memory (LSTM)
LSTMs were also designed to address the vanishing gradient problem in RNNs. LSTMs use
three gates called input, output, and forget gate. Similar to GRU, these gates determine which
information to retain.
Key Differences Between CNN and RNN
CNN is applicable for sparse data like images. RNN is applicable for time series and
sequential data.
While training the model, CNN uses a simple backpropagation and RNN uses
backpropagation through time to calculate the loss.
RNN can have no restriction in length of inputs and outputs, but CNN has finite
inputs and finite outputs.
CNN has a feedforward network and RNN works on loops to handle sequential data.
CNN can also be used for video and image processing. RNN is primarily used for
speech and text analysis.
Working of a Rcnn network:
A recurrent neural network (RNN) is the type of artificial neural network (ANN) that is
used in Apple’s Siri and Google’s voice search. RNN remembers past inputs due to an
internal memory which is useful for predicting stock prices, generating text, transcriptions,
and machine translation.
In the traditional neural network, the inputs and the outputs are independent of each other,
whereas the output in RNN is dependent on prior elementals within the sequence. Recurrent
networks also share parameters across each layer of the network. In feedforward networks,
there are different weights across each node. Whereas RNN shares the same weights within
each layer of the network and during gradient descent, the weights and basis are adjusted
individually to reduce the loss.
RNN
The image above is a simple representation of recurrent neural networks. If we are
forecasting stock prices using simple data [45,56,45,49,50,…], each input from X0 to Xt will
contain a past value. For example, X0 will have 45, X1 will have 56, and these values are
used to predict the next number in a sequence.
How Recurrent Neural Networks Work
In RNN, the information cycles through the loop, so the output is determined by the current
input and previously received inputs.
The input layer X processes the initial input and passes it to the middle layer A. The middle
layer consists of multiple hidden layers, each with its activation functions, weights, and
biases. These parameters are standardized across the hidden layer so that instead of creating
multiple hidden layers, it will create one and loop it over.
Instead of using traditional backpropagation, recurrent neural networks
use backpropagation through time (BPTT) algorithms to determine the gradient. In
backpropagation, the model adjusts the parameter by calculating errors from the output to the
input layer. BPTT sums the error at each time step as RNN shares parameters across each
layer. Learn more on RNNs and how they work at What are Recurrent Neural Networks?.
Recurrent Neural Networks
Humans don’t start their thinking from scratch every second. As you read this essay, you
understand each word based on your understanding of previous words. You don’t throw
everything away and start thinking from scratch again. Your thoughts have persistence.
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example,
imagine you want to classify what kind of event is happening at every point in a movie. It’s
unclear how a traditional neural network could use its reasoning about previous events in the
film to inform later ones.
Recurrent neural networks address this issue. They are networks with loops in them, allowing
information to persist.
Recurrent Neural Networks have loops.
In the above diagram, a chunk of neural network, A�, looks at some input xt�� and outputs
a value htℎ�. A loop allows information to be passed from one step of the network to the next.
These loops make recurrent neural networks seem kind of mysterious. However, if you think a
bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent
neural network can be thought of as multiple copies of the same network, each passing a
message to a successor. Consider what happens if we unroll the loop:
An unrolled recurrent neural network.
This chain-like nature reveals that recurrent neural networks are intimately related to sequences
and lists. They’re the natural architecture of neural network to use for such data.
And they certainly are used! In the last few years, there have been incredible success applying
RNNs to a variety of problems: speech recognition, language modeling, translation, image
captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with
RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent
Neural Networks. But they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural
network which works, for many tasks, much much better than the standard version. Almost all
exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs
that this essay will explore.
The Problem of Long-Term Dependencies
One of the appeals of RNNs is the idea that they might be able to connect previous information
to the present task, such as using previous video frames might inform the understanding of the
present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context –
it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the
relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.
But there are also cases where we need more context. Consider trying to predict the last word in
the text “I grew up in France… I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which language,
we need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human
could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice,
RNNs don’t seem to be able to learn them.
LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They were introduced by Hochreiter &
Schmidhuber (1997), and were refined and popularized by many people in following
work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behavior, not something they
struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network.
In standard RNNs, this repeating module will have a very simple structure, such as a single tanh
layer.
The repeating module in a standard RNN contains a single layer.
LSTMs also have this chain like structure, but the repeating module has a different structure.
Instead of having a single neural network layer, there are four, interacting in a very special way.
The repeating module in an LSTM contains four interacting layers.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step
by step later. For now, let’s just try to get comfortable with the notation we’ll be using.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs
of others. The pink circles represent pointwise operations, like vector addition, while the yellow
boxes are learned neural network layers. Lines merging denote concatenation, while a line
forking denote its content being copied and the copies going to different locations.
The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only
some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully
regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid
neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each
component should be let through. A value of zero means “let nothing through,” while a value of
one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
Step-by-Step LSTM Walk Through
The first step in our LSTM is to decide what information we’re going to throw away from the
cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks
at ht−1ℎ�−1 and xt��, and outputs a number between 00 and 11 for each number in the cell
state Ct−1��−1. A 11 represents “completely keep this” while a 00 represents “completely
get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based on all
the previous ones. In such a problem, the cell state might include the gender of the present
subject, so that the correct pronouns can be used. When we see a new subject, we want to forget
the gender of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This has
two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
Next, a tanh layer creates a vector of new candidate values, C~t�~�, that could be added to
the state. In the next step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the
cell state, to replace the old one we’re forgetting.
It’s now time to update the old cell state, Ct−1��−1, into the new cell state Ct��. The
previous steps already decided what to do, we just need to actually do it.
We multiply the old state by ft��, forgetting the things we decided to forget earlier. Then we
add it∗C~t��∗�~�. This is the new candidate values, scaled by how much we decided to
update each state value.
In the case of the language model, this is where we’d actually drop the information about the old
subject’s gender and add the new information, as we decided in the previous steps.
Finally, we need to decide what we’re going to output. This output will be based on our cell
state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the
cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values
to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only
output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether the
subject is singular or plural, so that we know what form a verb should be conjugated into if
that’s what follows next.
Variants on Long Short Term Memory
What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the
above. In fact, it seems like almost every paper involving LSTMs uses a slightly different
version. The differences are minor, but it’s worth mentioning some of them.
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole
connections.” This means that we let the gate layers look at the cell state.
Bidirectional recurrent neural networks
Bidirectional recurrent neural networks are a combination of two recurrent neural networks
that train in unison. One network trains from the start to the end of a sequence while the other
works in the opposite direction.
The bidirectional method that this type of recurrent neural network uses allows the model to
learn from both present and past information. Once the network has learned from this, it can
analyze future events accordingly. This feature sets it apart from other types of recurrent
neural networks. The dual nature of bidirectional recurrent neural networks is useful in
circumstances where context is required.
Long short-term memory
Long short-term memory recurrent neural networks handle long time-series data. This means
that they can recall long-term time-series data collected prior.
This model has three different gates: the input gate, the output gate, and the forget gate.
These gates act as a form of control over features of the network, such as saving or removing
memory.
The input gate decides which new information moves into the cell state. The output gate, on
the other hand, regulates which information is selected from the cell state. After that decision
is made, it chooses the next hidden state for the network. Finally, the forget gate removes any
information from the cell state that is deemed irrelevant or insignificant.
While in the cell state, the network has automatic control while discarding irrelevant
information or retaining relevant features. The vanishing gradient problem found in some
networks can be prevented via the use of long short-term memory networks.