Chameli Devi Group of Institutions, Indore
Department of Computer Science and Engineering
Subject Notes
CS 601- Machine Learning
UNIT-IV
Syllabus: Recurrent neural network, Long short-term memory, Gated recurrent unit,
Translation, Beam search and width, Bleu score, Attention model, Reinforcement Learning, RL-
Framework, MDP, Bellman equations, Value Iteration and Policy Iteration, Actor-critic model, Q-
learning, SARSA
Course Outcome: Student will be able to apply Recurrent neural network principles in model
building for related real-world problems.
Recurrent neural network
Recurrent Neural Network (RNN) is a type of Neural Network where the output from previous
step is fed as input to the current step. In traditional neural networks, all the inputs and outputs
are independent of each other, but in cases like when it is required to predict the next word of
a sentence, the previous words are required and hence there is a need to remember the
previous words. Thus, RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers
some information about a sequence.
Figure 4.1: Recurrent neural network
RNN have a “memory” which remembers all information about what has been calculated. It
uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other
neural networks.
The formula for the current state can be written as –
ht =f ( ht −1 , xt )
Here, ht is the new state; ht-1 is the previous state while xt is the current input. We now have a
state of the previous input instead of the input itself, because the input neuron would have
applied the transformations on our previous input. So each successive input is called as a time
step.
Taking the simplest form of a recurrent neural network, let’s say that the activation function is
tanh, the weight at the recurrent neuron is W hh and the weight at the input neuron is W xh, we
can write the equation for the state at time t as –
ht =tanh ( W hh ht−1 +W ( xh ) X )
t
Recurrent neuron in this case is just taking the immediate previous state into consideration. For
longer sequences the equation can involve multiple such states. Once the final state is
calculated we can go on to produce the output.
Now, once the current state is calculated we can calculate the output state as-
Y t =W ( hy ) . ht
Training through RNN
1. A single time step of the input is supplied to the network i.e., x t is supplied to the
network
2. We then calculate its current state using a combination of the current input and the
previous state i.e. we calculate ht
3. The current ht becomes ht-1 for the next time step
4. We can go as many time steps as the problem demands and combine the information
from all the previous states
5. Once all the time steps are completed the final current state is used to calculate the
output yt
6. The output is then compared to the actual output and the error is generated
7. The error is then back propagated to the network to update the weights and the
network is trained
Long short-term memory
Recurrent Neural Networks suffer from short-term memory. If a sequence is long enough,
they’ll have hard time carrying information from earlier time steps to later ones. So if you are
trying to process a paragraph of text to do predictions, RNN’s may leave out important
information from the beginning.
Long Short Term Memory is a kind of recurrent neural network is the solution of the above
problem. In RNN output from the last step is fed as input in the current step. LSTM was
designed by Hochreiter & Schmidhuber. It tackled the problem of long-term dependencies of
RNN in which the RNN cannot predict the word stored in the long-term memory but can give
more accurate predictions from the recent information. LSTM can by default retain the
information for long period of time. It is used for processing, predicting and classifying on the
basis of time series data.
Structure of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks
called cells.
Figure 4.2: LSTM Network
Information is retained by the cells and the memory manipulations are done by the gates.
There are three gates –
1. Forget gate: The information that no longer useful in the cell state is removed with the
forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of bias. The
resultant is passed through an activation function which gives a binary output. If for a
particular cell state the output is 0, the piece of information is forgotten and for the output 1,
the information is retained for the future use.
Figure 4.3: Forget gate
2.Input gate: Addition of useful information to the cell state is done by input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs h_t-1 and x_t. Then, a vector is created
using tanh function that gives output from -1 to +1, which contains all the possible values
from h_t-1 and x_t. At last, the values of the vector and the regulated values are multiplied to
obtain the useful information.
Figure 4.4: Input gate
3. Output gate: The task of extracting useful information from the current cell state to be
presented as an output is done by output gate. First, a vector is generated by applying tanh
function on the cell. Then, the information is regulated using the sigmoid function and filter
the values to be remembered using inputs h_t-1 and x_t. At last, the values of the vector
and the regulated values are multiplied to be sent as an output and input to the next cell.
Figure 4.5: Output gate
Gated recurrent unit
Gated Recurrent Units (GRU) is one of the popular variants of recurrent neural networks and
has been widely used in the context of machine translation. GRUs can also be regarded as a
simpler version of LSTMs (Long Short-Term Memory). A gated recurrent unit (GRU) was
introduced to produce each recurrent unit to capture dependencies of various time scales.
How it works?
In GRU, two gates including a reset gate that adjusts the incorporation of new input with the
previous memory and an update gate that controls the preservation of the precious memory
are introduced. The reset gate and the update gate adaptively control how much each hidden
unit remembers or forgets while reading/generating a sequence.
Figure 4.6: Gated recurrent unit
In the above figure of the Gated Recurrent Unit, r and z are known to be the reset and update
gates, while h and h˜ are the activations as well as the candidate activation respectively. The
working of GRU proceeds such that when the reset gate is close to zero, the hidden state is
forced to ignore the previous hidden state and is reset with the current input.
This allows the hidden state to discard any data that is found to be irrelevant in the future. This
result allows a more compact representation. The update gate controls how much data from
the previous hidden state will be transferred to the current hidden state. This process performs
in a similar manner to the memory cell in the Long Short-Term Memory network and helps the
RNN to remember long-term information.
Advantages of Gated Recurrent Unit
Gated Recurrent Unit can be used to improve the memory capacity of a recurrent neural
network as well as provide the ease of training a model. The hidden unit can also be used for
settling the vanishing gradient problem in recurrent neural networks. It can be used in various
applications, including speech signal modeling, machine translation, and handwriting
recognition, among others.
Translation
Machine translation is the task of automatically converting source text in one language to text
in another language.
In a machine translation task, the input already consists of a sequence of symbols in some
language, and the computer program must convert this into a sequence of symbols in another
language. Given a sequence of text in a source language, there is no one single best translation
of that text to another language. This is because of the natural ambiguity and flexibility of
human language. The fact is that accurate translation requires background knowledge in order
to resolve ambiguity and establish the content of the sentence.
Here is the list of machine translation:
Statistical Machine Translation-
Statistical machine translation, or SMT for short, is the use of statistical models that learn to
translate text from a source language to a target language. The approach is data-driven,
requiring only a corpus of examples with both source and target language text. This means
linguists are no longer required to specify the rules of translation.
Neural Machine Translation-
Neural machine translation, or NMT for short, is the use of neural network models to learn a
statistical model for machine translation.
The key benefit to the approach is that a single system can be trained directly on source and
target text, no longer requiring the pipeline of specialized systems used in statistical machine
learning.
Encoder-Decoder Model
Multilayer Perceptron neural network models can be used for machine translation, although
the models are limited by a fixed-length input sequence where the output must be the same
length.
These early models have been greatly improved upon recently through the use of recurrent
neural networks organized into encoder-decoder architecture that allow for variable length
input and output sequences.
Beam search and width
Another popular heuristic is the beam search that expands upon the greedy search and returns
a list of most likely output sequences. Instead of greedily choosing the most likely next step as
the sequence is constructed, the beam search expands all possible next steps and keeps the k
most likely, where k is a user-specified parameter and controls the number of beams or parallel
searches through the sequence of probabilities.
The local beam search algorithm keeps track of k states rather than just one. It begins with k
randomly generated states. At each step, all the successors of all k states are generated. If one
of them is a goal, the algorithm halts. Otherwise, it selects the k best successors from the
complete list and repeats. We do not need to start with random states; instead, we start with
the k most likely words as the first step in the sequence.
Common beam width values are 1 for a greedy search and values of 5 or 10 for common
benchmark problems in machine translation. Larger beam widths result in better performance
of a model as the multiple candidate sequences increase the likelihood of better matching a
target sequence. This increased performance results in a decrease in decoding speed.
The search process can halt for each candidate separately either by reaching a maximum length,
by reaching an end-of-sequence token, or by reaching a threshold likelihood.
Example:
We can define a function to perform the beam search for a given sequence of probabilities and
beam width parameter k. At each step, each candidate sequence is expanded with all possible
next steps. Each candidate step is scored by multiplying the probabilities together. The k
sequences with the most likely probabilities are selected and all other candidates are pruned.
The process then repeats until the end of the sequence.
Bleu score: The bilingual evaluation understudy (Bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is a measurement of
the differences between an automatic translation and one or more human-created reference
translations of the same source sentence.
It is defined as follows:
❑
∑ count clip . ( n−gram )
pn= ❑
❑
∑
❑
count . ( n−gram )
where, pn is the Bleu score on n-gram
The BLEU algorithm compares consecutive phrases of the automatic translation with the
consecutive phrases it finds in the reference translation, and counts the number of matches, in
a weighted fashion. These matches are position independent. A higher match degree indicates a
higher degree of similarity with the reference translation, and higher score. Intelligibility and
grammatical correctness are not taken into account.
Attention model
Attention is proposed as a solution to the limitation of the Encoder-Decoder model encoding
the input sequence to one fixed length vector from which to decode each output time step. This
issue is believed to be more of a problem when decoding long sequences.
Figure 4.7: Attention Model
This model allows an RNN to pay attention to specific parts of the input that is considered as
being important, which improves the performance of the resulting model in practice. By
noting α<t,t′> the amount of attention that the output y<t> should pay to the activation a<t′> and c<t>
the context at time t, we have:
❑ ❑
()
¿t >¿= ∑ n a¿t ,t > ¿a
with ∑ a
'
'
¿t ,t >¿
' ¿t > ¿
¿
¿¿
k =1 ¿
c t'
t'
Attention weight The amount of attention that the output y<t>should pay to the activation a<t′> is
given by α<t,t′> computed as follows:
exp ¿ ¿ ¿
Reinforcement Learning
Reinforcement learning is a branch of machine learning that is concerned to take a sequence of
actions in order to maximize some reward.
Basically an RL does not know anything about the environment; it learns what to do by
exploring the environment. It uses actions, and receives states and rewards. The agent can only
change your environment through actions.
Concepts:
Agents take actions in an environment and receive states and rewards
Goal is to find a policy that maximize its utility function
Inspired by research on psychology and animal learning
Figure 4.8: Reinforcement Learning
Here we don't know which actions will produce rewards, also we don't know when an action
will produce rewards, and sometimes you do an action that will take time to produce rewards.
Basically all is learned with interactions with the environment.
Reinforcement learning components:
Agent: Our robot
Environment: The game, or where the agent lives.
A set of states
Policy: Map between state to actions
Reward Function : Gives immediate reward for each state
Value Function: Gives the total amount of reward the agent can expect from a particular
state to all possible states from that state. With the value function you can find a policy.
Model (Optional): Used to do planning, instead of simple trial-and-error approach
common to Reinforcement learning. Here means the possible state after we do an
action on the state
RL-Framework
Following are the top-5 RL-Frameworks available:
1. Acme
About: Acme is a framework for distributed reinforcement learning introduced by DeepMind.
The framework is used to build readable, efficient, research-oriented RL algorithms. At its core,
Acme is designed to enable simple descriptions of RL agents that can be run at various scales of
execution, including distributed agents. This framework aims to make the results of various RL
algorithms developed in academia and industrial labs easier to reproduce and extend for the
machine learning community at large.
2. DeeR
About: DeeR is a Python library for deep reinforcement learning. The framework is built with
modularity in mind so that it can easily be adapted to any need and provides many possibilities
such as Double Q-learning, Prioritized Experience Replay, Deep deterministic Policy Gradient
(DDPG), and Combined Reinforcement via Abstract Representations (CRAR).
3. Dopamine
About: Dopamine is a popular research framework for fast prototyping of reinforcement
learning algorithms. The framework aims to fill the need for a small, easily codebase in which
users can freely experiment with research. The design principles of this framework include
flexible development, reproducibility, easy experimentation and more.
4. Frap
About: Frap or Framework for Reinforcement learning And Planning is unifying that identifies
the underlying dimensions on which any planning or learning algorithm has to decide. The
framework provides deeper insight into the algorithmic space of planning and reinforcement
learning and also suggests new approaches to integrate both the fields. The aim of this
framework is to provide a common language to categorize algorithms as well as it identifies
new research directions.
5. RLgraph
About: RLgraph is a reinforcement learning framework that quickly prototypes, defines and
executes reinforcement learning algorithms both in research and practice. The framework
supports TensorFlow (or static graphs in general) or eager/define-by-run execution (PyTorch)
through a single component interface. Using RLgraph, developers can combine high-level
components in a space-independent manner and define input spaces.
MDP (Markov decision process)
MDP is a framework that can solve most Reinforcement Learning problems with discrete
actions. With the Markov Decision Process, an agent can arrive at an optimal policy for
maximum rewards over time.
The Markov decision process, better known as MDP, is an approach in reinforcement learning to
take decisions in a grid world environment. A grid world environment consists of states in the
form of grids.
The aim of MDP is to train an agent to find a policy that will return the maximum cumulative
rewards from taking a series of actions in one or more states.
Figure 4.9: Markov Decision Process
Here are the most important parts:
States: A set of possible states
Model: Probability to go to state when you do the action while you were on state. It is
also called transition model.
Action: Things that you can do on a particular state
Reward: Scalar value that you get for been on a state.
Policy: It is a map that tells the optimal action for every state
Optimal policy: It is a policy that maximize your expected reward
Bellman equations
The agent tries to get the most expected sum of rewards from every state it lands in. In order to
achieve that we must try to get the optimal value function, i.e. the maximum sum of cumulative
rewards. Bellman equation will help us to do so.
Using Bellman equation, the value function will be decomposed into two part; an immediate
reward, Rt+1, and discounted value of the successor state 𝛾V(St+1),
v(s) = 𝔼 [ Gt | St = s]
We unroll the return Gt,
= 𝔼 [Rt+1 + 𝛾Rt+2 + 𝛾2Rt+3 + … | St = s]
= 𝔼 [Rt+1 + 𝛾Rt+2 + 𝛾Rt+3 + … | St = s]
then substitute the return Gt+1, starting from time step t+1,
= 𝔼 [Rt+1 + 𝛾Gt+1 | St = s]
Finally, since the expected value function is a linear function, meaning that (aX+bY)= a𝔼(X)
+b𝔼(Y). The expected value of the return Gt+1 is the value of the state St+1,
= 𝔼 [Rt+1 + 𝛾v(St+1) | St = s]
That gives us the Bellman equation for MRPs,
v(s)= 𝔼 [Rt+1 + 𝛾v(St+1) | St = s]
So, for each state in the state space, the Bellman equation gives us the value of that state,
❑
v ( s )=R s +γ ∑ Pss' v ( s ' )
'
s∈S
The value of the state S is the reward we get upon leaving that state, plus a discounted average
over next possible successor states, where the value of each possible successor state is
multiplied by the probability that we land in it.
Value Iteration and Policy Iteration
The value-iteration and policy-iteration algorithms are two fundamental methods for solving
MDPs. Both value-iteration and policy-iteration assume that the agent knows the MDP model of
the world (i.e. the agent knows the state-transition and reward probability functions).
Therefore, they can be used by the agent to (offline) plan its actions given knowledge about the
environment before interacting with it.
Value iteration computes the optimal state value function by iteratively improving the estimate
of V(s). The algorithm initializes V(s) to arbitrary random values. It repeatedly updates the Q(s,
a) and V(s) values until they converges. Value iteration is guaranteed to converge to the optimal
values.
While value-iteration algorithm keeps improving the value function at each iteration until the
value-function converges. Since the agent only cares about the finding the optimal policy,
sometimes the optimal policy will converge before the value function. Therefore, another
algorithm called policy-iteration instead of repeated improving the value-function estimate; it
will re-define the policy at each step and compute the value according to this new policy until
the policy converges.
Policy iteration is also guaranteed to converge to the optimal policy and it often takes less
iteration to converge than the value-iteration algorithm. Both value-iteration and policy-
iteration algorithms can be used for offline planning where the agent is assumed to have prior
knowledge about the effects of its actions on the environment (they assume the MDP model is
known). Comparing to each other, policy-iteration is computationally efficient as it often takes
considerably fewer number of iterations to converge although each iteration is more
computationally expensive.
Actor-critic model
1. The “Critic” estimates the value function. This could be the action-value (the Q value) or
state-value (the V value).
2. The “Actor” updates the policy distribution in the direction suggested by the Critic (such as
with policy gradients).
Actor-critic aims to take advantage of all the good stuff from both value-based and policy-based
while eliminating all their drawbacks.
The principal idea is to split the model in two: one for computing an action based on a state and
another one to produce the Q values of the action.
How Actor-critic works
Imagine you play a video game with a friend that provides you some feedback. You’re the Actor
and your friend is the Critic.
Figure 4.10: Working of Actor-citric
At the beginning, you don’t know how to play, so you try some action randomly. The Critic
observes your action and provides feedback.
Learning from this feedback, you’ll update your policy and be better at playing that game.
On the other hand, your friend (Critic) will also update their own way to provide feedback so it
can be better next time.
The idea of Actor Critic is to have two neural networks. We estimate both:
a) ACTOR: A policy function controls how our agent acts.
b) CRITIC: A value function, measures how good these actions are.
Both run in parallel. Because we have two models (Actor and Critic) that must be trained, it
means that we have two set of weights that must be optimized separately.
Q-learning
In the case where the agent does not know apriori i.e., what are the effects of its actions on the
environment (state transition and reward models are not known). The agent only knows what
are the set of possible states and actions, and can observe the environment current state. In this
case, the agent has to actively learn through the experience of interactions with the
environment. There are two categories of learning algorithms:
Model-based learning: In model-based learning, the agent will interact to the environment and
from the history of its interactions; the agent will try to approximate the environment state
transition and reward models. Afterwards, given the models it learnt, the agent can use value-
iteration or policy-iteration to find an optimal policy.
Model-free learning: in model-free learning, the agent will not try to learn explicit models of
the environment state transition and reward functions. However, it directly derives an optimal
policy from the interactions with the environment.
Q-Learning is an example of model-free learning algorithm. It does not assume that agent
knows anything about the state-transition and reward models. However, the agent will discover:
what are the good and bad actions by trial and error.
The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the
samples of Q(s, a) that we observe during interaction with the environment. This approach is
known as Time-Difference Learning.
The Q-learning algorithm Process
Figure 4.11: Q-learning Algorithm Process
SARSA
The SARSA stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’)
is an On-Policy algorithm for TD-Learning. The major difference between it and Q-Learning, is
that the maximum reward for the next state is not necessarily used for updating the Q-values.
Instead, a new action, and therefore reward, is selected using the same policy that determined
the original action. The name SARSA actually comes from the fact that the updates are done
using the quintuple Q(s, a, r, s', a'). Where: s, a are the original state and action, r is the reward
observed in the following state and s', a' are the new state-action pair.
The procedural form of SARSA algorithm is as follows:
Initialize Q(s, a) arbitrarily
Repeat (for each episode):
Initialize s
Choose a from s using policy derived from Q
(e.g., -greedy)
Repeat (for each step of episode):
Take action a, observe r, s’
Choose a’ from s’ using policy derived from Q
(e.g., -greedy)
Q(s, a) <-- Q(s, a) + α [r + Q(s’, a’) – Q (s, a)]
s <-- s’, a <-- a’;
Until s is terminal;