0% found this document useful (0 votes)

9 views9 pages

Reinforcement Learning

Uploaded by

Rey reyhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

Reinforcement Learning

Uploaded by

Rey reyhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Reinforcement Learning

Reinforcement learning is an unsupervised learning model based on a reward

function that rewards desired behaviors and punishes negative ones. The incentives
from rewards hopefully makes progress on the desired outcome, which is all done
automatically. The model works by iterating over “state”, in which a decision must be
made.
Rewards are typically applied at “terminal state”, which is a desired final condition,
usually marking the end of a learning outcome. No rewards are given after terminal
state is reached. In the meantime, states in between the desired states have
relatively low (or zero) rewards, to incentivize the algorithm to continue moving.
Combining these low rewards with the time steps, the algorithm has a way to
determine if it is in a suboptimal path (or endless loop).
Each step is a function of the state s, action a, reward R, and next state s′ ,

algorithm → (s, a, R(s), s′ )

This approach is formally known as the Markov Decision Process (MDP). The future
only depends on current state and not how the algorithm got there.

Defining the Model

The Return
The return is a way of comparing rewards and determining which ones are better
or worse. It is able to account for time steps and weigh according to the
application. Sometimes the time steps are incredibly important, and other times
they are not.

The return works by summing the rewards R from each step and and weighing
them with a discount factor γ , until the terminal state is reached,

Return = R1 + γR2 + γ 2 R3 + … (until terminal state)

Reinforcement Learning 1
By decreasing the gamma value in each subsequent step, the algorithm
becomes a little impatient and is incentivized to finish as fast as possible. The
lower the gamma value, the more the algorithm is penalized for taking another
time step, because the discount factor perpetually decreases.
This affect has interesting behavior for negative reward systems. The discount
factor actually incentivizes the system to push out the negative rewards as far
into the future as possible.

Making Decisions with Policies

In reinforcement learning, the first goal is to come up with a policy π that takes
input state s and map it to some action a,

π(s) = a

The final goal is to find a policy that gives an action for every possible state, to
maximize the return.

Defining the Model Algorithm

The State-action Value Function
The state-action value function Q(s, a) or Q∗ gives the return if starting in state
s and taking action a once. Then after taking the action once, behave optimally
after. So, take whatever actions result in the highest possible return.

How does the algorithm know the optimal behavior to take after, and if it knows,
why couldn’t it have done this originally? Even though this definition is slightly
circular, it is possible to come up with a Q-function before the optimal policy has
been defined. This will be discussed in more detail further down.

When creating the policy, the Q-function can be computed from every possible
action at all given states. Since states depend on future states, these Q-functions
are recursively calculated to cover all possible scenarios. The best possible
return is then selected from each state,

max Q(s, a)
a

Thus, the best possible action is the one that maximizes the return at that state.
This defines the overall policy of the model.

Reinforcement Learning 2
The Bellman Equation
The final piece to putting this model together is to define the equation that
compute the Q-function, which is the Bellman equation. Given the current state
and action, future state and action, discount factor, and reward for current state,

Q(s, a) = R(s) + γ max

′
Q(s′ , a′ )
a

This equation defines a recursive function that iterates through all states and
uses the maximum reward from the following state. From a high level view, it
requires the current return and the optimal return from the next step. Once the
algorithm reaches a terminal reward, the state-action value function simplifies
down to Q(s, a) = R(s), and the rewards can funnel back up the recursive tree.

Random (Stochastic) Environments

Actions do not work in absolutes and are in fact probabilities. In other words, a
policy does not always dictate that a specific action should be chosen over the
other 100% of the time. Actions are represented as probabilities, much like
choices in other machine learning models. A stochastic environment refers to the
variable process, where there is some randomness or uncertainty. This
randomness is usually due to external factors. For example, say a robot is
learning how to walk, and it slips and falls on the ground. It may go back to a
previous state or even go to an unintended state.
When working with stochastic environments, the algorithm is more concerned
with maximizing the average value E of rewards across many attempts. So
instead of running the algorithm once and taking its highest rewards, the
algorithm may get hundreds, thousands, millions, etc. of attempts.

Return = E[R1 + γR2 + γ 2 R3 + … (until terminal state)]

Stochastic environments also modify the Bellman equation. Since future states
and actions are random, the average of those is taken too.

Q(s, a) = R(s) + γE[max

′
Q(s′ , a′ )]
a

Since randomness can cause non-optimal reward runs, the Q-values returned
will not be perfect. Judging how well the rewards returned, relative to an ideal
run, can be a good gauge of how efficient the model is.

Reinforcement Learning 3
Continuous State Spaces
So far, assumptions have been made that the system has been in a discrete
state. In other words, state was always within a finite number of known positions,
which held single values. However, it is possible for state to be continuous.
Instead of a single value representing one state, it could be a range of values or
a spectrum. This often includes a combination of variables meshed together. For
example, if mapping an autonomous car, its position (x and y coordinates),
turning radius, velocity, and acceleration could all represent one single state,
s = [x y θ ẋ ẏ ẍ ÿ].

The State-value Function

There is one piece left to complete the model, which is defining an algorithm to
pick the right actions. The key to this is the state-value function, which is a deep
reinforcement learning technique involving a neural network. The state and
actions are fed into the neural network as inputs, and the return (Q-function) is
outputted.

The neural network computes the return for every possible action in a given state
and picks the action maximizing the return. So, how does is a neural network
trained to output the Q-function? The approach is to use the Bellman equation to
create a training set with lots of examples (x, y), then use supervised learning
(neural networks) to create the mappings from x to y.
Taking the Bellman equation defined earlier, the input x and output y can be
described as,

Q( s, a ) = R(s) + γ max
′
Q(s′ , a′ )
a
x
y

And the goal of a neural network is to find some function that (ideally) finds the
desired output,

Reinforcement Learning 4
fw,b (x) ≈ y

So, how is a training set created with values x and y for a neural network to learn
from? Well, the model can start by trying all various types of states and actions;
sometimes randomly, sometimes blindly, sometimes deliberately directed, etc.
Each time an action is taken, the relevant data is added to the training set. The
state and action are used to compute the input x, and the reward and are used to
compute the output y

(i)
(s(i) , a(i) , R(s(i) ), s′ )
x(i) y (i)

The input is pretty straightforward, since state is typically a vector of given values
and the actions are one-hot encodings (0 or 1) of whether or not each action was
taken,

x(i) = (s(i) , a(i) )

Regarding the output, which is the Bellman equation, the reward is known too
and is fed into the equation.

(i)
y(i) = R(s(i) ) + γ max
′
Q(s′ , a′ )
a

Though, one may wonder about the Q-function, as optimal future actions a′ are
not known yet. Well, initially the result of this Q-function is not known. But it turns
out, it is possible to take a totally random guess as a starting point, and the
algorithm will work nonetheless.

Now that the training set has been created, the neural network can try to predict
y as a function of x. Just like other neural networks discussed, it can use mean
squared error or any other type of loss function.

Defining the Learning Algorithm

The first step to putting all of these equations and steps together is to initialize
the neural network parameters completely randomly. Initially, pick totally random
values for the weights, resulting in a random guess for Q(s, a). This is kind of
like initializing parameters w, b completely randomly in linear regression, then
using gradient descent to optimize them. What is important here, is if the
algorithm can slowly improve upon these initial guesses.

Reinforcement Learning 5
Next, many actions are taken like discussed above. Take actions that are good,
bad, random, deliberate, etc. and store them as training data tuples
(s, a, R(s), s′ ). The goal is to store the most recent tuples, but the model should
be training much more than that. For example, one could store 10,000 examples
while having run 500,000 time steps. The reason for this is to avoid excessive
memory load, and the values stored are referred to as the “replay buffer”.
The neural network is then trained from these most recent examples in the replay
buffer, by converting the tuples to training set data via the state-value function.
Like mentioned earlier, the neural network tries to find some function that fits the
desired output: Qnew (s, a) = fw,b (x) ≈ y, where Qnew simply represents the
new Q-function produced by the neural network.
At this point, the model has been trained, but the learning is not finished. It is rare
for a reinforcement learning algorithm to get the best return on the first training
run, so this whole process is repeated with one caveat - the new return produced
by the neural network becomes the starting return for the next one, Q = Qnew ,
which eliminates the need for guessing randomly on future iterations.

Summary of Algorithm
Initialize neural network randomly as a guess of Q(s, a).

Repeat:

Take actions to retrieve training examples (s, a, R(s), s′ ).

Store a fraction of these as training data tuples.

Train the neural network:

Create the training set using the state-value function.

x = (s, a) and y = R(s) + γ max

′
Q(s′ , a′ )
a

Train the model to predict desired outputs Qnew (s, a) ≈ y.

Set the return produced by model as the initial return for the next model
Q = Qnew .

Refining the Learning Algorithm

Improving the Neural Network Architecture

Reinforcement Learning 6
One way to improve performance is to modify the architecture of the neural
network itself. Rather than output a single unit in the output layer (resulting in one
single action), it can be more efficient to compute the return for all possible
actions; each one being a unit in the output layer. Because all actions need to be
computed anyways, it is faster to compute each one and pick the best return,
rather than feed them through the neural network.

It turns out it also makes computing parts of the Bellman equation more efficient,
specifically max Q(s′ , a′ ). This is because the return is now being computed
a′
for all actions too. So the max value from all actions can be picked here too.

ϵ-greedy Policy
When running reinforcement learning, even though the model is being trained,
there still needs to be a mechanism to pick actions while learning. It was
mentioned before that this could be some combination of random, direct, good,
bad, etc. actions. However, there is a more intentional way to pick actions, which
is through the ϵ-greedy policy (epsilon-greedy policy). This is especially important
at the beginning of learning, when the model still does not have a decent grasp
on what does and doesn’t work.

Taking completely random actions all the time is not a good idea, and also strictly
taking the action maximizing the return Q(s, a) is not either. Though, combining
these two together actually yields better results:

Greedy: With probability P , pick the action a that maximizes Q(s, a). (e.g.
95%)

Exploration: With probability (1 − P ), pick an action a randomly. (e.g. 5%)

Occasionally picking a random action helps the algorithm try out all possibilities,
while still being biased towards the best return. Otherwise, depending on how the
model is tuned or initialized (remember, the starting return values are chosen at
random), it may never try certain actions. Sometimes a model can “believe” that
certain actions are always a bad idea, even if they’re not, and randomness can
help mitigate this by forcing the algorithm to try those actions.

ϵ=1−P

Conventionally, the epsilon value ϵ is referenced at whatever percentage is

chosen for exploration. This value usually starts high at the beginning of model
training and gradually decreases. Doing so allows the learning algorithm to try

Reinforcement Learning 7
out many random actions at first, then gradually become more strict as it has an
idea of what actions it should and should not be taking.

One last note - because reinforcement learning algorithms are less mature than
supervised learning algorithms, tuning parameters can be increasingly more
important. Comparatively, using less than ideal parameters can result in
exponential losses in terms of time and computing resources wasted. As a result,
hyper-tuning parameters is a crucial step of the process.

Mini-batch
Mini-batches is a process of reducing the size of training examples, when a
machine learning model is optimizing its model algorithm (through gradient
descent, etc.). As such, this technique can be used for both unsupervised and
supervised learning. It’s fairly straightforward and involves picking some subset
m′ of m training examples, with the goal of speeding up computing time. When
dealing with massive training sets, it can be a big efficiency boost with negligible
performance losses. Each mini-batch takes once unique chunk of the training
data, so the entire dataset will be used after all batches have finished. For
example, when running gradient descent over 100 million training examples, the
subset could have 1,000 examples.

When compared to regular batch learning (using the

entire dataset at once), mini-batch tends to be more
noisily and not as reliably approach the global
minimum of the cost function. However, on average, it
tends to do as well and uses less computing
resources and time. This has made it preferable for
practical applications and a go-to approach in
industry.

Going back to reinforcement learning, mini-batch is implemented after the tuples

have been stored in the replay buffer. So, the replay buffer is equivalent to the full
dataset, and mini-batches are pieces of it.

Soft Updates
In the final step of the reinforcement model training loop, when the Q-function is
updated for the next iteration Q = Qnew , often the update can be abrupt and
sometimes even worse than before. Soft updates are implemented to avoid using
a potentially worse and noisy Q-function through an unlucky step. They work by

Reinforcement Learning 8
fractionally modifying Qnew so it still has most of the old Q values. Given
parameters w, b in the Q-function and some (small) percentage p,

w = (p)wnew + (1 − p)w
b = (p)bnew + (1 − p)b

This percentage value is typically very small, say 1%, and its weight determines
how aggressive the updates are. For example, w = 1wnew + 0w is the same
as the original function Q = Qnew , because all weight has been given to the
new parameters.

It turns out using the soft update method causes the model to converge on an
optimal return more reliably and avoid unwanted properties.

Reinforcement Learning 9

Final Assessment
100% (3)
Final Assessment
6 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
Sections
No ratings yet
Sections
76 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
4.3 Reinforcement Learning
No ratings yet
4.3 Reinforcement Learning
27 pages
Ds d79 Diy Solution v1 1tv pb2fp74
No ratings yet
Ds d79 Diy Solution v1 1tv pb2fp74
5 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Important QnA Neural Network AI Class 10
100% (1)
Important QnA Neural Network AI Class 10
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
37 RL
No ratings yet
37 RL
18 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Nidhish RLAI-Lab1
No ratings yet
Nidhish RLAI-Lab1
18 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Module - 5 - 6 - Reinforcement Learning
No ratings yet
Module - 5 - 6 - Reinforcement Learning
15 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Research Paper On AI Basics
No ratings yet
Research Paper On AI Basics
2 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Rishi S S (41111058) Final Report
No ratings yet
Rishi S S (41111058) Final Report
60 pages
Financial Sentiment Lexicon Induction
No ratings yet
Financial Sentiment Lexicon Induction
129 pages
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
No ratings yet
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
14 pages
AI & ML: A Comprehensive Overview
No ratings yet
AI & ML: A Comprehensive Overview
16 pages
7 - BV - Ananda - Path - All - Chapter PDF
No ratings yet
7 - BV - Ananda - Path - All - Chapter PDF
86 pages
CB SC P2cse23010
No ratings yet
CB SC P2cse23010
30 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
19 pages
Article Hand Writing Character Recognition Using CNN
No ratings yet
Article Hand Writing Character Recognition Using CNN
6 pages
Cabs Availability Prediction Using Deep Learning: Project Member
No ratings yet
Cabs Availability Prediction Using Deep Learning: Project Member
58 pages
K-NN Algorithm: A Beginner's Guide
No ratings yet
K-NN Algorithm: A Beginner's Guide
4 pages
Recent Advancement IN Artificial Intelligence & Machine Learning
No ratings yet
Recent Advancement IN Artificial Intelligence & Machine Learning
43 pages
Artificial Intelligence in 5G
No ratings yet
Artificial Intelligence in 5G
34 pages
Btech Oe 4 Sem Introduction To Soft Computing Noe041 2017
No ratings yet
Btech Oe 4 Sem Introduction To Soft Computing Noe041 2017
1 page
Generative Artificial Intelligence: Electronic Markets December 2023
No ratings yet
Generative Artificial Intelligence: Electronic Markets December 2023
18 pages
HPCL Section 13 Notes CH
No ratings yet
HPCL Section 13 Notes CH
17 pages
K Nearest Neighbors KNN A Fundamental Machine Learning Algorithm
No ratings yet
K Nearest Neighbors KNN A Fundamental Machine Learning Algorithm
11 pages
Understanding Artificial Intelligence
No ratings yet
Understanding Artificial Intelligence
5 pages
Neural Networks & SVMs in AI
No ratings yet
Neural Networks & SVMs in AI
19 pages
What Is Chat GPT?
No ratings yet
What Is Chat GPT?
7 pages
Syllabus 146
No ratings yet
Syllabus 146
4 pages
BharatGen: India's First AI Multimodal LLM
No ratings yet
BharatGen: India's First AI Multimodal LLM
3 pages
Machine Learning Assignment List
No ratings yet
Machine Learning Assignment List
2 pages
ML Topic Wise Resources
No ratings yet
ML Topic Wise Resources
3 pages
Artificial Intelligence Course Unlocking The Future of Technology
No ratings yet
Artificial Intelligence Course Unlocking The Future of Technology
3 pages