0% found this document useful (0 votes)

10 views12 pages

12 ML Reinforcement Learning Value Based Control

The document discusses various methods of Approximate Dynamic Programming (ADP) and policy improvement techniques in reinforcement learning, including Q-Learning and SARSA. It emphasizes the importance of function approximation for handling infinite state spaces and the convergence properties of these methods. Additionally, it covers the implementation of Deep Q-Networks (DQN) to stabilize learning through target functions and replay buffers.

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

12 ML Reinforcement Learning Value Based Control

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Overview

1. Function Approximation
Approximate Dynamic Programming
Does ADP converge?

2. Policy Improvement
Policy Improvement Theorem
Policy Iteration
Value Iteration
Q-Learning
SARSA
Deep Networks

Recap: Dynamic Programming

Remember the Bellman update?

Here, and are tables. How can we represent value functions that have infinitely
many states?

Dynamic Programming with Function Approximation

Two ideas:
values can be represented as parametric functions, i.e., (where is a
parameter vector)
since we have infinite states, we need to operate with samples

Dynamic Programming with Function Approximation

Values can be parametrized, i.e.,

Linear parametrization

Neural networks

1 / 12
Dynamic Programming with Function Approximation
Assume you have a dataset of states , actions , rewards , next states and next
actions where and .

Approximated dynamic programming consists of iterate the following equations:

Approximate Dynamic Programming for Policy

Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using e.g.,

gradient descent
5: Minimize

6: end for
7: Return .

Approximate Dynamic Programming for Policy

Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using
e.g., gradient descent
5:

2 / 12
6: end for

Gradient Descent Approximate Dynamic Programming

Let us compute the gradient of the optimization problem introduced,

This gradient allows us to write an online version of dynamic programming - thus temporal
difference with function approximation.

Gradient Updates
Let be parametric, i.e., ,

where , and .

Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

Value Functions
1: Input: , number of episodes , vector parameter
2: for do
3: Sample first state

4: for do
5: Sample action

6: Update learning rate

7: Apply a on the environment and receive reward and next state
8:

9:
10: end for
11: end for

3 / 12
Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

Action-Value

Functions
1: Input: , number of episodes , vector parameter
2: for do

3: Sample first state

4: Sample action
5: for do

6: Update learning rate

7: Apply a on the environment and receive reward and next state
8: Sample

9:
10:

11: end for

12: end for

Convergence
Approximate dynamic programming and temporal difference with function
approximation converge to a biased solution when the parametrization is linear.
When the function approximation is not linear (e.g., neural networks), ADP and TD with
function approximation are not guaranteed to converge.

In either case, their estimate is biased. Such bias can be mitigated by introducing many
parameters, but many parameters cause high estimation variance, which can be only
compensated by using many samples.
State-of-the-art reinforcement learning tends to use deep neural networks with million (or
even billion) of parameters, and use a vast amount of samples.

4 / 12
Our Objective
Our objective is to find a policy that performs better than any other policy. In
mathematical terms:

We call such policy optimal. Note: there might be several optimal policies.

Policy Improvement Theorem

if , then .
Note: the policy improvement theorem is more general in textbooks, however, this
simplified (and more specific) version, still allow us to derive the following corollaries and
algorithms.

Greedy Policies: A policy is said to be greedy (w.r.t. ) if

Tabular Policy Iteration

1: Create a table that represent deterministic policies, and initialize it randomly
2: for do

3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous

lecture)
4: for do
5:

6: end for
7: end for
Question: does this algorithm converge to the optimal policy? Yes (for tabular and ).

However, it is computational expensive because after each policy improvement we need to

evaluate the policy.

5 / 12
Policy Improvement Theorem
Corollary 1 If exists such that , then
.
Corollary 2 (Optimality Bellman Equation) The optimal -function satisfies the following
optimality Bellman equation:

Can we use the optimality Bellman equation to obtain some guarantees on the
convergence of policy iteration, and to derive a more efficient algorithm?

Optimality Bellman Operator

Consider the optimality Bellman operator

The optimality Bellman operator is contractive, and, thanks to the Banach's theorem, we
can state that

Once we obtain the optimal action-value function , obtaining an optimal policy is trivial:

(Note: such policy is deterministic).

On the Contractivity of the Optimality Bellman Operator

Consider the optimality Bellman operator

6 / 12
Value Iteration

Tabular Value Iteration

1: Create two tables and that represent the -functions, and that represents the
policy
2: for do

3: for do
4:

5: end for
6: for do

7:
8: end for
9: Return

10: end for

Value Iteration
Value iteration is basically dynamic programming with the max operator that selects the
action of the next state.
is model-based (it needs knowledge of the transition model and the reward). Similarly to
what we have seen the previous lecture, we can derive a model-free, online version, called
-learning.

Online Algorithms
Like in the previous lecture, we want to devise an online algorithm, that uses samples to
update the -function and the policy.

Online Algorithm
1: Initialize a -function
2: for Episodes do

3: Sample first state

4: for Single episode do
5: Sample action according to a policy (which policy?)

7 / 12
6: Apply a on the environment and receive reward and next state
7: Use to update the -function

8:
9: end for
10: end for

Q-learning
Like in the previous lecture, we can use the online averaging and bootstrapping, i.e.,

where is the reward and is the next state.

-Greedy Policies
We want to obtain an "online" algorithm: i.e., an algorithm that improves the policy while
interacting with the environment.
A popular strategy is to use -greedy policies: policies that select the greedy action with
probability , and select a sub-optimal action with probability .

Such policies select with high probability (usually epsilon is small) good actions, while
keeping exploring and avoiding local minima. (For the most curious: see the exploration-
exploitation trade-off)

Q-learning

Tabular Q-learning
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do

4: Sample first state

5: for Single episode do
# Greedy Policy

8 / 12
With probability select , otherwise select a randomly.
# Learning Rate Update

Apply on the environment and receive reward and next state

10: # Bellman Update
11:

12: end for

13: end for

Simulation Time
Let's simulate -learning. We use the "investor MDP" as an example (next slide). There are
three states : rich, well-off, poor .
There are two actions : no-invest, 1 : invest .

We assume constant learning rate , discount factor , and epsilon-greedy

. Episodes are truncated after 5 steps.
I need 5 student:
One student execute the MDP (simulate.py)
One student execute the epsilon-greedy policy (epsilon_greedy.py)
One student compute
One student compute the entire update

One student keeps track of the q-table on the blackboard.

Tabular SARSA
1: Input: , number of episodes

2: Initialize: a table of state-actions visitation , a table of -values

initialized with zeros.
3: for Episodes do
4: Sample first state .

5: With probability select , otherwise select randomly.

6: for Single episode do
7: # Learning Rate Update

8: Apply on the environment and receive reward and next state # Greedy Policy

9 / 12
9: With probability select , otherwise select a random action
.
10: # Bellman Update

11:
12: end for

Q-learning

Q-learning with Function approximation

1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros

3: for Episodes do
4: Sample first state
5: for Single episode do

6: with probability select , otherwise select a random action

a.
7: Update the learning rate .

8: Apply a on the environment and receive reward and next state

9:
10: end for

11: end for

SARSA with Function Approximation

1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros.
3: for Episodes do

4: Sample first state .

5: With probability select , otherwise select randomly.
6: for Single episode do

10 / 12
7: Update the learning rate
8: Apply a on the environment and receive reward and next state

9: With probability select , otherwise select randomly.

10:

11: end for

12: end for

SARSA and Q-Learning

SARSA can be seen an online version of policy iteration (the Bellman update only evaluates
the current policy), and the current policy is greedy w.r.t. the q-function.
Q-learning can be seen as an online version of value iteration (the Bellman update
evaluates the greedy policy).

For this reason, SARSA is an on-policy algorithm (i.e., it evaluates the current policy), while
-learning is off-policy, since it evaluates a greedy policy while using an -greedy policy on
the environment.

Deep Q-Network
Q-learning with function approximation tends to be a bit unstable.
DQN aims to mitigate those instabilities by 1) introducing a target q-function, and 2) by
introducing randomized mini-batch updates (replay buffer).
Target q-functions. To stabilize learning it is useful to avoid bootstrapping. The idea of DQN
is to keep two separate functions, as in DP. This can be done by having two separate sets
of parameters and , and .
The targets parameters are updates once in a while .

Replay buffer. In classic -learning, samples are very correlated, as they are obtained by
running the MDP. Using non i.i.d. samples with function approximation is problematic. The
idea of the replay buffer is to store the last samples, and to sample each time a minibatch
of i.i.d. samples from it.

Deep Q-Network

11 / 12
DQN
1: for Episodes do
2: Sample first state

3: for Single episode do

4: With probability select , otherwise select randomly.
5: Update the learning rate .

6: Apply a on the environment and receive reward and next state

7: Append to .
8: Sample a minimatch

10: end for

11: Every episodes, # Target update

12: end for

12 / 12

RL Unit-4
No ratings yet
RL Unit-4
47 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Introduction To RL
No ratings yet
Introduction To RL
64 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Nidhish RLAI-Lab1
No ratings yet
Nidhish RLAI-Lab1
18 pages
Q Learing
No ratings yet
Q Learing
30 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Cognizant 2026 Preperation
100% (1)
Cognizant 2026 Preperation
39 pages
Module 04
No ratings yet
Module 04
63 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
14 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Lec 09
No ratings yet
Lec 09
51 pages
Optilift RPC Manual Rockwell
No ratings yet
Optilift RPC Manual Rockwell
462 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
M 2
No ratings yet
M 2
12 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Notes
No ratings yet
Notes
6 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
CS229
No ratings yet
CS229
17 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Hcu Dump
100% (3)
Hcu Dump
86 pages
Dynamic Programming in MDPs
No ratings yet
Dynamic Programming in MDPs
42 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Performance Tuning Guide As400
No ratings yet
Performance Tuning Guide As400
136 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Soft Computing: Concepts and Techniques: January 2014
No ratings yet
Soft Computing: Concepts and Techniques: January 2014
17 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
UPSC EPFO APFC Exam Syllabus
0% (1)
UPSC EPFO APFC Exam Syllabus
5 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
EpicWeb Customer Portal User Guide
No ratings yet
EpicWeb Customer Portal User Guide
11 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Wikipedia Consensus
No ratings yet
Wikipedia Consensus
6 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
DEK 265-Horizon Installation Manual
No ratings yet
DEK 265-Horizon Installation Manual
68 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Steps To Use Smart Pigeon Hole PDF
No ratings yet
Steps To Use Smart Pigeon Hole PDF
2 pages
C Programming Basics and Structure
No ratings yet
C Programming Basics and Structure
33 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
IT 118 - SIA - Module 5
No ratings yet
IT 118 - SIA - Module 5
23 pages
Machine Learning For Cyber: Unit 1: Introduction
No ratings yet
Machine Learning For Cyber: Unit 1: Introduction
23 pages
Image Analytics, Unit-3
No ratings yet
Image Analytics, Unit-3
12 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RM Plagarism Report
No ratings yet
RM Plagarism Report
10 pages
Notes On The Troubleshooting and Repair of Small Switchmode Power Supplies
100% (1)
Notes On The Troubleshooting and Repair of Small Switchmode Power Supplies
65 pages
Vrontis 2021
No ratings yet
Vrontis 2021
31 pages
CS3342 Software Design Course
No ratings yet
CS3342 Software Design Course
15 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
Internal Trade (Korea) vs. International Trade
No ratings yet
Internal Trade (Korea) vs. International Trade
19 pages
First Year Undergraduate Student Registration 2025
No ratings yet
First Year Undergraduate Student Registration 2025
7 pages
vb6 Array Types Continued
No ratings yet
vb6 Array Types Continued
4 pages
Hackathon 2025
No ratings yet
Hackathon 2025
2 pages
P04 Calc AbsolutReferences
No ratings yet
P04 Calc AbsolutReferences
2 pages
NCEEICT Conference Paper Format
No ratings yet
NCEEICT Conference Paper Format
5 pages
Automotive Motor Control Board SPC58NN84E7
No ratings yet
Automotive Motor Control Board SPC58NN84E7
7 pages
Ddco Question Bank
No ratings yet
Ddco Question Bank
1 page
Graphics & Infographics Guide
No ratings yet
Graphics & Infographics Guide
3 pages
Screenshot 2024-03-12 at 6.57.10 PM
No ratings yet
Screenshot 2024-03-12 at 6.57.10 PM
1 page
Optimizing Energy Consumption in Smart Homes Using Machine Learning Techniques
No ratings yet
Optimizing Energy Consumption in Smart Homes Using Machine Learning Techniques
7 pages

12 ML Reinforcement Learning Value Based Control

Uploaded by

12 ML Reinforcement Learning Value Based Control

Uploaded by

Overview

Recap: Dynamic Programming

Dynamic Programming with Function Approximation

Dynamic Programming with Function Approximation

Approximated dynamic programming consists of iterate the following equations:

Approximate Dynamic Programming for Policy

4: Minimize using e.g.,

Approximate Dynamic Programming for Policy

Gradient Descent Approximate Dynamic Programming

Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

6: Update learning rate

Temporal Difference with Function Approximation for

3: Sample first state

6: Update learning rate

11: end for

Policy Improvement Theorem

Policy Improvement Theorem

Greedy Policies: A policy is said to be greedy (w.r.t. ) if

Tabular Policy Iteration

Tabular Policy Iteration

3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous

However, it is computational expensive because after each policy improvement we need to

Optimality Bellman Operator

(Note: such policy is deterministic).

On the Contractivity of the Optimality Bellman Operator

Tabular Value Iteration

10: end for

3: Sample first state

where is the reward and is the next state.

4: Sample first state

Apply on the environment and receive reward and next state

12: end for

We assume constant learning rate , discount factor , and epsilon-greedy

One student keeps track of the q-table on the blackboard.

2: Initialize: a table of state-actions visitation , a table of -values

5: With probability select , otherwise select randomly.

Q-learning with Function approximation

6: with probability select , otherwise select a random action

8: Apply a on the environment and receive reward and next state

11: end for

SARSA with Function Approximation

4: Sample first state .

9: With probability select , otherwise select randomly.

11: end for

SARSA and Q-Learning

3: for Single episode do

6: Apply a on the environment and receive reward and next state

10: end for

11: Every episodes, # Target update

You might also like