0% found this document useful (0 votes)

9 views68 pages

5SC28 Machine Learning For Systems and Control

The document presents Lecture 8 of the Machine Learning for Systems and Control course, focusing on reinforcement learning techniques including Q-learning, policy gradient methods, and actor-critic algorithms. It outlines learning goals, key concepts, and various algorithms such as REINFORCE and Proximal Policy Optimization, emphasizing their applications in control systems. The lecture also discusses the exploration-exploitation trade-off and the integration of model-based and model-free approaches in reinforcement learning.

Uploaded by

Sadra Moosavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views68 pages

5SC28 Machine Learning For Systems and Control

Uploaded by

Sadra Moosavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Machine Learning for

Systems and Control

5SC28
Lecture 8

dr. ir. Maarten Schoukens & prof. dr. ir. Roland Tóth

Control Systems Group

Department of Electrical Engineering
Eindhoven University of Technology

Academic Year: 2024-2025 (version 25/05/2025)

Past Lectures

Introduction to Reinforcement Learning

Temporal Difference Learning

Value-Based Learning: Tabular Q-Learning

Approximate Q-Learning

DQN

2
Learning Goals
• Explain, discuss, compare and interpret the main techniques and
theory for reinforcement learning based control starting from classical
Q-learning up to actor-critic and model internalization-based
methods;
• Recommend and evaluate a machine learning method for a real-life
application in a systems and control setting.
• Implement, tailor, and apply the Gaussian process, (deep) neural
network and reinforcement learning techniques for model and
control learning on real-world systems (e.g. on an inverted pendulum
laboratory setup).

3
Action-Value Reinforcement Learning

Agent learns Q* or Qπ to improve its actions 4

Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)

Strategy:

Training & Data ➔ θ ➔ Q-function ➔ Policy π

5
Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)

Strategy:

Training & Data ➔ θ ➔ Q-function ➔ Policy π

Disadvantages: Finite action space to solve

Poor exploration in case of large action space (slow)

6
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken

Strategy: Training & Data ➔ θ ➔ Policy π

7
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken

Strategy: Training & Data ➔ θ ➔ Policy π

Main Advantages: No need to solve

No need to discretize the action space
Can handle stochastic policies
Increased algorithm stability
(Dis)advantage: More complicated algorithm, more options to tune
8
Policy Gradient Methods

9
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:

parameters state action

Stochastic policies can be optimal in a partially observed or competitive

environment (e.g. card games)

10
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:

Discrete actions: softmax policy

Continuous actions: Gaussian policy

11
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value

function) using the policy π.

12
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value

function) using the policy π.

Solution: Stochastic Gradient Ascent (SGA)

Where α is the step size. Replay can be used to assist convergence.

13
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value

function) using the policy π.
How to calculate the value gradient to arrive to a gradient over
the policy?
Solution: Stochastic Gradient Ascent (SGA)

Where α is the step size. Replay can be used to assist convergence.

14
Policy Gradient Theorem
Assumptions:
Assume U is a discrete action space (for simplicity)
Probability that action u is taken for a given x,θ is nonzero over U

We know

i.e., we can characterize the effect of the current action

made by the policy over the future based on Qπ

15
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:

Equivalent reformulation:

Or:

1 R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction.” A Bradford Book, pp. 356, 2008.
2 V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, pp. 1008-1014, 2000. 16
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:

Equivalent reformulation:
The derivative of the expected reward is the expectation of the
product of the reward and gradient of the log of the policy πθ.

Or:

Replace u with the sample as the current action taken

18
Policy Gradient Algorithm Backbone

Replace u with the sample as the current action taken

Implement in REINFORCE algorithm

19
REINFORCE Algorithm

20
REINFORCE Algorithm

Episodic algorithm! Complete full episode for one parameter update

Episodic nature to obtain estimate of G

21
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
22
REINFORCE + baseline algorithm

23
REINFORCE + baseline algorithm

Baseline: We can introduce arbitrary baseline b(x) without consequence*:

Explanation: baseline does not introduce bias in gradient

But reduces variance of the gradient estimate in practice if

* As long as b does not depend on u 24

REINFORCE + baseline algorithm

Source: Richard S. Sutton and Andrew G. Barto, Reinforcement Learning An Introduction, second edition, Figure 13.2 25
Actor-Critic Methods

26
Why Actor-Critic?
Problem: REINFORCE uses episode learning to estimate Gk
Idea: Introduce value function (cfr. Lecture 5 and 6)

Estimate using temporal difference

approach (Lecture 6)

27
Action-Critic Reinforcement Learning

Strategy: the critic aims to learn Vπ while the actor tries to improve π 28
Actor-Critic Overview
Explicitly separated value function and policy
Critic = state value function
Actor = control policy

Problem setting:
Continuous synchronous learning
Continuous action and state space
Deterministic state transition and reward function

29
Actor-Critic Parameterization
Critic:

basis functions/features parameter vector

Actor:

Alternatively, they can be described via GP, DNN, etc.

Don’t forget: probabilistic policies (softmax, Gaussian, ε-greedy)

30
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π

Step 1: Use the Bellman equation for

Sample

31
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π

Step 1: Use the Bellman equation for

Step 2: Compute advantage using temporal difference

use the transition sample

Bootstrap

32
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

33
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

learning rate

34
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

learning rate

Step 1 Use Bellman equation + transition sample + critic-based estimate of Vπ

Step 2 Use the baseline

Advantage
37
On-Policy Advantage Based Actor
Step 3 Achieve maximization of J(θ,xk) via SGA

learning rate

Probabilistic policy

Remember our goal was to maximize the objective

38
On-Policy Advantage Based Actor
Step 4 Gradient calculation

Example: choose Gaussian policy

39
On-Policy Advantage Based Actor

Action uk has unforeseen advantage (exploit it more in the future)

Action uk has unforeseen disadvantage (avoid it in the future)

40
Advantage Actor-Critic (A2C) Algorithm

advantage
critic update - minimization
actor update - maximization

Note: you can also run multiple actors in parallel

41
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information

42
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information

43
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information

increases with
uncertainty

Exploration-Exploitation Trade-Off
44
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
45
Examples

46
Unbalanced Disc
• State: (angle and velocity)
• Input:
• Reward:
• Discount:

• Objective: stabilize top position (swing up)

• Insufficient actuation (needs to swing back & forth)

47
Unbalanced Disc
deterministic part only
Parameterization
Critic: Actor:
Basis functions are given by normalized RBF with equidistant gridding

48
Unbalanced Disc

49
Unbalanced Disc

50
Cart with Pendulum

51
Cart with Pendulum

Hierarchical control loop. The position controller is fixed to be a pre-tuned PD controller.

The objective is to learn the angle controller in closed loop.

52
Cart with Pendulum: Learning

Performance during learning

53
Cart with Pendulum: RL vs Classical

Classical Control Design Reinforcement Learning

54
Cart with Pendulum: Actor and Critic

Actor Critic
55
Proximal Policy Optimization

Changes: Longer optimization on data generated by old policy

Probability Ratio instead of Policy Probability
Clipping or KL Divergence

Increased Stability
No Policy Run-Off due to Large Updates
Observed Improved Sample Efficiency 56
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy

57
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy

But! Importance Sampling:

Will allow us to use samples from old policy as if they came from
the current policy distribution by applying correct weighting

58
Proximal Policy Optimization: Efficiency
1. Data generated by old policy We get policy run off if we train new policy
too long on data from old policy
2. Advantage computed on that data → wrong gradient information

But! Importance Sampling:

AC
Introduce Policy Ratio, as we have samples from old policy

towards
PPO
59
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates

Only update where we have a large trust
Limit the update range in our updates

60
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates

Only update where we have a large trust
Limit the update range in our updates

Option 1: Use KL-divergence to measure difference between

action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

61
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates

Only update where we have a large trust
Limit the update range in our updates

Option 1: Use KL-divergence to measure difference between

action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

62
Proximal Policy Optimization: Stability
Only update where we have a large trust
Limit the update range in our updates

Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

Clipping:

Source: J. Schulman et al., Proximal Policy

Optimization Algorithms, arXiv:1707.06347

63
Proximal Policy Optimization: Stability

Introduce Clipping + Min

64
Proximal Policy Optimization: Cost

Entropy

Surrogate Objective Function

Critic Objective Function

65
Proximal Policy Optimization: Cost

Note: you can also run multiple actors in parallel

66
Actor-Critic Properties
Proof of convergence under various assumptions
Improved stability
Probabilistic in nature
On-policy and off-policy versions
Tabular and continuous formulation
Many tricks to improve performance (reward normalization, …)
Updates can be synchronous, asynchronous or episodic
Many versions (A2C, A3C, SAC, PPO, …)
67
Overview
Introduced policy gradient reinforcement learning
REINFORCE
Actor-Critic (A2C)
PPO
Can work probabilistic, handle continuous state and action space

Fundamental dilemma remains:

• Efficiency of DP: Models make it possible to plan and synthetize policy, otherwise we
only rely on experience. Experiments are costly and risky.
• Efficiency of RL: Experiments allow to explore and improve exploitation on the long
run. Models are inherently uncertain.
• How to have a working marriage of DP and RL? 68

ST2334 Midterm Test 2022-2023 Sem 1 Solution
No ratings yet
ST2334 Midterm Test 2022-2023 Sem 1 Solution
7 pages
(Bajalinov) Linear-Fractional Programming 1st Edition
100% (5)
(Bajalinov) Linear-Fractional Programming 1st Edition
442 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Model Risk Forrest
No ratings yet
Model Risk Forrest
15 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Chepter # 5 Simple Regression and Correlation Exercise # 5 by Shahid Mehmood Simple Regression
No ratings yet
Chepter # 5 Simple Regression and Correlation Exercise # 5 by Shahid Mehmood Simple Regression
7 pages
40 Startup Interview Questions in ML
100% (3)
40 Startup Interview Questions in ML
33 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Simulation Modeling: True/False & MCQs
No ratings yet
Simulation Modeling: True/False & MCQs
21 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Solutions 3
No ratings yet
Solutions 3
4 pages
Numerical Methods-FINALS
No ratings yet
Numerical Methods-FINALS
4 pages
Binomial Heaps: Manoj Kumar DTU, Delhi
No ratings yet
Binomial Heaps: Manoj Kumar DTU, Delhi
36 pages
A G1002 Pages: 2: Answer Any Two Full Questions, Each Carries 15 Marks
No ratings yet
A G1002 Pages: 2: Answer Any Two Full Questions, Each Carries 15 Marks
2 pages
Hybrid SFLA Optimization Algorithm
No ratings yet
Hybrid SFLA Optimization Algorithm
8 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
RL 3
No ratings yet
RL 3
31 pages
Lagrange Interpolation Insights
No ratings yet
Lagrange Interpolation Insights
3 pages
The Lagrangian Relaxation Method For Solving Integer Programming Problems
No ratings yet
The Lagrangian Relaxation Method For Solving Integer Programming Problems
12 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Deep Learning Predicts EPL Results
No ratings yet
Deep Learning Predicts EPL Results
6 pages
Searchain Blockchain-Based Private Keyword Search in Decentralized Storage
No ratings yet
Searchain Blockchain-Based Private Keyword Search in Decentralized Storage
25 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Cost Estimation Methods Guide
No ratings yet
Cost Estimation Methods Guide
2 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Association Rules 1. Data Yang Digunakan Adalah Sebagai Berikut
No ratings yet
Association Rules 1. Data Yang Digunakan Adalah Sebagai Berikut
7 pages
RL 5
No ratings yet
RL 5
26 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Dmouj
No ratings yet
Dmouj
40 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
No ratings yet
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
13 pages
แบบฝึกหัดการวิเคราะห์อัลกอริทึม (ซูโดโค้ด)
No ratings yet
แบบฝึกหัดการวิเคราะห์อัลกอริทึม (ซูโดโค้ด)
5 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
LINEAR REGRESSION Feu Diliman
No ratings yet
LINEAR REGRESSION Feu Diliman
11 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Optimizing Initial Basic Feasible Solutions For Transportation Problems: A Novel Approach Incorporating Second Least Cost As Penalty
No ratings yet
Optimizing Initial Basic Feasible Solutions For Transportation Problems: A Novel Approach Incorporating Second Least Cost As Penalty
9 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Linearisation Techniques Explained
No ratings yet
Linearisation Techniques Explained
3 pages
Exploring Painting Synthesis With Diffusion Models 2
No ratings yet
Exploring Painting Synthesis With Diffusion Models 2
4 pages
Conver Flat File Into Staing Area
No ratings yet
Conver Flat File Into Staing Area
1 page
Module 3-Tutorial Sheet
No ratings yet
Module 3-Tutorial Sheet
4 pages
Ju DG-Recon Depth-Guided Neural 3D Scene Reconstruction ICCV 2023 Paper
No ratings yet
Ju DG-Recon Depth-Guided Neural 3D Scene Reconstruction ICCV 2023 Paper
11 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
Organization of Data Using Graphs
No ratings yet
Organization of Data Using Graphs
1 page
Unit Step Signal Unit Impulse Signal
No ratings yet
Unit Step Signal Unit Impulse Signal
11 pages
Exercise 10
No ratings yet
Exercise 10
1 page
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
ml4r 2025 06
No ratings yet
ml4r 2025 06
16 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages

5SC28 Machine Learning For Systems and Control

Uploaded by

5SC28 Machine Learning For Systems and Control

Uploaded by

Machine Learning for

Systems and Control

Control Systems Group

Academic Year: 2024-2025 (version 25/05/2025)

Introduction to Reinforcement Learning

Temporal Difference Learning

Value-Based Learning: Tabular Q-Learning

Agent learns Q* or Qπ to improve its actions 4

Training & Data ➔ θ ➔ Q-function ➔ Policy π

Training & Data ➔ θ ➔ Q-function ➔ Policy π

Disadvantages: Finite action space to solve

Strategy: Training & Data ➔ θ ➔ Policy π

Strategy: Training & Data ➔ θ ➔ Policy π

Main Advantages: No need to solve

parameters state action

Stochastic policies can be optimal in a partially observed or competitive

Discrete actions: softmax policy

Continuous actions: Gaussian policy

Interpretation: J is the expected cumulative reward (via the value

Interpretation: J is the expected cumulative reward (via the value

Solution: Stochastic Gradient Ascent (SGA)

Where α is the step size. Replay can be used to assist convergence.

Interpretation: J is the expected cumulative reward (via the value

Where α is the step size. Replay can be used to assist convergence.

i.e., we can characterize the effect of the current action

Replace u with the sample as the current action taken

Replace u with the sample as the current action taken

Implement in REINFORCE algorithm

Episodic algorithm! Complete full episode for one parameter update

Episodic nature to obtain estimate of G

Baseline: We can introduce arbitrary baseline b(x) without consequence*:

Explanation: baseline does not introduce bias in gradient

But reduces variance of the gradient estimate in practice if

* As long as b does not depend on u 24

Estimate using temporal difference

basis functions/features parameter vector

Alternatively, they can be described via GP, DNN, etc.

Don’t forget: probabilistic policies (softmax, Gaussian, ε-greedy)

Step 1: Use the Bellman equation for

Step 1: Use the Bellman equation for

Step 2: Compute advantage using temporal difference

use the transition sample

set of observations (current sample, replay, entire batch, …)

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

See also the prediction problem in Lecture 6!

Step 1 Use Bellman equation + transition sample + critic-based estimate of Vπ

Step 2 Use the baseline

Remember our goal was to maximize the objective

Example: choose Gaussian policy

Action uk has unforeseen advantage (exploit it more in the future)

Action uk has unforeseen disadvantage (avoid it in the future)

Note: you can also run multiple actors in parallel

Entropy: Measure of uncertainty / information

Entropy: Measure of uncertainty / information

Entropy: Measure of uncertainty / information

• Objective: stabilize top position (swing up)

Hierarchical control loop. The position controller is fixed to be a pre-tuned PD controller.

Performance during learning

Classical Control Design Reinforcement Learning

Changes: Longer optimization on data generated by old policy

But! Importance Sampling:

But! Importance Sampling:

Optimizing this results in excessively large policy updates

Optimizing this results in excessively large policy updates

Option 1: Use KL-divergence to measure difference between

Optimizing this results in excessively large policy updates

Option 1: Use KL-divergence to measure difference between

Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

Source: J. Schulman et al., Proximal Policy

Introduce Clipping + Min

Surrogate Objective Function

Critic Objective Function

Note: you can also run multiple actors in parallel