Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views68 pages

5SC28 Machine Learning For Systems and Control

The document presents Lecture 8 of the Machine Learning for Systems and Control course, focusing on reinforcement learning techniques including Q-learning, policy gradient methods, and actor-critic algorithms. It outlines learning goals, key concepts, and various algorithms such as REINFORCE and Proximal Policy Optimization, emphasizing their applications in control systems. The lecture also discusses the exploration-exploitation trade-off and the integration of model-based and model-free approaches in reinforcement learning.

Uploaded by

Sadra Moosavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views68 pages

5SC28 Machine Learning For Systems and Control

The document presents Lecture 8 of the Machine Learning for Systems and Control course, focusing on reinforcement learning techniques including Q-learning, policy gradient methods, and actor-critic algorithms. It outlines learning goals, key concepts, and various algorithms such as REINFORCE and Proximal Policy Optimization, emphasizing their applications in control systems. The lecture also discusses the exploration-exploitation trade-off and the integration of model-based and model-free approaches in reinforcement learning.

Uploaded by

Sadra Moosavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Machine Learning for

Systems and Control


5SC28
Lecture 8

dr. ir. Maarten Schoukens & prof. dr. ir. Roland Tóth

Control Systems Group


Department of Electrical Engineering
Eindhoven University of Technology

Academic Year: 2024-2025 (version 25/05/2025)


Past Lectures

Introduction to Reinforcement Learning

Temporal Difference Learning

Value-Based Learning: Tabular Q-Learning

Approximate Q-Learning

DQN

2
Learning Goals
• Explain, discuss, compare and interpret the main techniques and
theory for reinforcement learning based control starting from classical
Q-learning up to actor-critic and model internalization-based
methods;
• Recommend and evaluate a machine learning method for a real-life
application in a systems and control setting.
• Implement, tailor, and apply the Gaussian process, (deep) neural
network and reinforcement learning techniques for model and
control learning on real-world systems (e.g. on an inverted pendulum
laboratory setup).

3
Action-Value Reinforcement Learning

Agent learns Q* or Qπ to improve its actions 4


Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)

Strategy:

Training & Data ➔ θ ➔ Q-function ➔ Policy π

5
Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)

Strategy:

Training & Data ➔ θ ➔ Q-function ➔ Policy π

Disadvantages: Finite action space to solve


Poor exploration in case of large action space (slow)

6
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken

Strategy: Training & Data ➔ θ ➔ Policy π

7
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken

Strategy: Training & Data ➔ θ ➔ Policy π

Main Advantages: No need to solve


No need to discretize the action space
Can handle stochastic policies
Increased algorithm stability
(Dis)advantage: More complicated algorithm, more options to tune
8
Policy Gradient Methods

9
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:

parameters state action

Stochastic policies can be optimal in a partially observed or competitive


environment (e.g. card games)

10
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:

Discrete actions: softmax policy

Continuous actions: Gaussian policy

11
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value


function) using the policy π.

12
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value


function) using the policy π.

Solution: Stochastic Gradient Ascent (SGA)

Where α is the step size. Replay can be used to assist convergence.


13
Policy Gradient: Optimization
Objective function:

Interpretation: J is the expected cumulative reward (via the value


function) using the policy π.
How to calculate the value gradient to arrive to a gradient over
the policy?
Solution: Stochastic Gradient Ascent (SGA)

Where α is the step size. Replay can be used to assist convergence.


14
Policy Gradient Theorem
Assumptions:
Assume U is a discrete action space (for simplicity)
Probability that action u is taken for a given x,θ is nonzero over U

We know

i.e., we can characterize the effect of the current action


made by the policy over the future based on Qπ

15
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:

Equivalent reformulation:

Or:

1 R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction.” A Bradford Book, pp. 356, 2008.
2 V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, pp. 1008-1014, 2000. 16
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:

Equivalent reformulation:
The derivative of the expected reward is the expectation of the
product of the reward and gradient of the log of the policy πθ​.

Or:

1 R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction.” A Bradford Book, pp. 356, 2008.
2 V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, pp. 1008-1014, 2000. 17
Policy Gradient Algorithm Backbone

Replace u with the sample as the current action taken

18
Policy Gradient Algorithm Backbone

Replace u with the sample as the current action taken

Implement in REINFORCE algorithm

19
REINFORCE Algorithm

20
REINFORCE Algorithm

Episodic algorithm! Complete full episode for one parameter update

Episodic nature to obtain estimate of G

21
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
22
REINFORCE + baseline algorithm

23
REINFORCE + baseline algorithm

Baseline: We can introduce arbitrary baseline b(x) without consequence*:

Explanation: baseline does not introduce bias in gradient

But reduces variance of the gradient estimate in practice if

* As long as b does not depend on u 24


REINFORCE + baseline algorithm

Source: Richard S. Sutton and Andrew G. Barto, Reinforcement Learning An Introduction, second edition, Figure 13.2 25
Actor-Critic Methods

26
Why Actor-Critic?
Problem: REINFORCE uses episode learning to estimate Gk
Idea: Introduce value function (cfr. Lecture 5 and 6)

Estimate using temporal difference


approach (Lecture 6)

27
Action-Critic Reinforcement Learning

Strategy: the critic aims to learn Vπ while the actor tries to improve π 28
Actor-Critic Overview
Explicitly separated value function and policy
Critic = state value function
Actor = control policy

Problem setting:
Continuous synchronous learning
Continuous action and state space
Deterministic state transition and reward function

29
Actor-Critic Parameterization
Critic:

basis functions/features parameter vector

Actor:

Alternatively, they can be described via GP, DNN, etc.

Don’t forget: probabilistic policies (softmax, Gaussian, ε-greedy)


30
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π

Step 1: Use the Bellman equation for

Sample

31
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π

Step 1: Use the Bellman equation for

Step 2: Compute advantage using temporal difference

use the transition sample

Bootstrap

32
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

33
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

learning rate

34
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:

set of observations (current sample, replay, entire batch, …)

Take , SGD gives:

learning rate

See also the prediction problem in Lecture 6!


35
On-Policy Advantage Based Critic

Old estimate is too low → increase it (too negative criticism in the past)

Old estimate is too high → decrease it (too enthusiastic opinion in the past)
36
On-Policy Advantage Based Actor
Remember REINFORCE + baseline:

Step 1 Use Bellman equation + transition sample + critic-based estimate of Vπ

Step 2 Use the baseline

Advantage
37
On-Policy Advantage Based Actor
Step 3 Achieve maximization of J(θ,xk) via SGA

learning rate

Probabilistic policy

Remember our goal was to maximize the objective

38
On-Policy Advantage Based Actor
Step 4 Gradient calculation

Example: choose Gaussian policy

39
On-Policy Advantage Based Actor

Action uk has unforeseen advantage (exploit it more in the future)

Action uk has unforeseen disadvantage (avoid it in the future)


40
Advantage Actor-Critic (A2C) Algorithm

advantage
critic update - minimization
actor update - maximization

Note: you can also run multiple actors in parallel


41
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information

42
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information

43
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization

Entropy: Measure of uncertainty / information


increases with
uncertainty

Exploration-Exploitation Trade-Off
44
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
45
Examples

46
Unbalanced Disc
• State: (angle and velocity)
• Input:
• Reward:
• Discount:

• Objective: stabilize top position (swing up)


• Insufficient actuation (needs to swing back & forth)

47
Unbalanced Disc
deterministic part only
Parameterization
Critic: Actor:
Basis functions are given by normalized RBF with equidistant gridding

48
Unbalanced Disc

49
Unbalanced Disc

50
Cart with Pendulum

51
Cart with Pendulum

Hierarchical control loop. The position controller is fixed to be a pre-tuned PD controller.


The objective is to learn the angle controller in closed loop.

52
Cart with Pendulum: Learning

Performance during learning

53
Cart with Pendulum: RL vs Classical

Classical Control Design Reinforcement Learning


54
Cart with Pendulum: Actor and Critic

Actor Critic
55
Proximal Policy Optimization

Changes: Longer optimization on data generated by old policy


Probability Ratio instead of Policy Probability
Clipping or KL Divergence

Increased Stability
No Policy Run-Off due to Large Updates
Observed Improved Sample Efficiency 56
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy

57
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy

But! Importance Sampling:

Will allow us to use samples from old policy as if they came from
the current policy distribution by applying correct weighting

58
Proximal Policy Optimization: Efficiency
1. Data generated by old policy We get policy run off if we train new policy
too long on data from old policy
2. Advantage computed on that data → wrong gradient information

But! Importance Sampling:

AC
Introduce Policy Ratio, as we have samples from old policy

towards
PPO
59
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates


Only update where we have a large trust
Limit the update range in our updates

60
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates


Only update where we have a large trust
Limit the update range in our updates

Option 1: Use KL-divergence to measure difference between


action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

61
Proximal Policy Optimization: Stability

Optimizing this results in excessively large policy updates


Only update where we have a large trust
Limit the update range in our updates

Option 1: Use KL-divergence to measure difference between


action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]

62
Proximal Policy Optimization: Stability
Only update where we have a large trust
Limit the update range in our updates

Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]


Clipping:

Source: J. Schulman et al., Proximal Policy


Optimization Algorithms, arXiv:1707.06347

63
Proximal Policy Optimization: Stability

Introduce Clipping + Min

64
Proximal Policy Optimization: Cost

Entropy

Surrogate Objective Function

Critic Objective Function

65
Proximal Policy Optimization: Cost

Note: you can also run multiple actors in parallel


66
Actor-Critic Properties
Proof of convergence under various assumptions
Improved stability
Probabilistic in nature
On-policy and off-policy versions
Tabular and continuous formulation
Many tricks to improve performance (reward normalization, …)
Updates can be synchronous, asynchronous or episodic
Many versions (A2C, A3C, SAC, PPO, …)
67
Overview
Introduced policy gradient reinforcement learning
REINFORCE
Actor-Critic (A2C)
PPO
Can work probabilistic, handle continuous state and action space

Fundamental dilemma remains:


• Efficiency of DP: Models make it possible to plan and synthetize policy, otherwise we
only rely on experience. Experiments are costly and risky.
• Efficiency of RL: Experiments allow to explore and improve exploitation on the long
run. Models are inherently uncertain.
• How to have a working marriage of DP and RL? 68

You might also like