Machine Learning for
Systems and Control
5SC28
Lecture 8
dr. ir. Maarten Schoukens & prof. dr. ir. Roland Tóth
Control Systems Group
Department of Electrical Engineering
Eindhoven University of Technology
Academic Year: 2024-2025 (version 25/05/2025)
Past Lectures
Introduction to Reinforcement Learning
Temporal Difference Learning
Value-Based Learning: Tabular Q-Learning
Approximate Q-Learning
DQN
2
Learning Goals
• Explain, discuss, compare and interpret the main techniques and
theory for reinforcement learning based control starting from classical
Q-learning up to actor-critic and model internalization-based
methods;
• Recommend and evaluate a machine learning method for a real-life
application in a systems and control setting.
• Implement, tailor, and apply the Gaussian process, (deep) neural
network and reinforcement learning techniques for model and
control learning on real-world systems (e.g. on an inverted pendulum
laboratory setup).
3
Action-Value Reinforcement Learning
Agent learns Q* or Qπ to improve its actions 4
Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)
Strategy:
Training & Data ➔ θ ➔ Q-function ➔ Policy π
5
Action-Value Reinforcement Learning
Concept: Learn the value of actions through Q*(x,u) or Qπ(x,u)
Strategy:
Training & Data ➔ θ ➔ Q-function ➔ Policy π
Disadvantages: Finite action space to solve
Poor exploration in case of large action space (slow)
6
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken
Strategy: Training & Data ➔ θ ➔ Policy π
7
Policy Gradient Reinforcement Learning
Concept: Learn the policy that ensures maximum rewards for the
actions taken
Strategy: Training & Data ➔ θ ➔ Policy π
Main Advantages: No need to solve
No need to discretize the action space
Can handle stochastic policies
Increased algorithm stability
(Dis)advantage: More complicated algorithm, more options to tune
8
Policy Gradient Methods
9
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:
parameters state action
Stochastic policies can be optimal in a partially observed or competitive
environment (e.g. card games)
10
Stochastic Policy Representation
Stochastic policy: Learn a function that gives the probability
distribution over actions from the current state:
Discrete actions: softmax policy
Continuous actions: Gaussian policy
11
Policy Gradient: Optimization
Objective function:
Interpretation: J is the expected cumulative reward (via the value
function) using the policy π.
12
Policy Gradient: Optimization
Objective function:
Interpretation: J is the expected cumulative reward (via the value
function) using the policy π.
Solution: Stochastic Gradient Ascent (SGA)
Where α is the step size. Replay can be used to assist convergence.
13
Policy Gradient: Optimization
Objective function:
Interpretation: J is the expected cumulative reward (via the value
function) using the policy π.
How to calculate the value gradient to arrive to a gradient over
the policy?
Solution: Stochastic Gradient Ascent (SGA)
Where α is the step size. Replay can be used to assist convergence.
14
Policy Gradient Theorem
Assumptions:
Assume U is a discrete action space (for simplicity)
Probability that action u is taken for a given x,θ is nonzero over U
We know
i.e., we can characterize the effect of the current action
made by the policy over the future based on Qπ
15
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:
Equivalent reformulation:
Or:
1 R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction.” A Bradford Book, pp. 356, 2008.
2 V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, pp. 1008-1014, 2000. 16
Policy Gradient Theorem
We can show in terms of the policy gradient theorem1,2:
Equivalent reformulation:
The derivative of the expected reward is the expectation of the
product of the reward and gradient of the log of the policy πθ.
Or:
1 R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction.” A Bradford Book, pp. 356, 2008.
2 V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, pp. 1008-1014, 2000. 17
Policy Gradient Algorithm Backbone
Replace u with the sample as the current action taken
18
Policy Gradient Algorithm Backbone
Replace u with the sample as the current action taken
Implement in REINFORCE algorithm
19
REINFORCE Algorithm
20
REINFORCE Algorithm
Episodic algorithm! Complete full episode for one parameter update
Episodic nature to obtain estimate of G
21
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
22
REINFORCE + baseline algorithm
23
REINFORCE + baseline algorithm
Baseline: We can introduce arbitrary baseline b(x) without consequence*:
Explanation: baseline does not introduce bias in gradient
But reduces variance of the gradient estimate in practice if
* As long as b does not depend on u 24
REINFORCE + baseline algorithm
Source: Richard S. Sutton and Andrew G. Barto, Reinforcement Learning An Introduction, second edition, Figure 13.2 25
Actor-Critic Methods
26
Why Actor-Critic?
Problem: REINFORCE uses episode learning to estimate Gk
Idea: Introduce value function (cfr. Lecture 5 and 6)
Estimate using temporal difference
approach (Lecture 6)
27
Action-Critic Reinforcement Learning
Strategy: the critic aims to learn Vπ while the actor tries to improve π 28
Actor-Critic Overview
Explicitly separated value function and policy
Critic = state value function
Actor = control policy
Problem setting:
Continuous synchronous learning
Continuous action and state space
Deterministic state transition and reward function
29
Actor-Critic Parameterization
Critic:
basis functions/features parameter vector
Actor:
Alternatively, they can be described via GP, DNN, etc.
Don’t forget: probabilistic policies (softmax, Gaussian, ε-greedy)
30
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π
Step 1: Use the Bellman equation for
Sample
31
On-Policy Advantage Based Critic
Objective Critic: learn for the current policy π
Step 1: Use the Bellman equation for
Step 2: Compute advantage using temporal difference
use the transition sample
Bootstrap
32
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:
set of observations (current sample, replay, entire batch, …)
33
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:
set of observations (current sample, replay, entire batch, …)
Take , SGD gives:
learning rate
34
On-Policy Advantage Based Critic
Step 3: Minimize prediction error of the current estimate
cost:
set of observations (current sample, replay, entire batch, …)
Take , SGD gives:
learning rate
See also the prediction problem in Lecture 6!
35
On-Policy Advantage Based Critic
Old estimate is too low → increase it (too negative criticism in the past)
Old estimate is too high → decrease it (too enthusiastic opinion in the past)
36
On-Policy Advantage Based Actor
Remember REINFORCE + baseline:
Step 1 Use Bellman equation + transition sample + critic-based estimate of Vπ
Step 2 Use the baseline
Advantage
37
On-Policy Advantage Based Actor
Step 3 Achieve maximization of J(θ,xk) via SGA
learning rate
Probabilistic policy
Remember our goal was to maximize the objective
38
On-Policy Advantage Based Actor
Step 4 Gradient calculation
Example: choose Gaussian policy
39
On-Policy Advantage Based Actor
Action uk has unforeseen advantage (exploit it more in the future)
Action uk has unforeseen disadvantage (avoid it in the future)
40
Advantage Actor-Critic (A2C) Algorithm
advantage
critic update - minimization
actor update - maximization
Note: you can also run multiple actors in parallel
41
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization
Entropy: Measure of uncertainty / information
42
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization
Entropy: Measure of uncertainty / information
43
Exploration-Exploitation: Entropy Regularizer
Maximize Value Function + Entropy Regularization
Entropy: Measure of uncertainty / information
increases with
uncertainty
Exploration-Exploitation Trade-Off
44
Types of RL Control
By path to optimal solution
• Off-policy – find optimal policy regardless of the agent’s motivation
• On-policy – improve the policy that is used to make decisions
By level of interaction with the process
• Online – learn by interacting with the process
• Offline – data collected in advance (Monte-Carlo methods)
By model knowledge
• Model-free – no , only transition data (e.g. standard Q-Learning RL)
• Model-based – is known (e.g. Dynamic Programming)
• Model-learning – estimate from transition data
45
Examples
46
Unbalanced Disc
• State: (angle and velocity)
• Input:
• Reward:
• Discount:
• Objective: stabilize top position (swing up)
• Insufficient actuation (needs to swing back & forth)
47
Unbalanced Disc
deterministic part only
Parameterization
Critic: Actor:
Basis functions are given by normalized RBF with equidistant gridding
48
Unbalanced Disc
49
Unbalanced Disc
50
Cart with Pendulum
51
Cart with Pendulum
Hierarchical control loop. The position controller is fixed to be a pre-tuned PD controller.
The objective is to learn the angle controller in closed loop.
52
Cart with Pendulum: Learning
Performance during learning
53
Cart with Pendulum: RL vs Classical
Classical Control Design Reinforcement Learning
54
Cart with Pendulum: Actor and Critic
Actor Critic
55
Proximal Policy Optimization
Changes: Longer optimization on data generated by old policy
Probability Ratio instead of Policy Probability
Clipping or KL Divergence
Increased Stability
No Policy Run-Off due to Large Updates
Observed Improved Sample Efficiency 56
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy
57
Proximal Policy Optimization: Efficiency
We get policy run off if we
1. Data generated by old policy
train new policy too long on
2. Advantage computed on that data
data from old policy
But! Importance Sampling:
Will allow us to use samples from old policy as if they came from
the current policy distribution by applying correct weighting
58
Proximal Policy Optimization: Efficiency
1. Data generated by old policy We get policy run off if we train new policy
too long on data from old policy
2. Advantage computed on that data → wrong gradient information
But! Importance Sampling:
AC
Introduce Policy Ratio, as we have samples from old policy
towards
PPO
59
Proximal Policy Optimization: Stability
Optimizing this results in excessively large policy updates
Only update where we have a large trust
Limit the update range in our updates
60
Proximal Policy Optimization: Stability
Optimizing this results in excessively large policy updates
Only update where we have a large trust
Limit the update range in our updates
Option 1: Use KL-divergence to measure difference between
action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]
61
Proximal Policy Optimization: Stability
Optimizing this results in excessively large policy updates
Only update where we have a large trust
Limit the update range in our updates
Option 1: Use KL-divergence to measure difference between
action distributions and add as a penalty with variable
weight in the cost
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]
62
Proximal Policy Optimization: Stability
Only update where we have a large trust
Limit the update range in our updates
Option 2: Limit the policy ratio to a given range [1-ε, 1+ε]
Clipping:
Source: J. Schulman et al., Proximal Policy
Optimization Algorithms, arXiv:1707.06347
63
Proximal Policy Optimization: Stability
Introduce Clipping + Min
64
Proximal Policy Optimization: Cost
Entropy
Surrogate Objective Function
Critic Objective Function
65
Proximal Policy Optimization: Cost
Note: you can also run multiple actors in parallel
66
Actor-Critic Properties
Proof of convergence under various assumptions
Improved stability
Probabilistic in nature
On-policy and off-policy versions
Tabular and continuous formulation
Many tricks to improve performance (reward normalization, …)
Updates can be synchronous, asynchronous or episodic
Many versions (A2C, A3C, SAC, PPO, …)
67
Overview
Introduced policy gradient reinforcement learning
REINFORCE
Actor-Critic (A2C)
PPO
Can work probabilistic, handle continuous state and action space
Fundamental dilemma remains:
• Efficiency of DP: Models make it possible to plan and synthetize policy, otherwise we
only rely on experience. Experiments are costly and risky.
• Efficiency of RL: Experiments allow to explore and improve exploitation on the long
run. Models are inherently uncertain.
• How to have a working marriage of DP and RL? 68