Hierarchical Reinforcement Learning
Temporal Abstraction in RL
Khimya Khetarpal
Reasoning and Learning Lab
Mila - McGill University
Consider an autonomous robot in a warehouse
2 Image Source: Boston Dynamics Robot Handle
Consider an autonomous robot in a warehouse
Pick up boxes Navigate to destination Stack boxes
Tasks at hand could be solved quickly and efficiently with skills
3 Image Source: Boston Dynamics Robot Handle
Consider an autonomous robot in a warehouse
Pick up boxes Navigate to destination Stack boxes
Scan room Identify Objects Find Box Obstacle Avoid Obstacle Detect Move Identify OBJ Find Box Place Box
Reach Box Find Location Reach Box Find Location
Each skill can take different number of time steps
4 Image Source: Boston Dynamics Robot Handle
Consider an autonomous robot in a warehouse
Pick up boxes Navigate to destination Stack boxes
Scan room Identify Objects Find Box Obstacle Avoid Obstacle Detect Move Identify OBJ Find Box Place Box
Reach Box Find Location Reach Box Find Location
The ability to abstract knowledge temporally over many different
time scales is seamlessly integrated in human decision making!
5 Image Source: Boston Dynamics Robot Handle
Reinforcement Learning
At each time step, the agent:
• Executes action At
• Receives observation Ot
• Receives reward Rt
At each time step, the environment:
• Receives action
• Emits observation Ot+1
• Emits scalar reward Rt+1
6
Predictions : Value Functions
Policy π(a | s)
Value Function Vπ(s) = Eπ[Rt+1 + γRt+2 + γ 2Rt+3 + . . . | St = s]
Immediate Reward Discount Factor Discounted Future Value
7
Learning Values
Policy π(a | s)
Value Function Vπ(s) = Eπ[Rt+1 + γRt+2 + γ 2Rt+3 + . . . | St = s]
Immediate Reward Discount Factor Discounted Future Value
Temporal Difference Learning
V(St) ← V(St) + α(Rt+1 + γV(St+1) − V(St))
Temporal-difference error: δt = Rt+1 + γV(St+1) − V(St)
Learning rule for parameterized value functions wt+1 = wt + αδt ∇w Vw(St)
8
Why Temporal Abstraction
Planning
• Generate shorter plans
• Provides robustness to model errors
• Improves sample complexity
Learning
• Improve exploration by taking shortcuts in the environment
• Facilitates Off-Policy learning
• Improves efficiency/learning speed
• Helps in transfer learning
9
The Options Framework
10
The Options Framework
Options (Sutton, Precup, and Singh, 1999) formalize the idea of temporally extended
actions also known as skills.
11 Sutton, Precup & Singh 1999
Options Framework
• Definition
Let S, A be the set of states and actions. A Markov option ω ∈ Ω is a triple:
(Iω ⊆ S , πω : S × A → [0, 1] , βω : S → [0, 1])
Initiation set Intra option policy Termination condition
• Iω set of states aka preconditions
• πω(s, a) probability of taking an action a ∈ A in state s ∈ S when following the option ω
• βω(s) probability of terminating option ω upon entering state s
with a policy over options πΩ : S × Ω → [0,1]
12 Sutton, Precup & Singh 1999
Options Framework
• Definition
Let S, A be the set of states and actions. A Markov option ω ∈ Ω is a triple:
(Iω ⊆ S , πω : S × A → [0, 1] , βω : S → [0, 1])
Initiation set Intra option policy Termination condition
• Iω set of states aka preconditions
• πω(s, a) probability of taking an action a ∈ A in state s ∈ S when following the option ω
• βω(s) probability of terminating option ω upon entering state s
with a policy over options πΩ : S × Ω → [0,1]
• Example
• Robot navigating in a house: when you come across a closed door ( Iω ), open the
door ( πω ), until the door has been opened ( βω )
13 Sutton, Precup & Singh 1999
Planning with Options
14 Sutton, Precup & Singh 1999
Planning with Options
Primitive actions
Initial Values Iteration #1 Iteration #2
15 Sutton, Precup & Singh 1999
Planning with Options
Primitive actions
Initial Values Iteration #1 Iteration #2
Hallway Options
Initial Values Iteration #1 Iteration #2
16 Sutton, Precup & Singh 1999
Planning with Options : Discussion
Potential Applications:
• Planning with stocks
• Planning with assets - asset management
• Clinical Domains [Y. Shahar: A framework for knowledge-based temporal abstraction]
17
Can we learn such temporal abstractions?
• Bacon, Harb, and Precup, 2017 proposed the option-critic framework which provides
the ability to learn a set of options
• Optimize directly the discounted return, averaged over all the trajectories starting at a
designated state and option
∞ t
J= EΩ,θ,ω[ ∑t=0 γ rt+1 | s0, ω0]
18 Bacon, Harb & Precup 2017
Actor-Critic Architecture
19
Actor-Critic Architecture
Decides how the agent acts
Provides feedback to improve the actor
20
Option-Critic Architecture
Parameterize internal policies
Parameterize termination conditions
21 Bacon, Harb & Precup 2017
Option-Critic with Deep RL
22 Bacon, Harb & Precup 2017
Option-Critic with Deep RL
23 Bacon, Harb & Precup 2017
Hierarchical Abstract Machines (HAMs)
Hierarchical Abstract Machines (HAMs)
• Key Idea:
• Non deterministic finite state machines
• Transitions invoke lower level machines
Parr & Russell, 1998
Hierarchical Abstract Machines (HAMs)
• Key Idea:
• Non deterministic finite state machines
• Transitions invoke lower level machines
• A Machine:
• Is a partial policy
• Has four states: Call/Stop/Choice/Action State
Parr & Russell, 1998
Hierarchical Abstract Machines (HAMs)
• Upon encountering an obstacle:
• Machine enters a Choice state
• Follow-wall Machine
• Back-off Machine
• A HAM learns a policy to decide which machine is optimal to call
Parr & Russell, 1998
Feudal Learning
28
Feudal Learning
29 Dayan & Hinton 1993
Feudal Learning
• Reward Hiding:
• The managers provide subtasks g for sub-managers
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.
• Low-level managers learn how to achieve low-level goals even if these
do not exactly correspond together to the highest level goal.
Dayan & Hinton 1993
Feudal Learning
• Reward Hiding:
• The managers provide subtasks g for sub-managers
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.
• Low-level managers learn how to achieve low-level goals even if these
do not exactly correspond together to the highest level goal.
• Information Hiding:
• Managers only know the state of the system at the granularity of their
own choices of tasks.
• Information is hidden both ways, upwards and downwards, in terms of
the choice of sub-tasks chosen to meet the main goal.
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.
Dayan & Hinton 1993
FeUdal Networks (FUN) for HRL
Vezhnevets et. al 2017
FeUdal Networks (FUN) for HRL
• Key Insights:
• Manager chooses a subgoal direction that maximizes reward
• Worker selects actions that maxim cosine similarity
• FuN aims to represent sub-goals as directions in latent state space
• Subgoals = Meaning behaviours ; Subgoals as actions
Vezhnevets et. al 2017
FeUdal Networks (FUN) for HRL
Vezhnevets et. al 2017
Moving towards truly scalable RL
"Stop learning tasks, start learning skills." - Satinder Singh, NeurIPS 2018
Related Literature
• MAXQ
• HIRO
• h-DQN
• Meta Learning with Shared Hierarchies
• To be completed
Demo
Questions
Extra Slides
Option-Critic
Formulation
All options are available in all states
The option value function is defined as
∑
QΩ(s, ω) = πω,θ(a | s)QU(s, ω, a)
a
where QU : S × Ω × A ℝ is the value of executing an action in the context of a
state-option pair defined as:
∑
QU(s, ω, a) = r(s, a) + γ P(s′| s, a)U(ω, s′)
s′
41
Option-Critic
Formulation
All options are available in all states
The option value function is defined as
∑
QΩ(s, ω) = πω,θ(a | s)QU(s, ω, a)
a
where QU : S × Ω × A ℝ is the value of executing an action in the context of a
state-option pair defined as:
∑
QU(s, ω, a) = r(s, a) + γ P(s′| s, a)U(ω, s′)
s′
where U : S × Ω ℝ is the option-value function upon arrival in a state:
U(ω, s′) = (1 − βω,ν(s′))QΩ(s′, ω) + βω,ν(s′)VΩ(s′)
42