Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views42 pages

HRL Lecture

The document discusses Hierarchical Reinforcement Learning (HRL) and the concept of temporal abstraction in decision-making for autonomous robots. It introduces the Options Framework, which formalizes temporally extended actions, and explores various architectures such as Actor-Critic and Option-Critic for learning and planning. Additionally, it covers Feudal Learning and FeUdal Networks as methods for managing subgoals and improving learning efficiency in hierarchical settings.

Uploaded by

ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views42 pages

HRL Lecture

The document discusses Hierarchical Reinforcement Learning (HRL) and the concept of temporal abstraction in decision-making for autonomous robots. It introduces the Options Framework, which formalizes temporally extended actions, and explores various architectures such as Actor-Critic and Option-Critic for learning and planning. Additionally, it covers Feudal Learning and FeUdal Networks as methods for managing subgoals and improving learning efficiency in hierarchical settings.

Uploaded by

ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Hierarchical Reinforcement Learning

Temporal Abstraction in RL

Khimya Khetarpal
Reasoning and Learning Lab
Mila - McGill University
Consider an autonomous robot in a warehouse

2 Image Source: Boston Dynamics Robot Handle


Consider an autonomous robot in a warehouse

Pick up boxes Navigate to destination Stack boxes

Tasks at hand could be solved quickly and efficiently with skills

3 Image Source: Boston Dynamics Robot Handle


Consider an autonomous robot in a warehouse

Pick up boxes Navigate to destination Stack boxes

Scan room Identify Objects Find Box Obstacle Avoid Obstacle Detect Move Identify OBJ Find Box Place Box

Reach Box Find Location Reach Box Find Location

Each skill can take different number of time steps

4 Image Source: Boston Dynamics Robot Handle


Consider an autonomous robot in a warehouse

Pick up boxes Navigate to destination Stack boxes

Scan room Identify Objects Find Box Obstacle Avoid Obstacle Detect Move Identify OBJ Find Box Place Box

Reach Box Find Location Reach Box Find Location

The ability to abstract knowledge temporally over many different


time scales is seamlessly integrated in human decision making!

5 Image Source: Boston Dynamics Robot Handle


Reinforcement Learning

At each time step, the agent:


• Executes action At
• Receives observation Ot
• Receives reward Rt

At each time step, the environment:


• Receives action
• Emits observation Ot+1
• Emits scalar reward Rt+1

6
Predictions : Value Functions

Policy π(a | s)

Value Function Vπ(s) = Eπ[Rt+1 + γRt+2 + γ 2Rt+3 + . . . | St = s]

Immediate Reward Discount Factor Discounted Future Value

7
Learning Values

Policy π(a | s)

Value Function Vπ(s) = Eπ[Rt+1 + γRt+2 + γ 2Rt+3 + . . . | St = s]

Immediate Reward Discount Factor Discounted Future Value

Temporal Difference Learning


V(St) ← V(St) + α(Rt+1 + γV(St+1) − V(St))

Temporal-difference error: δt = Rt+1 + γV(St+1) − V(St)

Learning rule for parameterized value functions wt+1 = wt + αδt ∇w Vw(St)

8
Why Temporal Abstraction
Planning
• Generate shorter plans
• Provides robustness to model errors
• Improves sample complexity

Learning
• Improve exploration by taking shortcuts in the environment
• Facilitates Off-Policy learning
• Improves efficiency/learning speed
• Helps in transfer learning

9
The Options Framework

10
The Options Framework
Options (Sutton, Precup, and Singh, 1999) formalize the idea of temporally extended
actions also known as skills.

11 Sutton, Precup & Singh 1999


Options Framework
• Definition
Let S, A be the set of states and actions. A Markov option ω ∈ Ω is a triple:

(Iω ⊆ S , πω : S × A → [0, 1] , βω : S → [0, 1])


Initiation set Intra option policy Termination condition

• Iω set of states aka preconditions


• πω(s, a) probability of taking an action a ∈ A in state s ∈ S when following the option ω
• βω(s) probability of terminating option ω upon entering state s
with a policy over options πΩ : S × Ω → [0,1]

12 Sutton, Precup & Singh 1999


Options Framework
• Definition
Let S, A be the set of states and actions. A Markov option ω ∈ Ω is a triple:

(Iω ⊆ S , πω : S × A → [0, 1] , βω : S → [0, 1])


Initiation set Intra option policy Termination condition

• Iω set of states aka preconditions


• πω(s, a) probability of taking an action a ∈ A in state s ∈ S when following the option ω
• βω(s) probability of terminating option ω upon entering state s
with a policy over options πΩ : S × Ω → [0,1]

• Example
• Robot navigating in a house: when you come across a closed door ( Iω ), open the
door ( πω ), until the door has been opened ( βω )

13 Sutton, Precup & Singh 1999


Planning with Options

14 Sutton, Precup & Singh 1999


Planning with Options

Primitive actions

Initial Values Iteration #1 Iteration #2

15 Sutton, Precup & Singh 1999


Planning with Options

Primitive actions

Initial Values Iteration #1 Iteration #2

Hallway Options

Initial Values Iteration #1 Iteration #2

16 Sutton, Precup & Singh 1999


Planning with Options : Discussion

Potential Applications:
• Planning with stocks
• Planning with assets - asset management
• Clinical Domains [Y. Shahar: A framework for knowledge-based temporal abstraction]

17
Can we learn such temporal abstractions?

• Bacon, Harb, and Precup, 2017 proposed the option-critic framework which provides
the ability to learn a set of options

• Optimize directly the discounted return, averaged over all the trajectories starting at a
designated state and option

∞ t
J= EΩ,θ,ω[ ∑t=0 γ rt+1 | s0, ω0]

18 Bacon, Harb & Precup 2017


Actor-Critic Architecture

19
Actor-Critic Architecture

Decides how the agent acts

Provides feedback to improve the actor

20
Option-Critic Architecture

Parameterize internal policies

Parameterize termination conditions

21 Bacon, Harb & Precup 2017


Option-Critic with Deep RL

22 Bacon, Harb & Precup 2017


Option-Critic with Deep RL

23 Bacon, Harb & Precup 2017


Hierarchical Abstract Machines (HAMs)
Hierarchical Abstract Machines (HAMs)
• Key Idea:
• Non deterministic finite state machines
• Transitions invoke lower level machines

Parr & Russell, 1998


Hierarchical Abstract Machines (HAMs)
• Key Idea:
• Non deterministic finite state machines
• Transitions invoke lower level machines

• A Machine:
• Is a partial policy
• Has four states: Call/Stop/Choice/Action State

Parr & Russell, 1998


Hierarchical Abstract Machines (HAMs)
• Upon encountering an obstacle:
• Machine enters a Choice state
• Follow-wall Machine
• Back-off Machine

• A HAM learns a policy to decide which machine is optimal to call

Parr & Russell, 1998


Feudal Learning

28
Feudal Learning

29 Dayan & Hinton 1993


Feudal Learning
• Reward Hiding:
• The managers provide subtasks g for sub-managers
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.
• Low-level managers learn how to achieve low-level goals even if these
do not exactly correspond together to the highest level goal.

Dayan & Hinton 1993


Feudal Learning
• Reward Hiding:
• The managers provide subtasks g for sub-managers
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.
• Low-level managers learn how to achieve low-level goals even if these
do not exactly correspond together to the highest level goal.

• Information Hiding:
• Managers only know the state of the system at the granularity of their
own choices of tasks.
• Information is hidden both ways, upwards and downwards, in terms of
the choice of sub-tasks chosen to meet the main goal.
• Managers only reward the actions if the sub-manager achieves g,
irrespective of what the overall goal of the task is.

Dayan & Hinton 1993


FeUdal Networks (FUN) for HRL

Vezhnevets et. al 2017


FeUdal Networks (FUN) for HRL
• Key Insights:
• Manager chooses a subgoal direction that maximizes reward
• Worker selects actions that maxim cosine similarity
• FuN aims to represent sub-goals as directions in latent state space
• Subgoals = Meaning behaviours ; Subgoals as actions

Vezhnevets et. al 2017


FeUdal Networks (FUN) for HRL

Vezhnevets et. al 2017


Moving towards truly scalable RL

"Stop learning tasks, start learning skills." - Satinder Singh, NeurIPS 2018
Related Literature

• MAXQ
• HIRO
• h-DQN
• Meta Learning with Shared Hierarchies
• To be completed
Demo
Questions
Extra Slides
Option-Critic
Formulation

All options are available in all states

The option value function is defined as


QΩ(s, ω) = πω,θ(a | s)QU(s, ω, a)
a

where QU : S × Ω × A ℝ is the value of executing an action in the context of a


state-option pair defined as:


QU(s, ω, a) = r(s, a) + γ P(s′| s, a)U(ω, s′)
s′

41
Option-Critic
Formulation

All options are available in all states

The option value function is defined as


QΩ(s, ω) = πω,θ(a | s)QU(s, ω, a)
a

where QU : S × Ω × A ℝ is the value of executing an action in the context of a


state-option pair defined as:


QU(s, ω, a) = r(s, a) + γ P(s′| s, a)U(ω, s′)
s′

where U : S × Ω ℝ is the option-value function upon arrival in a state:

U(ω, s′) = (1 − βω,ν(s′))QΩ(s′, ω) + βω,ν(s′)VΩ(s′)

42

You might also like