Multi-Objective Planning & Learning
Shimon Whiteson & Diederik M. Roijers
Department of Computer Science
University of Oxford
Computational Intelligence
Vrije Universiteit Amsterdam
July 7, 2018
Whiteson & Roijers Multi-Objective Planning July 7, 2018 1 / 112
Schedule
08:30-09:15: Motivation & Concepts (Shimon)
09:15-09:30: Short Break
09:30-10:15: Motivation & Concepts cont’d (Shimon)
10:15-10:45: Coffee Break
10:45-11:30: Methods (Diederik)
11:30-11:45: Short Break
11:45-12:30: Methods & Applications (Diederik)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 2 / 112
Note
Get the latest version of the slides at:
http://roijers.info/motutorial.html
This tutorial is based on our survey article:
Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A
Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial
Intelligence Research, 48:67—113, 2013.
and Diederik’s dissertation:
Diederik Roijers Multi-Objective Decision-Theoretic Planning, University of
Amsterdam, 2016. http://roijers.info/pub/thesis.pdf
Whiteson & Roijers Multi-Objective Planning July 7, 2018 3 / 112
Part 1: Motivation & Concepts
Multi-Objective Motivation
MDPs & MOMDPs
Problem Taxonomy
Solution Concepts
Whiteson & Roijers Multi-Objective Planning July 7, 2018 4 / 112
Medical Treatment
Chance of being cured, having side effects, or dying
Whiteson & Roijers Multi-Objective Planning July 7, 2018 5 / 112
Traffic Coordination
Latency, throughput, fairness, environmental impact, etc.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 6 / 112
Mining Commodities
Gold collected, silver collected
village
mine
[Roijers et al. 2013, 2014]
Whiteson & Roijers Multi-Objective Planning July 7, 2018 7 / 112
Grid World
Getting the treasure, minimising fuel costs
Whiteson & Roijers Multi-Objective Planning July 7, 2018 8 / 112
Do We Need Multi-Objective Models?
Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112
Do We Need Multi-Objective Models?
Sutton’s Reward Hypothesis: “All of what we mean
by goals and purposes can be well thought of as maxi-
mization of the expected value of the cumulative sum
of a received scalar signal (reward).”
Source: http://rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112
Do We Need Multi-Objective Models?
Sutton’s Reward Hypothesis: “All of what we mean
by goals and purposes can be well thought of as maxi-
mization of the expected value of the cumulative sum
of a received scalar signal (reward).”
Source: http://rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
V :Π→R
V π = Eπ [ t rt ]
P
π ∗ = maxπ V π
Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112
Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!
V : Π → Rn
Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112
Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!
V : Π → Rn
Objection: why not just scalarize?
Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112
Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!
V : Π → Rn
Objection: why not just scalarize?
Scalarization function projects multi-objective value to a scalar:
Vwπ = f (Vπ , w)
Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1
A priori prioritization of the objectives
Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112
Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!
V : Π → Rn
Objection: why not just scalarize?
Scalarization function projects multi-objective value to a scalar:
Vwπ = f (Vπ , w)
Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1
A priori prioritization of the objectives
The weak argument is necessary but not sufficient
Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112
Why Multi-Objective Decision Making?
The strong argument: a priori scalarization is sometimes impossible,
infeasible, or undesirable
Instead produce the coverage set of undominated solutions
Yields three scenarios for planning or off-line RL
Yields two scenarios for on-line RL
Whiteson & Roijers Multi-Objective Planning July 7, 2018 11 / 112
Unknown-Weights Planning Scenario
Weights known in execution phase but not in planning phase
Example: mining commodities [Roijers et al. 2013]
Whiteson & Roijers Multi-Objective Planning July 7, 2018 12 / 112
Decision-Support Planning Scenario
Quantifying priorities is infeasible
Choosing between options is easier
Example: medical treatment
Whiteson & Roijers Multi-Objective Planning July 7, 2018 13 / 112
Known Weights Planning Scenario
Scalarization yields intractable problem
Whiteson & Roijers Multi-Objective Planning July 7, 2018 14 / 112
Reinforcement Learning Scenarios
Same scenarios apply for offline RL
For example, unknown-weights scenario becomes:
learning phase selection phase execution phase
coverage set selected
Learning
dataset Scalarization policy
Algorithm
weights
For online RL there are two more scenarios
Whiteson & Roijers Multi-Objective Planning July 7, 2018 15 / 112
Dynamic-Weights Online RL Scenario
Scalarization changes, over time, e.g., market prices
Caching policies for different prices speeds adaptation
[Natarajan & Tadepalli, 2005]
interaction with
the environment
rewards, states
and weights learning
algorithm
learning and execution phase
Whiteson & Roijers Multi-Objective Planning July 7, 2018 16 / 112
Interactive Decision-Support Online RL Scenario
Scalarization initially unknown
Learned via user interaction
Learning from environment and user simultaneously
interaction with interaction
the environment with the user
learning single
algorithm solution
learning and execution phase execution only phase
Whiteson & Roijers Multi-Objective Planning July 7, 2018 17 / 112
Summary of Motivation
Multi-objective methods are useful because many
problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112
Summary of Motivation
Multi-objective methods are useful because many
problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.
The burden of proof rests with the a priori scalariza-
tion, not with the multi-objective modeling.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112
Part 1: Motivation & Concepts
Multi-Objective Motivation
MDPs & MOMDPs
Problem Taxonomy
Solution Concepts
Whiteson & Roijers Multi-Objective Planning July 7, 2018 19 / 112
Markov Decision Process (MDP)
A single-objective MDP is a tuple hS, A, T , R, µ, γi where:
I S is a finite set of states
I A is a finite set of actions
I T : S × A × S → [0, 1] is a transition function
I R : S × A × S → R is a reward function
I µ : S → [0, 1] is a probability distribution over initial states
I γ ∈ [0, 1) is a discount factor
(figure from Poole & Mackworth, Artificial Intelligence:
Foundations of Computational Agents, 2010)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 20 / 112
Returns & Policies
Goal: maximize expected return, which is typically additive:
∞
X
Rt = γ k rt+k+1
k=0
A stationary policy conditions only on the current state:
π : S × A → [0, 1]
A deterministic stationary policy maps states directly to actions:
π:S →A
Whiteson & Roijers Multi-Objective Planning July 7, 2018 21 / 112
Value Functions in MDPs
A state-independent value function V π specifies the expected return
when following π from the initial state:
V π = E [R0 | π] (1)
A state value function of a policy π:
V π (s) = E [Rt | π, st = s]
The Bellman equation restates this expectation recursively for
stationary policies:
X X
V π (s) = π(s, a) T (s, a, s 0 )[R(s, a, s 0 ) + γV π (s 0 )]
a s0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 22 / 112
Optimality in MDPs
Theorem
For any additive infinite-horizon single-objective MDP, there exists a
deterministic stationary optimal policy [Howard 1960]
All optimal policies share the same optimal value function:
V ∗ (s) = max V π (s)
π
X
∗
V (s) = max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0
Extract the optimal policy using local action selection:
X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 23 / 112
Multi-Objective MDP (MOMDP)
Vector-valued reward and value:
R : S × A × S → Rn
X∞
Vπ = E [ γ k rk+1 | π]
k=0
X∞
π
V (s) = E [ γ k rt+k+1 | π, st = s]
k=0
Vπ (s) imposes only a partial ordering, e.g.,
0 0
Viπ (s) > Viπ (s) but Vjπ (s) < Vjπ (s).
Definition of optimality no longer clear
Whiteson & Roijers Multi-Objective Planning July 7, 2018 24 / 112
Part 1: Motivation & Concepts
Multi-Objective Motivation
MDPs & MOMDPs
Problem Taxonomy
Solution Concepts
Whiteson & Roijers Multi-Objective Planning July 7, 2018 25 / 112
Axiomatic vs. Utility-Based Approach
Axiomatic approach: define optimal solution set to be Pareto front
Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112
Axiomatic vs. Utility-Based Approach
Axiomatic approach: define optimal solution set to be Pareto front
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112
Axiomatic vs. Utility-Based Approach
Axiomatic approach: define optimal solution set to be Pareto front
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112
Axiomatic vs. Utility-Based Approach
Axiomatic approach: define optimal solution set to be Pareto front
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
I Deduce optimal solution set from three factors:
1 Multi-objective scenario
2 Properties of scalarization function
3 Allowable policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112
Three Factors
1 Multi-objective scenario
I Known weights → single policy
I Unknown weights or decision support → multiple policies
2 Properties of scalarization function
I Linear
I Monotonically increasing
3 Allowable policies
I Deterministic
I Stochastic
Whiteson & Roijers Multi-Objective Planning July 7, 2018 27 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 28 / 112
Part 1: Motivation & Concepts
Multi-Objective Motivation
MDPs & MOMDPs
Problem Taxonomy
Solution Concepts
Whiteson & Roijers Multi-Objective Planning July 7, 2018 29 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 30 / 112
Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1
wi quantifies importance of i-th objective
Simple and intuitive, e.g., when utility translates to money:
revenue = #cans × ppc + #bottles × ppb
Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112
Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1
wi quantifies importance of i-th objective
Simple and intuitive, e.g., when utility translates to money:
revenue = #cans × ppc + #bottles × ppb
Vwπ typically constrained to be a convex combination:
X
∀i wi ≥ 0, wi = 1
i
ppc ppb
utility = #cans × + #bottles ×
ppc + ppb ppc + ppb
Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112
Linear Scalarization & Single Policy
No special methods required: just apply f to each reward vector
Inner product distributes over addition yielding a normal MDP:
X∞ ∞
X
Vwπ π
= w · V = w · E[ k
γ rt+k+1 ] = E [ γ k (w · rt+k+1 )]
k=0 k=0
Apply standard methods to an MDP with:
R(s, a, s 0 ) = w · R(s, a, s 0 ), (2)
yielding a single determinstic stationary policy
Whiteson & Roijers Multi-Objective Planning July 7, 2018 32 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: collecting bottles and cans
Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: collecting bottles and cans
Note: only cell in taxonomy that does not require multi-objective methods
Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 34 / 112
Multiple Policies
Unknown weights or decision support → multiple policies
During planning w is unknown
Size of solution set is generally > 1
Set should not contain policies suboptimal for all w
Whiteson & Roijers Multi-Objective Planning July 7, 2018 35 / 112
Undominated & Coverage Sets
Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112
Undominated & Coverage Sets
Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }
Definition
A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a
policy with maximal scalarized value, i.e.,
0
CS(Π) ⊆ U(Π) ∧ (∀w)(∃π) π ∈ CS(Π) ∧ ∀(π 0 ∈ Π) Vwπ ≥ Vwπ
Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112
Example
Vwπ w = true w = false
π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2
One binary weight feature: only two possible weights
Weights are not objectives but two possible scalarizations
Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112
Example
Vwπ w = true w = false
π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2
One binary weight feature: only two possible weights
Weights are not objectives but two possible scalarizations
U(Π) = {π1 , π2 , π3 } but CS(Π) = {π1 , π2 } or {π2 , π3 }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112
Execution Phase
Single policy selected from CS(Π) and executed
Unknown weights: weights revealed and maximizing policy selected:
π ∗ = arg max Vwπ
π∈CS(Π)
Decision support: CS(Π) is manually inspected by the user
Whiteson & Roijers Multi-Objective Planning July 7, 2018 38 / 112
Linear Scalarization & Multiple Policies
Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112
Linear Scalarization & Multiple Policies
Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }
Definition
The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w,
contains a policy whose linearly scalarized value is maximal, i.e.,
0
CCS(Π) ⊆ CH(Π)∧(∀w)(∃π) π ∈ CCS(Π) ∧ ∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ
Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112
Visualization
Objective Space Weight Space
4
4
3
3
Vw
V1
2
2 1
1
0
0
0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0
V0 w1
Vw = w0 V0 + w1 V1 , w0 = 1 − w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 40 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: mining gold and silver
Whiteson & Roijers Multi-Objective Planning July 7, 2018 41 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 42 / 112
Monotonically Increasing Scalarization Functions
Mining example: Vπ1 = (3, 0), Vπ2 = (0, 3), Vπ3 = (1, 1)
Choosing Vπ3 implies nonlinear scalarization function
Whiteson & Roijers Multi-Objective Planning July 7, 2018 43 / 112
Monotonically Increasing Scalarization Functions
Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )
Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112
Monotonically Increasing Scalarization Functions
Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )
Definition
A policy π Pareto-dominates another policy π 0 when its value is at least as
high in all objectives and strictly higher in at least one objective:
0 0 0
Vπ P Vπ ⇔ ∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ
A policy is Pareto optimal if no policy Pareto-dominates it.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112
Nonlinear Scalarization Can Destroy Additivity
Nonlinear scalarization and expectation do not commute:
∞
X X∞
Vwπ = f (Vπ , w) = f (E [ γ k rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0
Bellman-based methods not applicable
Local action selection no longer yields an optimal policy:
π ∗ (s) 6= arg max V ∗ (s)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 45 / 112
Deterministic vs. Stochastic Policies
Stochastic policies are fine in most settings
Sometimes inappropriate, e.g., medical treatment
In MDPs, requiring deterministic policies is not restrictive
Optimal value attainable with deterministic stationary policy:
X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112
Deterministic vs. Stochastic Policies
Stochastic policies are fine in most settings
Sometimes inappropriate, e.g., medical treatment
In MDPs, requiring deterministic policies is not restrictive
Optimal value attainable with deterministic stationary policy:
X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0
Similar for MOMDPs with linear scalarization
MOMDPs with nonlinear scalarization:
I Stochastic policies may be preferable if allowed
I Nonstationary policies may be preferable otherwise
Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112
White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112
White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
3 3 π3 1 1
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112
White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
3 3 π3 1 1
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
πns alternates between a1 and a2 , starting with a1 :
3 3γ
Vπns = ,
1 − γ2 1 − γ2
Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112
White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
3 3 π3 1 1
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
πns alternates between a1 and a2 , starting with a1 :
3 3γ
Vπns = ,
1 − γ2 1 − γ2
Thus πns P π3 when γ ≥ 0.5, e.g., γ = 0.5 and f (Vπ ) = V1π V2π :
V π1 = V π2 = 0, V π3 = 4, V πns = 8
60
10
6
gamma=0.3 gamma=0.5 gamma=0.7 gamma=0.95
4
8
3
40
3 4
6
V2
V2
V2
V2
2
20
2
1
2
1
0
0
0 1 2 3 4 0 1 2 3 4 5 6 0 2 4 6 8 10 0 10 30 50
V1 V1 V1 V1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: radiation vs. chemotherapy
Whiteson & Roijers Multi-Objective Planning July 7, 2018 48 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 49 / 112
Mixture Policies
A mixture policy πm selects i-th policy
P from set of N deterministic
policies with probability pi , where N
i=0 pi = 1
Values are convex combination of values of constituent policies
In White’s example, replace πns by πm :
3p1 3(1 − p1 )
Vπm = p1 Vπ1 + (1 − p1 )Vπ2 = ,
1−γ 1−γ
Whiteson & Roijers Multi-Objective Planning July 7, 2018 50 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: studying vs. networking
Whiteson & Roijers Multi-Objective Planning July 7, 2018 51 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 52 / 112
Pareto Sets
Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112
Pareto Sets
Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }
Definition
A Pareto coverage set is a subset of PF (Π) such that, for every π 0 ∈ Π, it
contains a policy that either dominates π 0 or has equal value to π 0 :
0 0
PCS(Π) ⊆ PF (Π) ∧ ∀(π 0 ∈ Π)(∃π) π ∈ PCS(Π) ∧ (Vπ P Vπ ∨ Vπ = Vπ )
Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112
Visualization
Objective Space Weight Space
4
4
3
3
Vw
V1
2
2 1
1
0
0
0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0
V0 w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 54 / 112
Visualization
Objective Space Weight Space
4
4
3
3
Vw
V1
2
2 1
1
0
0
0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0
V0 w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 55 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: radiation vs. chemotherapy (again)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: radiation vs. chemotherapy (again)
Note: only setting that case requires a Pareto front!
Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 57 / 112
Mixture Policies
A CCS(ΠDS ) is also a CCS(Π) but not necessarily a PCS(Π)
But a PCS(Π) can be made by mixing policies in a CCS(ΠDS )
4
3
V1
2 1
0
0 1 2 3 4
V0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 58 / 112
Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Example: studying vs. networking (again)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 59 / 112
Part 2: Methods and Applications
Convex Coverage Set Planning Methods
I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support
Pareto Coverage Set Planning Methods
I Inner loop (non-stationary): Pareto-Q
I Outer loop issues
MOPOMDP Convex Coverage Set Planning: OLSAR
Applications
Whiteson & Roijers Multi-Objective Planning July 7, 2018 60 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Known transition and reward functions → planning
Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Known transition and reward functions → planning
Unknown transition and reward functions → learning
Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112
Background: Value Iteration
Initial estimate value estimate V0 (s)
Apply Bellman backups until convergence:
X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112
Background: Value Iteration
Initial estimate value estimate V0 (s)
Apply Bellman backups until convergence:
X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0
Can also be written:
Vk+1 (s) ← max Qk+1 (s, a),
a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0
Optimal policy is easy to retrieve from Q-table
Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 63 / 112
Scalarize MOMDP + Value Iteration
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112
Scalarize MOMDP + Value Iteration
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Scalarize reward function of MOMDP
Rw = w · R
Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112
Scalarize MOMDP + Value Iteration
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Scalarize reward function of MOMDP
Rw = w · R
Apply standard VI
Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112
Scalarize MOMDP + Value Iteration
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Scalarize reward function of MOMDP
Rw = w · R
Apply standard VI
Does not return multi-objective value
Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112
Scalarized Value Iteration
Adapt Bellman backup:
w · Vk+1 (s) ← max w · Qk+1 (s, a),
a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0
Returns multi-objective value.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 65 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 66 / 112
Inner versus Outer Loop
... ... ...
max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop
Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112
Inner versus Outer Loop
... ... ...
max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop
Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)
Outer loop
I Single objective method as subroutine
I Series of single-objective problems
Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112
Inner Loop: Convex Hull Value Iteration
Barrett & Narayanan (2008)
Idea: do the backup for all w in parallel
New backup operators must handle sets of values.
At backup:
I generate all value vectors for s, a-pair
I prune away those that are not optimal for any w
Only need deterministic stationary policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 68 / 112
Inner Loop: Convex Hull Value Iteration
Initial set of value vectors, e.g., V0 (s) = {(0, 0)}
All possible value vectors:
M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
Qk+1 (s, a) ←
s0
where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112
Inner Loop: Convex Hull Value Iteration
Initial set of value vectors, e.g., V0 (s) = {(0, 0)}
All possible value vectors:
M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
Qk+1 (s, a) ←
s0
where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
Prune value vectors
!
[
Vk+1 (s) ← CPrune Qk+1 (s, a)
a
CPrune uses linear programs (e.g., Roijers et al. (2015))
Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112
CHVI Example
Extremely simple MOMDP:
4
1 state: s;
3
2 actions: a1 and a2
Vw
Deterministic transitions
2
Deterministic rewards:
1
R(s, a1 , s) → (2, 0)
0
R(s, a2 , s) → (0, 2)
0.0 0.2 0.4 0.6 0.8 1.0
γ = 0.5 w1
V0 (s) = {(0, 0)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 70 / 112
CHVI Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 1:
V0 (s) = {(0, 0)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112
CHVI Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 1:
V0 (s) = {(0, 0)}
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112
CHVI Example
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 1:
1
V0 (s) = {(0, 0)}
0
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
S
V1 (s) = CPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112
CHVI Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112
CHVI Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112
CHVI Example
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
CPrune({(3, 0), (2, 1), (1, 2), (0, 3)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112
CHVI Example
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
{(3, 0), (0, 3)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 73 / 112
CHVI Example
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 3:
1
V2 (s) = {(3, 0), (0, 3)}
0
Q3 (s, a1 ) = {(3.5, 0), (2, 1.5)}
Q3 (s, a2 ) = {(1.5, 2), (0, 3.5)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V3 (s) =
CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) =
{(3.5, 0), (0, 3.5)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 74 / 112
Convex Hull Value Iteration
CPrune retains at least one optimal vector for each w
Therefore, Vw that would have been computed by VI is kept
CHVI does not retain excess value vectors
Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112
Convex Hull Value Iteration
CPrune retains at least one optimal vector for each w
Therefore, Vw that would have been computed by VI is kept
CHVI does not retain excess value vectors
CHVI generates a lot of excess value vectors
Removal with linear programs (CPrune) is expensive
Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112
Outer Loop
... ... ...
max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop
Repeatly calls a single-objective solver
Generic multi-objective method
I multi-objective coordination graphs
I multi-objective (multi-agent) MDPs
I multi-objective partially observable MDPs
Whiteson & Roijers Multi-Objective Planning July 7, 2018 76 / 112
Outer Loop: Optimistic Linear Support
Optimistic linear support (OLS) adapts and improves linear support
for POMDPs (Cheng (1988))
Solves scalarized instances for specific w
Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112
Outer Loop: Optimistic Linear Support
Optimistic linear support (OLS) adapts and improves linear support
for POMDPs (Cheng (1988))
Solves scalarized instances for specific w
Terminates after checking only a finite number of weights
Returns exact CCS
Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112
Linear Support
6 8
Vw
4 2
0
0.0 0.2 0.4 0.6 0.8 1.0
w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 78 / 112
Linear Support
6 8
Vw
4 2
0
0.0 0.2 0.4 0.6 0.8 1.0
w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 79 / 112
Linear Support
6 8
Vw
4 2
0
0.0 0.2 0.4 0.6 0.8 1.0
w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 80 / 112
Linear Support
6 8
Vw
4 2
0
0.0 0.2 0.4 0.6 0.8 1.0
w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 81 / 112
Optimistic Linear Support
8
8
(1,8) (1,8)
Δ (7,2)
6
6
(7,2)
(5,6)
uw
uw
4
4
2
2
wc
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
w1 w1
Priority queue, Q, for corner weights
Maximal possible improvement ∆ as priority
Stop when ∆ < ε
Whiteson & Roijers Multi-Objective Planning July 7, 2018 82 / 112
Optimistic Linear Support
Solving scalarized instance not always possible
ε-approximate solver
Produces an ε-CCS
Whiteson & Roijers Multi-Objective Planning July 7, 2018 83 / 112
Comparing Inner and Outer Loop
OLS (outer loop) advantages
I Any (cooperative) multi-objective decision problem
I Any single-objective / scalarized subroutine
I Inherits quality guarantees
I Faster for small and medium numbers of objectives
Inner loop faster for large numbers of objectives
Whiteson & Roijers Multi-Objective Planning July 7, 2018 84 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 85 / 112
Inner Loop: Pareto-Q
Similar to CHVI
Different pruning operator
Pairwise comparisons: V(s) P V0 (s)
Comparisons cheaper but much more vectors
Converges to correct Pareto coverage set (White (1982))
Executing a policy is no longer trivial (Van Moffaert & Nowé (2014))
Whiteson & Roijers Multi-Objective Planning July 7, 2018 86 / 112
Inner Loop: Pareto-Q
Compute all possible vectors
M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
Qk+1 (s, a) ←
s0
where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112
Inner Loop: Pareto-Q
Compute all possible vectors
M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
Qk+1 (s, a) ←
s0
where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
Take the union across a
Prune Pareto-dominated vectors
!
[
Vk+1 (s) ← PPrune Qk+1 (s, a)
a
Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112
Pareto-Q Example
4
Extremely simple MOMDP:
1 state: s;
3
2 actions: a1 and a2
V1
2
Deterministic rewards:
R(s, a1 , s) → (2, 0)
1
R(s, a2 , s) → (0, 2)
0
γ = 0.5
0 1 2 3 4
V0 (s) = {(0, 0)} V0
Whiteson & Roijers Multi-Objective Planning July 7, 2018 88 / 112
Pareto-Q Example
4
Deterministic rewards:
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
V1
2
γ = 0.5
Iteration 1:
1
V0 (s) = {(0, 0)}
0
Q1 (s, a1 ) = {(2, 0)} 0 1 2 3 4
Q1 (s, a2 ) = {(0, 2)} V0
S
V1 (s) = PPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}
Whiteson & Roijers Multi-Objective Planning July 7, 2018 89 / 112
Pareto-Q Example
4
Deterministic rewards:
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
V1
2
γ = 0.5
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)} 0 1 2 3 4
Q2 (s, a2 ) = {(1, 2), (0, 3)} V0
V2 (s) =
PPrune({(3, 0), (2, 1), (1, 2), (0, 3)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 90 / 112
Pareto-Q Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)
4
R(s, a2 , s) → (0, 2)
3
γ = 0.5
V1
2
Iteration 2:
V2 (s) =
1
{(3, 0), (2, 1), (1, 2), (0, 3)}
0
Q3 (s, a1 ) =
{(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} 0 1 2 3 4
V0
Q3 (s, a2 ) =
{(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}
V3 (s) =
PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5),
(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 91 / 112
Inner Loop: Pareto-Q
PCS size can explode
No longer deterministic
Cannot read policy from Q-table
Except for first action
Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112
Inner Loop: Pareto-Q
PCS size can explode
No longer deterministic
Cannot read policy from Q-table
Except for first action
“Track” a policy during execution (Van Moffaert & Nowé (2014))
I For deterministic transitions: s, a → s 0
I From Qt=0 (s, a) substract R(s, a)
I Correct for discount factor → Vt=1 (s 0 )
I Find Vt=1 (s 0 ) in Q-tables for s 0
For stochastic transitions, see Kristoff van Moffaert’s PhD thesis
Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112
Outer Loop?
... ... ...
max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop
Outer loop very difficult:
∞
X X∞
Vwπ = f (E [ k
γ rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0
Maximization does not do the trick!
Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013))
Not guaranteed to find optimal policy, or converge
Whiteson & Roijers Multi-Objective Planning July 7, 2018 93 / 112
Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 94 / 112
Part 2: Methods and Applications
Convex Coverage Set Planning Methods
I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support
Pareto Coverage Set Planning Methods
I Inner loop (non-stationary): Pareto-Q
I Outer loop issues
Interactive Online MORL: Interactive Thompson Sampling
Applications
Whiteson & Roijers Multi-Objective Planning July 7, 2018 95 / 112
Online Interactive Decision Support
interaction with interaction
the environment with the user
learning single
algorithm solution
learning and execution phase execution only phase
Simultaneous interaction with the environment and the decision maker
Whiteson & Roijers Multi-Objective Planning July 7, 2018 96 / 112
Multi-Objective Multi-Armed Bandits
Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.
Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112
Multi-Objective Multi-Armed Bandits
Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.
Can be seen as a single-state MOMDP
Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112
(Single-objective) Thompson Sampling
If we know the scalarization function: (SO) multi-armed bandit
Thompson sampling (Thompson, 1933) empirically best
I Basic idea: posterior distributions of mean reward for each arm: µai
I Pull sample mean for each ai
I Execute the action with the highest sample for its mean reward
0.6
after 1 pull
after 3 pulls
after 10 pulls
0.4
p
0.2
0.0
-10 -5 0 5 10
mu(a)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 98 / 112
Multi-Objective Challenges
No maximising action?
I Model the scalarization (/utility) function explicitly
I Learn about this function through user interaction
Cannot access scalarization function directly
I Only pairwise preferences
Whiteson & Roijers Multi-Objective Planning July 7, 2018 99 / 112
Interactive Thompson Sampling
interaction with interaction
the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2
(Roijers, Zintgraf, & Nowé, 2017)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112
Interactive Thompson Sampling
interaction with interaction
the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2
(Roijers, Zintgraf, & Nowé, 2017)
Open question: when to ask the user for preferences
Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112
Interactive Thompson Sampling
Action selection
I Sample vector-valued means from multi-variate posterior mean reward
distributions.
I Sample utility function from the posterior over utility functions
I Scalarize reward vectors with the sampled utility function
I Take maximizing action
When to query the user for preferences?
I Sample vector-valued means and utility function again
I Query when the maximising actions disagree with the first set of
samples
Whiteson & Roijers Multi-Objective Planning July 7, 2018 101 / 112
Interactive Thompson Sampling: Some Results
Linear utility functions (Roijers, Zintgraf, & Nowé, 2017)
NB: cumulative regret (optimal scalarized reward minus actual scalarized
reward) rather than cumulative reward
Whiteson & Roijers Multi-Objective Planning July 7, 2018 102 / 112
Conclusions Interactive Thompson Sampling
It is possible to learn about the user and the environment
simultaneously
For linear scalarization function: hardly any additional regret
Non-linear scalarization function, Gaussian processes as model of
utility function [Talk @ ALA]
Whiteson & Roijers Multi-Objective Planning July 7, 2018 103 / 112
Part 2: Methods and Applications
Convex Coverage Set Planning Methods
I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support
Pareto Coverage Set Planning Methods
I Inner loop (non-stationary): Pareto-Q
I Outer loop issues
Interactive Online MORL: Interactive Thompson Sampling
Applications
Whiteson & Roijers Multi-Objective Planning July 7, 2018 104 / 112
Treatment planning
Lizotte (2010, 2012)
I Maximizing effectiveness of
the treatment
I Minimizing the severity of the
side-effects
Finite-horizon MOMDPs
Deterministic policies
Whiteson & Roijers Multi-Objective Planning July 7, 2018 105 / 112
Epidemic control
Anthrax response (Soh &
Demiris (2011))
I Minimizing loss of life
I Minimizing number of false
alarms
I Minimizing cost of
investigation
Partial observability
(MOPOMDP)
Finite-state controllers
Evolutionary method
Pareto coverage set
Whiteson & Roijers Multi-Objective Planning July 7, 2018 106 / 112
Semi-autonomous wheelchairs
Control system for wheelchairs
(Soh & Demiris (2011))
I Maximizing safety
I Maximizing speed
I Minimizing power
consumption.
Partial observability
(MOPOMDP)
Finite-state controllers
Evolutionary method
Pareto coverage set
Whiteson & Roijers Multi-Objective Planning July 7, 2018 107 / 112
Broader Application
“Probabilistic Planning is Multi-objective” — Bryce et al. (2007)
I The expected return is not enough
I Cost of a plan
I Probability of success of a plan
I Non-goal terminal states
Whiteson & Roijers Multi-Objective Planning July 7, 2018 108 / 112
Broader Application
“Human-aligned artificial intelligence is a multiobjective problem” –
Vamplew et al., 2018
I Philosophy journal (ethics)
I Decision problems have ethical implications
I Ethical decision-making always involves trade-offs
I To align this with people’s convictions and preferences is not trivial
I Multi-objective, whatever ethical framework you use
Whiteson & Roijers Multi-Objective Planning July 7, 2018 109 / 112
Broader Application
“Tim O’Reilly says the economy is running on the wrong algorithm” –
Wired
I Companies typically only try to optimise profit
I This is bad, as consumers experience negative effects of this
I Consumers are costumers
I Very bad as a long term strategy
Whiteson & Roijers Multi-Objective Planning July 7, 2018 110 / 112
Closing
Consider multiple objectives
I most problems have them
I a priori scalarization can be bad
Derive your solution set
I Pareto front often not necessary
Exciting growing field
Promising applications
Whiteson & Roijers Multi-Objective Planning July 7, 2018 111 / 112
At these conferences
AAMAS
I Luisa M. Zintgraf, Diederik M. Roijers, Sjoerd Linders, Catholijn M.
Jonker, Ann Nowe — Ordered Preference Elicitation Strategies for
Supporting Multi-Objective Decision Making
ICML
I Wenlong Lyu, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng —
Batch Bayesian Optimization via Multi-objective Acquisition Ensemble
for Automated Analog Circuit Design
I Eugenio Bargiacchi, Timothy Verstraeten, Diederik M. Roijers, Ann
Now, Hado van Hasselt — Learning to Coordinate with Coordination
Graphs in Repeated Single-Stage Multi-Agent Decision Problems
Whiteson & Roijers Multi-Objective Planning July 7, 2018 112 / 112
At these conferences
IJCAI
I Chao Bian, Chao Qian, Ke Tang — A General Approach to Running
Time Analysis of Multi-objective Evolutionary Algorithms
I Miguel Terra-Neves, Ines Lynce, Vasco Manquinho — Stratification for
Constraint-Based Multi-Objective Combinatorial Optimization
ALA workshop
I Diederik M. Roijers, Denis Steckelmacher, Ann Nowé —
Multi-objective Reinforcement Learning for the Expected Utility of the
Return
I Diederik M. Roijers, Luisa M. Zintgraf, Pieter Libin, Ann Nowé —
Interactive Multi-Objective Reinforcement Learning in Multi-Armed
Bandits for Any Utility Function
Whiteson & Roijers Multi-Objective Planning July 7, 2018 113 / 112