0% found this document useful (0 votes)

97 views152 pages

Multi-Objective Planning & Learning

This document provides an overview of a tutorial on multi-objective planning and learning given by Shimon Whiteson and Diederik Roijers. The tutorial covers motivation and concepts in the morning sessions, including different scenarios that require multi-objective models like medical treatment problems with multiple health outcomes. It then covers methods for multi-objective decision making in the afternoon sessions. The slides and tutorial are based on previous work surveying multi-objective sequential decision making problems and approaches.

Uploaded by

Aamir Habib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views152 pages

Multi-Objective Planning & Learning

Uploaded by

Aamir Habib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 152

Multi-Objective Planning & Learning

Shimon Whiteson & Diederik M. Roijers

Department of Computer Science

University of Oxford

Computational Intelligence
Vrije Universiteit Amsterdam

July 7, 2018

Whiteson & Roijers Multi-Objective Planning July 7, 2018 1 / 112

Schedule

08:30-09:15: Motivation & Concepts (Shimon)

09:15-09:30: Short Break

09:30-10:15: Motivation & Concepts cont’d (Shimon)

10:15-10:45: Coffee Break

10:45-11:30: Methods (Diederik)

11:30-11:45: Short Break

11:45-12:30: Methods & Applications (Diederik)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 2 / 112

Note

Get the latest version of the slides at:

http://roijers.info/motutorial.html

This tutorial is based on our survey article:

Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A
Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial
Intelligence Research, 48:67—113, 2013.

and Diederik’s dissertation:

Diederik Roijers Multi-Objective Decision-Theoretic Planning, University of

Amsterdam, 2016. http://roijers.info/pub/thesis.pdf

Whiteson & Roijers Multi-Objective Planning July 7, 2018 3 / 112

Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 4 / 112

Medical Treatment

Chance of being cured, having side effects, or dying

Whiteson & Roijers Multi-Objective Planning July 7, 2018 5 / 112

Traffic Coordination

Latency, throughput, fairness, environmental impact, etc.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 6 / 112

Mining Commodities

Gold collected, silver collected

village
mine

[Roijers et al. 2013, 2014]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 7 / 112

Grid World
Getting the treasure, minimising fuel costs

Whiteson & Roijers Multi-Objective Planning July 7, 2018 8 / 112

Do We Need Multi-Objective Models?

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112

Do We Need Multi-Objective Models?

Sutton’s Reward Hypothesis: “All of what we mean

by goals and purposes can be well thought of as maxi-
mization of the expected value of the cumulative sum
of a received scalar signal (reward).”
Source: http://rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112

Do We Need Multi-Objective Models?

Sutton’s Reward Hypothesis: “All of what we mean

V :Π→R

V π = Eπ [ t rt ]
P

π ∗ = maxπ V π

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112

Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112

Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112

Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Scalarization function projects multi-objective value to a scalar:

Vwπ = f (Vπ , w)

Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1

A priori prioritization of the objectives

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112

Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Scalarization function projects multi-objective value to a scalar:

Vwπ = f (Vπ , w)

Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1

A priori prioritization of the objectives

The weak argument is necessary but not sufficient

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112

Why Multi-Objective Decision Making?

The strong argument: a priori scalarization is sometimes impossible,

infeasible, or undesirable

Instead produce the coverage set of undominated solutions

Yields three scenarios for planning or off-line RL

Yields two scenarios for on-line RL

Whiteson & Roijers Multi-Objective Planning July 7, 2018 11 / 112

Unknown-Weights Planning Scenario

Weights known in execution phase but not in planning phase

Example: mining commodities [Roijers et al. 2013]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 12 / 112

Decision-Support Planning Scenario

Quantifying priorities is infeasible

Choosing between options is easier

Example: medical treatment

Whiteson & Roijers Multi-Objective Planning July 7, 2018 13 / 112

Known Weights Planning Scenario

Scalarization yields intractable problem

Whiteson & Roijers Multi-Objective Planning July 7, 2018 14 / 112

Reinforcement Learning Scenarios

Same scenarios apply for offline RL

For example, unknown-weights scenario becomes:

learning phase selection phase execution phase

coverage set selected

Learning
dataset Scalarization policy
Algorithm

weights

For online RL there are two more scenarios

Whiteson & Roijers Multi-Objective Planning July 7, 2018 15 / 112

Dynamic-Weights Online RL Scenario

Scalarization changes, over time, e.g., market prices

Caching policies for different prices speeds adaptation

[Natarajan & Tadepalli, 2005]

interaction with
the environment
rewards, states
and weights learning
algorithm

learning and execution phase

Whiteson & Roijers Multi-Objective Planning July 7, 2018 16 / 112

Interactive Decision-Support Online RL Scenario

Scalarization initially unknown

Learned via user interaction

Learning from environment and user simultaneously

interaction with interaction

the environment with the user

learning single
algorithm solution

learning and execution phase execution only phase

Whiteson & Roijers Multi-Objective Planning July 7, 2018 17 / 112

Summary of Motivation

Multi-objective methods are useful because many

problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112

Summary of Motivation

Multi-objective methods are useful because many

problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.

The burden of proof rests with the a priori scalariza-

tion, not with the multi-objective modeling.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112

Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 19 / 112

Markov Decision Process (MDP)
A single-objective MDP is a tuple hS, A, T , R, µ, γi where:
I S is a finite set of states
I A is a finite set of actions
I T : S × A × S → [0, 1] is a transition function
I R : S × A × S → R is a reward function
I µ : S → [0, 1] is a probability distribution over initial states
I γ ∈ [0, 1) is a discount factor

(figure from Poole & Mackworth, Artificial Intelligence:

Foundations of Computational Agents, 2010)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 20 / 112
Returns & Policies
Goal: maximize expected return, which is typically additive:
∞
X
Rt = γ k rt+k+1
k=0

A stationary policy conditions only on the current state:

π : S × A → [0, 1]

A deterministic stationary policy maps states directly to actions:

π:S →A

Whiteson & Roijers Multi-Objective Planning July 7, 2018 21 / 112

Value Functions in MDPs

A state-independent value function V π specifies the expected return

when following π from the initial state:

V π = E [R0 | π] (1)

A state value function of a policy π:

V π (s) = E [Rt | π, st = s]

The Bellman equation restates this expectation recursively for

stationary policies:
X X
V π (s) = π(s, a) T (s, a, s 0 )[R(s, a, s 0 ) + γV π (s 0 )]
a s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 22 / 112

Optimality in MDPs

Theorem
For any additive infinite-horizon single-objective MDP, there exists a
deterministic stationary optimal policy [Howard 1960]

All optimal policies share the same optimal value function:

V ∗ (s) = max V π (s)

π
X
∗
V (s) = max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Extract the optimal policy using local action selection:

X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 23 / 112

Multi-Objective MDP (MOMDP)
Vector-valued reward and value:

R : S × A × S → Rn

X∞
Vπ = E [ γ k rk+1 | π]
k=0
X∞
π
V (s) = E [ γ k rt+k+1 | π, st = s]
k=0

Vπ (s) imposes only a partial ordering, e.g.,

0 0
Viπ (s) > Viπ (s) but Vjπ (s) < Vjπ (s).

Definition of optimality no longer clear

Whiteson & Roijers Multi-Objective Planning July 7, 2018 24 / 112

Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 25 / 112

Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112

Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112

Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112

Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
I Deduce optimal solution set from three factors:
1 Multi-objective scenario
2 Properties of scalarization function
3 Allowable policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112

Three Factors

1 Multi-objective scenario
I Known weights → single policy
I Unknown weights or decision support → multiple policies

2 Properties of scalarization function

I Linear
I Monotonically increasing

3 Allowable policies
I Deterministic
I Stochastic

Whiteson & Roijers Multi-Objective Planning July 7, 2018 27 / 112

Problem Taxonomy

single policy multiple policies (unknown

(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 28 / 112

Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 29 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 30 / 112

Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1

wi quantifies importance of i-th objective

Simple and intuitive, e.g., when utility translates to money:

revenue = #cans × ppc + #bottles × ppb

Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112

Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1

wi quantifies importance of i-th objective

Simple and intuitive, e.g., when utility translates to money:

revenue = #cans × ppc + #bottles × ppb

Vwπ typically constrained to be a convex combination:

X
∀i wi ≥ 0, wi = 1
i

ppc ppb
utility = #cans × + #bottles ×
ppc + ppb ppc + ppb

Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112

Linear Scalarization & Single Policy

No special methods required: just apply f to each reward vector

Inner product distributes over addition yielding a normal MDP:

X∞ ∞
X
Vwπ π
= w · V = w · E[ k
γ rt+k+1 ] = E [ γ k (w · rt+k+1 )]
k=0 k=0

Apply standard methods to an MDP with:

R(s, a, s 0 ) = w · R(s, a, s 0 ), (2)

yielding a single determinstic stationary policy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 32 / 112

Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: collecting bottles and cans

Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112

Example: collecting bottles and cans

Note: only cell in taxonomy that does not require multi-objective methods

Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 34 / 112

Multiple Policies

Unknown weights or decision support → multiple policies

During planning w is unknown

Size of solution set is generally > 1

Set should not contain policies suboptimal for all w

Whiteson & Roijers Multi-Objective Planning July 7, 2018 35 / 112

Undominated & Coverage Sets

Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112

Undominated & Coverage Sets

Definition
A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a
policy with maximal scalarized value, i.e.,
0

CS(Π) ⊆ U(Π) ∧ (∀w)(∃π) π ∈ CS(Π) ∧ ∀(π 0 ∈ Π) Vwπ ≥ Vwπ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112

Example

Vwπ w = true w = false

π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2

One binary weight feature: only two possible weights

Weights are not objectives but two possible scalarizations

Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112

Example

Vwπ w = true w = false

π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2

One binary weight feature: only two possible weights

Weights are not objectives but two possible scalarizations

U(Π) = {π1 , π2 , π3 } but CS(Π) = {π1 , π2 } or {π2 , π3 }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112

Execution Phase

Single policy selected from CS(Π) and executed

Unknown weights: weights revealed and maximizing policy selected:

π ∗ = arg max Vwπ

π∈CS(Π)

Decision support: CS(Π) is manually inspected by the user

Whiteson & Roijers Multi-Objective Planning July 7, 2018 38 / 112

Linear Scalarization & Multiple Policies

Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112

Linear Scalarization & Multiple Policies

Definition
The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w,
contains a policy whose linearly scalarized value is maximal, i.e.,
0

CCS(Π) ⊆ CH(Π)∧(∀w)(∃π) π ∈ CCS(Π) ∧ ∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112

Visualization
Objective Space Weight Space
4

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

V0 w1

Vw = w0 V0 + w1 V1 , w0 = 1 − w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 40 / 112

Problem Taxonomy

single policy multiple policies (unknown

Example: mining gold and silver

Whiteson & Roijers Multi-Objective Planning July 7, 2018 41 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 42 / 112

Monotonically Increasing Scalarization Functions
Mining example: Vπ1 = (3, 0), Vπ2 = (0, 3), Vπ3 = (1, 1)
Choosing Vπ3 implies nonlinear scalarization function

Whiteson & Roijers Multi-Objective Planning July 7, 2018 43 / 112

Monotonically Increasing Scalarization Functions

Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )

Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112

Monotonically Increasing Scalarization Functions

Definition
A policy π Pareto-dominates another policy π 0 when its value is at least as
high in all objectives and strictly higher in at least one objective:
0 0 0
Vπ P Vπ ⇔ ∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ

A policy is Pareto optimal if no policy Pareto-dominates it.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112

Nonlinear Scalarization Can Destroy Additivity

Nonlinear scalarization and expectation do not commute:

∞
X X∞
Vwπ = f (Vπ , w) = f (E [ γ k rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0

Bellman-based methods not applicable

Local action selection no longer yields an optimal policy:

π ∗ (s) 6= arg max V ∗ (s)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 45 / 112

Deterministic vs. Stochastic Policies

Stochastic policies are fine in most settings

Sometimes inappropriate, e.g., medical treatment

In MDPs, requiring deterministic policies is not restrictive

Optimal value attainable with deterministic stationary policy:

X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112

Deterministic vs. Stochastic Policies

Stochastic policies are fine in most settings

Sometimes inappropriate, e.g., medical treatment

In MDPs, requiring deterministic policies is not restrictive

Optimal value attainable with deterministic stationary policy:

X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Similar for MOMDPs with linear scalarization

MOMDPs with nonlinear scalarization:
I Stochastic policies may be preferable if allowed
I Nonstationary policies may be preferable otherwise

Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112

White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112

White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
3 3 π3 1 1
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
πns alternates between a1 and a2 , starting with a1 :
3 3γ
Vπns = ,
1 − γ2 1 − γ2

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112

60
10
6

gamma=0.3 gamma=0.5 gamma=0.7 gamma=0.95

8
3

40
3 4

6
V2

V2
2

20
2
1

2
1
0

0
0 1 2 3 4 0 1 2 3 4 5 6 0 2 4 6 8 10 0 10 30 50
V1 V1 V1 V1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112

Problem Taxonomy

single policy multiple policies (unknown

Example: radiation vs. chemotherapy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 48 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 49 / 112

Mixture Policies

A mixture policy πm selects i-th policy

P from set of N deterministic
policies with probability pi , where N
i=0 pi = 1

Values are convex combination of values of constituent policies

In White’s example, replace πns by πm :

3p1 3(1 − p1 )
Vπm = p1 Vπ1 + (1 − p1 )Vπ2 = ,
1−γ 1−γ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 50 / 112

Problem Taxonomy

single policy multiple policies (unknown

Example: studying vs. networking

Whiteson & Roijers Multi-Objective Planning July 7, 2018 51 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 52 / 112

Pareto Sets

Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112

Pareto Sets

Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }

Definition
A Pareto coverage set is a subset of PF (Π) such that, for every π 0 ∈ Π, it
contains a policy that either dominates π 0 or has equal value to π 0 :
0 0

PCS(Π) ⊆ PF (Π) ∧ ∀(π 0 ∈ Π)(∃π) π ∈ PCS(Π) ∧ (Vπ P Vπ ∨ Vπ = Vπ )

Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112

Visualization

Objective Space Weight Space

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

V0 w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 54 / 112

Visualization

Objective Space Weight Space

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

V0 w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 55 / 112

Example: radiation vs. chemotherapy (again)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112

Example: radiation vs. chemotherapy (again)

Note: only setting that case requires a Pareto front!

Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112

Problem Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 57 / 112

Mixture Policies
A CCS(ΠDS ) is also a CCS(Π) but not necessarily a PCS(Π)
But a PCS(Π) can be made by mixing policies in a CCS(ΠDS )

4
3
V1
2 1
0

0 1 2 3 4
V0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 58 / 112

Problem Taxonomy

single policy multiple policies (unknown

Example: studying vs. networking (again)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 59 / 112

Part 2: Methods and Applications

Convex Coverage Set Planning Methods

I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods

I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

MOPOMDP Convex Coverage Set Planning: OLSAR

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 60 / 112

Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112

Known transition and reward functions → planning

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112

Known transition and reward functions → planning

Unknown transition and reward functions → learning

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112

Background: Value Iteration

Initial estimate value estimate V0 (s)

Apply Bellman backups until convergence:

X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112

Background: Value Iteration

Initial estimate value estimate V0 (s)

Apply Bellman backups until convergence:

X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0

Can also be written:

Vk+1 (s) ← max Qk+1 (s, a),

a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0

Optimal policy is easy to retrieve from Q-table

Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112

Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 63 / 112

Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112

Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112

Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Apply standard VI

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112

Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Apply standard VI

Does not return multi-objective value

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112

Scalarized Value Iteration

Adapt Bellman backup:

w · Vk+1 (s) ← max w · Qk+1 (s, a),

a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0

Returns multi-objective value.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 65 / 112

Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 66 / 112

Inner versus Outer Loop

... ... ...

max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112

Inner versus Outer Loop

... ... ...

max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)

Outer loop
I Single objective method as subroutine
I Series of single-objective problems

Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112

Inner Loop: Convex Hull Value Iteration

Barrett & Narayanan (2008)

Idea: do the backup for all w in parallel

New backup operators must handle sets of values.

At backup:
I generate all value vectors for s, a-pair
I prune away those that are not optimal for any w

Only need deterministic stationary policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 68 / 112

Inner Loop: Convex Hull Value Iteration

Initial set of value vectors, e.g., V0 (s) = {(0, 0)}

All possible value vectors:

M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )

Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112

Inner Loop: Convex Hull Value Iteration

Initial set of value vectors, e.g., V0 (s) = {(0, 0)}

All possible value vectors:

M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )

Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Prune value vectors

!
[
Vk+1 (s) ← CPrune Qk+1 (s, a)
a

CPrune uses linear programs (e.g., Roijers et al. (2015))

Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112

CHVI Example

Extremely simple MOMDP:

4
1 state: s;

3
2 actions: a1 and a2

Vw
Deterministic transitions

2
Deterministic rewards:

1
R(s, a1 , s) → (2, 0)

0
R(s, a2 , s) → (0, 2)
0.0 0.2 0.4 0.6 0.8 1.0
γ = 0.5 w1

V0 (s) = {(0, 0)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 70 / 112

CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 1:
V0 (s) = {(0, 0)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112

CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 1:
V0 (s) = {(0, 0)}
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112

CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 1:

1
V0 (s) = {(0, 0)}

0
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
S
V1 (s) = CPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112

CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 2:
V1 (s) = {(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112

CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112

CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
CPrune({(3, 0), (2, 1), (1, 2), (0, 3)})

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112

CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
{(3, 0), (0, 3)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 73 / 112

CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 3:

1
V2 (s) = {(3, 0), (0, 3)}

0
Q3 (s, a1 ) = {(3.5, 0), (2, 1.5)}
Q3 (s, a2 ) = {(1.5, 2), (0, 3.5)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V3 (s) =
CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) =
{(3.5, 0), (0, 3.5)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 74 / 112

Convex Hull Value Iteration

CPrune retains at least one optimal vector for each w

Therefore, Vw that would have been computed by VI is kept

CHVI does not retain excess value vectors

Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112

Convex Hull Value Iteration

CPrune retains at least one optimal vector for each w

Therefore, Vw that would have been computed by VI is kept

CHVI does not retain excess value vectors

CHVI generates a lot of excess value vectors

Removal with linear programs (CPrune) is expensive

Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112

Outer Loop

... ... ...

max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Repeatly calls a single-objective solver

Generic multi-objective method
I multi-objective coordination graphs
I multi-objective (multi-agent) MDPs
I multi-objective partially observable MDPs

Whiteson & Roijers Multi-Objective Planning July 7, 2018 76 / 112

Outer Loop: Optimistic Linear Support

Optimistic linear support (OLS) adapts and improves linear support

for POMDPs (Cheng (1988))

Solves scalarized instances for specific w

Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112

Outer Loop: Optimistic Linear Support

Optimistic linear support (OLS) adapts and improves linear support

for POMDPs (Cheng (1988))

Solves scalarized instances for specific w

Terminates after checking only a finite number of weights

Returns exact CCS

Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112

Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0

w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 78 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0

w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 79 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0

w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 80 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0

w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 81 / 112
Optimistic Linear Support
8

8
(1,8) (1,8)
Δ (7,2)
6

6
(7,2)
(5,6)
uw

uw
4

4
2

2
wc
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
w1 w1

Priority queue, Q, for corner weights

Maximal possible improvement ∆ as priority

Stop when ∆ < ε

Whiteson & Roijers Multi-Objective Planning July 7, 2018 82 / 112

Optimistic Linear Support

Solving scalarized instance not always possible

ε-approximate solver

Produces an ε-CCS
Whiteson & Roijers Multi-Objective Planning July 7, 2018 83 / 112
Comparing Inner and Outer Loop

OLS (outer loop) advantages

I Any (cooperative) multi-objective decision problem

I Any single-objective / scalarized subroutine

I Inherits quality guarantees

I Faster for small and medium numbers of objectives

Inner loop faster for large numbers of objectives

Whiteson & Roijers Multi-Objective Planning July 7, 2018 84 / 112

Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 85 / 112

Inner Loop: Pareto-Q

Similar to CHVI

Different pruning operator

Pairwise comparisons: V(s) P V0 (s)

Comparisons cheaper but much more vectors

Converges to correct Pareto coverage set (White (1982))

Executing a policy is no longer trivial (Van Moffaert & Nowé (2014))

Whiteson & Roijers Multi-Objective Planning July 7, 2018 86 / 112

Inner Loop: Pareto-Q

Compute all possible vectors

M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )

Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112

Inner Loop: Pareto-Q

Compute all possible vectors

M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )

Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Take the union across a

Prune Pareto-dominated vectors

!
[
Vk+1 (s) ← PPrune Qk+1 (s, a)
a

Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112

Pareto-Q Example

4
Extremely simple MOMDP:
1 state: s;

3
2 actions: a1 and a2

V1
2
Deterministic rewards:
R(s, a1 , s) → (2, 0)

1
R(s, a2 , s) → (0, 2)

0
γ = 0.5
0 1 2 3 4
V0 (s) = {(0, 0)} V0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 88 / 112

Pareto-Q Example

4
Deterministic rewards:
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

V1
2
γ = 0.5

Iteration 1:

1
V0 (s) = {(0, 0)}

0
Q1 (s, a1 ) = {(2, 0)} 0 1 2 3 4
Q1 (s, a2 ) = {(0, 2)} V0
S
V1 (s) = PPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 89 / 112

Pareto-Q Example

4
Deterministic rewards:
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

V1
2
γ = 0.5

Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)} 0 1 2 3 4
Q2 (s, a2 ) = {(1, 2), (0, 3)} V0

V2 (s) =
PPrune({(3, 0), (2, 1), (1, 2), (0, 3)})

Whiteson & Roijers Multi-Objective Planning July 7, 2018 90 / 112

Pareto-Q Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)

4
R(s, a2 , s) → (0, 2)

3
γ = 0.5

V1
2
Iteration 2:
V2 (s) =

1
{(3, 0), (2, 1), (1, 2), (0, 3)}

0
Q3 (s, a1 ) =
{(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} 0 1 2 3 4
V0
Q3 (s, a2 ) =
{(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}
V3 (s) =
PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5),
(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 91 / 112
Inner Loop: Pareto-Q

PCS size can explode

No longer deterministic

Cannot read policy from Q-table

Except for first action

Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112

Inner Loop: Pareto-Q

PCS size can explode

No longer deterministic

Cannot read policy from Q-table

Except for first action

“Track” a policy during execution (Van Moffaert & Nowé (2014))
I For deterministic transitions: s, a → s 0
I From Qt=0 (s, a) substract R(s, a)
I Correct for discount factor → Vt=1 (s 0 )
I Find Vt=1 (s 0 ) in Q-tables for s 0

For stochastic transitions, see Kristoff van Moffaert’s PhD thesis

Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112

Outer Loop?

... ... ...

max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Outer loop very difficult:

∞
X X∞
Vwπ = f (E [ k
γ rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0

Maximization does not do the trick!

Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013))
Not guaranteed to find optimal policy, or converge
Whiteson & Roijers Multi-Objective Planning July 7, 2018 93 / 112
Taxonomy

single policy multiple policies (unknown

Whiteson & Roijers Multi-Objective Planning July 7, 2018 94 / 112

Part 2: Methods and Applications

Convex Coverage Set Planning Methods

I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods

I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

Interactive Online MORL: Interactive Thompson Sampling

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 95 / 112

Online Interactive Decision Support

interaction with interaction

the environment with the user

learning single
algorithm solution

learning and execution phase execution only phase

Simultaneous interaction with the environment and the decision maker

Whiteson & Roijers Multi-Objective Planning July 7, 2018 96 / 112

Multi-Objective Multi-Armed Bandits

Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112

Multi-Objective Multi-Armed Bandits

Can be seen as a single-state MOMDP

Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112

(Single-objective) Thompson Sampling
If we know the scalarization function: (SO) multi-armed bandit
Thompson sampling (Thompson, 1933) empirically best
I Basic idea: posterior distributions of mean reward for each arm: µai
I Pull sample mean for each ai
I Execute the action with the highest sample for its mean reward
0.6

after 1 pull
after 3 pulls
after 10 pulls
0.4
p

0.2
0.0

-10 -5 0 5 10

mu(a)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 98 / 112
Multi-Objective Challenges

No maximising action?
I Model the scalarization (/utility) function explicitly
I Learn about this function through user interaction

Cannot access scalarization function directly

I Only pairwise preferences

Whiteson & Roijers Multi-Objective Planning July 7, 2018 99 / 112

Interactive Thompson Sampling

interaction with interaction

the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2

(Roijers, Zintgraf, & Nowé, 2017)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112

Interactive Thompson Sampling

interaction with interaction

the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2

(Roijers, Zintgraf, & Nowé, 2017)

Open question: when to ask the user for preferences

Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112

Interactive Thompson Sampling

Action selection
I Sample vector-valued means from multi-variate posterior mean reward
distributions.
I Sample utility function from the posterior over utility functions
I Scalarize reward vectors with the sampled utility function
I Take maximizing action
When to query the user for preferences?
I Sample vector-valued means and utility function again
I Query when the maximising actions disagree with the first set of
samples

Whiteson & Roijers Multi-Objective Planning July 7, 2018 101 / 112

Interactive Thompson Sampling: Some Results

Linear utility functions (Roijers, Zintgraf, & Nowé, 2017)

NB: cumulative regret (optimal scalarized reward minus actual scalarized

reward) rather than cumulative reward

Whiteson & Roijers Multi-Objective Planning July 7, 2018 102 / 112

Conclusions Interactive Thompson Sampling

It is possible to learn about the user and the environment

simultaneously

For linear scalarization function: hardly any additional regret

Non-linear scalarization function, Gaussian processes as model of

utility function [Talk @ ALA]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 103 / 112

Part 2: Methods and Applications

Convex Coverage Set Planning Methods

I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods

I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

Interactive Online MORL: Interactive Thompson Sampling

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 104 / 112

Treatment planning

Lizotte (2010, 2012)

I Maximizing effectiveness of
the treatment
I Minimizing the severity of the
side-effects

Finite-horizon MOMDPs

Deterministic policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 105 / 112

Epidemic control

Anthrax response (Soh &

Demiris (2011))
I Minimizing loss of life
I Minimizing number of false
alarms
I Minimizing cost of
investigation

Partial observability
(MOPOMDP)

Finite-state controllers

Evolutionary method

Pareto coverage set

Whiteson & Roijers Multi-Objective Planning July 7, 2018 106 / 112

Semi-autonomous wheelchairs

Control system for wheelchairs

(Soh & Demiris (2011))
I Maximizing safety
I Maximizing speed
I Minimizing power
consumption.

Partial observability
(MOPOMDP)

Finite-state controllers

Evolutionary method

Pareto coverage set

Whiteson & Roijers Multi-Objective Planning July 7, 2018 107 / 112

Broader Application

“Probabilistic Planning is Multi-objective” — Bryce et al. (2007)

I The expected return is not enough
I Cost of a plan
I Probability of success of a plan
I Non-goal terminal states

Whiteson & Roijers Multi-Objective Planning July 7, 2018 108 / 112

Broader Application

“Human-aligned artificial intelligence is a multiobjective problem” –

Vamplew et al., 2018
I Philosophy journal (ethics)
I Decision problems have ethical implications
I Ethical decision-making always involves trade-offs
I To align this with people’s convictions and preferences is not trivial
I Multi-objective, whatever ethical framework you use

Whiteson & Roijers Multi-Objective Planning July 7, 2018 109 / 112

Broader Application

“Tim O’Reilly says the economy is running on the wrong algorithm” –

Wired
I Companies typically only try to optimise profit
I This is bad, as consumers experience negative effects of this
I Consumers are costumers
I Very bad as a long term strategy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 110 / 112

Closing

Consider multiple objectives

I most problems have them
I a priori scalarization can be bad

Derive your solution set

I Pareto front often not necessary

Exciting growing field

Promising applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 111 / 112

At these conferences

AAMAS

I Luisa M. Zintgraf, Diederik M. Roijers, Sjoerd Linders, Catholijn M.

Jonker, Ann Nowe — Ordered Preference Elicitation Strategies for
Supporting Multi-Objective Decision Making

ICML

I Wenlong Lyu, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng —
Batch Bayesian Optimization via Multi-objective Acquisition Ensemble
for Automated Analog Circuit Design
I Eugenio Bargiacchi, Timothy Verstraeten, Diederik M. Roijers, Ann
Now, Hado van Hasselt — Learning to Coordinate with Coordination
Graphs in Repeated Single-Stage Multi-Agent Decision Problems

Whiteson & Roijers Multi-Objective Planning July 7, 2018 112 / 112

At these conferences
IJCAI

I Chao Bian, Chao Qian, Ke Tang — A General Approach to Running

Time Analysis of Multi-objective Evolutionary Algorithms
I Miguel Terra-Neves, Ines Lynce, Vasco Manquinho — Stratification for
Constraint-Based Multi-Objective Combinatorial Optimization

ALA workshop

I Diederik M. Roijers, Denis Steckelmacher, Ann Nowé —

Multi-objective Reinforcement Learning for the Expected Utility of the
Return
I Diederik M. Roijers, Luisa M. Zintgraf, Pieter Libin, Ann Nowé —
Interactive Multi-Objective Reinforcement Learning in Multi-Armed
Bandits for Any Utility Function

Whiteson & Roijers Multi-Objective Planning July 7, 2018 113 / 112

Nursing Informatics
89% (9)
Nursing Informatics
43 pages
Operational Manual For Alphenix 2b308-309en
No ratings yet
Operational Manual For Alphenix 2b308-309en
618 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Multi-Objective Recommendations: A Tutorial: Abstract
No ratings yet
Multi-Objective Recommendations: A Tutorial: Abstract
35 pages
Lecture 3 Chapter 3
No ratings yet
Lecture 3 Chapter 3
14 pages
AIML Unit - 3 MDP New
No ratings yet
AIML Unit - 3 MDP New
30 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
No ratings yet
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
15 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
Qbus6820 - Week 10
No ratings yet
Qbus6820 - Week 10
26 pages
Wang, Cheng, Jin - 2024 - Sparse Large-Scale Multiobjective Optimization by Identifying Nonzero Decision Variables
No ratings yet
Wang, Cheng, Jin - 2024 - Sparse Large-Scale Multiobjective Optimization by Identifying Nonzero Decision Variables
13 pages
Graded Questions 2023 Solutions Per Block Updated
100% (1)
Graded Questions 2023 Solutions Per Block Updated
152 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Mixed-Integer Optimization With Constraint Learning
No ratings yet
Mixed-Integer Optimization With Constraint Learning
62 pages
Mod5 Smith 23 SC E
No ratings yet
Mod5 Smith 23 SC E
55 pages
Stochastic Multi-Objective Optimization: A Survey On Non-Scalarizing Methods
No ratings yet
Stochastic Multi-Objective Optimization: A Survey On Non-Scalarizing Methods
25 pages
WRS 4
No ratings yet
WRS 4
32 pages
A Tutorial On Multiobjective Optimization: Fundamentals and Evolutionary Methods
No ratings yet
A Tutorial On Multiobjective Optimization: Fundamentals and Evolutionary Methods
25 pages
Qex 1
No ratings yet
Qex 1
9 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
Unit 3 - Multiobjective Optimization
No ratings yet
Unit 3 - Multiobjective Optimization
57 pages
15 EC Multiobjective Optimization
No ratings yet
15 EC Multiobjective Optimization
56 pages
Multi-Objective Optimization with EAs
No ratings yet
Multi-Objective Optimization with EAs
36 pages
CIS Docker Benchmark v1.5.0 PDF
No ratings yet
CIS Docker Benchmark v1.5.0 PDF
292 pages
Magnetically Levitated Ball
No ratings yet
Magnetically Levitated Ball
4 pages
Nokia Looking Ahead To 5G WhitePaper July 2014
No ratings yet
Nokia Looking Ahead To 5G WhitePaper July 2014
16 pages
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
No ratings yet
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
19 pages
Multi-Objective Genetic Algorithms
No ratings yet
Multi-Objective Genetic Algorithms
52 pages
Turunan Imidazoline Crodazoline o
No ratings yet
Turunan Imidazoline Crodazoline o
2 pages
Multiobjective Linear Programming
No ratings yet
Multiobjective Linear Programming
8 pages
Multicriteria Optimizationand Decision Making
No ratings yet
Multicriteria Optimizationand Decision Making
84 pages
Kal 02
No ratings yet
Kal 02
17 pages
MODULE 5 Softcomputing
No ratings yet
MODULE 5 Softcomputing
25 pages
12270-Article (PDF) - 25835-1-10-20210120
No ratings yet
12270-Article (PDF) - 25835-1-10-20210120
31 pages
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
No ratings yet
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
3 pages
Lynx
No ratings yet
Lynx
6 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Solving Multiobjective Models With GAMS
No ratings yet
Solving Multiobjective Models With GAMS
19 pages
An Introduction To Software Defined Radio For Microwave Engineers
No ratings yet
An Introduction To Software Defined Radio For Microwave Engineers
1 page
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Curl Multi Perform
No ratings yet
Curl Multi Perform
1 page
M 2
No ratings yet
M 2
12 pages
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) Full Access
No ratings yet
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) Full Access
137 pages
Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies
No ratings yet
Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies
19 pages
Data Acquisition in MATLAB
No ratings yet
Data Acquisition in MATLAB
27 pages
5G Channel Sounding, Reference Solution: Keysight Technologies
No ratings yet
5G Channel Sounding, Reference Solution: Keysight Technologies
10 pages
Normalization and Other Topics in Multi Objective Optimization
No ratings yet
Normalization and Other Topics in Multi Objective Optimization
13 pages
7152 - Application Manual
No ratings yet
7152 - Application Manual
103 pages
PROBLEM SENSING FOR TEACHERS AND MTs
No ratings yet
PROBLEM SENSING FOR TEACHERS AND MTs
91 pages
Submitted By:: Abhinav Chaturvedi Kanika Sheokand Manjalika Neha Sharma Palak Bajaj Ms. Japneet Kaur
No ratings yet
Submitted By:: Abhinav Chaturvedi Kanika Sheokand Manjalika Neha Sharma Palak Bajaj Ms. Japneet Kaur
15 pages
5G BBWF Mike Wright Oct 2013
No ratings yet
5G BBWF Mike Wright Oct 2013
6 pages
B.Tech CSE Algorithm Design Notes
No ratings yet
B.Tech CSE Algorithm Design Notes
126 pages
Fuzzy Based Landslide Prediction Using Wireless Sensor Networks
No ratings yet
Fuzzy Based Landslide Prediction Using Wireless Sensor Networks
19 pages
Defining Cognitive Radio
No ratings yet
Defining Cognitive Radio
15 pages
Optimization With Multiple Objectives: Eva K. Lee, Ph.D. Eva - Lee@isye - Gatech.edu
No ratings yet
Optimization With Multiple Objectives: Eva K. Lee, Ph.D. Eva - Lee@isye - Gatech.edu
17 pages
Ad Hoc and Sensor Networks Chapter 12: Data-Centric and Content-Based Networking
No ratings yet
Ad Hoc and Sensor Networks Chapter 12: Data-Centric and Content-Based Networking
26 pages
Systemc-Ams Tutorial: Institute of Computer Technology Vienna University of Technology Markus Damm
No ratings yet
Systemc-Ams Tutorial: Institute of Computer Technology Vienna University of Technology Markus Damm
26 pages
3408-Data Structure
No ratings yet
3408-Data Structure
3 pages
Spectrum Sensing in Cognitive Radio
No ratings yet
Spectrum Sensing in Cognitive Radio
27 pages
Ad Hoc and Sensor Networks CH 8
No ratings yet
Ad Hoc and Sensor Networks CH 8
46 pages
Antenna and Propagation: Planar Antennas Patch / Slot
No ratings yet
Antenna and Propagation: Planar Antennas Patch / Slot
32 pages
Lec 39
No ratings yet
Lec 39
21 pages
Class1 Emoo 2016
No ratings yet
Class1 Emoo 2016
111 pages
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) Full Digital Chapters
No ratings yet
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) Full Digital Chapters
100 pages
Pivot Table
No ratings yet
Pivot Table
19 pages
Lecture #3: Review On Lecture 2 Random Signals Signal Transmission Through Linear Systems Bandwidth of Digital Data
No ratings yet
Lecture #3: Review On Lecture 2 Random Signals Signal Transmission Through Linear Systems Bandwidth of Digital Data
35 pages
VI-04 Parametric Methods - 2007
No ratings yet
VI-04 Parametric Methods - 2007
21 pages
Spectrum Access and Sharing
No ratings yet
Spectrum Access and Sharing
38 pages
1 s2.0 S2210650223001657 Main
No ratings yet
1 s2.0 S2210650223001657 Main
18 pages
Loxone Compendium Building Automation
No ratings yet
Loxone Compendium Building Automation
44 pages
MAS - Class
No ratings yet
MAS - Class
71 pages
Dynamic Programming in MDPs
No ratings yet
Dynamic Programming in MDPs
42 pages
POMDP and MDP Tutorial Guide
No ratings yet
POMDP and MDP Tutorial Guide
55 pages
Software Defined Radio Handbook: Eighth Edition
No ratings yet
Software Defined Radio Handbook: Eighth Edition
53 pages
Lecture #2 Digital Communications EE 725
No ratings yet
Lecture #2 Digital Communications EE 725
33 pages
Software Defined Radio Lec 3 - RF Front-End For SDR: Sajjad Hussain, Mcs-Nust
No ratings yet
Software Defined Radio Lec 3 - RF Front-End For SDR: Sajjad Hussain, Mcs-Nust
40 pages
Joystick DANFOSS JS1-H
No ratings yet
Joystick DANFOSS JS1-H
4 pages
Gate Controlled Switch
No ratings yet
Gate Controlled Switch
14 pages
SuperMark 1.5T Proposal - 2108
No ratings yet
SuperMark 1.5T Proposal - 2108
29 pages
Tinyos Lab: Lesson 04: Sensing: WSN Programming Course
No ratings yet
Tinyos Lab: Lesson 04: Sensing: WSN Programming Course
8 pages
Tinyos Lab: Lesson 03: Radio Communication: WSN Programming Course
No ratings yet
Tinyos Lab: Lesson 03: Radio Communication: WSN Programming Course
7 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
2 Static & Dynamic Web Pages
No ratings yet
2 Static & Dynamic Web Pages
24 pages
AI-Powered DeFi Trading Platform
No ratings yet
AI-Powered DeFi Trading Platform
22 pages
Box Plot
No ratings yet
Box Plot
4 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Many-Objective Evolutionary Algorithms A Survey - B. Li, J. Li, K. Tang, X. Yao
No ratings yet
Many-Objective Evolutionary Algorithms A Survey - B. Li, J. Li, K. Tang, X. Yao
37 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
16 pages
06 MDP
No ratings yet
06 MDP
89 pages
Multiobjective Optimization Guide
No ratings yet
Multiobjective Optimization Guide
46 pages
4th Unit Imp Topics
No ratings yet
4th Unit Imp Topics
14 pages
Database Design Assignment Guide
No ratings yet
Database Design Assignment Guide
4 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
No ratings yet
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
13 pages
Optimization Models: 2.1 Concepts
No ratings yet
Optimization Models: 2.1 Concepts
23 pages
Spiceman FONT
No ratings yet
Spiceman FONT
10 pages
Lec 08
No ratings yet
Lec 08
59 pages
Multi Objective Optimization
No ratings yet
Multi Objective Optimization
17 pages
Diagnose IIS Performance Problems Using Windows Performance Monitor
No ratings yet
Diagnose IIS Performance Problems Using Windows Performance Monitor
2 pages
Mum-25-26-Pr0247 - SBM Bank - Ibm Instana - Proposal Dated 07-July-2025
No ratings yet
Mum-25-26-Pr0247 - SBM Bank - Ibm Instana - Proposal Dated 07-July-2025
1 page
GR Tutorial
No ratings yet
GR Tutorial
79 pages
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
No ratings yet
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
6 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages