Utilities and MDP:
A Lesson in Multiagent System
Henry Hexmoor
SIUC
Utility
• Preferences are recorded as a utility function
ui : S R
S is
i the
h set off observable
b bl states in
i the
h world
ld
ui is utility function
R is real numbers
• States of the world become ordered.
ordered
Properties of Utilites
Reflexive: ui(s) ≥ ui(s)
Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c)
then ui(a) ≥ ui(c).
(c)
Comparable: a,b either ui(a) ≥ ui(b) or
ui(b) ≥ ui(a).
( )
Selfish agents:
• A rational agent is one that wants to maximize
its utilities, but intends no harm.
Agents
Rational agents
Selfish
lf h
agents
Rational, non‐
selfish agents
Utility is not money:
• while utility represents an agent’s
agent s preferences
it is not necessarily equated with money. In
fact the utility of money has been found to be
fact,
roughly logarithmic.
Marginal Utility
• Marginal utility is the utility gained from next
event
Example:
getting A for an A student.
versus A for an B student
Transition function
Transition function is represented as
T ( s, a, s ' )
Transition
i i function
f i isi defined
d fi d as the h probability
b bili
of reaching S’ from S with action ‘a’
Expected Utility
• Expected utility is defined as the sum of
product of the probabilities of reaching s’
from s with action ‘a’a and utility of the final
state.
E[ui , s, a ] T ( s, a, s ' )ui ( s ' )
s 'S
Where S is set of all p
possible states
Value of Information
• Value of information that current state is t and
not s:
E E[ui , t , i (t )] E[ui , t , i ( s )]
here E [u i , t , i (t )] represents updated, new info
E[ui , t , i ( s )] represents
p old value
Markov Decision Processes: MDP
• Graphical representation of a sample Markov decision process along with values for
the transition and reward functions. We let the start state be s1. E.g.,
Reward Function: r(s)
• Reward function is represented as
r:S→R
Deterministic Vs Non‐Deterministic
Non Deterministic
• Deterministic world: predictable effects
Example: only one action leads to T=1 , else Ф
• Nondeterministic world: values change
Policy:
• Policy is behavior of agents that maps states
to action
• Policy is represented by
Optimal Policy
• Optimal policy is a policy that maximizes
expected utility.
• Optimal policy is represented as
*
i* ( s ) arg a A max E [u i , s , a ]
Discounted Rewards: ( 0 1)
• Discounted rewards smoothly reduce the
impact of rewards that are farther off in the
future
0 r ( s 1 ) 1 r ( s 2 ) 2 r ( s 3 ) ...
Wh
Where ( 0 1) represents
t di
discountt ffactor
t
* ( s ) arg max T ( s , a , s )u ( s )
a
s
Bellman Equation
u ( s ) r ( s ) max
a
T ( s , a , s ) u ( s )
s
Where
r(s) represents immediate reward
T(s a s’)u(s’)
T(s,a,s )u(s ) represents
future, discounted rewards
Brute Force Solution
• Write n Bellman equations one for each n
states, solve …
• This is a non
non‐linear
linear equation due to max
a
Value Iteration Solution
• Set values of u(s) to random numbers
• Use Bellman update equation
r ( s) max T ( s, a, s)u t ( s)
u t 1 ( s)
a
s
• Converge and stop using this equation when
(1 )
u
where u max utility change
Value Iteration Algorithm
VALUE ITERATION (T , r , , )
d
do
u u'
for sS
r ( s ) max
do u ( s )
a
T ( s , a , s ) u ( s )
s
if | u ' ( s ) u ( s ) |
then
h | u '(s) u (s) |
(1 )
until
return u
0.5 and 0.15. The algorithm stops after t=4
MDP for one agent
• Multiagent: one agent changes , others are
stationary.
• Better approach T ( s, a, s' )
a is a vector of size ‘n’ showing each agent’s
action.
i Where
Wh ‘n’
‘ ’ represents number
b off
agents
• Rewards:
– Dole out equally among agents
– Reward proportional to contribution
Observation model
• noise + cannot observe world …
• Belief state b P1, P2, P3,...,Pn
• Observation model O(s,o) = probability of
observing ‘o’, being in state ‘s’.
s b( s) O( s, o) T ( s, a, s)b( s)
s
Where is normalization constant
Partially observable MDP
T ( b , a , b )
O(s, o) T (s, a, s)b(s)
s s
if * h
holds
ld
=
0 otherwise
* ‐ s b ( s ) O ( s , o ) T ( s , a , s ) b ( s ) is true for b , a , b
s
new reward function
(b ) b ( s ) r ( s )
s
• Solving POMDP is hard.