Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views23 pages

Utilities and MDP: A Lesson in Multiagent System: Henry Hexmoor Siuc Siuc

Uploaded by

selvi13082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Utilities and MDP: A Lesson in Multiagent System: Henry Hexmoor Siuc Siuc

Uploaded by

selvi13082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Utilities and MDP:

A Lesson in Multiagent System


Henry Hexmoor
SIUC
Utility
• Preferences are recorded as a utility function
ui : S  R
S is
i the
h set off observable
b bl states in
i the
h world
ld
ui is utility function
R is real numbers
• States of the world become ordered.
ordered
Properties of Utilites
Reflexive: ui(s) ≥ ui(s)
Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c)
then ui(a) ≥ ui(c).
(c)
Comparable: a,b either ui(a) ≥ ui(b) or
ui(b) ≥ ui(a).
( )
Selfish agents:
• A rational agent is one that wants to maximize
its utilities, but intends no harm.
Agents
Rational agents

Selfish
lf h
agents

Rational, non‐
selfish agents
Utility is not money:
• while utility represents an agent’s
agent s preferences
it is not necessarily equated with money. In
fact the utility of money has been found to be
fact,
roughly logarithmic.
Marginal Utility
• Marginal utility is the utility gained from next
event
Example:
getting A for an A student.
versus A for an B student
Transition function
Transition function is represented as
T ( s, a, s ' )
Transition
i i function
f i isi defined
d fi d as the h probability
b bili
of reaching S’ from S with action ‘a’
Expected Utility
• Expected utility is defined as the sum of
product of the probabilities of reaching s’
from s with action ‘a’a and utility of the final
state.

E[ui , s, a ]   T ( s, a, s ' )ui ( s ' )


s 'S
Where S is set of all p
possible states
Value of Information
• Value of information that current state is t and
not s:

E  E[ui , t ,  i (t )]  E[ui , t ,  i ( s )]

here E [u i , t ,  i (t )] represents updated, new info


E[ui , t ,  i ( s )] represents
p old value
Markov Decision Processes: MDP
• Graphical representation of a sample Markov decision process along with values for
the transition and reward functions. We let the start state be s1. E.g.,
Reward Function: r(s)
• Reward function is represented as
r:S→R
Deterministic Vs Non‐Deterministic
Non Deterministic
• Deterministic world: predictable effects
Example: only one action leads to T=1 , else Ф

• Nondeterministic world: values change


Policy: 
• Policy is behavior of agents that maps states
to action
• Policy is represented by 
Optimal Policy
• Optimal policy is a policy that maximizes
expected utility.
• Optimal policy is represented as 
*

 i* ( s )  arg a A max E [u i , s , a ]
Discounted Rewards:  ( 0  1)
• Discounted rewards smoothly reduce the
impact of rewards that are farther off in the
future
 0 r ( s 1 )   1 r ( s 2 )   2 r ( s 3 )  ...

Wh
Where  ( 0  1) represents
t di
discountt ffactor
t

 * ( s )  arg max  T ( s , a , s )u ( s )


a
s
Bellman Equation

u ( s )  r ( s )   max
a
 T ( s , a , s ) u ( s )
s

Where
r(s) represents immediate reward
T(s a s’)u(s’)
T(s,a,s )u(s ) represents
future, discounted rewards
Brute Force Solution
• Write n Bellman equations one for each n
states, solve …
• This is a non
non‐linear
linear equation due to max
a
Value Iteration Solution
• Set values of u(s) to random numbers
• Use Bellman update equation
 r ( s)   max  T ( s, a, s)u t ( s)
u t 1 ( s) 
a
s

• Converge and stop using this equation when


 (1   )
u 

where u max utility change
Value Iteration Algorithm
VALUE  ITERATION (T , r ,  ,  )
d
do
u  u'
 
for sS

  r ( s )   max
do u  ( s ) 
a
 T ( s , a , s ) u ( s )
s

if | u ' ( s )  u ( s ) | 

then
h  | u '(s)  u (s) |

 (1   )
until  

return u
  0.5 and  0.15. The algorithm stops after t=4
MDP for one agent
• Multiagent: one agent changes , others are
stationary.

• Better approach  T ( s, a, s' )

a is a vector of size ‘n’ showing each agent’s
action.
i Where
Wh ‘n’
‘ ’ represents number
b off
agents
• Rewards:
– Dole out equally among agents
– Reward proportional to contribution
Observation model
• noise + cannot observe world …

• Belief state b P1, P2, P3,...,Pn 

• Observation model O(s,o) = probability of


observing ‘o’, being in state ‘s’.
s b( s)  O( s, o) T ( s, a, s)b( s)
s
Where  is normalization constant
Partially observable MDP

T ( b , a , b )
 O(s, o) T (s, a, s)b(s)
s s
if * h
holds
ld

=
0 otherwise
 
* ‐ s b  ( s  )   O ( s  , o )  T ( s , a , s  ) b ( s ) is true for b , a , b
s

new reward function


 (b )   b ( s ) r ( s )
s
• Solving POMDP is hard.

You might also like