Decision Making Under Uncertainty
Russell and Norvig: ch 16, 17 CMSC421 Fall 2005
Utility-Based Agent
sensors
?
agent actuators
environment
Non-deterministic vs. Probabilistic Uncertainty
? ?
a
{a,b,c}
{a(pa),b(pb),c(pc)} decision that maximizes expected utility value Probabilistic model
decision that is best for worst case Non-deterministic model ~ Adversarial search
Expected Utility
Random variable X with n values x1,,xn and distribution (p1,,pn) E.g.: X is the state reached after doing an action A under uncertainty Function U of X E.g., U is the utility of a state The expected utility of A is EU[A] = Si=1,,n p(xi|A)U(xi)
One State/One Action Example
s0
EU(A1) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62
A1
s1
0.2 100
s2
0.7 50
s3
0.1 70
One State/Two Actions Example
s0
EU(AI) = 62 EU(A2) = 74 EU(S0) = max{EU(A1),EU(A2)} = 74 A2
A1
s1
0.2 100
s2
0.7 0.2 50
s3
0.1 70
s4
0.8 80
Introducing Action Costs
s0
EU(A1) = 62 5 = 57 EU(A2) = 74 25 = 49 EU(S0) = max{EU(A1),EU(A2)} = 57 A2
A1
-5
-25
s1
0.2 100
s2
0.7 0.2 50
s3
0.1 70
s4
0.8 80
MEU Principle
rational agent should choose the action that maximizes agents expected utility this is the basis of the field of decision theory normative criterion for rational choice of action
Not quite
Must have complete model of:
Actions Utilities States
Even if you have a complete model, will be computationally intractable In fact, a truly rational agent takes into account the utility of reasoning as well---bounded rationality Nevertheless, great progress has been made in this area recently, and we are able to solve much more complex decision theoretic problems than ever before
Well look at
Decision Theoretic Planning
Simple decision making (ch. 16) Sequential decision making (ch. 17)
Decision Networks
Extend BNs to handle actions and utilities Also called Influence diagrams Make use of BN inference Can do Value of Information calculations
Decision Networks cont.
Chance nodes: random variables, as in BNs Decision nodes: actions that decision maker can take Utility/value nodes: the utility of the outcome state.
R&N example
Umbrella Network
take/dont take P(rain) = 0.4
Take Umbrella umbrella
P(umb|take) = 1.0 P(~umb|~take)=1.0
rain
happiness U(~umb, ~rain) = 100 U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25
Evaluating Decision Networks
Set the evidence variables for current state For each possible value of the decision node:
Set decision node to that value Calculate the posterior probability of the parent nodes of the utility node, using BN inference Calculate the resulting utility for action
return the action with the highest utility
Umbrella Network
take/dont take P(rain) = 0.4
Take Umbrella umbrella
P(umb|take) = 1.0 P(~umb|~take)=1.0
rain #1
umb rain 0 1 0 1 P(umb,rain | take)
happiness U(~umb, ~rain) = 100 U(~umb, rain) = -100
U(umb,~rain) = 0
0 0 1 1
U(umb,rain) = -25
#2: EU(take)
Umbrella Network
take/dont take P(rain) = 0.4
Take Umbrella umbrella
P(umb|take) = 1.0 P(~umb|~take)=1.0
rain #1
umb rain 0 1 0 1 P(umb,rain | ~take)
happiness U(~umb, ~rain) = 100 U(~umb, rain) = -100
U(umb,~rain) = 0
0 0 1 1
U(umb,rain) = -25
#2: EU(~take)
Value of Information (VOI)
suppose agents current knowledge is E. The value of the current best action is
EU( | E) max U(Re sult i ( A))P(Re sult i ( A ) | E, Do ( A ))
A i
the value of the new best action (after new evidence E is obtained):
EU( | E, E) max U(Re sult i ( A))P(Re sult i ( A) | E, E, Do ( A ))
A i
the value of information for E is:
VOI (E)
P (e
k
| E)EU( ek | e k , E) EU( | E)
Umbrella Network
take/dont take P(rain) = 0.4
Take Umbrella umbrella
P(umb|take) = 1.0 P(~umb|~take)=1.0
rain
happiness U(~umb, ~rain) = 100 U(~umb, rain) = -100
U(umb,~rain) = 0
forecast
R 0 F 0 P(F|R) 0.8
0
1 1
1
0 1
0.2
0.3 0.7
U(umb,rain) = -25
VOI
VOI(forecast)= P(rainy)EU(rainy) + P(~rainy)EU(~rainy) EU()
P(F=rainy) = 0.4
Umbrella Network
take/dont take
F 0 0 1 1
R 0 1 0 1
P(R|F) 0.8 0.2 0.3 0.7
P(rain) = 0.4
Take Umbrella umbrella
P(umb|take) = 1.0 P(~umb|~take)=1.0
rain
happiness U(~umb, ~rain) = 100 U(~umb, rain) = -100
U(umb,~rain) = 0
forecast
R 0 F 0 P(F|R) 0.8
0
1 1
1
0 1
0.2
0.3 0.7
U(umb,rain) = -25
umb 0 0 1 1
rain 0 1 0 1
P(umb,rain | take, rainy)
umb
0 0 1 1
rain
0 1 0 1
P(umb,rain | take, ~rainy)
#1: EU(take|rainy)
#3: EU(take|~rainy)
umb 0 0 1 1
rain 0 1 0 1
P(umb,rain | ~take, rainy)
umb 0 0 1 1
rain 0 1 0 1
P(umb,rain |~take, ~rainy)
#2: EU(~take|rainy)
#4: EU(~take|~rainy)
VOI
VOI(forecast)= P(rainy)EU(rainy) + P(~rainy)EU(~rainy) EU()
Sequential Decision Making
Finite Horizon Infinite Horizon
Simple Robot Navigation Problem
In each state, the possible actions are U, D, R, and L
Probabilistic Transition Model
In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)
Probabilistic Transition Model
In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)
Probabilistic Transition Model
In each state, the possible actions are U, D, R, and L The effect of U is as follows (transition model): With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move) With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move) With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)
Markov Property
The transition properties depend only on the current state, not on previous history (how that state was reached)
Sequence of Actions
3 2 1 [3,2]
2 3 4 Planned sequence of actions: (U, R)
Sequence of Actions
3 2 1 [3,2] [3,2] [3,3] [4,2]
2 3 4 Planned sequence of actions: (U, R) U is executed
Histories
3 2 1 [3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
2 3 4 Planned sequence of actions: (U, R) U has been executed R is executed
There are 9 possible sequences of states called histories and 6 possible final states for the robot!
Probability of Reaching the Goal
Note importance of Markov property 2 in this derivation
1 3
P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2]) P([4,3] | R.[3,3]) = 0.8 P([3,3] | U.[3,2]) = 0.8 P([4,3] | R.[4,2]) = 0.1 P([4,2] | U.[3,2]) = 0.1 P([4,3] | (U,R).[3,2]) = 0.65
Utility Function
3 2 1 +1 -1
[4,3] provides power supply [4,2] is a sand area from which the robot cannot escape
Utility Function
3 2 1 +1 -1
[4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries
Utility Function
3 2 1 +1 -1
[4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries [4,3] or [4,2] are terminal states
Utility of a History
3 2 1 +1 -1
[4,3] provides power supply [4,2] is a sand area from which the robot cannot escape The robot needs to recharge its batteries [4,3] or [4,2] are terminal states The utility of a history is defined by the utility of the last state (+1 or 1) minus n/25, where n is the number of moves
Utility of an Action Sequence
3 2 1 +1 -1
Consider the action sequence (U,R) from [3,2]
Utility of an Action Sequence
3 2 1 +1 -1 [3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability
Utility of an Action Sequence
3 2 1 +1 -1 [3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories:
U = ShUh P(h)
Optimal Action Sequence
3 2 1 +1 -1 [3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
Consider the action sequence (U,R) from [3,2] A run produces one among 7 possible histories, each with some probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility
Optimal Action Sequence
3 2 1 +1 -1 [3,2] [3,2] [3,3] [4,2] [3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
Consider the action sequence (U,R) from [3,2] A run produces onethe sequence is histories, each with some only if among 7 possible executed blindly! probability The utility of the sequence is the expected utility of the histories The optimal sequence is the one with maximal utility But is the optimal action sequence what we want to compute?
Reactive Agent Algorithm
Accessible or Repeat: observable state s sensed state If s is terminal then exit a choose action (given s) Perform a
Policy
(Reactive/Closed-Loop Strategy)
3 2 1 +1 -1
A policy P is a complete mapping from states to actions
Reactive Agent Algorithm
Repeat: s sensed state If s is terminal then exit a P(s) Perform a
Optimal Policy
3 2 1 +1 -1
A policy P is a completeNote that [3,2] is a dangerous mapping from states to actions state that the optimal policy The optimal policy P* is the one that always yields a tries to avoid history (ending at a terminal state) with maximal expected utility Makes sense because of Markov property
Optimal Policy
3 2 1 +1 -1
This problem states to A policy P is a complete mapping from is called aactions Markov Decision Problem (MDP) The optimal policy P* is the one that always yields a history with maximal expected utility
How to compute P*?
Additive Utility
History H = (s0,s1,,sn) The utility of H is additive iff: U(s ,s ,,s ) = R(0) + U(s ,,s ) = S R(i)
0 1 n 1 n
Reward
Additive Utility
History H = (s0,s1,,sn) The utility of H is additive iff: U(s ,s ,,s ) = R(0) + U(s ,,s )
0 1 n 1 n
S R(i)
Robot navigation example: R(n) = +1 if sn = [4,3] R(n) = -1 if sn = [4,2] R(i) = -1/25 if i = 0, , n-1
Principle of Max Expected Utility
History H = (s0,s1,,sn) Utility of H: U(s0,s1,,sn) = S
R(i)
+1 -1
First-step analysis
U(i) = R(i) + maxa SkP(k | a.i) U(k)
P*(i) = arg maxa SkP(k | a.i) U(k)
Value Iteration
U (i) = 0
0
Initialize the utility of each non-terminal state si to For t = 0, 1, 2, , do:
Ut+1(i) R(i) + maxa SkP(k | a.i) Ut(k)
+1 -1
3 2 1 1 2 3
Value Iteration
Note the importance of terminal state and Initialize the utility of each non-terminalstates si to connectivity of the U0(i) = 0 state-transition graph For t = 0, 1, 2, , do:
Ut+1(i) R(i) + maxa SkP(k | a.i) Ut(k)
Ut([3,1])
+1 -1
0.611 0.5 0 0 10
3 2 1
0.812 0.868 0.918 0.762 0.660
0.705 0.655 0.611 0.388
20
30 t
Policy Iteration
Pick a policy P at random
Policy Iteration
Pick a policy P at random Repeat:
Compute the utility of each state for P Ut+1(i) R(i) + SkP(k | P(i).i) Ut(k)
Policy Iteration
Pick a policy P at random Repeat:
Compute the utility of each state for P Ut+1(i) R(i) + SkP(k | P(i).i) Ut(k) Compute the policy P given these utilities
P(i) = arg maxa
S P(k | a.i) U(k)
k
Policy Iteration
Pick a policy P at random Repeat:
Compute the utility of each state for P Ut+1(i) R(i) + SkP(k | P(i).i) Ut(k) Compute the policy P given these utilities
P(i) = arg maxa
Or S P(k solve the set of linear equations: | a.i) U(k)
k
If P = P then return Pa sparse system) (often
U(i) = R(i) + SkP(k | P(i).i) U(k)
Example: Tracking a Target
The robot must keep the target in view The targets trajectory is not known in advance
robot
target
Example: Tracking a Target
Example: Tracking a Target
Infinite Horizon
In many problems, e.g., the robot navigation example, histories are What if the robot lives forever? potentially unbounded and the same One trick: state can be reached many times
3 2 1 1 2 3 4 +1 -1
Use discounting to make infinite Horizon problem mathematically tractible
POMDP (Partially Observable Markov Decision Problem)
A sensing operation returns multiple states, with a probability distribution Choosing the action that maximizes the expected utility of this state distribution assuming state utilities computed as above is not good enough, and actually does not make sense (is not rational)
Example: Target Tracking
There is uncertainty in the robots and targets positions, and this uncertainty grows with further motion There is a risk that the target escape behind the corner requiring the robot to move appropriately But there is a positioning landmark nearby. Should the robot tries to reduce position uncertainty?
Summary
Decision making under uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration