Closed
Description
Description
Today I picked up the mdp.py
code to assist some work with the Value Iteration algorithm, and I found what I think are deficiencies with the MDP
class and the value_iteration()
function, particularly with respect to how they handle terminal states.
MDP.actions(self, state)
returns[None]
ifstate
is a terminal. This is wrong, as action needs to act as a dictionary lookup key, andNone
cannot act as a key. The return value should probably be[]
, the empty list.- In
value_iteration()
, if the state being updated is terminal (and the above problem is fixed), then themax()
function is called with an empty list, which throws aValueError
. There needs to bedefault=0
added to the parameter list ofmax()
to handle updating terminal states.
There is also this problem:
- If
delta
is equal to0
, which can happen whengamma=1
, then the following loop breaking condition never executes:delta < epsilon*(1 - gamma)/gamma
. Note thatgamma=1
is a valid value, so this behaviour is a bug. It can at least be fixed by addingdelta is 0 or ...
to the condition
Appendix:
Code setup that discovered these problems:
act_list = ['N', 'S', 'E', 'W']
terminals = [(1,1), (1,3), (3,1)]
transitions = {
(1, 1): {
'N': [(0.7, (1, 1)), (0.3, (1, 1))], 'S': [(0.8, (2,1)), (0.2, (1,2))],
'E': [(0.6, (1, 2)), (0.4, (2,1))], 'W': [(0.3, (1, 1)), (0.7, (1,1))]
},
(1, 2): {
'N': [(0.7, (1,2)), (0.3, (1,1))],
'S': [(0.8, (2,2)), (0.2, (1,3))],
'E': [(0.6, (1,3)), (0.4, (2,2))],
'W': [(0.7, (1,1)), (0.3, (1,2))]
},
(1, 3): {
'N': [(0.7, (1,3)), (0.3, (1,2))],
'S': [(0.8, (2,3)), (0.2, (1,3))],
'E': [(0.6, (1,3)), (0.4, (2,3))],
'W': [(0.7, (1,2)), (0.3, (1,3))]
},
(2, 1): {
'N': [(0.7, (1,1)), (0.3, (2,1))],
'S': [(0.8, (3,1)), (0.2, (2,2))],
'E': [(0.6, (2,2)), (0.4, (3,1))],
'W': [(0.7, (2,1)), (0.3, (1,1))]
},
(2, 2): {
'N': [(0.7, (1,2)), (0.3, (2,1))],
'S': [(0.8, (3,2)), (0.2, (2,3))],
'E': [(0.6, (2,3)), (0.4, (3,2))],
'W': [(0.7, (2,1)), (0.3, (1,2))]
},
(2, 3): {
'N': [(0.7, (1,3)), (0.3, (2,2))],
'S': [(0.8, (3,3)), (0.2, (2,3))],
'E': [(0.6, (2,3)), (0.4, (3,3))],
'W': [(0.7, (2,2)), (0.3, (1,3))]
},
(3, 1): {
'N': [(0.7, (2,1)), (0.3, (3,1))],
'S': [(0.8, (3,1)), (0.2, (3,2))],
'E': [(0.6, (3,2)), (0.4, (3,1))],
'W': [(0.7, (3,1)), (0.3, (2,1))]
},
(3, 2): {
'N': [(0.7, (2,2)), (0.3, (3,1))],
'S': [(0.8, (3,2)), (0.2, (3,3))],
'E': [(0.6, (3,3)), (0.4, (3,2))],
'W': [(0.7, (3,1)), (0.3, (2,2))]
},
(3, 3): {
'N': [(0.7, (2,3)), (0.3, (3,2))],
'S': [(0.8, (3,3)), (0.2, (3,3))],
'E': [(0.6, (3,3)), (0.4, (3,3))],
'W': [(0.7, (3,2)), (0.3, (2,3))]
}
}
rewards = {
(1, 1): 20,
(1, 2): -1,
(1, 3): 5,
(2, 1): -1,
(2, 2): -1,
(2, 3): -1,
(3, 1): -20,
(3, 2): -1,
(3, 3): -1
}
states = list(rewards.keys())
gamma = 1
init = (1, 1)
from mdp import MDP
problem = MDP(
init,
act_list,
terminals,
transitions,
rewards,
states,
gamma
)
from mdp import value_iteration
u = value_iteration(problem)
print(u)
Metadata
Metadata
Assignees
Labels
No labels