Problems in MDP class from mdp.py

### Description

Today I picked up the `mdp.py` code to assist some work with the *Value Iteration* algorithm, and I found what I think are deficiencies with the `MDP` class and the `value_iteration()` function, particularly with respect to how they handle terminal states.

* `MDP.actions(self, state)` returns `[None]` if `state` is a terminal. This is wrong, as action needs to act as a dictionary lookup key, and `None` cannot act as a key. The return value should probably be `[]`, the empty list.
* In `value_iteration()`, if the state being updated is terminal (and the above problem is fixed), then the `max()` function is called with an empty list, which throws a `ValueError`. There needs to be `default=0` added to the parameter list of `max()` to handle updating terminal states.

There is also this problem: 

* If `delta` is equal to `0`, which can happen when `gamma=1`, then the following loop breaking condition never executes: `delta < epsilon*(1 - gamma)/gamma`. Note that `gamma=1` is a valid value, so this behaviour is a bug. It can at least be fixed by adding `delta is 0 or ...` to the condition 


### Appendix: 

Code setup that discovered these problems:

```python
act_list = ['N', 'S', 'E', 'W']
terminals = [(1,1), (1,3), (3,1)]
transitions = {
    (1, 1): {
        'N': [(0.7, (1, 1)), (0.3, (1, 1))], 'S': [(0.8, (2,1)), (0.2, (1,2))],
        'E': [(0.6, (1, 2)), (0.4, (2,1))], 'W': [(0.3, (1, 1)), (0.7, (1,1))]
    },
    (1, 2): {
        'N': [(0.7, (1,2)), (0.3, (1,1))],
        'S': [(0.8, (2,2)), (0.2, (1,3))],
        'E': [(0.6, (1,3)), (0.4, (2,2))],
        'W': [(0.7, (1,1)), (0.3, (1,2))]
    },
    (1, 3): {
        'N': [(0.7, (1,3)), (0.3, (1,2))],
        'S': [(0.8, (2,3)), (0.2, (1,3))],
        'E': [(0.6, (1,3)), (0.4, (2,3))],
        'W': [(0.7, (1,2)), (0.3, (1,3))]
    },
    (2, 1): {
        'N': [(0.7, (1,1)), (0.3, (2,1))],
        'S': [(0.8, (3,1)), (0.2, (2,2))],
        'E': [(0.6, (2,2)), (0.4, (3,1))],
        'W': [(0.7, (2,1)), (0.3, (1,1))]
    },
    (2, 2): {
        'N': [(0.7, (1,2)), (0.3, (2,1))],
        'S': [(0.8, (3,2)), (0.2, (2,3))],
        'E': [(0.6, (2,3)), (0.4, (3,2))],
        'W': [(0.7, (2,1)), (0.3, (1,2))]
    },
    (2, 3): {
        'N': [(0.7, (1,3)), (0.3, (2,2))],
        'S': [(0.8, (3,3)), (0.2, (2,3))],
        'E': [(0.6, (2,3)), (0.4, (3,3))],
        'W': [(0.7, (2,2)), (0.3, (1,3))]
    },
    (3, 1): {
        'N': [(0.7, (2,1)), (0.3, (3,1))],
        'S': [(0.8, (3,1)), (0.2, (3,2))],
        'E': [(0.6, (3,2)), (0.4, (3,1))],
        'W': [(0.7, (3,1)), (0.3, (2,1))]
    },
    (3, 2): {
        'N': [(0.7, (2,2)), (0.3, (3,1))],
        'S': [(0.8, (3,2)), (0.2, (3,3))],
        'E': [(0.6, (3,3)), (0.4, (3,2))],
        'W': [(0.7, (3,1)), (0.3, (2,2))]
    },
    (3, 3): {
        'N': [(0.7, (2,3)), (0.3, (3,2))],
        'S': [(0.8, (3,3)), (0.2, (3,3))],
        'E': [(0.6, (3,3)), (0.4, (3,3))],
        'W': [(0.7, (3,2)), (0.3, (2,3))]
    }
}

rewards = {
    (1, 1): 20,
    (1, 2): -1,
    (1, 3): 5,
    (2, 1): -1,
    (2, 2): -1,
    (2, 3): -1,
    (3, 1): -20,
    (3, 2): -1,
    (3, 3): -1
}
states = list(rewards.keys())
gamma = 1
init = (1, 1)

from mdp import MDP

problem = MDP(
    init,
    act_list,
    terminals,
    transitions,
    rewards,
    states,
    gamma
)

from mdp import value_iteration

u = value_iteration(problem)

print(u)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems in MDP class from mdp.py #917

Description

Appendix:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems in MDP class from mdp.py #917

Description

Description

Appendix:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions