diff --git a/images/-0.04.jpg b/images/-0.04.jpg new file mode 100644 index 000000000..3cf276421 Binary files /dev/null and b/images/-0.04.jpg differ diff --git a/images/-0.4.jpg b/images/-0.4.jpg new file mode 100644 index 000000000..b274d2ce3 Binary files /dev/null and b/images/-0.4.jpg differ diff --git a/images/-4.jpg b/images/-4.jpg new file mode 100644 index 000000000..79eefb0cd Binary files /dev/null and b/images/-4.jpg differ diff --git a/images/4.jpg b/images/4.jpg new file mode 100644 index 000000000..55e75001d Binary files /dev/null and b/images/4.jpg differ diff --git a/images/ge0.jpg b/images/ge0.jpg new file mode 100644 index 000000000..a70b18703 Binary files /dev/null and b/images/ge0.jpg differ diff --git a/images/ge1.jpg b/images/ge1.jpg new file mode 100644 index 000000000..624f16e25 Binary files /dev/null and b/images/ge1.jpg differ diff --git a/images/ge2.jpg b/images/ge2.jpg new file mode 100644 index 000000000..3a29f8f4c Binary files /dev/null and b/images/ge2.jpg differ diff --git a/images/ge4.jpg b/images/ge4.jpg new file mode 100644 index 000000000..b3a4b4acd Binary files /dev/null and b/images/ge4.jpg differ diff --git a/images/grid_mdp.jpg b/images/grid_mdp.jpg new file mode 100644 index 000000000..fa77fa276 Binary files /dev/null and b/images/grid_mdp.jpg differ diff --git a/images/grid_mdp_agent.jpg b/images/grid_mdp_agent.jpg new file mode 100644 index 000000000..3f247b6f2 Binary files /dev/null and b/images/grid_mdp_agent.jpg differ diff --git a/mdp.ipynb b/mdp.ipynb index af46f948c..4a3f1f757 100644 --- a/mdp.ipynb +++ b/mdp.ipynb @@ -30,7 +30,9 @@ "* Overview\n", "* MDP\n", "* Grid MDP\n", - "* Value Iteration Visualization" + "* Value Iteration\n", + " * Value Iteration Visualization\n", + "* Policy Iteration" ] }, { @@ -547,7 +549,7 @@ "collapsed": true }, "source": [ - "# Value Iteration\n", + "# VALUE ITERATION\n", "\n", "Now that we have looked how to represent MDPs. Let's aim at solving them. Our ultimate goal is to obtain an optimal policy. We start with looking at Value Iteration and a visualisation that should help us understanding it better.\n", "\n", @@ -649,6 +651,30 @@ "pseudocode(\"Value-Iteration\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### AIMA3e\n", + "__function__ VALUE-ITERATION(_mdp_, _ε_) __returns__ a utility function \n", + " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_), \n", + "      rewards _R_(_s_), discount _γ_ \n", + "   _ε_, the maximum error allowed in the utility of any state \n", + " __local variables__: _U_, _U′_, vectors of utilities for states in _S_, initially zero \n", + "        _δ_, the maximum change in the utility of any state in an iteration \n", + "\n", + " __repeat__ \n", + "   _U_ ← _U′_; _δ_ ← 0 \n", + "   __for each__ state _s_ in _S_ __do__ \n", + "     _U′_\\[_s_\\] ← _R_(_s_) + _γ_ max_a_ ∈ _A_(_s_) Σ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] \n", + "     __if__ | _U′_\\[_s_\\] − _U_\\[_s_\\] | > _δ_ __then__ _δ_ ← | _U′_\\[_s_\\] − _U_\\[_s_\\] | \n", + " __until__ _δ_ < _ε_(1 − _γ_)/_γ_ \n", + " __return__ _U_ \n", + "\n", + "---\n", + "__Figure ??__ The value iteration algorithm for calculating utilities of states. The termination condition is from Equation (__??__)." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -766,6 +792,1198 @@ "source": [ "Move the slider above to observe how the utility changes across iterations. It is also possible to move the slider using arrow keys or to jump to the value by directly editing the number with a double click. The **Visualize Button** will automatically animate the slider for you. The **Extra Delay Box** allows you to set time delay in seconds upto one second for each time step. There is also an interactive editor for grid-world problems `grid_mdp.py` in the gui folder for you to play around with." ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "# POLICY ITERATION\n", + "\n", + "We have already seen that value iteration converges to the optimal policy long before it accurately estimates the utility function. \n", + "If one action is clearly better than all the others, then the exact magnitude of the utilities in the states involved need not be precise. \n", + "The policy iteration algorithm works on this insight. \n", + "The algorithm executes two fundamental steps:\n", + "* **Policy evaluation**: Given a policy _πᵢ_, calculate _Uᵢ = U(πᵢ)_, the utility of each state if _πᵢ_ were to be executed.\n", + "* **Policy improvement**: Calculate a new policy _πᵢ₊₁_ using one-step look-ahead based on the utility values calculated.\n", + "\n", + "The algorithm terminates when the policy improvement step yields no change in the utilities. \n", + "Refer to **Figure 17.6** in the book to see how this is an improvement over value iteration.\n", + "We now have a simplified version of the Bellman equation\n", + "\n", + "$$U_i(s) = R(s) + \\gamma \\sum_{s'}P(s'\\ |\\ s, \\pi_i(s))U_i(s')$$\n", + "\n", + "An important observation in this equation is that this equation doesn't have the `max` operator, which makes it linear.\n", + "For _n_ states, we have _n_ linear equations with _n_ unknowns, which can be solved exactly in time _**O(n³)**_.\n", + "For more implementational details, have a look at **Section 17.3**.\n", + "Let us now look at how the expected utility is found and how `policy_iteration` is implemented." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
def expected_utility(a, s, U, mdp):\n",
+       "    """The expected utility of doing a in state s, according to the MDP and U."""\n",
+       "    return sum([p * U[s1] for (p, s1) in mdp.T(s, a)])\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(expected_utility)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
def policy_iteration(mdp):\n",
+       "    """Solve an MDP by policy iteration [Figure 17.7]"""\n",
+       "    U = {s: 0 for s in mdp.states}\n",
+       "    pi = {s: random.choice(mdp.actions(s)) for s in mdp.states}\n",
+       "    while True:\n",
+       "        U = policy_evaluation(pi, U, mdp)\n",
+       "        unchanged = True\n",
+       "        for s in mdp.states:\n",
+       "            a = argmax(mdp.actions(s), key=lambda a: expected_utility(a, s, U, mdp))\n",
+       "            if a != pi[s]:\n",
+       "                pi[s] = a\n",
+       "                unchanged = False\n",
+       "        if unchanged:\n",
+       "            return pi\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(policy_iteration)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Fortunately, it is not necessary to do _exact_ policy evaluation. \n", + "The utilities can instead be reasonably approximated by performing some number of simplified value iteration steps.\n", + "The simplified Bellman update equation for the process is\n", + "\n", + "$$U_{i+1}(s) \\leftarrow R(s) + \\gamma\\sum_{s'}P(s'\\ |\\ s,\\pi_i(s))U_{i}(s')$$\n", + "\n", + "and this is repeated _k_ times to produce the next utility estimate. This is called _modified policy iteration_." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
def policy_evaluation(pi, U, mdp, k=20):\n",
+       "    """Return an updated utility mapping U from each state in the MDP to its\n",
+       "    utility, using an approximation (modified policy iteration)."""\n",
+       "    R, T, gamma = mdp.R, mdp.T, mdp.gamma\n",
+       "    for i in range(k):\n",
+       "        for s in mdp.states:\n",
+       "            U[s] = R(s) + gamma * sum([p * U[s1] for (p, s1) in T(s, pi[s])])\n",
+       "    return U\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(policy_evaluation)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us now solve **`sequential_decision_environment`** using `policy_iteration`." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{(0, 0): (0, 1),\n", + " (0, 1): (0, 1),\n", + " (0, 2): (1, 0),\n", + " (1, 0): (1, 0),\n", + " (1, 2): (1, 0),\n", + " (2, 0): (0, 1),\n", + " (2, 1): (0, 1),\n", + " (2, 2): (1, 0),\n", + " (3, 0): (-1, 0),\n", + " (3, 1): None,\n", + " (3, 2): None}" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "policy_iteration(sequential_decision_environment)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "pseudocode('Policy-Iteration')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### AIMA3e\n", + "__function__ POLICY-ITERATION(_mdp_) __returns__ a policy \n", + " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_) \n", + " __local variables__: _U_, a vector of utilities for states in _S_, initially zero \n", + "        _π_, a policy vector indexed by state, initially random \n", + "\n", + " __repeat__ \n", + "   _U_ ← POLICY\\-EVALUATION(_π_, _U_, _mdp_) \n", + "   _unchanged?_ ← true \n", + "   __for each__ state _s_ __in__ _S_ __do__ \n", + "     __if__ max_a_ ∈ _A_(_s_) Σ_s′_ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] > Σ_s′_ _P_(_s′_ | _s_, _π_\\[_s_\\]) _U_\\[_s′_\\] __then do__ \n", + "       _π_\\[_s_\\] ← argmax_a_ ∈ _A_(_s_) Σ_s′_ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] \n", + "       _unchanged?_ ← false \n", + " __until__ _unchanged?_ \n", + " __return__ _π_ \n", + "\n", + "---\n", + "__Figure ??__ The policy iteration algorithm for calculating an optimal policy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "## Sequential Decision Problems\n", + "\n", + "Now that we have the tools required to solve MDPs, let us see how Sequential Decision Problems can be solved step by step and how a few built-in tools in the GridMDP class help us better analyse the problem at hand. \n", + "As always, we will work with the grid world from **Figure 17.1** from the book.\n", + "![title](images/grid_mdp.jpg)\n", + "
This is the environment for our agent.\n", + "We assume for now that the environment is _fully observable_, so that the agent always knows where it is.\n", + "We also assume that the transitions are **Markovian**, that is, the probability of reaching state _s'_ from state _s_ only on _s_ and not on the history of earlier states.\n", + "Almost all stochastic decision problems can be reframed as a Markov Decision Process just by tweaking the definition of a _state_ for that particular problem.\n", + "
\n", + "However, the actions of our agent in this environment are unreliable.\n", + "In other words, the motion of our agent is stochastic. \n", + "More specifically, the agent does the intended action with a probability of _0.8_, but with probability _0.1_, it moves to the right and with probability _0.1_ it moves to the left of the intended direction.\n", + "The agent stays put if it bumps into a wall.\n", + "![title](images/grid_mdp_agent.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These properties of the agent are called the transition properties and are hardcoded into the GridMDP class as you can see below." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
    def T(self, state, action):\n",
+       "        if action is None:\n",
+       "            return [(0.0, state)]\n",
+       "        else:\n",
+       "            return [(0.8, self.go(state, action)),\n",
+       "                    (0.1, self.go(state, turn_right(action))),\n",
+       "                    (0.1, self.go(state, turn_left(action)))]\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(GridMDP.T)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To completely define our task environment, we need to specify the utility function for the agent. \n", + "This is the function that gives the agent a rough estimate of how good being in a particular state is, or how much _reward_ an agent receives by being in that state.\n", + "The agent then tries to maximize the reward it gets.\n", + "As the decision problem is sequential, the utility function will depend on a sequence of states rather than on a single state.\n", + "For now, we simply stipulate that in each state s, the agent receives a finite reward _R(s)_.\n", + "\n", + "For any given state, the actions the agent can take are encoded as given below:\n", + "- Move Up: (0, 1)\n", + "- Move Down: (0, -1)\n", + "- Move Left: (-1, 0)\n", + "- Move Right: (1, 0)\n", + "- Do nothing: `None`\n", + "\n", + "We now wonder what a valid solution to the problem might look like. \n", + "We cannot have fixed action sequences as the environment is stochastic and we can eventually end up in an undesirable state.\n", + "Therefore, a solution must specify what the agent shoulddo for _any_ state the agent might reach.\n", + "
\n", + "Such a solution is known as a **policy** and is usually denoted by **π**.\n", + "
\n", + "The **optimal policy** is the policy that yields the highest expected utility an is usually denoted by **π* **.\n", + "
\n", + "The `GridMDP` class has a useful method `to_arrows` that outputs a grid showing the direction the agent should move, given a policy.\n", + "We will use this later to better understand the properties of the environment." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
    def to_arrows(self, policy):\n",
+       "        chars = {\n",
+       "            (1, 0): '>', (0, 1): '^', (-1, 0): '<', (0, -1): 'v', None: '.'}\n",
+       "        return self.to_grid({s: chars[a] for (s, a) in policy.items()})\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(GridMDP.to_arrows)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This method directly encodes the actions that the agent can take (described above) to characters representing arrows and shows it in a grid format for human visalization purposes. \n", + "It converts the received policy from a `dictionary` to a grid using the `to_grid` method." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + " Codestin Search App\n", + " \n", + " \n", + "\n", + "\n", + "

\n", + "\n", + "
    def to_grid(self, mapping):\n",
+       "        """Convert a mapping from (x, y) to v into a [[..., v, ...]] grid."""\n",
+       "        return list(reversed([[mapping.get((x, y), None)\n",
+       "                               for x in range(self.cols)]\n",
+       "                              for y in range(self.rows)]))\n",
+       "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "psource(GridMDP.to_grid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have all the tools required and a good understanding of the agent and the environment, we consider some cases and see how the agent should behave for each case." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Case 1\n", + "---\n", + "R(s) = -0.04 in all states except terminal states" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Note that this environment is also initialized in mdp.py by default\n", + "sequential_decision_environment = GridMDP([[-0.04, -0.04, -0.04, +1],\n", + " [-0.04, None, -0.04, -1],\n", + " [-0.04, -0.04, -0.04, -0.04]],\n", + " terminals=[(3, 2), (3, 1)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use the `best_policy` function to find the best policy for this environment.\n", + "But, as you can see, `best_policy` requires a utility function as well.\n", + "We already know that the utility function can be found by `value_iteration`.\n", + "Hence, our best policy is:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .001))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now use the `to_arrows` method to see how our agent should pick its actions in the environment." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> > > .\n", + "^ None ^ .\n", + "^ > ^ <\n" + ] + } + ], + "source": [ + "from utils import print_table\n", + "print_table(sequential_decision_environment.to_arrows(pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is exactly the output we expected\n", + "
\n", + "![title](images/-0.04.jpg)\n", + "
\n", + "Notice that, because the cost of taking a step is fairly small compared with the penalty for ending up in `(4, 2)` by accident, the optimal policy is conservative. \n", + "In state `(3, 1)` it recommends taking the long way round, rather than taking the shorter way and risking getting a large negative reward of -1 in `(4, 2)`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Case 2\n", + "---\n", + "R(s) = -0.4 in all states except in terminal states" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "sequential_decision_environment = GridMDP([[-0.4, -0.4, -0.4, +1],\n", + " [-0.4, None, -0.4, -1],\n", + " [-0.4, -0.4, -0.4, -0.4]],\n", + " terminals=[(3, 2), (3, 1)])" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> > > .\n", + "^ None ^ .\n", + "^ > ^ <\n" + ] + } + ], + "source": [ + "pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .001))\n", + "from utils import print_table\n", + "print_table(sequential_decision_environment.to_arrows(pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is exactly the output we expected\n", + "![title](images/-0.4.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As the reward for each state is now more negative, life is certainly more unpleasant.\n", + "The agent takes the shortest route to the +1 state and is willing to risk falling into the -1 state by accident." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Case 3\n", + "---\n", + "R(s) = -4 in all states except terminal states" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "sequential_decision_environment = GridMDP([[-4, -4, -4, +1],\n", + " [-4, None, -4, -1],\n", + " [-4, -4, -4, -4]],\n", + " terminals=[(3, 2), (3, 1)])" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> > > .\n", + "^ None > .\n", + "> > > ^\n" + ] + } + ], + "source": [ + "pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .001))\n", + "from utils import print_table\n", + "print_table(sequential_decision_environment.to_arrows(pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is exactly the output we expected\n", + "![title](images/-4.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The living reward for each state is now more negative than the most negative terminal. Life is so painful that the agent heads for the nearest exit as even the worst exit is less painful than the current state." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Case 4\n", + "---\n", + "R(s) = 4 in all states except terminal states" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "sequential_decision_environment = GridMDP([[4, 4, 4, +1],\n", + " [4, None, 4, -1],\n", + " [4, 4, 4, 4]],\n", + " terminals=[(3, 2), (3, 1)])" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> > < .\n", + "> None < .\n", + "> > > v\n" + ] + } + ], + "source": [ + "pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .001))\n", + "from utils import print_table\n", + "print_table(sequential_decision_environment.to_arrows(pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, the output we expect is\n", + "![title](images/4.jpg)\n", + "
\n", + "As life is positively enjoyable and the agent avoids _both_ exits.\n", + "Even though the output we get is not exactly what we want, it is definitely not wrong.\n", + "The scenario here requires the agent to anything but reach a terminal state, as this is the only way the agent can maximize its reward (total reward tends to infinity), and the program does just that.\n", + "
\n", + "Currently, the GridMDP class doesn't support an explicit marker for a \"do whatever you like\" action or a \"don't care\" condition.\n", + "You can however, extend the class to do so.\n", + "
\n", + "For in-depth knowledge about sequential decision problems, refer **Section 17.1** in the AIMA book." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Appendix\n", + "\n", + "Surprisingly, it turns out that there are six other optimal policies for various ranges of R(s). \n", + "You can try to find them out for yourself.\n", + "See **Exercise 17.5**.\n", + "To help you with this, we have a GridMDP editor in `grid_mdp.py` in the GUI folder. \n", + "
\n", + "Here's a brief tutorial about how to use it\n", + "
\n", + "Let us use it to solve `Case 2` above\n", + "1. Run `python gui/grid_mdp.py` from the master directory.\n", + "2. Enter the dimensions of the grid (3 x 4 in this case), and click on `'Build a GridMDP'`\n", + "3. Click on `Initialize` in the `Edit` menu.\n", + "4. Set the reward as -0.4 and click `Apply`. Exit the dialog. \n", + "![title](images/ge0.jpg)\n", + "
\n", + "5. Select cell (1, 1) and check the `Wall` radio button. `Apply` and exit the dialog.\n", + "![title](images/ge1.jpg)\n", + "
\n", + "6. Select cells (4, 1) and (4, 2) and check the `Terminal` radio button for both. Set the rewards appropriately and click on `Apply`. Exit the dialog. Your window should look something like this.\n", + "![title](images/ge2.jpg)\n", + "
\n", + "7. You are all set up now. Click on `Build and Run` in the `Build` menu and watch the heatmap calculate the utility function.\n", + "![title](images/ge4.jpg)\n", + "
\n", + "Green shades indicate positive utilities and brown shades indicate negative utilities. \n", + "The values of the utility function and arrow diagram will pop up in separate dialogs after the algorithm converges." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/tests/test_mdp.py b/tests/test_mdp.py index 9117a32d9..1aed4b58f 100644 --- a/tests/test_mdp.py +++ b/tests/test_mdp.py @@ -70,14 +70,6 @@ def test_policy_iteration(): (2, 1): (1, 0), (2, 2): (1, 0), (3, 0): (0, 1), (3, 1): None, (3, 2): None} - assert policy_iteration(sequential_decision_environment_3) == { - (0, 0): (-1, 0), (0, 1): (0, -1), (0, 2): (0, -1), (0, 3): (0, -1), (0, 4): None, - (1, 0): (-1, 0), (1, 1): (-1, 0), (1, 4): (1, 0), - (2, 0): (-1, 0), (2, 1): (0, -1), (2, 2): None, (2, 4): (1, 0), - (3, 0): (-1, 0), (3, 2): None, (3, 3): (1, 0), (3, 4): (1, 0), - (4, 0): (-1, 0), (4, 3): (1, 0), (4, 4): (1, 0), - (5, 0): None, (5, 1): (0, 1), (5, 2): (0, 1), (5, 3): (0, 1), (5, 4): (1, 0)} - def test_best_policy(): pi = best_policy(sequential_decision_environment,