aimacode · norvig · Feb 23, 2018 · Feb 14, 2018 · Feb 15, 2018
diff --git a/mdp.ipynb b/mdp.ipynb
@@ -59,7 +59,7 @@
    "source": [
     "## MDP\n",
     "\n",
-    "To begin with let us look at the implementation of MDP class defined in mdp.py The docstring tells us what all is required to define a MDP namely - set of states,actions, initial state, transition model, and a reward function. Each of these are implemented as methods. Do not close the popup so that you can follow along the description of code below."
+    "To begin with let us look at the implementation of MDP class defined in mdp.py The docstring tells us what all is required to define a MDP namely - set of states, actions, initial state, transition model, and a reward function. Each of these are implemented as methods. Do not close the popup so that you can follow along the description of code below."
    ]
   },
   {
@@ -336,7 +336,7 @@
    "source": [
     "## GRID MDP\n",
     "\n",
-    "Now we look at a concrete implementation that makes use of the MDP as base class. The GridMDP class in the mdp module is used to represent a grid world MDP like the one shown in  in **Fig 17.1** of the AIMA Book. The code should be easy to understand if you have gone through the CustomMDP example."
+    "Now we look at a concrete implementation that makes use of the MDP as base class. The GridMDP class in the mdp module is used to represent a grid world MDP like the one shown in  in **Fig 17.1** of the AIMA Book. We assume for now that the environment is _fully observable_, so that the agent always knows where it is. The code should be easy to understand if you have gone through the CustomMDP example."
    ]
   },
   {
@@ -551,25 +551,164 @@
     "\n",
     "Now that we have looked how to represent MDPs. Let's aim at solving them. Our ultimate goal is to obtain an optimal policy. We start with looking at Value Iteration and a visualisation that should help us understanding it better.\n",
     "\n",
-    "We start by calculating Value/Utility for each of the states. The Value of each state is the expected sum of discounted future rewards given we start in that state and follow a particular policy pi.The algorithm Value Iteration (**Fig. 17.4** in the book) relies on finding solutions of the Bellman's Equation. The intuition Value Iteration works is because values propagate. This point will we more clear after we encounter the visualisation. For more information you can refer to **Section 17.2** of the book. \n"
+    "We start by calculating Value/Utility for each of the states. The Value of each state is the expected sum of discounted future rewards given we start in that state and follow a particular policy _pi_. The value or the utility of a state is given by\n",
+    "\n",
+    "$$U(s)=R(s)+\\gamma\\max_{a\\epsilon A(s)}\\sum_{s'} P(s'\\ |\\ s,a)U(s')$$\n",
+    "\n",
+    "This is called the Bellman equation. The algorithm Value Iteration (**Fig. 17.4** in the book) relies on finding solutions of this Equation. The intuition Value Iteration works is because values propagate through the state space by means of local updates. This point will we more clear after we encounter the visualisation. For more information you can refer to **Section 17.2** of the book. \n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n",
+       "   \"http://www.w3.org/TR/html4/strict.dtd\">\n",
+       "\n",
+       "<html>\n",
+       "<head>\n",
+       "  <title></title>\n",
+       "  <meta http-equiv=\"content-type\" content=\"text/html; charset=None\">\n",
+       "  <style type=\"text/css\">\n",
+       "td.linenos { background-color: #f0f0f0; padding-right: 10px; }\n",
+       "span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }\n",
+       "pre { line-height: 125%; }\n",
+       "body .hll { background-color: #ffffcc }\n",
+       "body  { background: #f8f8f8; }\n",
+       "body .c { color: #408080; font-style: italic } /* Comment */\n",
+       "body .err { border: 1px solid #FF0000 } /* Error */\n",
+       "body .k { color: #008000; font-weight: bold } /* Keyword */\n",
+       "body .o { color: #666666 } /* Operator */\n",
+       "body .ch { color: #408080; font-style: italic } /* Comment.Hashbang */\n",
+       "body .cm { color: #408080; font-style: italic } /* Comment.Multiline */\n",
+       "body .cp { color: #BC7A00 } /* Comment.Preproc */\n",
+       "body .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */\n",
+       "body .c1 { color: #408080; font-style: italic } /* Comment.Single */\n",
+       "body .cs { color: #408080; font-style: italic } /* Comment.Special */\n",
+       "body .gd { color: #A00000 } /* Generic.Deleted */\n",
+       "body .ge { font-style: italic } /* Generic.Emph */\n",
+       "body .gr { color: #FF0000 } /* Generic.Error */\n",
+       "body .gh { color: #000080; font-weight: bold } /* Generic.Heading */\n",
+       "body .gi { color: #00A000 } /* Generic.Inserted */\n",
+       "body .go { color: #888888 } /* Generic.Output */\n",
+       "body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */\n",
+       "body .gs { font-weight: bold } /* Generic.Strong */\n",
+       "body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */\n",
+       "body .gt { color: #0044DD } /* Generic.Traceback */\n",
+       "body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */\n",
+       "body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */\n",
+       "body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */\n",
+       "body .kp { color: #008000 } /* Keyword.Pseudo */\n",
+       "body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */\n",
+       "body .kt { color: #B00040 } /* Keyword.Type */\n",
+       "body .m { color: #666666 } /* Literal.Number */\n",
+       "body .s { color: #BA2121 } /* Literal.String */\n",
+       "body .na { color: #7D9029 } /* Name.Attribute */\n",
+       "body .nb { color: #008000 } /* Name.Builtin */\n",
+       "body .nc { color: #0000FF; font-weight: bold } /* Name.Class */\n",
+       "body .no { color: #880000 } /* Name.Constant */\n",
+       "body .nd { color: #AA22FF } /* Name.Decorator */\n",
+       "body .ni { color: #999999; font-weight: bold } /* Name.Entity */\n",
+       "body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */\n",
+       "body .nf { color: #0000FF } /* Name.Function */\n",
+       "body .nl { color: #A0A000 } /* Name.Label */\n",
+       "body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */\n",
+       "body .nt { color: #008000; font-weight: bold } /* Name.Tag */\n",
+       "body .nv { color: #19177C } /* Name.Variable */\n",
+       "body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */\n",
+       "body .w { color: #bbbbbb } /* Text.Whitespace */\n",
+       "body .mb { color: #666666 } /* Literal.Number.Bin */\n",
+       "body .mf { color: #666666 } /* Literal.Number.Float */\n",
+       "body .mh { color: #666666 } /* Literal.Number.Hex */\n",
+       "body .mi { color: #666666 } /* Literal.Number.Integer */\n",
+       "body .mo { color: #666666 } /* Literal.Number.Oct */\n",
+       "body .sa { color: #BA2121 } /* Literal.String.Affix */\n",
+       "body .sb { color: #BA2121 } /* Literal.String.Backtick */\n",
+       "body .sc { color: #BA2121 } /* Literal.String.Char */\n",
+       "body .dl { color: #BA2121 } /* Literal.String.Delimiter */\n",
+       "body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */\n",
+       "body .s2 { color: #BA2121 } /* Literal.String.Double */\n",
+       "body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */\n",
+       "body .sh { color: #BA2121 } /* Literal.String.Heredoc */\n",
+       "body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */\n",
+       "body .sx { color: #008000 } /* Literal.String.Other */\n",
+       "body .sr { color: #BB6688 } /* Literal.String.Regex */\n",
+       "body .s1 { color: #BA2121 } /* Literal.String.Single */\n",
+       "body .ss { color: #19177C } /* Literal.String.Symbol */\n",
+       "body .bp { color: #008000 } /* Name.Builtin.Pseudo */\n",
+       "body .fm { color: #0000FF } /* Name.Function.Magic */\n",
+       "body .vc { color: #19177C } /* Name.Variable.Class */\n",
+       "body .vg { color: #19177C } /* Name.Variable.Global */\n",
+       "body .vi { color: #19177C } /* Name.Variable.Instance */\n",
+       "body .vm { color: #19177C } /* Name.Variable.Magic */\n",
+       "body .il { color: #666666 } /* Literal.Number.Integer.Long */\n",
+       "\n",
+       "  </style>\n",
+       "</head>\n",
+       "<body>\n",
+       "<h2></h2>\n",
+       "\n",
+       "<div class=\"highlight\"><pre><span></span><span class=\"k\">def</span> <span class=\"nf\">value_iteration</span><span class=\"p\">(</span><span class=\"n\">mdp</span><span class=\"p\">,</span> <span class=\"n\">epsilon</span><span class=\"o\">=</span><span class=\"mf\">0.001</span><span class=\"p\">):</span>\n",
+       "    <span class=\"sd\">&quot;&quot;&quot;Solving an MDP by value iteration. [Figure 17.4]&quot;&quot;&quot;</span>\n",
+       "    <span class=\"n\">U1</span> <span class=\"o\">=</span> <span class=\"p\">{</span><span class=\"n\">s</span><span class=\"p\">:</span> <span class=\"mi\">0</span> <span class=\"k\">for</span> <span class=\"n\">s</span> <span class=\"ow\">in</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">states</span><span class=\"p\">}</span>\n",
+       "    <span class=\"n\">R</span><span class=\"p\">,</span> <span class=\"n\">T</span><span class=\"p\">,</span> <span class=\"n\">gamma</span> <span class=\"o\">=</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">R</span><span class=\"p\">,</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">T</span><span class=\"p\">,</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">gamma</span>\n",
+       "    <span class=\"k\">while</span> <span class=\"bp\">True</span><span class=\"p\">:</span>\n",
+       "        <span class=\"n\">U</span> <span class=\"o\">=</span> <span class=\"n\">U1</span><span class=\"o\">.</span><span class=\"n\">copy</span><span class=\"p\">()</span>\n",
+       "        <span class=\"n\">delta</span> <span class=\"o\">=</span> <span class=\"mi\">0</span>\n",
+       "        <span class=\"k\">for</span> <span class=\"n\">s</span> <span class=\"ow\">in</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">states</span><span class=\"p\">:</span>\n",
+       "            <span class=\"n\">U1</span><span class=\"p\">[</span><span class=\"n\">s</span><span class=\"p\">]</span> <span class=\"o\">=</span> <span class=\"n\">R</span><span class=\"p\">(</span><span class=\"n\">s</span><span class=\"p\">)</span> <span class=\"o\">+</span> <span class=\"n\">gamma</span> <span class=\"o\">*</span> <span class=\"nb\">max</span><span class=\"p\">([</span><span class=\"nb\">sum</span><span class=\"p\">([</span><span class=\"n\">p</span> <span class=\"o\">*</span> <span class=\"n\">U</span><span class=\"p\">[</span><span class=\"n\">s1</span><span class=\"p\">]</span> <span class=\"k\">for</span> <span class=\"p\">(</span><span class=\"n\">p</span><span class=\"p\">,</span> <span class=\"n\">s1</span><span class=\"p\">)</span> <span class=\"ow\">in</span> <span class=\"n\">T</span><span class=\"p\">(</span><span class=\"n\">s</span><span class=\"p\">,</span> <span class=\"n\">a</span><span class=\"p\">)])</span>\n",
+       "                                        <span class=\"k\">for</span> <span class=\"n\">a</span> <span class=\"ow\">in</span> <span class=\"n\">mdp</span><span class=\"o\">.</span><span class=\"n\">actions</span><span class=\"p\">(</span><span class=\"n\">s</span><span class=\"p\">)])</span>\n",
+       "            <span class=\"n\">delta</span> <span class=\"o\">=</span> <span class=\"nb\">max</span><span class=\"p\">(</span><span class=\"n\">delta</span><span class=\"p\">,</span> <span class=\"nb\">abs</span><span class=\"p\">(</span><span class=\"n\">U1</span><span class=\"p\">[</span><span class=\"n\">s</span><span class=\"p\">]</span> <span class=\"o\">-</span> <span class=\"n\">U</span><span class=\"p\">[</span><span class=\"n\">s</span><span class=\"p\">]))</span>\n",
+       "        <span class=\"k\">if</span> <span class=\"n\">delta</span> <span class=\"o\">&lt;</span> <span class=\"n\">epsilon</span> <span class=\"o\">*</span> <span class=\"p\">(</span><span class=\"mi\">1</span> <span class=\"o\">-</span> <span class=\"n\">gamma</span><span class=\"p\">)</span> <span class=\"o\">/</span> <span class=\"n\">gamma</span><span class=\"p\">:</span>\n",
+       "            <span class=\"k\">return</span> <span class=\"n\">U</span>\n",
+       "</pre></div>\n",
+       "</body>\n",
+       "</html>\n"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "psource(value_iteration)"
    ]
-  },
+  },  
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "It takes as inputs two parameters, an MDP to solve and epsilon the maximum error allowed in the utility of any state. It returns a dictionary containing utilities where the keys are the states and values represent utilities. Let us solve the **sequencial_decision_enviornment** GridMDP."
+    "It takes as inputs two parameters, an MDP to solve and epsilon the maximum error allowed in the utility of any state. It returns a dictionary containing utilities where the keys are the states and values represent utilities. <br> Value Iteration starts with arbitrary initial values for the utilities, calculates the right side of the Bellman equation and plugs it into the left hand side, thereby updating the utility of each state from the utilities of its neighbors. \n",
+    "This is repeated until equilibrium is reached. \n",
+    "It works on the principle of _Dynamic Programming_. \n",
+    "If U_i(s) is the utility value for state _s_ at the _i_ th iteration, the iteration step, called Bellman update, looks like this:\n",
+    "\n",
+    "$$ U_{i+1}(s) \\leftarrow R(s) + \\gamma \\max_{a \\epsilon A(s)} \\sum_{s'} P(s'\\ |\\ s,a)U_{i}(s') $$\n",
+    "\n",
+    "As you might have noticed, `value_iteration` has an infinite loop. How do we decide when to stop iterating? \n",
+    "The concept of _contraction_ successfully explains the convergence of value iteration. \n",
+    "Refer to **Section 17.2.3** of the book for a detailed explanation. \n",
+    "In the algorithm, we calculate a value _delta_ that measures the difference in the utilities of the current time step and the previous time step. \n",
+    "\n",
+    "$$\\delta = \\max{(\\delta, \\begin{vmatrix}U_{i + 1}(s) - U_i(s)\\end{vmatrix})}$$\n",
+    "\n",
+    "This value of delta decreases over time.\n",
+    "We terminate the algorithm if the delta value is less than a threshold value determined by the hyperparameter _epsilon_.\n",
+    "\n",
+    "$$\\delta \\lt \\epsilon \\frac{(1 - \\gamma)}{\\gamma}$$\n",
+    "\n",
+    "To summarize, the Bellman update is a _contraction_ by a factor of `gamma` on the space of utility vectors. \n",
+    "Hence, from the properties of contractions in general, it follows that `value_iteration` always converges to a unique solution of the Bellman equations whenever gamma is less than 1.\n",
+    "We then terminate the algorithm when a reasonable approximation is achieved.\n",
+    "In practice, it often occurs that the policy _pi_ becomes optimal long before the utility function converges. For the given 4 x 3 environment with _gamma = 0.9_, the policy _pi_ is optimal when _i = 4_, even though the maximum error in the utility function is stil 0.46.This can be clarified from **figure 17.6** in the book. Hence, to increase computational efficiency, we often use another method to solve MDPs called Policy Iteration which we will see in the later part of this notebook. \n",
+    "<br>For now, let us solve the **sequential_decision_environment** GridMDP using `value_iteration`."
    ]
   },
   {