diff --git a/images/maze.png b/images/maze.png new file mode 100644 index 000000000..f3fcd1990 Binary files /dev/null and b/images/maze.png differ diff --git a/images/mdp-d.png b/images/mdp-d.png new file mode 100644 index 000000000..8ba7cf073 Binary files /dev/null and b/images/mdp-d.png differ diff --git a/mdp_apps.ipynb b/mdp_apps.ipynb index 78542e075..50dce5427 100644 --- a/mdp_apps.ipynb +++ b/mdp_apps.ipynb @@ -31,7 +31,8 @@ " - State dependent reward function\n", " - State and action dependent reward function\n", " - State, action and next state dependent reward function\n", - "\n", + "- Grid MDP\n", + " - Pathfinding problem\n", "\n", "## SIMPLE MDP\n", "---\n", @@ -221,7 +222,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "['study', 'pub', 'sleep', 'facebook', 'quit']\n" + "['quit', 'sleep', 'study', 'pub', 'facebook']\n" ] } ], @@ -294,7 +295,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "{'class3': 'pub', 'leisure': 'quit', 'class2': 'study', 'class1': 'study', 'end': None}\n" + "{'class2': 'sleep', 'class3': 'pub', 'end': None, 'class1': 'study', 'leisure': 'quit'}\n" ] } ], @@ -318,7 +319,7 @@ "data": { "text/plain": [ "{'class1': 'study',\n", - " 'class2': 'study',\n", + " 'class2': 'sleep',\n", " 'class3': 'pub',\n", " 'end': None,\n", " 'leisure': 'quit'}" @@ -668,7 +669,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "['study', 'pub', 'sleep', 'facebook', 'quit']\n" + "['quit', 'sleep', 'study', 'pub', 'facebook']\n" ] } ], @@ -769,7 +770,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "{'class3': 'study', 'leisure': 'quit', 'class2': 'sleep', 'class1': 'facebook', 'end': None}\n" + "{'class2': 'sleep', 'class3': 'study', 'end': None, 'class1': 'facebook', 'leisure': 'quit'}\n" ] } ], @@ -832,10 +833,9 @@ "We have the following transition probability matrices:\n", "
\n", "
\n", - "Action 1: Cruising streets\n", - "
\n", + "Action 1: Cruising streets \n", "
\n", - "$$\\\\\n", + "$\\\\\n", " P^{1} = \n", " \\left[ {\\begin{array}{ccc}\n", " \\frac{1}{2} & \\frac{1}{4} & \\frac{1}{4} \\\\\n", @@ -843,13 +843,12 @@ " \\frac{1}{4} & \\frac{1}{4} & \\frac{1}{2} \\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", - "Action 2: Waiting at the taxi stand \n", + "Action 2: Waiting at the taxi stand \n", "
\n", - "
\n", - "$$\\\\\n", + "$\\\\\n", " P^{2} = \n", " \\left[ {\\begin{array}{ccc}\n", " \\frac{1}{16} & \\frac{3}{4} & \\frac{3}{16} \\\\\n", @@ -857,13 +856,12 @@ " \\frac{1}{8} & \\frac{3}{4} & \\frac{1}{8} \\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", "Action 3: Waiting for dispatch \n", "
\n", - "
\n", - "$$\\\\\n", + "$\\\\\n", " P^{3} =\n", " \\left[ {\\begin{array}{ccc}\n", " \\frac{1}{4} & \\frac{1}{8} & \\frac{5}{8} \\\\\n", @@ -871,7 +869,7 @@ " \\frac{3}{4} & \\frac{1}{16} & \\frac{3}{16} \\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", "For the sake of readability, we will call the states A, B and C and the actions 'cruise', 'stand' and 'dispatch'.\n", @@ -914,8 +912,7 @@ "
\n", "Action 1: Cruising streets \n", "
\n", - "
\n", - "$$\\\\\n", + "$\\\\\n", " R^{1} = \n", " \\left[ {\\begin{array}{ccc}\n", " 10 & 4 & 8 \\\\\n", @@ -923,13 +920,12 @@ " 10 & 2 & 8 \\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", "Action 2: Waiting at the taxi stand \n", "
\n", - "
\n", - "$$\\\\\n", + "$\\\\\n", " R^{2} = \n", " \\left[ {\\begin{array}{ccc}\n", " 8 & 2 & 4 \\\\\n", @@ -937,13 +933,12 @@ " 6 & 4 & 2\\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", "Action 3: Waiting for dispatch \n", "
\n", - "
\n", - "$$\\\\\n", + "$\\\\\n", " R^{3} = \n", " \\left[ {\\begin{array}{ccc}\n", " 4 & 6 & 4 \\\\\n", @@ -951,7 +946,7 @@ " 4 & 0 & 8\\\\\n", " \\end{array}}\\right] \\\\\n", " \\\\\n", - "$$\n", + " $\n", "
\n", "
\n", "We now build the reward model as a dictionary using these matrices." @@ -1194,7 +1189,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "['cruise', 'dispatch', 'stand']\n" + "['stand', 'dispatch', 'cruise']\n" ] } ], @@ -1290,6 +1285,150 @@ "We have successfully adapted the existing code to a different scenario yet again.\n", "The takeaway from this section is that you can convert the vast majority of reinforcement learning problems into MDPs and solve for the best policy using simple yet efficient tools." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## GRID MDP\n", + "---\n", + "### Pathfinding Problem\n", + "Markov Decision Processes can be used to find the best path through a maze. Let us consider this simple maze.\n", + "![title](images/maze.png)\n", + "\n", + "This environment can be formulated as a GridMDP.\n", + "
\n", + "To make the grid matrix, we will consider the state-reward to be -0.1 for every state.\n", + "
\n", + "State (1, 1) will have a reward of -5 to signify that this state is to be prohibited.\n", + "
\n", + "State (9, 9) will have a reward of +5.\n", + "This will be the terminal state.\n", + "
\n", + "The matrix can be generated using the GridMDP editor or we can write it ourselves." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "grid = [\n", + " [None, None, None, None, None, None, None, None, None, None, None], \n", + " [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, +5.0, None], \n", + " [None, -0.1, None, None, None, None, None, None, None, -0.1, None], \n", + " [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], \n", + " [None, -0.1, None, None, None, None, None, None, None, None, None], \n", + " [None, -0.1, None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], \n", + " [None, -0.1, None, None, None, None, None, -0.1, None, -0.1, None], \n", + " [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, -0.1, None], \n", + " [None, None, None, None, None, -0.1, None, -0.1, None, -0.1, None], \n", + " [None, -5.0, -0.1, -0.1, -0.1, -0.1, None, -0.1, None, -0.1, None], \n", + " [None, None, None, None, None, None, None, None, None, None, None]\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have only one terminal state, (9, 9)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "terminals = [(9, 9)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We define our maze environment below" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "maze = GridMDP(grid, terminals)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To solve the maze, we can use the `best_policy` function along with `value_iteration`." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "pi = best_policy(maze, value_iteration(maze))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is the heatmap generated by the GridMDP editor using `value_iteration` on this environment\n", + "
\n", + "![title](images/mdp-d.png)\n", + "
\n", + "Let's print out the best policy" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "None None None None None None None None None None None\n", + "None v < < < < < < None . None\n", + "None v None None None None None None None ^ None\n", + "None > > > > > > > > ^ None\n", + "None ^ None None None None None None None None None\n", + "None ^ None > > > > v < < None\n", + "None ^ None None None None None v None ^ None\n", + "None ^ < < < < < < None ^ None\n", + "None None None None None ^ None ^ None ^ None\n", + "None > > > > ^ None ^ None ^ None\n", + "None None None None None None None None None None None\n" + ] + } + ], + "source": [ + "from utils import print_table\n", + "print_table(maze.to_arrows(pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can infer, we can find the path to the terminal state starting from any given state using this policy.\n", + "All maze problems can be solved by formulating it as a MDP." + ] } ], "metadata": {