Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit d1f162b

Browse files
ad71norvig
authored andcommitted
Enhanced mdp_apps notebook (#782)
* Added pathfinding example * Added images
1 parent 7e763e6 commit d1f162b

File tree

3 files changed

+166
-27
lines changed

3 files changed

+166
-27
lines changed

images/maze.png

4.47 KB
Loading

images/mdp-d.png

20.8 KB
Loading

mdp_apps.ipynb

Lines changed: 166 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@
3131
" - State dependent reward function\n",
3232
" - State and action dependent reward function\n",
3333
" - State, action and next state dependent reward function\n",
34-
"\n",
34+
"- Grid MDP\n",
35+
" - Pathfinding problem\n",
3536
"\n",
3637
"## SIMPLE MDP\n",
3738
"---\n",
@@ -221,7 +222,7 @@
221222
"name": "stdout",
222223
"output_type": "stream",
223224
"text": [
224-
"['study', 'pub', 'sleep', 'facebook', 'quit']\n"
225+
"['quit', 'sleep', 'study', 'pub', 'facebook']\n"
225226
]
226227
}
227228
],
@@ -294,7 +295,7 @@
294295
"name": "stdout",
295296
"output_type": "stream",
296297
"text": [
297-
"{'class3': 'pub', 'leisure': 'quit', 'class2': 'study', 'class1': 'study', 'end': None}\n"
298+
"{'class2': 'sleep', 'class3': 'pub', 'end': None, 'class1': 'study', 'leisure': 'quit'}\n"
298299
]
299300
}
300301
],
@@ -318,7 +319,7 @@
318319
"data": {
319320
"text/plain": [
320321
"{'class1': 'study',\n",
321-
" 'class2': 'study',\n",
322+
" 'class2': 'sleep',\n",
322323
" 'class3': 'pub',\n",
323324
" 'end': None,\n",
324325
" 'leisure': 'quit'}"
@@ -668,7 +669,7 @@
668669
"name": "stdout",
669670
"output_type": "stream",
670671
"text": [
671-
"['study', 'pub', 'sleep', 'facebook', 'quit']\n"
672+
"['quit', 'sleep', 'study', 'pub', 'facebook']\n"
672673
]
673674
}
674675
],
@@ -769,7 +770,7 @@
769770
"name": "stdout",
770771
"output_type": "stream",
771772
"text": [
772-
"{'class3': 'study', 'leisure': 'quit', 'class2': 'sleep', 'class1': 'facebook', 'end': None}\n"
773+
"{'class2': 'sleep', 'class3': 'study', 'end': None, 'class1': 'facebook', 'leisure': 'quit'}\n"
773774
]
774775
}
775776
],
@@ -832,46 +833,43 @@
832833
"We have the following transition probability matrices:\n",
833834
"<br>\n",
834835
"<br>\n",
835-
"Action 1: Cruising streets\n",
836-
"<br>\n",
836+
"Action 1: Cruising streets \n",
837837
"<br>\n",
838-
"$$\\\\\n",
838+
"$\\\\\n",
839839
" P^{1} = \n",
840840
" \\left[ {\\begin{array}{ccc}\n",
841841
" \\frac{1}{2} & \\frac{1}{4} & \\frac{1}{4} \\\\\n",
842842
" \\frac{1}{2} & 0 & \\frac{1}{2} \\\\\n",
843843
" \\frac{1}{4} & \\frac{1}{4} & \\frac{1}{2} \\\\\n",
844844
" \\end{array}}\\right] \\\\\n",
845845
" \\\\\n",
846-
"$$\n",
846+
" $\n",
847847
"<br>\n",
848848
"<br>\n",
849-
"Action 2: Waiting at the taxi stand \n",
849+
"Action 2: Waiting at the taxi stand \n",
850850
"<br>\n",
851-
"<br>\n",
852-
"$$\\\\\n",
851+
"$\\\\\n",
853852
" P^{2} = \n",
854853
" \\left[ {\\begin{array}{ccc}\n",
855854
" \\frac{1}{16} & \\frac{3}{4} & \\frac{3}{16} \\\\\n",
856855
" \\frac{1}{16} & \\frac{7}{8} & \\frac{1}{16} \\\\\n",
857856
" \\frac{1}{8} & \\frac{3}{4} & \\frac{1}{8} \\\\\n",
858857
" \\end{array}}\\right] \\\\\n",
859858
" \\\\\n",
860-
"$$\n",
859+
" $\n",
861860
"<br>\n",
862861
"<br>\n",
863862
"Action 3: Waiting for dispatch \n",
864863
"<br>\n",
865-
"<br>\n",
866-
"$$\\\\\n",
864+
"$\\\\\n",
867865
" P^{3} =\n",
868866
" \\left[ {\\begin{array}{ccc}\n",
869867
" \\frac{1}{4} & \\frac{1}{8} & \\frac{5}{8} \\\\\n",
870868
" 0 & 1 & 0 \\\\\n",
871869
" \\frac{3}{4} & \\frac{1}{16} & \\frac{3}{16} \\\\\n",
872870
" \\end{array}}\\right] \\\\\n",
873871
" \\\\\n",
874-
"$$\n",
872+
" $\n",
875873
"<br>\n",
876874
"<br>\n",
877875
"For the sake of readability, we will call the states A, B and C and the actions 'cruise', 'stand' and 'dispatch'.\n",
@@ -914,44 +912,41 @@
914912
"<br>\n",
915913
"Action 1: Cruising streets \n",
916914
"<br>\n",
917-
"<br>\n",
918-
"$$\\\\\n",
915+
"$\\\\\n",
919916
" R^{1} = \n",
920917
" \\left[ {\\begin{array}{ccc}\n",
921918
" 10 & 4 & 8 \\\\\n",
922919
" 14 & 0 & 18 \\\\\n",
923920
" 10 & 2 & 8 \\\\\n",
924921
" \\end{array}}\\right] \\\\\n",
925922
" \\\\\n",
926-
"$$\n",
923+
" $\n",
927924
"<br>\n",
928925
"<br>\n",
929926
"Action 2: Waiting at the taxi stand \n",
930927
"<br>\n",
931-
"<br>\n",
932-
"$$\\\\\n",
928+
"$\\\\\n",
933929
" R^{2} = \n",
934930
" \\left[ {\\begin{array}{ccc}\n",
935931
" 8 & 2 & 4 \\\\\n",
936932
" 8 & 16 & 8 \\\\\n",
937933
" 6 & 4 & 2\\\\\n",
938934
" \\end{array}}\\right] \\\\\n",
939935
" \\\\\n",
940-
"$$\n",
936+
" $\n",
941937
"<br>\n",
942938
"<br>\n",
943939
"Action 3: Waiting for dispatch \n",
944940
"<br>\n",
945-
"<br>\n",
946-
"$$\\\\\n",
941+
"$\\\\\n",
947942
" R^{3} = \n",
948943
" \\left[ {\\begin{array}{ccc}\n",
949944
" 4 & 6 & 4 \\\\\n",
950945
" 0 & 0 & 0 \\\\\n",
951946
" 4 & 0 & 8\\\\\n",
952947
" \\end{array}}\\right] \\\\\n",
953948
" \\\\\n",
954-
"$$\n",
949+
" $\n",
955950
"<br>\n",
956951
"<br>\n",
957952
"We now build the reward model as a dictionary using these matrices."
@@ -1194,7 +1189,7 @@
11941189
"name": "stdout",
11951190
"output_type": "stream",
11961191
"text": [
1197-
"['cruise', 'dispatch', 'stand']\n"
1192+
"['stand', 'dispatch', 'cruise']\n"
11981193
]
11991194
}
12001195
],
@@ -1290,6 +1285,150 @@
12901285
"We have successfully adapted the existing code to a different scenario yet again.\n",
12911286
"The takeaway from this section is that you can convert the vast majority of reinforcement learning problems into MDPs and solve for the best policy using simple yet efficient tools."
12921287
]
1288+
},
1289+
{
1290+
"cell_type": "markdown",
1291+
"metadata": {},
1292+
"source": [
1293+
"## GRID MDP\n",
1294+
"---\n",
1295+
"### Pathfinding Problem\n",
1296+
"Markov Decision Processes can be used to find the best path through a maze. Let us consider this simple maze.\n",
1297+
"![title](images/maze.png)\n",
1298+
"\n",
1299+
"This environment can be formulated as a GridMDP.\n",
1300+
"<br>\n",
1301+
"To make the grid matrix, we will consider the state-reward to be -0.1 for every state.\n",
1302+
"<br>\n",
1303+
"State (1, 1) will have a reward of -5 to signify that this state is to be prohibited.\n",
1304+
"<br>\n",
1305+
"State (9, 9) will have a reward of +5.\n",
1306+
"This will be the terminal state.\n",
1307+
"<br>\n",
1308+
"The matrix can be generated using the GridMDP editor or we can write it ourselves."
1309+
]
1310+
},
1311+
{
1312+
"cell_type": "code",
1313+
"execution_count": 35,
1314+
"metadata": {
1315+
"collapsed": true
1316+
},
1317+
"outputs": [],
1318+
"source": [
1319+
"grid = [\n",
1320+
" [None, None, None, None, None, None, None, None, None, None, None], \n",
1321+
" [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, +5.0, None], \n",
1322+
" [None, -0.1, None, None, None, None, None, None, None, -0.1, None], \n",
1323+
" [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], \n",
1324+
" [None, -0.1, None, None, None, None, None, None, None, None, None], \n",
1325+
" [None, -0.1, None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], \n",
1326+
" [None, -0.1, None, None, None, None, None, -0.1, None, -0.1, None], \n",
1327+
" [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, -0.1, None], \n",
1328+
" [None, None, None, None, None, -0.1, None, -0.1, None, -0.1, None], \n",
1329+
" [None, -5.0, -0.1, -0.1, -0.1, -0.1, None, -0.1, None, -0.1, None], \n",
1330+
" [None, None, None, None, None, None, None, None, None, None, None]\n",
1331+
"]"
1332+
]
1333+
},
1334+
{
1335+
"cell_type": "markdown",
1336+
"metadata": {},
1337+
"source": [
1338+
"We have only one terminal state, (9, 9)"
1339+
]
1340+
},
1341+
{
1342+
"cell_type": "code",
1343+
"execution_count": 36,
1344+
"metadata": {
1345+
"collapsed": true
1346+
},
1347+
"outputs": [],
1348+
"source": [
1349+
"terminals = [(9, 9)]"
1350+
]
1351+
},
1352+
{
1353+
"cell_type": "markdown",
1354+
"metadata": {},
1355+
"source": [
1356+
"We define our maze environment below"
1357+
]
1358+
},
1359+
{
1360+
"cell_type": "code",
1361+
"execution_count": 37,
1362+
"metadata": {},
1363+
"outputs": [],
1364+
"source": [
1365+
"maze = GridMDP(grid, terminals)"
1366+
]
1367+
},
1368+
{
1369+
"cell_type": "markdown",
1370+
"metadata": {},
1371+
"source": [
1372+
"To solve the maze, we can use the `best_policy` function along with `value_iteration`."
1373+
]
1374+
},
1375+
{
1376+
"cell_type": "code",
1377+
"execution_count": 38,
1378+
"metadata": {
1379+
"collapsed": true
1380+
},
1381+
"outputs": [],
1382+
"source": [
1383+
"pi = best_policy(maze, value_iteration(maze))"
1384+
]
1385+
},
1386+
{
1387+
"cell_type": "markdown",
1388+
"metadata": {},
1389+
"source": [
1390+
"This is the heatmap generated by the GridMDP editor using `value_iteration` on this environment\n",
1391+
"<br>\n",
1392+
"![title](images/mdp-d.png)\n",
1393+
"<br>\n",
1394+
"Let's print out the best policy"
1395+
]
1396+
},
1397+
{
1398+
"cell_type": "code",
1399+
"execution_count": 39,
1400+
"metadata": {},
1401+
"outputs": [
1402+
{
1403+
"name": "stdout",
1404+
"output_type": "stream",
1405+
"text": [
1406+
"None None None None None None None None None None None\n",
1407+
"None v < < < < < < None . None\n",
1408+
"None v None None None None None None None ^ None\n",
1409+
"None > > > > > > > > ^ None\n",
1410+
"None ^ None None None None None None None None None\n",
1411+
"None ^ None > > > > v < < None\n",
1412+
"None ^ None None None None None v None ^ None\n",
1413+
"None ^ < < < < < < None ^ None\n",
1414+
"None None None None None ^ None ^ None ^ None\n",
1415+
"None > > > > ^ None ^ None ^ None\n",
1416+
"None None None None None None None None None None None\n"
1417+
]
1418+
}
1419+
],
1420+
"source": [
1421+
"from utils import print_table\n",
1422+
"print_table(maze.to_arrows(pi))"
1423+
]
1424+
},
1425+
{
1426+
"cell_type": "markdown",
1427+
"metadata": {},
1428+
"source": [
1429+
"As you can infer, we can find the path to the terminal state starting from any given state using this policy.\n",
1430+
"All maze problems can be solved by formulating it as a MDP."
1431+
]
12931432
}
12941433
],
12951434
"metadata": {

0 commit comments

Comments
 (0)