|
11 | 11 | },
|
12 | 12 | {
|
13 | 13 | "cell_type": "code",
|
14 |
| - "execution_count": 1, |
| 14 | + "execution_count": 172, |
15 | 15 | "metadata": {
|
16 | 16 | "collapsed": true
|
17 | 17 | },
|
|
50 | 50 | },
|
51 | 51 | {
|
52 | 52 | "cell_type": "code",
|
53 |
| - "execution_count": 2, |
| 53 | + "execution_count": 173, |
54 | 54 | "metadata": {
|
55 | 55 | "collapsed": false
|
56 | 56 | },
|
|
87 | 87 | },
|
88 | 88 | {
|
89 | 89 | "cell_type": "code",
|
90 |
| - "execution_count": 3, |
| 90 | + "execution_count": 174, |
91 | 91 | "metadata": {
|
92 | 92 | "collapsed": true
|
93 | 93 | },
|
|
119 | 119 | },
|
120 | 120 | {
|
121 | 121 | "cell_type": "code",
|
122 |
| - "execution_count": 4, |
| 122 | + "execution_count": 175, |
123 | 123 | "metadata": {
|
124 | 124 | "collapsed": false
|
125 | 125 | },
|
|
153 | 153 | },
|
154 | 154 | {
|
155 | 155 | "cell_type": "code",
|
156 |
| - "execution_count": 5, |
| 156 | + "execution_count": 176, |
157 | 157 | "metadata": {
|
158 | 158 | "collapsed": false
|
159 | 159 | },
|
|
181 | 181 | },
|
182 | 182 | {
|
183 | 183 | "cell_type": "code",
|
184 |
| - "execution_count": 6, |
| 184 | + "execution_count": 177, |
185 | 185 | "metadata": {
|
186 | 186 | "collapsed": true
|
187 | 187 | },
|
|
221 | 221 | },
|
222 | 222 | {
|
223 | 223 | "cell_type": "code",
|
224 |
| - "execution_count": 7, |
| 224 | + "execution_count": 178, |
225 | 225 | "metadata": {
|
226 | 226 | "collapsed": false
|
227 | 227 | },
|
228 | 228 | "outputs": [
|
229 | 229 | {
|
230 | 230 | "data": {
|
231 | 231 | "text/plain": [
|
232 |
| - "<mdp.GridMDP at 0x7fcb2826ba58>" |
| 232 | + "<mdp.GridMDP at 0x7fbecc40ebe0>" |
233 | 233 | ]
|
234 | 234 | },
|
235 |
| - "execution_count": 7, |
| 235 | + "execution_count": 178, |
236 | 236 | "metadata": {},
|
237 | 237 | "output_type": "execute_result"
|
238 | 238 | }
|
|
241 | 241 | "sequential_decision_environment"
|
242 | 242 | ]
|
243 | 243 | },
|
| 244 | + { |
| 245 | + "cell_type": "markdown", |
| 246 | + "metadata": { |
| 247 | + "collapsed": true |
| 248 | + }, |
| 249 | + "source": [ |
| 250 | + "# Value Iteration\n", |
| 251 | + "\n", |
| 252 | + "Now that we have looked how to represent MDPs. Let's aim at solving them. Our ultimate goal is to obtain an optimal policy. We start with looking at Value Iteration and a visualisation that should help us understanding it better.\n", |
| 253 | + "\n", |
| 254 | + "We start by calculating Value/Utility for each of the states. The Value of each state is the expected sum of discounted future rewards given we start in that state and follow a particular policy pi.The algorithm Value Iteration (**Fig. 17.4** in the book) relies on finding solutions of the Bellman's Equation. The intuition Value Iteration works is because values propagate. This point will we more clear after we encounter the visualisation. For more information you can refer to **Section 17.2** of the book. \n" |
| 255 | + ] |
| 256 | + }, |
| 257 | + { |
| 258 | + "cell_type": "code", |
| 259 | + "execution_count": 179, |
| 260 | + "metadata": { |
| 261 | + "collapsed": false |
| 262 | + }, |
| 263 | + "outputs": [], |
| 264 | + "source": [ |
| 265 | + "%psource value_iteration" |
| 266 | + ] |
| 267 | + }, |
| 268 | + { |
| 269 | + "cell_type": "markdown", |
| 270 | + "metadata": {}, |
| 271 | + "source": [ |
| 272 | + "It takes as inputs two parameters an MDP to solve and epsilon the maximum error allowed in the utility of any state. It returns a dictionary containing utilities where the keys are the states and values represent utilities. Let us solve the **sequencial_decision_enviornment** GridMDP.\n" |
| 273 | + ] |
| 274 | + }, |
244 | 275 | {
|
245 | 276 | "cell_type": "code",
|
246 |
| - "execution_count": null, |
| 277 | + "execution_count": 180, |
| 278 | + "metadata": { |
| 279 | + "collapsed": false |
| 280 | + }, |
| 281 | + "outputs": [ |
| 282 | + { |
| 283 | + "data": { |
| 284 | + "text/plain": [ |
| 285 | + "{(0, 0): 0.2962883154554812,\n", |
| 286 | + " (0, 1): 0.3984432178350045,\n", |
| 287 | + " (0, 2): 0.5093943765842497,\n", |
| 288 | + " (1, 0): 0.25386699846479516,\n", |
| 289 | + " (1, 2): 0.649585681261095,\n", |
| 290 | + " (2, 0): 0.3447542300124158,\n", |
| 291 | + " (2, 1): 0.48644001739269643,\n", |
| 292 | + " (2, 2): 0.7953620878466678,\n", |
| 293 | + " (3, 0): 0.12987274656746342,\n", |
| 294 | + " (3, 1): -1.0,\n", |
| 295 | + " (3, 2): 1.0}" |
| 296 | + ] |
| 297 | + }, |
| 298 | + "execution_count": 180, |
| 299 | + "metadata": {}, |
| 300 | + "output_type": "execute_result" |
| 301 | + } |
| 302 | + ], |
| 303 | + "source": [ |
| 304 | + "value_iteration(sequential_decision_environment)" |
| 305 | + ] |
| 306 | + }, |
| 307 | + { |
| 308 | + "cell_type": "markdown", |
| 309 | + "metadata": {}, |
| 310 | + "source": [ |
| 311 | + "To illustrate that values propagate out of states let us create a simple visualisation. We will be using a modified version of the value_iteration function which will store U over time. We will also remove the parameter epsilon and instead add the number of iterations we want." |
| 312 | + ] |
| 313 | + }, |
| 314 | + { |
| 315 | + "cell_type": "code", |
| 316 | + "execution_count": 181, |
247 | 317 | "metadata": {
|
248 | 318 | "collapsed": true
|
249 | 319 | },
|
250 | 320 | "outputs": [],
|
251 |
| - "source": [] |
| 321 | + "source": [ |
| 322 | + "def value_iteration_instru(mdp, iterations=20):\n", |
| 323 | + " U_over_time = []\n", |
| 324 | + " U1 = {s: 0 for s in mdp.states}\n", |
| 325 | + " R, T, gamma = mdp.R, mdp.T, mdp.gamma\n", |
| 326 | + " for _ in range(iterations):\n", |
| 327 | + " U = U1.copy()\n", |
| 328 | + " for s in mdp.states:\n", |
| 329 | + " U1[s] = R(s) + gamma * max([sum([p * U[s1] for (p, s1) in T(s, a)])\n", |
| 330 | + " for a in mdp.actions(s)])\n", |
| 331 | + " U_over_time.append(U)\n", |
| 332 | + " return U_over_time" |
| 333 | + ] |
| 334 | + }, |
| 335 | + { |
| 336 | + "cell_type": "markdown", |
| 337 | + "metadata": {}, |
| 338 | + "source": [ |
| 339 | + "Next, we define a function to create the visualisation from the utilities returned by **value_iteration_instru**. The reader need not concern himself with the code that immediately follows as it is the usage of Matplotib with IPython Widgets. If you are interested in reading more about these visit [ipywidgets.readthedocs.io](http://ipywidgets.readthedocs.io)" |
| 340 | + ] |
| 341 | + }, |
| 342 | + { |
| 343 | + "cell_type": "code", |
| 344 | + "execution_count": 182, |
| 345 | + "metadata": { |
| 346 | + "collapsed": true |
| 347 | + }, |
| 348 | + "outputs": [], |
| 349 | + "source": [ |
| 350 | + "columns = 4\n", |
| 351 | + "rows = 3\n", |
| 352 | + "U_over_time = value_iteration_instru(sequential_decision_environment)\n", |
| 353 | + " " |
| 354 | + ] |
| 355 | + }, |
| 356 | + { |
| 357 | + "cell_type": "code", |
| 358 | + "execution_count": 183, |
| 359 | + "metadata": { |
| 360 | + "collapsed": false |
| 361 | + }, |
| 362 | + "outputs": [], |
| 363 | + "source": [ |
| 364 | + "%matplotlib inline\n", |
| 365 | + "import matplotlib.pyplot as plt\n", |
| 366 | + "\n", |
| 367 | + "def plot_grid(iteration):\n", |
| 368 | + " data = U_over_time[iteration]\n", |
| 369 | + " grid = []\n", |
| 370 | + " for row in range(rows):\n", |
| 371 | + " current_row = []\n", |
| 372 | + " for column in range(columns):\n", |
| 373 | + " try:\n", |
| 374 | + " current_row.append(data[(column, row)])\n", |
| 375 | + " except KeyError:\n", |
| 376 | + " current_row.append(0)\n", |
| 377 | + " grid.append(current_row)\n", |
| 378 | + " grid.reverse() # output like book\n", |
| 379 | + " fig = plt.matshow(grid, cmap=plt.cm.bwr);\n", |
| 380 | + " plt.axis('off')\n", |
| 381 | + " fig.axes.get_xaxis().set_visible(False)\n", |
| 382 | + " fig.axes.get_yaxis().set_visible(False) " |
| 383 | + ] |
| 384 | + }, |
| 385 | + { |
| 386 | + "cell_type": "code", |
| 387 | + "execution_count": 184, |
| 388 | + "metadata": { |
| 389 | + "collapsed": false, |
| 390 | + "scrolled": true |
| 391 | + }, |
| 392 | + "outputs": [ |
| 393 | + { |
| 394 | + "data": { |
| 395 | + "image/png": "iVBORw0KGgoAAAANSUhEUgAAATgAAADtCAYAAAAr+2lCAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAAzZJREFUeJzt2rENwzAMAEExyP4r0wsE6Qwbj7uSalg9WGh29wAUfZ5e\nAOAuAgdkCRyQJXBAlsABWQIHZH3/Pc4cf0iA19s982vuggOyBA7IEjggS+CALIEDsgQOyBI4IEvg\ngCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOy\nBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4\nIEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAs\ngQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQO\nyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL\n4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIED\nsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgS\nOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CA\nLIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IE\nDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjgg\nS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyB\nA7IEDsgSOCBL4IAsgQOyBA7IEjggS+CALIEDsgQOyBI4IEvggCyBA7IEDsgSOCBL4IAsgQOyBA7I\nEjggS+CALIEDsgQOyJrdfXoHgFu44IAsgQOyBA7IEjggS+CALIEDsi6WyArVfE1QKgAAAABJRU5E\nrkJggg==\n", |
| 396 | + "text/plain": [ |
| 397 | + "<matplotlib.figure.Figure at 0x7fbea96037f0>" |
| 398 | + ] |
| 399 | + }, |
| 400 | + "metadata": {}, |
| 401 | + "output_type": "display_data" |
| 402 | + } |
| 403 | + ], |
| 404 | + "source": [ |
| 405 | + "import ipywidgets as widgets\n", |
| 406 | + "from IPython.display import display\n", |
| 407 | + "\n", |
| 408 | + "iteration_slider = widgets.IntSlider(min=0, max=15, step=1, value=0)\n", |
| 409 | + "w=widgets.interactive(plot_grid,iteration=iteration_slider)\n", |
| 410 | + "display(w)\n", |
| 411 | + " " |
| 412 | + ] |
| 413 | + }, |
| 414 | + { |
| 415 | + "cell_type": "markdown", |
| 416 | + "metadata": {}, |
| 417 | + "source": [ |
| 418 | + "Move the slider above to observe how the utility changes across iterations." |
| 419 | + ] |
252 | 420 | }
|
253 | 421 | ],
|
254 | 422 | "metadata": {
|
|
0 commit comments