-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Fix MDP class and add POMDP subclass and notebook #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
bd23fb7
Fixed typos and added inline LaTeX
bakerwho 9d5ec3c
Fixed backslash for inline LaTeX
bakerwho d8af22b
Fixed more backslashes
bakerwho a09490d
Merge branch 'master' of https://github.com/aimacode/aima-python into…
bakerwho 548c1cb
generalised MDP class and created POMDP notebook
bakerwho d04e14d
Fixed consistency issues with base MDP class
bakerwho bee037e
Small fix on CustomMDP
bakerwho 97e51cd
Set default args to pass tests
bakerwho 7e763e6
Added TableDrivenAgentProgram tests (#777)
ayushjain19 3fed661
Fixing tests
bakerwho bab969e
Fixed conflicts
bakerwho d778ec0
fixed conflicts
bakerwho 35be708
fixed test_rl
bakerwho 1a1a049
removed redundant code, fixed a comment
bakerwho File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Partially Observable Markov decision processes (POMDPs)\n", | ||
"\n", | ||
"This Jupyter notebook acts as supporting material for POMDPs, covered in **Chapter 17 Making Complex Decisions** of the book* Artificial Intelligence: A Modern Approach*. We make use of the implementations of POMPDPs in mdp.py module. This notebook has been separated from the notebook `mdp.py` as the topics are considerably more advanced.\n", | ||
"\n", | ||
"**Note that it is essential to work through and understand the mdp.ipynb notebook before diving into this one.**\n", | ||
"\n", | ||
"Let us import everything from the mdp module to get started." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from mdp import *\n", | ||
"from notebook import psource, pseudocode" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## CONTENTS\n", | ||
"\n", | ||
"1. Overview of MDPs\n", | ||
"2. POMDPs - a conceptual outline\n", | ||
"3. POMDPs - a rigorous outline\n", | ||
"4. Value Iteration\n", | ||
" - Value Iteration Visualization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1. OVERVIEW\n", | ||
"\n", | ||
"We first review Markov property and MDPs as in [Section 17.1] of the book.\n", | ||
"\n", | ||
"- A stochastic process is said to have the **Markov property**, or to have a **Markovian transition model** if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only on the present state, not on the sequence of events that preceded it.\n", | ||
"\n", | ||
" -- (Source: [Wikipedia](https://en.wikipedia.org/wiki/Markov_property))\n", | ||
"\n", | ||
"A Markov decision process or MDP is defined as:\n", | ||
"- a sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards.\n", | ||
"\n", | ||
"An MDP consists of a set of states (with an initial state $s_0$); a set $A(s)$ of actions\n", | ||
"in each state; a transition model $P(s' | s, a)$; and a reward function $R(s)$.\n", | ||
"\n", | ||
"The MDP seeks to make sequential decisions to occupy states so as to maximise some combination of the reward function $R(s)$.\n", | ||
"\n", | ||
"The characteristic problem of the MDP is hence to identify the optimal policy function $\\pi^*(s)$ that provides the _utility-maximising_ action $a$ to be taken when the current state is $s$.\n", | ||
"\n", | ||
"### Belief vector\n", | ||
"\n", | ||
"**Note**: The book refers to the _belief vector_ as the _belief state_. We use the latter terminology here to retain our ability to refer to the belief vector as a _probability distribution over states_.\n", | ||
"\n", | ||
"The solution of an MDP is subject to certain properties of the problem which are assumed and justified in [Section 17.1]. One critical assumption is that the agent is **fully aware of its current state at all times**.\n", | ||
"\n", | ||
"A tedious (but rewarding, as we will see) way of expressing this is in terms of the **belief vector** $b$ of the agent. The belief vector is a function mapping states to probabilities or certainties of being in those states.\n", | ||
"\n", | ||
"Consider an agent that is fully aware that it is in state $s_i$ in the statespace $(s_1, s_2, ... s_n)$ at the current time.\n", | ||
"\n", | ||
"Its belief vector is the vector $(b(s_1), b(s_2), ... b(s_n))$ given by the function $b(s)$:\n", | ||
"\\begin{align*}\n", | ||
"b(s) &= 0 \\quad \\text{if }s \\neq s_i \\\\ &= 1 \\quad \\text{if } s = s_i\n", | ||
"\\end{align*}\n", | ||
"\n", | ||
"Note that $b(s)$ is a probability distribution that necessarily sums to $1$ over all $s$.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"source": [ | ||
"## 2. POMDPs - a conceptual outline\n", | ||
"\n", | ||
"The POMDP really has only two modifications to the **problem formulation** compared to the MDP.\n", | ||
"\n", | ||
"- **Belief state** - In the real world, the current state of an agent is often not known with complete certainty. This makes the concept of a belief vector extremely relevant. It allows the agent to represent different degrees of certainty with which it _believes_ it is in each state.\n", | ||
"\n", | ||
"- **Evidence percepts** - In the real world, agents often have certain kinds of evidence, collected from sensors. They can use the probability distribution of observed evidence, conditional on state, to consolidate their information. This is a known distribution $P(e\\ |\\ s)$ - $e$ being an evidence, and $s$ being the state it is conditional on.\n", | ||
"\n", | ||
"Consider the world we used for the MDP. \n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"#### Using the belief vector\n", | ||
"An agent beginning at $(1, 1)$ may not be certain that it is indeed in $(1, 1)$. Consider a belief vector $b$ such that:\n", | ||
"\\begin{align*}\n", | ||
" b((1,1)) &= 0.8 \\\\\n", | ||
" b((2,1)) &= 0.1 \\\\\n", | ||
" b((1,2)) &= 0.1 \\\\\n", | ||
" b(s) &= 0 \\quad \\quad \\forall \\text{ other } s\n", | ||
"\\end{align*}\n", | ||
"\n", | ||
"By horizontally catenating each row, we can represent this as an 11-dimensional vector (omitting $(2, 2)$).\n", | ||
"\n", | ||
"Thus, taking $s_1 = (1, 1)$, $s_2 = (1, 2)$, ... $s_{11} = (4,3)$, we have $b$:\n", | ||
"\n", | ||
"$b = (0.8, 0.1, 0, 0, 0.1, 0, 0, 0, 0, 0, 0)$ \n", | ||
"\n", | ||
"This fully represents the certainty to which the agent is aware of its state.\n", | ||
"\n", | ||
"#### Using evidence\n", | ||
"The evidence observed here could be the number of adjacent 'walls' or 'dead ends' observed by the agent. We assume that the agent cannot 'orient' the walls - only count them.\n", | ||
"\n", | ||
"In this case, $e$ can take only two values, 1 and 2. This gives $P(e\\ |\\ s)$ as:\n", | ||
"\\begin{align*}\n", | ||
" P(e=2\\ |\\ s) &= \\frac{1}{7} \\quad \\forall \\quad s \\in \\{s_1, s_2, s_4, s_5, s_8, s_9, s_{11}\\}\\\\\n", | ||
" P(e=1\\ |\\ s) &= \\frac{1}{4} \\quad \\forall \\quad s \\in \\{s_3, s_6, s_7, s_{10}\\} \\\\\n", | ||
" P(e\\ |\\ s) &= 0 \\quad \\forall \\quad \\text{ other } s, e\n", | ||
"\\end{align*}\n", | ||
"\n", | ||
"Note that the implications of the evidence on the state must be known **a priori** to the agent. Ways of reliably learning this distribution from percepts are beyond the scope of this notebook." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 3. POMDPs - a rigorous outline\n", | ||
"\n", | ||
"A POMDP is thus a sequential decision problem for for a *partially* observable, stochastic environment with a Markovian transition model, a known 'sensor model' for inferring state from observation, and additive rewards. \n", | ||
"\n", | ||
"Practically, a POMDP has the following, which an MDP also has:\n", | ||
"- a set of states, each denoted by $s$\n", | ||
"- a set of actions available in each state, $A(s)$\n", | ||
"- a reward accrued on attaining some state, $R(s)$\n", | ||
"- a transition probability $P(s'\\ |\\ s, a)$ of action $a$ changing the state from $s$ to $s'$\n", | ||
"\n", | ||
"And the following, which an MDP does not:\n", | ||
"- a sensor model $P(e\\ |\\ s)$ on evidence conditional on states\n", | ||
"\n", | ||
"Additionally, the POMDP is now uncertain of its current state hence has:\n", | ||
"- a belief vector $b$ representing the certainty of being in each state (as a probability distribution)\n", | ||
"\n", | ||
"\n", | ||
"#### New uncertainties\n", | ||
"\n", | ||
"It is useful to intuitively appreciate the new uncertainties that have arisen in the agent's awareness of its own state.\n", | ||
"\n", | ||
"- At any point, the agent has belief vector $b$, the distribution of its believed likelihood of being in each state $s$.\n", | ||
"- For each of these states $s$ that the agent may **actually** be in, it has some set of actions given by $A(s)$.\n", | ||
"- Each of these actions may transport it to some other state $s'$, assuming an initial state $s$, with probability $P(s'\\ |\\ s, a)$\n", | ||
"- Once the action is performed, the agent receives a percept $e$. $P(e\\ |\\ s)$ now tells it the chances of having perceived $e$ for each state $s$. The agent must use this information to update its new belief state appropriately.\n", | ||
"\n", | ||
"#### Evolution of the belief vector - the `FORWARD` function\n", | ||
"\n", | ||
"The new belief vector $b'(s')$ after an action $a$ on the belief vector $b(s)$ and the noting of evidence $e$ is:\n", | ||
"$$ b'(s') = \\alpha P(e\\ |\\ s') \\sum_s P(s'\\ | s, a) b(s)$$ \n", | ||
"\n", | ||
"where $\\alpha$ is a normalising constant (to retain the interpretation of $b$ as a probability distribution.\n", | ||
"\n", | ||
"This equation is just counts the sum of likelihoods of going to a state $s'$ from every possible state $s$, times the initial likelihood of being in each $s$. This is multiplied by the likelihood that the known evidence actually implies the new state $s'$. \n", | ||
"\n", | ||
"This function is represented as `b' = FORWARD(b, a, e)`\n", | ||
"\n", | ||
"#### Probability distribution of the evolving belief vector\n", | ||
"\n", | ||
"The goal here is to find $P(b'\\ |\\ b, a)$ - the probability that action $a$ transforms belief vector $b$ into belief vector $b'$. The following steps illustrate this -\n", | ||
"\n", | ||
"The probability of observing evidence $e$ when action $a$ is enacted on belief vector $b$ can be distributed over each possible new state $s'$ resulting from it:\n", | ||
"\\begin{align*}\n", | ||
" P(e\\ |\\ b, a) &= \\sum_{s'} P(e\\ |\\ b, a, s') P(s'\\ |\\ b, a) \\\\\n", | ||
" &= \\sum_{s'} P(e\\ |\\ s') P(s'\\ |\\ b, a) \\\\\n", | ||
" &= \\sum_{s'} P(e\\ |\\ s') \\sum_s P(s'\\ |\\ s, a) b(s)\n", | ||
"\\end{align*}\n", | ||
"\n", | ||
"The probability of getting belief vector $b'$ from $b$ by application of action $a$ can thus be summed over all possible evidences $e$:\n", | ||
"\\begin{align*}\n", | ||
" P(b'\\ |\\ b, a) &= \\sum_{e} P(b'\\ |\\ b, a, e) P(e\\ |\\ b, a) \\\\\n", | ||
" &= \\sum_{e} P(b'\\ |\\ b, a, e) \\sum_{s'} P(e\\ |\\ s') \\sum_s P(s'\\ |\\ s, a) b(s)\n", | ||
"\\end{align*}\n", | ||
"\n", | ||
"where $P(b'\\ |\\ b, a, e) = 1$ if $b' = $ `FORWARD(b, a, e)` and $= 0$ otherwise.\n", | ||
"\n", | ||
"Given initial and final belief states $b$ and $b'$, the transition probabilities still depend on the action $a$ and observed evidence $e$. Some belief states may be achievable by certain actions, but have non-zero probabilities for states prohibited by the evidence $e$. Thus, the above condition thus ensures that only valid combinations of $(b', b, a, e)$ are considered.\n", | ||
"\n", | ||
"#### A modified rewardspace\n", | ||
"\n", | ||
"For MDPs, the reward space was simple - one reward per available state. However, for a belief vector $b(s)$, the expected reward is now:\n", | ||
"$$\\rho(b) = \\sum_s b(s) R(s)$$\n", | ||
"\n", | ||
"Thus, as the belief vector can take infinite values of the distribution over states, so can the reward for each belief vector vary over a hyperplane in the belief space, or space of states (planes in an $N$-dimensional space are formed by a linear combination of the axes)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No spaces around
=
for keyword arguments.