This project aims to code in Python a Bayesian approach to the Inverse Reinforcement Learning (IRL) problem. In the IRL problem, we try to learn the reward function R optimized by an agent called the tutor, given a partially described infinite horizon MDP ⟨S, A, T, γ⟩ (partially described because the reward function is not given), and observed state-action pairs O = {(s_i, a_i)|i ∈ [M]} (with M ∈ N) realized by the tutor.
- Jeremy B.
- Rithy S.
students in M1 I2D
In this project we are studying the Inverse Reinforcement Learning (IRL) problem. The goal is to determine the reward
function of an agent here called tutor.
To solve this problem we have a partially described infinite horizon Markov Decision Processes (MDP)
-
$S$ is the set states, for the rest of the report we will consider that a state corresponds to a cell in a grid with coordinate$(x,y)$ . -
$A$ is the set of action that the tutor may perform in any state$s$ .$A = {UP,, LEFT,, DOWN,, RIGHT}$ , each action is encoded by enums with a value from 0 to 3. -
$T$ is the transition function,$T(s, a, s')$ is equal to the probability of reaching the state$s'$ from$s$ when doing action$a$ . For the rest of the report we will consider that$T$ is a matrix of dimension 5, where$s = (x,y)$ ,$s' = ( x', y')$ :$T(s, a , s') = T[x, y, a, x',y']$ . -
$\gamma = 0.9$ is the discount factor.
Each state contains a set of 5 possible features: Treasure, Bomb, Mountain, Mud, Water. They will be encoded in a binary
vector.
Finally, the reward function
In this project we will try to determine this reward vector by using the policy iteration algorithm which will determine the best policy to solve the MDP starting from a reward. Then with this algorithm we will use a Monte-Carlo Markov-Chain algorithm which is a sampling method to try and return a reward vector that has a high probability of being the true reward vector used by the tutor given the observation. These algorithms and methods will be explained in more details in the third part of the report. Before that, we first need to present the structure of our program used for this project.
Description :
-
A
GridWorldis primarily a 3D-array modelling a map :$(x, y)$ -> cell, where a cell is nothing but a$0-1$ vector of size 5 (= the number of features) such that$cell_i = 1 \iff \text{the cell contains the feature }\phi_i$ . We added a list ofillegal_positionsin order to block some cells from being accessible. (A map example is provided in appendix) -
A
MDPrepresents a partial MDP (i.e. with no Reward function). -
A
Tutorhave starting coordinates ($s_0 = (x,y)$), on a markov decision process, and can have observationsO. -
The previous class contains the inner class
MActionthat answers the question 5 by providing a set O of M state-action pairs. -
The latter class,
PolicyIteration,PolicyWalk&PolicyWalkSimulatedAnnealinginherit fromAlgo: an abstract class imposing any object of the class to have an implementation of the methodrun(). -
Probabilityis an abstract class imposing any object of this class to implement the function prob_value() which returns a certain probability. -
The previous class is extended by
Likelihood&Prior. The latter is extended by the different kind of priors :Uniform,Gaussian,Beta&Mixture. -
A file named
utilscontains some utility methods used by several classes.
Several basic tests are made to insure the program works correctly.
Regarding other tests, performance tests (question 9) are build using a Pipeline class (that inherits from Algo) and
are inside
a test-class. An object of this class take in input a particular set of parameters on which we want to see the influence
on the performance. The Pipeline will use each element of the set given and execute the Policy Walk algorithm,
it will then return a reward for each element with which we may evaluate the performance.
The performance of a reward vector is calculated using a 0-1 policy loss function. This function will take a reward, create the best policy with respect to this reward. We then compare the suggested action to the one used by the tutor. A difference in action will give the reward a penalty of 1, thus the best reward is the one getting a score equal or close to 0.
(Remark that since the tutor may do a non-optimal action, getting a score equal to 0 is most likely impossible.)
In this part we will explain the different tests and ideas we had on the possible parameters that could influence the performance of the process. Unless stated otherwise, you may assume that we used the following default map with its features :
Other default parameters :
- The reward used to get the first observation and thus the true reward to find is:
$[1, -1, -0.2, -0.4, -0.6]$ (We consider that getting the treasure is a good event, getting the bomb is the worst event and that traversing any kind of terrain is a bad event) - The initial state is the cell
$D7$ . - The observation contains
$20$ pairs$(state, action)$ , thus the performance will be between$0$ and$20$ . - The maximum number of iteration in the Policy Walk is
$500$ (except for performance tests) - The Priors used are
UniformorGaussian. - We suppose that the tutor may choose an action that is not optimal with probability
$0.05$ . - The step size used for the Policy Walk algorithm is equal to
$0.1$ . This parameter is used to discretise the reward space.
In the following section we may present some plots showing the performance of a reward depending on the parameter tested using the 0-1 policy loss function presented earlier.
(The plots are obtained by running the different Pipeline classes in ./perf/.)
How can the maximum number of iteration in the Policy Walk algorithm impact the performance ?
The goal of the algorithm is to output a reward vector that has a high posterior probability given a set of observation. To this end, the algorithm will walk in the reward space, each iteration it will visit one neighbor and decide if this neighbor is a "good" reward depending on its posterior probability compared to the current one.
Therefore, the maximum number of iterations will restrain the number of neighbors visited during the walk. A small number of iteration means that we may not converge to the true reward vector. However, a high number of iterations may output a reward vector that diverges from the true reward vector. This may be due to a lack of information provided by the observation. Indeed, if the tutor never gets near some features, the algorithm will lack information concerning the tutor's preference for these features. Thus, the reward that is output for these features may be any element of the reward space.
In this plot, obtained by running the file tests/perf/PipelineNbIteration.py, we can see that the policy loss scores'
means (over 5 different seeds) are globally constant between 10 & 12. Thus, this parameter seems not to be important.
How can the Prior used in the Policy Walk algorithm impact the performance ?
In the algorithm to decide if we should change the current reward with its neighbor,
we compute the posterior ratio between the current reward and its neighbor in the reward space.
To do that we use Bayes rule and compute the ratio between the product of the Prior and the Likelihood.
Thus, the Prior can have an impact on these computation and change the final result, indeed we may change the current
reward,
less or more often and thus converge at a different rate.
For these test we used the following two Prior : Uniform & Gaussian.
There does not seem to be a clear difference in performance when we changed
the Prior. The twos are proportional with the number of observations :
Indeed, the ratio between the worst score (i.e. the number of observations) & the policy loss score is around
$$ policyLoss_M = 0.75 \times worstScore_M $$
where
It can be noted that the Uniform distribution seems to give a better score slightly more often.
How can the maximum number of iteration in the Policy Walk algorithm impact the performance ?
Changing the size of the grid may affect the performance by giving the tutor more space and choice to move in the grid. If there is more space to move in the grid, the tutor may make mistake more often, we could then also get the probability to make an action that is not optimal with their policy. Note that we only considered squared grid and that the number of each feature is the same as the default one, though their position are random.
Since we use different grid for the algorithm, we change the MDP used, thus it is necessary to get another
set of observation thanks to the algorithm MAction of the class Tutor. Indeed, it does not make sense to compare the
action suggested by a policy in one MDP with the observation set of another MDP.
So the Pipeline here also returns a set of observation which will be used to compute the performance of the reward
vector.
For the policy walk algorithm we limited the number of iteration to 100 to reduce execution time. We then obtained the following plot (that is the mean over 5 seeds):
We can see that increasing the size of the map seems to give slightly better performances of the algorithm.
How can the number of features in the map impact the performance ?
If the number of features is high, we can suppose that at some point the tutor will have to choose whether to cross the state containing that feature or to avoid it. This choice will give us more information concerning how the tutor thinks about the different features, then the observation given to us should help during the execution of the Policy Walk algorithm.
In the previous test, we simply increased the size of the map, keeping the same number of features as the default map.
Thus, there were a lot of cells not containing any feature, so we did not get much information concerning the true
reward.
In this test, since we use the default
Note that like in the previous test, since we are changing the map used, the MDP is also different.
Thus, for each MDP we use a new observation set.
In our implementation we simply consider that every feature appears in
the same amount (when we look at 10 features, it means that every feature
appear 10 times). When we generate the MDP, the features are randomly placed, as done for the question 3.
Recall that we use the default map' size which contains 64 cells.
When the number of features is 0, the score is almost 0, this is consistent with the fact that the policy iteration will simply stop after the first iteration because there is no feature and nothing to improve. This is caused by the fact that the tutor may do an action that is "not optimal" with his reward. (Considering that there is no feature, every action gives the same reward, so there is no "optimal" action in this case)
When the number of features is low, we have higher a score around 8, which is due to the fact that the tutor may not be near features. This will not give us enough information for our process, causing the differences.
We can remark that when the number of features increase, we get a better scores.
When the map is almost completely filled with features, we get scores that are similar to
when there are no features.
The reasoning is similar: if all cells contains all features, then the policy iteration algorithm will stop after the
first iteration as
there is nothing to improve.
Thus, setting a number of features that is neither too low nor too high should allow us to get more meaningful information.
How can the starting position impact the performance ?
In a similar manner as the previous test concerning the number of features, changing the starting position may help us find more information concerning the preferences of the tutor concerning some features.
Indeed, if the tutor starts its exploration surrounded by cells containing features, we should get more information compared to if they start surrounded by nothing.
We used the file PipelineStartingPosition for this test and used the following starting positions:
It seems that when the tutor is near a feature of positive reward, we obtain a better score (Starting position =
When surrounded by cells containing features of negative reward, we tend to have a worse score (Starting position =
When surrounded by cells containing no features, the score vary but is usually between the one of the previous two
cases. (Starting position =
We need to run the test a high number of times to get surer information, but it seems that immediately starting near features of negative reward does not help to find the true reward.
In this part we will discuss other approaches and methods to solve the problem.
We recall that : we can try to improve the process by using a modified version of the Policy Walk algorithm. In the Policy Walk algorithm, we decide to change the current reward vector by calculating the following ratio:
$$
\frac{P(R'|O)}{P(R|O)}
$$
where
$$
(\frac{P(R'|O)}{P(R|O)})^{\frac{1}{T_i}}
$$
where
The goal is to a neighbor more often at the beginning of the walk (when $i$ is small), this may allow us to reach a reward neighbor with high posterior probability more quickly. It may also allow us to visit some reward vector $R'$ of the space that we would not visit in the normal algorithm if the neighbors of $R'$ have low posterior probability.
When we are approaching the last iterations, we change to a neighbor only when the ratio is high, it guarantees us that the reward vector is as good or better than the current one.
We only tested using an exponential function (similar to $f(x) = exp(-x)$) for our temperature function.
{ width=50% }
If we compare with the Policy Walk previous results, we almost have the same results, but this version gives us higher policy loss scores, meaning that this method is not as successful as expected.
In this part we consider that the tutor can only give one demonstration on a map of our choice.
A simple idea to find the true reward vector is to use the policy walk algorithm multiple times:
First iteration:
- We choose a random reward vector for the beginning of the Policy Walk algorithm.
- We then get a reward vector with a high posterior probability.
- We store this reward vector in a list of reward vector.
Starting from the second iteration until a maximum number of iteration is reached:
- We start the Policy Walk algorithm with the reward vector of the previous iteration (which should have a high posterior probability).
- We store the resulting reward vector.
After all the iterations are done:
- In the list of reward vector, we select and return the reward vector that had the highest posterior probability.
Another idea is to take a linear programming approach:
We want to geta reward vector
We design a map that would allow us to specifically see the
different preferences of the tutor in regard to different features.
For example, creating paths containing only one feature.
With this type of configuration, considering that the tutor should maximise his reward,
he should only take one of the three paths to advance further in the map (if it maximises their reward).
Example of such path:
In a similar manner, create blocks of cells containing a particular feature: in that case, if the tutor decides to walk on these cells, it would mean that the features are positive for the tutor, if the tutor avoids them it would mean that the features are negative for the tutor.
These preferences will put constraints on the different values of the reward vector. For example if the tutor takes the water path, then we know that the reward value for the water is higher than the one for mud or bomb.
Whether we use linear programming methods to solve the problem or use the Policy Walk algorithm, the constraints will reduce the reward space, thus, reduce the number of reward vector to consider.
In this part we suppose that we can ask the tutor for multiple demonstration.
If we suppose that each demonstration can be done in a different map and that we can limit the number of movement $M$ .
Then another thing we can do to put constraints on the value for the reward for each feature is : to
design maps where the tutor can only move in a limited way with
Example of such maps:
Here to change its position, the tutor can only move LEFT or RIGHT, thus if the tutor move optimally, he should go for
what
should maximise his reward.
If the tutor moves UP or DOWN, then getting nothing is better than getting the reward of a Treasure and a Bomb. (The
starting position has no feature, thus giving a reward of 0).
We can run multiple demonstrations in the same map to make sure that the tutor did not make a mistake when he moved.
The goal is to find out the preferences of the tutor :
In this example the tutor has to choose between the following lotteries:
where
-
If the tutor goes to the treasure most of the time (i.e. more than half the time), we can say that the tutor prefers getting the treasure than getting the bomb. Moreover, getting the treasure is better than getting nothing.
Thus,$L1 > L2$ and$L1 > L3$ and$r_{Treasure} > 0$ . -
If the tutor goes LEFT half of the time and RIGHT the other half then the tutor is indifferent between getting the treasure or getting the bomb. Moreover, getting the bomb or the treasure is better than getting nothing.
Thus,$L1 = L2$ ,$r_{Treasure} > 0$ and$r_{Bomb} > 0$ (otherwise the tutor would have chosen to go UP or DOWN and get a reward of$0$ ). -
If the tutor goes to the bomb most of the time, we can say that the tutor prefers getting the bomb than getting the treasure. Moreover, getting the bomb is better than getting nothing. Thus,
$L2 > L1$ and$L2 > L3$ and$r_{Bomb}$ > 0. -
If the tutor goes TOP or DOWN most of the time, it means that getting the treasure or the bomb is worse than getting nothing. Thus,
$L3 > L1$ ,$L3 > L2$ ,$r_{Treasure} < 0$ and$r_{Bomb} < 0$ .
In that case we may design a map where the tutor has no choice but to choose between getting the bomb or getting the treasure and then getting one of the previous 3 cases.
If we get the third case, then we can use the following map to determine the preference of the tutor for the two features:
In this map the tutor can face only two lotteries :
In this map the tutor will have no choice but to choose the best feature to maximise their reward.
Thus, we will get the preference
We then use this kind of map to get all the preferences of the tutor, this will allow us to reduce the reward space and
accelerate the Policy Walk algorithm described earlier or use linear programming methods.
If we have a clear preference between 2 features, we may also determine the value of
If we suppose that we can only use one map, but that we can control the number of movement M and the starting position, then:
- We set
$M$ to 1 and have a demonstration for each cell of the map. This will allow us to get the policy$\pi$ of the tutor. - We repeat the demonstration on each cell multiple time to be sure that the tutor did not make a mistake.
- After obtaining the policy of the tutor, the goal then is to find a reward that gives a policy that gets as close as
possible to the one of the tutor.
Instead of using the 0-1 policy loss function to evaluate a reward like before, we will compare the number of difference between$\pi$ and the policy obtained by our reward with the policy iteration algorithm. - We then look for the reward giving the policy having the smallest number of difference with
$\pi$ . (Ideally, we get a reward$R^*$ that gives policy$\pi$ ).
Note that like the precedent idea, we can get the preferences of the tutor towards some features. Since a cell may contain multiple features, we will have to be careful when formulating the preferences.
We use a rejection sampling approach:
For each cell
Using this method we can get the policy of the tutor. However, this method needs a high number of samples for each cell, this may not be feasible in a real life setting.
Through this project, we studied the problem of the Inverse Reinforcement Learning consisting in finding the reward vector that is consistent with the observation that is given to us.
To solve the problem we made used of different tools of decision-making models like MDPs and algorithms like Policy Iteration and Policy Walk which is a Markov Chain Monte Carlo algorithm.
| Q. | X if done | Location | Method / Section |
|---|---|---|---|
| 1 | X | mdp.py, gridworld.py | class MDP |
| 2 | X | gridworld.py | __init__(default=True) |
| 3 | X | mdp.py, gridworld.py | __init__(grid_size_x, grid_size_y, default=False) |
| 4 | X | policyIteration.py | run |
| 5 | X | tutor.py | run_explicit |
| 6 | X | probDist.py, utils.py | compute_ratio |
| 7 | X | policyWalk.py | run_highest_aposteriori_reward |
| 8 | X | policyWalkSimulatedAnnealing.py | run_highest_aposteriori_reward |
| 9 | X | Report, tests/perf/* | Algorithms & tests |
| 10 | X | Report | To go further |
| 11 | X | Report | To go further |
Default map encoding example :
H,1 H,2 H,3 m H,4 H,5 H,6 m H,7 H,8
G,1 ~ G,2 ~ G,3 m~ G,4 • G,5 G,6 m G,7 G,8
F,1 ~ F,2 ~ F,3 ~ F,4 F,5 F,6 F,7 • F,8
E,1 E,2 E,3 E,4 E,5 E,6 E,7 E,8
D,1 D,2 D,3 M D,4 M D,5 M D,6 D,7 D,8
C,1 C,2 C,3 •M C,4 M C,5 M C,6 C,7 C,8
B,1 B,2 B,3 m B,4 B,5 B,6 m B,7 B,8
A,1 * A,2 A,3 m A,4 A,5 A,6 m A,7 A,8
vvv A1 vvv vvv A8 vvv
[[[1. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 1. 0.] [0. 0. 0. 1. 0.] [0. 0. 0. 1. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 1. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 1. 0.] [0. 0. 0. 1. 0.] [0. 0. 1. 1. 0.] [0. 1. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]]]
^^^ H1 ^^^ ^^^ H8 ^^^
Legend:
* : Treasure
• : Bomb
m : mud
M : Mountain
~ : Water
Bernard Michini and Jonathan P How. Improving the efficiency of bayesian inverse reinforcement learning. In 2012 IEEE International Conference on Robotics and Automation, pages 3651–3656. IEEE, 2012.
Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.