ergo_mdp

Ergodic economics simulations using MDP formalisms

This needs a bit of work, it's still broken

"...Apparently so, but suppose you throw a coin enough times... suppose one day, it lands on its edge."

Legacy of Kain: Soul Reaver II

Episodic MDPs, unlike their non-episodic counterparts, have proven ergodic properties

Bojun, Huang. "Steady State Analysis of Episodic Reinforcement Learning." Advances in Neural Information Processing Systems 33 (2020).

Peters, Ole. "The ergodicity problem in economics." Nature Physics 15.12 (2019): 1216-1221.

Moldovan, Teodor Mihai, and Pieter Abbeel. "Safe exploration in Markov decision processes." Proceedings of the 29th International Coference on International Conference on Machine Learning. 2012.

$\lim_{T \to \inf} \frac{1}{T}\mathop{\mathbb{E}}\sum_{t = 1}^TR(s_t,a_t) = V_\pi(s_0)$

$x' = \left{ {\begin{array}{*{20}{c}} {x + 0.5x,\quad p = \frac{1}{2}} \ {x- 0.4x,\quad p = \frac{1}{2}} \end{array}} \right}$

$\lim_{T \to\inf}\frac{1}{T}\mathop{\mathbb{E}}\left[\sum_{t = 1}^TR(s_t,a_t) \right] = V_\pi(s_)$

Taleb's take on this

$\\R\left((x,win),null\right) = 0.5x \R\left((x,lose),null\right) = -0.4x \R\left((x,choose),stop\right) = 0 \\P((x,win)|(x,choose),play) = 0.5\P((x,lose)|(x,choose),play) = 0.5\P((x,stopped)|(x,choose),stop) = 1\P((x+0.5x,choose)|(x,win),null) = 1\P((x-04x,choose)|(x,lose),null) = 1\\$

I would argue that most MDPs of interest are clearly non-ergodic. An MDP combined with a stochastic policy \pi is ergodic if all deterministic policies result in Markov Reward Processes that are ergodic. Almost all RL algorithms assume ergodicity. Value Iteration, the prime one, will just pass back rewards

Equivalently, we can say that an agent with a stochastic policy should be able to visit all states. The problem of ergodicity is that it makes agents overoptimistic, as it kinds of assumes that kinds of possible errors and bad luck are eventually recoverable. If I train as if ergodicity true, a 99% chance of losing everything vs an 1% chance of winning big times will average out, and an agent might actually go for high payout.

To work around the luck of ergodicity we make absorbing states extremely unrewarding. If you break your little toy helicopter you will, you get a massive negative reward. The reward has to be big enough, so that between the choice of getting further away on average and breaking down every so often, breaking down every so often to be considered unacceptable.

Generally, until now, tinkering with the reward function is considered enough. The agent learns to avoid those absorbing states, so that, eventually, the ergodic property is reclaimed.

The problem with this approach is that it's not trivial to model this arbitrary reward functions.

Well, the model is bonkers. The vast population becomes broke; the probability of being extremely wealthy is less and less (but more wealthy as things move forward). At the very end, because you cannot subdivide an indivudal to fewer than one points and let them have infinite wealth, the whole wealth model collapses.

So what is the

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
plots		plots
.gitignore		.gitignore
README.md		README.md
expectimax.py		expectimax.py
hist.pdf		hist.pdf
lognorm.pdf		lognorm.pdf
plot.py		plot.py
plot_less.py		plot_less.py
plot_norm.py		plot_norm.py
scores_mult.png		scores_mult.png
simulations.py		simulations.py
tree.py		tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ergo_mdp

This needs a bit of work, it's still broken

About

Uh oh!

Releases

Packages

Languages

ssamot/ergo_mdp

Folders and files

Latest commit

History

Repository files navigation

ergo_mdp

This needs a bit of work, it's still broken

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages