DeepMind Reinforcement Learning Overview
DeepMind Reinforcement Learning Overview
Kevin P. Murphy
1 Introduction 13
1.1 Sequential decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Canonical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Reinforcement Learning: a brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Value-based RL 29
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Solving for the optimal policy in a known world model . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Value function learning using samples from the world model . . . . . . . . . . . . . . . . . . . 33
2.4 SARSA: on-policy TD policy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Q-learning: off-policy TD policy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Policy-based RL 49
3.1 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Policy improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Off-policy methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Gradient-free policy optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 RL as inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Model-based RL 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Decision-time (online) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Background (offline) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 World models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Beyond one-step models: predictive representations . . . . . . . . . . . . . . . . . . . . . . . . 105
5 Multi-agent RL 113
5.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Solution concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3
7.5 Hierarchical RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6 Imitation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.7 Offline RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8 General RL, AIXI and universal AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8 Acknowledgements 173
4
Contents
1 Introduction 13
1.1 Sequential decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Maximum expected utility principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.2 Episodic vs continual tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.3 Universal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Canonical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.1 Partially observed MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Markov decision process (MDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Contextual MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.4 Contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.5 Belief state MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.6 Optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.6.1 Best-arm identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.6.2 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.6.3 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.6.4 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Reinforcement Learning: a brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Value-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Policy-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.3 Model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.4 State uncertainty (partial observability) . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.4.1 Optimal solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.4.2 Finite observation history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.4.3 Stateful (recurrent) policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.5 Model uncertainty (exploration-exploitation tradeoff) . . . . . . . . . . . . . . . . . . 24
1.3.6 Reward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.6.1 The reward hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.6.2 Reward hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.6.3 Sparse reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.6.4 Reward shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.6.5 Intrinsic reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Value-based RL 29
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Bellman’s equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Example: 1d grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Solving for the optimal policy in a known world model . . . . . . . . . . . . . . . . . . . . . . 31
5
2.2.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Real-time dynamic programming (RTDP) . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Value function learning using samples from the world model . . . . . . . . . . . . . . . . . . . 33
2.3.1 Monte Carlo estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Temporal difference (TD) learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Combining TD and MC learning using TD(λ) . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 SARSA: on-policy TD policy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Sarsa(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Q-learning: off-policy TD policy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Tabular Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2 Q learning with function approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.2.1 Neural fitted Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.2.2 DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.2.3 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2.4 Prioritized experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2.5 The deadly triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2.6 Target networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2.7 Gradient TD methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2.8 Two time-scale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2.9 Layer norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2.10 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3 Maximization bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3.1 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3.2 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3.3 Randomized ensemble DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 DQN extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4.1 Q learning for continuous actions . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4.2 Dueling DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4.3 Noisy nets and exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.4 Multi-step DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.5 Q(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.6 Rainbow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.7 Bigger, Better, Faster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.4.8 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Policy-based RL 49
3.1 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Likelihood ratio estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Variance reduction using reward-to-go . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.4 The policy gradient theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 Variance reduction using a baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.6 REINFORCE with baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Advantage actor critic (A2C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Generalized advantage estimation (GAE) . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.3 Two-time scale actor critic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.4 Natural policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.4.1 Natural gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6
3.2.4.2 Natural actor critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.5 Architectural issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.6 Deterministic policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.6.1 Deterministic policy gradient theorem . . . . . . . . . . . . . . . . . . . . . . 59
3.2.6.2 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.6.3 Twin Delayed DDPG (TD3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.6.4 Wasserstein Policy Optimization (WPO) . . . . . . . . . . . . . . . . . . . . 60
3.3 Policy improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Policy improvement lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.2 Trust region policy optimization (TRPO) . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.3 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.4 Variational Maximum a Posteriori Policy Optimization (VMPO) . . . . . . . . . . . . 64
3.4 Off-policy methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Policy evaluation using importance sampling . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.2 Off-policy actor critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.2.1 Learning the critic using V-trace . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.2.2 Learning the actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.2.3 Example: IMPALA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Off-policy policy improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Gradient-free policy optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 RL as inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Deterministic case (planning/control as inference) . . . . . . . . . . . . . . . . . . . . 70
3.6.2 Stochastic case (policy learning as variational inference) . . . . . . . . . . . . . . . . . 70
3.6.3 EM control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.4 KL control (maximum entropy RL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6.5 Maximum a Posteriori Policy Optimization (MPO) . . . . . . . . . . . . . . . . . . . . 72
3.6.6 Sequential Monte Carlo Policy Optimisation (SPO) . . . . . . . . . . . . . . . . . . . . 73
3.6.7 AWR and AWAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.8 Soft Actor Critic (SAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.8.1 SAC objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.8.2 Policy evaluation: tabular case . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.8.3 Policy evaluation: general case . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6.8.4 Policy improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.8.5 Adjusting the temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.9 Active inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Model-based RL 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Decision-time (online) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 Receeding horizon control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1.1 Forward search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1.2 Branch and bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1.3 Sparse sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1.4 Heuristic search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2 Monte Carlo tree search (MCTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2.1 MCTS for 2p0s games: AlphaGo, AlphaGoZero, and AlphaZero . . . . . . . 83
4.2.2.2 MCTS with learned world model: MuZero and EfficientZero . . . . . . . . . 84
4.2.2.3 MCTS in belief space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.3 Sequential Monte Carlo (SMC) for online planning . . . . . . . . . . . . . . . . . . . . 85
4.2.4 Model predictive control (MPC), aka open loop planning . . . . . . . . . . . . . . . . 87
4.2.4.1 Suboptimality of open-loop planning for stochastic environments . . . . . . . 87
4.2.4.2 Trajectory optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7
4.2.4.3 LQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4.4 Random shooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4.5 CEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4.6 MPPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4.7 GP-MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Background (offline) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.1 A game-theoretic perspective on MBRL . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.2 Dyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.2.1 Tabular Dyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.2.2 Dyna with function approximation . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 World models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.1 World models which are trained to predict observation targets . . . . . . . . . . . . . 93
4.4.1.1 Generative world models without latent variables . . . . . . . . . . . . . . . 93
4.4.1.2 Generative world models with latent variables . . . . . . . . . . . . . . . . . 93
4.4.1.3 Example: Dreamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1.4 Example: IRIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1.5 Code world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1.6 Partial observation prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.2 World models which are trained to predict other targets . . . . . . . . . . . . . . . . . 96
4.4.2.1 The objective mismatch problem . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.2.2 Observation prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2.3 Reward prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.2.4 Value prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.2.5 Policy prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.2.6 Self prediction (self distillation) . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.2.7 Avoiding self-prediction collapse using frozen targets . . . . . . . . . . . . . . 99
4.4.2.8 Avoiding self-prediction collapse using regularization . . . . . . . . . . . . . . 100
4.4.2.9 Example: JEPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.2.10 Example: DinoWM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.2.11 Example: TD-MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.2.12 Example: BYOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2.13 Example: Imagination-augmented agents . . . . . . . . . . . . . . . . . . . . 103
4.4.3 World models that are trained to help planning . . . . . . . . . . . . . . . . . . . . . . 103
4.4.4 Dealing with model errors and uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.4.1 Avoiding compounding errors in rollouts . . . . . . . . . . . . . . . . . . . . . 103
4.4.4.2 Unified model and planning variational lower bound . . . . . . . . . . . . . . 104
4.4.4.3 Dynamically switching between MFRL and MBRL . . . . . . . . . . . . . . . 104
4.4.5 Exploration for learning world models . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Beyond one-step models: predictive representations . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.1 General value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.2 Successor representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.3 Successor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.3.1 Learning SMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.3.2 Jumpy models using geometric policy composition . . . . . . . . . . . . . . . 109
4.5.4 Successor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.4.1 Generalized policy improvement . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.4.2 Option keyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.4.3 Learning SFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.4.4 Choosing the tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.5 Forwards-backwards representations: TODO . . . . . . . . . . . . . . . . . . . . . . . 112
8
5 Multi-agent RL 113
5.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.1 Normal-form games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.2 Stochastic games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.3 Partially observed stochastic games (POSG) . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.3.1 Data generating process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.3.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.3.3 Single agent perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.3.4 Factored Observation Stochastic Games (FOSG) . . . . . . . . . . . . . . . . 117
5.1.4 Extensive form games (EFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.4.1 Example: Kuhn Poker as EFG . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.4.2 Converting FOSG to EFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Solution concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1 Notation and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.2 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.3 Exploitability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.4 Nash equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.5 Approximate Nash equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.6 Entropy regularized Nash equilibria (aka Quantal Response Equilibria) . . . . . . . . . 121
5.2.7 Correlated equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.8 Limitations of equilibrium solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.9 Pareto optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.10 Social welfare and fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.11 No regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.1 Central learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 Independent learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2.1 Independent Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2.2 Independent Actor Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.2.3 Independent PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.2.4 Learning dynamics of multi-agent policy gradient methods . . . . . . . . . . 126
5.3.3 Centralized training of decentralized policies (CTDE) . . . . . . . . . . . . . . . . . . 126
5.3.3.1 Application to Diplomacy (Cicero) . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.4 Value decomposition methods for common-reward games . . . . . . . . . . . . . . . . . 127
5.3.4.1 Value decomposition network (VDN) . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4.2 QMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.5 Policy learning with self-play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.6 Policy learning with learned opponent models . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.7 Best response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.7.1 Fictitious play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.7.2 Neural fictitious self play (NFSP) . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.8 Population-based training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.8.1 PSRO (policy space response oracle) . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.8.2 Application to StarCraft (AlphaStar) . . . . . . . . . . . . . . . . . . . . . . 131
5.3.9 Counterfactual Regret Minimization (CFR) . . . . . . . . . . . . . . . . . . . . . . . . 131
5.3.9.1 Tabular case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.3.9.2 Deep CFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3.9.3 Applications to Poker and other games . . . . . . . . . . . . . . . . . . . . . 132
5.3.10 Regularized policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3.10.1 Magnetic Mirror Descent (MMD) . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3.10.2 PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3.11 Decision-time planning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9
5.3.11.1 Magnetic Mirror Descent Search (MMDS) . . . . . . . . . . . . . . . . . . . . 134
5.3.11.2 Belief state approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.11.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.11.4 Open questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10
7.2.1.1 Bandit case (Gittins indices) . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.2.1.2 MDP case (Bayes Adaptive MDPs) . . . . . . . . . . . . . . . . . . . . . . . 154
7.2.2 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2.2.1 Bandit case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2.2.2 MDP case (posterior sampling RL) . . . . . . . . . . . . . . . . . . . . . . . 155
7.2.3 Upper confidence bounds (UCBs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.3.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.3.2 Bandit case: Frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.3.3 Bandit case: Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2.3.4 MDP case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.3 Distributional RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.3.1 Quantile regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.3.2 Replacing regression with classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 Intrinsic reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4.1 Knowledge-based intrinsic motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4.2 Goal-based intrinsic motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.4.3 Empowerment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.5 Hierarchical RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.5.1 Feudal (goal-conditioned) HRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.5.1.1 Hindsight Experience Relabeling (HER) . . . . . . . . . . . . . . . . . . . . . 161
7.5.1.2 Hierarchical HER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.5.1.3 Learning the subgoal space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5.2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5.2.2 Learning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.6 Imitation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.6.1 Imitation learning by behavior cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.6.2 Imitation learning by inverse reinforcement learning . . . . . . . . . . . . . . . . . . . 164
7.6.3 Imitation learning by divergence minimization . . . . . . . . . . . . . . . . . . . . . . 165
7.7 Offline RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.7.1 Offline model-free RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.7.1.1 Policy constraint methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.7.1.2 Behavior-constrained policy gradient methods . . . . . . . . . . . . . . . . . 167
7.7.1.3 Uncertainty penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.7.1.4 Conservative Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.7.2 Offline model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.7.3 Offline RL using reward-conditioned sequence modeling . . . . . . . . . . . . . . . . . 169
7.7.4 Offline-to-online methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.8 General RL, AIXI and universal AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8 Acknowledgements 173
11
12
Chapter 1
Introduction
where s0 is the agent’s initial state, R(st , at ) is the reward function that the agent uses to measure the
value of performing an action in a given state, Vπ (s0 ) is the value function for policy π evaluated at s0 , and
the expectation is wrt
p(a0 , s1 , a1 , . . . , aT , sT |s0 , π) = π(a0 |s0 )penv (o1 |a0 )δ(s1 = U (s0 , a0 , o1 )) (1.2)
× π(a1 |s1 )penv (o2 |a1 , o1 )δ(s2 = U (s1 , a1 , o2 )) (1.3)
× π(a2 |s2 )penv (o3 |a1:2 , o1:2 )δ(s3 = U (s2 , a2 , o3 )) . . . (1.4)
where penv is the environment’s distribution over observations (which is usually unknown).
1 For a list of real-world applications of RL, see e.g., https://bit.ly/42V7dIJ from Csaba szepesvari (2024), https://bit.
ly/3EMMYCW from Vitaly Kurin (2022), and https://github.com/montrealrobotics/DeepRLInTheWorld, which seems to be kept
up to date.
13
Figure 1.1: A small agent interacting with a big external world.
Note that picking a policy to maximize the sum of expected rewards is an instance of the maximum
expected utility principle. (In Section 7.1, we discuss the closely related concept of choosing a policy which
minimizes the regret, which can be thought of as the difference between the expected reward of the agent’s
policy compared to a reference policy.) There are various ways to design or learn such an optimal policy,
depending on the assumptions we make about the environment, and the form of the agent. We will discuss
some of these options below.
where rt = R(st , at ) is the reward, and Gt is the reward-to-go. For episodic tasks that terminate at time T ,
we define Gt = 0 for t ≥ T . Clearly, the return satisfies the following recursive relationship:
14
θt 4/ AO / θt+1
= πt
π; t+1
zt / P / zt+1|t / UO / zt+1
D EO
at ôt+1 ot+1
O
OO
wt / M / wt+1
Figure 1.2: Diagram illustrating the interaction of the agent1:and environment. The agent has internal state zt , and
Figure
chooses action at based on its policy πt using at ∼ πt (zt ). It then predicts its next internal states, zt+1|t , via the predict
function P , and optionally predicts the resulting observation, ôt+1 , via the observation decoder D. The environment has
(hidden) internal state wt , which gets updated by the world model M to give the new state wt+1 ∼ M (wt , at ) in response
to the agent’s action. The environment also emits an observation ot+1 via its observation model, ot+1 ∼ O(wt+1 ).
This gets encoded to et+1 = E(ot+1 ) by the agent’s observation encoder E, which the agent uses to update its internal
state using zt+1 = U (zt , at , et+1 ). The policy is parameterized by θt , and these parameters may be updated (at a slower
time scale) by the RL policy A. Square nodes are functions, circles are variables (either random or deterministic).
Dashed square nodes are stochastic functions that take an extra source of randomness (not shown).
The discount factor γ plays two roles. First, it ensures the return is finite even if T = ∞ (i.e., infinite
horizon), provided we use γ < 1 and the rewards rt are bounded. Second, it puts more weight on short-term
rewards, which generally has the effect of encouraging the agent to achieve its goals more quickly. (For
example, if γ = 0.99, then an agent that reaches a terminal reward of 1.0 in 15 steps will receive an expected
discounted reward of 0.9915 = 0.86, whereas if it takes 17 steps it will only get 0.9917 = 0.84.) However, if γ is
too small, the agent will become too greedy. In the extreme case where γ = 0, the agent is completely myopic,
and only tries to maximize its immediate reward. In general, the discount factor reflects the assumption
that there is a probability of 1 − γ that the interaction will end at the next step. (If γ = 1 − T1 , the agent
expects to live on the order of T steps; for example, if1 each step is 0.1 seconds, then γ = 0.95 corresponds to
2 seconds.) For finite horizon problems, where T is known, we can set γ = 1, since we know the life time of
the agent a priori.
15
t ), where M is the environment’s state transition function (which is usually not known
wt+1 = M (wt , at , ϵw
to the agent) and ϵw t is random system noise. , The agent does not see the world state wt , but instead
3
sees a potentially noisy and/or partial observation ot+1 = O(wt+1 , ϵot+1 ) at each step, where ϵot+1 is random
observation noise. For example, when navigating a maze, the agent may only see what is in front of it, rather
than seeing everything in the world all at once; furthermore, even the current view may be corrupted by
sensor noise. Any given image, such as one containing a door, could correspond to many different locations in
the world (this is called perceptual aliasing), each of which may require a different action.
Thus the agent needs use these observations to main an internal belief state about the world, denoted
by z. This gets updated using the state update function
In the simplest setting, the internal zt can just store all the past observations, ht = (o1:t , a1:t−1 ), but such
non-parametric models can take a lot of time and space to work with, so we will usually consider parametric
approximations. The agent can then pass its state to its policy to pick actions, using at+1 = πt (zt+1 ).
We can further elaborate the behavior of the agent by breaking the state-update function into two parts.
First the agent predicts its own next state, zt+1|t = P (zt , at ), using a prediction function P , and then
it updates this prediction given the observation using update function U , to give zt+1 = U (zt+1|t , ot+1 ).
Thus the SU function is defined as the composition of the predict and update functions
If the observations are high dimensional (e.g., images), the agent may choose to encode its observations into
a low-dimensional embedding et+1 using an encoder, et+1 = E(ot+1 ); this can encourage the agent to focus
on the relevant parts of the sensory signal. In this case, the state update becomes
Optionally the agent can also learn to invert this encoder by training a decoder to predict the next observation
using ôt+1 = D(zt+1|t ); this can be a useful training signal, as we will discuss in Chapter 4. Finally, the agent
needs to learn the action policy πt (zt ) = π(zt ; θt ). We can update the policy parameters using a learning
algorithm, denoted
θt = A(o1:t , a1:t , r1:t ) = A(θt−1 , at , zt , rt ) (1.13)
See Figure 1.2 for an illustration.
We see that, in general, there are three interacting stochastic processes we need to deal with: the
environment’s states wt (which are usually affected by the agents actions); the agent’s internal states zt (which
reflect its beliefs about the environment based on the observed data); and the agent’s policy parameters θt
(which are updated based on the information stored in the belief state and the external observations).
or structural equation model. This is standard practice in the control theory and causality communities.
16
Figure 1.3: Illustration of an MDP as a finite state machine (FSM). The MDP has three discrete states (green
cirlces), two discrete actions (orange circles), and two non-zero rewards (orange arrows). The numbers on the
black edges represent state transition probabilities, e.g., p(s′ = s0 |a = a0 , s′ = s1 ) = 0.7; most state transitions
are impossible (probability 0), so the graph is sparse. The numbers on the yellow wiggly edges represent expected
rewards, e.g., R(s = s1 , a = a0 , s′ = s0 ) = +5; state transitions with zero reward are not annotated. From
https: // en. wikipedia. org/ wiki/ Markov_ decision_ process . Used with kind permission of Wikipedia author
waldoalvarez.
Note that we can combine these two distributions to derive the joint world model pW O (wt+1 , ot+1 |wt , at ).
Also, we can use these distributions to derive the environment’s non-Markovian observation distribution,
penv (ot+1 |o1:t , a1:t ), used in Equation (1.4), as follows:
X
penv (ot+1 |o1:t , a1:t ) = p(ot+1 |wt+1 )p(wt+1 |a1:t ) (1.16)
wt+1
X X
p(wt+1 |a1:t ) = ··· p(w1 |a1 )p(w2 |w1 , a1 ) . . . p(wt+1 |wt , at ) (1.17)
w1 wt
If the world model (both p(o|z) and p(z ′ |z, a)) is known, then we can — in principle — solve for the optimal
policy. The method requires that the agent’s internal state correspond to the belief state st = bt = p(wt |ht ),
where ht = (o1:t , a1:t−1 ) is the observation history. The belief state can be updated recursively using Bayes rule.
See Section 1.2.5 for details. The belief state forms a sufficient statistic for the optimal policy. Unfortunately,
computing the belief state and the resulting optimal policy is wildly intractable [PT87; KLC98]. We discuss
some approximate methods in Section 1.3.4.
17
In lieu of an observation model, we assume the environment (as opposed to the agent) sends out a reward
signal, sampled from pR (rt |st , at , st+1 ). The expected reward is then given by
X
R(st , at , st+1 ) = r pR (r|st , at , st+1 ) (1.19)
r
X
R(st , at ) = pS (st+1 |st , at )R(st , at , st+1 ) (1.20)
st+1
Note that the field of control theory uses slightly different terminology and notation when describing the
same setup: the environment is called the plant, the agent is called the controller, States are denoted by
xt ∈ X ⊆ RD , actions are denoted by ut ∈ U ⊆ RK , and rewards are replaced by costs ct ∈ R.
Given a stochastic policy π(at |st ), the agent can interact with the environment over many steps. Each
step is called a transition, and consists of the tuple (st , at , rt , st+1 ), where at ∼ π(·|st ), st+1 ∼ pS (st , at ),
and rt ∼ pR (st , at , st+1 ). Hence, under policy π, the probability of generating a trajectory length T ,
τ = (s0 , a0 , r0 , s1 , a1 , r1 , s2 , . . . , sT ), can be written explicitly as
−1
TY
p(τ ) = p0 (s0 ) π(at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) (1.21)
t=0
In general, the state and action sets of an MDP can be discrete or continuous. When both sets are finite,
we can represent these functions as lookup tables; this is known as a tabular representation. In this case,
we can represent the MDP as a finite state machine, which is a graph where nodes correspond to states,
and edges correspond to actions and the resulting rewards and next states. Figure 1.3 gives a simple example
of an MDP with 3 states and 2 actions.
If we know the world model pS and pR , and if the state and action space is tabular, then we can solve for
the optimal policy using dynamic programming techniques, as we discuss in Section 2.2. However, typically
the world model is unknown, and the states and actions may need complex nonlinear models to represent
their transitions. In such cases, we will have to use RL methods to learn a good policy.
18
A special case of a contextual bandit is a regular bandit, in which there is no context, or equivalently, st is
some fixed constant that never changes. When there are a finite number of possible actions, A = {a1 , . . . , aK },
this is called a multi-armed bandit.4 In this case the reward model has the form R(a) = f (wa ), where wa
are the parameters for arm a.
Contextual bandits have many applications. For example, consider an online advertising system. In
this case, the state st represents features of the web page that the user is currently looking at, and the action
at represents the identity of the ad which the system chooses to show. Since the relevance of the ad depends
on the page, the reward function has the form R(st , at ), and hence the problem is contextual. The goal is to
maximize the expected reward, which is equivalent to the expected number of times people click on ads; this
is known as the click through rate or CTR. (See e.g., [Gra+10; Li+10; McM+13; Aga+14; Du+21; YZ22]
for more information about this application.) Another application of contextual bandits arises in clinical
trials [VBW15]. In this case, the state st are features of the current patient we are treating, and the action
at is the treatment the doctor chooses to give them (e.g., a new drug or a placebo).
For more details on bandits, see e.g., [LS19; Sli19].
If we can solve this (PO)MDP, we have the optimal solution to the exploration-exploitation problem (see
Section 1.3.5).
As a simple example, consider a context-free Bernoulli bandit, where pR (r|a) = Ber(r|µa ), and
µa = pR (r = 1|a) = R(a) is the expected reward for taking action a. The only unknown parameters are
w = µ1:A . Suppose we use a factored beta prior
Y
p0 (w) = Beta(µa |α0a , β0a ) (1.24)
a
where
t−1
X
Ntr (a) = I (ai = a, ri = r) (1.26)
i=1
4 The terminology arises by analogy to a slot machine (sometimes called a “bandit”, because it steals your money) in a casino.
If there are K slot machines, each with different rewards (payout rates), then the agent (player) must explore the different
machines (by pulling the arms) until they have discovered which one is best, and can then stick to exploiting it.
5 Technically speaking, this is a POMDP, where we assume the states are observed, and the parameters are the unknown
hidden random variables. This is in contrast to Section 1.2.1, where the states were not observed, and the parameters were
assumed to be known.
19
,
Action 1 Action 2
Success Failur e Success Failur e
, , , ,
Action 1
Success Failur e
, ,
Figure 1.4: Illustration of sequential belief updating for a two-armed beta-Bernoulli bandit. The prior for the reward
for action 1 is the (blue) uniform distribution Beta(1, 1); the prior for the reward for action 2 is the (orange) unimodal
distribution Beta(2, 2). We update the parameters of the belief state based on the chosen action, and based on whether
the observed reward is success (1) or failure (0).
This is illustrated in Figure 1.4 for a two-armed Bernoulli bandit. We can use a similar method for a
Gaussian bandit, where pR (r|a) = N (r|µa , σa2 ).
In the case of contextual bandits, the problem is conceptually the same, but becomes more complicated
computationally. If we assume a linear regression bandit, pR (r|s, a; w) = N (r|ϕ(s, a)T w, σ 2 ), we can use
Bayesian linear regression to compute p(w|Dt ) exactly in closed form. If we assume a logistic regression
bandit, pR (r|s, a; w) = Ber(r|σ(ϕ(s, a)T w)), we have to use approximate methods for approximate Bayesian
logistic regression to compute p(w|Dt ). If we have a neural bandit of the form pR (r|s, a; w) = N (r|f (s, a; w))
for some nonlinear function f , then posterior inference is even more challenging (this is equivalent to the
problem of inference in Bayesian neural networks, see e.g., [Arb+23] for a review paper for the offline case,
and [DMKM22; JCM24] for some recent online methods).
We can generalize the above methods to compute the belief state for the parameters of an MDP in the
obvious way, but modeling both the reward function and state transition function.
Once we have computed the belief state, we can derive a policy with optimal regret using the methods
like UCB (Section 7.2.3) or Thompson sampling (Section 7.2.2).
6 In the contextual bandit problem, the environment state (context) does change, but not in response to the agent’s actions.
at the last step, namely lT . This is the difference between the function value we chose and the true optimum. Minimizing simple
regret results in a problem known as pure exploration [BMS11], where the agent needs to interact with the environment
to learn the underlying MDP; at the end, it can then solve for the resulting policy using planning methods (see Section 2.2).
However, in general RL problems, it is morehP common i to focus on the cumulative regret, also called the total regret or just
T
the regret, which is defined as LT ≜ E t=1 lt .
20
1.2.6.1 Best-arm identification
In the standard multi-armed bandit problem our goal is to maximize the sum of expected rewards. However,
in some cases, the goal is to determine the best arm given a fixed budget of T trials; this variant is known as
best-arm identification [ABM10]. Formally, this corresponds to optimizing the final reward criterion:
Vπ,πT = Ep(a1:T ,r1:T |s0 ,π) [R(â)] (1.27)
where â = πT (a1:T , r1:T ) is the estimated optimal arm as computed by the terminal policy πT applied to
the sequence of observations obtained by the exploration policy π. This can be solved by a simple adaptation
of the methods used for standard bandits.
for some unknown function R, where w ∈ R , using as few actions (function evaluations of R) as possible.
N
This is essentially an “infinite arm” version of the best-arm identification problem [Tou14], where we replace
the discrete choice of arms a ∈ {1, . . . , K} with the parameter vector w ∈ RN . In this case, the optimal
policy can be computed if the agent’s state st is a belief state over the unknown function, i.e., st = p(R|ht ).
A common way to represent this distribution is to use Gaussian processes. We can then use heuristics like
expected improvement, knowledge gradient or Thompson sampling to implement the corresponding policy,
wt = π(st ). For details, see e.g., [Gar23].
21
Approach Method Functions learned On/Off Section
Value-based SARSA Q(s, a) On Section 2.4
Value-based Q-learning Q(s, a) Off Section 2.5
Policy-based REINFORCE π(a|s) On Section 3.1.3
Policy-based A2C π(a|s), V (s) On Section 3.2.1
Policy-based TRPO/PPO π(a|s), Adv(s, a) On Section 3.3.3
Policy-based DDPG a = π(s), Q(s, a) Off Section 3.2.6.2
Policy-based Soft actor-critic π(a|s), Q(s, a) Off Section 3.6.8
Model-based MBRL p(s′ |s, a) Off Chapter 4
Table 1.1: Summary of some popular methods for RL. On/off refers to on-policy vs off-policy methods.
• What does the agent learn? Options include the value function, the policy, the model, or some
combination of the above.
• How does the agent represent its unknown functions? The two main choices are to use non-parametric
or tabular representations, or to use parametric representations based on function approximation. If
these functions are based on neural networks, this approach is called “deep RL”, where the term “deep”
refers to the use of neural networks with many layers.
• How are the actions are selected? Options include on-policy methods, where actions must be selected
by the agent’s current policy), and off-policy methods, where actions can be select by any kind of
policy, including human demonstrations.
Table 1.1 lists a few common examples of RL methods, classified along these lines. More details are given
in the subsequent sections.
1.3.1 Value-based RL
In this section, we give a brief introduction to value-based RL, also called Approximate Dynamic
Programming or ADP; see Chapter 2 for more details.
We introduced the value function Vπ (s) in Equation (1.1), which we repeat here for convenience:
"∞ #
X
Vπ (s) ≜ Eπ [G0 |s0 = s] = Eπ t
γ rt |s0 = s (1.29)
t=0
The value function for the optimal policy π is known to satisfy the following recursive condition, known as
∗
Bellman’s equation:
V ∗ (s) = max R(s, a) + γEpS (s′ |s,a) [V ∗ (s′ )] (1.30)
a
This follows from the principle of dynamic programming, which computes the optimal solution to a
problem (here the value of state s) by combining the optimal solution of various subproblems (here the values
of the next states s′ ). This can be used to derive the following learning rule:
V (s) ← V (s) + η[r + γV (s′ ) − V (s)] (1.31)
where s ∼ pS (·|s, a) is the next state sampled from the environment, and r = R(s, a) is the observed reward.
′
This is called Temporal Difference or TD learning (see Section 2.3.2 for details). Unfortunately, it is not
clear how to derive a policy if all we know is the value function. We now describe a solution to this problem.
We first generalize the notion of value function to assigning a value to a state and action pair, by defining
the Q function as follows:
"∞ #
X
Qπ (s, a) ≜ Eπ [G0 |s0 = s, a0 = a] = Eπ t
γ rt |s0 = s, a0 = a (1.32)
t=0
22
This quantity represents the expected return obtained if we start by taking action a in state s, and then
follow π to choose actions thereafter. The Q function for the optimal policy satisfies a modified Bellman
equation
h i
Q∗ (s, a) = R(s, a) + γEpS (s′ |s,a) max
′
Q∗ ′ ′
(s , a ) (1.33)
a
Q(s, a) ← r + γ max
′
Q(s′ , a′ ) − Q(s, a) (1.34)
a
where we sample s′ ∼ pS (·|s, a) from the environment. The action is chosen at each step from the implicit
policy
a = argmax Q(s, a′ ) (1.35)
a′
1.3.2 Policy-based RL
In this section we give a brief introductin to Policy-based RL; for details see Chapter 3.
In policy-based methods, we try to directly maximize J(πθ ) = Ep(s0 ) [Vπ (s0 )] wrt the parameter’s θ; this
is called policy search. If J(πθ ) is differentiable wrt θ, we can use stochastic gradient ascent to optimize θ,
which is known as policy gradient (see Section 3.1).
Policy gradient methods have the advantage that they provably converge to a local optimum for many
common policy classes, whereas Q-learning may diverge when approximation is used (Section 2.5.2.5). In
addition, policy gradient methods can easily be applied to continuous action spaces, since they do not need
to compute argmaxa Q(s, a). Unfortunately, the score function estimator for ∇θ J(πθ ) can have a very high
variance, so the resulting method can converge slowly.
One way to reduce the variance is to learn an approximate value function, Vw (s), and to use it as a
baseline in the score function estimator. We can learn Vw (s) using using TD learning. Alternatively, we can
learn an advantage function, Aw (s, a), and use it as a baseline. These policy gradient variants are called actor
critic methods, where the actor refers to the policy πθ and the critic refers to Vw or Aw . See Section 3.2 for
details.
1.3.3 Model-based RL
In this section, we give a brief introduction to model-based RL; for more details, see Chapter 4.
Value-based methods, such as Q-learning, and policy search methods, such as policy gradient, can be very
sample inefficient, which means they may need to interact with the environment many times before finding
a good policy, which can be problematic when real-world interactions are expensive. In model-based RL, we
first learn the MDP, including the pS (s′ |s, a) and R(s, a) functions, and then compute the policy, either using
approximate dynamic programming on the learned model, or doing lookahead search. In practice, we often
interleave the model learning and planning phases, so we can use the partially learned policy to decide what
data to collect, to help learn a better model.
23
1.3.4.1 Optimal solution
If we know the true latent structure of the world (i.e., both p(o|z) and p(z ′ |z, a), to use the notation of
Section 1.1.3), then we can use solution methods designed for POMDPs, discussed in Section 1.2.1. This
requires using Bayesian inference to compute a belief state, bt = p(wt |ht ) (see Section 1.2.5), and then using
this belief state to guide our decisions. However, learning the parameters of a POMDP (i.e., the generative
latent world model) is very difficult, as is recursively computing and updating the belief state, as is computing
the policy given the belief state. Indeed, optimally solving POMDPs is known to be computationally very
difficult for any method [PT87; KLC98]. So in practice simpler approximations are used. We discuss some of
these below. (For more details, see [Mur00].)
Note that it is possible to marginalize out the POMDP latent state wt , to derive a prediction over the
next observable state, p(ot+1 |ht , at ). This can then become a learning target for a model, that is trained to
directly predict future observations, without explicitly invoking the concept of latent state. This is called a
predictive state representation or PSR [LS01]. This is related to the idea of observable operator
models [Jae00], and to the concept of successor representations which we discuss in Section 4.5.2.
We can add exploration to this by sometimes picking some other, non-greedy action, as we discuss below.
One approach is to use an ϵ-greedy policy πϵ , parameterized by ϵ ∈ [0, 1]. In this case, we pick the
greedy action wrt the current model, at = argmaxa R̂t (st , a) with probability 1 − ϵ, and a random action
with probability ϵ. This rule ensures the agent’s continual exploration of all state-action combinations.
Unfortunately, this heuristic can be shown to be suboptimal, since it explores every action with at least a
24
R̂(s, a1 ) R̂(s, a2 ) πϵ (a|s1 ) πϵ (a|s2 ) πτ (a|s1 ) πτ (a|s2 )
1.00 9.00 0.05 0.95 0.00 1.00
4.00 6.00 0.05 0.95 0.12 0.88
4.90 5.10 0.05 0.95 0.45 0.55
5.05 4.95 0.95 0.05 0.53 0.48
7.00 3.00 0.95 0.05 0.98 0.02
8.00 2.00 0.95 0.05 1.00 0.00
Table 1.2: Comparison of ϵ-greedy policy (with ϵ = 0.1) and Boltzmann policy (with τ = 1) for a simple MDP with 6
states and 2 actions. Adapted from Table 4.1 of [GK19].
constant probability ϵ/|A|, although this can be solved by annealing ϵ to 0 over time. Another problem with
ϵ-greedy is that it can result in “dithering”, in which the agent continually changes its mind about what to
do. In [DOB21] they propose a simple solution to this problem, known as ϵz-greedy, that often works well.
The idea is that with probability 1 − ϵ the agent exploits, but with with probability ϵ the agent explores by
repeating the sampled action for n ∼ z() steps in a row, where z(n) is a distribution over the repeat duration.
This can help the agent escape from local minima.
Another simple approach to exploration is to use Boltzmann exploration, which assigns higher
probabilities to explore more promising actions, taking into account the reward function. That is, we use a
policy of the form
exp(R̂t (st , a)/τ )
πτ (a|s) = P (1.37)
′
a′ exp(R̂t (st , a )/τ )
where τ > 0 is a temperature parameter that controls how entropic the distribution is. As τ gets close to 0,
πτ becomes close to a greedy policy. On the other hand, higher values of τ will make π(a|s) more uniform,
and encourage more exploration. Its action selection probabilities can be much “smoother” with respect to
changes in the reward estimates than ϵ-greedy, as illustrated in Table 1.2.
The Boltzmann policy explores equally widely in all states. An alternative approach is to try to explore
(state,action) combinations where the consequences of the outcome might be uncertain. This can be achived
using an exploration bonus Rtb (s, a), which is large if the number of times we have tried actioon a in state
s is small. We can then add Rb to the regular reward, to bias the behavior in a way that will hopefully
cause the agent to learn useful information about the world. This is called an intrinsic reward function
(Section 7.4).
25
for making as many paper clips as possible. An optimal agent may convert the whole world into a paper
clip factory, because the user forgot to specify various constraints, such as not killing people (which might
otherwise be necessary in order to use as many resources as possible for paperclips). In the AI alignment
community, this example is known as the paperclip maximizer problem, and is due to Nick Bostrom
[Bos16]. (See e.g., https://openai.com/index/faulty-reward-functions/ for some examples that have
occurred in practice.) This is an example of a more general problem known as reward hacking [Ska+22].
For a potential solution, based on the assistance game paradigm, see Section 6.1.7.
where Φ : S → R is a potential function, then we can guarantee that the sum of shaped rewards will match
the sum of original rewards plus a constant. This is called Potential-Based Reward Shaping.
In [Wie03], they prove that (in the tabular case) this approach is equivalent to initializing the value
function to V (s) = Φ(s). In [TMM19], they propose an extension called potential-based advice, where they
show that a potential of the form F (s, a, s′ , a′ ) = γΦ(s′ , a′ ) − Φ(s, a) is also valid (and more expressive). In
[Hu+20], they introduce a reward shaping function z which can be used to down-weight or up-weight the
shaping function:
r′ (s, a) = r(s, a) + zϕ (s, a)F (s, a) (1.39)
They use bilevel optimization to optimize ϕ wrt the original task performance.
1.3.7 Software
Implementing RL algorithms is much trickier than methods for supervised learning, or generative methods
such as language modeling and diffusion, all of which have stable (easy-to-optimize) loss functions. Therefore
it is often wise to build on existing software rather than starting from scratch. We list some useful libraries
in Section 1.3.7.
In addition, RL experiments can be very high variance, making it hard to draw valid conclusions. See
[Aga+21b; Pat+24; Jor+24] for some recommended experimental practices. For example, when reporting
performance across different environments, with different intrinsic difficulties (e.g., different kinds of Atari
26
URL Language Comments
Stoix Jax Mini-library with many methods (including MBRL)
PureJaxRL Jax Single files with DQN; PPO, DPO
JaxRL Jax Single files with AWAC, DDPG, SAC, SAC+REDQ
Stable Baselines Jax Jax Library with DQN, CrossQ, TQC; PPO, DDPG, TD3, SAC
Jax Baselines Jax Library with many methods
Rejax Jax Library with DDQN, PPO, (discrete) SAC, DDPG
Dopamine Jax/TF Library with many methods
Rlax Jax Library of RL utility functions (used by Acme)
Acme Jax/TF Library with many methods (uses rlax)
CleanRL PyTorch Single files with many methods
Stable Baselines 3 PyTorch Library with DQN; A2C, PPO, DDPG, TD3, SAC, HER
TianShou PyTorch Library with many methods (including offline RL)
games), [Aga+21b] recommend reporting the interquartile mean (IQM) of the performance metric, which
is the mean of the samples between the 0.25 and 0.75 percentiles, (this is a special case of a trimmed mean).
Let this estimate be denoted by µ̂(Di ), where D is the empirical data (e.g., reward vs time) from the i’th
run. We can estimate the uncertainty in this estimate using a nonparametric method, such as bootstrap
resampling, or a parametric approximation, such as a Gaussian approximation. (This requires computing the
standard error of the mean, √σ̂n , where n is the number of trials, and σ̂ is the estimated standard deviation of
the (trimmed) data.)
27
28
Chapter 2
Value-based RL
This is the expected return obtained if we start in state s and follow π to choose actions in a continuing task
(i.e., T = ∞).
Similarly, we define the state-action value function, also known as the Q-function, as follows:
"∞ #
X
Qπ (s, a) ≜ Eπ [G0 |s0 = s, a0 = a] = Eπ t
γ rt |s0 = s, a0 = a (2.2)
t=0
This quantity represents the expected return obtained if we start by taking action a in state s, and then
follow π to choose actions thereafter.
Finally, we define the advantage function as follows:
This tells us the benefit of picking action a in state s then switching to policy π, relative to the baseline return
of always following π. Note that Advπ (s, a) can be both positive and negative, and Eπ(a|s) [Advπ (s, a)] = 0
due to a useful equality: Vπ (s) = Eπ(a|s) [Qπ (s, a)].
29
A fundamental result about the optimal value function is Bellman’s optimality equations:
Conversely, the optimal value functions are the only solutions that satisfy the equations. In other words,
although the value function is defined as the expectation of a sum of infinitely many rewards, it can be
characterized by a recursive equation that involves only one-step transition and reward models of the MDP.
Such a recursion play a central role in many RL algorithms we will see later.
Given a value function (V or Q), the discrepancy between the right- and left-hand sides of Equations (2.4)
and (2.5) are called Bellman error or Bellman residual. We can define the Bellman operator B given
an MDP M = (R, T ) and policy π as a function that takes a value function V and derives a few value function
V ′ that satisfies
V ′ (s) = BM
π
V (s) ≜ Eπ(a|s) R(s, a) + γET (s′ |s,a) [V (s′ )] (2.6)
This reduces the Bellman error. Applying the Bellman operator to a state is called a Bellman backup. If
we iterate this process, we will converge to the optimal value function V∗ , as we discuss in Section 2.2.1.
Given the optimal value function, we can derive an optimal policy using
Following such an optimal policy ensures the agent achieves maximum expected return starting from any
state.
The problem of solving for V ∗ , Q∗ or π ∗ is called policy optimization. In contrast, solving for Vπ or Qπ
for a given policy π is called policy evaluation, which constitutes an important subclass of RL problems as
will be discussed in later sections. For policy evaluation, we have similar Bellman equations, which simply
replace maxa {·} in Equations (2.4) and (2.5) with Eπ(a|s) [·].
In Equations (2.7) and (2.8), as in the Bellman optimality equations, we must take a maximum over all
actions in A, and the maximizing action is called the greedy action with respect to the value functions,
Q∗ or V ∗ . Finding greedy actions is computationally easy if A is a small finite set. For high dimensional
continuous spaces, see Section 2.5.4.1.
far-sighted. However, it does not give the agent any short-term guidance on how to behave. For example, in
s2 , it is not clear if it is should go up or down, since both actions will eventually reach the goal with identical
Q∗ -values.
Figure 2.1(d) shows Q∗ when γ = 0.9. This reflects a preference for near-term rewards, while also taking
future reward into account. This encourages the agent to seek the shortest path to the goal, which is usually
30
Q*(s, a)
R(s)
𝛄=0 𝛄=1 𝛄 = 0.9
ST1 ST1 Up Down Up Down Up Down
0
S1 S1
0 0 0 0 1.0 0 0.81
Up
S2 a1 S2
0
0 0 1.0 1.0 0.73 0.9
Down
S3 a2 S3 0
0 1.0 1.0 1.0 0.81 1.0
ST2 ST2 1
Figure 2.1: Left: illustration of a simple MDP corresponding to a 1d grid world of 3 non-absorbing states and 2
actions. Right: optimal Q-functions for different values of γ. Adapted from Figures 3.1, 3.2, 3.4 of [GK19].
what we desire. A proper choice of γ is up to the agent designer, just like the design of the reward function,
and has to reflect the desired behavior of the agent.
Note that the update rule, sometimes called a Bellman backup, is exactly the right-hand side of the
Bellman optimality equation Equation (2.4), with the unknown V ∗ replaced by the current estimate Vk . A
fundamental property of Equation (2.9) is that the update is a contraction: it can be verified that
31
In other words, every iteration will reduce the maximum value function error by a constant factor.
Vk will converge to V ∗ , after which an optimal policy can be extracted using Equation (2.8). In practice,
we can often terminate VI when Vk is close enough to V ∗ , since the resulting greedy policy wrt Vk will be
near optimal. Value iteration can be adapted to learn the optimal action-value function Q∗ .
We can guarantee that Vπ′ ≥ Vπ . This is called the policy improvement theorem. To see this, define r ′ ,
T′ and v ′ as before, but for the new policy π ′ . The definition of π ′ implies r ′ + γT′ v ≥ r + γTv = v, where
the equality is due to Bellman’s equation. Repeating the same equality, we have
v ≤ r ′ + γT′ v ≤ r ′ + γT′ (r ′ + γT′ v) ≤ r ′ + γT′ (r ′ + γT′ (r ′ + γT′ v)) ≤ · · · (2.13)
′ ′2 ′ ′ −1 ′ ′
2
= (I + γT + γ T + · · · )r = (I − γT ) r =v (2.14)
Starting from an initial policy π0 , policy iteration alternates between policy evaluation (E) and improvement
(I) steps, as illustrated below:
E I E I E
π0 → Vπ0 → π1 → Vπ1 · · · → π ∗ → V ∗ (2.15)
32
Figure 2.2: Policy iteration vs value iteration represented as backup diagrams. Empty circles represent states, solid
(filled) circles represent states and actions. Adapted from Figure 8.6 of [SB18].
The algorithm stops at iteration k, if the policy πk is greedy with respect to its own value function Vπk . In
this case, the policy is optimal. Since there are at most |A||S| deterministic policies, and every iteration
strictly improves the policy, the algorithm must converge after finite iterations.
In PI, we alternate between policy evaluation (which involves multiple iterations, until convergence of
Vπ ), and policy improvement. In VI, we alternate between one iteration of policy evaluation followed by one
iteration of policy improvement (the “max” operator in the update rule). We are in fact free to intermix any
number of these steps in any order. The process will converge once the policy is greedy wrt its own value
function.
Note that policy evaluation computes Vπ whereas value iteration computes V ∗ . This difference is illustrated
in Figure 2.2, using a backup diagram. Here the root node represents any state s, nodes at the next level
represent state-action combinations (solid circles), and nodes at the leaves representing the set of possible
resulting next state s′ for each possible action. In PE, we average over all actions according to the policy,
whereas in VI, we take the maximum over all actions.
2.3 Value function learning using samples from the world model
In the rest of this chapter, we assume the agent only has access to samples from the environment, (s′ , r) ∼
p(s′ , r|s, a). We will show how to use these samples to estimate optimal value function and Q-function, even
without explicitly knowing the MDP dynamics. This is sometimes called “learning” as opposed to “planning”,
since the latter requires access to an explicit world model.
where η is the learning rate, and the term in brackets is an error term. We can use a similar technique to
estimate Qπ (s, a) = E [Gt |st = s, at = a] by simply starting the rollout with action a.
We can use MC estimation of Q, together with policy iteration (Section 2.2.3), to learn an optimal policy.
Specifically, at iteration k, we compute a new, improved policy using πk+1 (s) = argmaxa Qk (s, a), where Qk
is approximated using MC estimation. This update can be applied to all the states visited on the sampled
trajectory. This overall technique is called Monte Carlo control.
To ensure this method converges to the optimal policy, we need to collect data for every (state, action)
pair, at least in the tabular case, since there is no generalization across different values of Q(s, a). One way
33
to achieve this is to use an ϵ-greedy policy (see Section 1.3.5). Since this is an on-policy algorithm, the
resulting method will converge to the optimal ϵ-soft policy, as opposed to the optimal policy. It is possible to
use importance sampling to estimate the value function for the optimal policy, even if actions are chosen
according to the ϵ-greedy policy. However, it is simpler to just gradually reduce ϵ.
where η is the learning rate. (See [RFP15] for ways to adaptively set the learning rate.) The δt = yt − V (st )
term is known as the TD error. A more general form of TD update for parametric value function
representations is
we see that Equation (2.16) is a special case. The TD update rule for evaluating Qπ is similar, except we
replace states with states and actions.
It can be shown that TD learning in the tabular case, Equation (2.16), converges to the correct value func-
tion, under proper conditions [Ber19]. However, it may diverge when using nonlinear function approximators,
as we discuss in Section 2.5.2.5. The reason is that this update is a “semi-gradient”, which refers to the fact
that we only take the gradient wrt the value function, ∇w V (st , wt ), treating the target Ut as constant.
The potential divergence of TD is also consistent with the fact that Equation (2.18) does not correspond
to a gradient update on any objective function, despite having a very similar form to SGD (stochastic gradient
descent). Instead, it is an example of bootstrapping, in which the estimate, Vw (st ), is updated to approach
a target, rt + γVw (st+1 ), which is defined by the value function estimate itself. This idea is shared by DP
methods like value iteration, although they rely on the complete MDP model to compute an exact Bellman
backup. In contrast, TD learning can be viewed as using sampled transitions to approximate such backups.
An example of a non-bootstrapping approach is the Monte Carlo estimation in the previous section. It
samples a complete trajectory, rather than individual transitions, to perform an update; this avoids the
divergence issue, but is often much less efficient. Figure 2.3 illustrates the difference between MC, TD, and
DP.
34
Figure 2.3: Backup diagrams of V (st ) for Monte Carlo, temporal difference, and dynamic programming updates of the
state-value function. Used with kind permission of Andy Barto.
the return for the rest of the trajectory, similar to heuristic search (Section 4.2.1.4). That is, we can use the
n-step return
Using this update can help propagate sparse terminal rewards back through many earlier states.
Rather than picking a specific lookahead value, n, we can take a weighted average of all possible values,
with a single parameter λ ∈ [0, 1], by using
∞
X
Gλt ≜ (1 − λ) λn−1 Gt:t+n (2.23)
n=1
P∞
This is called the lambda return. Note that these coefficients sum to one (since t=0 (1 − λ)λt = 1−λ 1−λ = 1,
for λ < 1), so the return is a convex combination of n-step returns. See Figure 2.4 for an illustration. We can
now use Gλt inside the TD update instead of Gt:t+n ; this is called TD(λ).
Note that, if a terminal state is entered at step T (as happens with episodic tasks), then all subsequent
n-step returns are equal to the conventional return, Gt . Hence we can write
−t−1
TX
Gλt = (1 − λ) λn−1 Gt:t+n + λT −t−1 Gt (2.24)
n=1
From this we can see that if λ = 1, the λ-return becomes equal to the regular MC return Gt . If λ = 0, the
λ-return becomes equal to the one-step return Gt:t+1 (since 0n−1 = 1 iff n = 1), so standard TD learning is
often called TD(0) learning. This episodic form also gives us the following recursive equation
35
Figure 2.4: The backup diagram for TD(λ). Standard TD learning corresponds to λ = 0, and standard MC learning
corresponds to λ = 1. From Figure 12.1 of [SB18]. Used with kind permission of Richard Sutton.
where a′ ∼ π(s′ ) is the action the agent will take in state s′ . This converges to Qπ . After Q is updated (for
policy evaluation), π also changes accordingly so it is greedy with respect to Q (for policy improvement).
This algorithm, first proposed by [RN94], was further studied and renamed to SARSA by [Sut96]; the name
comes from its update rule that involves an augmented transition (s, a, r, s′ , a′ ).
2.4.1 Convergence
In order for SARSA to converge to Q∗ , every state-action pair must be visited infinitely often, at least in
the tabular case, since the algorithm only updates Q(s, a) for (s, a) that it visits. One way to ensure this
condition is to use a “greedy in the limit with infinite exploration” (GLIE) policy. An example is the ϵ-greedy
policy, with ϵ vanishing to 0 gradually. It can be shown that SARSA with a GLIE policy will converge to Q∗
and π ∗ [Sin+00].
36
2.4.2 Sarsa(λ)
It is possible to apply the eligibility trace idea to Sarsa, since it is an on-policy method. This can help
speedup learning in sparse reward scenarios.
The basic idea, in the tabular case, is as follows. We compute an eligibility for each state action pair,
denoted N (s, a), representing the visit count. Following Equation (2.27), we perform the update
This is the update rule of Q-learning for the tabular case [WD92].
Since it is off-policy, the method can use (s, a, r, s′ ) triples coming from any data source, such as older
versions of the policy, or log data from an existing (non-RL) system. If every state-action pair is visited
infinitely often, the algorithm provably converges to Q∗ in the tabular case, with properly decayed learning
rates [Ber19]. Algorithm 1 gives a vanilla implementation of Q-learning with ϵ-greedy exploration.
37
Q-function
Episode Time Step Action (s,α,r , s') r + γ Q*(s' , α)
Q-function
episode start episode end
UP DOWN UP DOWN
S1 1 1 (S1 , D,0,S2) 0 + 0.9 X 0 = 0 S1
0 0 0 0
1 2 (S2 ,U,0,S1) 0 + 0.9 X 0 = 0
S1 S1
0 0 0 0
2 1 (S1 , D,0,S2) 0 + 0.9 x 0 = 0
2 3 (S3 , D,0,ST2) 1
S3 0 1 S3 0 1
4 7 (S2 , D,0,S3) 1
S1 S1
0 0.81 0 0.81
5 1 (S1 , U, 0,ST1) 0
S3 0.81 1 S3 0.81 1
Figure 2.5: Illustration of Q learning for one random trajectory in the 1d grid world in Figure 2.1 using ϵ-greedy
exploration. At the end of episode 1, we make a transition from S3 to ST 2 and get a reward of r = 1, so we estimate
Q(S3 , ↓) = 1. In episode 2, we make a transition from S2 to S3 , so S2 gets incremented by γQ(S3 , ↓) = 0.9. Adapted
from Figure 3.3 of [GK19].
For terminal states, s ∈ S + , we know that Q(s, a) = 0 for all actions a. Consequently, for the optimal
value function, we have V ∗ (s) = maxa′ Q∗ (s, a) = 0 for all terminal states. When performing online learning,
we don’t usually know which states are terminal. Therefore we assume that, whenever we take a step in the
environment, we get the next state s′ and reward r, but also a binary indicator done(s′ ) that tells us if s′ is
terminal. In this case, we set the target value in Q-learning to V ∗ (s′ ) = 0 yielding the modified update rule:
h i
Q(s, a) ← Q(s, a) + η r + (1 − done(s′ ))γ max
′
Q(s′ ′
, a ) − Q(s, a) (2.33)
a
For brevity, we will usually ignore this factor in the subsequent equations, but it needs to be implemented in
the code.
Figure 2.5 gives an example of Q-learning applied to the simple 1d grid world from Figure 2.1, using
γ = 0.9. We show the Q-functon at the start and end of each episode, after performing actions chosen by an
ϵ-greedy policy. We initialize Q(s, a) = 0 for all entries, and use a step size of η = 1. At convergence, we have
Q∗ (s, a) = r + γQ∗ (s′ , a∗ ), where a∗ =↓ for all states.
38
2.5.2 Q learning with function approximation
To make Q learning work with high-dimensional state spaces, we have to replace the tabular (non-parametric)
representation with a parametric approximation, denoted Qw (s, a). We can update this function using one or
more steps of SGD on the following loss function
2
L(w|s, a, r, s′ ) = (r + γ max
′
Qw (s′ , a′ )) − Qw (s, a) (2.34)
a
Since nonlinear functions need to be trained on minibatches of data, we compute the average loss over multiple
randomly sampled experience tuples (see Section 2.5.2.3 for discussion) to get
2.5.2.2 DQN
The influential deep Q-network or DQN paper of [Mni+15] also used neural nets to represent the Q function,
but performed a smaller number of gradient updates per iteration. Furthermore, they proposed to modify the
target value when fitting the Q function in order to avoid instabilities during training (see Section 2.5.2.5 for
details).
The DQN method became famous since it was able to train agents that can outperform humans when
playing various Atari games from the ALE (Atari Learning Environment) benchmark [Bel+13]. Here the
input is a small color image, and the action space corresponds to moving left, right, up or down, plus an
optional shoot action.1
1 For more discussion of ALE, see [Mac+18a], and for a recent extension to continuous actions (representing joystick control),
see the CALE benchmark of [FC24]. Note that DQN was not the first deep RL method to train an agent from pixel input; that
honor goes to [LR10], who trained an autoencoder to embed images into low-dimensional latents, and then used neural fitted Q
learning (Section 2.5.2.1) to fit the Q function.
39
Since 2015, many more extensions to DQN have been proposed, with the goal of improving performance
in various ways, either in terms of peak reward obtained, or sample efficiency (e.g., reward obtained after only
100k steps in the environment, as proposed in the Atari-100k benchmark [Kai+19]2 ), or training stability,
or all of the above. We discuss some of these extensions in Section 2.5.4.
δi = ri + γ max
′
Qw (s′i , a′ ) − Qw (si , ai ) (2.36)
a
the-art model-free deep RL algorithms” at the time of the benchmark’s creation in 2019. This excludes games like Montezuma’s
Revenge, which require more exploration and hence more training data.
40
(a) (b)
Figure 2.6: (a) A simple MDP. (b) Parameters of the policy diverge over time. From Figures 11.1 and 11.2 of [SB18].
Used with kind permission of Richard Sutton.
of MC), and off-policy learning (where the actions are sampled from some distribution other than the policy
that is being optimized). This combination is known as the deadly triad [Sut15; van+18]).
A classic example of this is the simple MDP depicted in Figure 2.6a, due to [Bai95]. (This is known as
Baird’s counter example.) It has 7 states and 2 actions. Taking the dashed action takes the environment
to the 6 upper states uniformly at random, while the solid action takes it to the bottom state. The reward is
0 in all transitions, and γ = 0.99. The value function Vw uses a linear parameterization indicated by the
expressions shown inside the states, with w ∈ R8 . The target policies π always chooses the solid action in
every state. Clearly, the true value function, Vπ (s) = 0, can be exactly represented by setting w = 0.
Suppose we use a behavior policy b to generate a trajectory, which chooses the dashed and solid actions
with probabilities 6/7 and 1/7, respectively, in every state. If we apply TD(0) on this trajectory, the
parameters diverge to ∞ (Figure 2.6b), even though the problem appears simple. In contrast, with on-policy
data (that is, when b is the same as π), TD(0) with linear approximation can be guaranteed to converge to
a good value function approximate [TR97]. The difference is that with on-policy learning, as we improve
the value function, we also improve the policy, so the two become self-consistent, whereas with off-policy
learning, the behavior policy may not match the optimal value function that is being learned, leading to
inconsistencies.
The divergence behavior is demonstrated in many value-based bootstrapping methods, including TD,
Q-learning, and related approximate dynamic programming algorithms, where the value function is represented
either linearly (like the example above) or nonlinearly [Gor95; TVR97; OCD21]. The root cause of these
divergence phenomena is that bootstrapping methods typically are not minimizing a fixed objective function.
Rather, they create a learning target using their own estimates, thus potentially creating a self-reinforcing
loop to push the estimates to infinity. More formally, the problem is that the contraction property in the
tabular case (Equation (2.10)) may no longer hold when V is approximated by Vw .
We discuss some solutions to the deadly triad problem below.
y(r, s′ ; w− ) = r + γ max
′
Qw− (s′ , a′ ) (2.39)
a
for training Qw . We can periodically set w− ← sg(w), usually after a few episodes, where the stop gradient
operator is used to prevent autodiff propagating gradients back to w. Alternatively, we can use an exponential
41
Data No LayerNorm With LayerNorm
Figure 2.7: We generate a dataset (left) with inputs x distributed in a circle with radius 0.5 and labels y = ||x||. We
then fit a two-layer MLP without LayerNorm (center) and with LayerNorm (right). LayerNorm bounds the values and
prevents catastrophic overestimation when extrapolating. From Figure 3 of [Bal+23]. Used with kind permission of
Philip Ball.
moving average (EMA) of the weights, i.e., we use w = ρw + (1 − ρ)sg(w), where ρ ≪ 1 ensures that Qw
slowly catches up with Qw . (If ρ = 0, we say that this is a detached target, since it is just a frozen copy of
the current weights.) The final loss has the form
L(w) = E(s,a,r,s′ )∼U (D) [L(w|s, a, r, s′ )] (2.40)
′ ′
L(w|s, a, r, s ) = (y(r, s ; w) − Qw (s, a)) 2
(2.41)
Theoretical work justifying this technique is given in [FSW23; Che+24a].
42
Figure 2.8: Comparison of Q-learning and double Q-learning on a simple episodic MDP using ϵ-greedy action selection
with ϵ = 0.1. The initial state is A, and squares denote absorbing states. The data are averaged over 10,000 runs.
From Figure 6.5 of [SB18]. Used with kind permission of Richard Sutton.
variables {Xa }. Thus, if we pick actions greedily according to their random scores {Xa }, we might pick a
wrong action just because random noise makes it appealing.
Figure 2.8 gives a simple example of how this can happen in an MDP. The start state is A. The right
action gives a reward 0 and terminates the episode. The left action also gives a reward of 0, but then enters
state B, from which there are many possible actions, with rewards drawn from N (−0.1, 1.0). Thus the
expected return for any trajectory starting with the left action is −0.1, making it suboptimal. Nevertheless,
the RL algorithm may pick the left action due to the maximization bias making B appear to have a positive
value.
So we see that Q1 uses Q2 to choose the best action but uses Q1 to evaluate it, and vice versa. This technique
is called double Q-learning [Has10]. Figure 2.8 shows the benefits of the algorithm over standard Q-learning
in a toy problem.
In Section 3.2.6.3 we discuss an extension called clipped double DQN which uses two Q networks and
their frozen copies to define the following target:
y(r, s′ ; w1:2 , w1:2 ) = r + γ min Qwi (s′ , argmax Qwi (s′ , a′ )) (2.45)
i=1,2 a′
43
2.5.3.3 Randomized ensemble DQN
The double DQN method is extended in the REDQ (randomized ensembled double Q learning) method
of [Che+20], which uses an ensemble of N > 2 Q-networks. Furthermore, at each step, it draws a random
sample of M ≤ N networks, and takes the minimum over them when computing the target value. That is, it
uses the following update (see Algorithm 2 in appendix of [Che+20]):
y(r, s′ ; w1:N , w1:N ) = r + γ max
′
min Qwi (s′ , a′ ) (2.46)
a i∈M
where M is a random subset from the N value functions. The ensemble reduces the variance, and the
minimum reduces the overestimation bias.3 If we set N = M = 2, we get a method similar to clipped double
Q learning. (Note that REDQ is very similiar to the Random Ensemble Mixture method of [ASN20],
which was designed for offline RL.)
(UTD) ratio (also called Replay Ratio) is critical for sample efficiency, and is commonly used in model-based RL.
44
Thus we can satisfy the constraint for the optimal policy by subtracting off maxa A(s, a) from the advantage
head. Equivalently we can compute the Q function using
In practice, the max is replaced by an average, which seems to work better empirically.
This can be implemented for episodic environments by storing experience tuples of the form
n
X
τ = (s, a, γ k−1 rk , sn , done) (2.56)
k=1
where done = 1 if the trajectory ended at any point during the n-step rollout.
Theoretically this method is only valid if all the intermediate actions, a2:n−1 , are sampled from the current
optimal policy derived from Qw , as opposed to some behavior policy, such as epsilon greedy or some samples
from the replay buffer from an old policy. In practice, we can just restrict sampling to recent samples from
the replay buffer, making the resulting method approximately on-policy.
2.5.4.5 Q(λ)
Instead of using a fixed n, it is possible to use a weighted combination of returns; this is known as the Q(λ)
algorithm [PW94; Har+16; Koz+21], and relies on the concept of eligibility traces. Unfortunately it is more
complicated than the Sarsa case in Section 2.4.2, since Q learning is off-policy, but the eligibility traces
backpropagate information obtained by the exploration policy.
2.5.4.6 Rainbow
The Rainbow method of [Hes+18] combined 6 improvements to the vanilla DQN method, as listed below.
(The paper is called “Rainbow” due to the color coding of their results plot, a modified version of which is
shown in Figure 2.9.) At the time it was published (2018), this produced SOTA results on the Atari-200M
benchmark. The 6 improvements are as follows:
• Use double DQN, as in Section 2.5.3.2.
• Use prioritized experience replay, as in Section 2.5.2.4.
• Use the categorical DQN (C51) (Section 7.3.2) distributional RL method.
45
Figure 2.9: Plot of median human-normalized score over all 57 Atari games for various DQN agents. The yellow,
red and green curves are distributional RL methods (Section 7.3), namely categorical DQN (C51) (Section 7.3.2)
Quantile Regression DQN (Section 7.3.1), and Implicit Quantile Networks [Dab+18]. Figure from https: // github.
com/ google-deepmind/ dqn_ zoo .
• Use a larger CNN with residual connections, namely the Impala network from [Esp+18] with the
modifications (including the use of spectral normalization) proposed in [SS21].
• Use Munchausen RL [VPG20], which modifies the Q learning update rule by adding an entropy-like
penalty.
• Collect 1 environment step from 64 parallel workers for each minibatch update (rather than taking
many steps from a smaller number of workers).
• Use a larger CNN with residual connections, namely a modified version of the Impala network from
[Esp+18].
• Increase the update-to-data (UTD) ratio (number of times we update the Q function for every
observation that is observed), in order to increase sample efficiency [HHA19].
• Use a periodic soft reset of (some of) the network weights to avoid loss of elasticity due to increased
network updates, following the SR-SPR method of [D’O+22].
• Use n-step returns, as in Section 2.5.4.4, and then gradually decrease (anneal) the n-step return from
n = 10 to n = 3, to reduce the bias over time.
46
• Gradually increase the discount factor from γ = 0.97 to γ = 0.997, to encourage longer term planning
once the model starts to be trained.4
• Drop noisy nets (which requires multiple network copies and thus slows down training due to increased
memory use), since it does not help.
4 The Agent 57 method of [Bad+20] automatically learns the exploration rate and discount factor using a multi-armed
bandit stratey, which lets it be more exploratory or more exploitative, depending on the game. This resulted in super human
performance on all 57 Atari games in ALE. However, it required 80 billion frames (environment steps)! This was subsequently
reduced to the “standard” 200M frames in the MEME method of [Kap+22].
47
48
Chapter 3
Policy-based RL
In the previous section, we considered methods that estimate the action-value function, Q(s, a), from which
we derive a policy. However, these methods have several disadvantages: (1) they can be difficult to apply to
continuous action spaces; (2) they may diverge if function approximation is used (see Section 2.5.2.5); (3)
the training of Q, often based on TD-style updates, is not directly related to the expected return garnered
by the learned policy; (4) they learn deterministic policies, whereas in stochastic and partially observed
environments, stochastic policies are provably better [JSJ94].
In this section, we discuss policy search methods, which directly optimize the parameters of the policy
so as to maximize its expected return. We mostly focus on policy gradient methods, that use the gradient
of the loss to guide the search (see e.g., [Aga+21a]). As we will see, these policy methods often benefit from
estimating a value or advantage function to reduce the variance in the policy search process, so we will also
use techniques from Chapter 2.
The parametric policy will be denoted by πθ (a|s), which is usually some form of neural network. For
discrete actions, the final layer is usually passed through a softmax function and then into a categorical
distribution. For continuous actions, we typically use a Gaussian output layer (potentially clipped to a
suitable range, such as [−1, 1]), although it is also possible to use more expressive (multi-modal) distributions,
such as diffusion models [Ren+24].
There are many implementation details one needs to get right to get good performance when designing
such neural networks. For example, [Fur+21] recommends using ELU instead of RELU activations, and using
LayerNorm. (In [Gal+24] they recently proved that adding layer norm to the final layer of a DQN model
is sufficient to guarantee that value learning is stable, even in the nonlinear setting.) However, we do not
discuss these details in this manuscript.
For more details on policy gradient methods, see e.g., [Wen18b; Aga+21a; Leh24].
49
where R(τ ) = γ 0 r0 +γ 1 r1 +. . . is the return along the trajectory, and pθ (τ ) is the distribution over trajectories
induced by the policy (and world model):
T
Y
pθ (τ ) = p(s1 ) T (sk+1 |sk , ak )πθ (ak |sk ) (3.2)
k=1
The expectations can be estimated using Monte Carlo sampling (rolling out the policy in the environment).
The gradient can be computed from Equation (3.2) as follows:
T
X
∇θ log pθ (τ ) = ∇θ log πθ (ak |sk ) (3.5)
k=1
Hence " ! #
T
X
∇θ J(θ) = Eτ ∇θ log πθ (ak |sk ) R(τ ) (3.6)
k=1
In statistics, the term ∇θ log πθ (a|s) is called the (Fisher) score function1 , so sometimes Equation (3.6) is
called the score function estimator or SFE [Fu15; Moh+20].
f1 r1 + f1 r2 γ + f1 r3 γ 2 + · · · f1 rT γ T −1 (3.9)
+ r1 + f2 r2 γ + f2 r3 γ 2 + · · · f2 rT γ T −1
f2 (3.10)
+ f3
r1 +
f2r2γ + f3 r3 γ 2 · · · f2 rT γ T −1 (3.11)
..
. (3.12)
+ fTr1 +
fT r2γ +fTr3γ 2
+ · · · fT rT γ T −1 (3.13)
1 This is distinct from the Stein score, which is the gradient wrt the argument of the log probability, ∇ log π (a|s), as used
a θ
in diffusion.
50
where we have canceled terms that are 0, due to the fact that the reward at step k cannot depend on actions
at time steps in the future. Plugging in this simplified expression we get
" T T
!#
X X
∇J(θ) = Eτ ∇θ log πθ (ak |sk ) rl γ l−1 (3.14)
k=1 l=k
" T T
!#
X X
= Eτ ∇θ log πθ (ak |sk ) γ k−1 l l−k
rγ (3.15)
k=1 l=k
" T
#
X
= Eτ ∇θ log πθ (ak |sk )γ k−1
Gk (3.16)
k=1
Note that the reward-to-go of a state-action pair (s, a) can be considered as a single sample approximation
of the state-action value function Qθ (s, a). Averaging over such samples gives
" T #
X
∇J(θ) = Eτ γ k−1
Qθ (sk , ak )∇θ log πθ (ak |sk ) (3.18)
k=1
3.1.3 REINFORCE
In this section, we describe an algorithm that uses the above estimate of the gradient of the policy value,
together with SGD, to fit a policy. That is, we use
T
X
θj+1 := θj + η ∇θ log πθj (ak |sk )γ k−1 Gk (3.19)
k=1
where j is the SGD iteration number, and we draw a single trajectory at each step. This is is called the
REINFORCE algorithm [Wil92].2
The update equation in Equation (3.19) can be interpreted as follows: we compute the sum of discounted
future rewards induced by a trajectory, and if this is positive, we increase θ so as to make this trajectory
more likely, otherwise we decrease θ. Thus, we reinforce good behaviors, and reduce the chances of generating
bad ones. See Algorithm 3 for the pseudocode.
2 The term “REINFORCE” is an acronym for “REward Increment = nonnegative Factor x Offset Reinforcement x Characteristic
Eligibility”. The phrase “characteristic eligibility” refers to the ∇ log πθ (at |st ) term; the phrase “offset reinforcement” refers to
the Gt − b(st ) term, where b is a baseline to be defined later; and the phrase “nonnegative factor” refers to the learning rate η of
SGD.
51
Algorithm 3: REINFORCE (episodic version)
1 Initialize policy parameters θ
2 repeat
3 Sample an episode τ = (s1 , a1 , r1 , s2 , . . . , sT ) using πθ
4 for k = 1, . . . , T do
PT
5 Gk = l=k γ l−k Rl
6 θ ← θ + ηθ γ k−1 Gk ∇θ log πθ (ak |sk )
7 until converged ;
where pπ (s0 → s, t) is the probability of going from s0 to s in t steps, and pπt (s) is the marginal probability of
being in state s at time t (after each episodic reset). Note that ργπ is a measureP ofγtime spent in non-terminal
states, but it is not a probability measure, since it is not normalized,
P∞ i.e., s ρπ (s) ̸= 1. However, we can
define a normalized version of the measure ρ by noting that t=0 γ t = 1−γ 1
for γ < 1. Hence the normalized
discounted state visitation distribution is given by the following (note the change from ρ to p):
∞
X
pγπ (s) = (1 − γ)ργπ (s) = (1 − γ) γ t pπt (s) (3.22)
t=0
We can convert from the normalized distribution back to the measure using
1
ργπ (s) = pγ (s) (3.23)
1−γ π
Using this notation, one can show (see [KL02]) that we can rewrite Equation (3.18) in terms of expectations
over states rather than over trajectories:
∇θ J(θ) = Eργπ (s)πθ (a|s) [Qπθ (s, a)∇θ log πθ (a|s)] (3.24)
1
= E γ [Qπθ (s, a)∇θ log πθ (a|s)] (3.25)
1 − γ pπ (s)πθ (a|s)
∇θ J(θ) = Eρθ (s)πθ (a|s) [(Qπθ (s, a) − b(s))∇θ log πθ (a|s)] (3.26)
Any function that satisfies E [∇θ b(s)] = 0 is a valid baseline. This follows since
X X X
∇θ πθ (a|s)(Q(s, a) − b(s)) = ∇θ πθ (a|s)Q(s, a) − ∇θ [ πθ (a|s)]b(s) (3.27)
a a a
X
= ∇θ πθ (a|s)Q(s, a) − 0 (3.28)
a
A common choice for the baseline is b(s) = V (s). This is a valid choice since E[∇θ V (s)] = 0 if we use an old
(frozen) version of the policy that is independent of θ. This is a useful choice V (s) and Q(s, a) are correlated
and have similar magnitudes, so the scaling factor in front of the gradient term will be small, ensuring the
update steps are not too big.
52
Note that Q(s, a) − V (s) = A(s, a) is the advantage function. In the finite horizon case we get
" T
# " T
#
X X
k−1 k−1
∇J(θ) = Eτ ∇θ log πθ (ak |sk )γ (Qθ (sk , ak ) − Vθ (sk )) = Eτ ∇θ log πθ (ak |sk )γ Aθ (sk , ak )
k=1 k=1
(3.29)
We can also apply a baseline to the reward-to-go formulation to get
" T
#
X
∇J(θ) = Eτ ∇θ log πθ (ak |sk )γ k−1
(Gk − b(sk )) (3.30)
k=1
We can derive analogous baselines for the infinite horizon case, defined in terms of pγπ .
T
X
θ ←θ+η γ k−1 (Gk − b(sk ))∇θ log πθ (ak |sk ) (3.31)
k=1
See Algorithm 4 for the pseudocode, where we use the value function as a baseline, estimated using TD.
53
Equation (3.19) becomes
T
X −1
θ ←θ+η γ t (Gt:t+1 − Vw (st )) ∇θ log πθ (at |st ) (3.32)
t=0
T
X −1
=θ+η γ t rt + γVw (st+1 ) − Vw (st ) ∇θ log πθ (at |st ) (3.33)
t=0
Note that δt = rt+1 + γVw (st+1 ) − Vw (st ) is a single sample approximation to the advantage function
Adv(st , at ) = Q(st , at ) − V (st ). This method is therefore called advantage actor critic or A2C. See
Algorithm 5 for the pseudo-code.3 (Note that Vw (st+1 ) = 0 if st is a done state, representing the end of an
episode.) Note that this is an on-policy algorithm, where we update the value function Vwπ to reflect the value
of the current policy π. See Section 3.2.3 for further discussion of this point.
13 until converged ;
In practice, we should use a stop-gradient operator on the target value for the TD update, for reasons
explained in Section 2.5.2.5. Furthermore, it is common to add an entropy term to the policy, to act as a
regularizer (to ensure the policy remains stochastic, which smoothens the loss function — see Section 3.6.8).
If we use a shared network with separate value and policy heads, we need to use a single loss function for
training all the parameters ϕ. Thus we get the following loss, for each trajectory, where we want to minimize
TD loss, maximize the policy gradient (expected reward) term, and maximize the entropy term.
T
1X
L(ϕ; τ ) = [λT D LT D (st , at , rt , st+1 ) − λP G JP G (st , at , rt , st+1 ) − λent Jent (st )] (3.34)
T t=1
yt = rt + γ(1 − done(st ))Vϕ (st+1 ) (3.35)
LT D (st , at , rt , st+1 ) = (sg(yt ) − Vϕ (s)) 2
(3.36)
JP G (st , at , rt , st+1 ) = (sg(yt − Vϕ (st )) log πϕ (at |st ) (3.37)
X
Jent (st ) = − πϕ (a|st ) log πϕ (a|st ) (3.38)
a
To handle the dynamically varying scales of the different loss functions, we can use the PopArt method of
[Has+16; Hes+19] to allow for a fixed set of hyper-parameter values for λi . (PopArt stands for “Preserving
Outputs Precisely, while Adaptively Rescaling Targets”.)
3 In [Mni+16], they proposed a distributed version of A2C known as A3C which stands for “asynchrononous advantage actor
critic”.
54
3.2.2 Generalized advantage estimation (GAE)
In A2C, we replaced the high variance, but unbiased, MC return Gt with the low variance, but biased,
one-step bootstrap return Gt:t+1 = rt + γVw (st+1 ). More generally, we can compute the n-step estimate
A(n)
w (st , at ) = Gt:t+n − Vw (st ) (3.40)
(1)
At = rt + γvt+1 − vt (3.41)
(2)
At 2
= rt + γrt+1 + γ vt+2 − vt (3.42)
..
. (3.43)
(∞)
At 2
= rt + γrt+1 + γ rt+2 + · · · − vt (3.44)
(1) (∞)
At is high bias but low variance, and At is unbiased but high variance.
Instead of using a single value of n, we can take a weighted average. That is, we define
PT (n)
n=1 wn At
At = PT (3.45)
n=1 wn
δt = rt + γvt+1 − vt (3.46)
T −(t+1)
At = δt + γλδt+1 + · · · + (γλ) δT −1 = δt + γλAt+1 (3.47)
Here λ ∈ [0, 1] is a parameter that controls the bias-variance tradeoff: larger values decrease the bias but
increase the variance. This is called generalized advantage estimation (GAE) [Sch+16b]. See Algorithm 6
for some pseudocode. Using this, we can define a general actor-critic method, as shown in Algorithm 7.
We can generalize this approach even further, by using gradient estimators of the form
" ∞
#
X
∇J(θ) = E Ψt ∇ log πθ (at |st ) (3.48)
t=0
55
Algorithm 7: Actor critic with GAE
1 Initialize parameters ϕ, environment state s
2 repeat
3 (s1 , a1 , r1 , . . . , sT ) = rollout(s, πϕ )
4 v1:T = Vϕ (s1:T )
5 (A1:T , y1:T ) = sg(GAE(r1:T , v1:T , γ, λ))
PT
6 L(ϕ) = T1 t=1 λT D (Vϕ (st ) − yt )2 − λP G At log πϕ (at |st ) − λent H(πϕ (·|st ))
7 ϕ := ϕ − η∇L(ϕ)
8 until converged ;
θk+1 = θk − ηk gk (3.54)
where gk = ∇θ L(θk ) is the gradient of the loss at the previous parameter values, and ηk is the learning rate.
It can be shown that the above update is equivalent to minimizing a locally linear approximation to the loss,
L̂k , subject to the constraint that the new parameters do not move too far (in Euclidean distance) from the
56
(a) (b)
Figure 3.1: Changing the mean of a Gaussian by a fixed amount (from solid to dotted curve) can have more impact
when the (shared) variance is small (as in a) compared to when the variance is large (as in b). Hence the impact (in
terms of prediction accuracy) of a change to µ depends on where the optimizer is in (µ, σ) space. From Figure 3 of
[Hon+10], reproduced from [Val00]. Used with kind permission of Antti Honkela.
previous parameters:
where the step size ηk is proportional to ϵ. This is called a proximal update [PB+14].
One problem with the SGD update is that Euclidean distance in parameter space does not make sense for
probabilistic models. For example, consider comparing two Gaussians, pθ = p(y|µ, σ) and pθ′ = p(y|µ′ , σ ′ ).
The (squared) Euclidean distance between the parameter vectors decomposes as ||θ−θ ′ ||22 = (µ−µ′ )2 +(σ−σ ′ )2 .
However, the predictive distribution has the form exp(− 2σ1 2 (y − µ)2 ), so changes in µ need to be measured
relative to σ. This is illustrated in Figure 3.1(a-b), which shows two univariate Gaussian distributions (dotted
and solid lines) whose means differ by ϵ. In Figure 3.1(a), they share the same small variance σ 2 , whereas in
Figure 3.1(b), they share the same large variance. It is clear that the difference in µ matters much more (in
terms of the effect on the distribution) when the variance is small. Thus we see that the two parameters
interact with each other, which the Euclidean distance cannot capture.
The key to NGD is to measure the notion of distance between two probability distributions in terms
of the KL divergence. This can be approximated in terms of the Fisher information matrix (FIM). In
particular, for any given input x, we have
1 T
DKL (pθ (y|x) ∥ pθ+δ (y|x)) ≈ δ Fx δ (3.57)
2
where Fx is the FIM
Fx (θ) = −Epθ (y|x) ∇2 log pθ (y|x) = Epθ (y|x) (∇ log pθ (y|x))(∇ log pθ (y|x))T (3.58)
We now replace the Euclidean distance between the parameters, d(θk , θk+1 ) = ||δ||22 , with
where δ = θk+1 − θk and Fk = Fx (θk ) for a randomly chosen input x. This gives rise to the following
constrained optimization problem:
If we replace the constraint with a Lagrange multiplier, we get the unconstrained objective:
57
Solving Jk (δ) = 0 gives the update
δ = −ηk F−1
k gk (3.62)
The term F−1 g is called the natural gradient. This is equivalent to a preconditioned gradient update,
where we use the inverse FIM as a preconditioning matrix. We can compute the (adaptive) learning rate
using r
ϵ
ηk = (3.63)
gk F−1
T
k gk
Computing the FIM can be hard. A simple approximation PN is to replace the model’s distribution
PN with the
empirical distribution. In particular, define pD (x, y) = N1 n=1 δxn (x)δyn (y), pD (x) = N1 n=1 δxn (x) and
pθ (x, y) = pD (x)p(y|x, θ). Then we can compute the empirical Fisher [Mar16] as follows:
F(θ) = Epθ (x,y) ∇ log p(y|x, θ)∇ log p(y|x, θ)T (3.64)
≈ EpD (x,y) ∇ log p(y|x, θ)∇ log p(y|x, θ) T
(3.65)
1 X
= ∇ log p(y|x, θ)∇ log p(y|x, θ)T (3.66)
|D|
(x,y)∈D
where At is the advantage function at step t of the random trajectory generated by the policy at iteration k.
Now we compute
T T
1X 1X
gk = gkt , Fk = gkt gTkt (3.68)
T t=1 T t=1
and compute δ k+1 = −ηk F−1 k gk . This approach is called natural policy gradient [Kak01; Raj+17].
We can compute F−1 k gk without having to invert Fk by using the conjugate gradient method, where
each CG step uses efficient methods for Hessian-vector products [Pea94]. This is called Hessian free
optimization [Mar10]. Similarly, we can efficiently compute gTk (F−1k gk ).
As a more accurate alternative to the empirical Fisher, [MG15] propose the KFAC method, which stands
for “Kronecker factored approximate curvature”; this approximates the FIM of a DNN as a block diagonal
matrix, where each block is a Kronecker product of two small matrices. This was applied to policy gradient
learning in [Wu+17].
58
actions. (We require that the actions are continuous, because we will take the Jacobian of the Q function wrt
the actions.)
The benefit of using a deterministic policy, as opposed to a stochastic policy, is that we can modify the
policy gradient method so that it can work off policy without needing importance sampling, as we will see. In
addition, the feedback signal for learning is based on the vector-valued gradient of the value function, which
is more informative than a scalar reward signal.
The deterministic policy gradient theorem [Sil+14] tells us that the gradient of this expression is given
by
where ∇θ µθ (s) is the Nθ × NA Jacobian matrix, and NA and Nθ are the dimensions of A and θ, respectively.
The intuition for this equation is as follows: the change in the expected value due to changing the parameters,
∇θ J(µθ ) ∈ RNθ , is equal to the change in the policy output (i.e., the actions) due to changing the parameters,
∇θ µθ (s) ∈ RNθ ×NA times the change in the expected value due to the change in the actions, ∇a Qµθ (s, a) ∈
RN A .
For stochastic policies of the form πθ (a|s) = µθ (s) + noise, the standard policy gradient theorem reduces
to the above form as the noise level goes to zero.
Note that the gradient estimate in Equation (3.71) integrates over the states but not over the actions, which
helps reduce the variance in gradient estimation from sampled trajectories. However, since the deterministic
policy does not do any exploration, we need to use an off-policy method for training. This collects data from
a stochastic behavior policy πb , whose stationary state distribution is pγπb . The original objective, J(µθ ), is
approximated by the following:
Jb (µθ ) ≜ Epγπb (s) [Vµθ (s)] = Epγπb (s) [Qµθ (s, µθ (s))] (3.72)
where we have a dropped a term that depends on ∇θ Qµθ (s, a) and is hard to estimate [Sil+14].
To apply Equation (3.73), we may learn Qw ≈ Qµθ with TD, giving rise to the following updates:
So we learn both a state-action critic Qw and an actor µθ . This method avoids importance sampling in the
actor update because of the deterministic policy gradient, and we avoids it in the critic update because of the
use of Q-learning.
If Qw is linear in w, and uses features of the form ϕ(s, a) = aT ∇θ µθ (s), then we say the function
approximator for the critic is compatible with the actor; in this case, one can show that the above
approximation does not bias the overall gradient.
The basic off-policy DPG method has been extended in various ways, some of which we describe below.
59
3.2.6.2 DDPG
The DDPG algorithm of [Lil+16], which stands for “deep deterministic policy gradient”, uses the DQN
method (Section 2.5.2.2) to learn the Q function, and then uses this to evaluate the policy. In more detail,
the actor tries to minimize the output of the critic
where the loss is averaged over states s drawn from the replay buffer. The critic tries to minimize the 1-step
TD loss, as in Q-learning:
where Qw is the target critic network, and the samples (s, a, r, a′ ) are drawn from a replay buffer. (See
Section 2.5.2.6 for a discussion of target networks.)
The D4PG algorithm [BM+18], which stands for “distributed distributional DDPG”, extends DDPG to
handle distributed training, and to handle distributional RL (see Section 7.3).
Second it uses clipped double Q learning, which is an extension of the double Q-learning discussed in
Section 2.5.3.1 to avoid over-estimation bias. In particular, the target values for TD learning are defined using
Third, it uses delayed policy updates, in which it only updates the policy after the value function has
stabilized. (See also Section 3.2.3.) See Algorithm 8 for the pseudcode.
(Note that the states are sampled from a replay buffer, so may be off-policy, but the actions are sampled
from the current policy, so are on-policy.)
If we ignore the FIM preconditioner F −1 , we see that the update is similar to the one used in the DPG
T
theorem, except we replace the Jacobian ∇θ µθ (s) with ∇θ (∇a log πθ (a|s)) ∈ RNθ ×NA . Intuitively this
4 Although the method of [Zha+18] and the SVG(0) method of [Hee+15] also support stochastic policies, they rely on the
reparameterization trick, which is not always applicable (e.g., if the policy is a mixture of Gaussians). In addition, WPO makes
use of natural gradients, whereas these are first-order methods.
5 Note that f (a, θ) = log π (a|s) is a scalar-valued function of θ and a (for a fixed s); the notation ∇ (∇ f (a, θ)T ) is
θ θ a
another way of writing the Jacobian matrix [ ∂θ∂f ] .
∂a ij
i j
60
Algorithm 8: TD3
1 Initialize environment state s, policy parameters θ, target policy parameters θ, critic parameters wi ,
target critic parameters wi = wi , replay buffer D = ∅, discount factor γ, EMA rate ρ, step size ηw ,
ηθ .
2 repeat
3 a = µθ (s) + noise
4 (s′ , r) = step(a, s)
5 D := D ∪ {(s, a, r, s′ )}
6 s ← s′
7 for G updates do
8 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
9 w = update-critics(θ, w, B)
10 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
11 θ = update-policy(θ, w, B)
12 until converged ;
13 .
14 def update-critics(θ, w, B):
15 Let (sj , aj , rj , s′j )B
j=1 = B
16 for j = 1 : B do
17 ãj = µθ (s′j ) + clip(noise, −c, c)
18 yj = rj + γ mini=1,2 Qwi (s′j , ãj )
19 for i = 1 : 2 do P
(s,a,r,s′ )j ∈B (Qwi (sj , aj ) − sg(yj ))
1 2
20 L(wi ) = |B|
21 wi ← wi − ηw ∇L(wi ) // Descent
22 wi := ρwi + (1 − ρ)wi //Update target networks with EMA
23 Return w1:N , w1:N
24 .
25 def update-actor(θ, w, B):
1
P 2
26 J(θ) = |B| s∈B (Qw1 (s, µθ (s)))
27 θ ← θ + ηθ ∇J(θ) // Ascent
28 θ := ρθ + (1 − ρ)θ //Update target policy network with EMA
29 Return θ, θ
61
aQ (s, a)
Q (s, a)
aQ (s, a)
captures the change in probability flow over the action space due to a change in the parameters. See
Figure 3.2 for an illustration.
However, the use of the FIM preconditioner keeps the update closer to the true gradient flow. (Indeed, in
the case of a Gaussian policy and quadratic value function, WPO is exactly the Wasserstein gradient flow if
you use the FIM, but is very different if you don’t.) Furthermore, this preconditioner can avoid numerical
issues which can arise as the policy converges to a deterministic policy, leading to a blowing up of the gradient
term ∇a log πθ (a|s).
In general, computing the FIM can be intractable. However, the authors assume the policy is a diagonal
Gaussian, for which the FIM is diagonal:
diag σ12 , σ12 , . . . , σ12 0
F(µ, σ) = 1 2 d (3.83)
2
0 diag , 2 , . . . , σ22
σ12 σ22 d
62
the total variation distance between two distributions be given by
1 1X
TV(p, q) ≜ ||p − q||1 = |p(s) − q(s)| (3.84)
2 2 s
where C π,πk = maxs |Eπ(a|s) [Aπk (s, a)] |. In the above, L(π, πk ) is a surrogate objective, and the second term
is a penalty term.
If we can optimize this lower bound (or a stochastic approximation, based on samples from the current
policy πk ), we can guarantee monotonic policy improvement (in expectation) at each step. We will replace
this objective with a trust-region update that is easier to optimize:
The constraint bounds the worst-case performance decline at each update. The overall procedure becomes
an approximate policy improvement method. There are various ways of implementing the above method in
practice, some of which we discuss below. (See also [GDWF22], who propose a framework called mirror
learning, that justifies these “approximations” as in fact being the optimal thing to do for a different kind of
objective; see also [Vas+21].)
ϵ2
then π also satisfies the TV constraint with δ = 2 . Next it considers a first-order expansion of the surrogate
objective to get
π(a|s) πk
L(π, πk ) = E pγ
πk (s)πk (a|s)
A (s, a) ≈ gTk (θ − θk ) (3.88)
πk (a|s)
where gk = ∇θ L(πθ , πk )|θk . Finally it considers a second-order expansion of the KL term to get the
approximate constraint
1
Epγπk (s) [DKL (πk ∥ π) (s)] ≈ (θ − θk )T Fk (θ − θk ) (3.89)
2
where Fk = gk gTk is an approximation to the Fisher information matrix (see Equation (3.68)). We then use
the update
θk+1 = θk + ηk vk (3.90)
q
where vk = F−1
k gk is the natural gradient, and the step size is initialized to ηk =
2δ
vTk Fk vk
. (In practice we
compute vk by approximately solving the linear system Fk v = gk using conjugate gradient methods, which
just require matrix vector multiplies.) We then use a backtracking line search procedure to ensure the trust
region is satisfied.
63
3.3.3 Proximal Policy Optimization (PPO)
In this section, we describe the the proximal policy optimization or PPO method of [Sch+17], which is a
simplification of TRPO.
We start by noting the following result:
1 π(a|s)
Epπk (s) [TV(π, πk )(s)] = E(s,a)∼pπk |
γ γ − 1| (3.91)
2 πk (a|s)
This holds provided the support of π is contained in the support of πk at every state. We then use the
following update:
πk+1 = argmax E(s,a)∼pγπk [min (ρk (s, a)Aπk (s, a), ρ̃k (s, a)Aπk (s, a))] (3.92)
π
64
Unfortunately, this kind of KL is harder to compute, since we are taking expectations wrt the unknown
distribution π.
To solve this problem, VMPO adopts an EM-type approach. In the E step, we compute a non-parametric
version of the state-action distribution given by the unknown new policy:
ψ(s, a) = π(a|s)pγπk (s) (3.94)
The optimal new distribution is given by
ψk+1 = argmax Eψ(s,a) [Aπk (s, a)] s.t. DKL (ψ ∥ ψk ) ≤ δ (3.95)
ψ
In the M step, we project this target distribution back onto the space of parametric policies, while satisfying
the KL trust region constraint:
πk+1 = argmax E(s,a)∼pγπk [w(s, a) log π(a|s)] s.t. Epγπk [DKL (ψk ∥ ψ) (s)] ≤ δ (3.100)
π
65
However, since the trajectories are sampled from πb , we use importance sampling (IS) to correct for the
distributional mismatch, as first proposed in [PSS00]. This gives
n T −1
1 X p(τ (i) |π) X t (i)
JˆIS (π) ≜ γ rt (3.102)
n i=1 p(τ (i) |πb ) t=0
h i
It can be verified that Eπb JˆIS (π) = J(π), that is, JˆIS (π) is unbiased, provided that p(τ |πb ) > 0 whenever
(i)
p(τ |π)
p(τ |π) > 0. The importance ratio, p(τ (i) |π ) , is used to compensate for the fact that the data is sampled
b
from πb and not π. It can be simplified as follows:
QT −1 TY−1
p(τ |π) p(s0 ) t=0 π(at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) π(at |st )
= QT −1 = (3.103)
p(τ |πb ) p(s0 ) t=0 πb (at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) π (a |s )
t=0 b t t
This simplification makes it easy to apply IS, as long as the target and behavior policies are known. (If the
behavior policy is unknown, we can estimate it from D, and replace πb by its estimate πˆb . For convenience,
define the per-step importance ratio at time t by
ρt (τ ) ≜ π(at |st )/πb (at |st ) (3.104)
We can reduce the variance of the estimator by noting that the reward rt is independent of the trajectory
beyond time t. This leads to a per-decision importance sampling variant:
n T −1
1 XX Y (i)
JˆPDIS (π) ≜ ρt′ (τ (i) )γ t rt (3.105)
n i=1 t=0 ′
t ≤t
where we define δt = (rt + γV (st+1 ) − V (st )) as the TD error at time t. To extend this to the off-policy case,
we use the per-step importance ratio trick. However, to bound the variance of the estimator, we truncate the
IS weights. In particular, we define
π(at |st ) π(at |st )
ct = min c, , ρt = min ρ, (3.108)
πb (at |st ) πb (at |st )
where c and ρ are hyperparameters. We then define the V-trace target value for V (si ) as
i+n−1 t−1
!
X Y
vi = V (si ) + γ t−i
ct′ ρt δt (3.109)
t=i t′ =i
66
Note that we can compute these targets recursively using
The product of the weights ci . . . ct−1 (known as the “trace”) measures how much a temporal difference δt
at time t impacts the update of the value function at earlier time i. If the policies are very different, the
variance of this product will be large. So the truncation parameter c is used to reduce the variance. In
[Esp+18], they find c = 1 works best.
The use of the target ρt δt rather than δt means we are evaluating the value function for a policy that is
somewhere between πb and π. For ρ = ∞ (i.e., no truncation), we converge to the value function V π , and
for ρ → 0, we converge to the value function V πb . In [Esp+18], they find ρ = 1 works best. (An alternative
to clipping the importance weights is to use a resampling technique, and then use unweighted samples to
estimate the value function [Sch+19].)
Note that if c = ρ, then ci = ρi . This gives rise to the simplified form
n−1 j
!
X Y
vt = V (st ) + γj ct+m δt+j (3.111)
j=0 m=0
We can use the above V-trace targets to learn an approximate value function by minimizing the usual ℓ2
loss:
L(w) = Et∼D (vt − Vw (st ))2 (3.112)
the gradient of which has the form
Differentiating this and ignoring the term ∇θ Qπ (s, a), as suggested by [DWS12], gives a way to (approximately)
estimate the off-policy policy-gradient using a one-step IS correction ratio:
XX
∇θ Jπb (πθ ) ≈ pγπb (s)∇θ πθ (a|s)Qπ (s, a) (3.115)
s a
πθ (a|s)
=E pγ
πb (s),πb (a|s)
∇θ log πθ (a|s)Qπ (s, a) (3.116)
πb (a|s)
In practice, we can approximate Qπ (st , at ) by qt = rt + γvt+1 , where vt+1 is the V-trace estimate for state
st+1 . If we use V (st ) as a baseline, to reduce the variance, we get the following gradient estimate for the
policy:
∇J(θ) = Et∼D [ρt ∇θ log πθ (at |st )(rt + γvt − Vw (st ))] (3.117)
We can also replace the 1-step IS-weighted TD error ρt (rt + γvt − Vw (st )) with an IS-weighted GAE value
by modifying the generalized advantage estimation method in Section 3.2.2. In particular, we just need to
define λt = λ min(1, ρt ). We denote the IS-weighted GAE estimate as Aρt .6
6 For an implementation, see https://github.com/google-deepmind/rlax/blob/master/rlax/_src/multistep.py#L39
67
3.4.2.3 Example: IMPALA
As an example of an off-policy AC method, we consider IMPALA, which stands for “Importance Weighted
Actor-Learning Architecture”. [Esp+18]. This uses shared parameters for the policy and value function (with
different output heads), and adds an entropy bonus to ensure the policy remains stochastic. Thus we end up
with the following objective, which is very similar to on-policy actor-critic shown in Algorithm 7:
L(ϕ) = Et∼D λT D (Vϕ (st ) − vt )2 − λP G Aρt log πϕ (at |st ) − λent H(πϕ (·|st )) (3.118)
The only difference from standard A2C is that we need to store the probabilities of each action, πb (at |st ),
in addition to (st , at , rt , st+1 ) in the dataset D, which can be used to compute the importance ratio ρt in
Equation (3.108). [Esp+18] was able to use this method to train a single agent (using a shared CNN and
LSTM for both value and policy) to play all 57 games at a high level. Furthermore, they showed that their
method — thanks to its off-policy corrections — outperformed the A3C method (a parallel version of A2C)
in Section 3.2.1.
In [SHS20], they analyse the variance of the V-trace estimator, used to compute ρt in Equation (3.108).
They show that to keep this bounded, it is necessary to mix some off-policy data (from the replay buffer)
with some fresh online data from the current policy.
68
Figure 3.3: A graphical model for optimal control.
3.6 RL as inference
In this section, we discuss an approach to policy optimization that reduces it to probabilistic inference. This
is called control as inference, or RL as inference, and has been discussed in numerous works (see e.g.,
[Att03; TS06; Tou09; ZABD10; RTV12; BT12; KGO12; HR17; Lev18; Fur+21; Zha+24a]). The primary
advantage of this approach is that it enables policy learning using off-policy data, while avoiding the need to
use (potentially high variance) importance sampling corrections. (This is because the inference approach
takes expectations wrt dq (s) instead of dπ (s), where q is an auxiliary distribution, π is the policy which is
being optimized, and d is the state visitation measure.) A secondary advantage is that it enables us to use
the large toolkit of methods for probabilistic modeling and inference to solve RL problems.7 The resulting
framework forms the foundation of the the MPO discussed in Section 3.6.5, the SAC method discussed
in Section 3.6.8, as well as the SMC planning method discussed in Section 4.2.3, and some kinds of LLM
test-time inference, as discussed in Section 6.1.3.5.
The core of these methods is based on the probabilistic model shown in Figure 3.3. This shows an MDP
augmented with new variables, Ot . These are called optimality variables, and indicating whether the
action at time t is optimal or not. We assume these have the following probability distribution:
p(Ot = 1|st , at ) ∝ exp(η −1 G(st , at )) (3.122)
where η > 0 is a temperature parameter, and G(s, a) is some quality function, such as G(s, a) = R(s, a), or
G(s, a) = Q(s, a) or G(s, a) = A(s, a). For brevity, we will just write p(O = 1|·) to denote the probability
of the event that Ot = 1 for all time steps. (Note that the specific value of 1 is arbitrary; this likelihood
function is really just a non-negative weighting term that biases the action trajectory, as we show below.)
7 Note, however, that we do not tackle the problem of epistemic uncertainty (exploration). Solving this in the context of
69
3.6.1 Deterministic case (planning/control as inference)
Our goal is to find trajectories that are optimal. That is, we would like to find the mode (or posterior samples)
from the following distribution:
" −1
TY
#" T #
Y
p(τ |O = 1, π) ∝ p(τ , O = 1|π) ∝ p(s1 ) π(at |st )p(st+1 |st , at ) p(Ot = 1|st , at )
t=1 t=1
(3.123)
(Typically the initial state s1 is known, in which case p(s1 ) is a delta function.)
The MAP sequence of actions, which we denote by â1:T (s1 ), is the optimal open loop plan. (It is called
“open loop” since the agent does not need to observe the state, since st is uniquely determined by s1 and
a1:t , both of which are known.) Computing this trajectory is known as the control as inference problem
[Wat+21]. Such open loop planning problems can be solved using model predictive control methods, discussed
in Section 4.2.4.
where we define Y
pπ (τ ) = p(s1 ) p(st+1 |st , at )π(at |st ) (3.126)
t
Since marginalizing over trajectories is difficult, we introduce a variational distribution q(τ ) to simplify the
computations. We assume q factors in the same way:
Y
q(τ ) = p(s1 ) p(st+1 |st , at )πq (at |st ) (3.127)
t
Note that we use the true dynamics model p(st+1 |st , at ) when defining q, and only introduce the variational
distribution for the actions, πq (at |st ). This is one way to avoid the optimism bias that can arise if we
sample from an unconstrained q(τ |O = 1). To see this, suppose O = 1 is the event that we win the lottery.
We do not want conditioning on this outcome to influence our belief in the the probability of chance events,
which is governed by p(st+1 |st , at ) and not p(st+1 |st , at , O = 1). See [Lev18] for further discuss of this point.
Now note the following identity
pπ (O = 1|τ )pπ (τ )
DKL (q(τ ) ∥ pπ (τ |O = 1)) = Eq log q(τ ) − log (3.128)
pπ (O = 1)
= Eq [log q(τ ) − log pπ (O = 1|τ ) − log pπ (τ )] + log pπ (O = 1) (3.129)
70
Hence
q(τ ) q(τ )
log pπ (O = 1) = Eq log p(O = 1|τ ) − log + log (3.130)
p(τ ) p(τ |O = 1)
= J(pπ , q) + DKL (q(τ ) ∥ pπ (τ |O = 1)) (3.131)
where J is defined by
J(πp , πq ) = Eq [log pπ (O = 1|τ )] − DKL (q(τ ) ∥ pπ (τ )) (3.132)
T
X
= Eq η −1 G(st , at ) − DKL (πq (·|st ) ∥ πp (·|st )) (3.133)
t=1
Since DKL (q(τ ) ∥ pπ (τ |O = 1)) ≥ 0, we see that log p(O = 1|π) ≥ J(pπ , q); hence J is called the evidence
lower bound or ELBO. We can define the policy learning task as maximizing the ELBO, subject to the
constraints that πp and πq are distributions that integrate to 1 across actions for all states.
To extend to the infinite time discounted case, we define dπ (s) as the unnormalized discounted distribution
over states
∞
X
dπ (s) = γ t p(st = s|π) (3.134)
t=1
P
We now replace the Eq(s) with Edq (s) to get the constrained objective
t
Z Z Z Z
max J(πp , πq ) s.t. dq (s) πp (a|s)dads = 1, dq (s) πq (a|s)dads = 1 (3.135)
πp ,πq
There are two main ways to solve this optimization problem, which we call “EM control” and “KL control”,
following [Fur+21]. We describe these below.
3.6.3 EM control
In this section, we discuss ways to optimize Equation (3.135) using the Expectation Maximization or EM
algorithm, which is a widely used bound optimization method, also called a MM (majorize / maximize)
method, that mononotonically increases a lower bound on its objective (see [HL04] for a tutorial). In the E
step, we maximize J wrt a non-parametric representation of the variational posterior πq , while holding the
parametric prior πp = πθk−1 p
fixed at the value from the previous (k − 1’th) iteration, to get πqk . In the M step,
we then maximize J wrt πp , holding the variational posterior fixed at πqk , to get the updated policy πθkp .
In more detail, in the E step we maximize the following wrt πq :
Z Z
J(πθk−1
p
, πq ) = d q (s) πq (s|a)η −1 G(s, a)dads
Z Z Z Z
πq (a|s)
− dq (s) πq (a|s) log dads + λ 1 − dπ(s) πq (a|s)dads (3.136)
πθpk−1 (a|s)
71
3.6.4 KL control (maximum entropy RL)
In KL control, we only optimize the variational posterior πq , holding the prior πp fixed. Thus we only have
an E step. In addition, we represent πq parameterically, as πθq , instead of the non-parametric approach used
by EM. If the prior πp is uniform, and we use G(s, at) = R(s, a), then Equation (3.133) becomes
T
X
ηJ(πp , πq ) = Eq [R(st , at ) − ηH(πq (·|st ))] (3.140)
t=1
P
where −H(q) = DKL (q ∥ unif) = a q(a) log q(a) c is the negative entropy function and c is a constant. This is
called the maximum entropy RL objective [ZABD10; Haa+18a; Haa+18b]. This differs from the standard
objective used in RL training (namely a lower bound on sum of expected rewards) by virtue of the addition
of the entropy regularizer. See Section 3.6.8 for further discussion.
In addition, the (inverse) temperature parameter η is solved for by minimizing the dual of Equation (3.137),
which is given by h i
π k−1
g(η) = ηϵ + η log Edq (s)π k−1 (a|s) exp η −1 Q θp (s, a) (3.143)
θp
In the M step, MPO augments the objective in Equation (3.139) with a log prior at the k’th step of the
form log pk (θp ) to create a MAP estimate. That is, it optimizes the following wrt θp :
J(q k , πθp ) = Edq (s)q(a|s) log πθp (a|s) + log pk (θp ) (3.144)
We can think of this step as projecting the non-parametric policy q back to the space of parameterizable
policies Πθ .
We assume the prior is a Gaussian centered at the previous parameters,
pk (θ) = N (θ|θk , λFk ) = c exp −λ(θ − θk )T F−1
k (θ − θk ) (3.145)
where Fk is the Fisher information matrix. If we view this as a second order approximation to the KL, we
can rewrite the objective as
max Edq (s) Eq(a|s) log π(a|s, θp ) − λDKL (π(a|s, θk ) ∥ π(a|s, θp )) (3.146)
θp
We can approximate the expectation wrt dq (s) by sampling states from a replay buffer, and the expectaton
wrt q(a|s) by sampling from the policy. The KL term can be computed analytically for Gaussian policies.
We can then optimize this objective using SGD.
Note that we can also rewrite this as a constrained optimization problem
max Edq (s) Eq(a|s) log π(a|s, θp ) s.t. Edq (s) [DKL (π(a|s, θk ) ∥ π(a|s, θp ))] ≤ ϵm (3.147)
θp
72
3.6.6 Sequential Monte Carlo Policy Optimisation (SPO)
In this section, we discuss SPO method of [Mac+24]. This is a model-based version of MPO, which uses
Sequential Monte Carlo (SMC) to perform approximate inference in the E step. In particular, it samples
from a distribution over optimal future trajectories starting from the current state, st , and using the current
policy πθp and dynamics model T (s′ |s, a). From this it derives a non-parametric distribution over the optimal
actions to take at the next step, q(at |st ). (see Section 4.2.3 for details). This becomes a target for the
parametric policy update in the M step, which is the same weighted maximum likelihood method used by
MPO.
Note that the entropy term makes the objective easier to optimize, and encourages exploration. To optimize
this, we can perform a policy evaluation step, and then a policy improvement step.
where
is the soft value function. If we iterate Qk+1 = T π Qk , this will converge to the soft Q function for π.
73
In the tabular case, we can derive the optimal soft value function as follows. First, by definition, we have
X
V ∗ (s) := max π(a | s) [Q∗ (s, a) − α log π(a | s)] . (3.151)
π
a
∂L
= Q∗ (s, a) − α(1 + log π(a | s)) − λ = 0. (3.153)
∂π(a | s)
Since
Q∗ (s, a) X Q∗ (s, a′ )
log π ∗ (a | s) = − log exp , (3.157)
α ′
α
a
we have
X Q∗ (s, a′ )
∗ ∗
Q (s, a) − α log π (a | s) = α log exp . (3.158)
α
a′
74
is the frozen target value, and and Vw (s) is a frozen version of the soft value function from Equation (3.150):
Vw (st ) = Eπ(at |st ) [Qw (st , at ) − α log π(at |st )] (3.162)
where w is the EMA version of w. (The use of a frozen target is to avoid bootstrapping instabililities discussed
in Section 2.5.2.5.)
To avoid the positive overestimation bias that can occur with actor-critic methods, [Haa+18a], suggest
fitting two soft Q functions, by optimizing JQ (wi ), for i = 1, 2, independently. Inspired by clipped double Q
learning, used in TD3 (Section 3.2.6.3), the targets are defined as
y(rt+1 , st+1 ; w1:2 , θ) = rt+1 + γ min Qwi (st+1 , ãt+1 ) − α log πθ (ãt+1 |st+1 ) (3.163)
i=1,2
where ãt+1 ∼ πθ (st+1 ) is a sampled next action. In [Che+20], they propose the REDQ method (Section 2.5.3.3)
which uses a random ensemble of N ≥ 2 networks instead of just 2.
where H is the target entropy (a hyper-parameter). This objective is approximated by sampling actions from
the replay buffer.
75
Algorithm 10: SAC
1 Initialize environment state s, policy parameters θ, N critic parameters wi , target parameters
wi = wi , replay buffer D = ∅, discount factor γ, EMA rate ρ, step size ηw , ηπ .
2 repeat
3 Take action a ∼ πθ (·|s)
4 (s′ , r) = step(a, s)
5 D := D ∪ {(s, a, r, s′ )}
6 s ← s′
7 for G updates do
8 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
9 w = update-critics(θ, w, B)
10 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
11 θ = update-policy(θ, w, B)
12 until converged ;
13 .
14 def update-critics(θ, w, B):
15 Let (sj , aj , rj , s′j )B
j=1 = B
16 yj = y(rj , s′j ; w1:N , θ) for j = 1 : B
17 for i = 1 : N doP
(s,a,r,s′ )j ∈B (Qwi (sj , aj ) − sg(yj ))
1 2
18 L(wi ) = |B|
19 wi ← wi − ηw ∇L(wi ) // Descent
20 wi := ρwi + (1 − ρ)wi //Update target networks
21 Return w1:N , w1:N
22 .
23 def update-actor(θ, w, B):
PN
24 Q̂(s, a) ≜ N1 i=1Qwi (s, a) // Average critic
1
P
25 J(θ) = |B| s∈B Q̂(s, ãθ (s)) − α log πθ (ã(s)|s) , ãθ (s) ∼ πθ (·|s)
26 θ ← θ + ηθ ∇J(θ) // Ascent
27 Return θ
76
3.6.9 Active inference
Control as inference is closely related to a technique known as active inference, as we explain below. For
more details on the connection, see [Mil+20; WIP20; LÖW21; Saj+21; Tsc+20].
The active inference technique was developed in the neuroscience community, that has its own vocabulary
for standard ML concepts. We start with the free energy principle [Fri09; Buc+17; SKM18; Ger19;
Maz+22]. The FEP is equivalent to using variational inference to perform state estimation (perception) and
parameter estimation (learning) in a latent variable model. In particular, consider an LVM p(z, o|θ) with
hidden states z, observations o, and parameters θ. We define the variational free energy to be
F(o|θ) = DKL (q(z|o, θ) ∥ p(z|o, θ)) − log p(o|θ) = Eq(z|o,θ) [log q(z|o, θ) − log p(o, z|θ)] ≥ − log p(o|θ)
(3.167)
which is the KL between the approximate variational posterior q and the true posterior p, minus a normalization
constant, log p(o|θ), which is known as the free energy. State estimation (perception) corresponds to solving
minq(z|o,θ) F(o|θ), and parameter estimation (model fitting) corresponds to solving minθ F(o|θ), just as in
the EM (expectation maximization) algorithm. (We can also be Bayesian about θ, as in variational Bayes
EM, instead of just computing a point estimate.) This EM procedure will minimize the VFE, which is an
upper bound on the negative log marginal likelihood of the data. In other words, it adjusts the model (belief
state and parameters) so that it better predicts the observations, so the agent is less surprised (minimizes
prediction errors).
To extend the above FEP to decision making problems, we define the expected free energy as follows
where q(o|a) is the posterior predictive distribution over future observations given action sequence a. (We
should also condition on any observed history / agent state h, and the model parameters θ, but we omit this
from the notation for brevity.)
We see that we can decompose the EFE into two terms. First there is the intrinsic value, known as the
epistemic drive. Minimizing this will encourage the agent to choose actions which maximize the mutual
information between the observations o and the hidden states z, thus reducing uncertainty about the hidden
states. (This is called epistemic foraging.) Second there is the extrinsic value, known as the exploitation
term. Maximizing this will encourage the agent to choose actions that result in observations that match
its prior. For example, if the agent predicts that the world will look brighter when it flips a light switch,
it can take the action of flipping the switch to fulfill this prediction. This prior can be related to a reward
function by defining as p(o) ∝ eR(o) , encouraging goal directed behavior, exactly as in control-as-inference
(cf. [Vri+25]). However, the active inference approach provides a way of choosing actions without needing to
specify a reward.
Since solving to the optimal action at each step can be slow, it is possible to amortize this cost by training
a policy network to compute π(a|h) = argmina G(a|h), where h is the observation history (or current state),
as shown in [Mil20; HL20]; this is called “deep active inference”.
Overall, we see that this framework provides a unified theory of both perception and action, both of which
try to minimize some form of free energy. In particular, minimizing the expected free energy will cause the
agent to pick actions to reduce its uncertainty about its hidden states, which can then be used to improve
its predictive model pθ of observations; this in turn will help minimize the VFE of future observations, by
updating the internal belief state q(z|o, θ) to explain the observations. In other words, the agent acts so it
can learn so it becomes less surprised by what it sees. This ensures the agent is in homeostasis with its
environment.
Note that active inference is often discussed in the context of predictive coding. This is equivalent to
a special case of FEP where two assumptions are made: (1) the generative model p(z, o|θ) is a nonlinear
hierarchical Gaussian model (similar to a VAE decoder), and (2) the variational posterior approximation uses
a diagonal Laplace approximation, q(z|o, θ) = N (z|ẑ, H) with the mode ẑ being computed using gradient
77
descent, and H being the Hessian at the mode. This can be considered a non-amortized version of a VAE,
where inference (E step) is done with iterated gradient descent, and parameter estimation (M step) is also
done with gradient descent. (A more efficient incremental EM version of predictive coding, which updates
{ẑn : n = 1 : N } and θ in parallel, was recently presented in [Sal+24], and an amortized version in [Tsc+23].)
For more details on predictive coding, see [RB99; Fri03; Spr17; HM20; MSB21; Mar21; OK22; Sal+23;
Sal+24].
78
Chapter 4
Model-based RL
4.1 Introduction
Model-free approaches to RL typically need a lot of interactions with the environment to achieve good
performance. For example, state of the art methods for the Atari benchmark, such as rainbow (Section 2.5.2.2),
use millions of frames, equivalent to many days of playing at the standard frame rate. By contrast, humans
can achieve the same performance in minutes [Tsi+17]. Similarly, OpenAI’s robot hand controller [And+20]
needs 100 years of simulated data to learn to manipulate a rubiks cube.
One promising approach to greater sample efficiency is model-based RL (MBRL). In the simplest
approach to MBRL, we first learn the state transition or dynamics model pS (s′ |s, a) — also called a world
model — and the reward function R(s, a), using some offline trajectory data, and then we use these models
to compute a policy (e.g., using dynamic programming, as discussed in Section 2.2, or using some model-free
policy learning method on simulated data, as discussed in Chapter 3). It can be shown that the sample
complexity of learning the dynamics is less than the sample complexity of learning the policy [ZHR24].
However, the above two-stage approach — where we first learn the model, and then plan with it — can
suffer from the usual problems encountered in offline RL (Section 7.7), i.e., the policy may query the model
at a state for which no data has been collected, so predictions can be unreliable, causing the policy to learn
the wrong thing. To get better results, we have to interleave the model learning and policy learning, so that
one helps the other (since the policy determines what data is collected).
There are two main ways to perform MBRL. In the first approach, known as decision-time planning or
model predictive control, we use the model to choose the next action by searching over possible future
trajectories. We then score each trajectory, pick the action corresponding to the best one, take a step in the
environment, and repeat. (We can also optionally update the model based on the rollouts.) This is discussed
in Section 4.2.
The second approach is to use the current model and policy to rollout imaginary trajectories, and to use
this data (optionally in addition to empirical data) to improve the policy using model-free RL; this is called
background planning, and is discussed in Section 4.3.
The advantage of decision-time planning is that it allows us to train a world model on reward-free data,
and then use that model to optimize any reward function. This can be particularly useful if the reward
contains changing constraints, or if it is an intrinsic reward (Section 7.4) that frequently changes based on
the knowledge state of the agent. The downside of decision-time planning is that it is much slower. However,
it is possible to combine the two methods, as we discuss below. For an empirical comparison of background
planning and decision-time planning, see [AP24].
Some generic pseudo-code for an MBRL agent is given in Algorithm 11. (The rollout function is defined
in Algorithm 12; some simple code for model learning is shown in Algorithm 13, although we discuss other
loss functions in Section 4.4; finally, the code for the policy learning is given in other parts of this manuscript.)
For more details on general MBRL, see e.g., [Wan+19; Moe+23; PKP21; Luo+22].
79
Algorithm 11: MBRL agent
1 def MBRL-agent(Menv ; T, H, N ):
2 Initialize state s ∼ Menv
3 Initialize data buffer D = ∅, model M̂
4 Initialize value function V , policy proposal π
5 repeat
6 // Collect data from environment
7 τenv = rollout(s, π, T, Menv ), ;
8 s = τenv [−1], ;
9 D = D ∪ τenv
10 // Update model
11 if Update model online then
12 M̂ = update-model(M̂, τenv )
13 if Update model using replay then
14
n
τreplay = sample-trajectory(D), n = 1 : N
15 M̂ = update-model(M̂, τreplay
1:N
)
16 // Update policy
17 if Update on-policy with real then
18 (π, V ) = update-on-policy(π, V, τenv )
19 if Update on-policy with imagination then
20
n
τimag = rollout(sample-init-state(D), π, T, M̂ ), n = 1 : N
21 (π, V ) = update-on-policy(π, V, τimag
1:N
)
22 if Update off-policy with real then
23
n
τreplay = sample-trajectory(D), n = 1 : N
24 (π, V ) = update-off-policy(π, V, τreplay
1:N
)
25 if Update off-policy with imagination then
26
n
τimag = rollout(sample-state(D), π, T, M̂ ), n = 1 : N
27 (π, V ) = update-off-policy(π, V, τimag
1:N
)
28 until until converged ;
80
Figure 4.1: Illustration of forward search applied to a problem with 3 discrete states and 2 discrete actions. From
Figure 9.1 of [KWW22]. Used with kind permission of Mykel Kochenderfer.
81
which is independent of |S|.
• Action selection: If we have not visited s before, we initialize the node by setting N (s, a) = 0 and
Q(s, a) = 0 and returning U (s) as the value, where U is some estimated value function. Otherwise we
pick the next action to explore from state s. To explore actions, we first try each action once, and we
then use the Upper Confidence Tree or UCT heuristic (based on UCB from Section 7.2.3) to select
subsequent actions, i.e. we use
s
log N (s)
a = argmax Q(s, a) + c (4.1)
a∈A(s) N (s, a)
P
where N (s) = a N (s, a) is the total visit count to s, and c is an exploration bonus scaling term.
(Various other expressions are used in the literature, see [Bro+12] for a discussion.) If we have a
predictor or prior over over actions, P (s, a), we can instead use
√ !
N (s)
a = argmax Q(s, a) + c P (s, a) (4.2)
a∈A(s) 1 + N (s, a)
• Expansion: After choosing action a, we sample the next state s′ ∼ p(s′ |s, a).
• Rollout: we recursively estimate u = U (s′ ) using MCTS from that node. At some depth, we stop and
use the value function to return u = r + γv(s′ ).
• Backup: Finally we update the Q function for the root node using a running average:
1
Q(s, a) ← Q(s, a) + (u − Q(s, a)) (4.3)
N (s, a)
When we return from the recursive call, we are effectively backpropagating the value u from the leaves up
the tree, as illustrated in Figure 4.2(b). A sketch of a non-recursive version of the algorithm (Algorithm 25 of
[ACS24]) is shown in Algorithm 14.
82
u
ExploreAction
u
ExploreAction
u
ExploreAction
u
InitialiseNode
(a) (b)
Figure 4.2: Illustration of MCTS. (a) Expanding nodes until we hit a new (previously unexplored) leaf node. (b)
Propagating leaf value u back up the tree. From Figure 9.25 of [ACS24].
83
grandmaster at the board game Go. AlphaGo was followed up by AlphaGoZero [Sil+17a], which had a
much simpler design, and did not train on any human data, i.e., it was trained entirely using RL and self play.
It significantly outperformed the original AlphaGo. This was generalized to AlphaZero [Sil+18], which can
play expert-level Go, chess, and shogi (Japanese chess), without using any domain knowledge (except in the
design of the neural network used to guide MCTS).
In more detail, AlphaZero used MCTS (with self-play), combined with a neural network which computes
(v s , π s ) = f (s; θ), where vs is the expected outcome of the game from state s (either +1 for a win, -1 for
a loss, or 0 for a draw), and π s is the policy (distribution over actions) for state s. The policy is used
internally by MCTS whenever a new node is initialized to give an additional exploration bonus to the most
promising / likely actions. This controls the breadth of the search tree. In addition, the learned value
function vs = f (s; θ)v is used to provide the value for leaf nodes in cases where we cannot afford to rollout to
termination. This controls the depth of the search tree.
The policy/value network f is trained by optimizing the actor-critic loss
" #
X
L(θ) = E(s,πMCTS
s ,V MCTS (s))∼D (V
MCTS
(s) − Vθ (s))2 − π MCTS
s (a) log π θ (a|s) (4.4)
a
84
4.2.2.3 MCTS in belief space
In [Mos+24], they present BetaZero, which performs MCTS in belief space. The current state is represented
by a belief state, bt , which is passed to the network to generate an initial policy proposal πθ (a|b) and value
function vθ (b). (Instead of passing the belief state to the network, they actually pass features derived from
the belief state, namely the mean and variance of the states.1 )
To rollout out trajectories from the current root node b, they proceed as follows:
• Expand the node as follows: sample the current hidden state s ∼ b, sample the next hidden state
s ∼ T (s′ |s, a) sample the observation o ∼ O(s′ ), sample the reward r ∼ R(s, a, s′ ); and finally derive
the new belief state b′ = Update(b, a, o) using e.g., a particle filter (see e.g., [Lim+23]).
• Simulate future returns using rollouts to get u = r + γVθ (b′ ) (assuming single step for notational
simplicity);
At the end of tree search, they derive the tree policy π t = π MCTS (bt ) from the root, and compute the
PT
empirical reward-to-go gt = i=t γ i−t ri based on all rewards observed below root node bt ; this is added to a
dataset D = {(bt , π t , gt )} which is used to update the policy and value network.
and η is a temperature parameter obtained by maximizing Equation (3.143). (The state-value function V
can be learned via TD(0.)
Let the resulting empirical distribution over trajectories be denoted by
N
X
q̂i (τ ) = wn δ(τ − τ n ) (4.9)
n=1
where τ n is the n’th sample, and wn is its (normalized) weight. We can derive the distribution over next best
action as follows: X
q̂(a|s0 ) = wn δ(a − an0 ) (4.10)
n
1 The POMCP algorithm of [SV10] (Partially Observable Monte Carlo Planning) is related to BetaZero, but passes observation-
action histories as input to the policy/value network, instead of features dervied from the belief state. The POMCPOW
algorithm of [SK18] (POMCP with observation widening) extends this to continuous domains by sampling observations and
actions.
85
One way to sample trajectories from such a distribution is to use SMC (Sequential Monte Carlo), which
is a generalization of particle filtering. This is an approach to approximate inference in state space models
based on sequential importance sampling with resampling (see e.g., [NLS19]). At each step, we use a proposal
distribution β(τt |τ1:t−1 ), which extends the previous sampled trajectory with a new value of xt = (st , at ). We
then compute the weight of this proposed extension by comparing it to the target qi (τt |τ1:t ) to get
qi (τ1:t )
w(τ1:t ) ∝ w(τ1:t−1 ) (4.11)
β(τ1:t )
In SMC, at each step we propose a new particle according to β, and then weight it according to the above
equation. We can then optionally resample the particles every few steps, or when the effective sample size
becomes too small; after a resampling step, we reset the weights to 1, since we now have a weighted sample.
At the end, we return an empirical distribution over actions that correspond to high scoring trajectories,
from which we can estimate the next best action (e.g., by taking the mean or mode of this distribution). See
Algorithm 15 for details. (See also [Pic+19; Lio+22] for related methods.)
Note that the above framework is a special case of twisted SMC [NLS19; Law+22; Zha+24b],, where
the advantage function plays the role of a “twist” function, summarizing expected future rewards from the
current state.
Algorithm 15: SMC-RHC (Sequential Monte Carlo for Receeding Horizon Control)
1 def SMC-RHC(st , πi , Vi ):
2 Initialize particles: {snt = st }N n=1
3 Initialize weights: {wtn = 1}N n=1
4 for j = t + 1 : t + d do
5 {anj ∼ πi (·|snj )}N n=1
6 {snj+1 ∼ T̂ (snj , anj )}Nn=1
7 {rjn ∼ R̂(snj , anj )}N n=1
8 {xnj = (snj , anj , rjn )}N
n=1
9 {Anj = rjn + Vi (snt+1 ) − Vi (snj )}N
n=1
10 {wjn = wj−1n
exp(Ani /ηi∗ )}N
n=1
11 if Resample then
12 {xnt:j } ∼ Multinom(n; wi1 , . . . , wiN )}
13 {wjn = 1}N n=1
n
14 {wn = Pw }N
n′ w n′ n=1
15 Let {ant }N
n=1 be the set of sampled actions at the start of {xnt:t+d }N
n=1
P n
16 Return q̂(a|st ) = n w δ(a − ant )
86
Figure 4.3: Illustration of the suboptimality of open-loop planning. From Figure 9.9 of [KWW22]. Used with kind
permission of Mykel Kochenderfer.
where T is the dynamics model. It then returns a∗t as the best action, takes a step, and replans.
Crucially, the future actions are chosen without knowing what the future states are; this is what is meant
by “open loop”. This can be much faster than interleaving the search for actions and future states. However, it
can also lead to suboptimal decisions, as we discuss below. Nevertheless, the fact that we replan at each step
can reduce the harms of this approximation, making the method quite popular for some problems, especially
ones where the dynamics are deterministic, and the actions are continuous (so that Equation (4.15) becomes
a standard optimization problem over the real valued sequence of vectors at:t+d−1 ).
• U(down,up) = 20
• U(down,down) = 20
Thus the best open-loop action is to choose down, with an expected reward of 20. However, closed-loop
planning can reason that, after taking the first action, the agent can sense the resulting state. If it initially
chooses to go up from s1 , then it can decide to next go up or down, depending on whether it is in s2 or s3 ,
thereby guaranteeing a reward of 30.
87
4.2.4.2 Trajectory optimization
If the dynamics is deterministic, the problem becomes one of solving
d
X
max γ t R(st , at ) (4.16)
a1:d ,s2:d
t=1
s.t. st+1 = T (st , at ) (4.17)
where T is the transition function. This is called a trajectory optimization problem. We discuss various
ways to solve this below.
4.2.4.3 LQR
If the system dynamics are linear and the reward function is quadratic, then the optimal action sequence can
be computed exactly using a method similar to Kalman filtering. This is known as the linear quadratic
regulator (LQR). For details, see e.g., [AM89; HR17; Pet08].
If the model is nonlinear, we can use differential dynamic programming (DDP) [JM70; TL05] to
approximately solve the problem. In each iteration, DDP starts with a reference trajectory, and linearizes the
system dynamics around states on the trajectory to form a locally quadratic approximation of the reward
function. This system can be solved using LQG, whose optimal solution results in a new trajectory. The
algorithm then moves to the next iteration, with the new trajectory as the reference trajectory.
4.2.4.5 CEM
As an improvement upon random shooting, it is common to use black-box (gradient-free) optimization
methods like the cross-entropy method or CEM in order to find the best action sequence. The CEM
method is a simple derivative-free optimization method for continuous black-box functions f : RD → R. We
start with a multivariate Gaussian, N (µ0 , Σ0 ), representing a distribution over possible action a. We sample
from this, evaluate all the proposals, pick the top K, then refit the Gaussian to these top K, and repeat,
until we find a sample with sufficiently good score (or we perform moment matching on the top K scores).
For details, see [Rub97; RK04; Boe+05]. In Section 4.2.4.6, we discuss the MPPI method, which is a common
instantiation of CEM method. In [BXS20] they discuss how to combine CEM with gradient-based planning.
4.2.4.6 MPPI
The model predictive path integral or MPPI approach [WAT17] is a version of CEM. Originally MPPI
was limited to models with linear dynamics, but it was extended to general nonlinear models in [Wil+17].
The basic idea is that the initial mean of the Gaussian at step t, namely µt = at:t+H , is computed based on
shifting µ̂t−1 forward by one step. (Here µt is known as a reference trajectory.)
In [Wag+19], they apply this method for robot control. They consider a state vector of the form
st = (qt , q̇t ), where qt is the configuration of the robot. The deterministic dynamics has the form
qt + q̇t ∆t
st+1 = F (st , at ) = (4.18)
q̇t + f (st , at )∆t
where f is a 2 layer MLP. This is trained using the Dagger method of [RGB11], which alternates between
fitting the model (using supervised learning) on the current replay buffer (initialized with expert data), and
then deploying the model inside the MPPI framework to collect new data.
88
A similar method was used in the TD-MPC paper [HSW22; HSW24], which learns a non-generative
world model in latent space, and then uses MPPI to implement MPC (see Section 4.4.2.11 for details). They
initialize the population of K sampled action trajectories by applying the policy prior to generate J < K
samples, and then generate the remaining K − J samples using the diagonal Gaussian prior from the previous
time step.
4.2.4.7 GP-MPC
[KD18] proposed GP-MPC, which combines a Gaussian process dynamics model with model predictive
control. They compute a Gaussian approximation to the future state trajectory given a candidate action
trajectory, p(st+1:t+H |at:t+H−1 , st ), by moment matching, and use this to deterministically compute the
expected reward and its gradient wrt at:t+H−1 . Using this, they can solve Equation (4.15) to find a∗t:t+H−1 ;
finally, they execute the first step of this plan, a∗t , and repeat the whole process.
The key observation is that moment matching is a deterministic operator that maps p(st |a1:t−1 ) to
p(st+1 |a1:t ), so the problem becomes one of deterministic optimal control, for which many solution methods
exist. Indeed the whole approach can be seen as a generalization of the LQG method from classical control,
which assumes a (locally) linear dynamics model, a quadratic cost function, and a Gaussian distribution over
states [Rec19]. In GP-MPC, the moment matching plays the role of local linearization.
The advantage of GP-MPC over the earlier method known as PILCO (“probabilistic inference for learning
control”), which learns a policy by maximizing the expected reward from rollouts (see [DR11; DFR15] for
details), is that GP-MPC can handle constraints more easily, and it can be more data efficient, since it
continually updates the GP model after every step (instead of at the end of an trajectory).
We define the loss of a model M̂ given a distribution µ(s, a) over states and actions as
h i
ℓ(M̂, µ) = E(s,a)∼µ DKL Menv (·|s, a) ∥ M̂ (·|s, a)
89
We now define MBRL as a two-player general-sum game (see Chapter 5 for details):
policy player model player
z }| { z }| {
max J(π, M̂ ), min ℓ(M̂, µπMenv )
π M̂
PT
where µπMenv = T1 t=0 Menv (st = s, at = a) is the induced state visitation distribution when policy π is
applied in the real world Menv , so that minimizing ℓ(M̂, µπMenv ) gives the maximum likelhood estimate
for M̂ .
Now consider a Nash equilibrium of this game, that is a pair (π, M̂ ) that satisfies ℓ(M̂, µπMenv ) ≤ ϵMenv
and J(π, M̂ ) ≥ J(π ′ , M̂ ) − ϵπ for all π ′ . (That is, the model is accurate when predicting the rollouts from π,
and π cannot be improved when evaluated in M̂ ). In [RMK20] they prove that the Nash equilibirum policy
π is near optimal wrt the real world, in the sense that J(π ∗ , Menv ) − J(π, Menv ) is bounded by a constant,
where π ∗ is an optimal policy for the real world Menv . (The constant depends on the ϵ parameters, and the
∗
TV distance between µπMenv and µπ∗ M̂
.)
A natural approach to trying to find such a Nash equilibrium is to use gradient descent ascent or
GDA, in which each player updates its parameters simultaneously, using
Unfortunately, GDA is often an unstable algorithm, and often needs very small learning rates η. In addition,
to increase sample efficiency in the real world, it is better to make multiple policy improvement steps using
synthetic data from the model M̂k at each step.
Rather than taking small steps in parallel, the best response strategy fully optimizes each player given
the previous value of the other player, in parallel:
Unfortunately, making such large updates in parallel can often result in a very unstable algorithm.
To avoid the above problems, [RMK20] propose to replace the min-max game with a Stackelberg game,
which is a generalization of min-max games where we impose a specific player ordering. In particular, let the
players be A and B, let their parameters be θA and θB , and let their losses be LA (θA , θB ) and LB (θA , θB ).
If player A is the leader, the Stackelberg game corresponds to the following nested optimization problem,
also called a bilevel optimization problem:
∗ ∗
min LA (θA , θB (θA ))sθB (θA ) = argmin LB (θA , θ)
θA θ
Since the follower B chooses the best response to the leader A, the follower’s parameters are a function of the
leader’s. The leader is aware of this, and can utilize this when updating its own parameters.
The main advantage of the Stackelberg approach is that one can derive gradient-based algorithms that
will provably converge to a local optimum [CMS07; ZS22]. In particular, suppose we choose the policy as
leader (PAL). We then just have to solve the following optimization problem:
We can solve the first step by executing πk in the environment to collect data Dk and then fitting a local
(policy-specific) dynamics model by solving M̂k+1 = argmin ℓ(M̂, Dk ). (For example, this could be a locally
90
linear model, such as those used in trajectory optimization methods discussed in Section 4.2.4.6.) We then
(slightly) improve the policy to get πk+1 using a conservative update algorithm, such as natural actor-critic
(Section 3.2.4) or TRPO (Section 3.3.2), on “imaginary” model rollouts from M̂k+1 .
Alternatively, suppose we choose the model as leader (MAL). We now have to solve
We can solve the first step by using any RL algorithm on “imaginary” model rollouts from M̂k to get πk+1 . We
then apply this in the real world to get data Dk+1 , which we use to slightly improve the model to get M̂k+1
by using a conservative model update applied to Dk+1 . (In practice we can implement a conservative model
update by mixing Dk+1 with data generated from earlier models, an approach known as data aggregation
[RB12].) Compared to PAL, the resulting model will be a more global model, since it is trained on data from
a mixture of policies (including very suboptimal ones at the beginning of learning).
4.3.2 Dyna
The Dyna paper [Sut90] proposed an approach to MBRL that is related to the approach discussed in
Section 4.3.1, in the sense that it trains a policy and world model in parallel, but it differs in one crucial way: the
policy is also trained on real data, not just imaginary data. That is, we define πk+1 = πk +ηπ ∇π J(πk , D̂k ∪Dk ),
where Dk is data from the real environment and D̂k = rollout(πk , M̂k ) is imaginary data from the model.
This makes Dyna a hybrid model-free and model-based RL method, rather than a “pure” MBRL method.
In more detail, at each step of Dyna, the agent collects new data from the environment and adds it to a
real replay buffer. This is then used to do an off-policy update. It also updates its world model given the real
data. Then it simulates imaginary data, starting from a previously visited state (see sample-init-state
function in Algorithm 11), and rolling out the current policy in the learned model. The imaginary data is
then added to the imaginary replay buffer and used by an on-policy learning algorithm. This process continue
until the agent runs out of time and must take the next step in the environment.
91
Algorithm 16: Tabular Dyna-Q
1 def dyna-Q-agent(s, Menv ; ϵ, η, γ):
2 Initialize data buffer D = ∅, Q(s, a) = 0 and M̂ (s, a) = 0
3 repeat
4 // Collect real data from environment
5 a = eps-greedy(Q, ϵ)
6 (s′ , r) = env.step(s, a)
7 D = D ∪ {(s, a, r, s′ )}
8 // Update policy on real data
9 Q(s, a) := Q(s, a) + η[r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]
10 // Update model on real data
11 M̂ (s, a) = (s′ , r)
12 s := s′
13 // Update policy on imaginary data
14 for n=1:N do
15 Select (s, a) from D
16 (s′ , r) = M̂ (s, a)
17 Q(s, a) := Q(s, a) + η[r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]
18 until until converged ;
[Chu+18]).2 In [Kur+19], they combine Dyna with TRPO (Section 3.3.2) and ensemble world models, and
in [Wu+23] they combine Dyna with PPO and GP world models. (Technically speaking, these on-policy
approaches are not valid with Dyna, but they can work if the replay buffer used for policy training is not too
stale.)
In this section, we discuss various kinds of world models that have been proposed in the literature. These
models can be trained to predict future observations (generative WMs) or just future rewards/values and/or
future latent embeddings (non-generative / non-reconsructive WMs). Once trained, the models can be used
for decision-time planning, background planning, or just as an auxiliary signal to aid in things like intrinsic
curiosity. See Table 4.1 for a summary.
2 In [Zhe+22b] they argue that the main benefit of an ensemble is that it limits the Lipschitz constant of the value function.
They show that more direct methods for regularizing this can work just as well, and are much faster.
92
4.4.1 World models which are trained to predict observation targets
In this section, we discuss different kinds of world model T (s′ |s, a). We can use this to generate imaginary
trajectories by sampling from the following joint distribution:
−1
TY
p(st+1:T , rt+1:T , at:T −1 |st ) = π(ai |si )T (si+1 |si , ai )R(ri+1 |si , ai ) (4.19)
i=t
The model may be augmented with latent variables, as we discuss in Section 4.4.1.2.
In this section, we assume the model is always trained to predict the entire observation vector, even if
we use latent variables. (This is what we mean by “generative world model”.) One big disadvantage of this
approach is that the observations may contain irrelevant or distractor variables; this is particularly likely
to occur in the case of image data. In addition, there may be a distribution shift in th observation process
between train and test time. Both of these factors can adversely affect the performance of generative WMs
(see e.g., [Tom+23]). We discuss some non-generative approaches to WMs in Section 4.4.2.
where p(ot |zt ) = D(ot |zt ) is the decoder or likelihood function, M(z ′ |z, a) is the dunamics in latent space.
π(at |zt ) is the policy in latent space.
The world model is usually trained by maximizing the marginal likelihood of the observed outputs given
an action sequence. (We discuss non-likelihood based loss functions in Section 4.4.2.) Computing the marginal
likelihood requires marginalizing over the hidden variables zt+1:T . To make this computationally tractable, it
is common to use amortized variational inference, in which we train an encoder network, p(zt |ot ) = E(zt |ot ),
to approximate the posterior over the latents. Many papers have followed this basic approach, such as the
“world models” paper [HS18], and the methods we discuss below.
93
Figure 4.4: Illustration of Dreamer world model as a factor graph (squares are learnable functions, circles are variables,
diamonds are fixed cost functions). We have unrolled the forwards prediction for only 1 step. Also, we have omitted
the reward prediction loss.
94
(Note that Dreamer is based on an earlier approach called PlaNet [Haf+19], which used MPC instead of
background planning.)
In Dreamer, the stochastic dynamic latent variables in Equation (4.20) are replaced by deterministic
dynamic latent variables ht , since this makes the model easier to train. (We will see that ht acts like the
posterior over the hidden state at time t − 1; this is also the prior predictive belief state before we see ot .) A
“static” stochastic variable zt is now generated for each time step, and acts like a “random effect” in order
to help generate the observations, without relying on ht to store all of the necessary information. (This
simplifies the recurrent latent state.) In more detail, Dreamer defines the following functions:3
• A hidden dynamics (sequence) model: ht+1 = U(ht , at , zt )
• A latent state conditional prior: ẑt ∼ P (ẑt |ht )
• A latent state decoder (observation predictor): ôt ∼ D(ôt |ht , ẑt ).
• A reward predictor: r̂t ∼ R(r̂t |ht , ẑt )
• A latent state encoder: zt ∼ E(zt |ht , ot ).
• A policy function: at ∼ π(at |ht )
See Figure 4.4 for an illustration of the system.
We now give a simplified explanation of how the world model is trained. The loss has the form
" T #
X
LWM
= Eq(z1:T ) o z
βo L (ot , ôt ) + βz L (zt , ẑt ) (4.21)
t=1
where the β terms are different weights for each loss, and q is the posterior over the latents, given by
T
Y
q(z1:T |h0 , o1:T , a1:T ) = E(zt |ht , ot )δ(ht − U(ht−1 , at−1 , zt−1 )) (4.22)
t=1
where we abuse notation somewhat, since Lz is a function of two distributions, not of the variables zt and ẑt .
In addition to the world model loss, we have the following actor-critic losses
T
X
Lcritic = (V (ht ) − sg(Gλt ))2 (4.25)
t=1
T
X
Lactor = − sg((Gλt − V (ht ))) log π(at |ht ) (4.26)
t=1
There have been several extensions to the original Dreamer paper. DreamerV2 [Haf+21] adds categorical
(discrete) latents and KL balancing between prior and posterior estimates. This was the first imagination-
based agent to outperform humans in Atari games. DayDreamer [Wu+22] applies DreamerV2 to real robots.
3We can map from our notation to the notation in the paper as follows: o → x , U → f (sequence model), P → p (ẑ |h )
t t ϕ 0 ϕ t t
(dynamics predictor), D → pϕ (x̂t |ht , ẑt ) (decoder), E → qϕ (zt |ht , xt ) (encoder).
95
DreamerV3 [Haf+23] builds upon DreamerV2 using various tricks — such as symlog encodings4 for the reward,
critic, and decoder — to enable more stable optimization and domain independent choice of hyper-parameters.
It was the first method to create diamonds in the Minecraft game without requiring human demonstration
data. (However, reaching this goal took 17 days of training.) [Lin+24a] extends DreamerV3 to also model
language observations.
Many variants of Dreamer have been explored. For example, TransDreamer [Che+21a] and STORM
[Zha+23b] replace the RNN world model with transformers, and the S4WM method of [DPA23] uses S4
(Structured State Space Sequence) models. The DreamingV2 paper of [OT22] replaces the generative loss with
a non-generative self-prediction loss (see Section 4.4.2.6), and [RHH23; Sgf] use the VicReg non-generative
representation learning method (see Section 4.4.2.9).
symlog function squashes large positive and negative values, while preserving small values.
96
Loss Policy Usage Examples
OP Observables Dyna Diamond [Alo+24], Delta-Iris [MAF24]
OP Observables MCTS TDM [Sch+23a]
OP Latents Dyna Dreamer [Haf+23]
RP, VP, PP Latents MCTS MuZero [Sch+20]
RP, VP, PP, ZP Latents MCTS EfficientZero [Ye+21]
RP, VP, ZP Latents MPC-CEM TD-MPC [HSW22; HSW24]
VP, ZP Latents Aux. Minimalist [Ni+24]
VP, ZP Latents Dyna DreamingV2 [OT22]
VP, ZP, OP Latents Dyna AIS [Sub+22]
POP Latents Dyna Denoised MDP [Wan+22]
Table 4.2: Summary of some world-modeling methods. The “loss” column refers to the loss used to train the latent
encoder (if present) and the dynamics model (OP = observation prediction, ZP = latent state prediction, RP = reward
prediction, VP = value prediction, PP = policy prediction, POP = partial observation prediction). The “policy” column
refers to the input that is passed to the policy. (For MCTS methods, the policy is just used as a proposal over action
sequences to initialize the search/ optimization process.) The “usage” column refers to how to the world model is used:
for background planning (which we call “Dyna”), or for decision-time planning (which we call “MCTS”), or just as
an auxiliary loss on top of standard policy/value learning (which we call “Aux”). Thus Aux methods are single-stage
(“end-to-end”), whereas the other methods alternate are two-phase, and alternate between improving the world model
and then using it for improving the policy (or searching for the optimal action).
prediction error, by minimizing ℓ(M̂, µπM ). However, not all features of the state are useful for planning. For
example, if the states are images, a dynamics model with limited representational capacity may choose to focus
on predicting the background pixels rather than more control-relevant features, like small moving objects,
since predicting the background reliably reduces the MSE more. This is due to “objective mismatch”
[Lam+20; Wei+24], which refers to the discrepancy between the way a model is usually trained (to predict
the observations) vs the way its representation is used for control. To tackle this problem, in this section we
discuss methods for learning representations and models that don’t rely on predicting all the observations.
Our presentation is based in part on [Ni+24] (which in turn builds on [Sub+22]). See Table 4.2 for a summary
of some of the methods we will discuss.
where D is the decoder. In order to repeatedly apply this, we need to be able to update the encoding z = ϕ(h)
in a recursive or online way. Thus we must also satisfify the following recurrent encoder condition, which
5 Note that in general, the encoder may depend on the entire history of previous observations, denoted z = ϕ(D).
97
Figure 4.5: Illustration of an encoder zt = E(ot ), which is passed to a value estimator vt = V (zt ), and a world model,
which predicts the next latent state ẑt+1 = M(zt , at ), the reward rt = R(zt , at ), and the termination (done) flag,
dt = done(zt ). From Figure C.2 of [AP23]. Used with kind permission of Doina Precup.
where U is the update operator. Note that belief state updates (as in a POMDP) satisfy this property.
Furthermore, belief states are a sufficient statistic to satisfy the OP condition. See Section 4.4.1.2 for a
discussion of generative models of this form.
The drawback of this approach is that in general it is very hard to predict future observations, at least in
high dimensional settings like images. Fortunately, such prediction is not necessary for optimal behavior.
Thus we now turn our attention to other training objectives.
See Figure 4.5 for an illustration. In [Ni+24], they prove that a representation that satisfies ZP and RP is
enough to satisfy value equivalence (sufficiency for Q∗ ).
98
Note that there is a stronger property than value equivalence called bisimulation [GDG03]. This says
that two states s1 and s2 are bisimiliar if P (s′ |s1 , a) ≈ P (s′ |s2 , a) and R(s1 , a) = R(s2 , a). From this, we can
derive a continuous measure called the bisimulation metric [FPP04]. This has the advantage (compared
to value equivalence) of being policy independent, but the disadvantage that it can be harder to compute
[Cas20; Zha+21], although there has been recent progress on computaitonally efficient methods such as MICo
[Cas+21] and KSMe [Cas+23].
The value function and reward losses may be too sparse to learn efficiently. Although self-prediction loss can
help somewhat, it does not use any extra information from the environment as feedback. Consequently it is
natural to consider other kinds of prediction targets for learning the latent encoder (and dynamics). When
using MCTS, it is possible compute what the policy should be for a given state, and this can be used as a
prediction target for the reactive policy at = π(zt ), which in turn can be used as a feedback signal for the
latent state. This method is used by MuZero (Section 4.2.2.2) and EfficientZero (Section 4.2.2.2).
Unfortunately, in problems with sparse reward, predicting the value or policy may not provide enough of a
feedback signal to learn quickly. Consequently it is common to augment the training with a self-prediction
loss where we train ϕ to ensure the following condition hold:
where the LHS is the predicted mean of the next latent state under the true model, and the RHS is the
predicted mean under the learned dynamics model. We call this the EZP, which stands for expected z
prediction.6
A trivial way to minimize the self-prediction loss is to lear an embedding that maps everything to a constant
vector, say E(h) = 0, in which case zt+1 will be trivial for the dynamics model M to predict. However this
is not a useful representation. This problem is representational collapse [Jin+22]. Fortunately, we can
provably prevent collapse (at least for linear encoders) by using a frozen target network [TCG21; Tan+23;
Ni+24]. That is, we use the following auxiliary loss
where
ϕ = ρϕ + (1 − ρ)sg(ϕ) (4.33)
is the (stop-gradient version of) the EMA of the encoder weights. (If we set ρ = 0, this is called a detached
network.) See Figure 4.6(a) for an illistration.
This approach is used in many papers, such as BYOL [Gri+20a] (BYOL stands for “bootstrap your own
latent”), SimSiam [CH20], DinoV2 [Oqu+24], JEPA (Joint embedding Prediction Architecture) [LeC22],
I-JEPA [Ass+23], V-JEPA [Bar+24], Image World Models [Gar+24], Predictron [Sil+17b], Value
Prediction Networks [OSL17], Self Predictive Representations (SPR) [Sch+21b], Efficient Zero
(Section 4.2.2.2), etc. Unfortunately, the self-prediction method has only been proven to be theoretically
sound for the case of linear encoders.
6 In [Ni+24], they also describe the ZP loss, which requires predicting the full distribution over z ′ using a stochastic transition
model. This is strictly more powerful, but somewhat more complicated, so we omit it for simplicity.
99
Figure 4.6: Illustration of the JEPA (Joint embedding Prediction Architecture) world model approach using two
different approaches to avoid latent collapse: (a) self-distillation; (b) information theoretic regularizers. Diamonds
represent fixed cost functions, squares represent learnable functions, circles are variables.
Various methods have been proposed to P approximate the information content I(zt ), mostly based on some
function of the outer product matrix t zt zTt (since this captures second order moments, it is an implicit
assumption of Gaussanity of the embedding distribution).
This method has been used in several papers, such as Barlow Twins [Zbo+21], VICReg [BPL22a],
VICRegL [BPL22b], MMCR (Maximum Manifold Capacity Representation) [Yer+23], etc. Unfortunately,
the methods in these papers do not provide a lower bound on I(zt ) and I(zt+1 ), which is needed to optimize
Equation (4.34).
An alternative appproach is to modify the objective to use the change in the latent rate rather than the
absolute information content; this can be tractably approximated, as shown in MCR2 (Maximal Coding
Rate Reduction) [Yu+20b] and its closed-loop extension [Dai+22].
100
Section 4.4.2.7. An alternative way to avoid the latent collapse problem is to add regularization terms that try
to maximize the information content in zt and zt+1 , while also minimizing the prediction error, as discussed
in Section 4.4.2.8. See Figure 4.6 for an illustration to these two approaches.
In the case where the observations are high-dimensional, such as images, it it natural to use a pre-trained
representation, zt = ϕ(ot ), as input to the world model (or policy). The representation function ϕ can
be pretrained on a large dataset using a non-reconstructive loss, such as the DINOv2 method [Oqu+24].
Although this can sometimes give gains (as in the DinoWM paper [Zho+24a]), in other cases, better results
are obtained by training the representaton from scratch [Han+23; Sch+24]. The performance is highly
dependent on the similarity or differences between between the pretraining distribution and the agent’s
distribution, the form of the representation function, and its training objective.
In this section, we describe TD-MPC2 [HSW24], which is an extension of TD-MPC of [HSW22]. This
learns the following functions:
• Encoder: et = E(ot )
The model is trained using the following VP+ZP loss applied to trajectories sampled from the replay buffer:
" H
#
X
L(θ) = E(o,a,r,o′ )0:H ∼B λt ||zt′ − sg(E(o′t ))||22 + CE(r̂t , rt ) + CE(q̂t , qt ) (4.35)
t=0
We use cross-entropy loss on a discretized representation of the reward and Q value in a log-transformed
space, in order to be robust to different value scales across time and problem settings (see Section 7.3.2). The
target value for the Q function update is defined by
t=0
This policy is used as a proposal (prior), in conjunction with the MPPI trajectory planning method
(Section 4.2.4.6) to select actions at run time.
101
rt
Vt−1 Lvp Gt TD Vt
V π at V
zt−1 U zt
D P E
Lop ot
Figure 4.7: Illustration of (a simplified version of ) the BYOL-Explore architecture, represented as a factor graph (so
squares are functions, circles are variables). The dotted lines represent an optional observation prediction loss. The
map from notation in this figure to the paper is as follows: U → hc (closed-loop RNN update), P → ho (open-loop
RNN update), D → g (decoder), E → f (encoder). We have unrolled the forwards prediction for only 1 step. Also, we
have omitted the reward prediction loss. The V node is the EMA version of the value function. The TD node is the
TD operator.
102
4.4.2.12 Example: BYOL
In [Gri+20a], they present BYOL (Build Your Own Latents), which uses the ZP and VP loss. See Figure 4.7
for the computation graph, which we see is slightly simpler than the Dreamer computation graph in Figure 4.4
due to the lack of stochastic latents.
In [Guo+22], they present BYOL-Explore, which extends BYOL by using the self-prediction error to
define an intrinsic reward. This encourages the agent to explore states where the model is uncertain. See
Section 7.4 for further discussion of this topic.
where s′ = M̂ (s, a) is the learned dynamics model, subject to the constraint that the value function is derived
from the model using
"K−1 #
X
k K
V (s) = argmax EM̂ γ R(sk , ak ) + γ V (sK )|S0 = s .
a(0:K) k=0
This bilevel optimization problem was first proposed in the Value Iteration Network paper of [Tam+16],
and extended in the TreeQN paper [Far+18]. In [AY20] they proposed a version of this for continuous
actions based on the differentiable CEM method (see Section 4.2.4.5). In [Nik+22; Ban+23] they propose
to use implicit differentation to avoid explicitly unrolling the inner optimization. In [ML25], they propose
D-TSN (differentiable tree search network), which is similar to TreeQN, but constructs a best-first search
tree, rather than a fixed depth tree, using a stochastic tree expansion method.
103
In [Jan+19a], they present the MBPO method, which uses short rollouts (inside Dyna) to prevent
compounding error (an approach which is justified in [Jia+15]). [Fra+24] is a recent extension of MBPO
which dynamically decides how much to roll out, based on model uncertainty.
Another approach to mitigating compounding errors is to learn a trajectory-level dynamics model, instead
of a single-step model, see e.g., [Zho+24b] which uses diffusion to train p(st+1:t+H |st , at:t+H−1 ), and uses
this inside an MPC loop.
If the model is able to predict a reliable distribution over future states, then we can leverage this
uncertainty estimate to compute an estimate of the expected reward. For example, PILCO [DR11; DFR15]
uses Gaussian processes as the world model, and uses this to analytically derive the expected reward over
trajectories as a function of policy parameters, which are then optimized using a deterministic second-order
gradient-based solver. In [Man+19], they combine the MPO algorithm (Section 3.6.5) for continuous control
with uncertainty sets on the dynamics to learn a policy that optimizes for a worst case expected return
objective.
whereQ P (τ ) is the distribution over trajectories induced by policy applied to the true world model, P (τ ) =
∞
µ(s0 ) t=0 M (st+1 |st , at )π(at |st ), and Q(τ ) is the distribution over trajectories using the estimated world
Q∞
model, Q(τ ) = µ(s0 ) t=0 M̂ (st+1 |st , at )π(at |st ). They then maximize this bound wrt π and M̂ .
In [Ghu+22] they extend MNM to work with images (and other high dimensional states) by learning a
latent encoder Ê(zt |ot ) as well as latent dynamics M̂ (zt+1 |zt , at ), similar to other self-predictive methods
(Section 4.4.2.6). They call their method Aligned Latent Models.
104
trajectories for model fitting. Recently it has become popular to use LLMs to tackle this problem, using
methods known as AI scientists (see e.g., [Gan+25b]). It is also possible to combine LLMs as hypothesis
generators with Bayesian inference and information theoretic reasoning principles for a more principled
approach to the problem (see e.g., Piriyakulkij2024).
If C(st+1 ) = Rt+1 , this reduces to the value function.7 If we define the cumulatant to be the observation
vector, then the GVF will learn to predict future observations at multiple time scales; this is called nexting
[MS14; MWS14]. Predicting the GVFs for multiple cumulants can be useful as an auxiliary task to learn a
useful non-generative representation for the solving the main task, as shown in [Jad+17].
[Vee+19] present an approach (based on meta-gradients) to learn which cumulants are worth predicting,
In the inner loop, the model f predicts the policy πt and value function Vt , as usual, and also predicts
the GVFs yt for the specified cumulants; the function f is called the answer network, and is denoted by
(πt , Vt , yt ) = fθ (ot−i−1:t ). In the outerloop, the model g learns to extract the cumulants and their discounts
given future observations; this called the question network and is denoted by (ct , γ t ) = gη (ot+1:t+j ). The
outer update to η is based on the gradient of the RL loss after performing K inner updates to θ using the
RL loss and auxiliary loss.
105
gives us the successor representation or SR [Day93]:
"∞ #
X
M π (s, s̃) = E γ t I (st+1 = s̃) |S0 = s (4.39)
t=0
Thus we see that the SR replaces information about individual transitions with their cumulants, just as the
value function replaces individual rewards with the reward-to-go.
Like the value function, the SR obeys a Bellman equation
X X
M π (s, s̃) = π(a|s) T (s′ |s, a) (I (s′ = s̃) + γM π (s′ , s̃)) (4.42)
a s′
= E [I (s = s̃) + γM π (s′ , s̃)]
′
(4.43)
M π (s, s̃) ← M π (s, s̃) + η (I (s′ = s̃) + γM π (s′ , s̃) − M π (s, s̃)) (4.44)
| {z }
δ
where s′ is the next state sampled from T (s′ |s, a). Compare this to the value-function TD update in
Equation (2.16):
V π (s) ← V π (s) + η (R(s′ ) + γV π (s′ ) − V π (s)) (4.45)
| {z }
δ
However, with an SR, we can easily compute the value function for any reward function (given a fixed policy)
as follows: X
V R,π = M π (s, s̃)R(s̃) (4.46)
s̃
M π (s, a, s̃) ← M π (s, a, s̃) + η (I (s′ = s̃) + γM π (s′ , a′ , s̃) − M π (s, a, s̃)) (4.49)
| {z }
δ
where s′ is the next state sampled from T (s′ |s, a) and a′ is the next action sampled from π(s′ ). Compare this
to the (on-policy) SARSA update from Equation (2.28):
106
Goal
s̃
<latexit sha1_base64="3fl7AMR82nporqgYiPa0q5VI1tY=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Ie0oWw203bpbhJ2N0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSNsGm4EdhKFVAYC28H4dua3n1BpHkcPZpKgL+kw4gPOqLHSY89wEWKmp/1yxa26c5BV4uWkAjka/fJXL4xZKjEyTFCtu56bGD+jynAmcFrqpRoTysZ0iF1LIypR+9n84Ck5s0pIBrGyFRkyV39PZFRqPZGB7ZTUjPSyNxP/87qpGVz7GY+S1GDEFosGqSAmJrPvScgVMiMmllCmuL2VsBFVlBmbUcmG4C2/vEpaF1WvVq3dX1bqN3kcRTiBUzgHD66gDnfQgCYwkPAMr/DmKOfFeXc+Fq0FJ585hj9wPn8AP1aQuA==</latexit>
s
<latexit sha1_base64="uWioic9Sc3+uT7pr32g9YUpvtBk=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZq6H6p7FbcOcgq8XJShhz1fumrN4hZGqE0TFCtu56bGD+jynAmcFrspRoTysZ0iF1LJY1Q+9n80Ck5t8qAhLGyJQ2Zq78nMhppPYkC2xlRM9LL3kz8z+umJrzxMy6T1KBki0VhKoiJyexrMuAKmRETSyhT3N5K2IgqyozNpmhD8JZfXiWty4pXrVQbV+XabR5HAU7hDC7Ag2uowT3UoQkMEJ7hFd6cR+fFeXc+Fq1rTj5zAn/gfP4A4jeNAg==</latexit>
Agent
s
<latexit sha1_base64="uWioic9Sc3+uT7pr32g9YUpvtBk=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZq6H6p7FbcOcgq8XJShhz1fumrN4hZGqE0TFCtu56bGD+jynAmcFrspRoTysZ0iF1LJY1Q+9n80Ck5t8qAhLGyJQ2Zq78nMhppPYkC2xlRM9LL3kz8z+umJrzxMy6T1KBki0VhKoiJyexrMuAKmRETSyhT3N5K2IgqyozNpmhD8JZfXiWty4pXrVQbV+XabR5HAU7hDC7Ag2uowT3UoQkMEJ7hFd6cR+fFeXc+Fq1rTj5zAn/gfP4A4jeNAg==</latexit>
Figure 4.8: Illustration of successor representation for the 2d maze environment shown in (a) with reward shown in (d),
which assigns all states a reward of -0.1 except for the goal state which has a reward of 1.0. In (b-c) we show the SRs
for a random policy and the optimal policy. In (e-f ) we show the corresponding value functons. In (b), we see that the
SR under the random policy assigns high state occupancy values to states which are close (in Manhattan distance) to the
current state s13 (e.g., M π (s13 , s14 ) = 5.97) and low values to states that are further away (e.g., M π (s13 , s12 ) = 0.16).
In (c), we see that the SR under the optimal policy assigns high state occupancy values to states which are close to the
optimal path to the goal (e.g., M π (s13 , s14 ) = 1.0) and which fade with distance from the current state along that path
(e.g., M π (s13 , s12 ) = 0.66). From Figure 3 of [Car+24]. Used with kind permission of Wilka Carvalho. Generated by
https: // github. com/ wcarvalho/ jaxneurorl/ blob/ main/ successor_ representation. ipynb .
107
However, from an SR, we can compute the state-action value function for any reward function:
X
QR,π (s, a) = M π (s, a, s̃)R(s̃) (4.51)
s̃
Thus µπ (s̃|s) tells us the probability that s̃ can be reached from s within a horizon determined by γ when
following π, even though we don’t know exactly when we will reach s̃.
SMs obey a Bellman-like recursion
108
Hence we can learn an SM using a TD update of the form
µπ (s̃|s, a) ← µπ (s̃|s, a) + η ((1 − γ)T (s′ |s, a) + γµπ (s̃|s′ , a′ ) − µπ (s̃|s, a)) (4.60)
| {z }
δ
where s′ is the next state sampled from T (s′ |s, a) and a′ is the next action sampled from π(s′ ). With an
SM, we can compute the state-action value for any reward:
1
QR,π (s, a) = Eµπ (s̃|s,a) [R(s̃)] (4.61)
1−γ
This can be used to improve the policy as we discuss in Section 4.5.4.1.
where (T π µπ )(·|s, a) is the Bellman operator applied to µπ and then evaluated at (s, a), i.e.,
X X
(T π µπ )(s̃|s, a) = (1 − γ)T (s′ |s, a) + γ T (s̃|s, a) π(a′ |s′ )µπ (s̃|s′ , a′ ) (4.63)
s′ a′
We can sample from this as follows: first sample s′ ∼ T (s′ |s, a) from the environment (or an offline replay
buffer), and then with probability 1 − γ set s̃ = s′ and terminate. Otherwise sample a′ ∼ π(a′ |s′ ) and then
create a bootstrap sample from the SM using s̃ ∼ µπ (s̃|s′ , a′ ).
There are many possible density models we can use for µπ . In [Tha+22], they use a VAE. In [Tom+24],
they use an autoregressive transformer applied to a set of discrete latent tokens, which are learned using
VQ-VAE or a non-reconstructive self-supervised loss. They call their method Video Occcupancy Models.
Recently, [Far+25] proposed to use diffusion (flow matching) to learn SMs.
An alternative approach to learning SMs, that avoids fitting a normalized density model over states, is to
use contrastive learning to estimate how likely s̃ is to occur after some number of steps, given (s, a), compared
to some randomly sampled negative state [ESL21; ZSE24]. Although we can’t sample from the resulting
learned model (we can only use it for evaluation), we can use it to improve a policy that achieves a target
state (an approach known as goal-conditioned policy learning, discussed in Section 7.5.1).
109
SRs by working with features ϕ(s) instead of primitive states. In particular, if we define the cumulant to be
C(st+1 ) = ϕ(st+1 ), we get the following definition of SF:
"∞ #
X
π,ϕ
ψ (s) = E t
γ ϕ(st+1 )|s0 = s, a0:∞ ∼ π (4.64)
t=0
We will henceforth drop the ϕ superscript from the notation, for brevity. SFs obey a Bellman equation
ψ(s) = E [ϕ(s′ ) + γψ(s′ )] (4.65)
If we assume the reward function can be written as
R(s, w) = ϕ(s)T w (4.66)
then we can derive the value function for any reward as follows:
V π,w (s) = E [R(s1 ) + γR(s2 ) + · · · |s0 = s] (4.67)
= E ϕ(s1 )T w + γϕ(s2 )T w + · · · |s0 = s (4.68)
"∞ #T
X
=E γ ϕ(st+1 )|s0 = s w = ψ π (s)T w
t
(4.69)
t=0
110
(a) Successor Features (b) Generalized Policy Improvement
⇡
⇥ 2
⇤ ⇡i
= E⇡ (s, a)> wnew }
<latexit sha1_base64="Waw5YcCTllGahBnnmSg7wjYGcIk=">AAACUXicbVFBaxQxFH47am23Wtd67CW4CIKwzGylehFKRfBYwW0Lm+mQyWR2wiaTIXkjLMP8RQ968n/00kOLmdlFtPVBeN/7vvdI3pe0UtJhGP4aBA8ePtp6vL0z3H3ydO/Z6Pn+mTO15WLGjTL2ImVOKFmKGUpU4qKygulUifN0+bHTz78J66Qpv+KqErFmi1LmkjP0VDIqaGpU5lbap4ZWTraXPsmWfCBUMyzStPnUJj1FlchxTqtCJhF5Q+iCac1IX0//1JfTNXPYMSoz6KiViwLjZDQOJ2Ef5D6INmAMmzhNRj9oZnitRYlcMefmUVhh3DCLkivRDmntRMX4ki3E3MOSaeHipnekJa88k5HcWH9KJD3790TDtOuW9p3dlu6u1pH/0+Y15u/jRpZVjaLk64vyWhE0pLOXZNIKjmrlAeNW+rcSXjDLOPpPGHoTorsr3wdn00l0NDn68nZ8fLKxYxsO4CW8hgjewTF8hlOYAYfvcAU3cDv4ObgOIAjWrcFgM/MC/olg9zfKb7Or</latexit>
<latexit sha1_base64="yWN7FMIcSQGJYSArI0BBEESubl0=">AAACRnicbVBNT9wwEJ0s/aDbry099mJ1VWmRqlWCKsoRwaXHrdRdkDZp5HgdsHBsy55AV8G/jgtnbv0JvXCgqnqts+TQQkey/PzePHnmFUYKh3H8PeqtPXj46PH6k/7TZ89fvBy82pg5XVvGp0xLbQ8L6rgUik9RoOSHxnJaFZIfFCf7rX5wyq0TWn3BpeFZRY+UKAWjGKh8kKVGjOi52ySpsdqgJmlFv+VNoHPh0yYttFy4ZRWuwDnhv3bSyL2nm+GB2njSBBMeF2Vz5oMVkSh+5n3q88EwHserIvdB0oEhdDXJB1fpQrO64gqZpM7Nk9hg1lCLgknu+2ntuKHshB7xeYCKVtxlzSoGT94FZkFKbcNRSFbs346GVq5dJXS247q7Wkv+T5vXWO5kjVCmRq7Y7UdlLUlIq82ULITlDOUyAMqsCLMSdkwtZRiS74cQkrsr3wezrXGyPd7+/GG4u9fFsQ5v4C2MIIGPsAufYAJTYHABP+AGfkaX0XX0K/p929qLOs9r+Kd68Aef/rRL</latexit>
⇡fridge
at ⇠ ⇡fridge
<latexit sha1_base64="93wdZ10L3t3S7aEV393Ipgnd7W8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiTia1l047KCfUATwmQyaYdOHszcCCVk4cZfceNCEbd+hDv/xmmbhbYeuHA4517uvcdPBVdgWd9GZWV1bX2julnb2t7Z3TP3D7oqySRlHZqIRPZ9opjgMesAB8H6qWQk8gXr+eObqd97YFLxJL6HScrciAxjHnJKQEueWSdeDgV2FI+wk3IvdwBwKHkwZIVnNqymNQNeJnZJGqhE2zO/nCChWcRioIIoNbCtFNycSOBUsKLmZIqlhI7JkA00jUnElJvPnijwsVYCHCZSVwx4pv6eyEmk1CTydWdEYKQWvan4nzfIILxycx6nGbCYzheFmcCQ4GkiOOCSURATTQiVXN+K6YhIQkHnVtMh2IsvL5PuadO+aJ7fnTVa12UcVVRHR+gE2egStdAtaqMOougRPaNX9GY8GS/Gu/Exb60Y5cwh+gPj8wfKlpg1</latexit>
<latexit sha1_base64="R9osOw/dYDAjZREfWK3W+rGCyaM=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaL4KokUtRl0Y3LCvYBTQyTyaQdOnkwcyOUkI2/4saFIm79DHf+jdM2C209cOFwzr3ce4+fCq7Asr6Nysrq2vpGdbO2tb2zu2fuH3RVkknKOjQRiez7RDHBY9YBDoL1U8lI5AvW88c3U7/3yKTiSXwPk5S5ERnGPOSUgJY888hJFX/InZR7uQOAQ8mDISsKz6xbDWsGvEzsktRRibZnfjlBQrOIxUAFUWpgWym4OZHAqWBFzckUSwkdkyEbaBqTiCk3nz1Q4FOtBDhMpK4Y8Ez9PZGTSKlJ5OvOiMBILXpT8T9vkEF45eY8TjNgMZ0vCjOBIcHTNHDAJaMgJpoQKrm+FdMRkYSCzqymQ7AXX14m3fOGfdFo3jXrresyjio6RifoDNnoErXQLWqjDqKoQM/oFb0ZT8aL8W58zFsrRjlziP7A+PwBdM6W+Q==</latexit>
⇡fridge
⇡
...
<latexit sha1_base64="q0/rPJF2cF/fpjDRGwxwnNC+7Cg=">AAACDHicbVDLSsNAFJ3UV62vqEs3g0FwVRIp6rLoxmUL9gFtDJPppB06eTBzo5SQD3Djr7hxoYhbP8Cdf+O0zUJbD1w4c869zL3HTwRXYNvfRmlldW19o7xZ2dre2d0z9w/aKk4lZS0ai1h2faKY4BFrAQfBuolkJPQF6/jj66nfuWdS8Ti6hUnC3JAMIx5wSkBLnmk177J+wr2sD4ADyQdDlude9jAXQi7G+mladtWeAS8TpyAWKtDwzK/+IKZpyCKggijVc+wE3IxI4FSwvNJPFUsIHZMh62kakZApN5sdk+MTrQxwEEtdEeCZ+nsiI6FSk9DXnSGBkVr0puJ/Xi+F4NLNeJSkwCI6/yhIBYYYT5PBAy4ZBTHRhFDJ9a6YjogkFHR+FR2Cs3jyMmmfVZ3zaq1Zs+pXRRxldISO0Sly0AWqoxvUQC1E0SN6Rq/ozXgyXox342PeWjKKmUP0B8bnDx3AnFA=</latexit>
Qwfridge
milk
Legend
<latexit sha1_base64="kQkDxPnVv0Mgc/tsfqEuJYOLy9w=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AL8Nj0E=</latexit>
<latexit sha1_base64="2M9Uun3jhzFBOkbVOCuyBd1NlPo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCnooV9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1S8fLmuN2yKOMpzAKZyDD9fQgHtoQgAUODzDK7w50nlx3p2PRWvJKWaO4Q+czx/uTI7H</latexit>
=
<latexit sha1_base64="8LVrqJucQl9xNwsjgr7UF8xAfX0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUC9C0IvHBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDfzW0+oNI/lgxkn6Ed0IHnIGTVWqt/0iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkuZF2bssV+qVUvU2iyMPJ3AK5+DBFVThHmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP4+7jMo=</latexit>
<latexit sha1_base64="hpnySaJh+/CAhHti5mNvHw101yA=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkVI9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmGG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBBPF7K2IRFhhYmw8FRuCt/ryOule1b1mvfHQqLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEWdo5I</latexit>
wmilk
<latexit sha1_base64="dwqnoHlLqdsg8BuVgD6mRKkkLpQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ac0oWy223bpbhJ2J0oJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZemEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCanhUkS8iQIl7ySaUxVK3g7HtzO//ci1EXH0gJOEB4oOIzEQjKKV/Kde5iMSJeR42itX3Ko7B1klXk4qkKPRK3/5/ZilikfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rU0ooqbIJvfPCVnVumTQaxtRUjm6u+JjCpjJiq0nYriyCx7M/E/r5vi4DrIRJSkyCO2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MJRuCt/zyKmldVL3Lau2+Vqnf5HEU4QRO4Rw8uII63EEDmsAggWd4hTcndV6cd+dj0Vpw8plj+APn8wdGwJHa</latexit>
State features
Apple ..
<latexit sha1_base64="qDf6pzroMMJvMcIkHWdR/tYPv+o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cKthbaUDabTbt2sxt2J4VS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8MBXcoOd9O4W19Y3NreJ2aWd3b/+gfHjUMirTlDWpEkq3Q2KY4JI1kaNg7VQzkoSCPYbD25n/OGLacCUfcJyyICF9yWNOCVqp1R1FCk2vXPGq3hzuKvFzUoEcjV75qxspmiVMIhXEmI7vpRhMiEZOBZuWuplhKaFD0mcdSyVJmAkm82un7plVIjdW2pZEd67+npiQxJhxEtrOhODALHsz8T+vk2F8HUy4TDNkki4WxZlwUbmz192Ia0ZRjC0hVHN7q0sHRBOKNqCSDcFffnmVtC6q/mW1dl+r1G/yOIpwAqdwDj5cQR3uoAFNoPAEz/AKb45yXpx352PRWnDymWP4A+fzB85dj0s=</latexit>
⇡milk
Open Drawer . Max
<latexit sha1_base64="uAph9Fync0syabW4IbKWtkcTpLo=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gPaWDbbTbt0Nwm7E6WE/g8vHhTx6n/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKSLeQIGStxPNqQokbwWjm6nfeuTaiDi6x3HCfUUHkQgFo2ilh24ielkXkSghR5NeueJW3RnIMvFyUoEc9V75q9uPWap4hExSYzqem6CfUY2CST4pdVPDE8pGdMA7lkZUceNns6sn5MQqfRLG2laEZKb+nsioMmasAtupKA7NojcV//M6KYZXfiaiJEUesfmiMJUEYzKNgPSF5gzl2BLKtLC3EjakmjK0QZVsCN7iy8ukeVb1Lqrnd+eV2nUeRxGO4BhOwYNLqMEt1KEBDDQ8wyu8OU/Oi/PufMxbC04+cwh/4Hz+ALoPkqw=</latexit>
Milk
<latexit sha1_base64="YoX6KhxbTerZB8O3mp8Q8Gf5Oeg=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARXJVEirosunFZwT6giWEymbRDJw9mbpQSsvFX3LhQxK2f4c6/cdpmoa0HLhzOuZd77/FTwRVY1rextLyyurZe2ahubm3v7Jp7+x2VZJKyNk1EIns+UUzwmLWBg2C9VDIS+YJ1/dH1xO8+MKl4Et/BOGVuRAYxDzkloCXPPHRSxe9zJ+Ve7gDgQJJHJovCM2tW3ZoCLxK7JDVUouWZX06Q0CxiMVBBlOrbVgpuTiRwKlhRdTLFUkJHZMD6msYkYsrNpw8U+EQrAQ4TqSsGPFV/T+QkUmoc+bozIjBU895E/M/rZxBeujmP0wxYTGeLwkxgSPAkDRxwySiIsSaESq5vxXRIJKGgM6vqEOz5lxdJ56xun9cbt41a86qMo4KO0DE6RTa6QE10g1qojSgq0DN6RW/Gk/FivBsfs9Ylo5w5QH9gfP4Ak0iXDQ==</latexit>
⇡drawer
at ⇠ ⇡drawer
<latexit sha1_base64="aaob0i0A55J+HffMxei18XxR810=">AAACBHicbVDLSsNAFJ34rPUVddnNYBFclUR8LYtuXFawD2hCmEwm7dDJg5kbpYQs3Pgrblwo4taPcOffOG2z0NYDFw7n3Mu99/ip4Aos69tYWl5ZXVuvbFQ3t7Z3ds29/Y5KMklZmyYikT2fKCZ4zNrAQbBeKhmJfMG6/uh64nfvmVQ8ie9gnDI3IoOYh5wS0JJn1oiXQ4EdxSPspNzLHQAcSPLAZOGZdathTYEXiV2SOirR8swvJ0hoFrEYqCBK9W0rBTcnEjgVrKg6mWIpoSMyYH1NYxIx5ebTJwp8pJUAh4nUFQOeqr8nchIpNY583RkRGKp5byL+5/UzCC/dnMdpBiyms0VhJjAkeJIIDrhkFMRYE0Il17diOiSSUNC5VXUI9vzLi6Rz0rDPG2e3p/XmVRlHBdXQITpGNrpATXSDWqiNKHpEz+gVvRlPxovxbnzMWpeMcuYA/YHx+QPo/JhJ</latexit>
⇡drawer
Knife
w
...
<latexit sha1_base64="kQkDxPnVv0Mgc/tsfqEuJYOLy9w=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AL8Nj0E=</latexit>
<latexit sha1_base64="2M9Uun3jhzFBOkbVOCuyBd1NlPo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCnooV9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1S8fLmuN2yKOMpzAKZyDD9fQgHtoQgAUODzDK7w50nlx3p2PRWvJKWaO4Q+czx/uTI7H</latexit>
=
<latexit sha1_base64="8LVrqJucQl9xNwsjgr7UF8xAfX0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUC9C0IvHBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDfzW0+oNI/lgxkn6Ed0IHnIGTVWqt/0iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkuZF2bssV+qVUvU2iyMPJ3AK5+DBFVThHmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP4+7jMo=</latexit>
Task Q⇡wdrawer
<latexit sha1_base64="AyR5teNAat/Susz4Jr45Wl5qJQ8=">AAACDHicbVDLSsNAFJ3UV62vqEs3g0FwVRIRdVl047IF+4Cmhslk0g6dPJi5sZTQD3Djr7hxoYhbP8Cdf+O0zUJbD1w4c869zL3HTwVXYNvfRmlldW19o7xZ2dre2d0z9w9aKskkZU2aiER2fKKY4DFrAgfBOqlkJPIFa/vDm6nffmBS8SS+g3HKehHpxzzklICWPNNq3Oduyr3cBcCBJCMmJxMvH82FiIuhfpqWXbVnwMvEKYiFCtQ988sNEppFLAYqiFJdx06hlxMJnAo2qbiZYimhQ9JnXU1jEjHVy2fHTPCJVgIcJlJXDHim/p7ISaTUOPJ1Z0RgoBa9qfif180gvOrlPE4zYDGdfxRmAkOCp8nggEtGQYw1IVRyvSumAyIJBZ1fRYfgLJ68TFpnVeeiet44t2rXRRxldISO0Sly0CWqoVtUR01E0SN6Rq/ozXgyXox342PeWjKKmUP0B8bnDz1mnGQ=</latexit>
<latexit sha1_base64="G4NrL4+FchmdNAyoq1HYtt9oFXI=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4hkUcCGzI79MLI7OxmZlZDCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl77JcqVdK1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOejjQQ=</latexit>
wmilk
<latexit sha1_base64="dwqnoHlLqdsg8BuVgD6mRKkkLpQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ac0oWy223bpbhJ2J0oJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZemEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCanhUkS8iQIl7ySaUxVK3g7HtzO//ci1EXH0gJOEB4oOIzEQjKKV/Kde5iMSJeR42itX3Ko7B1klXk4qkKPRK3/5/ZilikfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rU0ooqbIJvfPCVnVumTQaxtRUjm6u+JjCpjJiq0nYriyCx7M/E/r5vi4DrIRJSkyCO2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MJRuCt/zyKmldVL3Lau2+Vqnf5HEU4QRO4Rw8uII63EEDmsAggWd4hTcndV6cd+dj0Vpw8plj+APn8wdGwJHa</latexit>
milk
Task Policies <latexit sha1_base64="atwBTWaewmruY30kSfX/vOndlS0=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqexKUY9FLx4r2A9ol5JNs21sNlmSrFCW/gcvHhTx6v/x5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61jEo1ZU2qhNKdkBgmuGRNy61gnUQzEoeCtcPx7cxvPzFtuJIPdpKwICZDySNOiXVSq5eMeN/vlyte1ZsDrxI/JxXI0eiXv3oDRdOYSUsFMabre4kNMqItp4JNS73UsITQMRmyrqOSxMwE2fzaKT5zygBHSruSFs/V3xMZiY2ZxKHrjIkdmWVvJv7ndVMbXQcZl0lqmaSLRVEqsFV49joecM2oFRNHCNXc3YrpiGhCrQuo5ELwl19eJa2Lqn9Zrd3XKvWbPI4inMApnIMPV1CHO2hAEyg8wjO8whtS6AW9o49FawHlM8fwB+jzBz40juw=</latexit> <latexit sha1_base64="qQ8W6PPGg+MLPiwWXFeQFQUGATw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK/YI2lM120i7dbOLuRiihf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEsG1cd1vp7C2vrG5Vdwu7ezu7R+UD49aOk4VwyaLRaw6AdUouMSm4UZgJ1FIo0BgOxjfzfz2EyrNY9kwkwT9iA4lDzmjxkqdXjLi/awx7ZcrbtWdg6wSLycVyFHvl796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n83vnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwhs/4zJJDUq2WBSmgpiYzJ4nA66QGTGxhDLF7a2EjaiizNiISjYEb/nlVdK6qHpX1cuHy0rtNo+jCCdwCufgwTXU4B7q0AQGAp7hFd6cR+fFeXc+Fq0FJ585hj9wPn8AOl6QGw==</latexit>
1 T
t
<latexit sha1_base64="4mSRiAOC1HPbUsbyd7QN48TyFAA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsN+3azSbsToQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/CEY3878hyeujYjVPU4S7kd0qEQoGEUrNbFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NApObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYbXfiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNiUbgrf88ippX1S9y2qtWavUb/I4inACp3AOHlxBHe6gAS1gwOEZXuHNeXRenHfnY9FacPKZY/gD5/MH4xeNAQ==</latexit>
Figure 4.9: Illustration of successor features representation. (a) Here ϕt = ϕ(st ) is the vector of features for the state
at time t, and ψ π is the corresponding SF representation, which depends on the policy π. (b) Given a set of existing
policies and their SFs, we can create a new one by specifying a desired weight vector wnew and taking a weighted
combination of the existing SFs. From Figure 5 of [Car+24]. Used with kind permission of Wilka Carvalho.
Thus ws induces a set of policies that are active for a period of time, similar to playing a chord on a piano.
111
so we replace the discrete over a finite number of policies, maxi , with a continuous optimization problem
maxzw , to be solved per state.
If we want to learn the policies and SFs at the same time, we can optimize the following losses in parallel:
where a∗ = argmaxa′ ψ θ (s′ , a′ , zw )T w. The first equation is standard Q learning loss, and the second is
the TD update rule in Equation (4.72) for the SF. In [Car+23], they present the Successor Features
Keyboard, that can learn the policy, the SFs and the task encoding zw , all simultaneously. They also
suggest replacing the squared error regression loss in Equation (4.79) with a cross-entropy loss, where each
dimension of the SF is now a discrete probability distribution over M possible values of the corresponding
feature. (c.f. Section 7.3.2).
112
Chapter 5
Multi-agent RL
In this section, we give a brief introduction to multi-agent RL or MARL. Our presentation is based on
[ACS24]. MARL is closely related to game theory (see e.g. [LBS08]) and multi-agent systems design (see e.g.
[SLB08]), as we will see. For other surveys on MARL, see e.g. [HLKT19; YW20; Won+22; GD22].
5.1 Games
Multi-agent environments are often called games, even if they represent “real-world” problems such as
multi-robot coordination (e.g., a fleet of autonomous vehicles) or agent-based trading. In this section, we
discuss different kinds of games that have been proposed, summarized in Figure 5.1.
In the game theory community, the rules of the game (i.e., the environment dynamics, and the reward
function, aka payoff function) are usually assumed known, and the focus is on computing strategies (i.e.,
policies) for each player (i.e., agent), whereas in MARL, we usually assume the environment is unknown
and the agents have to learn just by interacting with it. (This is analogous to the distinction between DP
methods, that assume a known MDP, and RL methods, that just assume sample-based access to the MDP.)
Stochastic Game
n agents
m states - fully observed
Repeated Markov
Normal-Form Game Decision Process
n agents 1 agent
1 state m states
Figure 5.1: Hierarchy of games. From Fig 3.1 of [ACS24]. Used with kind permission of Stefano Albrecht.
113
as follows. Each agent samples an action ai ∈ Ai with probability πi (ai ), then the resulting joint action
a = (a1 , . . . , an ) is taken and the reward r = (r1 , . . . , rm ) is given to each player, where ri = Ri (a).
P Games can be classified based on the type of rewards they contain. In zero-sum games, we have
i Ri (a) = 0 for all a. (For a two-player zero-sum game, 2p0s, we must have R1 (a) = −R2 (a).) In
common-payoff games (aka common-reward games), we have Ri (a) = Rj (a) for all a. And in general-
sum games, there are no restrictions on the rewards.
In zero-sum games, the agents must compete against each other, whereas in common-reward games, the
agents generally must cooperate (although they may compete with each other over a shared resource). In
general-sum games, there can be a mix of cooperation and competition. Although common-reward games can
be easier to solve than general-sum games, it can be challenging to disentangle the contribution of each agent
to the shared reward (this is a multi-agent version of the credit assignment problem), and coordinating
actions across agents can also be difficult.
Normal-form games with 2 agents are called matrix games because they can be defined by a 2d reward
matrix. We give some well-known examples in Table 5.1.
• In rock-paper-scissors, Rock can blunt scissors, Paper can cover rock, and Scissors can cut paper;
from these constraints, we can determine which player wins or loses. This is a zero-sum game.
• In the battle of the sexes games, a male-female couple want to choose a shared activity. They both of
have different individual preferences (eg row player prefers Opera, column player prefers Football), both
they would both rather spend time together than alone. This is an example of a coordination game.
• In the Prisoner’s dilemma, which is a general-sum game, the players (who are prisoners being
interogated independently in different cells) can either cooperate with each other (by both “staying
mum”, i.e., denying they committed the crime), or one can defect on the other (by claiming the other
person committed the crime). If they both cooperate, they only have to serve 1 year in jail each, based
on weak evidence. If they both defect, they each serve 3 years. But if the row player cooperates (stays
silent) and the column player defects (implicates his partner), the row player gets 5 years and the
column player gets out of jail free. This leads to an incentive for both players to defect, even though they
would be better off if they both cooperated. We discuss this example in more detail in Section 5.2.4.
R P S
R 0,0 -1,1 1,-1 O F C D
P 1,-1 0,0 -1,1 O 2,) 0,0 C -1,-1 -5,0
S -1,1 1,-1 0,0 F 0,0 1,2 D 0,5 -3,-3
(a) Rock-Paper-Scissors (b) Battle of the sexes (c) Prisoner’s Dilemma
Table 5.1: Three different matrix games. Here the notation (x, y) in cell (i, j) refers to the row player receiving x and
the column player receiving y in response to the joint action (i, j).
Suppose we consider matrix games with just 2 actions each. In this case, we can represent the game as
follows:
a11 , b11 a12 , b12
(5.1)
a21 , b21 a22 , b22
where aij is the reward to player 1 (row player) and bij is the reward to player 2 (column player) if player 1
picks action i and player 2 picks action j. Suppose we further restrict attention to strictly ordinal games,
meaning that each agent ranks the 4 possible outcomes from 1 (least preferred) to 4 (most preferred)). In
this case, there are 78 structurally distinct games [RG66]. These can be grouped into two main kinds. In
114
no-conflict games, both players have the same set of most preferred outcomes, whereas in conflict games,
the players disagree about what is best. If we consider general ordinal 2 × 2 games (where one or both players
may have equal preference for two or more outcomes), we find that there are 726 of them [KF88].
A repeated matrix game is the multi-agent analog of a multi-armed bandit problem, discussed in
Section 1.2.4. In this case, the policy has the form πi (ait |ht ), where ht = (a0 , . . . , at−1 ) is the history of
joint-actions. In some cases, the agent may choose to ignore the history, or only look at the last n joint
actions. For example, in the tit-for-tat strategy in the prisoner’s dilemma, the policy for agent i at step t is
to do the same action that agent −i did at step t − 1 (where −i means the agent other than i), so the policy
is conditioned on at−1 . (Note that this strategy will punish players who defect, and can lead to the evolution
of cooperative behavior, even in selfish agents [AH81; Axe84].)
Often we assume the policies are Markovian, in which case they can be written as πi (ait |st ).
Note that, from the perspective of each agent i, the environment transition function has the form
X Y
Ti (st+1 |st , ait ) = T (st+1 |st , (ait , a−i
t )) πj (ajt |st ) (5.3)
a−i
t
j̸=i
Thus Ti depends on the policies of the other players, which are often changing, which makes these local/agent-
centric transition matrices non-stationary, even if the underlying environment is stationary. Typically agent
i does not know the policies of the other agents j, so it has to learn them, or it can just treat the other
agents as part of the environment (i.e., as another source of unmodeled “noise”) and then use single agent RL
methods (see Section 5.3.2).
115
p = 0.5
s1
1,1 0,1
a = (2,2)
r = (1,1) 2,2 1,1
p = 1.0
p = 0.8
p = 0.5 a = (2,1)
r = (4,1)
s2 a = (1,2) p = 0.2 s3
r = (0,4)
1,3 0,4 1,1 0,0
4,0 3,1 4,1 3,2
Figure 5.2: Example of a two-player general-sum stochastic game. Circles represent the states, inside of which we
show the reward function in matrix form. Only one of the 4 possible transitions out of each state are shown. The little
black dots are called the after states, and correspond to an intermediate point where a joint action has been decided
by the players, but nature hasn’t yet sampled the next transition, which occurs with the specified probabilities. From Fig
3.3(b) of [ACS24]. Used with kind permission of Stefano Albrecht.
5.1.3.2 Objective
P
We define the sum of rewards as G = t Ri (St , At ), where we use capital letters for random variables, and
bold face for everything that is joint across all agents. The objective of player i is to maximize Ji (πi ) = Eπi [G].
We can compute this using Bellmans equations, as follows. First define the expected state value for agent i
under joint policy π given joint history ht as
X
viπ (ht ) = Eπ [G≥t |ht ] = Eπ [ Ri (St′ , At′ )|ht ] (5.4)
t′ ≥t
The expected state value for agent i under the joint policy π and its local history hit is
viπ (hit ) = Eπ [viπ (Ht )|hit ] (5.5)
Similarly define the expected state-action value given the joint history as
qiπ (ht , ait ) = Eπ [G≥t |ht , ait ] = Eπ [Ri (St , At ) + viπ (Ht+1 )|ht , ait ] (5.6)
and the expected state-action value given the local history as
qiπ (hit , ait ) = Eπ [qiπ (Ht , ait )|hit ] (5.7)
116
5.1.3.3 Single agent perspective
From the perspective of agent i, it just observes a sequence of observations generated by the following “sensor
stream distribution”, (which is non-Markovian [LMLFP11]):
XX
pi (oit+1 |hit , ait ) = Ôi (oit+1 |st+1 , at )pi (a−i i i
t |ht )pi (st+1 |ht , at ) (5.8)
st+1 a−i
t
Y
pi (a−i i
t |ht ) = π̂ij (ajt |hit ) (5.9)
j̸=i
X
pi (st+1 |hit , at ) = T̂i (st+1 |st , at )bi (st |hit ) (5.10)
st
where π̂ij in Equation (5.9) is i’s estimate of j’s policy; T̂i is i’s estimate of T based on hit ; Ôi is i’s estimate
of Oi based on hit ; and bi (st |hit ) is i’s belief state (i.e., its posterior distributiion over the underlying
latent state given its local observation history). The agent can either learn a policy given this “collapsed”
representation, treating the other agents as part of the environment, or it can explicitly try to learn the true
joint world model T , local observation model Oi and other agent policies πij , so it can reason about the other
agents. In this section, we follow the latter approach.
factored POSG.
117
Figure 5.3: EFG for Kuhn Poker. The dashed lines connect histories in the same information set. From Fig 2 of
[Kov+22]. Used with kind permission of Viliam Lisy.
Kuhn poker is a form of (two player) poker where the deck includes only three cards: Jack, Queen,
and King. First, each player places one chip into the pot as the initial forced bet (ante). Each
player is then privately dealt one card (the last card isn’t revealed). This is followed by a betting
phase (explained below). The game ends either when one player folds (forfeiting all bets made so
far to their opponent) or there is a showdown, where the private cards are revealed and the higher
card’s owner receives the bets. At the start of the betting, player one can either check/pass or
raise/bet (one chip). If they check, player two can also check/pass –— leading to a showdown –—
or bet. If one of the players bets, their opponent must either call (betting one chip to match the
opponent’s bet), followed by a showdown, or fold.
The EFG for this is shown in Figure 5.3. To interpret this figure, consider the left part of the tree, where
the state of nature is JQ (as determined by the chance player’s first two dealing actions). Suppose the first
player checks, leading to state JQc. The second player can either check, leading to JQcc, resulting in a
showdown with a reward of -1 to player 1 (since J < Q); or bet, leading to JQcb. In the latter case, player 1
must then either fold, leading to JQcbf with a reward to player 1 of -1 (since player 1 only put in one chip);
or player 1 must call, leading to JQcbc with a reward to player 1 of -2 (since player 1 put in two chips).
We can convert an FOSG into an EFG by “unrolling it”. First we define the information set for a given
information state as the set of consistent world state trajectories. By applying the policy of each agent to the
world model, we can derive a tree of possible world states (trajectories) and corresponding information sets
for each agent, and thus can construct a corresponding (augmented) EFG. See [Kov+22] for details.
118
5.2.1 Notation and definitions
First we define some notation. Let ĥt = {(sk , ok , ak )t−1 k=1 , st , ot } be the full history, containing all the past
states, joint observations, and joint actions. Let σ(ĥt ) = ht = (o1 , . . . , ot ) be the history of joint observations,
and σi (ĥt ) = hit = (oi1 , . . . , oit ) be the history of observations for agent i. (This typically also includes the
actions chosen by agent i.)
We define the expected return for agent i under joint policy π by
X
Ui (π) = p(ĥt |π)ui (ĥt ) (5.11)
ĥt
and ui (ĥt ) is the discounted actual return for agent i in a given full history
t−1
X
ui (ĥt ) = γ k Ri (sk , ak , sk+1 ) (5.13)
k=0
where s(ĥ) extracts the last state from ĥ. With this, we can define the expected return using
Finally, we define the best response policy for agent i as the one that maximizes the expected return
for agent i against a given set of policies for all the other agents, π −i = (π1 , . . . , πi−1 , πi+1 , . . . , πn ). That is,
5.2.2 Minimax
The minimax solution is defined for two-agent zero-sum games. Its existence for normal-form games was
first proven by John von Neumann in 1928. We say that joint policy π = (πi , πj ) is a minimax solution if
Ui (π) = max
′
min
′
Ui (πi′ , πj′ ) (5.18)
πi πj
= min
′
max
′
Ui (πi′ , πj′ ) (5.19)
πj πi
In other words, π is a minimax solution iff πi ∈ BRi (πj ) and πj ∈ BRj (πi ). We can solve for the minimax
solution using linear programming.
Minimax solutions also exist for two-player zero-sum stochastic games with finite episode lengths, such
as chess and Go. In the case of perfect information games, the problems are Markovian, and so dynamic
programming can be used to solve them. In generally this may be slow, but minimax search (a depth-limited
version of DP that requires a heuristic function) can be used.
119
5.2.3 Exploitability
In the case of a 2 player zero-sum game, we can measure how close we are to a minimax solution by computing
the exploitability score, defined as
1
exploitability(π) = [max J(π1′ , π2 ) − min J(π1 , π2′ )] (5.21)
2 π1′ π2′
where J(π) is the expected reward for player 1 (which is the loss for player 2). Exploitability is the expected
return of πi playing against a best response to πi , averaged over both players i ∈ 1, 2. Joint policies with
exploitability zero are Nash equilibria (see Section 5.2.4).
Note that computing the exploitability score requires computing a best response to any given policy,
which can be hard in general. However, one can use standard deep RL methods for this (see e.g., [Tim+20]).
John Nash proved the existence of such a solution for general-sum non-repeated normal form games in 1950,
although computing such equilibria is computationally intractable in general.
Below we discuss the kinds of equilibria that exist for the games shown in Table 5.1.
• For the rock-paper-scissors game, the only NE is the mixed strategy (i.e., stochastic policy) where
each agent chooses actions uniformly at random, so πi = (1/3, 1/3, 1/3). This yields an expected return
of 0.
• For the battle-of-the sexes game, there are two pure strategy Nash equilibria: (Opera, Opera) and
(Football, Football). There’s also a mixed strategy equilibrium but it involves randomness and gives
lower expected payoffs.
• For the Prisoner’s Dilemma game, the only NE is the pure strategy of (D,D), which yields an expected
return of (-3,-3). Note that this is worse than the maximum possible expected return, which is (-1,-1)
given by the strategy of (C,C). However, such a strategy is not an NE, since each player could improve
its return if it unilaterally deviates from it (i.e., defects on its partner).
Interestingly, it can be shown that two agents that use rational (Bayesian) learning rules to update their
beliefs about the opponent’s strategy (based on observed outcomes of earlier games), and then compute a
best response to this belief, will eventually converge to Nash equilibrium [KL93].
C D
A 100,100 0,0 (5.24)
B 1,2 1,1
120
The unique NE is (A,C), but the ϵ-NE with ϵ = 1 is either (A,C) or (B,D), which clearly have very different
rewards.
Despite the above drawback, much computational work focuses on approximate Nash equilibria. Indeed
we can measure the rate of convergence to such a state by defining
X
NashConv(π) = δi (π) (5.25)
i
where δi (π) is the amount of incentive that i has to deviate to one of its best responses away from the joint
policy:
δi (π) = ui (πib , π −i ) − ui (π), πib ∈ BR(π −i ) (5.26)
where ∆(A) is the action simplex, and q is the action-value function. If this holds for all states (decision
points) s, we say that π is α-soft optimal in the behavioral sense.
For two-player zero-sum NFGs (which have a single decision point or state), we say that a policy is a
QRE if each player’s policy is soft optimal in the normal sense conditioned on the other player not changing
its policy [MP95]. For two-player zero-sum games with multiple states (i.e., EFGs), we say that a policy is a
agent QRE if each player’s policy is soft optimal in the behavioral sense conditioned on the other player not
changing its policy [MP98].
That is, player i has no incentive to deviate from the recommendation, after receiving it. It can be shown
that the set of correlated equilibria contains the set of Nash equilibria. In particular, since Nash equilibrium
is a special case of correlated
Q equilibrium in which the joint policy π c is factored into independent agent
policies with π c (a) = i πi (ai ).
To see how a correlated equilibrium can give higher returns than a Nash equilibrium, consider the Chicken
game. This models two agents that are driving towards each other. Each agent can either stay on course (S)
121
or leave (L) and avoid a crash. The payoff matrix is as follows:
S L
S 0,0 7,2 (5.29)
L 2,7 6,6
This reflects the fact that if they both stay on course, then they both die and get reward 0; if they both
leave, they both survive and get reward 6; but if player i chooses to stay and the other one leaves, then i gets
a reward of 7 for being brave, and −i only gets a reward of 2 for chickening out.
We can represent πi by the scalar πi (S), since πi (L)) = 1 − πi (S). Hence π can be defined by the
tuple (π1 , π2 ). There are 3 uncorrelated NEs: π = (1, 0) with return (7, 2); π = (0, 1) with return (2, 7);
and π = ( 13 , 31 ) with return (4.66, 4.66). There is 1 CE, namely π c (L, L) = π c (S, L) = π c (L, S) = 13 and
π c (S, S) = 0. The central policy has an expected return of
1 1 1
7· +2· +6· =5 (5.30)
3 3 3
which we see is higher than the NE of 4.66. This is because it avoids the deadly joint (S,S) action. To show
that this is a CE, consider the case where i (e.g., row player) receives recommendation L; they know that j
(column player) will choose either S or L with probability 0.5 (because the central policy is uniform). If i
sticks with the recommendation, its expected return is 0.5 · 2 + 0.5 · 6 = 4; this is greater than deviating from
the recommendation and picking S, which has expected return of 0.5 · 0 + 0.5 · 7 = 3.5. Thus π c is a CE.
In [Aum87] they show that the CE solution corresponds to the behavior of a rational Bayesian agent.
The correlated equilibrium solution can be computed via linear programming.
122
9
Joint return
8 Pareto-optimal
Deterministic NE
7 Probabilistic NE
0
0 1 2 3 4 5 6 7 8 9
Agent 2 expected reward
Figure 5.4: Space of (discretized) joint policies for Chicken game. From Fig 4.4 of [ACS24]. Used with kind permission
of Stefano Albrecht.
To further constrain the space of desirable solutions, we can consider additional concepts. For example,
we define welfare optimality as X
W (π) = Ui (π) (5.32)
i
A joint policy is welfare-optimal if π ∈ argmaxπ′ W (π ′ ). One can show that welfare optimality implies Pareto
optimality, but not (in general) vice versa.
Similarly, we define fairness optimality as
Y
F (π) = Ui (π) (5.33)
i
In the battle-of-the-sexes game in Table 5.1, the only fair outcome is the joint distribution over P r(F, F ) =
P r(O, O) = 0.5, which means the couple spend half their time watching football and half going to the opera.
In the Chicken game in Figure 5.4, there is only one solution that is both welfare-optimal and fairness-
optimal, namely the joint policy with expected return of (6,6). Note, however, that this is not a Nash
policy.
5.2.11 No regret
The quantity known as regret measures the difference between the rewards an agent received versus the
maximum rewards it could have received if it had chosen a different action. For a non-repeated normal-form
general-sum game, played over E episodes, this is defined as
E
X
RegretE
i = max
i
[Ri ((ai , a−i
e )) − Ri (ae )] (5.34)
a
e=1
We can generalize the definition of regret to stochastic games and POSGs by defining the regret over policies
instead of actions. That is,
XE
RegretEi = maxi
[Ui ((π i , π −i
e )) − Ui (π e )] (5.35)
π
e=1
123
In all these cases, an agent is said to have no-regret if
1
∀i. lim RegretE
i ≤0 (5.36)
E→∞ E
5.3 Algorithms
In this section, we discuss various MARL algorithms.
124
5.3.2.2 Independent Actor Critic
Instead of using value-based methods, we can also use policy learning methods. The multi-agent version of
the policy gradient theorem in Equation (3.6) is the following (see e.g., [ACS24] for derivation):
h i
∇θi J(θ1:n ) ∝ Eĥ∼p(ĥ|π),ai ∼πi ,a−i ∼π−i Qπi (ĥ, (ai , a−i )∇θi log π(ai |hi = σi (ĥ); θi ) (5.37)
where Adv is the advantage, as we discussed in Section 3.2.1. This can be used inside a multi-agent version
of the advantage actor critic or A2C method (known as MAA2C) shown in Algorithm 18. (To combat the
fact that we cannot use replay buffers with an on-policy method, we assume instead that we can parallelize
over multiple (synchronous) environments, to ensure we have a sufficiently large minibatch to estimate the
loss function at each step.)
Algorithm 18: Multi-agent Advantage Actor-Critic (MAA2C)
1 Initialize n actor networks with random parameters θ1 , . . . , θn
2 Initialize n critic networks with random parameters w1 , . . . , wn
3 Initialize K parallel environments
4 Initialize histories hi,k
0 for each agent i and environment k
5 for time step t = 0 . . . do
6 for environment k = 1, . . . , K do
7 Sample actions: {ai,k i,k i n
t ∼ π(·|ht , θ )}i=1
8 Sample next state: skt+1 ∼ T (·|skt , a1:n,k
t )
i,k k i,k n
9 Sample observations: {ot+1 ∼ Oi (·|st , at )}i=1
10 Sample rewards: {rti,k ∼ Ri (·|skt , a1:n,k
t , skt+1 )}ni=1
i,k i,k i,k
11 Update histories: {ht+1 = (ht , ot+1 )}i=1 n
12 for agent i = 1, . . . , n do
13 if skt+1 is terminal then
14 Adv(hi,k i,k i,k i,k
t , at ) ← rt − V (ht ; wi );
15 Critic target yti,k ← rti,k ;
16 else
17 Adv(hi,k i,k i,k i,k i,k
t , at ) ← rt + γV (ht+1 ; wi ) − V (ht ; wi );
i,k i,k i,k
18 Critic target yt ← rt + γV (ht+1 ; wi );
19 Actor loss:
K
1 X
L(θi ) ← Adv(hi,k i,k i,k i,k
t , at ) log π(at | ht ; θi )
K
k=1
20 Critic loss:
1 X i,k 2
K
L(wi ) ← yt − V (hi,k
t ; wi )
K
k=1
125
5.3.2.3 Independent PPO
We can implement an independent version of PPO (known as IPPO) in a similar way, by updating all the
policies in parallel [Wit+20].
The expression Uj (α, β) is analogous, with rij replaced with cij . (Note that computing Ui requires knowledge
of Ri but also of πj (and vice versa for computing Uj ); we will relax this assumption below.) Finally, we can
learn the policies using gradient ascent:
126
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.0
Agent 1 policy
Agent 2 policy
Converged policy
0.8 0.2
0.6 0.4
π1(Paper)
0.4 0.6
0.2 0.8
Figure 5.5: Learning dynamics of two policies for the rock-paper-scissors game. The upper and lower triangles illustrate
the policies for each agent over time as a point in the 2d simplex (noting that πei (S) = 1 − (πei (P ) + πei (R)), where S is
the scissors action, P is paper action, and R is rock action). The update rule is π e+1 = LR(De , π e ), where LR is the
learning rule known as WoLF-PHC (see Section 5.3.2.4). From Fig 5.5 of [ACS24]. Used with kind permission of
Stefano Albrecht.
In this section, we describe the Cicero system from [Met+22], which achieved human-level performance in
the complex natural language 7-player strategy game called Diplomacy, which requires both cooperative
and and competitive behavior. The method used CTDE, combining an LLM for generating and interpreting
dialog with a mix of self-play RL, imitation learning, opponent modeling, and policy generation using regret
minimization. The system uses imitation learning on human games to warm-start the initial policy and
language model, and then is refined using RL with self-play. The system uses explicit belief state modeling
over the opponents’ intents and plans; this is learning via supervised learning over past dialogues and game
outcomes, and refined during self-play.
where A∗ (h, c; θ) = argmaxa Q(h, c, a; θ) and A∗i (hi ; θ) = argmaxa Q(hi , ai ; θi ). This ensures that picking
actions locally for each agent will also be optimal globally.
127
5.3.4.1 Value decomposition network (VDN)
For example, consider the value decomposition network or VDN method of [Sun+17]. This assumes a
linear decomposition X
Q(ht , ct , at ; θ) = Q(hit , ait ; θi ) (5.44)
i
5.3.4.2 QMIX
A more general method, known as QMIX, is presented in [Ras+18]. This assumes
where fmix is a neural network that is constructed so that it is monotonically increasing in each of its
arguments. (This is ensured by requiring all the weights of the mixing network to be non-negative; the weights
themselves are predicted by another “hyper-network”, conditioned on the state hit .) This satisfies IGM since
max Q(ht+1 , ct+1 , a; θ) = fmix max Q(h 1
t+1 , a1
; θ 1 ), . . . , max
n
Q(hn
t+1 , an
; θ n ) (5.46)
a 1 a a
Hence we can fit this Q function by minimzing the TD loss (with target network θ):
1 X 2
L(θ) = rt + γ max Q(ht+1 , ct+1 , a; θ) − Q(ht , ct , a; θ) (5.47)
B a
(ht ,ct ,at ,rt ,ht+1 ,ct+1 )∈B
where p(s′ |s, ai , aj ) is the world model, π is i’s policy, and ψ(s) performs the role reversal.3 This is known as
self-play, and lets us learn πi using standard policy learning methods, such as PPO or policy improvement
based on search (decision-time planning). This latter approach is the one used by the AlphaZero method
(see Section 4.2.2.1), which was applied to zero-sum perfect information games like Chess and Go. In such
settings, self-play can be proved to converge to a Nash equilibrium.
Unfortunately, in general games (e.g., for imperfect information games, such as Poker or Hanabi, or for
general sum games), self-play can lead to oscillating strategies or cyclical behavior, rather than converging
to a Nash equilibrium, as we discuss in Section 5.3.2.4. Thus self-play can result in policies that are easily
exploited. We discuss more stable learning methods below.
location of player 1’s pieces (or -1 if they are removed), and y is a vector containing the opponent’s pieces. Thus the policy for
player 1, π1 , just needs to access the x part of the state. For player 2, we can transform the state vector into s′ = ψ(s) = (y, x)
and then apply π1 to choose actions, so π2 (s) = π1 (ψ(s)).
128
π(· | hit , mit ; θi ) ait
Figure 5.6: Encoder-decoder architecture for agent modeling. From Fig 9.21 of [ACS24]. Used with kind permission of
Stefano Albrecht.
example, we can train an encoder-decoder network to predict the actions of other agents via a bottleneck,
and then pass this bottleneck embedding to the policy as side information, as proposed in [PCA21]. In more
detail, let mit = f e (hit ; ψie ) be the encoding of i’s history, which is then passed to the decoder f d to predict
the other agents actions using π̂ i,t −i = f (mt ; ψi ). In addition, we pass mt to i’s policy to compute the action
d i d i
using at ∼ π(·|ht , mt ; θi ). This is illustrated in Figure 5.6. (A similar method could be used to predict other
i i i
If there are a large number of actions, we can approximate the sum over a−i using Monte Carlo sampling.
The only thing left is to specify how to learn the opponent policies. We dicsuss this below.
In fictious play, each agent i estimates the policies of the other players, based on their past actions. It then
computes a best response. For example, imagine you’re playing rock-paper-scissors repeatedly. If you notice
your opponent plays “rock” 60% of the time so far, your best response is to play ”paper” more often. You
adjust your strategy based on the empirical frequency of their past moves. The method is called “fictious”
because each player is acting as if the opponents are playing a fixed strategy, even though they’re actually
adapting.
The method was originally developed for non-repeated normal-form games (which are stateless, so hi = []
and Qi (a) = Ri (a)). In this case, we can estimate the policies by counting and averaging. That is,
Cjt (aj )
π̂jt (aj ) = P t ′ (5.51)
a′ Cj (a )
129
where Cjt (aj ) is the number of times agent j chose action aj in episodes up to step t. We then compute the
best response, given by X Y
BRti = argmax Ri ((ai , a−i ) π̂jt (aj ) (5.52)
ai
a−i j̸=i
Equivalently we can say that is the best response to π t−1 = avg(π 1 , . . . , π t−1 ). For two-player zero-sum
πit
finite games, this procedure will converge to a NE. That is, the exploitability of the average π t generated by
FP converges to zero as t grows large.
where Dt is the replay buffer containing previous states and actions of all the players. In addition, we use
DQN to learn Qi (hi , a) for each agent. We then use this learned average policy, plus the Q functions, to
compute AVi , and hence the best response.
In the zero-sum two-player case, we can use self-play, so we just assume π −i is the opposite of π i , which
can be learned by supervised learning applied just to its own states and actions. This is called fictitious self
play [HLS15]. It was extended to the neural net case in [HS16], who call it neural fictitious self play. If
the Q function converges to the optimal function, then this process converges to a NE, like standard FP.
130
k Πk1 Πk2 σ1k σ2k π1′ π2′
1 R P 1 1 S P
2 R,S P (0, 1) 1 S R
Figure 5.7: PSRO for rock-paper-scissors. We show the populations Πki , distributuons σik and best responses πi′ for
both agents over generations k = 1 : 5. (We use the shorthand of R to denote the pure policy that always plays R, and
similarly for S and P.) New entries added to the population are shown with an underline. At generation k = 3, there
are 2 best responses, R and P, so the oracle can select either of them. Similarly, at generation k = 5, there are 3 best
responses.
RL methods, such as [OA14]. In this approach, in each episode the policies of the other agents j ̸= i are
sampled from πj ∼ σjk , and then standard RL is applied.
It can be shown that, if PSRO uses a meta-solver that computes exact Nash equilibria for the meta-game,
and if the oracle computes the exact best-response policies in the unerlying game G, then the distributions
{σik }i∈I converge to a Nash equilibrium of G. See Figure 5.7 for an example.
Note that PSRO can also be applied to general-sum, imperfect information games. For example, [Li+23]
uses (information set) MCTS, together with a learned world model, to compute the best response policy πi′
at each step of PSRO.
game, which can be solved using deep RL with self-play. This is different from the StarCraft Multi-Agent Challenge (SMAC)
[Sam+19], which is a cooperative, partially observed multi-agent game, where the agents (corresponding to individual units)
must work together as a team to defeat a fixed AI opponent in certain predefined battles.
131
We can decompose this as
η π (τ ) = ηiπ (τ )η−i
π
(τ ) (5.55)
Similarly define η π (τ , z) as the probability of the trajectory τ = (s0 , . . . , st ) followed by z = (st+1 , . . . , sT ),
which is some continuation that ends in a terminal state. Let Z(hi ) = {(τ , z)} be the set of trajectories and
their terminal extensions which are compatible with hi , in the sense that Oi (τ ) = hi , and which end in a
terminal state. Also, let τ ai be the trajectory followed by action ai . We then define the counterfactual
state-action value for an information state as
X
c
qπ,i (hi , ai ) = π
η−i (τ )η π (τ ai , z)ui (z) (5.56)
(τ ,z)∈Z(hi )
rik (hi , ai ) = qπ
c i i c i
k ,i (h , a ) − vπ k ,i (h ) (5.58)
Note that this is the counterfactual version of an advantage function, as explained in [Sri+18]. Similarly we
define the cumulative counterfactual regret to be
k
X
Rik (hi , ai ) = rij (hi , ai ) (5.59)
j=0
CFR starts with a uniform random joint policy π 0 and then updates it at each iteration by performing
regret matching [HMC00]. That is, it updates the policy as follows
k,+ i i
P Ri (h k,+ ,a )
if denominator is positive
k+1 i i
πi (h , a ) = i
a∈Ai (h ) R i (hi ,a)
(5.60)
1 otherwise
|Ai (hi )|
132
5.3.10 Regularized policy gradient methods
In this section, we discuss policy gradient methods that incorporate a regularization term to ensure convergence,
even in adversarial settings, such as 2p0s games.
If we drop the magnet term, by setting α = 0, the method is equivalent to the mirror descent policy
optimization or MDPO algorithm of [Tom+20]. In this case, the optimal solution is given by
5.3.10.2 PPO
The MMD method of Section 5.3.10.1 is very similar to the PPO algorithm of Section 3.3.3. In particular,
the KL penalized version of PPO uses the following loss
π(at |st )
Est ,at Aold (st , at ) + α H(π(·|st )) − β DKL (πold (·|st ), π(·|st )) (5.64)
πold (at |st )
where Aold (s, a) = qold (s, a) − vold (s) is the advantage function. By comparison, if we use a uniform magnet
for ρ, the MMD loss in Equation (5.61) becomes
Est ,at [π(at |st )qold (st , at ) + α H(π(·|st )) − β DKL (π(·|st ), πold (·|st ))] (5.65)
where β = 1/η is the inverse stepsize. The main difference between these equations is just the use of a reverse
KL instead of forwards KL. (The two expressions also differ by the scaling factor 1/πold (at |st ) and the offset
term vold (st ).)
Despite the similarities, in [Sok+22], PPO has been shown to perform worse than MMD on various
2p0s games. One possible reason for PPO’s poor performance is due to the use forwards vs reverse KL
penalty. However, [HMDH20] compared the use of reverse KL regularization instead of forward KL in PPO
for Mujoco, and found that the two yielded similar performance. The explanation suggested in [Rud+25] is
simply that the hyper-parameters in PPO (in particular, the entropy penalty α) was not tuned properly for
the 2p0s setting (the latter tending to require much larger values, such as 0.05-2.0, whereas single agent PPO
implementations usually use 0-0.01).
133
They experimentally tested this hypothesis by comparing PPO with various other algorithms (including
MMD, CFR (Section 5.3.9), PSRO (Section 5.3.8.1) and NFSP (Section 5.3.7.2)) on a set of imperfect
information games (partially observed or “phantom”/“dark” versions of Tic-Tac-Toe and 3x3 Hex, where
the agents actions are invisible to the non-acting player). They find that properly tuned policy gradient
methods (including both PPO and MMD) performed the best in terms of having the lowest exploitability
scores. (The exploitability score is defined in Equation (5.21), and was computed by exactly solving for the
optimal opponent policy given a candidate learned policy.)
The above experimental result led the authors of [Rud+25] to propose the following “Policy Gradient
Hypothesis”:
Appropriately tuned policy gradient methods that share an ethos with magnetic mirror descent
are competitive with or superior to model-free deep reinforcement learning approaches based on
fictitious play, double oracle [population-based training], or counterfactual regret minimization in
two-player zero-sum imperfect-information games.
If true, this hypothesis would be very useful, since it means we can use standard single agent policy
gradient methods, such as (suitably tuned) PPO, for multiplayer games, both cooperative (see [Yu+22]) and
adversarial (see [Rud+25]).
One way to estimate the action values q for the current state is to perform Monte Carlo search or MCS
[TG96], which unrolls possible futures using the current policy, as in DTP. Thus with enough samples, DTP
(with the correct world model, and a suitable exploratory policy) combined with this update will give the
same results as (asynchronous) policy iteration.
where π is the previous local (blueprint) policy and ρ is the local magnet policy (which can be taken as
uniform). If hit is the current state (root of search tree for player i), and the actions are discrete, we can
equivalently perform an SGD step on the following parametric policy loss:
X πθ (a|hit ) 1 πθ (a|hit )
L(θ) = πθ (a|hit )q(hit , a) − απθ (a|hit ) log − πθ (a|hit ) log (5.69)
a
ρ(a) η πold (a|hit )
134
Algorithm 19: Magnetic Mirror Descent Search (MMDS)
1 Input: current state hit , joint policy π, agent id i
2 q[a] = 0, N [a] = 0 for each action a ∈ Ai
3 repeat
4 Sample current world state using agent’s local belief state: st ∼ Pπ (·|hit )
5 for a ∈ Ai do
6 Sample return G≥t ∼ Pπ (G≥t |st , a) by rolling out π in world model starting at st
7 q[a] = q[a] + G≥t
8 N [a] = N [a] + 1
9 q[a] = q[a]/N [a] for a ∈ Ai
10 Return U (π i (hit ), q) by performing SGD on Equation (5.69).
11 until until search budget exhausted ;
To implement this algorithm, we need to sample from Pπ (st |hit ), which is the distribution over world states
given agent i’s local history. One approach to this is to use particle filtering, cf. [Lim+23]).
Another approach is to train a belief model to predict the other player’s private information, and the
underlying environment state, given the current player’s history, i.e., we learn to predict P (st , {hjt }|hit ). In the
learned belief search (LBS) method of [Hu+21] (which was designed for Hanabi, which is a Dec-POMDP),
rather than predicting the entire action-observation history for each agent, they just predict the private
information (card hand) for each agent (represented as a sequence of tokens). This can be used (together with
the shared public information) to reconstruct the environment state. They train this model (represented as a
seq2seq LSTM) using supervised learning, where agent i learns to predict its own private information given
its public history. At test time, agent i uses j’s public history as input to its model to sample j’s private
information. (This assumes that j is using the same blueprint policy to choose actions that i used during
training.) Given the imputed private information, it then reconstructs the environment state and performs
rollouts, using the joint blueprint policy, in order to locally improve its own policy.
5.3.11.3 Experiments
In [Sok+23], they implemented the above method and applied it to several imperfect information games
(using the true known world model) For the common-reward game of Hanbai (5 card and 7 card variants),
they used PPO to pretrain the blueprint policy, and they pretrained a seq2seq belief model. At run time,
they use 10k samples for each step of MDS to locally improve the policy (which takes about 2 seconds).
They observed modest gains over rival methods. For the 2p0s games, they used the partially observed
(dark/phantom) versions of 3x3 Hex and Tic-Tac-Toe. For belief state estimation, they use a particle filter
with just 10 particles, for speed. As a blueprint policy they consider uniform and MMD (for 1M steps). They
find that MMDS can improve the blueprint, and this combination beats baslines such as PPO and NFSP.
They also compare to MMD as a baseline. For the MMD-1M baseline, the blueprint matches the baseline (by
construction), but the MMDS version beats it. However, the MMD-10M baseline beats MMDS, showing that
enough offline computation can beat less online computation.
135
5.3.11.4 Open questions
It is an interesting open question how well this MMDS method will work when the world model needs to be
learned, since this results in rollout errors, as discussed in Section 4.3.1. Similarly errors in the belief state
approximation may adverseley affect the estimate of q for the root node.
In addition, it is an open question to prove convergence properties of the generalized version of MMDS,
that uses more than just action value feednack. For example, MCTS updates the local policy at internal
nodes, not just the root node. In some cases, MCTS can work better than simple MCS, although this is not
always the case (see e.g., [Ham+21]).
136
Chapter 6
LLMs and RL
In this section, we discuss connections between RL and foundation models, also called large language
models (LLMs). These are generative models (usually transformers) which are trained on large amounts of
text and image data (see e.g., [Bur25; Bro24]).1 More details on the connections between RL and LLMs can
be found in e.g., [Pte+24].
foundation models or large multimodal models, but we stick to the term LLM for simplicity.
137
6.1.1.2 Modeling and training method
In the standard RL for LLM setup, there is just a single state, namely the input prompt s; the action is a
sequence of tokens generated by the policy in response, and then the game ends. This is equivalent to a
contextual bandit problem, with sequence-valued input (context) and output (action). (We consider the full
multi-turn case in Section 6.1.5.) Formally, the goal is to maximize
where s0 is the context/prompt (sampled from the dataset), and a is the generated sequence of actions
(tokens) sampled from the policy:
T
Y
πθ (a|s0 ) = πθ (at |a1:t−1 , s0 ) (6.2)
t=1
with initial distribution δ(s = s0 ). Thus the state st is just the set of tokens from the initial prompt s0 plus
the generated tokens up until time t. This definition
QT of state restores the Markov property, and allows us to
write the policy in the usual way as πθ (a|s0 ) = t=1 πθ (at |st ).
We can then rewrite the objective in standard MDP form as follows:
" T XX
#
X
J(θ) = Es0 ∼D πθ (at |st )δ(st |st−1 , at )R(st , at ) (6.4)
t=1 st at
In practice, the above approach can overfit to the reward function, so we usually regularize the problem
to ensure the policy πθ remains close to the base pre-trained LLM πref . We can do this by adding a penalty
of the form −βDKL (πθ (at |st ) ∥ πref (at |st )) to the per-token reward.
138
6.1.2.3 Learning the reward model from human feedback (RLHF)
To train LLMs to do well in general tasks, such as text summarization or poetry writing, it is common to use
reinforcement learning from human feedback or RLHF, which refers to learning a reward model from
human data, and then using RL to train the LLM to maximize this.
The basic idea is as follows. We first generate a large number of (context, answer1, answer2) tuples either
by a human or an LLM. We then ask human raters if they prefer answer 1 or answer 2. Let x be the prompt
(context), and yw be the winning (prefered) output, and yl be the losing output. Let rθ (x, y) be the reward
assigned to output y. (This model is typically a shallow MLP on top of the last layer of a pretrained LLM.)
We train the reward model by maximizing the likelihood of the observed preference data. The likelihood
function is given by the Bradley Terry choice model:
exp(rθ (x, yw ))
pθ (yw > yl ) = (6.5)
exp(rθ (x, yw )) + exp(rθ (x, yl ))
In some cases, we ask human raters if they prefer answer 1 or answer 2, or if there is a tie, denoted
y ∈ {1, 2, ∅}. In this case, we can optimize
L(θ) = E(x,y1 ,y2 ,y)∼D [I (y = 1) log pθ (y1 > y2 |x) + I (y = 2) log pθ (y1 < y2 |x) (6.10)
+I (y = ∅) log pθ (y1 > y2 |x)pθ (y1 < y2 |x)] (6.11)
Instead of asking humans their preferences for each possible input example, we can ask an LLM. We can then
fit the reward model to this synthetically labeled data, just as in RLHF. This is called RLAIF, which stands
for RL from AI feedback. (This is an example of LLM as judge.)
In order to specify how to judge things, the LLM needs to be prompted. Anthropic (which is the company
that makes the Claude LLM) created a technique called constitutional AI [Ant22], where the prompt is
viewed as a “constitution”, which specifies what kinds of responses are desirable or undesirable. With this
method, the system can critique its own outputs, and thus self-improve.
Rather than using an LLM to generate preference labels for training a separate reward model, or for providing
scalar feedback (like 7.3), we can use the LLM to provide richer textual feedback, such as “This response is
helpful but makes a factual error about X” (c.f., execution feedback from running code [Geh+24]). This is
called a generative reward model or GRM.
139
6.1.3 Algorithms
In this section, we discuss algorithms for training LLM policies that maximize the expected reward. These
methods are derived from general RL algorithms by specializing to the bandit setting (i.e., generating a single
answer in response to a single state/prompt).
6.1.3.1 REINFORCE
Given a reward function, one of the simplest ways to train the policy to maximize it is to use the REINFORCE
algorithm, discussed in Section 3.1.3). This performs gradient descent, where the gradient is derived from
Equation (6.1):
G
1 X
∇θ J(θ) = Es0 ∼D R(s0 , ai ) − bi (s) ∇θ log πθ (ai |s0 ) (6.12)
G i=1
where ai ∼ πθ (·|s0 ) is the i’th sampled response, and b is a baseline to reduce the variance (see Section 3.1.5).
In the Reinforce Leave-One-Out) method (RLOO) of [Ahm+24], they propose the following baseline:
G
X
1
bi (s) = R(s, aj ) (6.13)
G−1
j=1,j̸=i
which is the average of all the samples in the batch, excluding the current sample.
6.1.3.2 PPO
We can also train the LLM policy using PPO (Section 3.3.3). In the bandit case, we can write the objective
as follows:
J(θ) = Es Ea∼πold (·|s) min (ρ(a|s)A(s, a), clip(ρ(a|s)A(s, a))) (6.14)
where the likelihood ratio is
πθ (a|s)
ρθ (a|s) = (6.15)
πold (a|s)
and the advantage is
A(s, a) = R(s, a) − Vθ (s) (6.16)
where Vθ (s) is a learned value function, often derived from the policy network backbone.
Since the policy generates a sequence, we can expand out the above expression into the token level terms,
and approximate the expectations by rolling out G trajectories, τi , from the old policy to get
G |τi |
1 X 1 X
Jppo (θ) = min (ρθ (τi,t |τi,<t )Ai,t clip(ρθ (τi,t |τi,<t ))Ai,t ) (6.17)
G i=1 |τi | t=1
6.1.3.3 GRPO
The Group Relative PPO or GRPO algorithm of [Sha+24], which was used to train DeepSeek-R1-Zero
(discussed in Section 6.1.4.2), is a variant of PPO which replaces the critic network with a Monte Carlo
estimate of the value function. This is done to avoid the need to have two copies of the LLM, one for the
actor and one for the critic, to save memory. In more detail, the advantage for the i’t rollout is given by
Ri − mean[R1 , . . . , RG ]
A(s, ai ) = (6.18)
std [R1 , . . . , RG ]
where Ri = R(s, ai ) is the final reward. This is shared across all tokens in trajectory τi .
140
The above expression has a bias, due to the division by the standard deviation. In Dr GRPO (GRPO
Done Right) method of [Liu+25b], they propose to drop the denominator. The resulting expression then
becomes identical (up to a scaling factor of G/(G − 1)) to the advantage used in RLOO (Section 6.1.3.1). To
see this, note that
G
G G 1 X
ADrGRPO
i = (ri − rj ) (6.19)
G−1 G−1 G j=1
G
X
G 1 1
= ri − rj ) − ri (6.20)
G−1 G−1 G−1
j=1,j̸=i
G
X
1
= ri − rj ) = ARLOO
i (6.21)
G−1
j=1,j̸=i
In addition to modifying the advantage, they propose to use a regularizer, by optimizing J = Jppo − βJkl .
They use the low-variance MC estimator of KL divergence proposed in http://joschu.net/blog/kl-approx.
html, which has the form DKL [Q, P ] = E[r − log r − 1], where r = P (a)/Q(a), P (a) = πold (a) is the reference
policy (e.g., the SFT base model), Q(a) = πθ (a) is the policy which is being optimized, and the expectations
are wrt Q(a) = πθ (a). Thus the final GRPO objective (to be maximized) is J = Jppo − βJkl , where
Jppo = Es Eai ∼πold (·|s) min (ρi Ai , clip(ρi Ai )) (6.22)
πθ (ai |s)
ρi = (6.23)
πold (ai |s)
Jkl = Es Eai ∼πθ (·|s) [ri − log ri − 1] (6.24)
πold (ai |s)
ri = (6.25)
πθ (ai |s)
where s is a sampled question and ai is a sampled output (answer). In practice, GRPO approximates the Jkl
term by sampling from ai ∼ πθold (·|s), so we can share the samples for both the PPO term and the KL term.2
6.1.3.4 DPO
Rather than first fitting a reward model from preference data using the Bradley-Terry model in Section 6.1.2.3,
and then optimizing the policy to maximize this, it is possible to optimize the preferences directly, using the
DPO (Direct Preference Optimization) method of [Raf+23]. (This is sometimes called direct alignment.)
We now derive the DPO method. To simplify notation, we use x for the input prompt (initial state s)
and y for the output (answer sequence a). The objective for KL-regularized policy learning is to maximize
the following:3
π(y|x)
J(π) = Ex∼D,y∼π(y|x) R(x, y) − β log (6.26)
πref (y|x)
Equivalently we can minimize the loss
π(y|x) 1
L(π) = Ex∼D,y∼π(y|x) log − R(x, y) (6.27)
πref (y|x) β
" #
π(y|x)
= Ex∼D,y∼π(y|x) log 1 1 − log Z(x) (6.28)
Z(x) πref (y|x) exp( β R(x, y))
2 Note that this is biased, as pointed out in https://x.com/NandoDF/status/1884038052877680871. However, the bias is
likely small if θold is close to θ. If this bias is a problem, it is easy to fix using importance sampling.
3 In [Hua+25b], they recently showed that it is better to replace the KL divergence with a χ2 divergence, which quantifies
uncertainty more effectively than KL-regularization. The resulting algorothm is provably robust to overoptimization, unlike
standard DPO.
141
where X 1
Z(x) = πref (y|x) exp( R(x, y)) (6.29)
y
β
142
Sampling from such distributions can be done using various methods. A simple method is known as
best-of-N sampling, which just generates N trajectories, and picks the best. This is equivalent to ancestral
sampling from the forwards model, and then weighting by the final likelihood (a soft version of rejection
sampling), and hence is very inefficient. In addition, performance can decrease when N increases, due to
over-optimization (see e.g., [Hua+25a; FS25]).
A more efficient method is to use twisted SMC, which combines particle filtering (a form of sequential
Monte Carlo) with a “twist” function, which predicts the future reward given the current state4 , analogous
to a value function (see e.g. [NLS19; CP20; Law+18; Law+22]). This is sometimes called SMC steering,
and has been used in several papers (see e.g., [Lew+23; Zha+24b; Fen+25; Lou+25; Pur+25]). (See also
Section 4.2.3 for closely related work which uses twisted SMC to sample action sequences, where the twist is
approximated by a learned advantage function.)
The posterior sampling approach discussed above is an example of using more compute at test time to
improve the generation process of an LLM. This is known as test time compute (see e.g., [Ji+25] for
survey). This provides another kind of scaling law, known as inference time scaling, besides just improving
the size (and training time) of the base model [Sne+24].
The disadvantage of decision-time planning (online posterior sampling) is that it can be slow. Hence it can
be desirable to “amortize” the cost by fine-tuning the base LLM so it matches the tilted posterior [KPB22].
We can do this by minimizing J(θ) = DKL (πθ ∥ π ∗ ), where π ∗ is the tilted distribution in Equation (6.36),
and then sampling from π ∗ .
Following Equation (3.129), we have
DKL (πθ (y) ∥ π ∗ (y)) = Eπθ (y) [log πθ (y) − log Φ(y) − log πref (y)] + log pref (O = 1) (6.38)
where the first term is the negative ELBO (evidence lower bound), and the second log pref (O = 1) term is
independent of the function being optimized (namely π ∗ ). Hence we can minimize the KL by maximizing the
ELBO:
J ′ (θ) = Eπθ (y) [log Φ(y)] − DKL (πθ (y) ∥ πref (y)) ∝ Eπθ (y) [R(y)] − βDKL (πθ (y) ∥ πref (y)) (6.39)
4 Formally, the optimal twist function on a partial sequence y is defined as Φ∗ (y) = Eπref (y′ ) [Φ(y ′ )|y is a prefix of y ′ ].
143
6.1.4.1 Chain of thought prompting
The quality of the output from an LLM can be improved by prompting it to “show its work” before presenting
the final answer. These intermediate tokens are called a “Chain of Thought” [Wei+22]. Models that act in
this way are often said to be doing “reasoning” or “thinking”, although in less anthropomorphic terms,
we can think of them as just policies with dynamically unrolled computational graphs. This is motivated
by various theoretical results that show that such CoT can significantly improve the expressive power of
transformers [MS24; Li+24b].
Rather than just relying on prompting, we can explicitly train a model to think by letting it generate a
variable number of tokens “in its head” before generating the final answer. Only the final outcome is evaluated,
using a known reward function (as in the case of math and coding problems).
This approach was recently demonstrated by the DeepSeek-R1-Zero system [Dee25] (released by a
Chinese company in January 2025). They started with a strong LLM base model, known as DeepSeek-V3-
Base [Dee24], which was pre-trained on a large variety of data (including Chains of Thought). They then
used a variant of PPO, known as GRPO (see Section 6.1.3.3) to do RLFT, using a set math and coding
benchmarks where the ground truth answer is known. The resulting system got excellent performance on
math and coding benchmarks.5 The closed-source models ChatGPT-o1 and ChatGPT-o3 from OpenAI6
and the Gemini 2.0 Flash Thinking model from Google Deepmind7 are believed to follow similar principles
to DeepSeek-R1, although the details are not public.8 For a recent review of these reasoning methods. see
e.g., [Xu+25].
Since we usually only care about maximizing the probability that the final answer is correct, and not about
the values or “correctness” of the intermediate thoughts (since it can be hard to judge heuristic arguments),
P we
can view training a thinking model as equivalent to maximizing the marginal likelihood p(y|x) = z p(y, z|x),
where z are the latent thoughts (see e.g., [Hof+23; Zel+24; TWM25]).
One reason DeepSeek-R1 got so much attention in the press is that during the training process, it seemed to
“spontaneously” exhibit some “emergent abilities”, such as generating increasingly long sequence of thoughts,
and using self-reflection to refine its thinking, before generating the final answer.
Note that the claim that RL “caused” these emergent abilities has been disputed by many authors (see
e.g., [Liu+25a; Yue+25]). Instead, the general concensus is that the base model itself was already trained on
datasets that contained some COT-style reasoning patterns. This is consistent with the findings in [Gan+25a],
which showed that applying RL to a base model that had not been pre-trained on reasoning patterns (such as
self-reflection) did not result in a final model that could exhibit such behaviors. However, RL can “expose” or
“amplify” such abilities in a base model if they are already present to a certain extent. (See also [FMR25] for
a detailed theoretical study of this issue.) Recently Absolute Zero Reasoner from [Zha+25] showed it is
possible to automatically generate a curriculum of questions and answers, which is used for the RL training.
However, once again, this not training from scratch, since it relies on a strong pre-trained base model.
5 Although DeepSeek-R1-Zero exhibited excellent performance on math and coding benchmarks, it did not work as well on
more general reasoning benchmarks. So their final system, called DeepSeek-R1, combined RL training with more traditional
SFT (on synthetically generated CoTs).
6 See https://openai.com/index/learning-to-reason-with-llms/.
7 See https://deepmind.google/technologies/gemini/flash-thinking.
8 However, shortly after the release of R1, the CRO of Open AI (Mark Chen) confirmed that o1 uses some of the same core
144
Figure 6.1: Illustration of how to train an LLM to both “think” internally and “act” externally. We sample P initial
prompts, and for each one, we roll out N trajectories (of length up to K tokens). From [Wan+25]. Used with kind
permission of Zihan Wang.
where τ = (s0 , aT0 , r0 , . . . , sK , aTK , rK ) is a trajectory of length K, R(τ ) is the trajectory level reward, and
M is the MDP, defined by the combination of the internal deterministic transition for thinking tokens in
Equation (6.3) and the unknown external environment. In [Wan+25], they call this approach StarPO
(State-Thinking-Action-Reward Policy Optimization), although it is really just an instance of PPO.
145
6.1.7 Alignment and the assistance game
Encouraging an agent to behave in a way that satisfies one or more human preferences is called alignment.
We can use RL for this, by creating suitable reward functions. However, any objective-maximizing agent may
engage in reward hacking (Section 1.3.6.2), in which it finds ways to maximize the specified reward but which
humans consider undesirable. This is due to reward misspecification, or simply, the law of unintended
consequences.
A classic example of this is the poem known as The Sorcerer’s Apprentice, written by the German poet
Goethe in 1797. This was later made famous in Disney’s cartoon “Fantasia” from 1940. In the cartoon version,
Mickey Mouse is an apprentice to a powerful sorcerer. Mickey is tasked with fetching water from the well.
Feeling lazy, Mickey puts on the sorcerer’s magic hat, and enchants the broom to carry the buckets of water
for him. However, Mickey forgot to ask the magic broom to stop after the first bucket of water has been
carried, so soon the room is filled with an army of tireless, water-carrying brooms, until the room floods, at
which point Mickey asks the wizard to intervene.
The above parable is typical of many problems that arise when trying to define a reward function that
fails to capture all the edge cases we might not have thought of. In [Rus19], Stuart Russell proposed a clever
solution to this fundamental problem. Specifically, the human and machine are both treated as agents in a
two-player, partially observed cooperative game (an instance of a Dec-POMDP, see Section 5.1.3)), called
an assistance game, where the machine’s goal is to maximize the user’s utility (reward) function, which
is inferred based on the human’s behavior using inverse RL. That is, instead of trying to learn the reward
function using RLHF, and then optimizing that, we treat the reward function as an unknown part of the
environment. If we adopt a Bayesian perspective on this, we can maintain a posterior belief over the model
parameters, which will incentivize the agent to perform information gathering actions (see Section 7.2.1.2).
For example, if the machine is uncertain about whether something is a good idea or not, it will proceed
cautiously (e.g., by asking the user for their preference), rather than blindly solving the wrong problem. For
more details on this framework, see [Sha+20]. For a more general discussion of (mis)alignment and risks
posed by AI agents and “AGI”, see e.g., [Ham+25].
146
Lean theorem proving environment. In this environment, the reward is 0 or 1 (proof is correct or not), the
state space is a structured set of previously proved facts and the current goal, and the action space is a set of
proof tactics. The agent itself is a separate transformer policy network (distinct from the formalizer network)
that is a pre-trained LLM, that is fine-tuned on math, Lean and code, and then further trained using RL.
147
as relative location, derived from the image). These reward functions are then “verified” by applying them to
an offline set of expert and random trajectories; a good reward function should allocate high reward to the
expert trajectories and low reward to the random ones. Finally, the reward functions are used as auxiliary
rewards inside an RL agent.
There are of course many other ways an LLM could be used to help learn reward functions, and this
remains an active area of research.
148
Figure 6.2: Illustration of how to use a pretrained LLM (combined with RAG) as a policy. From Figure 5 of [Par+23].
Used with kind permission of Joon Park.
posterior of p(M |D(1 : t)), where M is the WM, where the weight for each sampled program ρ are represented
by a Beta distribution, B(α, β), with initial parameters α = C + Cr(ρ), and β = C + C(1 − r(ρ)), where
r(ρ) is the fraction of unit tests that pass, and C is a constant. This representation about the quality of
each program is similar to the one used to represent the reward for the arms in a Bernoulli bandit, and is
based on the REx (Refine, Explore, Exploit) algorithm of [Tan+24]. At each step, it samples one of these
models (programs) from this weighted posterior, and then uses it inside of a planning algorithm, similar to
Thompson sampling (Section 7.2.2.2). The agent then executes this in the environment, and passes back
failed predictions to the LLM, asking it to improve the WM, or to fix bugs if it does not run. (This refinement
step is similar to the I and F steps of GIF-MCTS.) To encourage exploration, they introduce a new learning
objective that prefers world models which a planner thinks lead to rewarding states, particularly when the
agent is uncertain as to where the rewards are.
We can also use the LLM as a mutation operator inside of an evolutionary search algorithm, as in the
FunSearch system [RP+24] (recently rebranded as AlphaEvolve [Dee25]), where the goal is to search over
program space to find code that minimizes some objective, such as prediction errors on a given dataset.
149
prompting the LLM to first retrieve relevant past examples from an external “memory”, rather than explicitly
storing the entire history ht in the context (this is called retrieval augmented generation or RAG); see
Figure 6.2 for an illustration. Note that no explicit learning (in the form of parametric updates) is performed
in these systems; instead they rely entirely on in-context learning (and prompt engineering).
The above approaches do not train the LLM, but instead rely on in-context learning in order to adapt the
policy. Better results can be obtained by using RL finetuning, as we discussed in Section 6.1.5.
6.2.4.3 In-context RL
Large LLMs have shown a surprising property, known as In-Context Learning or ICL, in which they can
be “taught” to do function approximation just by being given (x, y) pairs in their context (prompt), and then
being asked to predict the output for a novel x. This can be used to train LLMs without needing to do any
gradient updates to the underlying parameters. For a review of methods that apply ICL to RL, see [Moe+25].
150
Chapter 7
Other topics in RL
where
π ∗ = argmax Es0 ∼M [V π (s0 |M )] (7.2)
π∈Π
is the optimal policy from some policy class Π which has access to the true MDP M . This is often referred to
as the best policy in hindsight, since if we average over enough sequences (or over enough time steps), the
policy we wish we had chosen will of course be the optimal policy that knows the true environment.
Since the true MDP is usually unknown, we can define the maximum regret of a policy as its worst
case regret wrt some class of models M:
h ∗ i
1We can also define the regret in terms of value functions: RegretT (π|M, Π) = Es0 ∼M VTπ (s0 |M ) − VTπ (s0 |M ) , where
V π (s|M ) refers to the value of policy π starting from state s in MDP M .
151
We can then define the minimax optimal policy as the one that minimizes the maximum regret:2
∗ ∗
πM M (M, Π) = argmin max RegretT (π|π (M ), M ) (7.4)
π∈Π M ∈M
The main quantity of interest in the theoretical RL literature is how fast the regret √
grows as a function
of time T . In the case of a tabular episodic MDP, the optimal minimax regret is O( HSAT ) (ignoring
logarithmic factors), where H is the horizon length (number of steps per episode), S is the number of states,
A is the number of actions, and T is the total number of steps. When using parametric functions to define
the MDP, the bounds depend on the complexity of the function class. For details, see e.g., [AJO08; JOA10;
LS19].
where
πt∗ = argmax V π (st |Mt ) (7.6)
π∈Π
where the distance function measures the similarity of the MDPs at adjacent episodes (e.g., ℓ1 distance of the
reward and transition functions). The optimal dynamic regret can then be bounded in terms of VB (see e.g.,
[CSLZ23; Aue+19].
where P0 (M) is our prior over models, and A is our learning algorithm that generates the policy (decision
procedure) to use at each step, as discussed in Section 1.1.3. Note that the uncertainty over models
automatically encourages the optimal amount of exploration, as we discuss in Section 7.2.1.2. Note also that
if we can do exact inference, the optimal algorithm is uniquely determined by the prior P0 .
2We can also consider a non-stochastic setting, in which we allow an adversary to choose the state sequence, rather than
taking expectations over them. In this case, we take the maximum over individual sequences rather than models. For details, see
[HS22].
152
Aspect Bayes-Optimal (BAMDP) Regret-Minimizing (Minimax)
Knowledge Requires a known prior over MDPs No prior; judged against best policy
in hindsight
Objective Maximize expected return under the Minimize regret w.r.t. optimal policy
prior in true MDP
Exploration Performs optimal Bayesian explo- Often uses optimism or randomness
ration (e.g., UCB, TS)
Adaptation Fully adaptive via posterior updates May use confidence bounds, resets,
or pessimism
Setting Bayesian RL Frequentist or adversarial RL
Table 7.1: Key differences between Bayes-optimal and regret-minimizing policies in RL.
By contrast, the regret minimizing policy is the one that minimizes the maximum regret
where π ∗ is given in Equation (7.2). Unlike the Bayesian case, we must now manually design the algorithm
to solve the exploration-exploitation problem (e.g., using the Thompson sampling method of Section 7.2.2 or
the UCB method of Section 7.2.3), i.e., there is no automatic solution to the problem.
This distinction between minimizing risk and minimizing regret is equivalent to the standard difference
between Bayesian and frequentist approaches to decision making (see e.g., [Mur23, Sec 34.1]). The advantage
of the Bayesian approach is that it can use prior knowledge (e.g., based on experience with other tasks, or
knowledge from an LLM) to adapt quickly to changes, and to make predictions about the future, allowing for
optimal long-range planning. The advantage of the regret-minimizing approach is that it avoids the need for
a specifying prior over models, it can be robustly adapt to unmodeled changes, and it can handle adversarial
(non-stochastic) noise. See Table 7.1 for a summary, and [HT15] for more discussion.
153
of [KWW22, Sec 15.5], and consider a Bernoulli bandit with n arms. Let the belief state be denoted by
b = (w1 , l1 , . . . , wn , ln ), where wa is the number of times arm a has won (given reward 1) and la is the number
of times arm a has lost (given reward 0). Using Bellman’s equation, and the expression for the probability of
winning under a beta-Bernoulli distribution with a uniform prior, we have
V ∗ (b) = max Q∗ (b, a) (7.12)
a
wa + 1
Q∗ (b, a) = (1 + V ∗ (· · · , wa + 1, la , · · · ) (7.13)
wa + la + 2
wa + 1
+ 1− V ∗ (· · · , wa , la + 1, · · · ) (7.14)
w a + la + 2
In the finite horizon case, withPh steps, We can compute ∗Q using dynamic programming. We start with
∗
terminal belief
P states b with a (wa + la ) = h, where V (b) = 0. We then work backwards to states b
satisfying a (wa + la ) = h − 1, and then applying the above equation recursively until time step 0.
Unfortunately, although this process is optimal, the number of belief states is O(h2n ), rendering it
intractable. Fortunately, for the infinite horizon discounted case, the problem can be solved efficiently using
Gittins indices [Git89] (see [PR12; Pow22] for details). However, these optimal methods do not extend to
contextual bandits, where the problem is provably intractable [PT87].
The rewards and actions of the augmented MDP are the same as the base MDP. Thus Bellman’s equation
gives us !
X
V ∗ (s, b) = max R(s, a) + γ P (s′ |s, b, a)V ∗ (s′ , BU (s, b, a, s′ )) (7.17)
a
s′
Unfortunately, this is computationally intractable to solve. Fortunately, various approximations have been
proposed (see e.g., [Zin+21; AS22; Mik+20]).
154
Cumulative regret
arm0 arm0 40 observed
1400
10 arm1
arm1 c √ t
arm2 35
1200 arm2
5 30
cumulative reward
1000
0 25
800
20
LT
−5
600
15
−10 400
10
−15 200
5
0 0
−20
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
time time
Figure 7.1: Illustration of Thompson sampling applied to a linear-Gaussian contextual bandit. The context has the
form st = (1, t, t2 ). (a) True reward for each arm vs time. (b) Cumulative reward per arm vs time. (c) Cumulative
regret vs time. Generated by thompson_sampling_linear_gaussian.ipynb.
If the posterior is uncertain, the agent will sample many different actions, automatically resulting in exploration.
As the uncertainty decreases, it will start to exploit its knowledge.
To see how we can implement this method, note that we can compute the expression in Equation (7.18)
by using a single Monte Carlo sample θ̃t ∼ p(θ|ht ). We then plug in this parameter into our reward model,
and greedily pick the best action:
at = argmax R(st , a′ ; θ̃t ) (7.19)
a′
This sample-then-exploit approach will choose actions with exactly the desired probability, since
Z
′
pa = I a = argmax R(st , a ; θ̃t ) p(θ̃t |ht ) = Pr (a = argmax R(st , a′ ; θ̃t )) (7.20)
a′ θ̃t ∼p(θ|ht ) a′
Despite its simplicity, this approach can be shown to achieve optimal regret (see e.g., [Rus+18] for a
survey). In addition, it is very easy to implement, and hence is widely used in practice [Gra+10; Sco10;
CL11].
In Figure 7.1, we give a simple example of Thompson sampling applied to a linear regression bandit. The
context has the form st = (1, t, t2 ). The true reward function has the form R(st , a) = wTa st . The weights per
arm are chosen as follows: w0 = (−5, 2, 0.5), w1 = (0, 0, 0), w2 = (5, −1.5, −1). Thus we see that arm 0 is
initially worse (large negative bias) but gets better over time (positive slope), arm 1 is useless, and arm 2 is
initially better (large positive bias) but gets worse over time. The observation noise is the same for all arms,
σ 2 = 1. See Figure 7.1(a) for a plot of the reward function. We use a conjugate Gaussian-gamma prior and
perform exact Bayesian updating. Thompson sampling quickly discovers that arm 1 is useless. Initially it
pulls arm 2 more, but it adapts to the non-stationary nature of the problem and switches over to arm 0, as
shown in Figure 7.1(b). In Figure 7.1(c), we show that the empirical cumulative regret in blue is close to the
optimal lower bound in red.
steps of the PSRL algorithm for some simple tabular problems. Although the method worked susprisingly well (in the sense of
having low Bayesian regret) on very small problems (e.g., 3 state RiverSwim), it failed on larger problems (e.g., 4 state) and
more stochastic problems.
155
Algorithm 20: Posterior sampling RL. We define the history at step k to be the set of previous
trajectories, Hk = {τ 1 , . . . , τ k−1 }, each of length H, where τ k = (sk1 , ak1 , r1k , . . . , skH , akH , rH
k
, skH+1 ).
1 Input: Prior over models P (M )
2 History H1 = ∅
3 for Episode k = 1 : K do
4 Sample model from posterior, Mk ∼ P (M |Hk )
5 Compute optimal policy πk∗ = solve(Mk )
6 Execute πk∗ for H steps to get τ k
7 Update history Hk+1 = Hk ∪ τ k
8 Update posterior p(M |Hk+1 )
9 Return πK
∗
As a more computationally efficient alternative, it is also possible to maintain a posterior over policies
or Q functions instead of over world models; see e.g., [Osb+23a] for an implementation of this idea based
on epistemic neural networks [Osb+23b], and epistemic value estimation [SSTVH23] for an imple-
mentation based on Laplace approximation. Another approach is to use successor features (Section 4.5.4),
where the Q function is assumed to have the form Qπ (s, a) = ψ π (s, a)T w. In particular, [Jan+19b] proposes
Sucessor Uncertainties, in which they model the uncertainty over w as a Gaussian, p(w) = N (µw , Σw ).
From this they can derive the posterior distribution over Q values as p(Q(s, a)) = N (Ψπ µw , Ψπ Σw (Ψπ )T ),
where Ψπ = [ψ π (s, a)]T is a matrix of features, one per state-action pair.
UCB can be viewed a form of exploration bonus, where the optimistic estimate encourages exploration.
Typically, the amount of optimism, R̃t −R, decreases over time so that the agent gradually reduces exploration.
With properly constructed optimistic reward estimates, the UCB strategy has been shown to achieve near-
optimal regret in many variants of bandits [LS19]. (We discuss regret in Section 7.1.)
The optimistic function R̃ can be obtained in different ways, sometimes in closed forms, as we discuss
below.
156
Figure 7.2: Illustration of the reward distribution Q(a) for a Gaussian bandit with 3 different actions, and the
corresponding lower and upper confidence bounds. We show the posterior means Q(a) = µ(a) with a vertical dotted line,
and the scaled posterior standard deviations cσ(a) as a horizontal solid line. From [Sil18]. Used with kind permission
of David Silver.
As an example, consider again the context-free Bernoulli bandit, R(a) ∼ Ber(µ(a)). The MLE R̂t (a) =
µ̂t (a) is given by the empirical average of observed rewards whenever action a was taken:
where Ntr (a) is the number of times (up to step t − 1) that action a has been tried and the observed reward
was r, and Nt (a) is the total number of times action a has been tried:
t−1
X
Nt (a) = I (at = a) (7.23)
s=1
p
Then the Chernoff-Hoeffding inequality [BLM16] leads to δt (a) = c/ Nt (a) for some constant c, so
c
R̃t (a) = µ̂t (a) + p (7.24)
Nt (a)
We can use similar techniques for a Gaussian bandit, where pR (R|a, θ) = N (R|µa , σa2 ), µa is the expected
reward, and σa2 the variance. If we use a conjugate prior, we can compute p(µa , σa |Dt ) in closed form.
Using an uninformative version of the conjugate prior, we find E [µa |ht ] = µ̂t (a), which is just the empirical
mean
p of rewards forpaction a. The uncertainty in this estimate is the standard error of the mean, i.e.,
V [µa |ht ] = σ̂t (a)/ Nt (a), where σ̂t (a) is the empirical standard deviation of the rewards for action a.
Once we have computed the mean and posterior standard deviation, we define the optimistic reward
estimate as
R̃t (a) = µ̂t (a) + cσ̂t (a) (7.26)
for some constant c that controls how greedy the policy is. See Figure 7.2 for an illustration. We see that
this is similar to the frequentist method based on concentration inequalities, but is more general.
157
Two-Hot HLGauss Categorical Distributional RL
Figure 7.3: Illustration of how to encode a scalar target y or distributional target Z using a categorical distribution.
From Figure 1 of [Far+24]. Used with kind permission of Jesse Farebrother.
[AJO08] presents the more sophisticated UCRL2 algorithm, which computes confidence intervals on all the
MDP model parameters at the start of each episode; it then computes the resulting optimistic MDP and
solves for the optimal policy, which it uses to collect more data.
7.3 Distributional RL
The distributional RL approach of [BDM17; BDR23], predicts the distribution of (discounted) returns, not
PT −t
just the expected return. More precisely, let Ztπ = k=0 γ k R(st+k , at+k ) be a random variable representing
the (discounted) reward-to-go from step t. The standard value function is defined to compute the expectation
of this variable: V π (s) = E [Z0π |s0 = s]. In DRL, we instead attempt to learn the full distribution, p(Z0π |s0 = s)
when training the critic. We then compute the expectation of this distribution when training the actor. For a
general review of distributional regression, see [KSS23]. Below we briefly mention a few algorithms in this
class that have been explored in the context of RL.
for predicting the mean leverages a distribution, for robustness and ease of optimization.
158
target based on putting appropriate weight on the nearest two bins (see Figure 7.3). In [IW18], they proposed
the HL-Gauss histogram loss, that convolves the target value y with a Gaussian, and then discretizes the
resulting continuous distribution. This is more symetric than two-hot encoding, P as shown in Figure 7.3.
Regardless of how the discrete target is chosen, predictions are made using ŷ(s; θ) = k pk (s)bk , where pk (s)
is the probability of bin k, and bk is the bin center.
In [Far+24], they show that the HL-Gauss trick works much better than MSE, two-hot and C51 across a
variety of problems (both offline and online), especially when they scale to large networks. They conjecture
that the reason it beats MSE is that cross entropy is more robust to noisy targets (e.g., due to stochasticity)
and nonstationary targets. They also conjecture that the reason HL works better than two-hot is that HL is
closer to ordinal regression, and reduces overfitting by having a softer (more entropic) target distribution
(similiar to label smoothing in classification problems).
This is the same as methods based on rewarding states for prediction error. Unfortunately such methods can
suffer from the noisy TV problem (also called a stochastic trap), in which an agent is attracted to states
which are intrinsically to predict. To see this, note that by averaging over future states we see that the above
reward reduces to
R(s, a) = −Ep∗ (s′ |s,a) [log q(s′ |s, a)] = Hce (p∗ , q) (7.29)
159
where p∗ is the true model and q is the learned dynamics model, and Hce is the cross -entropy. As we learn
the optimal model, q = p∗ , this reduces to the conditional entropy of the predictive distribution, which can
be non-zero for inherently unpredictable states.
To help filter out such random noise, [Pat+17] proposes an Intrinsic Curiosity Module. This first
learns an inverse dynamics model of the form a = f (s, s′ ), which tries to predict which action was used,
given that the agent was in s and is now in s′ . The classifier has the form softmax(g(ϕ(s), ϕ(s′ ), a)), where
z = ϕ(s) is a representation function that focuses on parts of the state that the agent can control. Then the
agent learns a forwards dynamics model in z-space. Finally it defines the intrinsic reward as
Thus the agent is rewarded for visiting states that lead to unpredictable consequences, where the difference
in outcomes is measured in a (hopefully more meaningful) latent space.
Another solution is to replace the cross entropy with the KL divergence, R(s, a) = DKL (p||q) = Hce (p, q) −
H(p), which goes to zero once the learned model matches the true model, even for unpredictable states.
This has the desired effect of encouraging exploration towards states which have epistemic uncertainty
(reducible noise) but not aleatoric uncertainty (irreducible noise) [MP+22]. The BYOL-Hindsight method
of [Jar+23] is one recent approach that attempts to use the R(s, a) = DKL (p||q) objective. Unfortunately,
computing the DKL (p||q) term is much harder than the usual variational objective of DKL (q||p). A related
idea, proposed in the RL context by [Sch10], is to use the information gain as a reward. This is defined
as Rt (st , at ) = DKL (q(st |ht , at , θt )||q(st |ht , at , θt−1 ), where ht is the history of past observations, and
θt = update(θt−1 , ht , at , st ) are the new model parameters. This is closely related to the BALD (Bayesian
Active Learning by Disagreement) criterion [Hou+11; KAG19], and has the advantage of being easier to
compute, since it is does not reference the true distribution p.
7.4.3 Empowerment
TODO
7.5 Hierarchical RL
So far we have focused on MDPs that work at a single time scale. However, this is very limiting. For example,
imagine planning a trip from San Francisco to New York: we need to choose high level actions first, such as
which airline to fly, and then medium level actions, such as how to get to the airport, followed by low level
actions, such as motor commands. Thus we need to consider actions that operate multiple levels of temporal
abstraction. This is called hierarchical RL or HRL. This is a big and important topic, and we only brief
mention a few key ideas and methods. Our summary is based in part on [Pat+22]. (See also Section 4.5
where we discuss multi-step predictive models; by contrast, in this section we focus on model-free methods.)
160
Subgoal
𝜋2(s,g)
State Goal
Subgoal
𝜋1(s,g)
State Goal
Primitive Action
𝜋0(s,g)
State Goal
Figure 7.4: Illustration of a 3 level hierarchical goal-conditioned controller. From http: // bigai. cs. brown. edu/
2019/ 09/ 03/ hac. html . Used with kind permission of Andrew Levy.
[Sch+15a]. This policy optimizes an MDP in which the reward is define as R(s, a|g) = 1 iff the goal state is
achieved, i.e., R(s, a|s) = I (s = g). (We can also define a dense reward signal using some state abstraction
function ϕ, by definining R(s, a|g) = sim(ϕ(s), ϕ(g)) for some similarity metric.) This approach to RL is
known as goal-conditioned RL [LZZ22].
In this section, we discuss an approach to efficiently learning goal-conditioned policies, in the special case
where the set of goal states G is the same as the set of original states S. We will extend this to the hierarchical
case below.
The basic idea is as follows. We collect various trajectores in the environment, from a starting state s0 to
some terminal state sT , and then define the goal of each trajectory as being g = sT ; this trajectory then
serves as a demonstration of how to achieve this goal. This is called hindsight experience relabeling
or HER [And+17]. This can be used to relabel the trajectories stored in the replay buffer. That is, if we
have (s, a, R(s|g), s′ , g) tuples, we replace them with (s, a, R(s|g ′ ), g ′ ) where g ′ = sT . We can then use any
off-policy RL method to learn π(a|s, g). In [Eys+20], they show that HER can be viewed as a special case of
maximum-entropy inverse RL, since it is estimating the reward for which the corresponding trajectory was
optimal.
We can leverage HER to learn a hierarchical controller in several ways. In [Nac+18] they propose HIRO
(Hierarchical Reinforcement Learning with Off-policy Correction) as a way to train a two-level controller.
(For a two-level controller, the top level is often called the manager,
P and the low level the worker.) The
data for the manager are transition tuples of the form (st , gt , rt:t+c , st+c ), where c is the time taken for
the worker to reach the goal (or some maximum time), and rt is the main task reward function at step t.
gt
The data for the worker are transition tuples of the form (st+i , gt , at+i , rt+i , st+i+1 ) for i = 0 : c, where rtg
is the reward wrt reaching goal g. This data can be used to train the two policies. However, if the worker
fails to achieve the goal in the given time limit, all the rewards will be 0, and no learning will take place. To
combat this, if the worker does not achieve gt after c timesteps, the subgoal is relabeled in the transition
data with another subgoal gt′ which is sampled from p(g|τ ), where τ is the observed trajectory. Thus both
policies treat gt′ as the goal in hindsight, so they can use the actually collected data for training
161
The hierarchical actor critic (HAC) method of [Lev+18] is a simpler version of HIRO that can be
extended to multiple levels of hierarchy, where the lowest level corresponds to primitive actions (see Figure 7.4).
In the HAC approach, the output subgoal in the higher level data, and the input subgoal in the lower-level
data, are replaced with the actual state that was achieved in hindsight. This allows the training of each level of
the hierarchy independently of the lower levels, by assuming the lower level policies are already optimal (since
they achieved the specified goal). As a result, the distribution of (s, a, s′ ) tuples experienced by a higher level
will be stable, providing a stationary learning target. By contrast, if all policies are learned simultaneously,
the distribution becomes non-stationary, which makes learning harder. For more details, see the paper, or
the corresponding blog post (with animations) at http://bigai.cs.brown.edu/2019/09/03/hac.html.
7.5.2 Options
The feudal approach to HRL is somewhat limited, since not all subroutines or skills can be defined in terms
of reaching a goal state (even if it is a partially specified one, such as being in a desired location but without
specifying the velocity). For example, consider the skill of “driving in a circle”, or “finding food”. The options
framework is a more general framework for HRL first proposed in [SPS99]. We discuss this below.
7.5.2.1 Definitions
An option ω = (I, π, β) is a tuple consisting of: the initiation set Iω ⊂ S, which is a subset of states that this
option can start from (also called the affordances of each state [Khe+20]); the subpolicy πω (a|s) ∈ [0, 1];
and the termination condition βω (s) ∈ [0, 1], which gives the probability of finishing in state s. (This
induces a geometric distribution over option durations, which we denote by τ ∼ βω .) The set of all options is
denoted Ω.
To execute an option at step t entails choosing an action using at = πω (st ) and then deciding whether to
terminate at step t + 1 with probability 1 − βω (st+1 ) or to continue following the option at step t + 1. (This
is an example of a semi-Markov decision process [Put94].) If we define πω (s) = a and βω (s) = 0 for all
s, then this option corresponds to primitive action a that terminates in one step. But with options we can
expand the repertoire of actions to include those that take many steps to finish.
To create an MDP with options, we need to define the reward function and dynamics model. The reward
is defined as follows:
R(s, ω) = E R1 + γR2 + · · · + γ τ −1 Rτ |S0 = s, A0:τ −1 ∼ πω , τ ∼ βω (7.31)
162
Note that pγ (s′ |s, ω) is not a conditional probability distribution, because of the γ k term, but we can usually
treat it like one. Note also that a dynamics model that can predict multiple steps ahead is sometimes called
a jumpy model (see also Section 4.5.3.2).
We can use these definitions to define the value function for a hierarchical policy using a generalized
Bellman equation, as follows:
" #
X X
′ ′
Vπ (s) = π(ω|s) R(s, ω) + pγ (s |s, ω)Vπ (s ) (7.33)
ω∈Ω(s) s′
We can compute this using value iteration. We can then learn a policy using policy iteration, or a policy
gradient method. In other words, once we have defined the options, we can use all the standard RL machinery.
Note that GCRL can be considered a special case of options where each option corresponds to a different
goal. Thus the reward function has the form R(s, ω) = I (s = ω), the termination function is βω (s) = I (s = ω),
and the initiation set is the entire state space.
163
7.6 Imitation learning
In previous sections, an RL agent is to learn an optimal sequential decision making policy so that the total
reward is maximized. Imitation learning (IL), also known as apprenticeship learning and learning
from demonstration (LfD), is a different setting, in which the agent does not observe rewards, but has access
to a collection Dexp of trajectories generated by an expert policy πexp ; that is, τ = (s0 , a0 , s1 , a1 , . . . , sT )
and at ∼ πexp (st ) for τ ∈ Dexp . The goal is to learn a good policy by imitating the expert, in the absence
of reward signals. IL finds many applications in scenarios where we have demonstrations of experts (often
humans) but designing a good reward function is not easy, such as car driving and conversational systems.
(See also Section 7.7, where we discuss the closely related topic of offline RL, where we also learn from a
collection of trajectories, but no longer assume they are generated by an optimal policy.)
where the expectation wrt pγπexp may be approximated by averaging over states in Dexp . A challenge with
this method is that the loss does not consider the sequential nature of IL: future state distribution is not
fixed but instead depends on earlier actions. Therefore, if we learn a policy π̂ that has a low imitation error
under distribution pγπexp , as defined in Equation (7.34), it may still incur a large error under distribution pγπ̂
(when the policy π̂ is actually run). This problem has been tackled by the offline RL literature, which we
discuss in Section 7.7.
where RθPis an unknown reward function with parameter θ. Abusing notation slightly, we denote by
T −1
Rθ (τ ) = t=0 Rθ (st , at )) the cumulative reward along the trajectory τ . This model assigns exponentially
R
small probabilities to trajectories with lower cumulative rewards. The partition function, Zθ ≜ τ exp(Rθ (τ )),
is in general intractable to compute, and must be approximated. Here, we can take a sample-based approach.
Let Dexp and D be the sets of trajectories generated by an expert, and by some known distribution q,
respectively. We may infer θ by maximizing the likelihood, p(Dexp |θ), or equivalently, minimizing the negative
log-likelihood loss
1 X 1 X exp(Rθ (τ ))
L(θ) = − Rθ (τ ) + log (7.36)
|Dexp | |D| q(τ )
τ ∈Dexp τ ∈D
164
(a) online reinforcement learning (b) off-policy reinforcement learning (c) offline reinforcement learning
rollout data rollout data
buffer buffer
update
update learn
Figure 7.5: Comparison of online on-policy RL, online off-policy RL, and offline RL. From Figure 1 of [Lev+20a].
Used with kind permission of Sergey Levine.
The term inside the log of the loss is an importance sampling estimate of Z that is unbiased as long as
q(τ ) > 0 for all τ . However, in order to reduce the variance, we can choose q adaptively as θ is being updated.
The optimal sampling distribution, q∗ (τ ) ∝ exp(Rθ (τ )), is hard to obtain. Instead, we may find a policy π̂
which induces a distribution that is close to q∗ , for instance, using methods of maximum entropy RL discussed
in Section 3.6.4. Interestingly, the process above produces the inferred reward Rθ as well as an approximate
optimal policy π̂. This approach is used by guided cost learning [FLA16], and found effective in robotics
applications.
min max Epγπexp (s,a) [Tw (s, a)] − Epγπ (s,a) [f ∗ (Tw (s, a))] (7.37)
π w
7.7 Offline RL
Offline reinforcement learning (also called batch reinforcement learning [LGR12]) is concerned with
learning a reward maximizing policy from a fixed, static dataset, collected by some existing policy, known as
the behavior policy. Thus no interaction with the environment is allowed (see Figure 7.5). This makes
policy learning harder than the online case, since we do not know the consequences of actions that were not
taken in a given state, and cannot test any such “counterfactual” predictions by trying them. (This is the
same problem as in off-policy RL, which we discussed in Section 3.4.) In addition, the policy will be deployed
165
on new states that it may not have seen, requiring that the policy generalize out-of-distribution, which is the
main bottleneck for current offline RL methods [Par+24b].
A very simple and widely used offline RL method is known as behavior cloning or BC. This amounts
to training a policy to predict the observed output action at associated with each observed state st , so
we aim to ensure π(st ) ≈ at , as in supervised learning. This assumes the offline dataset was created
by an expert, and so falls under the umbrella of imitation learning (see Section 7.6.1 for details). By
contrast, offline RL methods can leverage suboptimal data. We give a brief summary of some of these
methods below. For more details, see e.g., [Lev+20b; Che+24b; Cet+24; YWW25] and the list of papers
at https://github.com/hanjuku-kaso/awesome-offline-rl. For some offline RL benchmarks, see DR4L
[Fu+20], RL Unplugged [Gul+20], OGBench (Offline Goal-Conditioned benchmark) [Par+24a], and D5RL
[Raf+24].
One problem with the above method is that we have to fit a parametric model to πb (a|s) in order to
evaluate the divergence term. Fortunately, in the case of KL, the divergence can be enforced implicitly, as in
the advantage weighted regression or AWR method of [Pen+19], the reward weighted regression
method of [PS07], the advantage weighted actor critic or AWAC method of [Nai+20], the advantage
weighted behavior model or ABM method of [Sie+20], In this approach, we first solve (nonparametrically)
for the new policy under the KL divergence constraint to get π k+1 , and then we project this into the required
166
policy function class via supervised regression, as follows:
1 1 π
π k+1 (a|s) ← πb (a|s) exp Qk (s, a) (7.42)
Z α
πk+1 ← argmin DKL (π k+1 ∥ π) (7.43)
π
Recently a class of methods has been developed that is simple and effective: we first learn a baseline policy
π(a|s) (using BC) and a Q function (using Bellman minimization) on the offline data, and then update the
policy parameters to pick actions that have high expected value according to Q and which are also likely
under the BC prior. An early example of this is the Q† algorithm of [Fuj+19]. In [FG21], they present the
DDPG+BC method, which optimizes
where µπ (s) = Eπ(a|s) [a] is the mean of the predicted action, and α is a hyper-parameter. As another example,
the DQL method of [WHZ23] optimizes a diffusion policy using
min L(π) = Ldiffusion (π) + Lq (π) = Ldiffusion (π) − αEs∼D,a∼π(·|s) [Q(s, a)] (7.45)
π
where the second term is a penalty derived from Conservative Q Learning (Section 7.7.1.4), that ensures the
Q values do not get too small. Finally, [Aga+22b] discusses how to transfer the policy from a previous agent
to a new agent by combining BC with Q learning.
An alternative way to avoid picking out-of-distribution actions, where the Q function might be unreliable, is
to add a penalty term to the Q function based on the estimated epistemic uncertainty, given the dataset
D, which we denote by Unc(PD (Qπ )), where PD (Qπ ) is the distribution over Q functions, and Unc is some
metric on distributions. For example, we can use a deep ensemble to represent the distribution, and use the
variance of Q(s, a) across ensemble members as a measure of uncertainty. This gives rise to the following
policy improvement update:
h h i i
πk+1 ← argmax Es∼D Eπ(a|s) EPD (Qπk+1 ) Qπk+1 (s, a) − αUnc(PD (Qπk+1 )) (7.46)
π
167
7.7.1.4 Conservative Q-learning
An alternative to explicitly estimating uncertainty is to add a conservative penalty directly to the Q-learning
error term. That is, we minimize the following wrt w using each batch of data B:
where µ is the new policy derived from Q, and R(µ) = −DKL (µ ∥ ρ) is a regularizer, and ρ is the action
prior, which we discuss below. Since we are minimizing C(B, w) (in addition to E(B, w)), we see that we are
simultaneously maximizing the Q values for actions that are drawn from the behavior policy while minimizing
the Q values for actions sampled from µ. This is to combat the optimism bias of Q-learning (hence the term
“conservative”).
Now we derive the expression for µ. From P Section 3.6.4 we know that the optimal solution has the form
µ(a|s) = Z1 ρ(a|s) exp(Q(s, a)), where Z = a′ exp(Q(s, a′ )) is the normalizer, and ρ(a|s) is the prior. (For
example, we can set ρ(a|s) to be the previous policy.) We can then approximate the first term in the penalty
using importance sampling, with ρ(a|s) as the proposal:
µ(a|s) exp(Q(s, a))
Ea∼µ(·|s) [Q(s, a)] = Eρ(a|s) Q(s, a) = Eρ(a|s) P ′
Q(s, a) (7.49)
ρ(a|s) a′ exp(Q(s, a ))
Alternatively, suppose we set ρ(a|s) to be uniform, as in maxent RL (Section 3.6.4). In this case, we
should replace the value function with the soft value function. From Equation (3.159), using a penalty
coefficient of α = 1, we have
X
Ea [Qsoft (s, a)] = Vsoft (s) = log exp(Q(s, a)) (7.50)
a
168
7.7.3 Offline RL using reward-conditioned sequence modeling
Recently an approach to offline RL based on sequence modeling has become very popular. The basic idea
— known as upside down RL [Sch19] or RvS (RL via Supervised learning) [KPL19; Emm+21] — is to
train a generative model over future states and/or actions conditioned on the observed reward, rather than
predicting the reward given a state-action trajectory. At test time, the conditioning is changed to represent
the desired reward, and futures are sampled from the model. The implementation of this idea then depends
on what kind of generative model to use, as we discuss below.
The trajectory transformer method of [JLL21] learns a joint model of the form p(s1:T , a1:T , r1:T ) using
a transformer, and then samples from this using beam search, selecting the ones with high reward (similar to
MPC, Section 4.2.4). The decision transformer [Che+21b] is related, but just generates action sequences,
and conditions on the past observations and the future reward-to-go. That is, it fits
where τ = (s1 , a1 , . . . , sT , aT ) is the state-action trajectory and y is the observed (trajectory-level) reward.
The model for p(τ |z) is a causal transformer, which generates the action at each step given the previous
states and actions and the latent plan. The model for p(y|z) is just a Gaussian. The model for p(z) is a
Gaussian passed through a U-net style CNN, thus providing a richer prior. The latent variables provide a way
to “stitch together” individual (high performing) trajectories, so that the learned policy can predict p(at |st , z)
even if the current state st is not on the training manifold (thus requiring generalization, a problem that
behavior cloning faces [Ghu+24]). During decision time, they infer ẑ = argmax p(z|y = ymax ) using gradient
ascent, and then autoregressively generate actions from p(at |s1:t , a1:t−1 , ẑ).
169
This is called the offline-to-online (O2O) paradigm. Unfortunately, due to the significant distribution shift
between online experiences and offline data, most offline RL algorithms suffer from performance drops when
they are finetuned online. Many different methods have been proposed to tackle this, a few of which we
mention below. See https://github.com/linhlpv/awesome-offline-to-online-RL-papers for a more
extensive list.
[Nak+23] suggest pre-training with CQL followed by online finetuning. Naively this does not work that
well, because CQL can be too conservative, requiring the online learning to waste some time at the beginning
fixing the pessimism. So they propose a small modification to CQL, known as Calibrated Q learning. This
simply prevents CQL from being too conservative, by replacing the CQL regularizer in Equation (7.48) with
a slightly modified expression. Then online finetuning is performed in the usual way.
An alternative approach is the Dagger algorithm of [RGB11]. (Dagger is short for Dataset Aggregation.)
This iteratively trains the policy on expert provided data. We start with an initial dataset D (e.g., empty)
and an initial policy π1 (e.g., random). At iteration t, we run the current policy πt in the environment to
collect states {si }. We then ask an expert policy for the correct actions a∗i = π ∗ (si ). We then aggregate the
data to compute D = D ∪ {(si , a∗i )}, and train the new policy πt+1 on D. The key ide is to not just train on
expert trajectories as in BC, but to train on the states that the policy actually visits. This avoids overfitting
to idealized data and improves robustness (avoids compounding error).
where Pr(p) is the prior probability of p, and we assume the likelihood is 1 if p can generate the observations
given the actions, and is 0 otherwise.
One important question is: what is a reasonable prior over programs? In [Hut05], Marcus Hutter proposed
to apply the idea of Solomonoff induction [Sol64] to the case of an online decision making agent. This
amounts to using the prior Pr(p) = 2−ℓ(p) , where ℓ(p) is the length of program p. This prior favors shorter
programs, and the likelihood filters out programs that cannot explain the data. The resulting agent is known
as AIXI, where “AI” stands for “Artificial Intelligence” and “XI” referring to the Greek letter ξ used in
Solomonoff induction. The AIXI agent has been called the “most intelligent general-purpose agent possible”
[HQC24], and can be viewed as the theoretical foundation of (universal) artificial general intelligence or
AGI.
Unfortunately, the AIXI agent is intractable to compute, for two main reasons: (1) it relies on Solomonoff
induction and Kolmogorov complexity, both of which are intractable; and (2) the expectimax computation
is intractable. Fortunately, various tractable approximations have been devised. In lieu of Kolmogorov
complexity, we can use measures like MDL (minimum description length), and for Solomonoff induction, we
can use various local search or optimization algorithms through suitable function classes. For the expectimax
computation, we can use MCTS (see Section 4.2.2) to approximate it. Alternatively, [GM+24] showed that
it is possible to use meta learning to train a generic sequence predictor, such as a transformer or LSTM,
on data generated by random Turing machines, so that the transformer learns to approximate a universal
170
predictor. Another approach is to learn a policy (to avoid searching over action sequences) using TD-learning
(Section 2.3.2); the weighting term in the policy mixture requires that the agent predict its own future actions,
so this approach is known as self-AIXI [Cat+23].
Note that AIXI is a normative theory for optimal agents, but is not very practical, since it does not take
computational limitations into account. In [Aru+24a; Aru+24b], they describe an approach which extends
the above Bayesian framework, while also taking into account the data budget (due to limited environment
interactions) that real agents must contend with (which prohibits modeling the entire environment or
finding the optimal action). This approach, known as Capacity-Limited Bayesian RL (CBRL), combines
Bayesian inference, RL, and rate distortion theory, and can be seen as a normative theoretical foundation for
computationally bounded rational agents.
171
172
Chapter 8
Acknowledgements
Parts of this monograph are borrowed from chapters 34 and 35 of my textbook [Mur23], some of which was
written with Lihong Li. However, this text supercedes those chapters, and goes beyond it in many ways.
Thanks to the following people for feedback on the current document: Pablo Samuel Castro, Elad Hazan,
Tuan Ahn Le, Dieterich Lawson, Marc Lanctot, David Pfau, Theo Weber. And thanks to Xinghua Lou for
help with some of the figures.
173
174
Bibliography
175
[AJO08] P. Auer, T. Jaksch, and R. Ortner. “Near-optimal Regret Bounds for Reinforcement Learning”.
In: NIPS. Vol. 21. 2008. url: https://proceedings.neurips.cc/paper_files/paper/2008/
file/e4a6222cdb5b34375400904f03d8e6a5-Paper.pdf.
[al24] J. P.-H. et al. “Genie 2: A large-scale foundation world model”. In: (2024). url: https :
/ / deepmind . google / discover / blog / genie - 2 - a - large - scale - foundation - world -
model/.
[Ale+23] L. N. Alegre, A. L. C. Bazzan, A. Nowé, and B. C. da Silva. “Multi-step generalized policy
improvement by leveraging approximate models”. In: NIPS. Vol. 36. Curran Associates, Inc.,
2023, pp. 38181–38205. url: https://proceedings.neurips.cc/paper_files/paper/2023/
hash/77c7faab15002432ba1151e8d5cc389a-Abstract-Conference.html.
[Alo+24] E. Alonso, A. Jelley, V. Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. “Diffusion
for world modeling: Visual details matter in Atari”. In: arXiv [cs.LG] (May 2024). url: http:
//arxiv.org/abs/2405.12399.
[AM89] B. D. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Methods. Prentice-Hall
International, Inc., 1989.
[Ama98] S Amari. “Natural Gradient Works Efficiently in Learning”. In: Neural Comput. 10.2 (1998),
pp. 251–276. url: http://dx.doi.org/10.1162/089976698300017746.
[AMH23] A. Aubret, L. Matignon, and S. Hassas. “An information-theoretic perspective on intrinsic
motivation in reinforcement learning: A survey”. en. In: Entropy 25.2 (Feb. 2023), p. 327. url:
https://www.mdpi.com/1099-4300/25/2/327.
[Ami+21] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, and D. Precup. “A survey of exploration
methods in reinforcement learning”. In: arXiv [cs.LG] (Aug. 2021). url: http://arxiv.org/
abs/2109.00157.
[An+21] G. An, S. Moon, J.-H. Kim, and H. O. Song. “Uncertainty-Based Offline Reinforcement Learning
with Diversified Q-Ensemble”. In: NIPS. Vol. 34. Dec. 2021, pp. 7436–7447. url: https://
proceedings.neurips.cc/paper_files/paper/2021/file/3d3d286a8d153a4a58156d0e02d8570c-
Paper.pdf.
[And+17] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,
P. Abbeel, and W. Zaremba. “Hindsight Experience Replay”. In: arXiv [cs.LG] (July 2017).
url: http://arxiv.org/abs/1707.01495.
[And+20] O. M. Andrychowicz et al. “Learning dexterous in-hand manipulation”. In: Int. J. Rob. Res.
39.1 (2020), pp. 3–20. url: https://doi.org/10.1177/0278364919887447.
[Ant22] Anthropic. “Constitutional AI: Harmlessness from AI Feedback”. In: arXiv [cs.CL] (Dec. 2022).
url: http://arxiv.org/abs/2212.08073.
[Ant+22] I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. “Planning in Stochastic
Environments with a Learned Model”. In: ICLR. 2022. url: https://openreview.net/forum?
id=X6D9bAHhBQ1.
[AP23] S. Alver and D. Precup. “Minimal Value-Equivalent Partial Models for Scalable and Robust
Planning in Lifelong Reinforcement Learning”. en. In: Conference on Lifelong Learning Agents.
PMLR, Nov. 2023, pp. 548–567. url: https://proceedings.mlr.press/v232/alver23a.
html.
[AP24] S. Alver and D. Precup. “A Look at Value-Based Decision-Time vs. Background Planning
Methods Across Different Settings”. In: Seventeenth European Workshop on Reinforcement
Learning. Oct. 2024. url: https://openreview.net/pdf?id=Vx2ETvHId8.
[Arb+23] J. Arbel, K. Pitas, M. Vladimirova, and V. Fortuin. “A Primer on Bayesian Neural Networks:
Review and Debates”. In: arXiv [stat.ML] (Sept. 2023). url: http://arxiv.org/abs/2309.
16314.
176
[ARKP24] S. Alver, A. Rahimi-Kalahroudi, and D. Precup. “Partial models for building adaptive model-
based reinforcement learning agents”. In: COLLAS. May 2024. url: https://arxiv.org/abs/
2405.16899.
[Aru+17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. “A Brief Survey of Deep
Reinforcement Learning”. In: IEEE Signal Processing Magazine, Special Issue on Deep Learning
for Image Understanding (2017). url: http://arxiv.org/abs/1708.05866.
[Aru+18] D. Arumugam, D. Abel, K. Asadi, N. Gopalan, C. Grimm, J. K. Lee, L. Lehnert, and M. L.
Littman. “Mitigating planner overfitting in model-based reinforcement learning”. In: arXiv
[cs.LG] (Dec. 2018). url: http://arxiv.org/abs/1812.01129.
[Aru+24a] D. Arumugam, M. K. Ho, N. D. Goodman, and B. Van Roy. “Bayesian Reinforcement Learning
With Limited Cognitive Load”. en. In: Open Mind 8 (Apr. 2024), pp. 395–438. url: https:
//direct.mit.edu/opmi/article- pdf/doi/10.1162/opmi_a_00132/2364075/opmi_a_
00132.pdf.
[Aru+24b] D. Arumugam, S. Kumar, R. Gummadi, and B. Van Roy. “Satisficing exploration for deep
reinforcement learning”. In: Finding the Frame Workshop at RLC. July 2024. url: https:
//openreview.net/forum?id=tHCpsrzehb.
[AS18] S. V. Albrecht and P. Stone. “Autonomous agents modelling other agents: A comprehensive
survey and open problems”. en. In: Artif. Intell. 258 (May 2018), pp. 66–95. url: http :
//dx.doi.org/10.1016/j.artint.2018.01.002.
[AS22] D. Arumugam and S. Singh. “Planning to the information horizon of BAMDPs via epistemic
state abstraction”. In: NIPS. Oct. 2022.
[AS66] S. M. Ali and S. D. Silvey. “A General Class of Coefficients of Divergence of One Distribution
from Another”. In: J. R. Stat. Soc. Series B Stat. Methodol. 28.1 (1966), pp. 131–142. url:
http://www.jstor.org/stable/2984279.
[ASN20] R. Agarwal, D. Schuurmans, and M. Norouzi. “An Optimistic Perspective on Offline Reinforce-
ment Learning”. en. In: ICML. PMLR, Nov. 2020, pp. 104–114. url: https://proceedings.
mlr.press/v119/agarwal20c.html.
[Ass+23] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas.
“Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture”. In:
CVPR. Jan. 2023. url: http://arxiv.org/abs/2301.08243.
[Att03] H. Attias. “Planning by Probabilistic Inference”. In: AI-Stats. 2003. url: http://research.
goldenmetallic.com/aistats03.pdf.
[Aue+19] P. Auer, Y. Chen, P. Gajane, C.-W. Lee, H. Luo, R. Ortner, and C.-Y. Wei. “Achieving Optimal
Dynamic Regret for Non-stationary Bandits without Prior Information”. en. In: Conference on
Learning Theory. PMLR, June 2019, pp. 159–163. url: https://proceedings.mlr.press/
v99/auer19b.html.
[Aum87] R. J. Aumann. “Correlated equilibrium as an expression of Bayesian rationality”. en. In:
Econometrica 55.1 (Jan. 1987), p. 1. url: https://www.jstor.org/stable/1911154.
[Axe84] R. Axelrod. The evolution of cooperation. Basic Books, 1984.
[AY20] B. Amos and D. Yarats. “The Differentiable Cross-Entropy Method”. In: ICML. 2020. url:
http://arxiv.org/abs/1909.12830.
[Bad+20] A. P. Badia, B. Piot, S. Kapturowski, P Sprechmann, A. Vitvitskyi, D. Guo, and C Blundell.
“Agent57: Outperforming the Atari Human Benchmark”. In: ICML 119 (Mar. 2020), pp. 507–517.
url: https://proceedings.mlr.press/v119/badia20a/badia20a.pdf.
[Bai95] L. C. Baird. “Residual Algorithms: Reinforcement Learning with Function Approximation”. In:
ICML. 1995, pp. 30–37.
177
[Bal+23] P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. “Efficient Online Reinforcement Learning with
Offline Data”. en. In: ICML. PMLR, July 2023, pp. 1577–1594. url: https://proceedings.
mlr.press/v202/ball23a.html.
[Ban+23] D. Bansal, R. T. Q. Chen, M. Mukadam, and B. Amos. “TaskMet: Task-driven metric learning for
model learning”. In: NIPS. Ed. by A Oh, T Naumann, A Globerson, K Saenko, M Hardt, and S
Levine. Vol. abs/2312.05250. Dec. 2023, pp. 46505–46519. url: https://proceedings.neurips.
cc / paper _ files / paper / 2023 / hash / 91a5742235f70ae846436d9780e9f1d4 - Abstract -
Conference.html.
[Bar+17] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. “Suc-
cessor Features for Transfer in Reinforcement Learning”. In: NIPS. Vol. 30. 2017. url: https://
proceedings.neurips.cc/paper_files/paper/2017/file/350db081a661525235354dd3e19b8c05-
Paper.pdf.
[Bar+19] A. Barreto et al. “The Option Keyboard: Combining Skills in Reinforcement Learning”. In:
NIPS. Vol. 32. 2019. url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/251c5ffd6b62cc21c446c963c76cf214-Paper.pdf.
[Bar+20] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. “Fast reinforcement learning with
generalized policy updates”. en. In: PNAS 117.48 (Dec. 2020), pp. 30079–30087. url: https:
//www.pnas.org/doi/abs/10.1073/pnas.1907370117.
[Bar+24] A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. Le Cun, M. Assran, and N. Ballas.
“Revisiting Feature Prediction for Learning Visual Representations from Video”. In: (2024).
url: https://ai.meta.com/research/publications/revisiting-feature-prediction-
for-learning-visual-representations-from-video/.
[Bar+25] C. Baronio, P. Marsella, B. Pan, and S. Alberti. Kevin-32B: Multi-Turn RL for Writing CUDA
Kernels. 2025. url: https://cognition.ai/blog/kevin-32b.
[BBS95] A. G. Barto, S. J. Bradtke, and S. P. Singh. “Learning to act using real-time dynamic pro-
gramming”. In: AIJ 72.1 (1995), pp. 81–138. url: http://www.sciencedirect.com/science/
article/pii/000437029400011O.
[BDG00] C. Boutilier, R. Dearden, and M. Goldszmidt. “Stochastic dynamic programming with factored
representations”. en. In: Artif. Intell. 121.1-2 (Aug. 2000), pp. 49–107. url: http://dx.doi.
org/10.1016/S0004-3702(00)00033-3.
[BDM17] M. G. Bellemare, W. Dabney, and R. Munos. “A Distributional Perspective on Reinforcement
Learning”. In: ICML. 2017. url: http://arxiv.org/abs/1707.06887.
[BDR23] M. G. Bellemare, W. Dabney, and M. Rowland. Distributional Reinforcement Learning. http:
//www.distributional-rl.org. MIT Press, 2023.
[Bel+13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The Arcade Learning Environment:
An Evaluation Platform for General Agents”. In: JAIR 47 (2013), pp. 253–279.
[Bel+16] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. “Unifying
Count-Based Exploration and Intrinsic Motivation”. In: NIPS. 2016. url: http://arxiv.org/
abs/1606.01868.
[Ber19] D. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, 2019. url: http:
//www.mit.edu/~dimitrib/RLbook.html.
[BHP17] P.-L. Bacon, J. Harb, and D. Precup. “The Option-Critic Architecture”. In: AAAI. 2017.
[BKH16] J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer Normalization”. In: (2016). arXiv: 1607.06450
[stat.ML]. url: http://arxiv.org/abs/1607.06450.
[BKM24] W Bradley Knox and J. MacGlashan. “How to Specify Reinforcement Learning Objectives”. In:
Finding the Frame: An RLC Workshop for Examining Conceptual Frameworks. July 2024. url:
https://openreview.net/pdf?id=2MGEQNrmdN.
178
[BLM16] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory
of Independence. Oxford University Press, 2016.
[Blo+15] D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers. “Evolutionary dynamics of multi-agent
learning: A survey”. en. In: JAIR 53 (Aug. 2015), pp. 659–697. url: https://jair.org/index.
php/jair/article/view/10952.
[BM+18] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, T. B. Dhruva, A. Muldal,
N. Heess, and T. Lillicrap. “Distributed Distributional Deterministic Policy Gradients”. In:
ICLR. 2018. url: https://openreview.net/forum?id=SyZipzbCb¬eId=SyZipzbCb.
[BMS11] S. Bubeck, R. Munos, and G. Stoltz. “Pure Exploration in Finitely-armed and Continuous-armed
Bandits”. In: Theoretical Computer Science 412.19 (2011), pp. 1832–1852.
[Boe+05] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. “A Tutorial on the Cross-Entropy
Method”. en. In: Ann. Oper. Res. 134.1 (2005), pp. 19–67. url: https://link.springer.com/
article/10.1007/s10479-005-5724-z.
[Boo+23] S. Booth, W. B. Knox, J. Shah, S. Niekum, P. Stone, and A. Allievi. “The Perils of Trial-and-
Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications”. In: AAAI.
2023. url: https://slbooth.com/assets/projects/Reward_Design_Perils/.
[Bor+19] D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and
T. Schaul. “Universal Successor Features Approximators”. In: ICLR. 2019. url: https://
openreview.net/pdf?id=S1VWjiRcKX.
[Bos16] N. Bostrom. Superintelligence: Paths, Dangers, Strategies. en. London, England: Oxford Uni-
versity Press, Mar. 2016. url: https://www.amazon.com/Superintelligence- Dangers-
Strategies-Nick-Bostrom/dp/0198739834.
[Bou+23] K. Bousmalis et al. “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation”.
In: TMLR (June 2023). url: http://arxiv.org/abs/2306.11706.
[Bow+23] M. Bowling, J. D. Martin, D. Abel, and W. Dabney. “Settling the reward hypothesis”. en. In:
ICML. 2023. url: https://arxiv.org/abs/2212.10420.
[BPL22a] A. Bardes, J. Ponce, and Y. LeCun. “VICReg: Variance-Invariance-Covariance Regularization
for Self-Supervised Learning”. In: ICLR. 2022. url: https : / / openreview . net / pdf ? id =
xm6YD62D1Ub.
[BPL22b] A. Bardes, J. Ponce, and Y. LeCun. “VICRegL: Self-supervised learning of local visual features”.
In: NIPS. Oct. 2022. url: https://arxiv.org/abs/2210.01571.
[Bra+22] D. Brandfonbrener, A. Bietti, J. Buckman, R. Laroche, and J. Bruna. “When does return-
conditioned supervised learning work for offline reinforcement learning?” In: NIPS. June 2022.
url: http://arxiv.org/abs/2206.01079.
[Bro+12] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,
D. Perez, S. Samothrakis, and S. Colton. “A Survey of Monte Carlo Tree Search Methods”. In:
IEEE Transactions on Computational Intelligence and AI in Games 4.1 (2012).
[Bro+19] N. Brown, A. Lerer, S. Gross, and T. Sandholm. “Deep Counterfactual Regret Minimization”.
en. In: ICML. PMLR, May 2019, pp. 793–802. url: https://proceedings.mlr.press/v97/
brown19b.html.
[Bro24] W. Brown. Generative AI Handbook: A Roadmap for Learning Resources. 2024. url: https:
//genai-handbook.github.io.
[BS17] N. Brown and T. Sandholm. “Superhuman AI for heads-up no-limit poker: Libratus beats top
professionals”. en. In: Science 359.6374 (2017), pp. 418–424. url: https://www.science.org/
doi/10.1126/science.aao1733.
[BS19] N. Brown and T. Sandholm. “Superhuman AI for multiplayer poker”. en. In: Science 365.6456
(Aug. 2019), pp. 885–890. url: https://www.science.org/doi/10.1126/science.aay2400.
179
[BS23] A. Bagaria and T. Schaul. “Scaling goal-based exploration via pruning proto-goals”. en. In: IJCAI.
Aug. 2023, pp. 3451–3460. url: https://dl.acm.org/doi/10.24963/ijcai.2023/384.
[BSA83] A. G. Barto, R. S. Sutton, and C. W. Anderson. “Neuronlike adaptive elements that can
solve difficult learning control problems”. In: SMC 13.5 (1983), pp. 834–846. url: http :
//dx.doi.org/10.1109/TSMC.1983.6313077.
[BT12] M. Botvinick and M. Toussaint. “Planning as inference”. en. In: Trends Cogn. Sci. 16.10 (2012),
pp. 485–488. url: https://pdfs.semanticscholar.org/2ba7/88647916f6206f7fcc137fe7866c58e6211e.
pdf.
[Buc+17] C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth. “The free energy principle for action
and perception: A mathematical review”. In: J. Math. Psychol. 81 (2017), pp. 55–79. url:
https://www.sciencedirect.com/science/article/pii/S0022249617300962.
[Bur+18] Y. Burda, H. Edwards, A Storkey, and O. Klimov. “Exploration by random network distillation”.
In: ICLR. Vol. abs/1810.12894. Sept. 2018.
[Bur25] A. Burkov. The Hundred-Page Language Models Book. 2025. url: https://thelmbook.com/.
[BV02] M. Bowling and M. Veloso. “Multiagent learning using a variable learning rate”. en. In: Artif.
Intell. 136.2 (Apr. 2002), pp. 215–250. url: http://dx.doi.org/10.1016/S0004-3702(02)
00121-2.
[BXS20] H. Bharadhwaj, K. Xie, and F. Shkurti. “Model-Predictive Control via Cross-Entropy and
Gradient-Based Optimization”. en. In: Learning for Dynamics and Control. PMLR, July 2020,
pp. 277–286. url: https://proceedings.mlr.press/v120/bharadhwaj20a.html.
[CA13] E. F. Camacho and C. B. Alba. Model predictive control. Springer, 2013.
[Cao+24] Y. Cao, H. Zhao, Y. Cheng, T. Shu, G. Liu, G. Liang, J. Zhao, and Y. Li. “Survey on large
language model-enhanced reinforcement learning: Concept, taxonomy, and methods”. In: IEEE
Transactions on Neural Networks and Learning Systems (2024). url: http://arxiv.org/abs/
2404.00282.
[Cao+25] S. Cao et al. SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning.
2025. url: https://novasky-ai.notion.site/skyrl-v0.
[Car+23] W. C. Carvalho, A. Saraiva, A. Filos, A. Lampinen, L. Matthey, R. L. Lewis, H. Lee, S. Singh, D.
Jimenez Rezende, and D. Zoran. “Combining Behaviors with the Successor Features Keyboard”.
In: NIPS. Vol. 36. 2023, pp. 9956–9983. url: https://proceedings.neurips.cc/paper_
files / paper / 2023 / hash / 1f69928210578f4cf5b538a8c8806798 - Abstract - Conference .
html.
[Car+24] W. Carvalho, M. S. Tomov, W. de Cothi, C. Barry, and S. J. Gershman. “Predictive rep-
resentations: building blocks of intelligence”. In: Neural Comput. (Feb. 2024). url: https:
//gershmanlab.com/pubs/Carvalho24.pdf.
[Cas11] P. S. Castro. “On planning, prediction and knowledge transfer in Fully and Partially Observable
Markov Decision Processes”. en. PhD thesis. McGill, 2011. url: https://www.proquest.com/
openview/d35984acba38c072359f8a8d5102c777/1?pq-origsite=gscholar&cbl=18750.
[Cas20] P. S. Castro. “Scalable methods for computing state similarity in deterministic Markov Decision
Processes”. In: AAAI. 2020.
[Cas+21] P. S. Castro, T. Kastner, P. Panangaden, and M. Rowland. “MICo: Improved representations
via sampling-based state similarity for Markov decision processes”. In: NIPS. Nov. 2021. url:
https://openreview.net/pdf?id=wFp6kmQELgu.
[Cas+23] P. S. Castro, T. Kastner, P. Panangaden, and M. Rowland. “A kernel perspective on behavioural
metrics for Markov decision processes”. In: TMLR abs/2310.19804 (Oct. 2023). url: https:
//openreview.net/pdf?id=nHfPXl1ly7.
180
[Cat+23] E. Catt, J. Grau-Moya, M. Hutter, M. Aitchison, T. Genewein, G. Delétang, K. Li, and J.
Veness. “Self-Predictive Universal AI”. In: NIPS. Vol. 36. 2023, pp. 27181–27198. url: https://
proceedings.neurips.cc/paper_files/paper/2023/hash/56a225639da77e8f7c0409f6d5ba996b-
Abstract-Conference.html.
[Cet+24] E. Cetin, A. Tirinzoni, M. Pirotta, A. Lazaric, Y. Ollivier, and A. Touati. “Simple ingredients
for offline reinforcement learning”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/
abs/2403.13097.
[CH20] X. Chen and K. He. “Exploring simple Siamese representation learning”. In: arXiv [cs.CV] (Nov.
2020). url: http://arxiv.org/abs/2011.10566.
[Che+20] X. Chen, C. Wang, Z. Zhou, and K. W. Ross. “Randomized Ensembled Double Q-Learning:
Learning Fast Without a Model”. In: ICLR. Oct. 2020. url: https://openreview.net/pdf?
id=AY8zfZm0tDd.
[Che+21a] C. Chen, Y.-F. Wu, J. Yoon, and S. Ahn. “TransDreamer: Reinforcement Learning with
Transformer World Models”. In: Deep RL Workshop NeurIPS. 2021. url: http://arxiv.org/
abs/2202.09481.
[Che+21b] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and
I. Mordatch. “Decision Transformer: Reinforcement Learning via Sequence Modeling”. In: arXiv
[cs.LG] (June 2021). url: http://arxiv.org/abs/2106.01345.
[Che+24a] F. Che, C. Xiao, J. Mei, B. Dai, R. Gummadi, O. A. Ramirez, C. K. Harris, A. R. Mahmood, and
D. Schuurmans. “Target networks and over-parameterization stabilize off-policy bootstrapping
with function approximation”. In: ICML. May 2024. url: http://arxiv.org/abs/2405.21043.
[Che+24b] J. Chen, B. Ganguly, Y. Xu, Y. Mei, T. Lan, and V. Aggarwal. “Deep Generative Models for
Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions”. In: TMLR
(Feb. 2024). url: https://openreview.net/forum?id=Mm2cMDl9r5.
[Che+24c] J. Chen, B. Ganguly, Y. Xu, Y. Mei, T. Lan, and V. Aggarwal. “Deep Generative Models for
Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions”. In: TMLR
(Feb. 2024). url: https://openreview.net/forum?id=Mm2cMDl9r5.
[Che+24d] W. Chen, O. Mees, A. Kumar, and S. Levine. “Vision-language models provide promptable
representations for reinforcement learning”. In: arXiv [cs.LG] (Feb. 2024). url: http://arxiv.
org/abs/2402.02651.
[Chr19] P. Christodoulou. “Soft Actor-Critic for discrete action settings”. In: arXiv [cs.LG] (Oct. 2019).
url: http://arxiv.org/abs/1910.07207.
[Chu+18] K. Chua, R. Calandra, R. McAllister, and S. Levine. “Deep Reinforcement Learning in a Handful
of Trials using Probabilistic Dynamics Models”. In: NIPS. 2018. url: http://arxiv.org/abs/
1805.12114.
[Chu+25] T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, S. Levine, and Y. Ma. “SFT Memorizes, RL Generalizes:
A Comparative Study of Foundation Model Post-training”. In: The Second Conference on
Parsimony and Learning. Mar. 2025. url: https://openreview.net/forum?id=d3E3LWmTar.
[CL11] O. Chapelle and L. Li. “An empirical evaluation of Thompson sampling”. In: NIPS. 2011.
[CMS07] B. Colson, P. Marcotte, and G. Savard. “An overview of bilevel optimization”. en. In: Ann.
Oper. Res. 153.1 (Sept. 2007), pp. 235–256. url: https://link.springer.com/article/10.
1007/s10479-007-0176-2.
[Cob+19] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. “Quantifying Generalization in
Reinforcement Learning”. en. In: ICML. May 2019, pp. 1282–1289. url: https://proceedings.
mlr.press/v97/cobbe19a.html.
[Col+22] C. Colas, T. Karch, O. Sigaud, and P.-Y. Oudeyer. “Autotelic agents with intrinsically motivated
goal-conditioned reinforcement learning: A short survey”. en. In: JAIR 74 (July 2022), pp. 1159–
1199. url: https://www.jair.org/index.php/jair/article/view/13554.
181
[CP19] Y. K. Cheung and G. Piliouras. “Vortices instead of equilibria in MinMax optimization: Chaos
and butterfly effects of online learning in zero-sum games”. In: COLT. May 2019. url: https:
//arxiv.org/abs/1905.08396.
[CP20] N. Chopin and O. Papaspiliopoulos. An Introduction to Sequential Monte Carlo. en. 1st ed.
Springer, 2020. url: https://www.amazon.com/Introduction-Sequential-Monte-Springer-
Statistics/dp/3030478440.
[CS04] I. Csiszár and P. C. Shields. “Information theory and statistics: A tutorial”. In: (2004).
[Csi67] I. Csiszar. “Information-Type Measures of Difference of Probability Distributions and Indirect
Observations”. In: Studia Scientiarum Mathematicarum Hungarica 2 (1967), pp. 299–318.
[CSLZ23] W. C. Cheung, D. Simchi-Levi, and R. Zhu. “Nonstationary reinforcement learning: The
blessing of (more) optimism”. en. In: Manage. Sci. 69.10 (Oct. 2023), pp. 5722–5739. url:
https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2023.4704.
[Dab+17] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. “Distributional reinforcement
learning with quantile regression”. In: arXiv [cs.AI] (Oct. 2017). url: http://arxiv.org/abs/
1710.10044.
[Dab+18] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. “Implicit quantile networks for distributional
reinforcement learning”. In: arXiv [cs.LG] (June 2018). url: http://arxiv.org/abs/1806.
06923.
[Dai+22] X. Dai et al. “CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction”.
en. In: Entropy 24.4 (Mar. 2022). url: http://dx.doi.org/10.3390/e24040456.
[Dai+24] N. Dainese, M. Merler, M. Alakuijala, and P. Marttinen. “Generating Code World Models
with large language models guided by Monte Carlo Tree Search”. In: NIPS. May 2024. url:
https://arxiv.org/abs/2405.15383.
[Dan+16] C. Daniel, H. van Hoof, J. Peters, and G. Neumann. “Probabilistic inference for determining
options in reinforcement learning”. en. In: Mach. Learn. 104.2-3 (Sept. 2016), pp. 337–357. url:
https://link.springer.com/article/10.1007/s10994-016-5580-x.
[Day93] P. Dayan. “Improving generalization for temporal difference learning: The successor representa-
tion”. en. In: Neural Comput. 5.4 (July 1993), pp. 613–624. url: https://ieeexplore.ieee.
org/abstract/document/6795455.
[Dee24] DeepSeek-AI. “DeepSeek-V3 Technical Report”. In: arXiv [cs.CL] (Dec. 2024). url: http:
//arxiv.org/abs/2412.19437.
[Dee25] G. DeepMind. AlphaEvolve: A Gemini-powered coding agent for designing advanced algo-
rithms. en. 2025. url: https://deepmind.google/discover/blog/alphaevolve-a-gemini-
powered-coding-agent-for-designing-advanced-algorithms/.
[Dee25] DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning”. In: arXiv [cs.CL] (Jan. 2025). url: http://arxiv.org/abs/2501.12948.
[DFR15] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. “Gaussian Processes for Data-Efficient
Learning in Robotics and Control”. en. In: IEEE PAMI 37.2 (2015), pp. 408–423. url: http:
//dx.doi.org/10.1109/TPAMI.2013.218.
[DH92] P. Dayan and G. E. Hinton. “Feudal Reinforcement Learning”. In: NIPS 5 (1992). url: https://
proceedings.neurips.cc/paper_files/paper/1992/file/d14220ee66aeec73c49038385428ec4c-
Paper.pdf.
[Die00] T. G. Dietterich. “Hierarchical reinforcement learning with the MAXQ value function decompo-
sition”. en. In: JAIR 13 (Nov. 2000), pp. 227–303. url: https://www.jair.org/index.php/
jair/article/view/10266.
182
[Die+07] M. Diehl, H. G. Bock, H. Diedam, and P.-B. Wieber. “Fast Direct Multiple Shooting Algo-
rithms for Optimal Robot Control”. In: Lecture Notes in Control and Inform. Sci. 340 (2007).
url: https://www.researchgate.net/publication/29603798_Fast_Direct_Multiple_
Shooting_Algorithms_for_Optimal_Robot_Control.
[DMKM22] G. Duran-Martin, A. Kara, and K. Murphy. “Efficient Online Bayesian Inference for Neural
Bandits”. In: AISTATS. 2022. url: http://arxiv.org/abs/2112.00195.
[D’O+22] P. D’Oro, M. Schwarzer, E. Nikishin, P.-L. Bacon, M. G. Bellemare, and A. Courville. “Sample-
Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier”. In: Deep Reinforcement
Learning Workshop NeurIPS 2022. Dec. 2022. url: https : / / openreview . net / pdf ? id =
4GBGwVIEYJ.
[DOB21] W. Dabney, G. Ostrovski, and A. Barreto. “Temporally-Extended epsilon-Greedy Exploration”.
In: ICLR. 2021. url: https://openreview.net/pdf?id=ONBPHFZ7zG4.
[DPA23] F. Deng, J. Park, and S. Ahn. “Facing off world model backbones: RNNs, Transformers, and S4”.
In: NIPS abs/2307.02064 (July 2023), pp. 72904–72930. url: https://proceedings.neurips.
cc / paper _ files / paper / 2023 / hash / e6c65eb9b56719c1aa45ff73874de317 - Abstract -
Conference.html.
[DR11] M. P. Deisenroth and C. E. Rasmussen. “PILCO: A Model-Based and Data-Efficient Approach
to Policy Search”. In: ICML. 2011. url: http://www.icml-2011.org/papers/323_icmlpaper.
pdf.
[Du+21] C. Du, Z. Gao, S. Yuan, L. Gao, Z. Li, Y. Zeng, X. Zhu, J. Xu, K. Gai, and K.-C. Lee.
“Exploration in Online Advertising Systems with Deep Uncertainty-Aware Learning”. In: KDD.
KDD ’21. Association for Computing Machinery, 2021, pp. 2792–2801. url: https://doi.org/
10.1145/3447548.3467089.
[Duf02] M. Duff. “Optimal Learning: Computational procedures for Bayes-adaptive Markov decision
processes”. PhD thesis. U. Mass. Dept. Comp. Sci., 2002. url: http://envy.cs.umass.edu/
People/duff/diss.html.
[DVRZ22] S. Dong, B. Van Roy, and Z. Zhou. “Simple Agent, Complex Environment: Efficient Re-
inforcement Learning with Agent States”. In: J. Mach. Learn. Res. (2022). url: https :
//www.jmlr.org/papers/v23/21-0773.html.
[DWS12] T. Degris, M. White, and R. S. Sutton. “Off-Policy Actor-Critic”. In: ICML. 2012. url: http:
//arxiv.org/abs/1205.4839.
[Ebe+24] M. Eberhardinger, J. Goodman, A. Dockhorn, D. Perez-Liebana, R. D. Gaina, D. Cakmak, S.
Maghsudi, and S. Lucas. “From code to play: Benchmarking program search for games using large
language models”. In: arXiv [cs.AI] (Dec. 2024). url: http://arxiv.org/abs/2412.04057.
[Eco+19] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. “Go-Explore: a New Approach
for Hard-Exploration Problems”. In: (2019). arXiv: 1901.10995 [cs.LG]. url: http://arxiv.
org/abs/1901.10995.
[Eco+21] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. “First return, then explore”.
en. In: Nature 590.7847 (Feb. 2021), pp. 580–586. url: https://www.nature.com/articles/
s41586-020-03157-9.
[EGW05] D Ernst, P Geurts, and L Wehenkel. “Tree-based batch mode reinforcement learning”. In:
J. Mach. Learn. Res. 6.18 (Dec. 2005), pp. 503–556. url: https://jmlr.org/papers/v6/
ernst05a.html.
[Emm+21] S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. “RvS: What is essential for offline RL
via Supervised Learning?” In: arXiv [cs.LG] (Dec. 2021). url: http://arxiv.org/abs/2112.
10751.
183
[ESL21] B. Eysenbach, R. Salakhutdinov, and S. Levine. “C-Learning: Learning to Achieve Goals via
Recursive Classification”. In: ICLR. 2021. url: https://openreview.net/pdf?id=tc5qisoB-
C.
[Esp+18] L. Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-
Learner Architectures”. en. In: ICML. PMLR, July 2018, pp. 1407–1416. url: https : / /
proceedings.mlr.press/v80/espeholt18a.html.
[Eys+20] B. Eysenbach, X. Geng, S. Levine, and R. Salakhutdinov. “Rewriting History with Inverse RL:
Hindsight Inference for Policy Improvement”. In: NIPS. Feb. 2020.
[Eys+21] B. Eysenbach, A. Khazatsky, S. Levine, and R. Salakhutdinov. “Mismatched No More: Joint
Model-Policy Optimization for Model-Based RL”. In: (2021). arXiv: 2110.02758 [cs.LG]. url:
http://arxiv.org/abs/2110.02758.
[Eys+22] B. Eysenbach, A. Khazatsky, S. Levine, and R. Salakhutdinov. “Mismatched No More: Joint
Model-Policy Optimization for Model-Based RL”. In: NIPS. 2022.
[Far+18] G. Farquhar, T. Rocktäschel, M. Igl, and S. Whiteson. “TreeQN and ATreeC: Differentiable
Tree-Structured Models for Deep Reinforcement Learning”. In: ICLR. Feb. 2018. url: https:
//openreview.net/pdf?id=H1dh6Ax0Z.
[Far+23] J. Farebrother, J. Greaves, R. Agarwal, C. Le Lan, R. Goroshin, P. S. Castro, and M. G.
Bellemare. “Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks”. In:
ICLR. 2023. url: https://openreview.net/pdf?id=oGDKSt9JrZi.
[Far+24] J. Farebrother et al. “Stop regressing: Training value functions via classification for scalable
deep RL”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/abs/2403.03950.
[Far+25] J. Farebrother, M. Pirotta, A. Tirinzoni, R. Munos, A. Lazaric, and A. Touati. “Temporal
Difference Flows”. In: ICML. Mar. 2025. url: https://arxiv.org/abs/2503.09817.
[FC24] J. Farebrother and P. S. Castro. “CALE: Continuous Arcade Learning Environment”. In: NIPS.
Oct. 2024. url: https://arxiv.org/abs/2410.23810.
[Fen+25] S. Feng, X. Kong, S. Ma, A. Zhang, D. Yin, C. Wang, R. Pang, and Y. Yang. “Step-by-Step
Reasoning for Math Problems via Twisted Sequential Monte Carlo”. In: ICLR. 2025. url:
https://openreview.net/forum?id=Ze4aPP0tIn.
[FG21] S. Fujimoto and S. s. Gu. “A Minimalist Approach to Offline Reinforcement Learning”. In:
NIPS. Vol. 34. Dec. 2021, pp. 20132–20145. url: https://proceedings.neurips.cc/paper_
files/paper/2021/file/a8166da05c5a094f7dc03724b41886e5-Paper.pdf.
[FHM18] S. Fujimoto, H. van Hoof, and D. Meger. “Addressing Function Approximation Error in Actor-
Critic Methods”. In: ICLR. 2018. url: http://arxiv.org/abs/1802.09477.
[FL+18] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau. “An Introduction
to Deep Reinforcement Learning”. In: Foundations and Trends in Machine Learning 11.3 (2018).
url: http://arxiv.org/abs/1811.12560.
[FLA16] C. Finn, S. Levine, and P. Abbeel. “Guided Cost Learning: Deep Inverse Optimal Control via
Policy Optimization”. In: ICML. 2016, pp. 49–58.
[FLL18] J. Fu, K. Luo, and S. Levine. “Learning Robust Rewards with Adverserial Inverse Reinforcement
Learning”. In: ICLR. 2018.
[FMR25] D. J. Foster, Z. Mhammedi, and D. Rohatgi. “Is a Good Foundation Necessary for Efficient
Reinforcement Learning? The Computational Role of the Base Model in Exploration”. In: (Mar.
2025). url: http://arxiv.org/abs/2503.07453.
[For+18] M. Fortunato et al. “Noisy Networks for Exploration”. In: ICLR. 2018. url: http://arxiv.
org/abs/1706.10295.
[FPP04] N. Ferns, P. Panangaden, and D. Precup. “Metrics for finite Markov decision processes”. en. In:
UAI. 2004. url: https://dl.acm.org/doi/10.5555/1036843.1036863.
184
[FR23] D. J. Foster and A. Rakhlin. “Foundations of reinforcement learning and interactive decision
making”. In: arXiv [cs.LG] (Dec. 2023). url: http://arxiv.org/abs/2312.16730.
[Fra+24] B. Frauenknecht, A. Eisele, D. Subhasish, F. Solowjow, and S. Trimpe. “Trust the Model Where
It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption”. In:
ICML. June 2024. url: https://openreview.net/pdf?id=N0ntTjTfHb.
[Fre+24] B. Freed, T. Wei, R. Calandra, J. Schneider, and H. Choset. “Unifying Model-Based and
Model-Free Reinforcement Learning with Equivalent Policy Sets”. In: RL Conference. 2024.
url: https://rlj.cs.umass.edu/2024/papers/RLJ_RLC_2024_37.pdf.
[Fri03] K. Friston. “Learning and inference in the brain”. en. In: Neural Netw. 16.9 (2003), pp. 1325–1352.
url: http://dx.doi.org/10.1016/j.neunet.2003.06.005.
[Fri09] K. Friston. “The free-energy principle: a rough guide to the brain?” en. In: Trends Cogn. Sci.
13.7 (2009), pp. 293–301. url: http://dx.doi.org/10.1016/j.tics.2009.04.005.
[FS+19] H Francis Song et al. “V-MPO: On-Policy Maximum a Posteriori Policy Optimization for
Discrete and Continuous Control”. In: arXiv [cs.AI] (Sept. 2019). url: http://arxiv.org/
abs/1909.12238.
[FS25] G. Faria and N. A. Smith. “Sample, don’t search: Rethinking test-time alignment for language
models”. In: arXiv [cs.CL] (Apr. 2025). url: http://arxiv.org/abs/2504.03790.
[FSW23] M. Fellows, M. J. A. Smith, and S. Whiteson. “Why Target Networks Stabilise Temporal Differ-
ence Methods”. en. In: ICML. PMLR, July 2023, pp. 9886–9909. url: https://proceedings.
mlr.press/v202/fellows23a.html.
[Fu15] M. Fu, ed. Handbook of Simulation Optimization. 1st ed. Springer-Verlag New York, 2015. url:
http://www.springer.com/us/book/9781493913831.
[Fu+20] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven
Reinforcement Learning. arXiv:2004.07219. 2020.
[Fuj+19] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau. “Benchmarking batch deep reinforcement
learning algorithms”. In: Deep RL Workshop NeurIPS. Oct. 2019. url: https://arxiv.org/
abs/1910.01708.
[Fur+21] H. Furuta, T. Kozuno, T. Matsushima, Y. Matsuo, and S. S. Gu. “Co-Adaptation of Algorithmic
and Implementational Innovations in Inference-based Deep Reinforcement Learning”. In: NIPS.
Mar. 2021. url: http://arxiv.org/abs/2103.17258.
[Gal+24] M. Gallici, M. Fellows, B. Ellis, B. Pou, I. Masmitja, J. N. Foerster, and M. Martin. “Simplifying
deep temporal difference learning”. In: ICML. July 2024.
[Gan+25a] K. Gandhi, A. Chakravarthy, A. Singh, N. Lile, and N. D. Goodman. “Cognitive behaviors that
enable self-improving reasoners, or, four habits of highly effective STaRs”. In: arXiv [cs.CL]
(Mar. 2025). url: http://arxiv.org/abs/2503.01307.
[Gan+25b] K. Gandhi, M. Y. Li, L. Goodyear, L. Li, A. Bhaskar, M. Zaman, and N. D. Goodman.
“BoxingGym: Benchmarking progress in automated experimental design and model discovery”.
In: arXiv [cs.LG] (Jan. 2025). url: http://arxiv.org/abs/2501.01540.
[Gar23] R. Garnett. Bayesian Optimization. Cambridge University Press, 2023. url: https : / /
bayesoptbook.com/.
[Gar+24] Q. Garrido, M. Assran, N. Ballas, A. Bardes, L. Najman, and Y. LeCun. “Learning and
leveraging world models in visual representation learning”. In: arXiv [cs.CV] (Mar. 2024). url:
http://arxiv.org/abs/2403.00504.
[GBS22] C. Grimm, A. Barreto, and S. Singh. “Approximate Value Equivalence”. In: NIPS. Oct. 2022.
url: https://openreview.net/pdf?id=S2Awu3Zn04v.
185
[GD22] S. Gronauer and K. Diepold. “Multi-agent deep reinforcement learning: a survey”. en. In: Artif.
Intell. Rev. 55.2 (Feb. 2022), pp. 895–943. url: https://dl.acm.org/doi/10.1007/s10462-
021-09996-w.
[GDG03] R. Givan, T. Dean, and M. Greig. “Equivalence notions and model minimization in Markov
decision processes”. en. In: Artif. Intell. 147.1-2 (July 2003), pp. 163–223. url: https://www.
sciencedirect.com/science/article/pii/S0004370202003764.
[GDWF22] J. Grudzien, C. A. S. De Witt, and J. Foerster. “Mirror Learning: A Unifying Framework of Policy
Optimisation”. In: ICML. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022,
pp. 7825–7844. url: https://proceedings.mlr.press/v162/grudzien22a/grudzien22a.
pdf.
[Geh+24] J. Gehring, K. Zheng, J. Copet, V. Mella, T. Cohen, and G. Synnaeve. “RLEF: Grounding code
LLMs in execution feedback with reinforcement learning”. In: arXiv [cs.CL] (Oct. 2024). url:
http://arxiv.org/abs/2410.02089.
[Ger18] S. J. Gershman. “Deconstructing the human algorithms for exploration”. en. In: Cognition 173
(Apr. 2018), pp. 34–42. url: https://www.sciencedirect.com/science/article/abs/pii/
S0010027717303359.
[Ger19] S. J. Gershman. “What does the free energy principle tell us about the brain?” In: Neurons,
Behavior, Data Analysis, and Theory (2019). url: http://arxiv.org/abs/1901.07945.
[GFZ24] J. Grigsby, L. Fan, and Y. Zhu. “AMAGO: Scalable in-context Reinforcement Learning for
adaptive agents”. In: ICLR. 2024.
[GGN22] S. K. S. Ghasemipour, S. S. Gu, and O. Nachum. “Why So Pessimistic? Estimating Uncertainties
for Offline RL through Ensembles, and Why Their Independence Matters”. In: NIPS. Oct. 2022.
url: https://openreview.net/pdf?id=z64kN1h1-rR.
[Gha+15] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. “Bayesian Reinforcement Learning:
: A Survey”. en. In: Found. Trends® Mach. Learn. 8.5-6 (Nov. 2015), pp. 359–483. url:
https://arxiv.org/abs/1609.04436.
[Ghi+20] S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M. White. “Gradient temporal-
difference learning with Regularized Corrections”. In: ICML. July 2020.
[Gho+21] D. Ghosh, J. Rahme, A. Kumar, A. Zhang, R. P. Adams, and S. Levine. “Why Generalization
in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability”. In: NIPS. Vol. 34.
Dec. 2021, pp. 25502–25515. url: https://proceedings.neurips.cc/paper_files/paper/
2021/file/d5ff135377d39f1de7372c95c74dd962-Paper.pdf.
[Ghu+22] R. Ghugare, H. Bharadhwaj, B. Eysenbach, S. Levine, and R. Salakhutdinov. “Simplifying Model-
based RL: Learning Representations, Latent-space Models, and Policies with One Objective”.
In: ICLR. Sept. 2022. url: https://openreview.net/forum?id=MQcmfgRxf7a.
[Ghu+24] R. Ghugare, M. Geist, G. Berseth, and B. Eysenbach. “Closing the gap between TD learning
and supervised learning – A generalisation Point of View”. In: ICML. Jan. 2024. url: https:
//arxiv.org/abs/2401.11237.
[Git89] J. Gittins. Multi-armed Bandit Allocation Indices. Wiley, 1989.
[GK19] L. Graesser and W. L. Keng. Foundations of Deep Reinforcement Learning: Theory and Practice
in Python. en. 1 edition. Addison-Wesley Professional, 2019. url: https://www.amazon.com/
Deep-Reinforcement-Learning-Python-Hands/dp/0135172381.
[GM+24] J. Grau-Moya et al. “Learning Universal Predictors”. In: arXiv [cs.LG] (Jan. 2024). url:
https://arxiv.org/abs/2401.14953.
[Gor95] G. J. Gordon. “Stable Function Approximation in Dynamic Programming”. In: ICML. 1995,
pp. 261–268.
186
[Gra+10] T. Graepel, J. Quinonero-Candela, T. Borchert, and R. Herbrich. “Web-Scale Bayesian Click-
Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine”.
In: ICML. 2010.
[Gri+20a] J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In:
NIPS. June 2020. url: http://arxiv.org/abs/2006.07733.
[Gri+20b] C. Grimm, A. Barreto, S. Singh, and D. Silver. “The Value Equivalence Principle for Model-Based
Reinforcement Learning”. In: NIPS 33 (2020), pp. 5541–5552. url: https://proceedings.
neurips.cc/paper_files/paper/2020/file/3bb585ea00014b0e3ebe4c6dd165a358-Paper.
pdf.
[Gua+23] L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati. “Leveraging Pre-trained Large
Language Models to Construct and Utilize World Models for Model-based Task Planning”. In:
NIPS. May 2023. url: http://arxiv.org/abs/2305.14909.
[Gul+20] C. Gulcehre et al. RL Unplugged: Benchmarks for Offline Reinforcement Learning. arXiv:2006.13888.
2020.
[Guo+22] Z. D. Guo et al. “BYOL-Explore: Exploration by Bootstrapped Prediction”. In: NIPS. Oct.
2022. url: https://openreview.net/pdf?id=qHGCH75usg.
[GZG19] S. K. S. Ghasemipour, R. S. Zemel, and S. Gu. “A Divergence Minimization Perspective on
Imitation Learning Methods”. In: CORL. 2019, pp. 1259–1277.
[Haa+18a] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. “Soft Actor-Critic: Off-Policy Maximum
Entropy Deep Reinforcement Learning with a Stochastic Actor”. In: ICML. 2018. url: http:
//arxiv.org/abs/1801.01290.
[Haa+18b] T. Haarnoja et al. “Soft Actor-Critic Algorithms and Applications”. In: (2018). arXiv: 1812.
05905 [cs.LG]. url: http://arxiv.org/abs/1812.05905.
[Haf+19] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. “Learning
Latent Dynamics for Planning from Pixels”. In: ICML. 2019. url: http://arxiv.org/abs/
1811.04551.
[Haf+20] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. “Dream to Control: Learning Behaviors by
Latent Imagination”. In: ICLR. 2020. url: https://openreview.net/forum?id=S1lOTC4tDS.
[Haf+21] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. “Mastering Atari with discrete world models”.
In: ICLR. 2021.
[Haf+23] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. “Mastering Diverse Domains through World
Models”. In: arXiv [cs.AI] (Jan. 2023). url: http://arxiv.org/abs/2301.04104.
[Ham+21] J. B. Hamrick, A. L. Friesen, F. Behbahani, A. Guez, F. Viola, S. Witherspoon, T. Anthony, L.
Buesing, P. Veličković, and T. Weber. “On the role of planning in model-based deep reinforcement
learning”. In: ICLR. 2021. url: https://arxiv.org/abs/2011.04021.
[Ham+25] L. Hammond et al. “Multi-Agent Risks from Advanced AI”. In: arXiv [cs.MA] (Feb. 2025). url:
http://arxiv.org/abs/2502.14143.
[Han+19] S. Hansen, W. Dabney, A. Barreto, D. Warde-Farley, T. Van de Wiele, and V. Mnih. “Fast
Task Inference with Variational Intrinsic Successor Features”. In: ICLR. Sept. 2019. url:
https://openreview.net/pdf?id=BJeAHkrYDS.
[Han+23] N. Hansen, Z. Yuan, Y. Ze, T. Mu, A. Rajeswaran, H. Su, H. Xu, and X. Wang. “On Pre-Training
for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline”. In: ICML. June 2023.
url: https://openreview.net/pdf?id=dvp30Hrijj.
[Har+16] A. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos. “Q(lambda) with Off-Policy
Corrections”. In: (2016).
187
[Har+18] J. Harb, P.-L. Bacon, M. Klissarov, and D. Precup. “When waiting is not an option: Learning
options with a deliberation cost”. en. In: AAAI 32.1 (Apr. 2018). url: https://ojs.aaai.
org/index.php/AAAI/article/view/11831.
[Has10] H. van Hasselt. “Double Q-learning”. In: NIPS. Ed. by J. D. Lafferty, C. K. I. Williams, J
Shawe-Taylor, R. S. Zemel, and A Culotta. Curran Associates, Inc., 2010, pp. 2613–2621. url:
http://papers.nips.cc/paper/3964-double-q-learning.pdf.
[Has+16] H. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver. “Learning values across many
orders of magnitude”. In: NIPS. Feb. 2016.
[HBZ04] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. “Dynamic programming for partially observable
stochastic games”. en. In: Proceedings of the 19th national conference on Artifical intelligence.
AAAI’04. AAAI Press, July 2004, pp. 709–715. url: https://dl.acm.org/doi/10.5555/
1597148.1597262.
[HDCM15] A. Hallak, D. Di Castro, and S. Mannor. “Contextual Markov decision processes”. In: arXiv
[stat.ML] (Feb. 2015). url: http://arxiv.org/abs/1502.02259.
[HE16] J. Ho and S. Ermon. “Generative Adversarial Imitation Learning”. In: NIPS. 2016, pp. 4565–
4573.
[Hee+15] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. “Learning Continuous Control
Policies by Stochastic Value Gradients”. In: NIPS 28 (2015). url: https://proceedings.
neurips.cc/paper/2015/hash/148510031349642de5ca0c544f31b2ef-Abstract.html.
[Hes+18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot,
M. Azar, and D. Silver. “Rainbow: Combining Improvements in Deep Reinforcement Learning”.
In: AAAI. 2018. url: http://arxiv.org/abs/1710.02298.
[Hes+19] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. “Multi-task
deep reinforcement learning with PopArt”. In: AAAI. 2019.
[HGS16] H. van Hasselt, A. Guez, and D. Silver. “Deep Reinforcement Learning with Double Q-Learning”.
In: AAAI. AAAI’16. AAAI Press, 2016, pp. 2094–2100. url: http://dl.acm.org/citation.
cfm?id=3016100.3016191.
[HHA19] H. van Hasselt, M. Hessel, and J. Aslanides. “When to use parametric models in reinforcement
learning?” In: NIPS. 2019. url: http://arxiv.org/abs/1906.05243.
[HL04] D. R. Hunter and K. Lange. “A Tutorial on MM Algorithms”. In: The American Statistician 58
(2004), pp. 30–37.
[HL20] O. van der Himst and P. Lanillos. “Deep active inference for partially observable MDPs”. In:
ECML workshop on active inference. Sept. 2020. url: https://arxiv.org/abs/2009.03622.
[HLKT19] P. Hernandez-Leal, B. Kartal, and M. E. Taylor. “A survey and critique of multiagent deep
reinforcement learning”. In: Auton. Agent. Multi. Agent. Syst. (2019). url: http://link.
springer.com/10.1007/s10458-019-09421-1.
[HLS15] J. Heinrich, M. Lanctot, and D. Silver. “Fictitious Self-Play in Extensive-Form Games”. In:
ICML. 2015.
[HM20] M. Hosseini and A. Maida. “Hierarchical Predictive Coding Models in a Deep-Learning Frame-
work”. In: (2020). arXiv: 2005.03230 [cs.CV]. url: http://arxiv.org/abs/2005.03230.
[HMC00] S Hart and A Mas-Colell. “A simple adaptive procedure leading to correlated equilibrium”. en.
In: Econometrica 68.5 (Sept. 2000), pp. 1127–1150. url: https://www.jstor.org/stable/
2999445.
[HMDH20] C. C.-Y. Hsu, C. Mendler-Dünner, and M. Hardt. “Revisiting design choices in proximal Policy
Optimization”. In: arXiv [cs.LG] (Sept. 2020). url: http://arxiv.org/abs/2009.10897.
188
[Hof+23] M. D. Hoffman, D. Phan, D. Dohan, S. Douglas, T. A. Le, A. Parisi, P. Sountsov, C. Sutton,
S. Vikram, and R. A. Saurous. “Training Chain-of-Thought via Latent-Variable Inference”. In:
NIPS. Jan. 2023. url: https://openreview.net/forum?id=7p1tOZ13La.
[Hon+10] A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. “Approximate Riemannian
Conjugate Gradient Learning for Fixed-Form Variational Bayes”. In: JMLR 11.Nov (2010),
pp. 3235–3268. url: http://www.jmlr.org/papers/volume11/honkela10a/honkela10a.
pdf.
[Hon+23] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. “A two-timescale stochastic algorithm framework for
bilevel optimization: Complexity analysis and application to actor-critic”. en. In: SIAM J. Optim.
33.1 (Mar. 2023), pp. 147–180. url: https://epubs.siam.org/doi/10.1137/20M1387341.
[Hou+11] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. “Bayesian active learning for classifica-
tion and preference learning”. In: arXiv [stat.ML] (Dec. 2011). url: http://arxiv.org/abs/
1112.5745.
[HQC24] M. Hutter, D. Quarel, and E. Catt. An introduction to universal artificial intelligence. Chapman
and Hall, 2024. url: http://www.hutter1.net/ai/uaibook2.htm.
[HR11] R. Hafner and M. Riedmiller. “Reinforcement learning in feedback control: Challenges and
benchmarks from technical process control”. en. In: Mach. Learn. 84.1-2 (July 2011), pp. 137–169.
url: https://link.springer.com/article/10.1007/s10994-011-5235-x.
[HR17] C. Hoffmann and P. Rostalski. “Linear Optimal Control on Factor Graphs — A Message
Passing Perspective”. In: Intl. Federation of Automatic Control 50.1 (2017), pp. 6314–6319.
url: https://www.sciencedirect.com/science/article/pii/S2405896317313800.
[HS16] J. Heinrich and D. Silver. “Deep reinforcement learning from self-play in imperfect-information
games”. In: arXiv [cs.LG] (Mar. 2016). url: http://arxiv.org/abs/1603.01121.
[HS18] D. Ha and J. Schmidhuber. “World Models”. In: NIPS. 2018. url: http://arxiv.org/abs/
1803.10122.
[HS22] E. Hazan and K. Singh. “Introduction to online control”. In: arXiv [cs.LG] (Nov. 2022). url:
http://arxiv.org/abs/2211.09619.
[HSW22] N. A. Hansen, H. Su, and X. Wang. “Temporal Difference Learning for Model Predictive
Control”. en. In: ICML. June 2022, pp. 8387–8406. url: https://proceedings.mlr.press/
v162/hansen22a.html.
[HSW24] N. Hansen, H. Su, and X. Wang. “TD-MPC2: Scalable, Robust World Models for Continuous
Control”. In: ICLR. 2024. url: http://arxiv.org/abs/2310.16828.
[HT15] J. H. Huggins and J. B. Tenenbaum. “Risk and regret of hierarchical Bayesian learners”. In:
ICML. 2015. url: http://proceedings.mlr.press/v37/hugginsb15.html.
[HTB18] G. Z. Holland, E. J. Talvitie, and M. Bowling. “The effect of planning shape on Dyna-style
planning in high-dimensional state spaces”. In: arXiv [cs.AI] (June 2018). url: http://arxiv.
org/abs/1806.01825.
[Hu+20] Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan. “Learning to Utilize Shap-
ing Rewards: A New Approach of Reward Shaping”. In: NIPS 33 (2020), pp. 15931–15941. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/b710915795b9e9c02cf10d6d2bdb688c-
Paper.pdf.
[Hu+21] H. Hu, A. Lerer, N. Brown, and J. Foerster. “Learned Belief Search: Efficiently improving
policies in partially observable settings”. In: arXiv [cs.AI] (June 2021). url: http://arxiv.
org/abs/2106.09086.
[Hu+23] A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado.
“GAIA-1: A Generative World Model for Autonomous Driving”. In: arXiv [cs.CV] (Sept. 2023).
url: http://arxiv.org/abs/2309.17080.
189
[Hu+24] S. Hu, T. Huang, F. Ilhan, S. Tekin, G. Liu, R. Kompella, and L. Liu. A Survey on Large
Language Model-Based Game Agents. 2024. arXiv: 2404.02039 [cs.AI].
[Hua+24] S. Huang, M. Noukhovitch, A. Hosseini, K. Rasul, W. Wang, and L. Tunstall. “The N+
Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization”. In:
First Conference on Language Modeling. Aug. 2024. url: https://openreview.net/pdf?id=
kHO2ZTa8e3.
[Hua+25a] A. Huang, A. Block, Q. Liu, N. Jiang, A. Krishnamurthy, and D. J. Foster. “Is best-of-N the
best of them? Coverage, scaling, and optimality in inference-time alignment”. In: arXiv [cs.AI]
(Mar. 2025). url: http://arxiv.org/abs/2503.21878.
[Hua+25b] A. Huang, W. Zhan, T. Xie, J. D. Lee, W. Sun, A. Krishnamurthy, and D. J. Foster. “Correcting
the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared
Preference Optimization”. In: ICLR. 2025. url: https : / / openreview . net / forum ? id =
hXm0Wu2U9K.
[Hub+21] T. Hubert, J. Schrittwieser, I. Antonoglou, M. Barekatain, S. Schmitt, and D. Silver. “Learning
and planning in complex action spaces”. In: arXiv [cs.LG] (Apr. 2021). url: http://arxiv.
org/abs/2104.06303.
[Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Proba-
bility. en. 2005th ed. Springer, 2005. url: http://www.hutter1.net/ai/uaibook.htm.
[Ich+23] B. Ichter et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”. en. In:
Conference on Robot Learning. PMLR, Mar. 2023, pp. 287–318. url: https://proceedings.
mlr.press/v205/ichter23a.html.
[ID19] S. Ivanov and A. D’yakonov. “Modern Deep Reinforcement Learning algorithms”. In: arXiv
[cs.LG] (June 2019). url: http://arxiv.org/abs/1906.10025.
[IW18] E. Imani and M. White. “Improving Regression Performance with Distributional Losses”. en.
In: ICML. PMLR, July 2018, pp. 2157–2166. url: https://proceedings.mlr.press/v80/
imani18a.html.
[Jad+17] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.
“Reinforcement Learning with Unsupervised Auxiliary Tasks”. In: ICLR. 2017. url: https:
//openreview.net/forum?id=SJ6yPD5xg.
[Jad+19] M. Jaderberg et al. “Human-level performance in 3D multiplayer games with population-
based reinforcement learning”. en. In: Science 364.6443 (May 2019), pp. 859–865. url: https:
//www.science.org/doi/10.1126/science.aau6249.
[Jae00] H Jaeger. “Observable operator models for discrete stochastic time series”. en. In: Neural
Comput. 12.6 (June 2000), pp. 1371–1398. url: https://direct.mit.edu/neco/article-
pdf/12/6/1371/814514/089976600300015411.pdf.
[Jan+19a] M. Janner, J. Fu, M. Zhang, and S. Levine. “When to Trust Your Model: Model-Based Policy
Optimization”. In: NIPS. 2019. url: http://arxiv.org/abs/1906.08253.
[Jan+19b] D. Janz, J. Hron, P. Mazur, K. Hofmann, J. M. Hernández-Lobato, and S. Tschiatschek.
“Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning”. In:
NIPS. Vol. 32. 2019. url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/1b113258af3968aaf3969ca67e744ff8-Paper.pdf.
[Jan+22] M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. “Planning with Diffusion for Flexible
Behavior Synthesis”. In: ICML. May 2022. url: http://arxiv.org/abs/2205.09991.
[Jar+23] D. Jarrett, C. Tallec, F. Altché, T. Mesnard, R. Munos, and M. Valko. “Curiosity in Hindsight:
Intrinsic Exploration in Stochastic Environments”. In: ICML. June 2023. url: https : / /
openreview.net/pdf?id=fIH2G4fnSy.
190
[JCM24] M. Jones, P. Chang, and K. Murphy. “Bayesian online natural gradient (BONG)”. In: May 2024.
url: http://arxiv.org/abs/2405.19681.
[JGP16] E. Jang, S. Gu, and B. Poole. “Categorical Reparameterization with Gumbel-Softmax”. In:
(2016). arXiv: 1611.01144 [stat.ML]. url: http://arxiv.org/abs/1611.01144.
[Ji+25] Y. Ji, J. Li, H. Ye, K. Wu, K. Yao, J. Xu, L. Mo, and M. Zhang. “Test-time compute:
From System-1 thinking to System-2 thinking”. In: arXiv [cs.AI] (Jan. 2025). url: http :
//arxiv.org/abs/2501.02497.
[Jia+15] N. Jiang, A. Kulesza, S. Singh, and R. Lewis. “The Dependence of Effective Planning Horizon
on Model Accuracy”. en. In: Proceedings of the 2015 International Conference on Autonomous
Agents and Multiagent Systems. AAMAS ’15. Richland, SC: International Foundation for
Autonomous Agents and Multiagent Systems, May 2015, pp. 1181–1189. url: https://dl.
acm.org/doi/10.5555/2772879.2773300.
[Jin+22] L. Jing, P. Vincent, Y. LeCun, and Y. Tian. “Understanding Dimensional Collapse in Contrastive
Self-supervised Learning”. In: ICLR. 2022. url: https : / / openreview . net / forum ? id =
YevsQ05DEN7.
[JLL21] M. Janner, Q. Li, and S. Levine. “Offline Reinforcement Learning as One Big Sequence Modeling
Problem”. In: NIPS. June 2021.
[JM70] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier Press, 1970.
[JML20] M. Janner, I. Mordatch, and S. Levine. “Gamma-Models: Generative Temporal Difference
Learning for Infinite-Horizon Prediction”. In: NIPS. Vol. 33. 2020, pp. 1724–1735. url: https://
proceedings.neurips.cc/paper_files/paper/2020/file/12ffb0968f2f56e51a59a6beb37b2859-
Paper.pdf.
[JOA10] T. Jaksch, R. Ortner, and P. Auer. “Near-optimal regret bounds for reinforcement learning”. In:
JMLR 11.51 (2010), pp. 1563–1600. url: https://jmlr.org/papers/v11/jaksch10a.html.
[Jor+24] S. M. Jordan, A. White, B. C. da Silva, M. White, and P. S. Thomas. “Position: Benchmarking
is Limited in Reinforcement Learning Research”. In: ICML. June 2024. url: https://arxiv.
org/abs/2406.16241.
[JSJ94] T. Jaakkola, S. Singh, and M. Jordan. “Reinforcement Learning Algorithm for Partially Ob-
servable Markov Decision Problems”. In: NIPS. 1994.
[KAG19] A. Kirsch, J. van Amersfoort, and Y. Gal. “BatchBALD: Efficient and Diverse Batch Acquisition
for Deep Bayesian Active Learning”. In: NIPS. 2019. url: http://arxiv.org/abs/1906.08158.
[Kai+19] L. Kaiser et al. “Model-based reinforcement learning for Atari”. In: arXiv [cs.LG] (Mar. 2019).
url: http://arxiv.org/abs/1903.00374.
[Kak01] S. M. Kakade. “A Natural Policy Gradient”. In: NIPS. Vol. 14. 2001. url: https://proceedings.
neurips.cc/paper_files/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.
pdf.
[Kal+18] D. Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic
Manipulation”. In: CORL. 2018. url: http://arxiv.org/abs/1806.10293.
[Kap+18] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. “Recurrent Experience Replay
in Distributed Reinforcement Learning”. In: ICLR. Sept. 2018. url: https://openreview.
net/pdf?id=r1lyTjAqYX.
[Kap+22] S. Kapturowski, V. Campos, R. Jiang, N. Rakicevic, H. van Hasselt, C. Blundell, and A. P.
Badia. “Human-level Atari 200x faster”. In: ICLR. Sept. 2022. url: https://openreview.net/
pdf?id=JtC6yOHRoJJ.
[KB09] G. Konidaris and A. Barto. “Skill Discovery in Continuous Reinforcement Learning Do-
mains using Skill Chaining”. In: Advances in Neural Information Processing Systems 22
(2009). url: https : / / proceedings . neurips . cc / paper _ files / paper / 2009 / file /
e0cf1f47118daebc5b16269099ad7347-Paper.pdf.
191
[KD18] S. Kamthe and M. P. Deisenroth. “Data-Efficient Reinforcement Learning with Probabilistic
Model Predictive Control”. In: AISTATS. 2018. url: http://proceedings.mlr.press/v84/
kamthe18a/kamthe18a.pdf.
[Ke+19] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation Learning as
f -Divergence Minimization. arXiv:1905.12888. 2019.
[KF88] D. M. Kilgour and N. M. Fraser. “A taxonomy of all ordinal 2 x 2 games”. en. In: Theory
Decis. 24.2 (Mar. 1988), pp. 99–117. url: https://link.springer.com/article/10.1007/
BF00132457.
[KGO12] H. J. Kappen, V. Gómez, and M. Opper. “Optimal control as a graphical model inference
problem”. In: Mach. Learn. 87.2 (2012), pp. 159–182. url: https://doi.org/10.1007/s10994-
012-5278-7.
[Khe+20] K. Khetarpal, Z. Ahmed, G. Comanici, D. Abel, and D. Precup. “What can I do here? A
Theory of Affordances in Reinforcement Learning”. In: Proceedings of the 37th International
Conference on Machine Learning. Ed. by H. D. Iii and A. Singh. Vol. 119. Proceedings of
Machine Learning Research. PMLR, 2020, pp. 5243–5253. url: https://proceedings.mlr.
press/v119/khetarpal20a.html.
[Kid+20] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. “MOReL: Model-Based Offline
Reinforcement Learning”. In: NIPS. Vol. 33. 2020, pp. 21810–21823. url: https://proceedings.
neurips.cc/paper_files/paper/2020/file/f7efa4f864ae9b88d43527f4b14f750f-Paper.
pdf.
[Kir+21] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel. “A survey of zero-shot generalisation
in deep Reinforcement Learning”. In: JAIR (Nov. 2021). url: http://jair.org/index.php/
jair/article/view/14174.
[KL02] S. Kakade and J. Langford. “Approximately Optimal Approximate Reinforcement Learning”.
In: ICML. ICML ’02. Morgan Kaufmann Publishers Inc., 2002, pp. 267–274. url: http :
//dl.acm.org/citation.cfm?id=645531.656005.
[KL93] E. Kalai and E. Lehrer. “Rational learning leads to Nash equilibrium”. en. In: Econometrica
61.5 (Sept. 1993), p. 1019. url: https://www.jstor.org/stable/2951492.
[KLC98] L. P. Kaelbling, M. Littman, and A. Cassandra. “Planning and acting in Partially Observable
Stochastic Domains”. In: AIJ 101 (1998).
[Kli+24] M. Klissarov, P. D’Oro, S. Sodhani, R. Raileanu, P.-L. Bacon, P. Vincent, A. Zhang, and
M. Henaff. “Motif: Intrinsic motivation from artificial intelligence feedback”. In: ICLR. 2024.
[Kli+25] M. Klissarov, R Devon Hjelm, A. T. Toshev, and B. Mazoure. “On the Modeling Capabilities
of Large Language Models for Sequential Decision Making”. In: ICLR. 2025. url: https :
//openreview.net/forum?id=vodsIF3o7N.
[KLP11] L. P. Kaelbling and T Lozano-Pérez. “Hierarchical task and motion planning in the now”. In:
ICRA. 2011, pp. 1470–1477. url: http://dx.doi.org/10.1109/ICRA.2011.5980391.
[KMN99] M. Kearns, Y. Mansour, and A. Ng. “A Sparse Sampling Algorithm for Near-Optimal Planning
in Large Markov Decision Processes”. In: IJCAI. 1999.
[Kon+24] D. Kong, D. Xu, M. Zhao, B. Pang, J. Xie, A. Lizarraga, Y. Huang, S. Xie, and Y. N. Wu.
“Latent Plan Transformer for trajectory abstraction: Planning as latent space inference”. In:
NIPS. Feb. 2024.
[Kor+22] T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. “On Reinforcement Learning and
Distribution Matching for fine-tuning language models with no catastrophic forgetting”. In:
NIPS. June 2022. url: https://arxiv.org/abs/2206.00761.
[Kov+22] V. Kovařík, M. Schmid, N. Burch, M. Bowling, and V. Lisý. “Rethinking formal models
of partially observable multiagent decision making”. In: Artificial Intelligence (2022). url:
http://arxiv.org/abs/1906.11110.
192
[Koz+21] T. Kozuno, Y. Tang, M. Rowland, R Munos, S. Kapturowski, W. Dabney, M. Valko, and
D. Abel. “Revisiting Peng’s Q-lambda for modern reinforcement learning”. In: ICML 139 (Feb.
2021). Ed. by M. Meila and T. Zhang, pp. 5794–5804. url: https://proceedings.mlr.press/
v139/kozuno21a/kozuno21a.pdf.
[KP19] K. Khetarpal and D. Precup. “Learning options with interest functions”. en. In: AAAI 33.01 (July
2019), pp. 9955–9956. url: https://ojs.aaai.org/index.php/AAAI/article/view/5114.
[KPB22] T. Korbak, E. Perez, and C. L. Buckley. “RL with KL penalties is better viewed as Bayesian
inference”. In: EMNLP. May 2022. url: http://arxiv.org/abs/2205.11275.
[KPL19] A. Kumar, X. B. Peng, and S. Levine. “Reward-Conditioned Policies”. In: arXiv [cs.LG] (Dec.
2019). url: http://arxiv.org/abs/1912.13465.
[KS02] M. Kearns and S. Singh. “Near-Optimal Reinforcement Learning in Polynomial Time”. en. In:
MLJ 49.2/3 (Nov. 2002), pp. 209–232. url: https://link.springer.com/article/10.1023/
A:1017984413808.
[KSS23] T. Kneib, A. Silbersdorff, and B. Säfken. “Rage Against the Mean – A Review of Distributional
Regression Approaches”. In: Econometrics and Statistics 26 (Apr. 2023), pp. 99–123. url:
https://www.sciencedirect.com/science/article/pii/S2452306221000824.
[Kuh51] H. W. Kuhn. “A SIMPLIFIED TWO-PERSON POKER”. en. In: Contributions to the Theory
of Games (AM-24), Volume I. Ed. by H. W. Kuhn and A. W. Tucker. Princeton: Princeton
University Press, Dec. 1951, pp. 97–104. url: https://www.degruyter.com/document/doi/
10.1515/9781400881727-010/html.
[Kum+19] A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. “Stabilizing Off-Policy Q-Learning via
Bootstrapping Error Reduction”. In: NIPS. Vol. 32. 2019. url: https://proceedings.neurips.
cc/paper_files/paper/2019/file/c2073ffa77b5357a498057413bb09d3a-Paper.pdf.
[Kum+20] A. Kumar, A. Zhou, G. Tucker, and S. Levine. “Conservative Q-Learning for Offline Reinforce-
ment Learning”. In: NIPS. June 2020.
[Kum+23] A. Kumar, R. Agarwal, X. Geng, G. Tucker, and S. Levine. “Offline Q-Learning on Diverse
Multi-Task Data Both Scales And Generalizes”. In: ICLR. 2023. url: http://arxiv.org/abs/
2211.15144.
[Kum+24] S. Kumar, H. J. Jeon, A. Lewandowski, and B. Van Roy. “The Need for a Big World Simulator:
A Scientific Challenge for Continual Learning”. In: Finding the Frame: An RLC Workshop
for Examining Conceptual Frameworks. July 2024. url: https://openreview.net/pdf?id=
10XMwt1nMJ.
[Kur+19] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. “Model-Ensemble Trust-Region
Policy Optimization”. In: ICLR. 2019. url: http://arxiv.org/abs/1802.10592.
[KWW22] M. J. Kochenderfer, T. A. Wheeler, and K. Wray. Algorithms for Decision Making. The MIT
Press, 2022. url: https://github.com/sisl/algorithmsbook/.
[LA21] H. Liu and P. Abbeel. “APS: Active Pretraining with Successor Features”. en. In: ICML. PMLR,
July 2021, pp. 6736–6747. url: https://proceedings.mlr.press/v139/liu21b.html.
[Lai+21] H. Lai, J. Shen, W. Zhang, Y. Huang, X. Zhang, R. Tang, Y. Yu, and Z. Li. “On effective
scheduling of model-based reinforcement learning”. In: NIPS 34 (Nov. 2021). Ed. by M Ran-
zato, A Beygelzimer, Y Dauphin, P. S. Liang, and J. W. Vaughan, pp. 3694–3705. url: https://
proceedings.neurips.cc/paper_files/paper/2021/hash/1e4d36177d71bbb3558e43af9577d70e-
Abstract.html.
[Lam+20] N. Lambert, B. Amos, O. Yadan, and R. Calandra. “Objective Mismatch in Model-based
Reinforcement Learning”. In: Conf. on Learning for Dynamics and Control (L4DC). Feb. 2020.
[Lam+24] N. Lambert et al. “TÜLU 3: Pushing frontiers in open language model post-training”. In: arXiv
[cs.CL] (Nov. 2024). url: http://arxiv.org/abs/2411.15124.
193
[Lam25] N. Lambert. Reinforcement Learning from Human Feedback. 2025. url: https://rlhfbook.
com/book.pdf.
[Lan+09] M. Lanctot, K. Waugh, M. Zinkevich, and M. Bowling. “Monte Carlo Sampling for Regret
Minimization in Extensive Games”. In: NIPS 22 (2009). url: https://proceedings.neurips.
cc/paper_files/paper/2009/file/00411460f7c92d2124a67ea0f4cb5f85-Paper.pdf.
[Lan+17] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Perolat, D. Silver, and T.
Graepel. “A unified game-theoretic approach to multiagent reinforcement learning”. In: NIPS.
Nov. 2017. url: https://arxiv.org/abs/1711.00832.
[Law+18] D. Lawson, G. Tucker, C. A. Naesseth, C. J. Maddison, R. P. Adams, and Y. W. Teh.
“Twisted Variational Sequential Monte Carlo”. In: BDL Workshop. 2018. url: https : / /
bayesiandeeplearning.org/2018/papers/111.pdf.
[Law+22] D. Lawson, A. Raventos, A. Warrington, and S. Linderman. “SIXO: Smoothing Inference with
Twisted Objectives”. In: Advances in Neural Information Processing Systems. Oct. 2022. url:
https://openreview.net/forum?id=bDyLgfvZ0qJ.
[LBS08] K. Leyton-Brown and Y. Shoham. Essentials of game theory: A concise, multidisciplinary
introduction. Synthesis lectures on artificial intelligence and machine learning. Cham: Springer
International Publishing, 2008. url: https://www.gtessentials.org/.
[Le+20] T. A. Le, A. R. Kosiorek, N Siddharth, Y. W. Teh, and F. Wood. “Revisiting Reweighted Wake-
Sleep for Models with Stochastic Control Flow”. en. In: UAI. PMLR, Aug. 2020, pp. 1039–1049.
url: https://proceedings.mlr.press/v115/le20a.html.
[LeC22] Y. LeCun. A Path Towards Autonomous Machine Intelligence. 2022. url: https://openreview.
net/pdf?id=BZ5a1r-kVsf.
[Lee+22] K.-H. Lee et al. “Multi-Game Decision Transformers”. In: NIPS abs/2205.15241 (May 2022),
pp. 27921–27936. url: http://dx.doi.org/10.48550/arXiv.2205.15241.
[Leh24] M. Lehmann. “The definitive guide to policy gradients in deep reinforcement learning: Theory,
algorithms and implementations”. In: arXiv [cs.LG] (Jan. 2024). url: http://arxiv.org/
abs/2401.13662.
[Lev18] S. Levine. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review”.
In: (2018). arXiv: 1805.00909 [cs.LG]. url: http://arxiv.org/abs/1805.00909.
[Lev+18] A. Levy, G. Konidaris, R. Platt, and K. Saenko. “Learning Multi-Level Hierarchies with
Hindsight”. In: ICLR. Sept. 2018. url: https://openreview.net/pdf?id=ryzECoAcY7.
[Lev+20a] S. Levine, A. Kumar, G. Tucker, and J. Fu. “Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems”. In: (2020). arXiv: 2005.01643 [cs.LG]. url: http:
//arxiv.org/abs/2005.01643.
[Lev+20b] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems. arXiv:2005.01643. 2020.
[Lew+23] A. K. Lew, T. Zhi-Xuan, G. Grand, and V. K. Mansinghka. “Sequential Monte Carlo Steering
of Large Language Models using Probabilistic Programs”. In: arXiv [cs.AI] (June 2023). url:
http://arxiv.org/abs/2306.03081.
[LGR12] S. Lange, T. Gabel, and M. Riedmiller. “Batch reinforcement learning”. en. In: Adaptation,
Learning, and Optimization. Adaptation, learning, and optimization. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2012, pp. 45–73. url: https://link.springer.com/chapter/10.1007/
978-3-642-27645-3_2.
[LHC24] C. Lu, S. Hu, and J. Clune. “Intelligent Go-Explore: Standing on the shoulders of giant
foundation models”. In: arXiv [cs.LG] (May 2024). url: http://arxiv.org/abs/2405.15143.
[LHP22] M. Lauri, D. Hsu, and J. Pajarinen. “Partially Observable Markov Decision Processes in Robotics:
A Survey”. In: IEEE Trans. Rob. (Sept. 2022). url: http://arxiv.org/abs/2209.10342.
194
[LHS13] T. Lattimore, M. Hutter, and P. Sunehag. “The Sample-Complexity of General Reinforcement
Learning”. en. In: ICML. PMLR, May 2013, pp. 28–36. url: https://proceedings.mlr.
press/v28/lattimore13.html.
[Li+10] L. Li, W. Chu, J. Langford, and R. E. Schapire. “A contextual-bandit approach to personalized
news article recommendation”. In: WWW. 2010.
[Li18] Y. Li. “Deep Reinforcement Learning”. In: (2018). arXiv: 1810.06339 [cs.LG]. url: http:
//arxiv.org/abs/1810.06339.
[Li23] S. E. Li. Reinforcement learning for sequential decision and optimal control. en. Singapore:
Springer Nature Singapore, 2023. url: https://link.springer.com/book/10.1007/978-
981-19-7784-8.
[Li+23] Z. Li, M. Lanctot, K. R. McKee, L. Marris, I. Gemp, D. Hennes, P. Muller, K. Larson, Y.
Bachrach, and M. P. Wellman. “Combining tree-search, generative models, and Nash bargaining
concepts in game-theoretic reinforcement learning”. In: arXiv [cs.AI] (Feb. 2023). url: http:
//arxiv.org/abs/2302.00797.
[Li+24a] H. Li, X. Yang, Z. Wang, X. Zhu, J. Zhou, Y. Qiao, X. Wang, H. Li, L. Lu, and J. Dai. “Auto
MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft”. In:
CVPR. 2024, pp. 16426–16435. url: https://openaccess.thecvf.com/content/CVPR2024/
papers/Li_Auto_MC- Reward_Automated_Dense_Reward_Design_with_Large_Language_
Models_CVPR_2024_paper.pdf.
[Li+24b] Z. Li, H. Liu, D. Zhou, and T. Ma. “Chain of thought empowers transformers to solve inherently
serial problems”. In: ICLR. Feb. 2024. url: https://arxiv.org/abs/2402.12875.
[Lil+16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
“Continuous control with deep reinforcement learning”. In: ICLR. 2016. url: http://arxiv.
org/abs/1509.02971.
[Lim+23] M. H. Lim, T. J. Becker, M. J. Kochenderfer, C. J. Tomlin, and Z. N. Sunberg. “Optimality
guarantees for particle belief approximation of POMDPs”. In: J. Artif. Intell. Res. (2023). url:
https://jair.org/index.php/jair/article/view/14525.
[Lin+19] C. Linke, N. M. Ady, M. White, T. Degris, and A. White. “Adapting behaviour via intrinsic
reward: A survey and empirical study”. In: J. Artif. Intell. Res. (June 2019). url: http :
//arxiv.org/abs/1906.07865.
[Lin+24a] J. Lin, Y. Du, O. Watkins, D. Hafner, P. Abbeel, D. Klein, and A. Dragan. “Learning to model
the world with language”. In: ICML. 2024.
[Lin+24b] Y.-A. Lin, C.-T. Lee, C.-H. Yang, G.-T. Liu, and S.-H. Sun. “Hierarchical Programmatic Option
Framework”. In: NIPS. Nov. 2024. url: https://openreview.net/pdf?id=FeCWZviCeP.
[Lin92] L.-J. Lin. “Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and
Teaching”. In: Mach. Learn. 8.3-4 (1992), pp. 293–321. url: https://doi.org/10.1007/
BF00992699.
[Lio+22] V. Lioutas, J. W. Lavington, J. Sefas, M. Niedoba, Y. Liu, B. Zwartsenberg, S. Dabiri, F.
Wood, and A. Scibior. “Critic Sequential Monte Carlo”. In: ICLR. Sept. 2022. url: https:
//openreview.net/pdf?id=ObtGcyKmwna.
[Lit94] M. Littman. “Markov games as a framework for multi-agent reinforcement learning”. en. In:
Machine Learning Proceedings 1994. Elsevier, Jan. 1994, pp. 157–163. url: https://www.
sciencedirect.com/science/article/abs/pii/B9781558603356500271.
[Liu+25a] Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. R1-Zero-Like Training:
A Critical Perspective. Tech. rep. N. U. Singapore, 2025. url: https://github.com/sail-
sg/understand-r1-zero/blob/main/understand-r1-zero.pdf.
195
[Liu+25b] Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. “Understanding
R1-zero-like training: A critical perspective”. In: arXiv [cs.LG] (Mar. 2025). url: http :
//arxiv.org/abs/2503.20783.
[LMLFP11] G. J. Laurent, L. Matignon, and N Le Fort-Piat. “The world of independent learners is not
markovian”. en. In: International Journal of Knowledge-Based and Intelligent Engineering
Systems (2011). url: https://hal.science/hal-00601941/document.
[LMW24] B. Li, N. Ma, and Z. Wang. “Rewarded Region Replay (R3) for policy learning with discrete
action space”. In: arXiv [cs.LG] (May 2024). url: http://arxiv.org/abs/2405.16383.
[Lor24] J. Lorraine. “Scalable nested optimization for deep learning”. In: arXiv [cs.LG] (July 2024).
url: http://arxiv.org/abs/2407.01526.
[Lou+25] J. Loula et al. “Syntactic and semantic control of large language models via sequential Monte
Carlo”. In: ICLR. Apr. 2025. url: https://arxiv.org/abs/2504.13139.
[LÖW21] T. van de Laar, A. Özçelikkale, and H. Wymeersch. “Application of the Free Energy Principle
to Estimation and Control”. In: IEEE Trans. Signal Process. 69 (2021), pp. 4234–4244. url:
http://dx.doi.org/10.1109/TSP.2021.3095711.
[LPC22] N. Lambert, K. Pister, and R. Calandra. “Investigating Compounding Prediction Errors in
Learned Dynamics Models”. In: arXiv [cs.LG] (Mar. 2022). url: http://arxiv.org/abs/
2203.09637.
[LR10] S. Lange and M. Riedmiller. “Deep auto-encoder neural networks in reinforcement learning”.
en. In: IJCNN. IEEE, July 2010, pp. 1–8. url: https://ieeexplore.ieee.org/abstract/
document/5596468.
[LS01] M. Littman and R. S. Sutton. “Predictive Representations of State”. In: NIPS 14 (2001). url:
https://proceedings.neurips.cc/paper_files/paper/2001/file/1e4d36177d71bbb3558e43af9577d70e-
Paper.pdf.
[LS19] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge, 2019.
[Lu+23] X. Lu, B. Van Roy, V. Dwaracherla, M. Ibrahimi, I. Osband, and Z. Wen. “Reinforcement Learn-
ing, Bit by Bit”. In: Found. Trends® Mach. Learn. (2023). url: https://www.nowpublishers.
com/article/Details/MAL-097.
[Luo+22] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu. “A survey on model-based
reinforcement learning”. In: arXiv [cs.LG] (June 2022). url: http://arxiv.org/abs/2206.
09328.
[LV06] F. Liese and I. Vajda. “On divergences and informations in statistics and information theory”.
In: IEEE Transactions on Information Theory 52.10 (2006), pp. 4394–4412.
[LWL06] L. Li, T. J. Walsh, and M. L. Littman. “Towards a Unified Theory of State Abstraction for
MDPs”. In: (2006). url: https://thomasjwalsh.net/pub/aima06Towards.pdf.
[LZZ22] M. Liu, M. Zhu, and W. Zhang. “Goal-conditioned reinforcement learning: Problems and
solutions”. In: IJCAI. Jan. 2022.
[Ma+24] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and
A. Anandkumar. “Eureka: Human-Level Reward Design via Coding Large Language Models”.
In: ICLR. 2024.
[MA93] A. W. Moore and C. G. Atkeson. “Prioritized Sweeping: Reinforcement Learning with Less
Data and Less Time”. In: Machine Learning 13.1 (1993), 103––130.
[Mac+18a] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling.
“Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for
General Agents”. In: J. Artif. Intell. Res. (2018). url: http://arxiv.org/abs/1709.06009.
196
[Mac+18b] M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell. “Eigenoption
Discovery through the Deep Successor Representation”. In: ICLR. Feb. 2018. url: https:
//openreview.net/pdf?id=Bk8ZcAxR-.
[Mac+23] M. C. Machado, A. Barreto, D. Precup, and M. Bowling. “Temporal Abstraction in Reinforcement
Learning with the Successor Representation”. In: JMLR 24.80 (2023), pp. 1–69. url: http:
//jmlr.org/papers/v24/21-1213.html.
[Mac+24] M. Macfarlane, E. Toledo, D. J. Byrne, P. Duckworth, and A. Laterre. “SPO: Sequential Monte
Carlo Policy Optimisation”. In: NIPS. Nov. 2024. url: https://openreview.net/pdf?id=
XKvYcPPH5G.
[Mae+09] H. Maei, C. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton. “Convergent
Temporal-Difference Learning with Arbitrary Smooth Function Approximation”. In: NIPS.
Vol. 22. 2009. url: https://proceedings.neurips.cc/paper_files/paper/2009/file/
3a15c7d0bbe60300a39f76f8a5ba6896-Paper.pdf.
[MAF22] V. Micheli, E. Alonso, and F. Fleuret. “Transformers are Sample-Efficient World Models”. In:
ICLR. Sept. 2022.
[MAF24] V. Micheli, E. Alonso, and F. Fleuret. “Efficient world models with context-aware tokenization”.
In: ICML. June 2024.
[Maj21] S. J. Majeed. “Abstractions of general reinforcement learning: An inquiry into the scalability
of generally intelligent agents”. PhD thesis. ANU, Dec. 2021. url: https://arxiv.org/abs/
2112.13404.
[Man+19] D. J. Mankowitz, N. Levine, R. Jeong, Y. Shi, J. Kay, A. Abdolmaleki, J. T. Springenberg,
T. Mann, T. Hester, and M. Riedmiller. “Robust Reinforcement Learning for Continuous
Control with Model Misspecification”. In: (2019). arXiv: 1906.07516 [cs.LG]. url: http:
//arxiv.org/abs/1906.07516.
[Mar10] J Martens. “Deep learning via Hessian-free optimization”. In: ICML. 2010. url: http://www.
cs.toronto.edu/~asamir/cifar/HFO_James.pdf.
[Mar16] J. Martens. “Second-order optimization for neural networks”. PhD thesis. Toronto, 2016. url:
http://www.cs.toronto.edu/~jmartens/docs/thesis_phd_martens.pdf.
[Mar20] J. Martens. “New insights and perspectives on the natural gradient method”. In: JMLR (2020).
url: http://arxiv.org/abs/1412.1193.
[Mar21] J. Marino. “Predictive Coding, Variational Autoencoders, and Biological Connections”. en. In:
Neural Comput. 34.1 (2021), pp. 1–44. url: http://dx.doi.org/10.1162/neco_a_01458.
[Maz+22] P. Mazzaglia, T. Verbelen, O. Çatal, and B. Dhoedt. “The Free Energy Principle for Perception
and Action: A Deep Learning Perspective”. en. In: Entropy 24.2 (2022). url: http://dx.doi.
org/10.3390/e24020301.
[MBB20] M. C. Machado, M. G. Bellemare, and M. Bowling. “Count-based exploration with the successor
representation”. en. In: AAAI 34.04 (Apr. 2020), pp. 5125–5133. url: https://ojs.aaai.org/
index.php/AAAI/article/view/5955.
[McM+13] H. B. McMahan, G. Holt, D Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E.
Davydov, D. Golovin, et al. “Ad click prediction: a view from the trenches”. In: KDD. 2013,
pp. 1222–1230.
[Men+23] W. Meng, Q. Zheng, G. Pan, and Y. Yin. “Off-Policy Proximal Policy Optimization”. en. In:
AAAI 37.8 (June 2023), pp. 9162–9170. url: https://ojs.aaai.org/index.php/AAAI/
article/view/26099.
[Met+17] L. Metz, J. Ibarz, N. Jaitly, and J. Davidson. “Discrete Sequential Prediction of Continuous
Actions for Deep RL”. In: (2017). arXiv: 1705.05035 [cs.LG]. url: http://arxiv.org/abs/
1705.05035.
197
[Met+22] Meta Fundamental AI Research Diplomacy Team (FAIR)† et al. “Human-level play in the game
of Diplomacy by combining language models with strategic reasoning”. en. In: Science 378.6624
(Dec. 2022), pp. 1067–1074. url: http://dx.doi.org/10.1126/science.ade9097.
[Mey22] S. Meyn. Control Systems and Reinforcement Learning. Cambridge, 2022. url: https://meyn.
ece.ufl.edu/2021/08/01/control-systems-and-reinforcement-learning/.
[MG15] J. Martens and R. Grosse. “Optimizing Neural Networks with Kronecker-factored Approximate
Curvature”. In: ICML. 2015. url: http://arxiv.org/abs/1503.05671.
[MGR18] H. Mania, A. Guy, and B. Recht. “Simple random search of static linear policies is competitive
for reinforcement learning”. In: NIPS. Ed. by S Bengio, H Wallach, H Larochelle, K Grauman,
N Cesa-Bianchi, and R Garnett. Curran Associates, Inc., 2018, pp. 1800–1809. url: http:
//papers.nips.cc/paper/7451- simple- random- search- of- static- linear- policies-
is-competitive-for-reinforcement-learning.pdf.
[Mik+20] V. Mikulik, G. Delétang, T. McGrath, T. Genewein, M. Martic, S. Legg, and P. Ortega. “Meta-
trained agents implement Bayes-optimal agents”. In: NIPS 33 (2020), pp. 18691–18703. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/d902c3ce47124c66ce615d5ad9ba304f-
Paper.pdf.
[Mil20] B. Millidge. “Deep Active Inference as Variational Policy Gradients”. In: J. Mathematical
Psychology (2020). url: http://arxiv.org/abs/1907.03876.
[Mil+20] B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley. “On the Relationship Between Active
Inference and Control as Inference”. In: International Workshop on Active Inference. 2020. url:
http://arxiv.org/abs/2006.12964.
[Min+24] M. J. Min, Y. Ding, L. Buratti, S. Pujar, G. Kaiser, S. Jana, and B. Ray. “Beyond Accuracy:
Evaluating Self-Consistency of Code Large Language Models with IdentityChain”. In: ICLR.
2024. url: https://openreview.net/forum?id=caW7LdAALh.
[ML25] D. Mittal and W. S. Lee. “Differentiable Tree Search Network”. In: ICLR. 2025. url: https:
//arxiv.org/abs/2401.11660.
[MLR24] D. McNamara, J. Loper, and J. Regier. “Sequential Monte Carlo for Inclusive KL Minimization
in Amortized Variational Inference”. en. In: AISTATS. PMLR, Apr. 2024, pp. 4312–4320. url:
https://proceedings.mlr.press/v238/mcnamara24a.html.
[MM90] D. Q. Mayne and H Michalska. “Receding horizon control of nonlinear systems”. In: IEEE Trans.
Automat. Contr. 35.7 (1990), pp. 814–824.
[MMT17] C. J. Maddison, A. Mnih, and Y. W. Teh. “The Concrete Distribution: A Continuous Relaxation
of Discrete Random Variables”. In: ICLR. 2017. url: http://arxiv.org/abs/1611.00712.
[MMT24] S. Mannor, Y. Mansour, and A. Tamar. Reinforcement Learning: Foundations. 2024. url:
https://sites.google.com/corp/view/rlfoundations/home.
[Mni+15] V. Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540
(2015), pp. 529–533.
[Mni+16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K.
Kavukcuoglu. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML. 2016. url:
http://arxiv.org/abs/1602.01783.
[Moe+23] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker. “Model-based Reinforcement
Learning: A Survey”. In: Foundations and Trends in Machine Learning 16.1 (2023), pp. 1–118.
url: https://arxiv.org/abs/2006.16712.
[Moe+25] A. Moeini, J. Wang, J. Beck, E. Blaser, S. Whiteson, R. Chandra, and S. Zhang. “A survey of
in-context reinforcement learning”. In: arXiv [cs.LG] (Feb. 2025). url: http://arxiv.org/
abs/2502.07978.
198
[Moh+20] S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. “Monte Carlo Gradient Estimation in
Machine Learning”. In: JMLR 21.132 (2020), pp. 1–62. url: http://jmlr.org/papers/v21/19-
346.html.
[Mor+17] M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M.
Johanson, and M. Bowling. “DeepStack: Expert-level artificial intelligence in heads-up no-limit
poker”. en. In: Science 356.6337 (May 2017), pp. 508–513. url: http://dx.doi.org/10.1126/
science.aam6960.
[Mor63] T. Morimoto. “Markov Processes and the H-Theorem”. In: J. Phys. Soc. Jpn. 18.3 (1963),
pp. 328–331. url: https://doi.org/10.1143/JPSJ.18.328.
[Mos+24] R. J. Moss, A. Corso, J. Caers, and M. J. Kochenderfer. “BetaZero: Belief-state planning
for long-horizon POMDPs using learned approximations”. In: RL Conference. 2024. url:
https://arxiv.org/abs/2306.00249.
[MP+22] A. Mavor-Parker, K. Young, C. Barry, and L. Griffin. “How to Stay Curious while avoiding Noisy
TVs using Aleatoric Uncertainty Estimation”. en. In: ICML. PMLR, June 2022, pp. 15220–15240.
url: https://proceedings.mlr.press/v162/mavor-parker22a.html.
[MP95] R. D. McKelvey and T. R. Palfrey. “Quantal response equilibria for normal form games”. en.
In: Games Econ. Behav. 10.1 (July 1995), pp. 6–38. url: https://www.sciencedirect.com/
science/article/abs/pii/S0899825685710238.
[MP98] R. D. McKelvey and T. R. Palfrey. “Quantal response equilibria for extensive form games”. en. In:
Exp. Econ. 1.1 (June 1998), pp. 9–41. url: https://link.springer.com/article/10.1023/A:
1009905800005.
[MS14] J. Modayil and R Sutton. “Prediction driven behavior: Learning predictions that drive fixed
responses”. In: National Conference on Artificial Intelligence. June 2014. url: https://www.
semanticscholar.org/paper/Prediction-Driven-Behavior%3A-Learning-Predictions-
Modayil-Sutton/22162abb8f5868938f8da391d3a1d603b3d8ac4c.
[MS24] W. Merrill and A. Sabharwal. “The Expressive Power of Transformers with Chain of Thought”.
In: ICLR. 2024. url: https://arxiv.org/abs/2310.07923.
[MSB21] B. Millidge, A. Seth, and C. L. Buckley. “Predictive Coding: a Theoretical and Experimental
Review”. In: (2021). arXiv: 2107.12979 [cs.AI]. url: http://arxiv.org/abs/2107.12979.
[Mun14] R. Munos. “From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to
Optimization and Planning”. In: Foundations and Trends in Machine Learning 7.1 (2014),
pp. 1–129. url: http://dx.doi.org/10.1561/2200000038.
[Mun+16] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. “Safe and Efficient Off-Policy
Reinforcement Learning”. In: NIPS. 2016, pp. 1046–1054.
[Mur00] K. Murphy. A Survey of POMDP Solution Techniques. Tech. rep. Comp. Sci. Div., UC Berkeley,
2000. url: https://www.cs.ubc.ca/~murphyk/Papers/pomdp.pdf.
[Mur23] K. P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
[MWS14] J. Modayil, A. White, and R. S. Sutton. “Multi-timescale nexting in a reinforcement learning
robot”. en. In: Adapt. Behav. 22.2 (Apr. 2014), pp. 146–160. url: https://sites.ualberta.
ca/~amw8/nexting.pdf.
[Nac+18] O. Nachum, S. Gu, H. Lee, and S. Levine. “Data-Efficient Hierarchical Reinforcement Learn-
ing”. In: NIPS. May 2018. url: https : / / proceedings . neurips . cc / paper / 2018 / hash /
e6384711491713d29bc63fc5eeb5ba4f-Abstract.html.
[Nac+19] O. Nachum, S. Gu, H. Lee, and S. Levine. “Near-Optimal Representation Learning for Hier-
archical Reinforcement Learning”. In: ICLR. 2019. url: https://openreview.net/pdf?id=
H1emus0qF7.
199
[Nai+20] A. Nair, A. Gupta, M. Dalal, and S. Levine. “AWAC: Accelerating Online Reinforcement
Learning with Offline Datasets”. In: arXiv [cs.LG] (June 2020). url: http://arxiv.org/abs/
2006.09359.
[Nai+21] A. Naik, Z. Abbas, A. White, and R. S. Sutton. “Towards Reinforcement Learning in the
Continuing Setting”. In: Never-Ending Reinforcement Learning (NERL) Workshop at ICLR. 2021.
url: https://drive.google.com/file/d/1xh7WjGP2VI_QdpjVWygRC1BuH6WB_gqi/view.
[Nak+23] M. Nakamoto, Y. Zhai, A. Singh, M. S. Mark, Y. Ma, C. Finn, A. Kumar, and S. Levine.
“Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning”. In: arXiv [cs.LG]
(Mar. 2023). url: http://arxiv.org/abs/2303.05479.
[NHR99] A. Ng, D. Harada, and S. Russell. “Policy invariance under reward transformations: Theory
and application to reward shaping”. In: ICML. 1999.
[Ni+24] T. Ni, B. Eysenbach, E. Seyedsalehi, M. Ma, C. Gehring, A. Mahajan, and P.-L. Bacon.
“Bridging State and History Representations: Understanding Self-Predictive RL”. In: ICLR. Jan.
2024. url: http://arxiv.org/abs/2401.08898.
[Nik+22] E. Nikishin, R. Abachi, R. Agarwal, and P.-L. Bacon. “Control-oriented model-based reinforce-
ment learning with implicit differentiation”. en. In: AAAI 36.7 (June 2022), pp. 7886–7894.
url: https://ojs.aaai.org/index.php/AAAI/article/view/20758.
[NLS19] C. A. Naesseth, F. Lindsten, and T. B. Schön. “Elements of Sequential Monte Carlo”. In:
Foundations and Trends in Machine Learning (2019). url: http://arxiv.org/abs/1903.
04797.
[NR00] A. Ng and S. Russell. “Algorithms for inverse reinforcement learning”. In: ICML. 2000.
[NWJ10] X Nguyen, M. J. Wainwright, and M. I. Jordan. “Estimating Divergence Functionals and the
Likelihood Ratio by Convex Risk Minimization”. In: IEEE Trans. Inf. Theory 56.11 (2010),
pp. 5847–5861. url: http://dx.doi.org/10.1109/TIT.2010.2068870.
[OA14] F. A. Oliehoek and C. Amato. “Best Response Bayesian Reinforcement Learning for Multiagent
Systems with State Uncertainty”. en. In: Proceedings of the Ninth AAMAS Workshop on Multi-
Agent Sequential Decision Making in Uncertain Domains (MSDM). University of Liverpool,
2014. url: https://livrepository.liverpool.ac.uk/3000453/1/Oliehoek14MSDM.pdf.
[OA16] F. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. en. 1st ed.
SpringerBriefs in Intelligent Systems. Cham, Switzerland: Springer International Publishing,
June 2016. url: https://www.fransoliehoek.net/docs/OliehoekAmato16book.pdf.
[OCD21] G. Ostrovski, P. S. Castro, and W. Dabney. “The Difficulty of Passive Learning in Deep Reinforce-
ment Learning”. In: NIPS. Vol. 34. Dec. 2021, pp. 23283–23295. url: https://proceedings.
neurips.cc/paper_files/paper/2021/file/c3e0c62ee91db8dc7382bde7419bb573-Paper.
pdf.
[OK22] A. Ororbia and D. Kifer. “The neural coding framework for learning generative models”. en. In:
Nat. Commun. 13.1 (Apr. 2022), p. 2064. url: https://www.nature.com/articles/s41467-
022-29632-7.
[Oqu+24] M. Oquab et al. “DINOv2: Learning Robust Visual Features without Supervision”. In: Trans-
actions on Machine Learning Research (2024). url: https://openreview.net/forum?id=
a68SUt6zFt.
[ORVR13] I. Osband, D. Russo, and B. Van Roy. “(More) Efficient Reinforcement Learning via Posterior
Sampling”. In: NIPS. 2013. url: http://arxiv.org/abs/1306.0940.
[Osb+19] I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. “Deep exploration via randomized value
functions”. In: JMLR 20.124 (2019), pp. 1–62. url: http : / / jmlr . org / papers / v20 / 18 -
339.html.
200
[Osb+23a] I. Osband, Z. Wen, S. M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy.
“Approximate Thompson Sampling via Epistemic Neural Networks”. en. In: UAI. PMLR, July
2023, pp. 1586–1595. url: https://proceedings.mlr.press/v216/osband23a.html.
[Osb+23b] I. Osband, Z. Wen, S. M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy.
“Epistemic Neural Networks”. In: NIPS. 2023. url: https://proceedings.neurips.cc/paper_
files/paper/2023/file/07fbde96bee50f4e09303fd4f877c2f3-Paper-Conference.pdf.
[OSL17] J. Oh, S. Singh, and H. Lee. “Value Prediction Network”. In: NIPS. July 2017.
[OT22] M. Okada and T. Taniguchi. “DreamingV2: Reinforcement learning with discrete world models
without reconstruction”. en. In: 2022 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, Oct. 2022, pp. 985–991. url: https://ieeexplore.ieee.org/
abstract/document/9981405.
[OVR17] I. Osband and B. Van Roy. “Why is posterior sampling better than optimism for reinforcement
learning?” In: ICML. 2017, pp. 2701–2710.
[Par+23] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. “Generative
agents: Interactive simulacra of human behavior”. en. In: Proceedings of the 36th Annual ACM
Symposium on User Interface Software and Technology. New York, NY, USA: ACM, Oct. 2023.
url: https://dl.acm.org/doi/10.1145/3586183.3606763.
[Par+24a] S. Park, K. Frans, B. Eysenbach, and S. Levine. “OGBench: Benchmarking Offline Goal-
Conditioned RL”. In: arXiv [cs.LG] (Oct. 2024). url: http://arxiv.org/abs/2410.20092.
[Par+24b] S. Park, K. Frans, S. Levine, and A. Kumar. “Is value learning really the main bottleneck in
offline RL?” In: NIPS. June 2024. url: https://arxiv.org/abs/2406.09329.
[Pat+17] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. “Curiosity-driven Exploration by Self-
supervised Prediction”. In: ICML. 2017. url: http://arxiv.org/abs/1705.05363.
[Pat+22] S. Pateria, B. Subagdja, A.-H. Tan, and C. Quek. “Hierarchical Reinforcement Learning: A
comprehensive survey”. en. In: ACM Comput. Surv. 54.5 (June 2022), pp. 1–35. url: https:
//dl.acm.org/doi/10.1145/3453160.
[Pat+24] A. Patterson, S. Neumann, M. White, and A. White. “Empirical design in reinforcement
learning”. In: JMLR (2024). url: http://arxiv.org/abs/2304.01315.
[PB+14] N. Parikh, S. Boyd, et al. “Proximal algorithms”. In: Foundations and Trends in Optimization
1.3 (2014), pp. 127–239.
[PCA21] G. Papoudakis, F. Christianos, and S. V. Albrecht. “Agent modelling under partial observability
for deep reinforcement learning”. In: NIPS. 2021. url: https://arxiv.org/abs/2006.09447.
[Pea94] B. A. Pearlmutter. “Fast Exact Multiplication by the Hessian”. In: Neural Comput. 6.1 (1994),
pp. 147–160. url: https://doi.org/10.1162/neco.1994.6.1.147.
[Pen+19] X. B. Peng, A. Kumar, G. Zhang, and S. Levine. “Advantage-weighted regression: Simple
and scalable off-policy reinforcement learning”. In: arXiv [cs.LG] (Sept. 2019). url: http:
//arxiv.org/abs/1910.00177.
[Pet08] A. S. I. R. Petersen. “Formulas for Discrete Time LQR, LQG, LEQG and Minimax LQG
Optimal Control Problems”. In: IFAC Proceedings Volumes 41.2 (Jan. 2008), pp. 8773–8778.
url: https://www.sciencedirect.com/science/article/pii/S1474667016403629.
[Pfa+25] D. Pfau, I. Davies, D. Borsa, J. Araujo, B. Tracey, and H. van Hasselt. “Wasserstein Policy
Optimization”. In: ICML. May 2025. url: https://arxiv.org/abs/2505.00663.
[Pic+19] A. Piche, V. Thomas, C. Ibrahim, Y. Bengio, and C. Pal. “Probabilistic Planning with Sequential
Monte Carlo methods”. In: ICLR. 2019. url: https://openreview.net/pdf?id=ByetGn0cYX.
[Pis+22] M. Pislar, D. Szepesvari, G. Ostrovski, D. L. Borsa, and T. Schaul. “When should agents
explore?” In: ICLR. 2022. url: https://openreview.net/pdf?id=dEwfxt14bca.
201
[PKP21] A. Plaat, W. Kosters, and M. Preuss. “High-Accuracy Model-Based Reinforcement Learning, a
Survey”. In: (2021). arXiv: 2107.08241 [cs.LG]. url: http://arxiv.org/abs/2107.08241.
[Pla22] A. Plaat. Deep reinforcement learning, a textbook. Berlin, Germany: Springer, Jan. 2022. url:
https://link.springer.com/10.1007/978-981-19-0638-1.
[PLG23] B. Prystawski, M. Y. Li, and N. D. Goodman. “Why think step-by-step? Reasoning emerges
from the locality of experience”. In: NIPS. Vol. abs/2304.03843. Apr. 2023, pp. 70926–70947. url:
https://proceedings.neurips.cc/paper_files/paper/2023/hash/e0af79ad53a336b4c4b4f7e2a68eb609-
Abstract-Conference.html.
[PMB22] K. Paster, S. McIlraith, and J. Ba. “You can’t count on luck: Why decision transformers and
RvS fail in stochastic environments”. In: NIPS. May 2022. url: http://arxiv.org/abs/2205.
15967.
[Pom89] D. Pomerleau. “ALVINN: An Autonomous Land Vehicle in a Neural Network”. In: NIPS. 1989,
pp. 305–313.
[Pow19] W. B. Powell. “From Reinforcement Learning to Optimal Control: A unified framework for
sequential decisions”. In: arXiv [cs.AI] (Dec. 2019). url: http://arxiv.org/abs/1912.03513.
[Pow22] W. B. Powell. Reinforcement Learning and Stochastic Optimization: A Unified Framework
for Sequential Decisions. en. 1st ed. Wiley, Mar. 2022. url: https : / / www . amazon . com /
Reinforcement-Learning-Stochastic-Optimization-Sequential/dp/1119815037.
[PR12] W. B. Powell and I. O. Ryzhov. Optimal Learning. Wiley Series in Probability and Statistics.
http://optimallearning.princeton.edu/. Hoboken, NJ: Wiley-Blackwell, Mar. 2012. url: https:
//castle.princeton.edu/wp-content/uploads/2019/02/Powell-OptimalLearningWileyMarch112018.
pdf.
[PS07] J. Peters and S. Schaal. “Reinforcement Learning by Reward-Weighted Regression for Opera-
tional Space Control”. In: ICML. 2007, pp. 745–750.
[PSS00] D. Precup, R. S. Sutton, and S. P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation”.
In: ICML. ICML ’00. Morgan Kaufmann Publishers Inc., 2000, pp. 759–766. url: http :
//dl.acm.org/citation.cfm?id=645529.658134.
[PT87] C. Papadimitriou and J. Tsitsiklis. “The complexity of Markov decision processes”. In: Mathe-
matics of Operations Research 12.3 (1987), pp. 441–450.
[Pte+24] M. Pternea, P. Singh, A. Chakraborty, Y. Oruganti, M. Milletari, S. Bapat, and K. Jiang.
“The RL/LLM taxonomy tree: Reviewing synergies between Reinforcement Learning and Large
Language Models”. In: JAIR (Feb. 2024). url: https://www.jair.org/index.php/jair/
article/view/15960.
[Pur+25] I. Puri, S. Sudalairaj, G. Xu, K. Xu, and A. Srivastava. “A probabilistic inference approach to
inference-time scaling of LLMs using particle-based Monte Carlo methods”. In: arXiv [cs.LG]
(Feb. 2025). url: http://arxiv.org/abs/2502.01618.
[Put94] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley,
1994.
[PW94] J. Peng and R. J. Williams. “Incremental Multi-Step Q-Learning”. In: Machine Learning
Proceedings. Elsevier, Jan. 1994, pp. 226–232. url: http://dx.doi.org/10.1016/B978-1-
55860-335-6.50035-0.
[QPC21] J. Queeney, I. C. Paschalidis, and C. G. Cassandras. “Generalized Proximal Policy Optimization
with Sample Reuse”. In: NIPS. Oct. 2021.
[QPC24] J. Queeney, I. C. Paschalidis, and C. G. Cassandras. “Generalized Policy Improvement algorithms
with theoretically supported sample reuse”. In: IEEE Trans. Automat. Contr. (2024). url:
http://arxiv.org/abs/2206.13714.
202
[Raf+23] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. “Direct Preference
Optimization: Your language model is secretly a reward model”. In: arXiv [cs.LG] (May 2023).
url: http://arxiv.org/abs/2305.18290.
[Raf+24] R. Rafailov et al. “D5RL: Diverse datasets for data-driven deep reinforcement learning”. In:
RLC. Aug. 2024. url: https://arxiv.org/abs/2408.08441.
[Raj+17] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade. “Towards generalization and simplicity
in continuous control”. In: NIPS. Mar. 2017.
[Rao10] A. V. Rao. “A Survey of Numerical Methods for Optimal Control”. In: Adv. Astronaut. Sci.
135.1 (2010). url: http://dx.doi.org/.
[Ras+18] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson. “QMIX:
Monotonic value function factorisation for deep multi-agent reinforcement learning”. In: ICML.
Mar. 2018. url: https://arxiv.org/abs/1803.11485.
[RB12] S. Ross and J. A. Bagnell. “Agnostic system identification for model-based reinforcement
learning”. In: ICML. Mar. 2012.
[RB99] R. P. Rao and D. H. Ballard. “Predictive coding in the visual cortex: a functional interpretation
of some extra-classical receptive-field effects”. en. In: Nat. Neurosci. 2.1 (1999), pp. 79–87. url:
http://dx.doi.org/10.1038/4580.
[RCdP07] S. Ross, B. Chaib-draa, and J. Pineau. “Bayes-Adaptive POMDPs”. In: NIPS 20 (2007). url:
https://proceedings.neurips.cc/paper_files/paper/2007/file/3b3dbaf68507998acd6a5a5254ab2d76-
Paper.pdf.
[Rec19] B. Recht. “A Tour of Reinforcement Learning: The View from Continuous Control”. In: Annual
Review of Control, Robotics, and Autonomous Systems 2 (2019), pp. 253–279. url: http:
//arxiv.org/abs/1806.09460.
[Ree+22] S. Reed et al. “A Generalist Agent”. In: TMLR (May 2022). url: https://arxiv.org/abs/
2205.06175.
[Ren+24] A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai,
and M. Simchowitz. “Diffusion Policy Policy Optimization”. In: arXiv [cs.RO] (Aug. 2024). url:
http://arxiv.org/abs/2409.00588.
[RFP15] I. O. Ryzhov, P. I. Frazier, and W. B. Powell. “A new optimal stepsize for approximate dynamic
programming”. en. In: IEEE Trans. Automat. Contr. 60.3 (Mar. 2015), pp. 743–758. url:
https://castle.princeton.edu/Papers/Ryzhov-OptimalStepsizeforADPFeb242015.pdf.
[RG66] A. Rapoport and M. Guyer. “A Taxonomy of 2 X 2 Games”. In: General Systesm: Yearbook of
the Society for General Systems Research 11 (1966), pp. 203–214.
[RGB11] S. Ross, G. J. Gordon, and J. A. Bagnell. “A reduction of imitation learning and structured
prediction to no-regret online learning”. In: AISTATS. 2011.
[RHH23] J. Robine, M. Höftmann, and S. Harmeling. “A simple framework for self-supervised learning
of sample-efficient world models”. In: NIPS SSL Workshop. 2023, pp. 17–18. url: https :
//sslneurips23.github.io/paper_pdfs/paper_44.pdf.
[Rie05] M. Riedmiller. “Neural fitted Q iteration – first experiences with a data efficient neural reinforce-
ment learning method”. en. In: ECML. Lecture notes in computer science. 2005, pp. 317–328.
url: https://link.springer.com/chapter/10.1007/11564096_32.
[Rie+18] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess,
and J. T. Springenberg. “Learning by Playing Solving Sparse Reward Tasks from Scratch”. en.
In: ICML. PMLR, July 2018, pp. 4344–4353. url: https://proceedings.mlr.press/v80/
riedmiller18a.html.
[Rin21] M. Ring. “Representing knowledge as predictions (and state as knowledge)”. In: arXiv [cs.AI]
(Dec. 2021). url: http://arxiv.org/abs/2112.06336.
203
[RJ22] A. Rao and T. Jelvis. Foundations of Reinforcement Learning with Applications in Finance.
Chapman and Hall/ CRC, 2022. url: https://github.com/TikhonJelvis/RL-book.
[RK04] R. Rubinstein and D. Kroese. The Cross-Entropy Method: A Unified Approach to Combinatorial
Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, 2004.
[RLT18] M. Riemer, M. Liu, and G. Tesauro. “Learning Abstract Options”. In: NIPS 31 (2018). url:
https://proceedings.neurips.cc/paper_files/paper/2018/file/cdf28f8b7d14ab02d12a2329d71e4079-
Paper.pdf.
[RMD22] J. B. Rawlings, D. Q. Mayne, and M. M. Diehl. Model Predictive Control: Theory, Computa-
tion, and Design (2nd ed). en. Nob Hill Publishing, LLC, Sept. 2022. url: https://sites.
engineering.ucsb.edu/~jbraw/mpc/MPC-book-2nd-edition-1st-printing.pdf.
[RMK20] A. Rajeswaran, I. Mordatch, and V. Kumar. “A game theoretic framework for model based
reinforcement learning”. In: ICML. 2020.
[RN94] G. A. Rummery and M Niranjan. On-Line Q-Learning Using Connectionist Systems. Tech. rep.
Cambridge Univ. Engineering Dept., 1994. url: http://dx.doi.org/.
[RP+24] B. Romera-Paredes et al. “Mathematical discoveries from program search with large language
models”. In: Nature (2024).
[RR14] D. Russo and B. V. Roy. “Learning to Optimize via Posterior Sampling”. In: Math. Oper. Res.
39.4 (2014), pp. 1221–1243.
[RTV12] K. Rawlik, M. Toussaint, and S. Vijayakumar. “On stochastic optimal control and reinforce-
ment learning by approximate inference”. In: Robotics: Science and Systems VIII. Robotics:
Science and Systems Foundation, 2012. url: https://blogs.cuit.columbia.edu/zp2130/
files / 2019 / 03 / On _ Stochasitc _ Optimal _ Control _ and _ Reinforcement _ Learning _ by _
Approximate_Inference.pdf.
[Rub97] R. Y. Rubinstein. “Optimization of computer simulation models with rare events”. In: Eur. J.
Oper. Res. 99.1 (1997), pp. 89–112. url: http://www.sciencedirect.com/science/article/
pii/S0377221796003852.
[Rud+25] M. Rudolph, N. Lichtle, S. Mohammadpour, A. Bayen, J. Z. Kolter, A. Zhang, G. Farina,
E. Vinitsky, and S. Sokota. “Reevaluating policy gradient methods for imperfect-information
games”. In: arXiv [cs.LG] (Feb. 2025). url: http://arxiv.org/abs/2502.08938.
[Rus+18] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. “A Tutorial on Thompson
Sampling”. In: Foundations and Trends in Machine Learning 11.1 (2018), pp. 1–96. url:
http://dx.doi.org/10.1561/2200000070.
[Rus19] S. Russell. Human Compatible: Artificial Intelligence and the Problem of Control. en. Kin-
dle. Viking, 2019. url: https : / / www . amazon . com / Human - Compatible - Artificial -
Intelligence - Problem - ebook / dp / B07N5J5FTS / ref = zg _ bs _ 3887 _ 4 ? _encoding = UTF8 &
psc=1&refRID=0JE0ST011W4K15PTFZAT.
[RW91] S. Russell and E. Wefald. “Principles of metareasoning”. en. In: Artif. Intell. 49.1-3 (May 1991),
pp. 361–395. url: http://dx.doi.org/10.1016/0004-3702(91)90015-C.
[Ryu+20] M. Ryu, Y. Chow, R. Anderson, C. Tjandraatmadja, and C. Boutilier. “CAQL: Continuous
Action Q-Learning”. In: ICLR. 2020. url: https://openreview.net/forum?id=BkxXe0Etwr.
[Saj+21] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston. “Active Inference: Demystified and Compared”.
en. In: Neural Comput. 33.3 (Mar. 2021), pp. 674–712. url: https://web.archive.org/web/
20210628163715id_/https://discovery.ucl.ac.uk/id/eprint/10119277/1/Friston_
neco_a_01357.pdf.
[Sal+17] T. Salimans, J. Ho, X. Chen, and I. Sutskever. “Evolution Strategies as a Scalable Alternative
to Reinforcement Learning”. In: (2017). arXiv: 1703.03864 [stat.ML]. url: http://arxiv.
org/abs/1703.03864.
204
[Sal+23] T. Salvatori, A. Mali, C. L. Buckley, T. Lukasiewicz, R. P. N. Rao, K. Friston, and A. Ororbia.
“Brain-inspired computational intelligence via predictive coding”. In: arXiv [cs.AI] (Aug. 2023).
url: http://arxiv.org/abs/2308.07870.
[Sal+24] T. Salvatori, Y. Song, Y. Yordanov, B. Millidge, L. Sha, C. Emde, Z. Xu, R. Bogacz, and
T. Lukasiewicz. “A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding
Networks”. In: ICLR. Oct. 2024. url: https://openreview.net/pdf?id=RyUvzda8GH.
[Sam+19] M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M.
Hung, P. H. S. Torr, J. Foerster, and S. Whiteson. “The StarCraft Multi-Agent Challenge”. In:
arXiv [cs.LG] (Feb. 2019). url: http://arxiv.org/abs/1902.04043.
[SB18] R. Sutton and A. Barto. Reinforcement learning: an introduction (2nd edn). MIT Press, 2018.
[Sch10] J. Schmidhuber. “Formal Theory of Creativity, Fun, and Intrinsic Motivation”. In: IEEE Trans.
Autonomous Mental Development 2 (2010). url: http : / / people . idsia . ch / ~juergen /
ieeecreative.pdf.
[Sch+15a] T. Schaul, D. Horgan, K. Gregor, and D. Silver. “Universal Value Function Approximators”. en.
In: ICML. PMLR, June 2015, pp. 1312–1320. url: https://proceedings.mlr.press/v37/
schaul15.html.
[Sch+15b] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimiza-
tion”. In: ICML. 2015. url: http://arxiv.org/abs/1502.05477.
[Sch+16a] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. “Prioritized Experience Replay”. In: ICLR.
2016. url: http://arxiv.org/abs/1511.05952.
[Sch+16b] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-Dimensional Continuous
Control Using Generalized Advantage Estimation”. In: ICLR. 2016. url: http://arxiv.org/
abs/1506.02438.
[Sch+17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization
Algorithms”. In: (2017). arXiv: 1707.06347 [cs.LG]. url: http://arxiv.org/abs/1707.
06347.
[Sch+19] M. Schlegel, W. Chung, D. Graves, J. Qian, and M. White. “Importance Resampling for
Off-policy Prediction”. In: NIPS. June 2019. url: https://arxiv.org/abs/1906.04328.
[Sch19] J. Schmidhuber. “Reinforcement learning Upside Down: Don’t predict rewards – just map them
to actions”. In: arXiv [cs.AI] (Dec. 2019). url: http://arxiv.org/abs/1912.02875.
[Sch+20] J. Schrittwieser et al. “Mastering Atari, Go, Chess and Shogi by Planning with a Learned
Model”. In: Nature (2020). url: http://arxiv.org/abs/1911.08265.
[Sch+21a] M. Schmid et al. “Student of Games: A unified learning algorithm for both perfect and imperfect
information games”. In: Sci. Adv. (Dec. 2021). url: https://www.science.org/doi/10.1126/
sciadv.adg3256.
[Sch+21b] M. Schwarzer, A. Anand, R. Goel, R Devon Hjelm, A. Courville, and P. Bachman. “Data-
Efficient Reinforcement Learning with Self-Predictive Representations”. In: ICLR. 2021. url:
https://openreview.net/pdf?id=uCQfPZwRaUu.
[Sch+23a] I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg,
A. Byravan, L. Hasenclever, and N. Heess. “A Generalist Dynamics Model for Control”. In:
arXiv [cs.AI] (May 2023). url: http://arxiv.org/abs/2305.10912.
[Sch+23b] M. Schwarzer, J. Obando-Ceron, A. Courville, M. Bellemare, R. Agarwal, and P. S. Castro.
“Bigger, Better, Faster: Human-level Atari with human-level efficiency”. In: ICML. May 2023.
url: http://arxiv.org/abs/2305.19452.
[Sch+24] M. Schneider, R. Krug, N. Vaskevicius, L. Palmieri, and J. Boedecker. “The Surprising Ineffec-
tiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning”. In:
NIPS. Nov. 2024. url: https://openreview.net/pdf?id=LvAy07mCxU.
205
[Sco10] S. Scott. “A modern Bayesian look at the multi-armed bandit”. In: Applied Stochastic Models
in Business and Industry 26 (2010), pp. 639–658.
[Sei+16] H. van Seijen, A Rupam Mahmood, P. M. Pilarski, M. C. Machado, and R. S. Sutton. “True
Online Temporal-Difference Learning”. In: JMLR (2016). url: http://jmlr.org/papers/
volume17/15-599/15-599.pdf.
[Sey+22] T. Seyde, P. Werner, W. Schwarting, I. Gilitschenski, M. Riedmiller, D. Rus, and M. Wulfmeier.
“Solving Continuous Control via Q-learning”. In: ICLR. Sept. 2022. url: https://openreview.
net/pdf?id=U5XOGxAgccS.
[Sgf] “Simple, Good, Fast: Self-Supervised World Models Free of Baggage”. In: The Thirteenth
International Conference on Learning Representations. Oct. 2024. url: https://openreview.
net/pdf?id=yFGR36PLDJ.
[Sha+20] R. Shah, P. Freire, N. Alex, R. Freedman, D. Krasheninnikov, L. Chan, M. D. Dennis, P.
Abbeel, A. Dragan, and S. Russell. “Benefits of Assistance over Reward Learning”. In: NIPS
Workshop. 2020. url: https://aima.cs.berkeley.edu/~russell/papers/neurips20ws-
assistance.pdf.
[Sha+24] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y Wu, and D. Guo. “DeepSeek-
Math: Pushing the limits of mathematical reasoning in open language models”. In: arXiv [cs.CL]
(Feb. 2024). url: http://arxiv.org/abs/2402.03300.
[Sha53] L. S. Shapley. “Stochastic games”. en. In: Proc. Natl. Acad. Sci. U. S. A. 39.10 (Oct. 1953),
pp. 1095–1100. url: https://pnas.org/doi/full/10.1073/pnas.39.10.1095.
[SHS20] S. Schmitt, M. Hessel, and K. Simonyan. “Off-Policy Actor-Critic with Shared Experience
Replay”. en. In: ICML. PMLR, Nov. 2020, pp. 8545–8554. url: https://proceedings.mlr.
press/v119/schmitt20a.html.
[Sie+20] N. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R.
Hafner, N. Heess, and M. Riedmiller. “Keep Doing What Worked: Behavior Modelling Priors
for Offline Reinforcement Learning”. In: ICLR. 2020. url: https://openreview.net/pdf?id=
rke7geHtwH.
[Sil+14] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. “Deterministic
Policy Gradient Algorithms”. In: ICML. ICML’14. JMLR.org, 2014, pp. I–387–I–395. url:
http://dl.acm.org/citation.cfm?id=3044805.3044850.
[Sil+16] D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. en. In:
Nature 529.7587 (2016), pp. 484–489. url: http://dx.doi.org/10.1038/nature16961.
[Sil+17a] D. Silver et al. “Mastering the game of Go without human knowledge”. en. In: Nature 550.7676
(2017), pp. 354–359. url: http://dx.doi.org/10.1038/nature24270.
[Sil+17b] D. Silver et al. “The predictron: end-to-end learning and planning”. In: ICML. 2017. url:
https://openreview.net/pdf?id=BkJsCIcgl.
[Sil18] D. Silver. Lecture 9L Exploration and Exploitation. 2018. url: http://www0.cs.ucl.ac.uk/
staff/d.silver/web/Teaching_files/XX.pdf.
[Sil+18] D. Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go
through self-play”. en. In: Science 362.6419 (2018), pp. 1140–1144. url: http://dx.doi.org/
10.1126/science.aar6404.
[Sil+21] D. Silver, S. Singh, D. Precup, and R. S. Sutton. “Reward is enough”. en. In: Artif. Intell.
299.103535 (Oct. 2021), p. 103535. url: https://www.sciencedirect.com/science/article/
pii/S0004370221000862.
[Sin+00] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári. “Convergence Results for Single-
Step On-Policy Reinforcement-Learning Algorithms”. In: MLJ 38.3 (2000), pp. 287–308. url:
https://doi.org/10.1023/A:1007678930559.
206
[SK18] Z. Sunberg and M. Kochenderfer. “Online algorithms for POMDPs with continuous state, action,
and observation spaces”. In: ICAPS. 2018. url: https://arxiv.org/abs/1709.06196.
[Ska+22] J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. “Defining and characterizing
reward hacking”. In: NIPS. Sept. 2022.
[SKM00] S. P. Singh, M. J. Kearns, and Y. Mansour. “Nash Convergence of Gradient Dynamics in
General-Sum Games”. en. In: UAI. June 2000, pp. 541–548. url: https://dl.acm.org/doi/
10.5555/647234.719924.
[SKM18] S. Schwöbel, S. Kiebel, and D. Marković. “Active Inference, Belief Propagation, and the Bethe
Approximation”. en. In: Neural Comput. 30.9 (2018), pp. 2530–2567. url: http://dx.doi.
org/10.1162/neco_a_01108.
[SLB08] Y. Shoham and K. Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical
foundations. Cambridge, England: Cambridge University Press, Dec. 2008. url: https://www.
masfoundations.org/.
[Sli19] A. Slivkins. “Introduction to Multi-Armed Bandits”. In: Foundations and Trends in Machine
Learning (2019). url: http://arxiv.org/abs/1904.07272.
[Smi+23] F. B. Smith, A. Kirsch, S. Farquhar, Y. Gal, A. Foster, and T. Rainforth. “Prediction-Oriented
Bayesian Active Learning”. In: AISTATS. Apr. 2023. url: http://arxiv.org/abs/2304.
08151.
[Sne+24] C. Snell, J. Lee, K. Xu, and A. Kumar. “Scaling LLM test-time compute optimally can be
more effective than scaling model parameters”. In: arXiv [cs.LG] (Aug. 2024). url: http:
//arxiv.org/abs/2408.03314.
[Sok+21] S. Sokota, E. Lockhart, F. Timbers, E. Davoodi, R. D’Orazio, N. Burch, M. Schmid, M. Bowling,
and M. Lanctot. “Solving common-payoff games with approximate policy iteration”. In: AAAI.
Jan. 2021. url: https://arxiv.org/abs/2101.04237.
[Sok+22] S. Sokota, R. D’Orazio, J Zico Kolter, N. Loizou, M. Lanctot, I. Mitliagkas, N. Brown, and
C. Kroer. “A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and
Two-Player Zero-Sum Games”. In: ICLR. Sept. 2022. url: https://openreview.net/forum?
id=DpE5UYUQzZH.
[Sok+23] S. Sokota, G. Farina, D. J. Wu, H. Hu, K. A. Wang, J. Z. Kolter, and N. Brown. “The
update-equivalence framework for decision-time planning”. In: arXiv [cs.AI] (Apr. 2023). url:
http://arxiv.org/abs/2304.13138.
[Sol64] R. J. Solomonoff. “A formal theory of inductive inference. Part I”. In: Information and Control
7.1 (Mar. 1964), pp. 1–22. url: https://www.sciencedirect.com/science/article/pii/
S0019995864902232.
[Son98] E. D. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. 2nd.
Vol. 6. Texts in Applied Mathematics. Springer, 1998.
[Spi+24] B. A. Spiegel, Z. Yang, W. Jurayj, B. Bachmann, S. Tellex, and G. Konidaris. “Informing
Reinforcement Learning Agents by Grounding Language to Markov Decision Processes”. In:
Workshop on Training Agents with Foundation Models at RLC 2024. Aug. 2024. url: https:
//openreview.net/pdf?id=uFm9e4Ly26.
[Spr17] M. W. Spratling. “A review of predictive coding algorithms”. en. In: Brain Cogn. 112 (2017),
pp. 92–97. url: http://dx.doi.org/10.1016/j.bandc.2015.11.003.
[SPS99] R. S. Sutton, D. Precup, and S. Singh. “Between MDPs and semi-MDPs: A framework for
temporal abstraction in reinforcement learning”. In: Artif. Intell. 112.1 (Aug. 1999), pp. 181–211.
url: http://www.sciencedirect.com/science/article/pii/S0004370299000521.
[Sri+18] S. Srinivasan, M. Lanctot, V. Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling.
“Actor-Critic Policy Optimization in Partially Observable Multiagent Environments”. In: NIPS.
2018.
207
[SS21] D. Schmidt and T. Schmied. “Fast and Data-Efficient Training of Rainbow: an Experimental
Study on Atari”. In: Deep RL Workshop NeurIPS 2021. Dec. 2021. url: https://openreview.
net/pdf?id=GvM7A3cv63M.
[SS25] D. Silver and R. S. Sutton. Welcome to the era of experience. 2025. url: https://storage.
googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%
20Paper.pdf.
[SSM08] R. S. Sutton, C. Szepesvári, and H. R. Maei. “A convergent O(n) algorithm for off-policy temporal-
difference learning with linear function approximation”. en. In: NIPS. NIPS’08. Red Hook, NY,
USA: Curran Associates Inc., Dec. 2008, pp. 1609–1616. url: https://proceedings.neurips.
cc/paper_files/paper/2008/file/e0c641195b27425bb056ac56f8953d24-Paper.pdf.
[SSTH22] S. Schmitt, J. Shawe-Taylor, and H. van Hasselt. “Chaining value functions for off-policy learning”.
en. In: AAAI. Vol. 36. Association for the Advancement of Artificial Intelligence (AAAI), June
2022, pp. 8187–8195. url: https://ojs.aaai.org/index.php/AAAI/article/view/20792.
[SSTVH23] S. Schmitt, J. Shawe-Taylor, and H. Van Hasselt. “Exploration via Epistemic Value Estimation”.
en. In: AAAI 37.8 (June 2023), pp. 9742–9751. url: https://ojs.aaai.org/index.php/
AAAI/article/view/26164.
[Str00] M. Strens. “A Bayesian Framework for Reinforcement Learning”. In: ICML. 2000.
[Sub+22] J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan. “Approximate information state for
approximate planning and reinforcement learning in partially observed systems”. In: JMLR
23.12 (2022), pp. 1–83. url: http://jmlr.org/papers/v23/20-1165.html.
[Sun+17] P. Sunehag et al. “Value-decomposition networks for cooperative multi-agent learning”. In: arXiv
[cs.AI] (June 2017). url: http://arxiv.org/abs/1706.05296.
[Sun+24] F.-Y. Sun, S. I. Harini, A. Yi, Y. Zhou, A. Zook, J. Tremblay, L. Cross, J. Wu, and N. Haber.
“FactorSim: Generative Simulation via Factorized Representation”. In: NIPS. Nov. 2024. url:
https://openreview.net/forum?id=wBzvYh3PRA.
[Sut04] R. Sutton. “The reward hypothesis”. In: (2004). url: http://incompleteideas.net/rlai.cs.
ualberta.ca/RLAI/rewardhypothesis.html.
[Sut+08] R. S. Sutton, C. Szepesvari, A. Geramifard, and M. P. Bowling. “Dyna-style planning with
linear function approximation and prioritized sweeping”. In: UAI. 2008.
[Sut+11] R Sutton, J. Modayil, M Delp, T Degris, P Pilarski, A. White, and D. Precup. “Horde: a scalable
real-time architecture for learning knowledge from unsupervised sensorimotor interaction”. In:
Adapt Agent Multi-agent Syst (May 2011), pp. 761–768. url: https://www.semanticscholar.
org/paper/Horde%3A-a-scalable-real-time-architecture-for-from-Sutton-Modayil/
50e9a441f56124b7b969e6537b66469a0e1aa707.
[Sut15] R. Sutton. Introduction to RL with function approximation. NIPS Tutorial. 2015. url: http:
/ / media . nips . cc / Conferences / 2015 / tutorialslides / SuttonIntroRL - nips - 2015 -
tutorial.pdf.
[Sut88] R. Sutton. “Learning to predict by the methods of temporal differences”. In: Machine Learning
3.1 (1988), pp. 9–44.
[Sut90] R. S. Sutton. “Integrated Architectures for Learning, Planning, and Reacting Based on Ap-
proximating Dynamic Programming”. In: ICML. Ed. by B. Porter and R. Mooney. Morgan
Kaufmann, 1990, pp. 216–224. url: http://www.sciencedirect.com/science/article/
pii/B9781558601413500304.
[Sut95] R. S. Sutton. “TD models: Modeling the world at a mixture of time scales”. en. In: ICML.
Jan. 1995, pp. 531–539. url: https://www.sciencedirect.com/science/article/abs/pii/
B9781558603776500724.
208
[Sut96] R. S. Sutton. “Generalization in Reinforcement Learning: Successful Examples Using Sparse
Coarse Coding”. In: NIPS. Ed. by D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo. MIT
Press, 1996, pp. 1038–1044. url: http://papers.nips.cc/paper/1109-generalization-
in-reinforcement-learning-successful-examples-using-sparse-coarse-coding.pdf.
[Sut+99] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy Gradient Methods for Reinforcement
Learning with Function Approximation”. In: NIPS. 1999.
[SV10] D. Silver and J. Veness. “Monte-Carlo Planning in Large POMDPs”. In: Advances in Neural
Information Processing Systems. Vol. 23. 2010. url: https://proceedings.neurips.cc/
paper/2010/hash/edfbe1afcf9246bb0d40eb4d8027d90f-Abstract.html.
[SW06] J. E. Smith and R. L. Winkler. “The Optimizer’s Curse: Skepticism and Postdecision Surprise
in Decision Analysis”. In: Manage. Sci. 52.3 (2006), pp. 311–322.
[Sze10] C. Szepesvari. Algorithms for Reinforcement Learning. Morgan Claypool, 2010.
[Tam+16] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. “Value Iteration Networks”. In: NIPS.
2016. url: http://arxiv.org/abs/1602.02867.
[Tan+23] Y. Tang et al. “Understanding Self-Predictive Learning for Reinforcement Learning”. In: ICML.
2023. url: https://proceedings.mlr.press/v202/tang23d/tang23d.pdf.
[Tan+24] H. Tang, K. Hu, J. P. Zhou, S. C. Zhong, W.-L. Zheng, X. Si, and K. Ellis. “Code Repair
with LLMs gives an Exploration-Exploitation Tradeoff”. In: NIPS. Nov. 2024. url: https:
//openreview.net/pdf?id=o863gX6DxA.
[TCG21] Y. Tian, X. Chen, and S. Ganguli. “Understanding self-supervised Learning Dynamics without
Contrastive Pairs”. In: ICML. Feb. 2021. url: http://arxiv.org/abs/2102.06810.
[Ten02] R. B. A. Tennenholtz. “R-max – A General Polynomial Time Algorithm for Near-Optimal
Reinforcement Learning”. In: JMLR 3 (2002), pp. 213–231. url: http://www.ai.mit.edu/
projects/jmlr/papers/volume3/brafman02a/source/brafman02a.pdf.
[TG96] G. Tesauro and G. R. Galperin. “On-line Policy Improvement using Monte-Carlo Search”. In:
NIPS. 1996. url: https://arxiv.org/abs/2501.05407.
[Tha+22] S. Thakoor, M. Rowland, D. Borsa, W. Dabney, R. Munos, and A. Barreto. “Generalised
Policy Improvement with Geometric Policy Composition”. en. In: ICML. PMLR, June 2022,
pp. 21272–21307. url: https://proceedings.mlr.press/v162/thakoor22a.html.
[Tho33] W. R. Thompson. “On the Likelihood that One Unknown Probability Exceeds Another in View
of the Evidence of Two Samples”. In: Biometrika 25.3/4 (1933), pp. 285–294.
[Tim+20] F. Timbers, N. Bard, E. Lockhart, M. Lanctot, M. Schmid, N. Burch, J. Schrittwieser, T.
Hubert, and M. Bowling. “Approximate exploitability: Learning a best response in large games”.
In: arXiv [cs.LG] (Apr. 2020). url: http://arxiv.org/abs/2004.09677.
[TKE24] H. Tang, D. Y. Key, and K. Ellis. “WorldCoder, a Model-Based LLM Agent: Building World
Models by Writing Code and Interacting with the Environment”. In: NIPS. Nov. 2024. url:
https://openreview.net/pdf?id=QGJSXMhVaL.
[TL05] E. Todorov and W. Li. “A Generalized Iterative LQG Method for Locally-optimal Feedback
Control of Constrained Nonlinear Stochastic Systems”. In: ACC. 2005, pp. 300–306.
[TLO23] J. Tarbouriech, T. Lattimore, and B. O’Donoghue. “Probabilistic Inference in Reinforcement
Learning Done Right”. In: NIPS. Nov. 2023. url: https : / / openreview . net / pdf ? id =
9yQ2aaArDn.
[TMM19] C. Tessler, D. J. Mankowitz, and S. Mannor. “Reward Constrained Policy Optimization”. In:
ICLR. 2019. url: https://openreview.net/pdf?id=SkfrvsA9FX.
[TO21] A. Touati and Y. Ollivier. “Learning one representation to optimize all rewards”. In: NIPS. Mar.
2021. url: https://openreview.net/pdf?id=q_eWErV46er.
209
[Tom+20] M. Tomar, L. Shani, Y. Efroni, and M. Ghavamzadeh. “Mirror descent policy optimization”. In:
arXiv [cs.LG] (May 2020). url: http://arxiv.org/abs/2005.09814.
[Tom+22] T. Tomilin, T. Dai, M. Fang, and M. Pechenizkiy. “LevDoom: A benchmark for generalization
on level difficulty in reinforcement learning”. In: 2022 IEEE Conference on Games (CoG). IEEE,
Aug. 2022. url: https://ieee-cog.org/2022/assets/papers/paper_30.pdf.
[Tom+23] M. Tomar, U. A. Mishra, A. Zhang, and M. E. Taylor. “Learning Representations for Pixel-based
Control: What Matters and Why?” In: Transactions on Machine Learning Research (2023).
url: https://openreview.net/pdf?id=wIXHG8LZ2w.
[Tom+24] M. Tomar, P. Hansen-Estruch, P. Bachman, A. Lamb, J. Langford, M. E. Taylor, and S. Levine.
“Video Occupancy Models”. In: arXiv [cs.CV] (June 2024). url: http://arxiv.org/abs/2407.
09533.
[Tou09] M. Toussaint. “Robot Rrajectory Optimization using Approximate Inference”. In: ICML. 2009,
pp. 1049–1056.
[Tou14] M. Toussaint. Bandits, Global Optimization, Active Learning, and Bayesian RL – understanding
the common ground. Autonomous Learning Summer School. 2014. url: https://www.user.
tu-berlin.de/mtoussai/teaching/14-BanditsOptimizationActiveLearningBayesianRL.
pdf.
[TR97] J. Tsitsiklis and B. V. Roy. “An analysis of temporal-difference learning with function approxi-
mation”. In: IEEE Trans. on Automatic Control 42.5 (1997), pp. 674–690.
[TRO23] A. Touati, J. Rapin, and Y. Ollivier. “Does Zero-Shot Reinforcement Learning Exist?” In: ICLR.
2023. url: https://openreview.net/forum?id=MYEap_OcQI.
[TS06] M. Toussaint and A. Storkey. “Probabilistic inference for solving discrete and continuous state
Markov Decision Processes”. In: ICML. 2006, pp. 945–952.
[TS11] E. Talvitie and S. Singh. “Learning to make predictions in partially observable environments
without a generative model”. en. In: JAIR 42 (Sept. 2011), pp. 353–392. url: https://jair.
org/index.php/jair/article/view/10729.
[Tsc+20] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley. “Reinforcement learning through
active inference”. In: ICLR workshop on “Bridging AI and Cognitive Science“. Feb. 2020.
[Tsc+23] A. Tscshantz, B. Millidge, A. K. Seth, and C. L. Buckley. “Hybrid predictive coding: Inferring,
fast and slow”. en. In: PLoS Comput. Biol. 19.8 (Aug. 2023), e1011280. url: https : / /
journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011280&
type=printable.
[Tsi+17] P. A. Tsividis, T. Pouncy, J. L. Xu, J. B. Tenenbaum, and S. J. Gershman. “Human Learning
in Atari”. en. In: AAAI Spring Symposium Series. 2017. url: https://www.aaai.org/ocs/
index.php/SSS/SSS17/paper/viewPaper/15280.
[TVR97] J. N. Tsitsiklis and B Van Roy. “An analysis of temporal-difference learning with function
approximation”. en. In: IEEE Trans. Automat. Contr. 42.5 (May 1997), pp. 674–690. url:
https://ieeexplore.ieee.org/abstract/document/580874.
[TWM25] Y. Tang, S. Wang, and R. Munos. “Learning to chain-of-thought with Jensen’s evidence lower
bound”. In: arXiv [cs.LG] (Mar. 2025). url: http://arxiv.org/abs/2503.19618.
[Unk24] Unknown. “Beyond The Rainbow: High Performance Deep Reinforcement Learning On A
Desktop PC”. In: (Oct. 2024). url: https://openreview.net/pdf?id=0ydseYDKRi.
[Val00] H. Valpola. “Bayesian Ensemble Learning for Nonlinear Factor Analysis”. PhD thesis. Helsinki
University of Technology, 2000. url: https://users.ics.aalto.fi/harri/thesis/valpola_
thesis.ps.gz.
[van+18] H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep Reinforcement
Learning and the Deadly Triad. arXiv:1812.02648. 2018.
210
[Vas+21] S. Vaswani, O. Bachem, S. Totaro, R. Mueller, S. Garg, M. Geist, M. C. Machado, P. S. Castro,
and N. L. Roux. “A general class of surrogate functions for stable and efficient reinforcement
learning”. In: arXiv [cs.LG] (Aug. 2021). url: http://arxiv.org/abs/2108.05828.
[VBW15] S. S. Villar, J. Bowden, and J. Wason. “Multi-armed Bandit Models for the Optimal Design
of Clinical Trials: Benefits and Challenges”. en. In: Stat. Sci. 30.2 (2015), pp. 199–215. url:
http://dx.doi.org/10.1214/14-STS504.
[Vee+19] V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S.
Singh. “Discovery of Useful Questions as Auxiliary Tasks”. In: NIPS. Vol. 32. 2019. url: https://
proceedings.neurips.cc/paper_files/paper/2019/file/10ff0b5e85e5b85cc3095d431d8c08b4-
Paper.pdf.
[Ven+24] D. Venuto, S. N. Islam, M. Klissarov, D. Precup, S. Yang, and A. Anand. “Code as re-
ward: Empowering reinforcement learning with VLMs”. In: ICML. Feb. 2024. url: https:
//openreview.net/forum?id=6P88DMUDvH.
[Vez+17] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu.
“FeUdal Networks for Hierarchical Reinforcement Learning”. en. In: ICML. PMLR, July 2017,
pp. 3540–3549. url: https://proceedings.mlr.press/v70/vezhnevets17a.html.
[Vil+22] A. R. Villaflor, Z. Huang, S. Pande, J. M. Dolan, and J. Schneider. “Addressing Optimism
Bias in Sequence Modeling for Reinforcement Learning”. en. In: ICML. PMLR, June 2022,
pp. 22270–22283. url: https://proceedings.mlr.press/v162/villaflor22a.html.
[Vin+19] O. Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”.
en. In: Nature 575.7782 (Nov. 2019), pp. 350–354. url: http://dx.doi.org/10.1038/s41586-
019-1724-z.
[VPG20] N. Vieillard, O. Pietquin, and M. Geist. “Munchausen Reinforcement Learning”. In: NIPS.
Vol. 33. 2020, pp. 4235–4246. url: https://proceedings.neurips.cc/paper_files/paper/
2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf.
[Vri+25] B. de Vries et al. “Expected Free Energy-based planning as variational inference”. In: arXiv
[stat.ML] (Apr. 2025). url: http://arxiv.org/abs/2504.14898.
[Wag+19] N. Wagener, C.-A. Cheng, J. Sacks, and B. Boots. “An online learning approach to model
predictive control”. In: Robotics: Science and Systems. Feb. 2019. url: https://arxiv.org/
abs/1902.08967.
[Wan+16] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. “Dueling Network
Architectures for Deep Reinforcement Learning”. In: ICML. 2016. url: http://proceedings.
mlr.press/v48/wangf16.pdf.
[Wan+19] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel,
and J. Ba. “Benchmarking Model-Based Reinforcement Learning”. In: arXiv [cs.LG] (July 2019).
url: http://arxiv.org/abs/1907.02057.
[Wan+22] T. Wang, S. S. Du, A. Torralba, P. Isola, A. Zhang, and Y. Tian. “Denoised MDPs: Learning
World Models Better Than the World Itself”. In: ICML. June 2022. url: http://arxiv.org/
abs/2206.15477.
[Wan+24a] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar.
“Voyager: An Open-Ended Embodied Agent with Large Language Models”. In: TMLR (2024).
url: https://openreview.net/forum?id=ehfRiF0R3a.
[Wan+24b] S. Wang, S. Liu, W. Ye, J. You, and Y. Gao. “EfficientZero V2: Mastering discrete and continuous
control with limited data”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/abs/2403.
00564.
[Wan+25] Z. Wang et al. “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforce-
ment learning”. In: arXiv [cs.LG] (Apr. 2025). url: http://arxiv.org/abs/2504.20073.
211
[WAT17] G. Williams, A. Aldrich, and E. A. Theodorou. “Model Predictive Path Integral Control: From
Theory to Parallel Computation”. In: J. Guid. Control Dyn. 40.2 (Feb. 2017), pp. 344–357. url:
https://doi.org/10.2514/1.G001921.
[Wat+21] J. Watson, H. Abdulsamad, R. Findeisen, and J. Peters. “Stochastic Control through Approxi-
mate Bayesian Input Inference”. In: arxiv (2021). url: http://arxiv.org/abs/2105.07693.
[Wau+15] K. Waugh, D. Morrill, J. Bagnell, and M. Bowling. “Solving games with functional regret
estimation”. en. In: AAAI 29.1 (Feb. 2015). url: https://ojs.aaai.org/index.php/AAAI/
article/view/9445.
[WCM24] C. Wang, Y. Chen, and K. Murphy. “Model-based Policy Optimization under Approximate
Bayesian Inference”. en. In: AISTATS. PMLR, Apr. 2024, pp. 3250–3258. url: https : / /
proceedings.mlr.press/v238/wang24g.html.
[WD92] C. Watkins and P. Dayan. “Q-learning”. In: Machine Learning 8.3 (1992), pp. 279–292.
[Web+17] T. Weber et al. “Imagination-Augmented Agents for Deep Reinforcement Learning”. In: NIPS.
2017. url: http://arxiv.org/abs/1707.06203.
[Wei+22] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. “Chain of Thought
Prompting Elicits Reasoning in Large Language Models”. In: arXiv [cs.CL] (Jan. 2022). url:
http://arxiv.org/abs/2201.11903.
[Wei+24] R. Wei, N. Lambert, A. McDonald, A. Garcia, and R. Calandra. “A unified view on solving
objective mismatch in model-based Reinforcement Learning”. In: Trans. on Machine Learning
Research (2024). url: https://openreview.net/forum?id=tQVZgvXhZb.
[Wen18a] L. Weng. “A (Long) Peek into Reinforcement Learning”. In: lilianweng.github.io (2018). url:
https://lilianweng.github.io/posts/2018-02-19-rl-overview/.
[Wen18b] L. Weng. “Policy Gradient Algorithms”. In: lilianweng.github.io (2018). url: https : / /
lilianweng.github.io/posts/2018-04-08-policy-gradient/.
[WHT19] Y. Wang, H. He, and X. Tan. “Truly Proximal Policy Optimization”. In: UAI. 2019. url:
http://auai.org/uai2019/proceedings/papers/21.pdf.
[WHZ23] Z. Wang, J. J. Hunt, and M. Zhou. “Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning”. In: ICLR. 2023. url: https://openreview.net/pdf?id=AHvFDPi-
FA.
[Wie03] E Wiewiora. “Potential-Based Shaping and Q-Value Initialization are Equivalent”. In: JAIR.
2003. url: https://jair.org/index.php/jair/article/view/10338.
[Wil+17] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou.
“Information theoretic MPC for model-based reinforcement learning”. In: ICRA. IEEE, May
2017, pp. 1714–1721. url: https://ieeexplore.ieee.org/document/7989202.
[Wil92] R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement
learning”. In: MLJ 8.3-4 (1992), pp. 229–256.
[WIP20] J. Watson, A. Imohiosen, and J. Peters. “Active Inference or Control as Inference? A Unifying
View”. In: International Workshop on Active Inference. 2020. url: http://arxiv.org/abs/
2010.00262.
[Wit+20] C. S. de Witt, T. Gupta, D. Makoviichuk, V. Makoviychuk, P. H. S. Torr, M. Sun, and S.
Whiteson. “Is independent learning all you need in the StarCraft multi-agent challenge?” In:
arXiv [cs.AI] (Nov. 2020). url: http://arxiv.org/abs/2011.09533.
[WNS21] Y. Wan, A. Naik, and R. S. Sutton. “Learning and planning in average-reward Markov decision
processes”. In: ICML. 2021. url: https://arxiv.org/abs/2006.16318.
[Won+22] A. Wong, T. Bäck, A. V. Kononova, and A. Plaat. “Deep multiagent reinforcement learning:
challenges and directions”. en. In: Artif. Intell. Rev. 56.6 (Oct. 2022), pp. 5023–5056. url:
https://link.springer.com/article/10.1007/s10462-022-10299-x.
212
[Won+23] L. Wong, J. Mao, P. Sharma, Z. S. Siegel, J. Feng, N. Korneev, J. B. Tenenbaum, and J. Andreas.
“Learning adaptive planning representations with natural language guidance”. In: arXiv [cs.AI]
(Dec. 2023). url: http://arxiv.org/abs/2312.08566.
[Wu+17] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep
reinforcement learning using Kronecker-factored approximation”. In: NIPS. 2017. url: https:
//arxiv.org/abs/1708.05144.
[Wu+21] Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh. “Uncertainty
Weighted Actor-critic for offline Reinforcement Learning”. In: ICML. May 2021. url: https:
//arxiv.org/abs/2105.08140.
[Wu+22] P. Wu, A. Escontrela, D. Hafner, K. Goldberg, and P. Abbeel. “DayDreamer: World Models
for Physical Robot Learning”. In: (June 2022). arXiv: 2206 . 14176 [cs.RO]. url: http :
//arxiv.org/abs/2206.14176.
[Wu+23] G. Wu, W. Fang, J. Wang, P. Ge, J. Cao, Y. Ping, and P. Gou. “Dyna-PPO reinforcement
learning with Gaussian process for the continuous action decision-making in autonomous driving”.
en. In: Appl. Intell. 53.13 (July 2023), pp. 16893–16907. url: https://link.springer.com/
article/10.1007/s10489-022-04354-x.
[Wur+22] P. R. Wurman et al. “Outracing champion Gran Turismo drivers with deep reinforcement
learning”. en. In: Nature 602.7896 (Feb. 2022), pp. 223–228. url: https://www.researchgate.
net/publication/358484368_Outracing_champion_Gran_Turismo_drivers_with_deep_
reinforcement_learning.
[Xu+17] C. Xu, T. Qin, G. Wang, and T.-Y. Liu. “Reinforcement learning for learning rate control”. In:
arXiv [cs.LG] (May 2017). url: http://arxiv.org/abs/1705.11159.
[Xu+22] T. Xu, Z. Yang, Z. Wang, and Y. Liang. “A Unifying Framework of Off-Policy General
Value Function Evaluation”. In: NIPS. Oct. 2022. url: https://openreview.net/pdf?id=
LdKdbHw3A_6.
[Xu+25] F. Xu et al. “Towards large reasoning models: A survey of reinforced reasoning with Large
Language Models”. In: arXiv [cs.AI] (Jan. 2025). url: http://arxiv.org/abs/2501.09686.
[Yan+23] M. Yang, D Schuurmans, P Abbeel, and O. Nachum. “Dichotomy of control: Separating
what you can control from what you cannot”. In: ICLR. Vol. abs/2210.13435. 2023. url:
https://github.com/google-research/google-research/tree/.
[Yan+24] S. Yang, Y. Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and
P. Abbeel. “Learning Interactive Real-World Simulators”. In: ICLR. 2024. url: https://
openreview.net/pdf?id=sFyTZEqmUY.
[Yao+22] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. “ReAct: Synergizing
Reasoning and Acting in Language Models”. In: ICLR. Sept. 2022. url: https://openreview.
net/pdf?id=WE_vluYUL-X.
[Ye+21] W. Ye, S. Liu, T. Kurutach, P. Abbeel, and Y. Gao. “Mastering Atari games with limited data”.
In: NIPS. Oct. 2021.
[Yer+23] T Yerxa, Y. Kuang, E. P. Simoncelli, and S. Chung. “Learning efficient coding of natural
images with maximum manifold capacity representations”. In: NIPS abs/2303.03307 (2023),
pp. 24103–24128. url: https://proceedings.neurips.cc/paper_files/paper/2023/hash/
4bc6e94f2308c888fb69626138a2633e-Abstract-Conference.html.
[YKSR23] T. Yamagata, A. Khalil, and R. Santos-Rodriguez. “Q-learning Decision Transformer: Leveraging
Dynamic Programming for conditional sequence modelling in offline RL”. In: ICML. 2023. url:
https://arxiv.org/abs/2209.03993.
[Yu17] H. Yu. “On convergence of some gradient-based temporal-differences algorithms for off-policy
learning”. In: arXiv [cs.LG] (Dec. 2017). url: http://arxiv.org/abs/1712.09652.
213
[Yu+20a] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. “MOPO: Model-
based Offline Policy Optimization”. In: NIPS. Vol. 33. 2020, pp. 14129–14142. url: https://
proceedings.neurips.cc/paper_files/paper/2020/hash/a322852ce0df73e204b7e67cbbef0d0a-
Abstract.html.
[Yu+20b] Y. Yu, K. H. R. Chan, C. You, C. Song, and Y. Ma. “Learning diverse and discriminative
representations via the principle of Maximal Coding Rate Reduction”. In: NIPS abs/2006.08558
(June 2020), pp. 9422–9434. url: https://proceedings.neurips.cc/paper_files/paper/
2020/hash/6ad4174eba19ecb5fed17411a34ff5e6-Abstract.html.
[Yu+22] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu. “The surprising effectiveness
of PPO in cooperative, multi-agent games”. In: NeurIPS 2022 Datasets and Benchmarks. 2022.
url: https://arxiv.org/abs/2103.01955.
[Yu+23] C. Yu, N Burgess, M Sahani, and S Gershman. “Successor-Predecessor Intrinsic Exploration”. In:
NIPS. Vol. abs/2305.15277. Curran Associates, Inc., May 2023, pp. 73021–73038. url: https://
proceedings.neurips.cc/paper_files/paper/2023/hash/e6f2b968c4ee8ba260cd7077e39590dd-
Abstract-Conference.html.
[Yu+25] Q. Yu et al. “DAPO: An open-source LLM reinforcement learning system at scale”. In: arXiv
[cs.LG] (Mar. 2025). url: http://arxiv.org/abs/2503.14476.
[Yua22] M. Yuan. “Intrinsically-motivated reinforcement learning: A brief introduction”. In: arXiv
[cs.LG] (Mar. 2022). url: http://arxiv.org/abs/2203.02298.
[Yua+24] M. Yuan, R. C. Castanyer, B. Li, X. Jin, G. Berseth, and W. Zeng. “RLeXplore: Accelerating
research in intrinsically-motivated reinforcement learning”. In: arXiv [cs.LG] (May 2024). url:
http://arxiv.org/abs/2405.19548.
[Yue+25] Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. “Does reinforcement
Learning really incentivize reasoning capacity in LLMs beyond the base model?” In: arXiv
[cs.AI] (Apr. 2025). url: http://arxiv.org/abs/2504.13837.
[YW20] Y. Yang and J. Wang. “An overview of multi-agent reinforcement learning from game theoretical
perspective”. In: arXiv [cs.MA] (Nov. 2020). url: http://arxiv.org/abs/2011.00583.
[YWW25] M. Yin, M. Wang, and Y.-X. Wang. “On the statistical complexity for offline and low-adaptive
reinforcement learning with structures”. In: arXiv [cs.LG] (Jan. 2025). url: http://arxiv.
org/abs/2501.02089.
[YZ22] Y. Yang and P. Zhai. “Click-through rate prediction in online advertising: A literature review”.
In: Inf. Process. Manag. 59.2 (2022), p. 102853. url: https://www.sciencedirect.com/
science/article/pii/S0306457321003241.
[ZABD10] B. D. Ziebart, J Andrew Bagnell, and A. K. Dey. “Modeling Interaction via the Principle of
Maximum Causal Entropy”. In: ICML. 2010. url: https://www.cs.uic.edu/pub/Ziebart/
Publications/maximum-causal-entropy.pdf.
[Zbo+21] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. “Barlow Twins: Self-Supervised Learning
via Redundancy Reduction”. In: Mar. 2021. url: https://arxiv.org/abs/2103.03230.
[Zel+24] E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman. “Quiet-STaR:
Language Models Can Teach Themselves to Think Before Speaking”. In: arXiv [cs.CL] (Mar.
2024). url: http://arxiv.org/abs/2403.09629.
[Zha+18] R. Zhang, C. Chen, C. Li, and L. Carin. “Policy Optimization as Wasserstein Gradient Flows”. en.
In: ICML. July 2018, pp. 5737–5746. url: https://proceedings.mlr.press/v80/zhang18a.
html.
[Zha+19] S. Zhang, B. Liu, H. Yao, and S. Whiteson. “Provably convergent two-timescale off-policy actor-
critic with function approximation”. In: ICML 119 (Nov. 2019). Ed. by H. D. Iii and A. Singh,
pp. 11204–11213. url: https://proceedings.mlr.press/v119/zhang20s/zhang20s.pdf.
214
[Zha+21] A. Zhang, R. T. McAllister, R. Calandra, Y. Gal, and S. Levine. “Learning Invariant Repre-
sentations for Reinforcement Learning without Reconstruction”. In: ICLR. 2021. url: https:
//openreview.net/pdf?id=-2FCwDKRREu.
[Zha+23a] J. Zhang, J. T. Springenberg, A. Byravan, L. Hasenclever, A. Abdolmaleki, D. Rao, N. Heess,
and M. Riedmiller. “Leveraging Jumpy Models for Planning and Fast Learning in Robotic
Domains”. In: arXiv [cs.RO] (Feb. 2023). url: http://arxiv.org/abs/2302.12617.
[Zha+23b] W. Zhang, G. Wang, J. Sun, Y. Yuan, and G. Huang. “STORM: Efficient Stochastic Transformer
based world models for reinforcement learning”. In: arXiv [cs.LG] (Oct. 2023). url: http:
//arxiv.org/abs/2310.09615.
[Zha+24a] S. Zhao, R. Brekelmans, A. Makhzani, and R. Grosse. “Probabilistic inference in language
models via twisted Sequential Monte Carlo”. In: ICML. Apr. 2024. url: https://arxiv.org/
abs/2404.17546.
[Zha+24b] S. Zhao, R. Brekelmans, A. Makhzani, and R. B. Grosse. “Probabilistic Inference in Language
Models via Twisted Sequential Monte Carlo”. In: ICML. June 2024. url: https://openreview.
net/pdf?id=frA0NNBS1n.
[Zha+25] A. Zhao et al. “Absolute Zero: Reinforced self-play reasoning with zero data”. In: arXiv [cs.LG]
(May 2025). url: http://arxiv.org/abs/2505.03335.
[Zhe+22a] L. Zheng, T. Fiez, Z. Alumbaugh, B. Chasnov, and L. J. Ratliff. “Stackelberg actor-critic: Game-
theoretic reinforcement learning algorithms”. en. In: AAAI 36.8 (June 2022), pp. 9217–9224.
url: https://ojs.aaai.org/index.php/AAAI/article/view/20908.
[Zhe+22b] R. Zheng, X. Wang, H. Xu, and F. Huang. “Is Model Ensemble Necessary? Model-based RL
via a Single Model with Lipschitz Regularized Value Function”. In: ICLR. Sept. 2022. url:
https://openreview.net/pdf?id=hNyJBk3CwR.
[Zhe+24] Q. Zheng, M. Henaff, A. Zhang, A. Grover, and B. Amos. “Online intrinsic rewards for decision
making agents from large language model feedback”. In: arXiv [cs.LG] (Oct. 2024). url:
http://arxiv.org/abs/2410.23022.
[Zho+24a] G. Zhou, H. Pan, Y. LeCun, and L. Pinto. “DINO-WM: World models on pre-trained visual
features enable zero-shot planning”. In: arXiv [cs.RO] (Nov. 2024). url: http://arxiv.org/
abs/2411.04983.
[Zho+24b] G. Zhou, S. Swaminathan, R. V. Raju, J. S. Guntupalli, W. Lehrach, J. Ortiz, A. Dedieu,
M. Lázaro-Gredilla, and K. Murphy. “Diffusion Model Predictive Control”. In: arXiv [cs.LG]
(Oct. 2024). url: http://arxiv.org/abs/2410.05364.
[Zho+24c] Z. Zhou, B. Hu, C. Zhao, P. Zhang, and B. Liu. “Large language model as a policy teacher for
training reinforcement learning agents”. In: IJCAI. 2024. url: https://arxiv.org/abs/2311.
13373.
[ZHR24] H. Zhu, B. Huang, and S. Russell. “On representation complexity of model-based and model-free
reinforcement learning”. In: ICLR. 2024.
[Zie+08] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. “Maximum Entropy Inverse Rein-
forcement Learning”. In: AAAI. 2008, pp. 1433–1438.
[Zin+07] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. “Regret minimization in games with
incomplete information”. en. In: NIPS. Dec. 2007, pp. 1729–1736. url: https://dl.acm.org/
doi/10.5555/2981562.2981779.
[Zin+21] L Zintgraf, S. Schulze, C. Lu, L. Feng, M. Igl, K Shiarlis, Y Gal, K. Hofmann, and S. Whiteson.
“VariBAD: Variational Bayes-Adaptive Deep RL via meta-learning”. In: J. Mach. Learn. Res.
22.289 (2021), 289:1–289:39. url: https://www.jmlr.org/papers/volume22/21-0657/21-
0657.pdf.
215
[Zit+23] B. Zitkovich et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control”. en. In: Conference on Robot Learning. PMLR, Dec. 2023, pp. 2165–2183. url:
https://proceedings.mlr.press/v229/zitkovich23a.html.
[ZS22] N. Zucchet and J. Sacramento. “Beyond backpropagation: Bilevel optimization through implicit
differentiation and equilibrium propagation”. en. In: Neural Comput. 34.12 (Nov. 2022), pp. 2309–
2346. url: https://direct.mit.edu/neco/article-pdf/34/12/2309/2057431/neco_a_
01547.pdf.
[ZSE24] C. Zheng, R. Salakhutdinov, and B. Eysenbach. “Contrastive Difference Predictive Coding”.
In: The Twelfth International Conference on Learning Representations. 2024. url: https:
//openreview.net/pdf?id=0akLDTFR9x.
[ZW19] S. Zhang and S. Whiteson. “DAC: The Double Actor-Critic Architecture for Learning Options”.
In: NIPS 32 (2019). url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/4f284803bd0966cc24fa8683a34afc6e-Paper.pdf.
[ZWG22] E. Zelikman, Y. Wu, and N. D. Goodman. “STaR: Bootstrapping Reasoning With Reasoning”.
In: NIPS. Mar. 2022. url: https://arxiv.org/abs/2203.14465.
216