Tutorial: Deep Reinforcement Learning
David Silver, Google DeepMind
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Reinforcement Learning in a nutshell
RL is a general-purpose framework for decision-making
I RL is for an agent with the capacity to act
I Each action influences the agents future state
I Success is measured by a scalar reward signal
I Goal: select actions to maximise future reward
Deep Learning in a nutshell
DL is a general-purpose framework for representation learning
I Given an objective
I Learn representation that is required to achieve objective
I Directly from raw inputs
I Using minimal domain knowledge
Deep Reinforcement Learning: AI = RL + DL
We seek a single agent which can solve any human-level task
I RL defines the objective
I DL gives the mechanism
I RL + DL = general intelligence
Examples of Deep RL @DeepMind
I Play games: Atari, poker, Go, ...
I Explore worlds: 3D worlds, Labyrinth, ...
I Control physical systems: manipulate, walk, swim, ...
I Interact with users: recommend, optimise, personalise, ...
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Deep Representations
I A deep representation is a composition of many functions
x / h1
O
/ ... / hn
O
/y /l
w1 ... wn
I Its gradient can be backpropagated by the chain rule
@h1 @h2 @hn @y
@h1 @hn 1 @hn
@l @x @l @l @l
@x
o
@h1
o ... o @hn
o
@y
@h1 @hn
@w1 @wn
@l @l
@w1 ... @wn
Deep Neural Network
A deep neural network is typically composed of:
I Linear transformations
hk+1 = Whk
I Non-linear activation functions
hk+2 = f (hk+1 )
I A loss function on the output, e.g.
I Mean-squared error l = ||y y ||2
I Log likelihood l = log P [y ]
Training Neural Networks by Stochastic Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
I '*$%-/0'*,('-*$.'("$("#$1)*%('-*$
Sample gradient of expected loss L(w) = E [l]
,22&-3'/,(-&
%,*$0#$)+#4$(-$
@l @l @L(w)
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
E =
@w @w @w
I Adjust w down the sampled gradient
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
@l
w/
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
@w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
Weight Sharing
Recurrent neural network shares weights between time-steps
yO t yt+1
O
... / ht / ht+1 / ...
? O = O
w xt w xt+1
Convolutional neural network shares weights between local regions
w1 w2
w1 w2
h2
h1
x
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Many Faces of Reinforcement Learning
Computer Science
Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Rationality/
Mathematics Psychology
Game Theory
Economics
Agent and Environment
observation action I At each step t the agent:
ot at
I Executes action at
I Receives observation ot
I Receives scalar reward rt
reward rt
I The environment:
I Receives action at
I Emits observation ot+1
I Emits scalar reward rt+1
State
I Experience is a sequence of observations, actions, rewards
o1 , r1 , a1 , ..., at 1 , ot , rt
I The state is a summary of experience
st = f (o1 , r1 , a1 , ..., at 1 , ot , r t )
I In a fully observed environment
st = f (ot )
Major Components of an RL Agent
I An RL agent may include one or more of these components:
I Policy: agents behaviour function
I Value function: how good is each state and/or action
I Model: agents representation of the environment
Policy
I A policy is the agents behaviour
I It is a map from state to action:
I Deterministic policy: a = (s)
I Stochastic policy: (a|s) = P [a|s]
Value Function
I A value function is a prediction of future reward
I How much reward will I get from action a in state s?
I Q-value function gives expected total reward
I from state s and action a
I under policy
I with discount factor
2
Q (s, a) = E rt+1 + rt+2 + rt+3 + ... | s, a
Value Function
I A value function is a prediction of future reward
I How much reward will I get from action a in state s?
I Q-value function gives expected total reward
I from state s and action a
I under policy
I with discount factor
2
Q (s, a) = E rt+1 + rt+2 + rt+3 + ... | s, a
I Value functions decompose into a Bellman equation
Q (s, a) = Es 0 ,a0 r + Q (s 0 , a0 ) | s, a
Optimal Value Functions
I An optimal value function is the maximum achievable value
Q (s, a) = max Q (s, a) = Q (s, a)
Optimal Value Functions
I An optimal value function is the maximum achievable value
Q (s, a) = max Q (s, a) = Q (s, a)
I Once we have Q we can act optimally,
(s) = argmax Q (s, a)
a
Optimal Value Functions
I An optimal value function is the maximum achievable value
Q (s, a) = max Q (s, a) = Q (s, a)
I Once we have Q we can act optimally,
(s) = argmax Q (s, a)
a
I Optimal value maximises over all decisions. Informally:
Q (s, a) = rt+1 + max rt+2 + 2
max rt+3 + ...
at+1 at+2
= rt+1 + max Q (st+1 , at+1 )
at+1
Optimal Value Functions
I An optimal value function is the maximum achievable value
Q (s, a) = max Q (s, a) = Q (s, a)
I Once we have Q we can act optimally,
(s) = argmax Q (s, a)
a
I Optimal value maximises over all decisions. Informally:
Q (s, a) = rt+1 + max rt+2 + 2
max rt+3 + ...
at+1 at+2
= rt+1 + max Q (st+1 , at+1 )
at+1
I Formally, optimal values decompose into a Bellman equation
Q (s, a) = Es 0 r + max
0
Q (s 0 , a0 ) | s, a
a
Value Function Demo
Model
observation action
ot at
reward rt
Model
observation action
ot at I Model is learnt from experience
I Acts as proxy for environment
I Planner interacts with model
reward rt
I e.g. using lookahead search
Approaches To Reinforcement Learning
Value-based RL
I Estimate the optimal value function Q (s, a)
I This is the maximum value achievable under any policy
Policy-based RL
I Search directly for the optimal policy
I This is the policy achieving maximum future reward
Model-based RL
I Build a model of the environment
I Plan (e.g. by lookahead) using model
Deep Reinforcement Learning
I Use deep neural networks to represent
I Value function
I Policy
I Model
I Optimise loss function by stochastic gradient descent
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Q-Networks
Represent value function by Q-network with weights w
Q(s, a, w) Q (s, a)
Q(s,a,w) Q(s,a1,w) Q(s,am,w)
w w
s a s
Q-Learning
I Optimal Q-values should obey Bellman equation
Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a
I Treat right-hand side r + max
0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a
Q-Learning
I Optimal Q-values should obey Bellman equation
Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a
I Treat right-hand side r + max
0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a
I Converges to Q using table lookup representation
Q-Learning
I Optimal Q-values should obey Bellman equation
Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a
I Treat right-hand side r + max
0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a
I Converges to Q using table lookup representation
I But diverges using neural networks due to:
I Correlations between samples
I Non-stationary targets
Deep Q-Networks (DQN): Experience Replay
To remove correlations, build data-set from agents own experience
s1 , a 1 , r 2 , s 2
s2 , a 2 , r 3 , s 3 ! s, a, r , s 0
s3 , a 3 , r 4 , s 4
...
st , at , rt+1 , st+1 ! st , at , rt+1 , st+1
Sample experiences from data-set and apply update
2
0 0
l= r+ max
0
Q(s , a , w ) Q(s, a, w)
a
To deal with non-stationarity, target parameters w are held fixed
Deep Reinforcement Learning in Atari
state action
st at
reward rt
DQN in Atari
I End-to-end learning of values Q(s, a) from pixels s
I Input state s is stack of raw pixels from last 4 frames
I Output is Q(s, a) for 18 joystick/button positions
I Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
DQN Results in Atari
DQN Atari Demo
INNOVATIONS IN
The microbiome
T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E
DQN paper
www.nature.com/articles/nature14236
DQN source code:
sites.google.com/a/deepmind.com/dqn/ Self-taught AI software
attains human-level
performance in video games
PAGES 486 & 529
EPIDEMIOLOGY COSMOLOGY QUANTUM PHYSICS NATURE.COM/NATURE
26 February 2015 10
SHARE DATA IN A GIANT IN THE TELEPORTATION Vol. 518, No. 7540
OUTBREAKS EARLY UNIVERSE FOR TWO
Forge open access A supermassive black hole Transferring two properties
to sequences and more at a redshift of 6.3 of a single photon
PAGE 477 PAGES 490 & 512 PAGES 491 & 516
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
I Prioritised replay: Weight experience according to surprise
I Store experience in priority queue according to DQN error
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
I Prioritised replay: Weight experience according to surprise
I Store experience in priority queue according to DQN error
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
I Duelling network: Split Q-network into two channels
I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)
Q(s, a) = V (s, v ) + A(s, a, w)
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
I Prioritised replay: Weight experience according to surprise
I Store experience in priority queue according to DQN error
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
I Duelling network: Split Q-network into two channels
I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)
Q(s, a) = V (s, v ) + A(s, a, w)
Combined algorithm: 3x mean Atari score vs Nature DQN
Gorila (General Reinforcement Learning Architecture)
I 10x faster than Nature DQN on 38 out of 49 Atari games
I Applied to recommender systems within Google
Asynchronous Reinforcement Learning
I Exploits multithreading of standard CPU
I Execute many instances of agent in parallel
I Network parameters shared between threads
I Parallelism decorrelates data
I Viable alternative to experience replay
I Similar speedup to Gorila - on a single machine!
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Deep Policy Networks
I Represent policy by deep network with weights u
a = (a|s, u) or a = (s, u)
I Define objective function as total discounted reward
L(u) = E r1 + r2 + 2 r3 + ... | (, u)
I Optimise objective end-to-end by SGD
I i.e. Adjust policy parameters u to achieve more reward
Policy Gradients
How to make high-value actions more likely:
I The gradient of a stochastic policy (a|s, u) is given by
@L(u) @log (a|s, u)
=E Q (s, a)
@u @u
Policy Gradients
How to make high-value actions more likely:
I The gradient of a stochastic policy (a|s, u) is given by
@L(u) @log (a|s, u)
=E Q (s, a)
@u @u
I The gradient of a deterministic policy a = (s) is given by
@L(u) @Q (s, a) @a
=E
@u @a @u
I if a is continuous and Q is dierentiable
Actor-Critic Algorithm
I Estimate value function Q(s, a, w) Q (s, a)
I Update policy parameters u by stochastic gradient ascent
@l @log (a|s, u)
= Q(s, a, w)
@u @u
or
@l @Q(s, a, w) @a
=
@u @a @u
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
I Actor is updated towards target
@lu @log (at |st , u)
= (qt V (st , v))
@u @u
I Critic is updated to minimise MSE w.r.t. target
lv = (qt V (st , v))2
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
I Actor is updated towards target
@lu @log (at |st , u)
= (qt V (st , v))
@u @u
I Critic is updated to minimise MSE w.r.t. target
lv = (qt V (st , v))2
I 4x mean Atari score vs Nature DQN
Deep Reinforcement Learning in Labyrinth
A3C in Labyrinth
(a|st-1) V(st-1) (a|st) V(st) (a|st+1) V(st-1)
Deep Reinforcement Learning in Labyrinth
st-1 st st+1
Deep Reinforcement
Deep Reinforcement Learning
Learning in Labyrinth
in Labyrinth
ot-1 ot ot+1
I End-to-end learning of softmax policy (a|st ) from pixels
I Observations ot are raw pixels from current frame
I State st = f (o1 , ..., ot ) is a recurrent neural network (LSTM)
I Outputs both value V (s) and softmax over actions (a|s)
I Task is to collect apples (+1 reward) and escape (+10 reward)
A3C Labyrinth Demo
Demo:
www.youtube.com/watch?v=nMR5mjCFZCw&feature=youtu.be
Labyrinth source code (coming soon):
sites.google.com/a/deepmind.com/labyrinth/
Deep Reinforcement Learning with Continuous Actions
How can we deal with high-dimensional continuous action spaces?
I Cant easily compute max Q(s, a)
a
I Actor-critic algorithms learn without max
I Q-values are dierentiable w.r.t a
I @Q
Deterministic policy gradients exploit knowledge of @a
Deep DPG
DPG is the continuous analogue of DQN
I Experience replay: build data-set from agents experience
I Critic estimates value of current policy by DQN
2
0 0
lw = r + Q(s , (s , u ), w ) Q(s, a, w)
To deal with non-stationarity, targets u , w are held fixed
I Actor updates policy in direction that improves Q
@lu @Q(s, a, w) @a
=
@u @a @u
I In other words critic provides loss function for actor
DPG in Simulated Physics
I Physics domains are simulated in MuJoCo
I End-to-end learning of control policy from raw pixels s
I Input state s is stack of raw pixels from last 4 frames
I Two separate convnets are used for Q and
I Policy is adjusted in direction that most improves Q
a
Q(s,a)
(s)
DPG in Simulated Physics Demo
I Demo: DPG from pixels
A3C in Simulated Physics Demo
I Asynchronous RL is viable alternative to experience replay
I Train a hierarchical, recurrent locomotion controller
I Retrain controller on more challenging tasks
Fictitious Self-Play (FSP)
Can deep RL find Nash equilibria in multi-agent games?
I Q-network learns best response to opponent policies
I By applying DQN with experience replay
I c.f. fictitious play
I Policy network (a|s, u) learns an average of best responses
@l @log (a|s, u)
=
@u @u
I Actions a sample mix of policy network and best response
Neural FSP in Texas Holdem Poker
I Heads-up limit Texas Holdem
I NFSP with raw inputs only (no prior knowledge of Poker)
I vs SmooCT (3x medal winner 2015, handcrafted knowlege)
100
-100
-200
-300
mbb/h
-400
-500
-600 SmooCT
NFSP, best response strategy
-700 NFSP, greedy-average strategy
NFSP, average strategy
-800
0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07
Iterations
Outline
Introduction to Deep Learning
Introduction to Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Learning Models of the Environment
I Demo: generative model of Atari
I Challenging to plan due to compounding errors
I Errors in the transition model compound over the trajectory
I Planning trajectories dier from executed trajectories
I At end of long, unusual trajectory, rewards are totally wrong
Deep Reinforcement Learning in Go
What if we have a perfect model? e.g. game rules are known
T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E
AlphaGo paper:
www.nature.com/articles/nature16961
AlphaGo resources: At last a computer program that
can beat a champion Go player PAGE 484
deepmind.com/alphago/ ALL SYSTEMS GO
CONSERVATION RESEARCH ETHICS POPULAR SCIENCE NATURE.COM/NATURE
28 January 2016 10
SONGBIRDS SAFEGUARD WHEN GENES Vol. 529, No. 7587
LA CARTE TRANSPARENCY GOT SELFISH
Illegal harvest of millions Dont let openness backfire Dawkinss calling
of Mediterranean birds on individuals card forty years on
PAGE 452 PAGE 459 PAGE 462
Cover 28 January 2016.indd 1 20/01/2016 15:40
Conclusion
I General, stable and scalable RL is now possible
I Using deep networks to represent value, policy, model
I Successful in Atari, Labyrinth, Physics, Poker, Go
I Using a variety of deep RL paradigms