01 Module 1 Early Reinforcement Learning
01 Module 1 Early Reinforcement Learning
Learning
Learning Objectives
● Understand the History of
Reinforcement Learning
○ Value Iteration
○ Policy Iteration
○ TD-Learning
○ Q-Learning
Agenda
History Overview
Value Iteration
Policy Iteration
TD(Lambda)
Q-Learning
An RL Timeline
TD - Gammon Alpha Go
TD - Gammon Alpha Go
Value Iteration
Policy Iteration
TD(Lambda)
Q-Learning
A Simpler Lake
Markov Decision Process (MDP)
State
Reward
A Simpler Lake
a2
State State
0 1
a0
a1 a1
-1 +1
State State
2 3
Bellman Equation
Rewards
received
0 $100 $0 $0 $0 $0
Bellman Equation
State π Action
The Policy
State π* Action
Bellman Equation
π* π*
V (s) = maxa{R(s,a) + 𝛾V (s′)}
= new addition
Simple Lake Value
a2
State State
0 1
a0
a1 a1
-1 +1
State State
2 3
Simple Lake Value
2 3 0 0
a1 a1
? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
a2 ? 0 0
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
a2 a1 0 1
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
? ? 0 1
State State
- - 0 0 2 3
Simple Lake Value
2 3 0 0
a1 a1
a2 a1 0 0
State State
- - 0 0 2 3
Simple Lake Value (2 and 3 iterations)
2 3 0 0
a1 a1
a2 a1 0 0
State State
- - 0 0 2 3
Value Iteration Code
LAKE = np.array([[0, 0, 0, 0],
[0, -1, 0, -1],
[0, 0, 0, -1],
[-1, 0, 0, 1]])
LAKE_WIDTH = len(LAKE[0])
LAKE_HEIGHT = len(LAKE)
Args:
current_values (int array): the value of current states.
Returns:
prime_values (int array): The value of states based on future states.
policies (int array): The recommended action to take in a state.
"""
prime_values = []
policies = []
Lake Iteration
Iteration 4
5
6
3
21 Optimal Policy
0 0 0 0 .53
0 .59
0 .66
0 .59
0 0
1 0
2 0
1 0
0 -1 0 -1 .59
0 0 .73
0 0 0
1 - 0
1 -
0 0 0 -1 .66
0 .73
0 .81
0 0 0
2 0
1 0
1 -
-1 0 0 1 0 .81
0 .9 0 - 0
21 0
2 -
Agenda
History Overview
Value Iteration
Policy Iteration
TD(Lambda)
Q-Learning
Probabilities and Slipping
33%
33%
33%
Probabilities and Slipping
33%
33%
33%
Slippery Simple Lake
66% 33%
State
a0
66% 33%
Action
State State
a0 a2 33%
0 1
Reward
33% 33%
a1
Path Probability
33%
33%
33%
-1 -1
State
2
Bellman Equation
π*
V (s) = maxa {R(s,a)+𝛾Σs′ P(s′|s,a)V (s′)}
π*
= new addition
Weighting State Prime
Σs′ P(s′|s,a)V π*
(s′) 66%
66%
a0
33%
33%
State
Action
State State
a0 a2 33%
0 1
Counter Reward
Action Forward Clockwise
Clockwise 33%
a1
33%
Path Probability
a0 s2 s0 s0 33%
33%
33%
a1 s1 s2 s0 -1 -1
a2 s0 s1 s2 State
2
a3 s0 s0 s1
Weighting State Prime
Σs′ P(s′|s,a)V π*
(s′) =
.33 ∙ V(Counter Clockwise) +
.33 ∙ V(Forward) +
.33 ∙ V(Clockwise)
a0 s2 s0 s0 -1 0 0 -.33
a1 s1 s2 s0 0 -1 0 -.33
a2 s0 s1 s2 0 0 -1 -.33
a3 s0 s0 s1 0 0 0 0
Value Iteration Complexity
2
O(s as′)
For each state ... s0 s1 s2
Compare each
action ... a0 a1 a0 a1 a0 a1
By weighting
s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1
each new state ...
Repeat up to the
total number of s0 s1 s2
states.
Policy Iteration
2 3 0 0
a1 a1
1 1 0 0
State State
- - 0 0 2 3
Policy Iteration
2 3 0 0
a1 a1
1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration
2 3 0 0
a1 a1
1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration
2 3 0 0
a1 a1
1 1 0 0
State State
- - 0 0 2 3
Policy Iteration
2 3 0 0
a1 a1
2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration
2 3 0 0
a1 a1
2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration (Iteration 2)
2 3 0 0
a1 a1
2 1 0 0
State State
- - 0 0 2 3
Modified Policy Iteration Code
def iterate_policy(current_values, current_policies):
"""Finds the future state values for an array of current states.
Args:
current_values (int array): the value of current states.
current_policies (int array): a list where each cell is the recommended
action for the state matching its index.
Returns:
next_values (int array): The value of states based on future states.
next_policies (int array): The recommended action to take in a state.
"""
next_values = find_future_values(current_values, current_policies)
next_policies = find_best_policy(next_values)
return next_values, next_policies
Modified Policy Iteration Code
def find_future_values(current_values, current_policies):
"""Finds the next set of future values based on the current policy."""
next_values = []
next_policies.append(best_policy)
return next_policies
Modified Policy Iteration Complexity
2 2
O(s s′ + s as′)
-1 0 0 1 0 0 0 0 0 3 0
3 3
2
0 0 -.3
0 0 0 - 0
1 -
0 .03
0 .03
0 0 3
2 0
1 0 -
Policy Iteration 0 .09
.14
0 .44
.39
.3
0 0 - 21 31 -
Value Iteration vs Policy Iteration
Mathematically precise ✓ x
Less iterations x ✓
Value Iteration
Policy Iteration
TD(Lambda)
Q-Learning
An RL Timeline
TD - Gammon Alpha Go
A B C D E F G
Rewards +0 +1
γ=1
A Random Walk
Tails Heads
Left Right
50% 50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0)
New
Variable!
The Learning Rate
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0) 0 0 0 0 .25 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%
A B C D E F G
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
A B C D E F G
Rewards +0 +1
Value
0 .17 .33 .5 .67 .83 0
Iteration
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 1 0 0 0
TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 γ 1 0 0
TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 γ 1 0 0
TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 γ 1 0 0
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 γ2 γ 1 0
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 γ2 γ 1 0
Changed by -.025
λ=1
α = .5
A B C D E F G
Rewards +0 +1
Eligibility 0 0 0 (λγ)2 λγ 1 0
Changed by -.025
Agenda
History Overview
Value Iteration
Policy Iteration
TD(Lambda)
Q-Learning
The Q Table
Q - table
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
The Q Table
Q - table
0 0 -.5
0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
γ = .9 α = .5
The Q Table
Q - table
0 0 -.5 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
γ = .9 α = .5
Deep Q Learning
action_values = q_table[state_row]
max_indexes = np.argwhere(action_values == action_values.max())
...
return action
Anatomy of an Agent
EPISODES = 1000
agent = AGENT(NUM_STATES, NUM_ACTIONS, DISCOUNT, LEARNING_RATE, RANDOM_RATE)
environment = gym.make('FrozenLake-v0')
● Unknown Environment
● No knowledge or experience
Reward
“bad” and “good” decision 33% 33%
a1
Path Probability
● Developer sets rewards and 33%
33%
penalties 33%
-1 -1
Interaction ⇒ Knowledge
State
1. Agent
2. Environment Environment
3. State Action
Reward
4. Reward
Operator
State
Agent
DRL Trading Algorithm Components
1. Agent
2. Environment Environment
3. State Action
Reward
4. Reward
Operator
State
Agent
DRL Trading Algorithm Components
1. Agent
2. Environment Environment
3. State Action
Reward
4. Reward
Operator
State
Agent
DRL Trading Algorithm Components
1. Agent
2. Environment Environment
3. State Action
Reward
4. Reward
Operator
State
Agent
DRL Trading Algorithm Components
1. Agent
2. Environment Environment
3. State Action
Reward
4. Reward
Operator
State
Agent
DRL Agent
● Agent = Trader
State
Brokerage Account
Monitors Market
Makes Trade Decisions
Agent/Algo Methodology
1. Make Trading Decision ⇒ Order
Filled or Not Filled?
2. Assess New Market Conditions Environment
New Order?
3. Make Decision
Change Order?
Reward Do Nothing?
⇒ New Order?
Operator
⇒ Change Order?
State
⇒ Do nothing? Agent
DRL Environment
Markets
● Market(s) Other Agents
Order Book
● Other agents (algos and humans)
State
Agent
State
● Market Conditions (only partially
knowable by Agent)
Environment
● Unknowable:
Action
○ Number of other agents Reward
Operator
○ Their actions and positions
○ Strategy
○ Implementation
● Human-machine symbiosis
often breaks down and
performs poorly
Builds on Successful ML
Techniques
● One of the main challenges is
selecting un-biased,
representative financial data
●
Screencast