0% found this document useful (0 votes)

15 views134 pages

01 Module 1 Early Reinforcement Learning

Uploaded by

sherlockplus650b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views134 pages

01 Module 1 Early Reinforcement Learning

Uploaded by

sherlockplus650b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 134

Early Reinforcement

Learning
Learning Objectives
● Understand the History of
Reinforcement Learning

○ Value Iteration

○ Policy Iteration

○ TD-Learning

○ Q-Learning
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016

An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016

A Simple Story
Frozen Lake
Frozen Lake
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
A Simpler Lake
Markov Decision Process (MDP)
State

State State Action

a2 +1
0 1

Reward
A Simpler Lake

a2
State State
0 1
a0

a1 a1

-1 +1

State State
2 3
Bellman Equation

Rewards
received

V(s) = R(s,a) + 𝛾V(s′)

Value of Discounted
the current future state
state
The Discount Factor (𝛾)

2 days from 3 days from 4 days from

𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100

The Discount Factor (𝛾)

2 days from 3 days from 4 days from

𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100

.5 $100 $50 $25 $12.5 $6.25

The Discount Factor (𝛾)

2 days from 3 days from 4 days from

𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100

.5 $100 $50 $25 $12.5 $6.25

0 $100 $0 $0 $0 $0
Bellman Equation

V(s) = R(s,a) + 𝛾V(s′)

The Policy

State π Action
The Policy

State π* Action
Bellman Equation

π* π*
V (s) = maxa{R(s,a) + 𝛾V (s′)}

= new addition
Simple Lake Value

a2
State State
0 1
a0

a1 a1

-1 +1

State State
2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 1
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 1
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value

0 1 0 0
2 3 0 0

Policy Map Prime Value

γ = .9
a2 a1 0 1
- - 0 0
Simple Lake Value (more accurate)

State Map Current Value

0 1 0 0
2 3 0 0

Policy Map Prime Value Rewards

a2 a1 0 0
1 γ = .9 0 1
- - 0 0 0 0
Simple Lake Value (1 iteration)

State Map Current Value a2

State State
0 1
0 1 0 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 0
State State
- - 0 0 2 3
Simple Lake Value (2 and 3 iterations)

State Map Current Value a2

State State
0 1
0 1 .81 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 0
State State
- - 0 0 2 3
Value Iteration Code
LAKE = np.array([[0, 0, 0, 0],
[0, -1, 0, -1],
[0, 0, 0, -1],
[-1, 0, 0, 1]])
LAKE_WIDTH = len(LAKE[0])
LAKE_HEIGHT = len(LAKE)

DISCOUNT = .9 # Change me to be a value between 0 and 1.

DELTA = .0001 # I must be sufficiently small.
current_values = np.zeros_like(LAKE)

while change > DELTA:

prime_values, policies = iterate_value(current_values)
old_values = np.copy(current_values)
current_values = DISCOUNT * prime_values
change = np.sum(np.abs(old_values - current_values))
Value Iteration Code
def iterate_value(current_values):
"""Finds the future state values for an array of current states.

Args:
current_values (int array): the value of current states.

Returns:
prime_values (int array): The value of states based on future states.
policies (int array): The recommended action to take in a state.
"""
prime_values = []
policies = []

for state in STATE_RANGE:

value, policy = get_max_neighbor(state, current_values)
prime_values.append(value)
policies.append(policy)

prime_values = np.array(prime_values).reshape((LAKE_HEIGHT, LAKE_WIDTH))

return prime_values, policies
Value Iteration Code

Lake Iteration
Iteration 4
5
6
3
21 Optimal Policy
0 0 0 0 .53
0 .59
0 .66
0 .59
0 0
1 0
2 0
1 0
0 -1 0 -1 .59
0 0 .73
0 0 0
1 - 0
1 -
0 0 0 -1 .66
0 .73
0 .81
0 0 0
2 0
1 0
1 -
-1 0 0 1 0 .81
0 .9 0 - 0
21 0
2 -
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
Probabilities and Slipping

33%

33%
Probabilities and Slipping

33%

33%
Slippery Simple Lake

66% 33%
State
a0
66% 33%
Action
State State
a0 a2 33%
0 1

Reward
33% 33%
a1
Path Probability
33%
33%
33%

-1 -1

State
2
Bellman Equation

π*
V (s) = maxa {R(s,a)+𝛾Σs′ P(s′|s,a)V (s′)}
π*

= new addition
Weighting State Prime

Σs′ P(s′|s,a)V π*
(s′) 66%
66%
a0

33%
33%
State

Action
State State
a0 a2 33%
0 1
Counter Reward
Action Forward Clockwise
Clockwise 33%
a1
33%

Path Probability
a0 s2 s0 s0 33%
33%

33%

a1 s1 s2 s0 -1 -1

a2 s0 s1 s2 State
2

a3 s0 s0 s1
Weighting State Prime

Σs′ P(s′|s,a)V π*
(s′) =
.33 ∙ V(Counter Clockwise) +
.33 ∙ V(Forward) +
.33 ∙ V(Clockwise)

Counter V(Counter Weighted

Action Forward Clockwise V(Forward) V(Clockwise)
Clockwise Clockwise) Total

a0 s2 s0 s0 -1 0 0 -.33

a1 s1 s2 s0 0 -1 0 -.33

a2 s0 s1 s2 0 0 -1 -.33

a3 s0 s0 s1 0 0 0 0
Value Iteration Complexity
2
O(s as′)
For each state ... s0 s1 s2

Compare each
action ... a0 a1 a0 a1 a0 a1

By weighting
s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1
each new state ...

Repeat up to the
total number of s0 s1 s2
states.
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 0 0
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value

0 1 0 0
2 3 0 0

Policy Map Prime Value

γ = .9
1 1 -1 1
- - 0 0
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 0 0
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2

State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration (Iteration 2)

State Map Current Value a2

State State
0 1
0 1 .81 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 0 0
State State
- - 0 0 2 3
Modified Policy Iteration Code
def iterate_policy(current_values, current_policies):
"""Finds the future state values for an array of current states.

Args:
current_values (int array): the value of current states.
current_policies (int array): a list where each cell is the recommended
action for the state matching its index.

Returns:
next_values (int array): The value of states based on future states.
next_policies (int array): The recommended action to take in a state.
"""
next_values = find_future_values(current_values, current_policies)
next_policies = find_best_policy(next_values)
return next_values, next_policies
Modified Policy Iteration Code
def find_future_values(current_values, current_policies):
"""Finds the next set of future values based on the current policy."""
next_values = []

for state in STATE_RANGE:

current_policy = current_policies[state]
state_x, state_y = get_state_coordinates(state)

# If the cell has something other than 0, it's a terminal state.

value = LAKE[state_y, state_x]
if not value:
value = get_neighbor_value(
state_x, state_y, current_values, current_policy)
next_values.append(value)
return np.array(next_values).reshape((LAKE_HEIGHT, LAKE_WIDTH))
Modified Policy Iteration Code
def find_best_policy(next_values):
"""Finds the best policy given a value mapping."""
next_policies = []
for state in STATE_RANGE:
state_x, state_y = get_state_coordinates(state)

# No policy or best value yet

max_value = -np.inf
best_policy = -1

if not LAKE[state_y, state_x]:

for policy in ACTION_RANGE:
neighbor_value = get_neighbor_value(
state_x, state_y, next_values, policy)
if neighbor_value > max_value:
max_value = neighbor_value
best_policy = policy

next_policies.append(best_policy)
return next_policies
Modified Policy Iteration Complexity

2 2
O(s s′ + s as′)

Still need to look Finding the

at weighted sum new policy is
of future states pretty much
to calculate value the same as
Value Iteration
Value Iteration vs Policy Iteration
Value Iteration Iteration
Iteration 4
5
6
3
7
21 Optimal Policy
.00
0 .00
0 .00
0 .00
0 0 3 0
3 3
.00
.01
0 0 -.29
-.28
-.27
-.3 0 0 - 0 -
Lake .03
.02
.01
0 .05
.09
.07
.02
.10
0 .05
.09
.07
.02
.10
0 0 3 1 0 -
0 0 0 0 0 .09
.24
.25
.14
.21
.18
0 .50
.44
.48
.39
.52
.51
.3 0 - 2 1 -
0 -1 0 -1
0 0 0 -1 Iteration
Iteration 4
3
21 Optimal Policy

-1 0 0 1 0 0 0 0 0 3 0
3 3
2
0 0 -.3
0 0 0 - 0
1 -
0 .03
0 .03
0 0 3
2 0
1 0 -
Policy Iteration 0 .09
.14
0 .44
.39
.3
0 0 - 21 31 -
Value Iteration vs Policy Iteration

Property Value Iteration Policy Iteration

Mathematically precise ✓ x

Less iterations x ✓

Less computation per iteration ✓ x

Convergence condition Little change in value No change in policy

Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016

A Random Walk
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1
γ=1
A Random Walk
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0)

V(s) = R(s,a) + 𝛾V(s′)

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))

New
Variable!
The Learning Rate
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))

= 0 .5 ( 1 + 1·0 - 0)

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 .25 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))

= .5 + .5 ( 0 + 1 · .25 - .5 )

TD(0) 0 0 0 0 .25 .375 0

γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 .125 .375 0

γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 .016 .031 .063 .125 .375 0

γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 1 0 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 .202 .225 .5 0

γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ2 γ 1 0

TD(1) 0 0 0 .202 .225 .5 0

γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ2 γ 1 0

TD(1) 0 0 0 .182 .202 .475 0

Changed by -.025
λ=1

TD(λ) Random Walk γ = .9

α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 (λγ)2 λγ 1 0

TD(1) 0 0 0 .182 .202 .475 0

Changed by -.025
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
The Q Table

Q - table

Left Down Right Up

0 0 0 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0
The Q Table

Q - table

Left Down Right Up

0 0 -.5
0 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

γ = .9 α = .5
The Q Table

Q - table

Left Down Right Up

0 0 -.5 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

γ = .9 α = .5
Deep Q Learning

V(st-1 ) = V(st-1 ) + ꭤt(R(st-1,at-1 ) + γ · V(st ) - V(st-s ))

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Deep Q Learning

V(st-1 ) = V(st-1 ) + ꭤt(R(st-1,at-1 ) + γ · V(st ) - V(st-s ))

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
...

def update_q(self, state, action, reward, state_prime)

...

def act(self, state):

...
Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
self.discount = discount
self.learning_rate = learning_rate
self.q_table = np.zeros((num_states, num_actions))

def update_q(self, state, action, reward, state_prime)

...

def act(self, state):

def update_q(self, state, action, reward, state_prime)

alpha = self.learning_rate
future_value = reward + self.discount * np.max(q_table[state_prime])
old_value = q_table[state, action]
q_table[state, action] = old_value + alpha * (future_value - old_value)

def act(self, state):

def update_q(self, state, action, reward, state_prime)

def act(self, state):

action_values = q_table[state_row]
max_indexes = np.argwhere(action_values == action_values.max())
max_indexes = np.squeeze(max_indexes, axis=-1)
action = np.random.choice(max_indexes)
return action
On Purpose Mistakes?
Anatomy of an Agent
class Agent():
def __init__(..., learning_rate, random_rate):
...
self.num_actions = num_actions
self.random_rate = random_rate # I'm between 0 and 1.

def update_q(self, state, action, reward, state_prime)

...

def act(self, state, training=True):

if random.random() < self.random_rate and training:
return random.randint(0, self.num_actions-1)

action_values = q_table[state_row]
max_indexes = np.argwhere(action_values == action_values.max())
...
return action
Anatomy of an Agent
EPISODES = 1000
agent = AGENT(NUM_STATES, NUM_ACTIONS, DISCOUNT, LEARNING_RATE, RANDOM_RATE)
environment = gym.make('FrozenLake-v0')

def play_game(environment, agent):

state = environment.reset()
done = False

while not done:

action = agent.act(state)
state_prime, reward, done = environment.step(action)
agent.update_q(state, action, reward, state_prime)
state = new_state

for episode in range(EPISODES):

play_game(environment, agent)
Anatomy of an Agent
EPISODES = 1000
agent = AGENT(NUM_STATES, NUM_ACTIONS, DISCOUNT, LEARNING_RATE, RANDOM_RATE)
environment = gym.make('FrozenLake-v0')

def play_game(environment, agent):

state = environment.reset()
done = False

while not done:

action = agent.act(state)
state_prime, reward, done = environment.step(action)
agent.update_q(state, action, reward, state_prime)
state = new_state

for episode in range(EPISODES):

play_game(environment, agent)
File Name:
T-AIFORF-I-p3_M1_l10_benefits_of_using_reinforcement_lear
ning_in_your_trading_strategy_part1

Content Type: Video - Lecture Presenter

Presenter: Jack Farmer

Benefits of Reinforcement
Learning in Your Trading
Strategy
Learning Objectives
● Understand the difference
between deep learning (DL) and
deep reinforcement learning
(DRL)
● Identify the components of a
deep reinforcement learning
trading strategy
● Identify the advantages of DRL
that can help it improve the
efficiency and performance of
quantitative strategies
Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading

Strategies

DRL Advantages for Strategy

Efficiency and Performance
What is DRL?
● Naive Agent

● Unknown Environment

● No knowledge or experience

● Goal is to collect information

by taking actions
DRL Agent
● Tests State Spaces
66% 33%
State
a0
● Action => Reaction? = New State? 66% 33%
Action
State State
● Needs input to distinguish between a0
0
a2 33%
1

Reward
“bad” and “good” decision 33% 33%
a1
Path Probability
● Developer sets rewards and 33%
33%

penalties 33%

-1 -1
Interaction ⇒ Knowledge
State

⇒Better Decisions⇒Max Reward 2

DRL Agent vs DL Agent
● DRL Agents given a high degree of
freedom

● Build on and develop initial logic

based on experience

● Become independent operators with

their own experience-based logic

● Can extend beyond developer’s

knowledge and solve more
complex problems
Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading

Strategies

DRL Advantages for Strategy

Efficiency and Performance
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in

longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on

the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in

longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on

the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in

longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on

the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in

longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on

the current market conditions which
makes the trading environment
highly unpredictable
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State
Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Agent
● Agent = Trader

● Access to brokerage account Environment

● Monitors market conditions Action
Reward
● Makes trading decisions
Operator

State

Brokerage Account
Monitors Market
Makes Trade Decisions
Agent/Algo Methodology
1. Make Trading Decision ⇒ Order
Filled or Not Filled?
2. Assess New Market Conditions Environment
New Order?
3. Make Decision
Change Order?
Reward Do Nothing?
⇒ New Order?
Operator

⇒ Change Order?
State
⇒ Do nothing? Agent
DRL Environment
Markets
● Market(s) Other Agents
Order Book
● Other agents (algos and humans)

● Order Book (public liquidity) Action

Reward
● Order Execution Strategies
Operator
(hidden liquidity)

State
Agent
State
● Market Conditions (only partially
knowable by Agent)
Environment
● Unknowable:
Action
○ Number of other agents Reward
Operator
○ Their actions and positions

○ Their order specifications

Market
● Advantage gained from private Conditions Agent
information or tech superiority Private
Information
DRL Reward
● Specification is key to the
success of trading algo
Environment
Absolute Reward Maximization
Action
⇒ High PnL Volatility Reward
Operator
⇒ Unmanageable Drawdowns

● Optimization default is Sharpe

Ratio:
Agent
Strategy Return / PnL Volatility
File Name:
T-AIFORF-I-p3_M1_l11_benefits_of_using_reinforcement_lear
ning_in_your_trading_strategy_part2

Content Type: Video - Lecture Presenter

Presenter: Jack Farmer

Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading

Strategies

DRL Advantages for Strategy

Efficiency and Performance
DRL’s Key Advantages
1. The self-learning process is a good
match for a rapidly evolving market
environment

2. Brings more power and efficiency to

a dense and complex state space

3. It builds on machine learning

techniques that have already proven
successful in a variety of markets
Good Match for Markets
● Financial markets are dynamic and
turbulent structures
● Increased volatility and unstable
liquidity lead to periodic flash crashes
● Complex quantitative strategies and
technologically enhanced participants
create short-lived, hard to identify
patterns
● Historical data quickly becomes
irrelevant for predicting current
market movements
Good Match for Markets
● Even the most successful trading
firms are being forced to adapt

● RenTech’s RIDA fund has reduced the

use of pattern-based strategies by
over 60%1

● Other hedge funds have also given

up trend following as they struggle
to replicate past returns
1
Hedgefundresearch.com 2019
Good Match for Markets
● Automated strategies must be
flexible and not completely
dependent on past data

● DRL can learn on the go by doing,

just like humans, but faster

● DRL algos are getting better at taking

real-time decisions based on current
market conditions and the
immediate results of their actions
Power and Efficiency
● Traders must factor in many market
variables to make the set of
interconnected decisions that
comprise an order

● Price, size, order time, duration, and

type require decisions on:

○ What price to buy/sell?

○ What quantity?
○ How many orders?
○ Sequentially or simultaneously?
Power and Efficiency
● A medium frequency trading algo will
reconsider it options every second*
● Each action results in orders with
unique characteristics
● Financial Markets are too complex for
straightforward algorithms
● Their action space is continuously
expanding with possible order
combinations dependent on a
dynamically changing market state
* ”Idiosyncrasies and challenges of data driven learning in electronic trading”
(JPM November 30, 2018 https://arxiv.org/pdf/1811.09549.pdf)
Builds on Successful ML
Techniques
● Algo strategies consist of:

○ Strategy

○ Implementation

● Designed by trader and

implemented by a machine

● Human-machine symbiosis
often breaks down and
performs poorly
Builds on Successful ML
Techniques
● One of the main challenges is
selecting un-biased,
representative financial data

● Although widely recognized this

task is often poorly implemented
(usually by the trader)

● With advancement in DRL we are

getting closer to an Autonomous
machine in charge of both
strategy and implementation
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has

potential to make or break a
trading system

● Still we are closer to full

automation than ever before
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has

potential to make or break a
trading system

● Still we are closer to full

automation than ever before
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has

potential to make or break a
trading system

● Still we are closer to full

automation than ever before
Lab
Use Deep Q Framework
for a Buy/Sell Strategy
Lab Objectives
●

●
Screencast

Spare Parts Catalogue PK 11001: (S106-EK-A)
No ratings yet
Spare Parts Catalogue PK 11001: (S106-EK-A)
134 pages
Major Vegetables and Root Crops Quarterly Bulletin, April-June 2023 - 0
No ratings yet
Major Vegetables and Root Crops Quarterly Bulletin, April-June 2023 - 0
52 pages
MRG The Skyline 106 E-Brochure LR
No ratings yet
MRG The Skyline 106 E-Brochure LR
14 pages
Economics Worksheets (In Marathi)
No ratings yet
Economics Worksheets (In Marathi)
48 pages
Aluminium Formwork - S Form Factory Visit 28-11-2023
No ratings yet
Aluminium Formwork - S Form Factory Visit 28-11-2023
26 pages
Corporation. The LLC Will Be Treated As A Corporation As of The Effective Date of The S
100% (1)
Corporation. The LLC Will Be Treated As A Corporation As of The Effective Date of The S
2 pages
Aviary Design 12
No ratings yet
Aviary Design 12
5 pages
Test Ninjas Practice Modules
No ratings yet
Test Ninjas Practice Modules
160 pages
IC Vendor Comparison 10842
No ratings yet
IC Vendor Comparison 10842
3 pages
Bangladesh University of Professionals (BUP) MBA (Professional) - Summer Semester, 2020 Business Statistics (BUS - 7203) )
No ratings yet
Bangladesh University of Professionals (BUP) MBA (Professional) - Summer Semester, 2020 Business Statistics (BUS - 7203) )
1 page
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
Chapter 11
No ratings yet
Chapter 11
101 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Perez, Carlota-Technological Revolution FC Contents
No ratings yet
Perez, Carlota-Technological Revolution FC Contents
5 pages
Mathematical Foundations of Reinforcement Learning
No ratings yet
Mathematical Foundations of Reinforcement Learning
283 pages
List
No ratings yet
List
4 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
RLAI Lab 1 Rahel Benjamin
No ratings yet
RLAI Lab 1 Rahel Benjamin
16 pages
Mbo
No ratings yet
Mbo
33 pages
L1 Basic Concepts
No ratings yet
L1 Basic Concepts
27 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Project Crashing Lecture 2024
No ratings yet
Project Crashing Lecture 2024
12 pages
Intro to Reinforcement Learning Concepts
No ratings yet
Intro to Reinforcement Learning Concepts
524 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Ex No 1RL
No ratings yet
Ex No 1RL
3 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Law Students Explore Justice
No ratings yet
Law Students Explore Justice
7 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Book All in One
No ratings yet
Book All in One
288 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Merc (Partnership) March 14
No ratings yet
Merc (Partnership) March 14
27 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Project MAS291
No ratings yet
Project MAS291
14 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Denning Law School - Breach of Trust
No ratings yet
Denning Law School - Breach of Trust
4 pages
Lec 09
No ratings yet
Lec 09
51 pages
Café Amazon: Brand and Market Insights
No ratings yet
Café Amazon: Brand and Market Insights
13 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Lec 4
No ratings yet
Lec 4
16 pages
Adobe Scan 12 Nov 2020
No ratings yet
Adobe Scan 12 Nov 2020
7 pages
Cello - JCI
No ratings yet
Cello - JCI
2 pages
Wordpool All Units
No ratings yet
Wordpool All Units
7 pages
24072901214973HDFC ChallanReceipt
No ratings yet
24072901214973HDFC ChallanReceipt
1 page
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Lec 08
No ratings yet
Lec 08
59 pages
Project Profile On Naphthalene Balls
No ratings yet
Project Profile On Naphthalene Balls
2 pages
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
No ratings yet
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
33 pages
Chapter 2 - Conceptual Framework For Financial Accounting: Study Online at
No ratings yet
Chapter 2 - Conceptual Framework For Financial Accounting: Study Online at
4 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
Basic Details: Eprocurement System of Coal India Limited
No ratings yet
Basic Details: Eprocurement System of Coal India Limited
2 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Pressure Reducing Valve LF PRC
No ratings yet
Pressure Reducing Valve LF PRC
1 page
Team 57 Report
No ratings yet
Team 57 Report
7 pages
Reinforcement Learning - Project 3
No ratings yet
Reinforcement Learning - Project 3
9 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Invoice
No ratings yet
Invoice
1 page
M 2
No ratings yet
M 2
12 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Final Floor Plan 2022
No ratings yet
Final Floor Plan 2022
1 page
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
Practical No4,5
No ratings yet
Practical No4,5
7 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Algorithms To Solve An MDP
No ratings yet
Algorithms To Solve An MDP
24 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL 20241103355 Report
No ratings yet
RL 20241103355 Report
4 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Computing The Cake Eating Problem
No ratings yet
Computing The Cake Eating Problem
13 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
HW 2
No ratings yet
HW 2
2 pages
Py Code Example 11 0 Baird Semi Gradient DP Like
No ratings yet
Py Code Example 11 0 Baird Semi Gradient DP Like
3 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages