Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views134 pages

01 Module 1 Early Reinforcement Learning

Uploaded by

sherlockplus650b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views134 pages

01 Module 1 Early Reinforcement Learning

Uploaded by

sherlockplus650b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Early Reinforcement

Learning
Learning Objectives
● Understand the History of
Reinforcement Learning

○ Value Iteration

○ Policy Iteration

○ TD-Learning

○ Q-Learning
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016


An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016


A Simple Story
Frozen Lake
Frozen Lake
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
A Simpler Lake
Markov Decision Process (MDP)
State

State State Action


a2 +1
0 1

Reward
A Simpler Lake

a2
State State
0 1
a0

a1 a1

-1 +1

State State
2 3
Bellman Equation

Rewards
received

V(s) = R(s,a) + 𝛾V(s′)


Value of Discounted
the current future state
state
The Discount Factor (𝛾)

2 days from 3 days from 4 days from


𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100


The Discount Factor (𝛾)

2 days from 3 days from 4 days from


𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100

.5 $100 $50 $25 $12.5 $6.25


The Discount Factor (𝛾)

2 days from 3 days from 4 days from


𝛾 Today Tomorrow
now now now

1 $100 $100 $100 $100 $100

.5 $100 $50 $25 $12.5 $6.25

0 $100 $0 $0 $0 $0
Bellman Equation

V(s) = R(s,a) + 𝛾V(s′)


The Policy

State π Action
The Policy

State π* Action
Bellman Equation

π* π*
V (s) = maxa{R(s,a) + 𝛾V (s′)}

= new addition
Simple Lake Value

a2
State State
0 1
a0

a1 a1

-1 +1

State State
2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 ? 0 0
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 1
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

? ? 0 1
State State
- - 0 0 2 3
Simple Lake Value

State Map Current Value


0 1 0 0
2 3 0 0

Policy Map Prime Value


γ = .9
a2 a1 0 1
- - 0 0
Simple Lake Value (more accurate)

State Map Current Value


0 1 0 0
2 3 0 0

Policy Map Prime Value Rewards


a2 a1 0 0
1 γ = .9 0 1
- - 0 0 0 0
Simple Lake Value (1 iteration)

State Map Current Value a2


State State
0 1
0 1 0 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 0
State State
- - 0 0 2 3
Simple Lake Value (2 and 3 iterations)

State Map Current Value a2


State State
0 1
0 1 .81 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

a2 a1 0 0
State State
- - 0 0 2 3
Value Iteration Code
LAKE = np.array([[0, 0, 0, 0],
[0, -1, 0, -1],
[0, 0, 0, -1],
[-1, 0, 0, 1]])
LAKE_WIDTH = len(LAKE[0])
LAKE_HEIGHT = len(LAKE)

DISCOUNT = .9 # Change me to be a value between 0 and 1.


DELTA = .0001 # I must be sufficiently small.
current_values = np.zeros_like(LAKE)

while change > DELTA:


prime_values, policies = iterate_value(current_values)
old_values = np.copy(current_values)
current_values = DISCOUNT * prime_values
change = np.sum(np.abs(old_values - current_values))
Value Iteration Code
def iterate_value(current_values):
"""Finds the future state values for an array of current states.

Args:
current_values (int array): the value of current states.

Returns:
prime_values (int array): The value of states based on future states.
policies (int array): The recommended action to take in a state.
"""
prime_values = []
policies = []

for state in STATE_RANGE:


value, policy = get_max_neighbor(state, current_values)
prime_values.append(value)
policies.append(policy)

prime_values = np.array(prime_values).reshape((LAKE_HEIGHT, LAKE_WIDTH))


return prime_values, policies
Value Iteration Code

Lake Iteration
Iteration 4
5
6
3
21 Optimal Policy
0 0 0 0 .53
0 .59
0 .66
0 .59
0 0
1 0
2 0
1 0
0 -1 0 -1 .59
0 0 .73
0 0 0
1 - 0
1 -
0 0 0 -1 .66
0 .73
0 .81
0 0 0
2 0
1 0
1 -
-1 0 0 1 0 .81
0 .9 0 - 0
21 0
2 -
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
Probabilities and Slipping

33%

33%

33%
Probabilities and Slipping

33%

33%

33%
Slippery Simple Lake

66% 33%
State
a0
66% 33%
Action
State State
a0 a2 33%
0 1

Reward
33% 33%
a1
Path Probability
33%
33%
33%

-1 -1

State
2
Bellman Equation

π*
V (s) = maxa {R(s,a)+𝛾Σs′ P(s′|s,a)V (s′)}
π*

= new addition
Weighting State Prime

Σs′ P(s′|s,a)V π*
(s′) 66%
66%
a0

33%
33%
State

Action
State State
a0 a2 33%
0 1
Counter Reward
Action Forward Clockwise
Clockwise 33%
a1
33%

Path Probability
a0 s2 s0 s0 33%
33%

33%

a1 s1 s2 s0 -1 -1

a2 s0 s1 s2 State
2

a3 s0 s0 s1
Weighting State Prime

Σs′ P(s′|s,a)V π*
(s′) =
.33 ∙ V(Counter Clockwise) +
.33 ∙ V(Forward) +
.33 ∙ V(Clockwise)

Counter V(Counter Weighted


Action Forward Clockwise V(Forward) V(Clockwise)
Clockwise Clockwise) Total

a0 s2 s0 s0 -1 0 0 -.33

a1 s1 s2 s0 0 -1 0 -.33

a2 s0 s1 s2 0 0 -1 -.33

a3 s0 s0 s1 0 0 0 0
Value Iteration Complexity
2
O(s as′)
For each state ... s0 s1 s2

Compare each
action ... a0 a1 a0 a1 a0 a1

By weighting
s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1 s’0 s’1
each new state ...

Repeat up to the
total number of s0 s1 s2
states.
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 0 0
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 0 0 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value


0 1 0 0
2 3 0 0

Policy Map Prime Value


γ = .9
1 1 -1 1
- - 0 0
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

1 1 0 0
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration

State Map Current Value a2


State State
0 1
0 1 -.9 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 -1 1
State State
- - 0 0 2 3
Policy Iteration (Iteration 2)

State Map Current Value a2


State State
0 1
0 1 .81 .9 a0

2 3 0 0
a1 a1

Policy Map Prime Value -1 +1

2 1 0 0
State State
- - 0 0 2 3
Modified Policy Iteration Code
def iterate_policy(current_values, current_policies):
"""Finds the future state values for an array of current states.

Args:
current_values (int array): the value of current states.
current_policies (int array): a list where each cell is the recommended
action for the state matching its index.

Returns:
next_values (int array): The value of states based on future states.
next_policies (int array): The recommended action to take in a state.
"""
next_values = find_future_values(current_values, current_policies)
next_policies = find_best_policy(next_values)
return next_values, next_policies
Modified Policy Iteration Code
def find_future_values(current_values, current_policies):
"""Finds the next set of future values based on the current policy."""
next_values = []

for state in STATE_RANGE:


current_policy = current_policies[state]
state_x, state_y = get_state_coordinates(state)

# If the cell has something other than 0, it's a terminal state.


value = LAKE[state_y, state_x]
if not value:
value = get_neighbor_value(
state_x, state_y, current_values, current_policy)
next_values.append(value)
return np.array(next_values).reshape((LAKE_HEIGHT, LAKE_WIDTH))
Modified Policy Iteration Code
def find_best_policy(next_values):
"""Finds the best policy given a value mapping."""
next_policies = []
for state in STATE_RANGE:
state_x, state_y = get_state_coordinates(state)

# No policy or best value yet


max_value = -np.inf
best_policy = -1

if not LAKE[state_y, state_x]:


for policy in ACTION_RANGE:
neighbor_value = get_neighbor_value(
state_x, state_y, next_values, policy)
if neighbor_value > max_value:
max_value = neighbor_value
best_policy = policy

next_policies.append(best_policy)
return next_policies
Modified Policy Iteration Complexity

2 2
O(s s′ + s as′)

Still need to look Finding the


at weighted sum new policy is
of future states pretty much
to calculate value the same as
Value Iteration
Value Iteration vs Policy Iteration
Value Iteration Iteration
Iteration 4
5
6
3
7
21 Optimal Policy
.00
0 .00
0 .00
0 .00
0 0 3 0
3 3
.00
.01
0 0 -.29
-.28
-.27
-.3 0 0 - 0 -
Lake .03
.02
.01
0 .05
.09
.07
.02
.10
0 .05
.09
.07
.02
.10
0 0 3 1 0 -
0 0 0 0 0 .09
.24
.25
.14
.21
.18
0 .50
.44
.48
.39
.52
.51
.3 0 - 2 1 -
0 -1 0 -1
0 0 0 -1 Iteration
Iteration 4
3
21 Optimal Policy

-1 0 0 1 0 0 0 0 0 3 0
3 3
2
0 0 -.3
0 0 0 - 0
1 -
0 .03
0 .03
0 0 3
2 0
1 0 -
Policy Iteration 0 .09
.14
0 .44
.39
.3
0 0 - 21 31 -
Value Iteration vs Policy Iteration

Property Value Iteration Policy Iteration

Mathematically precise ✓ x

Less iterations x ✓

Less computation per iteration ✓ x

Convergence condition Little change in value No change in policy


Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
An RL Timeline

TD - Gammon Alpha Go

Policy Iteration Q-Learning DQNs

Value Iteration TD-Learning REINFORCE A3C

1957 1960 1988 1989 1992 2013 2016


A Random Walk
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1
γ=1
A Random Walk
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration
TD(0)

V(s) = R(s,a) + 𝛾V(s′)

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))

New
Variable!
The Learning Rate
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 0 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))


= 0 .5 ( 1 + 1·0 - 0)

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails Heads
Left Right
50% 50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 0 .5 0
γ=1
TD(0) Random Walk
α = .5
Heads
Right
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 .25 .5 0
γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

V(st-1 ) = V(st-1 ) + ⍺t(R(st-1 ,a ) + 𝛾V(st ) - V(st-1 ))


= .5 + .5 ( 0 + 1 · .25 - .5 )

TD(0) 0 0 0 0 .25 .375 0


γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 0 0 0 .125 .375 0


γ=1
TD(0) Random Walk
α = .5
Tails
Left
50%

A B C D E F G

Rewards +0 +1

Value
0 .17 .33 .5 .67 .83 0
Iteration

TD(0) 0 .016 .031 .063 .125 .375 0


γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 1 0 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 0 0 .5 0
γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ 1 0 0

TD(1) 0 0 0 .202 .225 .5 0


γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ2 γ 1 0

TD(1) 0 0 0 .202 .225 .5 0


γ = .9
TD(1) Random Walk
α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 γ2 γ 1 0

TD(1) 0 0 0 .182 .202 .475 0

Changed by -.025
λ=1

TD(λ) Random Walk γ = .9

α = .5

A B C D E F G

Rewards +0 +1

Eligibility 0 0 0 (λγ)2 λγ 1 0

TD(1) 0 0 0 .182 .202 .475 0

Changed by -.025
Agenda
History Overview

Value Iteration

Policy Iteration

TD(Lambda)

Q-Learning
The Q Table

Q - table

Left Down Right Up

0 0 0 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0
The Q Table

Q - table

Left Down Right Up

0 0 -.5
0 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

γ = .9 α = .5
The Q Table

Q - table

Left Down Right Up

0 0 -.5 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

γ = .9 α = .5
Deep Q Learning

V(st-1 ) = V(st-1 ) + ꭤt(R(st-1,at-1 ) + γ · V(st ) - V(st-s ))

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))


Deep Q Learning

V(st-1 ) = V(st-1 ) + ꭤt(R(st-1,at-1 ) + γ · V(st ) - V(st-s ))

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))


Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
...

def update_q(self, state, action, reward, state_prime)


...

def act(self, state):


...
Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
self.discount = discount
self.learning_rate = learning_rate
self.q_table = np.zeros((num_states, num_actions))

def update_q(self, state, action, reward, state_prime)


...

def act(self, state):


...
Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
self.discount = discount
self.learning_rate = learning_rate
self.q_table = np.zeros((num_states, num_actions))

def update_q(self, state, action, reward, state_prime)


alpha = self.learning_rate
future_value = reward + self.discount * np.max(q_table[state_prime])
old_value = q_table[state, action]
q_table[state, action] = old_value + alpha * (future_value - old_value)

def act(self, state):


...
Anatomy of an Agent
class Agent():
def __init__(num_states, num_actions, discount, learning_rate):
self.discount = discount
self.learning_rate = learning_rate
self.q_table = np.zeros((num_states, num_actions))

def update_q(self, state, action, reward, state_prime)


alpha = self.learning_rate
future_value = reward + self.discount * np.max(q_table[state_prime])
old_value = q_table[state, action]
q_table[state, action] = old_value + alpha * (future_value - old_value)

def act(self, state):


action_values = q_table[state_row]
max_indexes = np.argwhere(action_values == action_values.max())
max_indexes = np.squeeze(max_indexes, axis=-1)
action = np.random.choice(max_indexes)
return action
On Purpose Mistakes?
Anatomy of an Agent
class Agent():
def __init__(..., learning_rate, random_rate):
...
self.num_actions = num_actions
self.random_rate = random_rate # I'm between 0 and 1.

def update_q(self, state, action, reward, state_prime)


...

def act(self, state, training=True):


if random.random() < self.random_rate and training:
return random.randint(0, self.num_actions-1)

action_values = q_table[state_row]
max_indexes = np.argwhere(action_values == action_values.max())
...
return action
Anatomy of an Agent
EPISODES = 1000
agent = AGENT(NUM_STATES, NUM_ACTIONS, DISCOUNT, LEARNING_RATE, RANDOM_RATE)
environment = gym.make('FrozenLake-v0')

def play_game(environment, agent):


state = environment.reset()
done = False

while not done:


action = agent.act(state)
state_prime, reward, done = environment.step(action)
agent.update_q(state, action, reward, state_prime)
state = new_state

for episode in range(EPISODES):


play_game(environment, agent)
Anatomy of an Agent
EPISODES = 1000
agent = AGENT(NUM_STATES, NUM_ACTIONS, DISCOUNT, LEARNING_RATE, RANDOM_RATE)
environment = gym.make('FrozenLake-v0')

def play_game(environment, agent):


state = environment.reset()
done = False

while not done:


action = agent.act(state)
state_prime, reward, done = environment.step(action)
agent.update_q(state, action, reward, state_prime)
state = new_state

for episode in range(EPISODES):


play_game(environment, agent)
File Name:
T-AIFORF-I-p3_M1_l10_benefits_of_using_reinforcement_lear
ning_in_your_trading_strategy_part1

Content Type: Video - Lecture Presenter

Presenter: Jack Farmer


Benefits of Reinforcement
Learning in Your Trading
Strategy
Learning Objectives
● Understand the difference
between deep learning (DL) and
deep reinforcement learning
(DRL)
● Identify the components of a
deep reinforcement learning
trading strategy
● Identify the advantages of DRL
that can help it improve the
efficiency and performance of
quantitative strategies
Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading


Strategies

DRL Advantages for Strategy


Efficiency and Performance
What is DRL?
● Naive Agent

● Unknown Environment

● No knowledge or experience

● Goal is to collect information


by taking actions
DRL Agent
● Tests State Spaces
66% 33%
State
a0
● Action => Reaction? = New State? 66% 33%
Action
State State
● Needs input to distinguish between a0
0
a2 33%
1

Reward
“bad” and “good” decision 33% 33%
a1
Path Probability
● Developer sets rewards and 33%
33%

penalties 33%

-1 -1
Interaction ⇒ Knowledge
State

⇒Better Decisions⇒Max Reward 2


DRL Agent vs DL Agent
● DRL Agents given a high degree of
freedom

● Build on and develop initial logic


based on experience

● Become independent operators with


their own experience-based logic

● Can extend beyond developer’s


knowledge and solve more
complex problems
Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading


Strategies

DRL Advantages for Strategy


Efficiency and Performance
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in


longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on


the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in


longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on


the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in


longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on


the current market conditions which
makes the trading environment
highly unpredictable
Trading Challenges
● Strategies require error-free
handling of large volumes of data

● Agents’ actions may result in


longer-term consequences that
other ML techniques are unable to
measure

● And also have short-term impacts on


the current market conditions which
makes the trading environment
highly unpredictable
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State
Agent
DRL Trading Algorithm Components

1. Agent
2. Environment Environment

3. State Action
Reward
4. Reward
Operator

State

Agent
DRL Agent
● Agent = Trader

● Access to brokerage account Environment


● Monitors market conditions Action
Reward
● Makes trading decisions
Operator

State

Brokerage Account
Monitors Market
Makes Trade Decisions
Agent/Algo Methodology
1. Make Trading Decision ⇒ Order
Filled or Not Filled?
2. Assess New Market Conditions Environment
New Order?
3. Make Decision
Change Order?
Reward Do Nothing?
⇒ New Order?
Operator

⇒ Change Order?
State
⇒ Do nothing? Agent
DRL Environment
Markets
● Market(s) Other Agents
Order Book
● Other agents (algos and humans)

● Order Book (public liquidity) Action


Reward
● Order Execution Strategies
Operator
(hidden liquidity)

State
Agent
State
● Market Conditions (only partially
knowable by Agent)
Environment
● Unknowable:
Action
○ Number of other agents Reward
Operator
○ Their actions and positions

○ Their order specifications


Market
● Advantage gained from private Conditions Agent
information or tech superiority Private
Information
DRL Reward
● Specification is key to the
success of trading algo
Environment
Absolute Reward Maximization
Action
⇒ High PnL Volatility Reward
Operator
⇒ Unmanageable Drawdowns

● Optimization default is Sharpe


Ratio:
Agent
Strategy Return / PnL Volatility
File Name:
T-AIFORF-I-p3_M1_l11_benefits_of_using_reinforcement_lear
ning_in_your_trading_strategy_part2

Content Type: Video - Lecture Presenter

Presenter: Jack Farmer


Agenda
What is Deep Reinforcement
Learning?

How to Use DRL in Trading


Strategies

DRL Advantages for Strategy


Efficiency and Performance
DRL’s Key Advantages
1. The self-learning process is a good
match for a rapidly evolving market
environment

2. Brings more power and efficiency to


a dense and complex state space

3. It builds on machine learning


techniques that have already proven
successful in a variety of markets
Good Match for Markets
● Financial markets are dynamic and
turbulent structures
● Increased volatility and unstable
liquidity lead to periodic flash crashes
● Complex quantitative strategies and
technologically enhanced participants
create short-lived, hard to identify
patterns
● Historical data quickly becomes
irrelevant for predicting current
market movements
Good Match for Markets
● Even the most successful trading
firms are being forced to adapt

● RenTech’s RIDA fund has reduced the


use of pattern-based strategies by
over 60%1

● Other hedge funds have also given


up trend following as they struggle
to replicate past returns
1
Hedgefundresearch.com 2019
Good Match for Markets
● Automated strategies must be
flexible and not completely
dependent on past data

● DRL can learn on the go by doing,


just like humans, but faster

● DRL algos are getting better at taking


real-time decisions based on current
market conditions and the
immediate results of their actions
Power and Efficiency
● Traders must factor in many market
variables to make the set of
interconnected decisions that
comprise an order

● Price, size, order time, duration, and


type require decisions on:

○ What price to buy/sell?


○ What quantity?
○ How many orders?
○ Sequentially or simultaneously?
Power and Efficiency
● A medium frequency trading algo will
reconsider it options every second*
● Each action results in orders with
unique characteristics
● Financial Markets are too complex for
straightforward algorithms
● Their action space is continuously
expanding with possible order
combinations dependent on a
dynamically changing market state
* ”Idiosyncrasies and challenges of data driven learning in electronic trading”
(JPM November 30, 2018 https://arxiv.org/pdf/1811.09549.pdf)
Builds on Successful ML
Techniques
● Algo strategies consist of:

○ Strategy

○ Implementation

● Designed by trader and


implemented by a machine

● Human-machine symbiosis
often breaks down and
performs poorly
Builds on Successful ML
Techniques
● One of the main challenges is
selecting un-biased,
representative financial data

● Although widely recognized this


task is often poorly implemented
(usually by the trader)

● With advancement in DRL we are


getting closer to an Autonomous
machine in charge of both
strategy and implementation
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has


potential to make or break a
trading system

● Still we are closer to full


automation than ever before
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has


potential to make or break a
trading system

● Still we are closer to full


automation than ever before
Remaining Challenges to
Creating a DRL Trader
● DRL still requires millions of test
scenarios to trade profitably
and is dependent on an
operator to structure rewards

● Reward design is tricky and has


potential to make or break a
trading system

● Still we are closer to full


automation than ever before
Lab
Use Deep Q Framework
for a Buy/Sell Strategy
Lab Objectives


Screencast

You might also like