Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
94 views46 pages

Policy-Based Reinforcement Learning: Shusen Wang

The document discusses policy-based reinforcement learning. It defines the policy function π(a|s) and uses a neural network policy π(a|s;θ) to approximate the policy. It also defines the state-value function V(s) and uses the policy network to approximate it as V(s;θ) = Σπ(a|s;θ)Q(s,a). The policy gradient is then derived as the derivative of V(s;θ) with respect to θ, which is used to update θ through policy gradient ascent.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views46 pages

Policy-Based Reinforcement Learning: Shusen Wang

The document discusses policy-based reinforcement learning. It defines the policy function π(a|s) and uses a neural network policy π(a|s;θ) to approximate the policy. It also defines the state-value function V(s) and uses the policy network to approximate it as V(s;θ) = Σπ(a|s;θ)Q(s,a). The policy gradient is then derived as the derivative of V(s;θ) with respect to θ, which is used to update θ through policy gradient ascent.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Policy-Based Reinforcement Learning

Shusen Wang
Policy Function Approximation
Policy Function 𝜋 𝑎 𝑠

• Policy function 𝜋 𝑎 𝑠 is a probability density function (PDF).


• It takes state 𝑠 as input.
• It output the probabilities for all the actions, e.g.,
𝜋 left 𝑠 = 0.2,
𝜋 right 𝑠 = 0.1,
𝜋 up 𝑠 = 0.7.
• Randomly sample action 𝑎 random drawn from the distribution.
Can we directly learn a policy function 𝜋 𝑎 𝑠 ?

• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯


State 𝑠5
State 𝑠6
State 𝑠7

Can we directly learn a policy function 𝜋 𝑎 𝑠 ?

• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• What if there are too many (or infinite) states or actions?

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯


State 𝑠5
State 𝑠6
State 𝑠7

Policy Network 𝜋 𝑎 𝑠; 𝛉
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.
Policy Network 𝜋 𝑎 𝑠; 𝛉
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠>
feature
Policy Network 𝜋 𝑎 𝑠; 𝛉

• ∑C∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1.
• Here, 𝒜 = “left”, “right”, “up” is the set all actions.
• That is why we use softmax activation.

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠>
feature
State-Value Function Approximation
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

• The return depends on actions 𝐴> , 𝐴>L5, 𝐴>L6, ⋯ and states 𝑆> , 𝑆>L5, 𝑆>L6, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 P = 𝑠 P |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 P 𝑠, 𝑎 . (State transition.)
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

The expectation is taken w.r.t.


actions 𝐴>L5, 𝐴>L6, 𝐴>L7, ⋯
and states 𝑆>L5, 𝑆>L6, 𝑆>L7, ⋯
State-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).


State-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).


Policy-Based Reinforcement Learning

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .
Policy-Based Reinforcement Learning
Definition: State-value function.
• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Approximate state-value function.


• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
Policy-Based Reinforcement Learning
Definition: State-value function.
• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Approximate state-value function.


• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
• Approximate value function 𝑉S 𝑠> by:
𝑉 𝑠> ; 𝛉 = ∑C 𝜋 𝑎 𝑠> ; 𝛉 ⋅ 𝑄S 𝑠> , 𝑎 .
Policy-Based Reinforcement Learning
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .


Policy-Based Reinforcement Learning
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .

How to improve 𝛉? Policy gradient ascent!

• Observe state 𝑠.
^ _ `;𝛉
• Update policy by: 𝛉 ← 𝛉 + 𝛽 ⋅ .
^ 𝛉

Policy gradient
Policy Gradient

Reference

• Sutton and others: Policy gradient methods for reinforcement learning with function approximation. In NIPS,
2000.
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)⋅ab `,C ^S(C|`;𝛉)
• = ∑C = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C Push derivative inside the summation
^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C
^ 𝛉
^S(C|`;𝛉) Pretend 𝑄S is independent of 𝛉.
= ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 (It may not be true.)

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Note: This derivation is over-simplified and not rigorous.


Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^ def S(V|`;𝛉)
= 𝔼V ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉

The expectation is taken w.r.t. the random variable 𝐴~𝜋(⋅ |𝑠; 𝛉).
Policy Gradient

Policy gradient:
^ _ `;𝛉 ^ def S(V|`,𝛉)
= 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉

^ _ `;𝛉
• By the definition of 𝐠, 𝔼V 𝐠 𝐴, 𝛉 = .
^ 𝛉
^ _ `;𝛉
• 𝐠 𝑎g, 𝛉 is an unbiased estimate of .
^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉
^ _ `;𝛉
3. Use 𝐠 𝑎g, 𝛉 as an approximation to the policy gradient .
^ 𝛉
Update policy network using policy gradient
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
Option 1: REINFORCE. ^ 𝛉
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• Play the game to the end and generate the trajectory:
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
𝑠5 , 𝑎5 , 𝑟5 , 𝑠6 , 𝑎6 , 𝑟6 , ⋯ , 𝑠q , 𝑎 q , 𝑟q .
• Compute the discounted return 𝑢> = ∑qsn> 𝛾 st> 𝑟s , for all 𝑡.
• Since 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> , we can use 𝑢> to approximate 𝑄S 𝑠> , 𝑎> .
• è Use 𝑞> = 𝑢> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
Option 2: Approximate 𝑄S using a neural^ 𝛉network.
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• This leads to the actor-critic method.
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
Summary
Policy-Based Learning

• If a good policy function 𝜋 is known, the agent can be controlled


by the policy: randomly sample 𝑎> ∼ 𝜋 ⋅ 𝑠> .
• Approximate policy function 𝜋 𝑎 𝑠 by policy network 𝜋 𝑎 𝑠; 𝛉 .
• Learn the policy network by policy gradient algorithm.
• Policy gradient algorithm learn 𝛉 that maximizes 𝔼[ 𝑉 𝑆; 𝛉 .
Thank you!

You might also like