0% found this document useful (0 votes)

11 views2 pages

Policy Approximation Document

The document discusses policy approximation in policy gradient methods, emphasizing the use of differentiable parameterization to optimize policies for maximizing long-term rewards. It highlights the soft-max function for converting action preferences into probabilities, allowing for both deterministic and stochastic policies. The Policy Gradient Theorem is introduced as a method for optimizing policies by computing the gradient of expected returns with respect to policy parameters, particularly useful in high-dimensional reinforcement learning problems.

Uploaded by

ishwaryagundra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views2 pages

Policy Approximation Document

Uploaded by

ishwaryagundra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Policy Approximation and Policy Gradient Theorem

Policy Approximation
In policy gradient methods, we can parameterize the policy in a variety of ways, as long as it
is differentiable with respect to its parameters. This allows the policy to be updated using
optimization techniques.

The goal is to learn a good policy that maximizes long-term rewards. Policy approximation
refers to approximating the policy (the action-selection rule) using a function that can be
easily adjusted and improved over time.

Common Parameterization of Policies

A common method for discrete action spaces is to use action preferences for each state-
action pair, denoted as h(s,a,θ), where θ represents the policy parameters. These
preferences are converted into probabilities using a soft-max function:

π(a|s,θ) = e^h(s,a,θ) / Σ_b e^h(s,b,θ)

This soft-max function ensures that actions with higher preferences have higher
probabilities of being chosen, while all probabilities sum to 1.

Advantages of Using Soft-Max Policy Approximation

1. Approaching Deterministic Policies: Using a soft-max parameterization, the policy can
become deterministic over time.

2. Selection of Actions with Arbitrary Probabilities: Stochastic policies can be naturally

learned, especially useful in games like Poker.

3. Easier to Approximate in Some Problems: Policies may be easier to model than action-
value functions in certain environments.

4. Injecting Prior Knowledge: Parameterizing the policy can incorporate domain knowledge
into the learning process.

Example 13.1: Stochastic Policy in a Simple Corridor Gridworld

In this example, the agent must navigate a corridor with two actions: move left or right.
Action-value methods like ϵ-greedy may struggle, while policy gradient methods can learn a
stochastic policy that maximizes expected reward.

Key Takeaways
• Policy parameterization via soft-max allows for creating deterministic or stochastic
policies.

• Policy-based methods excel when stochastic behavior is optimal.

• These methods can learn faster in simpler environments.

• Prior knowledge about optimal policies can improve learning efficiency.

The Policy Gradient Theorem

The Policy Gradient Theorem optimizes a policy by computing the gradient of a
performance measure (expected return) with respect to policy parameters. This is
especially useful in high-dimensional problems.

Reinforcement Learning Setup

In RL, an agent interacts with the environment and learns to take actions based on
observations. The policy π(a|s,θ) defines action probabilities, and the goal is to maximize
the expected return.

Objective of Policy Gradient Methods

To improve the policy πθ, we calculate the gradient of J(θ) with respect to θ and update the
parameters to improve performance.

Deriving the Policy Gradient Theorem

The theorem simplifies computing the gradient ∇θJ(θ) using expected rewards and action
probabilities.

Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
100 Geometry Problems: Contributors: Djmathman, Abishek99, Captainflint
No ratings yet
100 Geometry Problems: Contributors: Djmathman, Abishek99, Captainflint
8 pages
Simple Belt Conveyor Calculation Example
90% (10)
Simple Belt Conveyor Calculation Example
3 pages
Lecture 7: Policy Gradient: David Silver
No ratings yet
Lecture 7: Policy Gradient: David Silver
41 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Natural Actor-Critic Reinforcement Learning
No ratings yet
Natural Actor-Critic Reinforcement Learning
12 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
GOS Manual PDF
No ratings yet
GOS Manual PDF
138 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
TOPAZ B. Ing2
100% (3)
TOPAZ B. Ing2
6 pages
L4DC PolicyOptTutorial2023
No ratings yet
L4DC PolicyOptTutorial2023
160 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
Grafik Pertumbuhan Anak
No ratings yet
Grafik Pertumbuhan Anak
6 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Policy Gradient Methods Explained
No ratings yet
Policy Gradient Methods Explained
5 pages
RL 3
No ratings yet
RL 3
31 pages
Book All in One
No ratings yet
Book All in One
288 pages
M 2
No ratings yet
M 2
12 pages
Policy Gradient in Reinforcement Learning
No ratings yet
Policy Gradient in Reinforcement Learning
2 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
Paper RL
No ratings yet
Paper RL
61 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Wecall Catalog
100% (2)
Wecall Catalog
20 pages
Industrial Cutting Machines Guide
No ratings yet
Industrial Cutting Machines Guide
8 pages
RL 5
No ratings yet
RL 5
26 pages
Construction Exam Solutions
100% (1)
Construction Exam Solutions
5 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
Onychophagia (Nail Biting), Anxiety, and Malocclusion
No ratings yet
Onychophagia (Nail Biting), Anxiety, and Malocclusion
4 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Wenger GearBoss Team Cart-TS
No ratings yet
Wenger GearBoss Team Cart-TS
1 page
Different Types of Pollution, English Essay, Project 2
No ratings yet
Different Types of Pollution, English Essay, Project 2
3 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Neha Dahiya - Content Submission (Patenting Life Forms and Gmo - Scope and Challenges For Intellectual Prope (4053)
No ratings yet
Neha Dahiya - Content Submission (Patenting Life Forms and Gmo - Scope and Challenges For Intellectual Prope (4053)
5 pages
Spontaneous Vaginal Delivery Report
No ratings yet
Spontaneous Vaginal Delivery Report
1 page
MAT301 Lecture Notes 2018version
No ratings yet
MAT301 Lecture Notes 2018version
99 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
3.9 Learning Parameterized Policies
No ratings yet
3.9 Learning Parameterized Policies
13 pages
Agricultural Engineering
No ratings yet
Agricultural Engineering
5 pages
Citric Acid-Production, Technology, Applications, Patent, Consultants, Company Profiles, Reports, Market
No ratings yet
Citric Acid-Production, Technology, Applications, Patent, Consultants, Company Profiles, Reports, Market
7 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Chapter - 01
No ratings yet
Chapter - 01
72 pages
Absurdism in "The Outsider": Ruoqi Han
No ratings yet
Absurdism in "The Outsider": Ruoqi Han
5 pages
Transport 2 QP - Merged
No ratings yet
Transport 2 QP - Merged
11 pages
RFID Labels For Labs Flyer
No ratings yet
RFID Labels For Labs Flyer
6 pages
Silver 14
No ratings yet
Silver 14
9 pages
Wbi11 01 Que 20240508
No ratings yet
Wbi11 01 Que 20240508
28 pages
Mathematical Foundations of Reinforcement Learning
No ratings yet
Mathematical Foundations of Reinforcement Learning
283 pages
PT Mapeh-6 Q3
No ratings yet
PT Mapeh-6 Q3
11 pages
AZ2 F+Chassis SM 3a 1 3
No ratings yet
AZ2 F+Chassis SM 3a 1 3
70 pages
FNDS3536S-V3 Encoder Satellitegateway Iptv
No ratings yet
FNDS3536S-V3 Encoder Satellitegateway Iptv
4 pages
L9 - Policy Gradient Methods
No ratings yet
L9 - Policy Gradient Methods
43 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
Handout 5
No ratings yet
Handout 5
72 pages
SAE 1065 Steel Composition Guide
No ratings yet
SAE 1065 Steel Composition Guide
2 pages
Updated SoW
No ratings yet
Updated SoW
6 pages
PowerPoint 9 Collisions (Momentum) in 2D (4U)
No ratings yet
PowerPoint 9 Collisions (Momentum) in 2D (4U)
11 pages
اسئلة فيزياء الرنين كروب B - نسخة
No ratings yet
اسئلة فيزياء الرنين كروب B - نسخة
2 pages
Home Work of Reinforcement Learning Policy Based Theory
No ratings yet
Home Work of Reinforcement Learning Policy Based Theory
10 pages
The Opportunity Cost of Using Excess Capacity
No ratings yet
The Opportunity Cost of Using Excess Capacity
8 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
L9 Policy Gradient
No ratings yet
L9 Policy Gradient
44 pages
VLSI Module 4 & 5 Questions
No ratings yet
VLSI Module 4 & 5 Questions
2 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
ml4r 2025 06
No ratings yet
ml4r 2025 06
16 pages

Policy Approximation Document

Uploaded by

Policy Approximation Document

Uploaded by

Policy Approximation and Policy Gradient Theorem

Common Parameterization of Policies

π(a|s,θ) = e^h(s,a,θ) / Σ_b e^h(s,b,θ)

Advantages of Using Soft-Max Policy Approximation

2. Selection of Actions with Arbitrary Probabilities: Stochastic policies can be naturally

Example 13.1: Stochastic Policy in a Simple Corridor Gridworld

• Policy-based methods excel when stochastic behavior is optimal.

• Prior knowledge about optimal policies can improve learning efficiency.

The Policy Gradient Theorem

Reinforcement Learning Setup

Objective of Policy Gradient Methods

Deriving the Policy Gradient Theorem

You might also like