Policy Approximation and Policy Gradient Theorem
Policy Approximation
In policy gradient methods, we can parameterize the policy in a variety of ways, as long as it
is differentiable with respect to its parameters. This allows the policy to be updated using
optimization techniques.
The goal is to learn a good policy that maximizes long-term rewards. Policy approximation
refers to approximating the policy (the action-selection rule) using a function that can be
easily adjusted and improved over time.
Common Parameterization of Policies
A common method for discrete action spaces is to use action preferences for each state-
action pair, denoted as h(s,a,θ), where θ represents the policy parameters. These
preferences are converted into probabilities using a soft-max function:
π(a|s,θ) = e^h(s,a,θ) / Σ_b e^h(s,b,θ)
This soft-max function ensures that actions with higher preferences have higher
probabilities of being chosen, while all probabilities sum to 1.
Advantages of Using Soft-Max Policy Approximation
1. Approaching Deterministic Policies: Using a soft-max parameterization, the policy can
become deterministic over time.
2. Selection of Actions with Arbitrary Probabilities: Stochastic policies can be naturally
learned, especially useful in games like Poker.
3. Easier to Approximate in Some Problems: Policies may be easier to model than action-
value functions in certain environments.
4. Injecting Prior Knowledge: Parameterizing the policy can incorporate domain knowledge
into the learning process.
Example 13.1: Stochastic Policy in a Simple Corridor Gridworld
In this example, the agent must navigate a corridor with two actions: move left or right.
Action-value methods like ϵ-greedy may struggle, while policy gradient methods can learn a
stochastic policy that maximizes expected reward.
Key Takeaways
• Policy parameterization via soft-max allows for creating deterministic or stochastic
policies.
• Policy-based methods excel when stochastic behavior is optimal.
• These methods can learn faster in simpler environments.
• Prior knowledge about optimal policies can improve learning efficiency.
The Policy Gradient Theorem
The Policy Gradient Theorem optimizes a policy by computing the gradient of a
performance measure (expected return) with respect to policy parameters. This is
especially useful in high-dimensional problems.
Reinforcement Learning Setup
In RL, an agent interacts with the environment and learns to take actions based on
observations. The policy π(a|s,θ) defines action probabilities, and the goal is to maximize
the expected return.
Objective of Policy Gradient Methods
To improve the policy πθ, we calculate the gradient of J(θ) with respect to θ and update the
parameters to improve performance.
Deriving the Policy Gradient Theorem
The theorem simplifies computing the gradient ∇θJ(θ) using expected rewards and action
probabilities.