Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views49 pages

Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF

Uploaded by

VenkateshKumar B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views49 pages

Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF

Uploaded by

VenkateshKumar B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Journal of Machine Learning Research 26 (2025) 1-49 Submitted 5/24; Published 2/25

Principled Penalty-based Methods for


Bilevel Reinforcement Learning and RLHF

Han Shen [email protected]


Department of Electrical, Computer and System Engineering
Rensselaer Polytechnic Institute
Troy, NY 12180, USA
Zhuoran Yang [email protected]
Department of Statistics and Data Science
Yale University
New Haven, CT 06520, USA
Tianyi Chen [email protected]
Department of Electrical, Computer and System Engineering
Rensselaer Polytechnic Institute
Troy, NY 12180, USA

Editor: Tor Lattimore

Abstract
Bilevel optimization has been recently applied to many machine learning tasks. However,
their applications have been restricted to the supervised learning setting, where static
objective functions with benign structures are considered. But bilevel problems such as
incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF)
are often modeled as dynamic objective functions that go beyond the simple static objective
structures, which pose significant challenges of using existing bilevel solutions. To tackle this
new class of bilevel problems, we introduce the first principled algorithmic framework for
solving bilevel RL problems through the lens of penalty formulation. We provide theoretical
studies of the problem landscape and its penalty-based (policy) gradient algorithms. We
demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov
game, RL from human feedback and incentive design.
Keywords: bilevel optimization, reinforcement learning, Stackelberg games

1 Introduction
Bilevel optimization (BLO) has emerged as an effective framework in machine learning. In a
nutshell, BLO involves two coupled optimization problems in the upper and lower levels
respectively, where they have different decision variables, denoted by x and y respectively.
The lower-level problem is a constraint for the upper-level problem, e.g., in the upper level,
we minimize a function f (x, y) with the constraint that y is a solution to the lower-level

.Symbol † denotes equal contribution. Preliminary results in this paper were presented in part at the 2024
International Conference on Machine Learning (Shen et al., 2024).

©2025 Han Shen and Zhuoran Yang and Tianyi Chen.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v26/24-0720.html.
Shen and Yang and Chen

problem determined by x, i.e., y ∈ Y ∗ (x). Here Y ∗ (x) is the solution set of the lower-level
problem determined by x.
BLO enjoys a wide range of applications in machine learning, including hyper-parameter
optimization (Franceschi et al., 2018), meta-learning (Finn et al., 2017; Rajeswaran et al.,
2019; Chen et al., 2023a), continue learning (Borsos et al., 2020), and adversarial learning
(Jiang et al., 2021). Existing applications mostly concentrate on supervised learning settings,
thus research on BLO has been predominantly confined to the static optimization setting
(Franceschi et al., 2017), wherein both the upper and lower-level problems, the objective
functions are (strongly-)convex functions. However, this setting is insufficient to model more
complex game-theoretic behaviors with sequential decision-making.
Reinforcement learning (RL) (Sutton and Barto, 2018) is a principled framework for
sequential decision-making problems and has achieved tremendous empirical success in recent
years (Silver et al., 2017; Ouyang et al., 2022). In this work, we study the BLO problem
in the context of RL, where the lower-level problem is an RL problem and the upper-level
problem can be either smooth optimization or RL. Specifically, in the lower-level problem,
the follower solves a Markov decision process (MDP) determined by the leader’s decision x,
and returns an optimal policy of this MDP to the leader, known as the best response policy.
The leader aims to maximize its objective function, subject to the constraint that the follower
always adopts the best response policy. This formulation of bilevel RL encompasses a range
of applications such as Stackelberg Markov games (Stackelberg, 1952), reward learning (Hu
et al., 2020), and RL from human feedback (RLHF) (Christiano et al., 2017). As an example,
in RL from human feedback, the leader designs a reward rx for the follower’s MDP, with
the goal that the resulting optimal policy yields the desired behavior of the leader.
Despite its various applications, the bilevel RL problem is difficult to solve. Broadly
speaking, the main technical challenge lies in handling the constraint, i.e., the lower-level
problem. The lower-level problem of bilevel RL extends from static smooth optimization to
policy optimization in RL, and thus faces significant technical challenges. Such an extension
loses a few optimization structures, such as strong convexity and uniform Polyak-Lojasiewicz
(PL) condition, which is critical for existing BLO algorithms (Ghadimi and Wang, 2018;
Shen and Chen, 2023). Specifically, there are two mainstream approaches for BLO: (a)
implicit gradient or iterative differentiation methods; and, (b) penalty-based methods. In
(a), it is typically assumed that the lower-level objective function is strongly convex (Ji et al.,
2021b; Chen et al., 2021), and thus its optimal solution Y ∗ (x) is unique. Then methods
in (a) are essentially gradient-descent methods for the hyperobjective f (x, Y ∗ (x)), where
the gradient of Y ∗ (x) can be computed using the implicit function theorem. However, in
our bilevel RL case, the lower-level objective function is the discounted return in MDP,
which is known to be non-convex (Agarwal et al., 2020). Thus, the hyperobjective and its
gradient are not well-defined. In (b), the bilevel problem is reformulated as a single-level
problem by adding a penalty term of the lower-level sub-optimality to the leader’s objective
function. The penalty reformulation approach has been studied in (Ye, 2012; Shen and Chen,
2023; Ye et al., 2022; Kwon et al., 2023) under the assumption that the lower-level objective
function satisfies certain PL conditions. Unfortunately, when it comes to bilevel RL, the
lower-level discounted return objective does not satisfy these conditions. To develop the
penalty approach for bilevel RL problems, it is unclear (i) what is an appropriate penalty
function; (ii) how is the solution to the reformulated problem related to the original bilevel

2
Principled Penalty-based Methods for Bilevel RL

problem; and, (iii) how to solve the reformulated problem. Therefore, directly extending
applying BLO methods to bilevel RL is not straightforward, and new theories and algorithms
tailored to the RL lower-level problem are needed, which are the subject of the paper.

1.1 Our contributions


To this end, we propose a novel algorithm that extends the idea of penalty-based BLO
algorithm (Shen and Chen, 2023) to tackle the specific challenges of bilevel RL. Our approach
includes the design of two tailored penalty functions: value penalty and Bellman penalty,
which are crafted to capture the optimality condition of the lower-level RL problem. In
addition, leveraging the geometry of the policy optimization problem, we prove that an
approximate solution to our reformulated problem is also an effective solution to the original
bilevel problem. Furthermore, we establish the differentiability of the reformulated problem,
and propose a first-order policy-gradient-based algorithm. To our knowledge, we establish the
first provably convergent first-order algorithm for bilevel RL. Lastly, we conduct experiments
on example applications covered by our framework, including the Stackelberg game and RL
from human feedback tasks.
Table 1: Summary of main convergence theorems. †: the lower-bound of the penalty
constant λ to guarantee that certain lower-level optimality gap (value function gap, Bellman
gap, and NI function value) is smaller than accuracy δvalue , δbellman and δN I respectively.
These optimality gaps will be introduced in their respective sections; ‡: Here ϵ is the accuracy.

Section 3 & Section 4 Section 5


Lower-level problem single-agent RL two-player zero-sum
Upper-level problem general objective
Penalty functions Value or Bellman penalty Nikaido-Isoda (NI)
−1 −0.5 −1 †
Penalty constant λ Ω(δvalue )† or Ω(δbellman )† Ω(δN I)
Inner-loop oracle algorithm Policy mirror descent (PMD)
Iteration complexity O(λϵ−1 log(λ2 /ϵ))‡

1.2 Related works


Bilevel optimization. The bilevel optimization problem can be dated back to (Stackelberg,
1952). The gradient-based bilevel optimization methods have gained growing popularity
in the machine learning area; see, e.g., (Sabach and Shtern, 2017; Franceschi et al., 2018;
Liu et al., 2020). A prominent branch of gradient-based bilevel optimization is based on
the implicit gradient (IG) theorem. The IG based methods have been widely studied under
a strongly convex lower-level function, see, e.g., (Pedregosa, 2016; Ghadimi and Wang,
2018; Hong et al., 2023; Ji et al., 2021a; Chen et al., 2021; Khanduri et al., 2021; Shen and
Chen, 2022; Li et al., 2022; Xiao et al., 2023b; Giovannelli et al., 2022; Chen et al., 2023b;
Yang et al., 2023). The iterative differentiation (ITD) methods, which can be viewed as an
iterative relaxation of the IG methods, have been studied in, e.g., (Maclaurin et al., 2015;
Franceschi et al., 2017; Nichol et al., 2018; Shaban et al., 2019; Liu et al., 2021b, 2022; Bolte
et al., 2022; Grazzi et al., 2020; Ji et al., 2022; Shen and Chen, 2022). However, in our case

3
Shen and Yang and Chen

Table 2: A holistic comparison between the bilevel RL in this work and the general
penalty-based bilevel optimization (OPT) (e.g., (Shen and Chen, 2023; Kwon et al., 2023)),
where ”GD” stands for the gradient descent and ”PMD” stands for the policy mirror descent.

Supervised penalty- This work on penalty-


based bilevel OPT based bilevel RL
Problem application hyperparameter OPT, Stackelberg Markov
adversarial training, game, RL from prefer-
continue learning, etc. ence, incentive design, etc
Penalty reformulation Value penalty with Value/Bellman/NI penalty
assumed property with proven property
Algorithm Gradient directly accessible Need to derive close form
gradient and estimate it
Iteration complexity Õ(λϵ−1 ) with inner-loop GD Õ(λϵ−1 ) with
inner-loop PMD

the lower-level objective is the discounted return which is known to be non-convex (Agarwal
et al., 2020). Thus it is difficult to apply the aforementioned methods here.

The penalty relaxation of the bilevel optimization problem, early studies of which can
be dated back to (Clarke, 1983; Luo et al., 1996), have gained interest from researchers
recently (see, e.g., (Shen and Chen, 2023; Ye et al., 2022; Lu and Mei, 2023; Kwon et al.,
2023; Xiao et al., 2023a)). Theoretical results for this branch of work are established under
certain lower-level error-bound conditions (e.g., uniform Polyak-Lojasiewicz inequalities)
weaker than strong convexity. In our case, the lower-level discounted return objective does
not satisfy those uniform error bounds. Therefore, the established penalty reformulations
may not be directly applied here. See Table 2 for a more detailed comparison between this
work and the general penalty-based bilevel optimization.

Applications of bilevel RL. The bilevel RL formulation considered in this work


covers several applications including reward shaping (Hu et al., 2020; Zou et al., 2019),
reinforcement learning from preference (Christiano et al., 2017; Xu et al., 2020; Pacchiano
et al., 2021), Stackelberg Markov game (Liu et al., 2021a; Song et al., 2023), incentive design
(Chen et al., 2016), etc. A concurrent work (Chakraborty et al., 2024) studies the policy
alignment problem, and introduces a corrected reward learning objective for RLHF that
leads to strong performance gain. The PARL algorithm in (Chakraborty et al., 2024) is
based on the implicit gradient bilevel optimization method that requires the strong convexity
of the lower-level objective. On the other hand, PARL uses second-order derivatives of the
RL objective, while our proposed algorithm is fully first-order.

2 Problem Formulations

In this section, we will first introduce the generic bilevel RL formulation. Then we will show
several specific applications of the generic bilevel RL problem.

4
Principled Penalty-based Methods for Bilevel RL

2.1 Bilevel reinforcement learning formulation


Reinforcement learning studies the problem where an agent aims to find a policy that
maximizes its accumulated reward under the environment’s dynamic. In such a problem, the
reward function and the dynamic are fixed given the agent’s policy. While in the problem
that we are about to study, the reward or the dynamic oftentimes depend on another
decision variable, e.g., the reward is parameterized by a neural network in RLHF; or in the
Stackelberg game, both the reward and the dynamic are affected by the leader’s policy.
Tailoring to this, we first define a so-called parameterized MDP. Given the parameter
x ∈ Rdx , define a parameterized MDP as Mτ (x) := {S, A, rx , Px , τ h} where S is a finite
state space; A is a finite action space; rx (s, a) is the parameterized reward given state-action
pair (s, a) ∈ S ×A; Px is a parameterized transition distribution that specifies Px (s′ |s, a)–the
probability of transiting to s′ given (s, a); a policy π specifies π(a|s) which is the probability
of taking action a given state s; and τ h is the regularization: τ ≥ 0 and h = (hs )s∈S where
each hs : ∆(A) 7→ R+ is a strongly-convex regularization function given s. When τ = 0,
Mτ (x) is an unregularized MDP.
Given a policy π, the value function of Mτ (x) under π is defined as

hX i
π t

VM τ (x)
(s) := E γ rx (s t , at ) − τ hs t (π(st )) s 0 = s, π (2.1)
t=0

where γ ∈ [0, 1), π(s) := π(·|s) ∈ ∆(A) and the expectation is taken over the trajectory
π
(s0 , a0 ∼ π(s0 ), s1 ∼ Px (·|s0 , a0 ), . . . ). Given a state distribution ρ, we write VM (ρ) =
τ (x)
π
Es∼ρ [VMτ (x) (s)]. We also define the Q function as

QπMτ (x) (s, a) := rx (s, a) + γEs′ ∼Px (·|s,a) VM (s′ ) .


 π 
τ (x)
(2.2)

and Pxπ (st = s|s0 ) as the probability of reaching state s at time t given initial state s0 under
a transition distribution Px and a policy π. The probability Pxπ (st = s|s0 , a0 ) can be defined
similarly.
Suppose the policy π is parameterized by y ∈ Y ⊆ Rdy . We define the policy class as
Π := {πy : y ∈ Y} ⊆ ∆(A)|S| . For Mτ (x), its optimal policy is denoted as πy∗ (x) ∈ Π
π ∗ (x) π
satisfying VMyτ (x) (s) ≥ VM τ (x)
(s) for any π ∈ Π and s. With a function f : Rdx × Rdy 7→ R,
we are interested in the following bilevel RL problem

π
(2.3) BM : min f (x, y), s.t. x ∈ X , y ∈ Y ∗ (x) := argmin −VMyτ (x) (ρ)
x,y y∈Y

where X ⊆ Rdx and Y ⊆ Rdy are compact convex sets; and ρ is a given state distribution
with ρ(s) > 0 on S. The name ‘bilevel’ refers to the nested structure in the optimization
problem: in the upper level, a function f (x, y) is minimized subject to the lower-level
optimality constraint that πy is the optimal policy for Mτ (x).

2.2 Applications of bilevel reinforcement learning


Next we show several example applications that can be modeled by a bilevel RL problem.

5
Shen and Yang and Chen

Stackelberg Markov game. Consider a Markov game where at each time step, a
leader and a follower observe the state and take actions simultaneously. Then according to
the current state and actions, the leader and follower receive rewards, and the game transits
to the next state. Such a MDP can be defined as Mgτ := {S, Al , Af , rl , rf , P, τ hl , τ hf }
where S is the state space; Al or Af is the leader’s or the follower’s action space; rl (s, al , af )
and rf (s, al , af ) are respectively the leader’s and the follower’s reward given (s, al , af ) ∈
S × Al × Af ; P(s′ |s, al , af ) is the probability of transiting to state s′ given (s, al , af ); the
leader’s/follower’s policy πx /πy defines πx (al |s)/πy (af |s)–the probability of choosing action
al /af given state s; and τ hl , τ hf are the regularization functions respectively for πx and πy .
Define the leader’s/follower’s value function as

hX i
π ,πy
γ t r⋆ (st , al,t , af,t ) − τ h⋆,st (π⋆ (st )) s0 = s, πx , πy , ⋆ = l or f

V⋆ x (s) := E
t=0

where γ ∈ [0, 1), π⋆ (s) := π⋆ (·|s) ∈ ∆(A⋆ ) and the expectation is taken over the trajectory
(s0 , al,0 ∼ πx (s0 ), af,0 ∼ πy (s0 ), sl,1 ∼ P(s0 , al,0 , af,0 ), . . . ). The Q function is defined as
π ,πy  π ,π
(s, al , af ) := r⋆ (s, al , af ) + γEs′ ∼P(s,al ,af ) V⋆ x y (s′ ) .

Q⋆ x

The follower’s objective is to find a best-response policy to the leader’s policy while the
leader aims to find a best-response to the follower’s best-response. The Stackelberg Markov
game can be formulated as
πx ,πy π ,πy
min −Vl (ρ), s.t. x ∈ X , y ∈ argmin −Vf x (ρ). (2.4)
x,y y∈Y

With the proof deferred to Appendix B.1, (2.4) can be viewed as a bilevel RL problem
π ,π
with f (x, y) = −Vl x y (ρ) and a Mτ (x) in which rx (s, af ) = Eal ∼πx (s) [rl (s, al , af )] and
Px (·|s, af ) = Eal ∼πx (s) [P(·|s, al , af )].
Reinforcement learning from human feedback (RLHF). In the RLHF setting,
the agent learns a task without knowing the true reward function. Instead, humans evaluate
pairs of state-action segments, and for each pair, they label the segment they prefer. The
agent’s goal is to learn the task well with a limited amount of labeled pairs.
The original framework of deep RL from human feedback in (Christiano et al., 2017) (we
call it DRLHF) consists of two possibly asynchronous learning processes: reward learning
from labeled pairs and RL from learned rewards. In short, we maintain a buffer of labeled
segment pairs {(d0 , l0 , d1 , l1 )i }i where each segment d = (st , at , . . . , st+T , at+T ) is collected
with the agent’s policy πy and l0 , l1 is the label (e.g., l1 = 1, l0 = 0 indicates segment d1 is
preferred over d0 ). DRLHF simultaneously learns a reward predictor rx with the data and
trains an RL agent using the learned reward. This process has a hierarchy structure and
can be reformulated as a bilevel RL problem:
 
min −Ed0 ,d1 ∼πy l0 log P (d0 ≻ d1 |rx ) + l1 log P (d1 ≻ d0 |rx ) ,
x,y
π
s.t. y ∈ argmin −VMyτ (x) (ρ).
y

6
Principled Penalty-based Methods for Bilevel RL

where P (d0 ≻ d1 |rx ) is the probability of preferring d0 over d1 under reward rx , given by
the Bradley-Terry model:
P
exp( st ,at ∈d0 rx (st , at ))
P (d0 ≻ d1 |rx ) = P P . (2.5)
exp( st ,at ∈d0 rx (st , at )) + exp( st ,at ∈d1 rx (st , at ))
Reward shaping. In the RL tasks where the reward is difficult to learn from (e.g.,
the reward signal is sparse where most states give zero reward), we can reshape the reward
to enable efficient policy learning while staying true to the original task. Given a task
specified by Mτ = {S, A, r, P, τ h}, the reward shaping problem (Hu et al., 2020) seeks
to find a reshaped reward rx parameterized by x ∈ X such that the new MDP with rx
enables more efficient policy learning for the original task. We can define the new MDP as
Mτ (x) = {S, A, rx , P, τ h} and formulate the problem as:
π π
min −VMyτ (ρ), s.t. x ∈ X , y ∈ argmin −VMyτ (x) (ρ) (2.6)
x,y y∈Y

which is a special case of bilevel RL.

3 Penalty Reformulation of Bilevel RL


A natural way to solve the bilevel RL problem BM is through reduction to a single-level
problem, that is, to find a single-level problem that shares its local/global solutions with
the original problem. Then by solving the single-level problem, we can recover the original
solutions. In this section, we will perform single-level reformulation of BM by penalizing
the upper-level objective with carefully chosen functions.
Specifically, we aim to find penalty functions p(x, y) such that the solutions of the
following problem recover the solutions of BM:
BMλp : min Fλ (x, y) := f (x, y) + λp(x, y), s.t. x ∈ X , y ∈ Y (3.1)
x,y

where λ is the penalty constant.

3.1 Value penalty and its landscape property


In BM, the lower-level problem of finding the optimal policy πy can be rewritten as its
π π
optimality condition: −VMyτ (x) (ρ) + maxy∈Y VMyτ (x) (ρ) = 0. Thus, BM can be rewritten as
π π
min f (x, y), s.t. x ∈ X , y ∈ Y, −VMyτ (x) (ρ) + max VMyτ (x) (ρ) = 0.
x,y y∈Y

A natural penalty function that we call value penalty measures the lower-level optimality:
π π
p(x, y) = −VMyτ (x) (ρ) + max VMyτ (x) (ρ). (3.2)
y∈Y

The value penalty specifies the following penalized problem


π π 
BMλp : min Fλ (x, y) = f (x, y) + λ − VMyτ (x) (ρ) + max VMyτ (x) ,
x,y y∈Y

s.t. x ∈ X , y ∈ Y. (3.3)
To capture the relation between solutions of BMλp and BM, we have the following lemma.

7
Shen and Yang and Chen

Lemma 1 (Relation on solutions) Consider choosing p as the value penalty in (3.2).


Assume there exists constant C such that maxx∈X ,y∈Y |f (x, y)| = C2 . Given accuracy δ > 0,
choose λ ≥ Cδ −1 . If (xλ , yλ ) achieves ϵ-minimum of BMλp , it achieves ϵ-minimum of the
relaxed BM:
π π
min f (x, y), s.t. x ∈ X , y ∈ Y, −VMyτ (x) (ρ) + max VMyτ (x) (ρ) ≤ ϵλ (3.4)
x,y y∈Y

where ϵλ ≤ δ + λ−1 ϵ.

The proof is deferred to Appendix B.2. Perhaps one restriction of the above lemma is that it
requires the boundedness of f on X × Y. This assumption is usually mild in RL problems,
e.g., it is guaranteed in the Stackelberg game provided the reward functions are bounded.
Since BMλp is in general a non-convex problem, it is also of interest to connect the
local solutions between BMλp and BM. To achieve this, some structural condition is
required. Suppose we use direct policy parameterization: y is a vector with its (s, a) element
ys,a = πy (a|s), and thus y = πy directly. Then we can prove the following structural
condition.

Lemma 2 (Gradient dominance) Given convex policy class Π and any τ ≥ 0, it holds
for any π ∈ Π that

π ′ 1 
π̃ π

max ⟨∇ V
π Mτ (x) (ρ), π − π⟩ ≥ max V (ρ) − V (ρ) .
π ′ ∈Π (1 − γ) mins ρ(s) π̃∈Π Mτ (x) Mτ (x)

See Appendix B.3 for a proof. A similar gradient dominance property was first proven in
(Agarwal et al., 2020, Lemma 4.1) for the unregularized MDPs. The above lemma is a
generalization of the result in (Agarwal et al., 2020) to a regularized case. Under such a
structure of the lower-level problem, we arrive at the following lemma capturing the relation
on local solutions.

Lemma 3 (Relation on local solutions) Consider using direct policy parameterization


and choosing p as the value penalty in (3.2). Assume f (x, ·) is L-Lipschitz-continuous on Y.
Given accuracy δ > 0, choose λ ≥ LCu δ −1 where Cu is a constant specified in the proof. If
(xλ , yλ ) is a local solution of BMλp , it is a local solution of the relaxed BM in (3.4) with
an ϵλ ≤ δ.

The proof can be found in Appendix B.4. Lemmas 1 and 3 suggest we can recover the
local/global solutions of the bilevel RL problem BM by locally/globally solving its penalty
reformulation BMλp with the value penalty.

3.2 Bellman penalty and its landscape property


Next we introduce the Bellman penalty that can be used as an alternative. To introduce this
penalty function, we consider a tabular policy (direct parameterization) πy , i.e. πy (·|s) = ys
for all s and y = (ys )s∈S ∈ Y = Π. Then we can define the Bellman penalty as

p(x, y) = g(x, y) − v(x) where v(x) := min g(x, y). (3.5)


y∈Y

8
Principled Penalty-based Methods for Bilevel RL

Here g(x, y) is defined as

g(x, y) := Es∼ρ [⟨ys , qs (x)⟩ + τ hs (ys )], (3.6)

where qs (x) ∈ R|A| is the vector of optimal Q functions, which is defined as

qs (x) = (qs,a (x))a∈A where qs,a (x) := − max QπMτ (x) (s, a). (3.7)
π∈Π

It is immediate that p(x, ·) is τ -strongly-convex uniformly for any x ∈ X by the 1-strong-


convexity of hs , and p(x, y) ≥ 0 by definition. Moreover, we can show that p(x, y) =
g(x, y)−v(x) is a suitable optimality metric of the lower-level RL problem in BM. Specifically,
we prove that the lower-level RL problem is solved whenever g(x, y) − v(x) is minimized in
the following lemma.
Lemma 4 Assume τ > 0, then we have the following holds.
• Given any x ∈ X , MDP Mτ (x) has a unique optimal policy πy∗ (x). And we have
arg miny∈Y g(x, y) = Y ∗ (x) = {πy∗ (x)}. Therefore, BM can be rewritten as the follow-
ing problem with ϵ = 0:

BMϵ : min f (x, y), s.t. x ∈ X , y ∈ Y,


x,y

g(x, y) − v(x) ≤ ϵ. (3.8)

• Assume f (x, ·) is L-Lipschitz-continuous on Y. More generally for ϵ ≥ 0, BMϵ is an


ϵ-approximate problem of BM in a sense that: given any x ∈ X , any feasible policy yϵ
of BMϵ is ϵ-feasible for BM:

∥yϵ − πy∗ (x)∥2 ≤ τ −1 ϵ.

Moreover, let f ∗ ,fϵ∗ respectively


√ be the optimal objective value of BM and BMϵ , then
∗ ∗
we have |f − fϵ | ≤ L τ ϵ. −1

The proof is deferred to Appendix B.5. Based on Lemma 4, g(x, y) − v(x) is a suitable
optimality metric for the lower-level problem. It is then natural to consider whether we can
use it as a penalty function for the lower-level sub-optimality. The Bellman penalty specifies
the following penalized problem:

BMλp : min Fλ (x, y) = f (x, y) + λ g(x, y) − v(x) , s.t. x ∈ X , y ∈ Y. (3.9)
x,y

We have the following result that captures the relation between the solution of BMϵ and
BMλp , which proves the Bellman penalty is indeed a suitable penalty function.
Lemma 5 (Relation on solutions) Consider choosing p as the Bellman penalty in (3.5).
Assume
√ f (x, ·) is L-Lipschitz-continuous on Y. Given some accuracy δ > 0, choose λ ≥
L τ −1 δ −1 . If (xλ , yλ ) is a local/global solution of BMλp , then it is a local/global solution
of BMϵλ with ϵλ ≤ δ.
This lemma follows from Proposition 3 in (Shen and Chen, 2023) under the τ -strong-convexity
of g(x, ·) along with the assumptions in this lemma.

9
Shen and Yang and Chen

4 A Penalty-based Bilevel RL Algorithm


In the previous sections, we have introduced two penalty functions p(x, y) such that the
original problem BM can be approximately solved via solving BMλp . However, it is still
unclear how BMλp can be solved. One challenge is the differentiability of the penalty
function p(x, y) in (3.1). In this section, we will first study when Fλ (x, y) admits gradients
in the generic case, and we will show the specific gradient forms in each application. Based
on these results, we propose a penalty-based algorithm and further establish its convergence.

4.1 Differentiability of the value penalty


We first consider the value penalty
π π
p(x, y) = −VMyτ (x) (ρ) + max VMyτ (x) (ρ).
y∈Y
π
For the differentiability in y, it follows ∇y p(x, y) = −∇y VMyτ (x) (ρ) which can be conveniently
evaluated with the policy gradient theorem. The issue lies in the differentiability of p(x, y)
with respect to x, where p(x, y) may not be differentiable in x due to the optimality function
π
maxy∈Y VMyτ (x) (ρ). Fortunately, we will show that in the setting of RL, p(·, y) admits
closed-form gradient under relatively mild assumptions below.
Assumption 1 Assume
π
(a) ∇x VMyτ (x) (ρ) is continuous in (x, y); and,
π π ′
(b) given any x ∈ X and y, y ′ ∈ Y ∗ (x), we have ∇x VMyτ (x) (ρ) = ∇x VMyτ (x) (ρ).
Assumption 1 (a) is mild in the applications, and can often be guaranteed by a continuously
differentiable reward function rx . A sufficient condition of Assumption 1 (b) is the optimal
policy of Mτ (x) on Π is unique, e.g., when πy = πy′ for y, y ′ ∈ Y ∗ (x). As indicated by
Lemma 4, the uniqueness is guaranteed when τ > 0.

Lemma 6 (Generic gradient form) Consider the value penalty p in (3.2). Suppose
Assumption 1 holds. Then p(x, y) is differentiable in x with the gradient
π π
∇x p(x, y) = −∇x VMyτ (x) (ρ) + ∇x VM τ (x)
(ρ)|π=πy∗ (x) (4.1)

where recall πy∗ (x) is an optimal policy on policy class Π = {πy : y ∈ Y} of Mτ (x).

The proof can be found in Appendix C.1. Next, we can apply the generic result from Lemma
6 to specify the exact gradient formula in different bilevel RL applications discussed in
Section 2.2.
Lemma 7 (Gradient form in the applications) Consider the value penalty p in (3.2).
The gradient of the penalty function in specific applications is listed below.
(a) RLHF/reward shaping: Assume rx is continuously differentiable and Assumption 1 (b)
holds. Then Lemma 6 holds and we have

hX i ∞
hX i
∇x p(x, y) = −E γ t ∇rx (st , at ) ρ, πy + E γ t ∇rx (st , at ) ρ, πy∗ (x) .
t=0 t=0

10
Principled Penalty-based Methods for Bilevel RL

(b) Stackelberg game: Assume πx is continuously differentiable and Assumption 1 (b) holds.
Then Lemma 6 holds and we have

hX i
π ,π
∇x p(x, y) = − E γ t Q̄f,tx y ∇ log πx (al,t |st ) s0 = s, πx , πy
t=0

πx ,πy∗ (x)
hX i
+E γ t Q̄f,t ∇ log πx (al,t |st ) s0 = s, πx , πy∗ (x)
t=0

π ,π π ,π
where Q̄f,tx y := Qf x y (st , al,t , af,t )−τ hf,st (πy (st )). Recall in the Stackelberg setting,
πy∗ (x) is the optimal follower policy given πx ; and the expectation is taken over the
trajectory generated by πx , πy (or πy∗ (x)), P.

We defer the proof to Appendix C.2.

4.2 Differentiability of the Bellman penalty


For the Bellman penalty in (3.5), though it is straightforward to evaluate ∇y p(x, y) =
∇y g(x, y), the differentiability of p(x, y) in x is unclear. We next identify some sufficient
conditions that allow convenient evaluation of ∇x p(x, y).

Assumption 2 Assume τ > 0 and the following hold:

(a) Given any (s, a), ∇x QπMτ (x) (s, a) exists and is continuous in (x, π); and,

(b) Either the discount factor γ = 0 or: Given x ∈ X , for the MDP Mτ (x), the Markov
chain induced by any policy π ∈ Π is irreducible1 .

Assumption 2 (a) is mild and can be satisfied in the applications in Section 2.2. Assumption
2 (b) is a regularity assumption on the MDP (Mitrophanov, 2005), and is often assumed in
recent studies on policy gradient algorithms (see e.g., (Wu et al., 2020; Qiu et al., 2021; ?)).

Lemma 8 (Generic gradient form) Consider the Bellman penalty p in (3.5). Assume
Assumption 2 holds. Then p(x, y) is differentiable with the gradient ∇x p(x, y) = ∇x g(x, y) −
∇v(x) where

∇x g(x, y) = −Es∼ρ,a∼πy (s) ∇x QπMτ (x) (s, a) π=π∗ (x)


 
(4.2)
y

∇v(x) = −Es∼ρ,a∼πy∗ (x)(s) ∇x QπMτ (x) (s, a) π=π∗ (x)


 
(4.3)
y

The proof can be found in Appendix C.3. The above lemma provides the form of gradients
for the BM problem. Next, we show that Lemma 8 holds for the example applications in
Section 2.2 and then compute the closed-form of the gradients.

Lemma 9 (Gradient form in the applications) Consider the Bellman penalty p(x, y)
in (3.5). The gradient form of the bilevel RL applications is listed below.
1. In Mτ (x), the Markov chain induced by policy π is irreducible if for any state s and initial state-action
pair s0 , a0 , there exists time step t such that Pxπ (st = s|s0 , a0 ) > 0, where Pxπ (st = s|s0 , a0 ) is the
probability of reaching s at time step t in MDP Mτ (x) with policy π.

11
Shen and Yang and Chen

(a) RLHF/reward shaping: Assume rx is continuously differentiable and Assumption 2 (b)


holds. Then Lemma 8 holds and we have

hX i
∇x g(x, y) = −E γ t ∇rx (st , at ) s0 ∼ ρ, a0 ∼ πy (s), πy∗ (x) ,
t=0

hX i
∇v(x) = −E γ t ∇rx (st , at ) s0 ∼ ρ, πy∗ (x)
t=0

where the expectation is taken over the trajectory generated by πy∗ (x) and P.
(b) Stackelberg game: Assume πx is continuously differentiable and Assumption 2 (b) holds.
Then Lemma 8 holds and we have

πx ,π ∗ (x)
hX i
∇x g(x, y) =−E γ t Qf y (st , al,t , af,t )∇ log πx (al,t |st ) s0 ∼ ρ, af,0 ∼ πy (s0 ),πx ,πy∗ (x)
t=0

hX i
+E γ t τ hf,st (πy∗ (x)(st ))∇ log πx (al,t |st ) s0 ∼ ρ, af,0 ∼ πy (s0 ), πx , πy∗ (x)
t=1

πx ,πy∗ (x)
hX i
∇v(x) = −E γ t Qf (st , al,t , af,t )∇ log πx (al,t |st ) s0 ∼ ρ, πx , πy∗ (x)
t=0

hX i
+E γ t τ hf,st (πy∗ (x)(st ))∇ log πx (al,t |st ) s0 ∼ ρ, πx , πy∗ (x)
t=1

Recall in the Stackelberg setting, πy∗ (x) is the optimal follower policy given πx ; and the
expectation is taken over the trajectory generated by πx , πy∗ (x) and P.

The proof is deferred to Appendix C.4 due to space limitation.

4.3 A gradient-based algorithm and its convergence


In the previous subsections, we have addressed the challenges of evaluating ∇p(x, y), enabling
the gradient-based methods to optimize Fλ (x, y) in (3.1). However, computing ∇p(xk , yk )
possibly requires an optimal policy πy∗ (xk ) of the lower-level RL problem Mτ (xk ). Given
xk , the lower-level RL problem can be solved with a wide range of algorithms, and we can
use an approximately optimal policy parameter π̂k ≈ πy∗ (xk ) to compute the approximate
penalty gradient ∇p(xˆ ˆ
k , yk ; π̂k ) ≈ ∇p(xk , yk ). The explicit formula of ∇p(xk , yk ; π̂k ) can be
straightforwardly obtained by replacing the optimal policy with its approximate π̂k in the
formula of ∇p(xk , yk ) presented in Lemmas 7 and 9. Therefore, we will defer the explicit
formula to Appendix C.7 for ease of reading.
ˆ
Given ∇p(x ˆ
k , yk ; π̂k ), we approximate the gradient Fλ (xk , yk ) as ∇Fλ (xk , yk ; π̂k ) :=
ˆ
∇f (xk , yk ) + λ∇p(xk , yk ; π̂k ) and update
h i
(xk+1 , yk+1 ) = ProjZ (xk , yk ) − α∇F ˆ λ (xk , yk ; π̂k ) (4.4)

where Z = X × Y, and this optimization process is summarized in Algorithm 1.


We next study the convergence of PBRL. To bound the error of the update in Algorithm
1, we make the following assumption on the sub-optimality of the policy π̂k .

12
Principled Penalty-based Methods for Bilevel RL

Algorithm 1 PBRL: Penalty-based Bilevel RL Gradient-descent


1: Select either the value or the Bellman penalty. Select (x1 , y1 ) ∈ Z := X × Y. Select step
size α, penalty constant γ and iteration number K.
2: for k = 1 to K do
3: Given RL problem Mτ (xk ), solve for an approximately optimal policy π̂k ∈ Π.
4: Compute the penalty’s approximate gradient ∇p(xˆ k , yk ; π̂k ) ≈ ∇p(xk , yk )
5: ˆ λ (xk , yk ; π̂k ) = ∇f (xk , yk ) + λ∇p(x
Compute the inexact gradient of Fλ as ∇F ˆ k , yk ; π̂k )

ˆ λ (xk , yk ; π̂k )
 
6: (xk+1 , yk+1 ) = ProjZ (xk , yk ) − α∇F
7: end for

Assumption 3 (Oracle accuracy) Given some accuracy ϵorac and step size α, assume
the following inequality holds
K K
1X ˆ 1X 1
20λ2 ∥∇p(x 2
k , yk ; π̂k )−∇p(xk , yk )∥ ≤ ϵorac + ∥(xk+1 , yk+1 )−(xk , yk )∥2 .
K K α2
k=1 k=1

This assumption only requires the running average error to be upper-bounded, which is
milder than requiring the error to be upper-bounded for each iteration. A sufficient condition
of the above assumption is ∥π̂k − πy∗ (xk )∥2 ≤ ϵorac with some constant c, which can be
achieved by the policy mirror descent algorithm (see e.g., (Lan, 2023; Zhan et al., 2023))
with iteration complexity O(log(λ2 /ϵorac )) (see a justification in Appendix C.7).
Furthermore, to guarantee worst-case convergence, the regularity condition that f and p
are Lipschitz-smooth is required. We thereby identify a set of sufficient conditions for the
value penalty or Bellman penalty to be smooth.
Assumption 4 (Smoothness assumption) Assume given any (s, a), hs (πy (s)) is Lh -
π π
Lipschitz smooth on Y; and QMy τ (x) (s, a), VMyτ (x) (s) are Lv -Lipschitz-smooth on X × Y.
Assumption 4 is satisfied under a smooth rx and a smooth policy (e.g., softmax policy (Mei
et al., 2020)), or a direct policy parameterization paired with smooth regularization function
hs . See a detailed justification of this in Appendix C.5.
Lemma 10 (Lipschitz smoothness of penalty functions) Under Assumptions 2 and
4, the value or Bellman penalty function p(x, y) is Lp -Lipschitz-smooth on X × Y with
constant Lp specified in the proof.
We refer the reader to Appendix C.6 for proof. Given the smoothness of the penalty terms,
we make the final regularity assumption on f .
Assumption 5 There exists constant Lf such that f (x, y) is Lf -Lipschitz smooth in (x, y).
The projected gradient is a commonly used metric in the convergence analysis of projected
gradient type algorithms (Ghadimi et al., 2016). Define the projected gradient of Fλ (x, y) as
1 
Gλ (xk , yk ) := (xk , yk ) − (x̄k+1 , ȳk+1 ) , (4.5)
α
where (x̄k+1 , ȳk+1 ) := ProjZ ((xk , yk ) − α∇Fλ (xk , yk )). Now we are ready to present the
convergence theorem of PBRL.

13
Shen and Yang and Chen

Theorem 11 (Convergence of PBRL) Consider running the PBRL algorithm. Suppose


1
Assumptions 2–5 hold. Choose step size α ≤ Lf +λLp
, then we have

K 
1 X 2 16 Fλ (x1 , y1 ) − inf (x,y)∈Z f (x, y)
∥Gλ (xk , yk )∥ ≤ + ϵorac
K αK
k=1

See Appendix C.8 for the proof of the above theorem. At each outer iteration k, let
com(ϵorac ) be the oracle’s iteration complexity. Then the above theorem suggests Algorithm
1 has an iteration complexity of O(λϵ−1 com(ϵorac )). When choosing the oracle as policy
mirror descent so that com(ϵorac ) = O(log(λ2 /ϵorac )) (Lan, 2023; Zhan et al., 2023), we have
Algorithm 1 has an iteration complexity of Õ(λϵ−1 ).

5 Bilevel RL with Lower-level Zero-sum Games


In the previous sections, we have introduced a penalty method to solve the bilevel RL
problem with a single-agent lower-level MDP. In this section, we seek to extend the previous
idea to the case where the lower-level problem is a zero-sum Markov game (Shapley, 1953;
Littman, 2001). We will first introduce the formulation of bilevel RL with a zero-sum Markov
game as the lower-level problem, and then propose its penalty reformulation with a suitable
penalty function. Finally, we establish the finite-time convergence for a projected policy
gradient-type bilevel RL algorithm.

5.1 Formulation
Given a parameter x ∈ Rdx , consider a parameterized two-player zero-sum Markov game
Mτ (x) = {S, A, rx , Px , τ h} where S is a finite state space; A = A1 ×A2 is a finite joint action
space, and A1 , A2 are the action spaces of player 1 and 2 respectively; rx (s, a) (a = (a1 , a2 )
is the joint action) is player 1’s parameterized reward, and player 2’s reward is −rx ; the
parameterized transition distribution Px specifies Px (s′ |s, a), which is the probability of the
next state being s′ given when the current state is s and the players take joint action a.
Furthermore, we let πi ∈ Πi denote player i’s policy, where πi (ai |s) is the probability of
player i taking action ai given state s. Here Πi is the policy class of player i and we assume
it is a convex set. We let π ∈ Π = Π1 × Π2 denote the joint policy.
Let τ h be a regularization parameter and hs be a regularization function at state s ∈ S.
Given the joint policy π = (π1 , π2 ), the (regularized) value function under π is defined as

hX i
π1 ,π2 π t

VM τ (x)
(s) = VMτ (x) (s) := E γ r (s ,
x t t a ) − τ h (π (s
st 1 t )) + τ h (π (s
st 2 t )) s 0 = s, π
t=0
(5.1)

where the expectation is taken over the trajectory generated by at ∼ (π1 (st ), π2 (st )) and
π
st+1 ∼ P(st , at ). Given some state distribution ρ, we write VM π
(ρ) = Es∼ρ [VM (s)].
τ (x) τ (x)
We can also define the Q function as

QπM1 ,π (s, a1 , a2 ) = QπMτ (x) (s, a) := r(s, a) + γEs′ ∼P(s,a) VM (s′ ) .


2
 π 
τ (x) τ (x)
(5.2)

14
Principled Penalty-based Methods for Bilevel RL

With a state distribution ρ that satisfies mins ρ(s) > 0, the ϵ-Nash-Equilibrium (NE)
(Ding et al., 2022; Zhang et al., 2023; Ma et al., 2023) is a joint policy π = (π1 , π2 ) satisfying
π ′ ,π2
n
π1 ,π2
N Eϵ (x) := (π1 , π2 ) ∈ Π : VM τ (x)
(ρ) ≥ VM1τ (x) (ρ) − ϵ, ∀π1′ ∈ Π1 and
π1 ,π2′
o
π1 ,π2 ′
VM τ (x)
(ρ) ≤ VMτ (x) (ρ) + ϵ, ∀π 2 ∈ Π2 . (5.3)

Then Nash equilibrium is defined as ϵ-NE with ϵ = 0 and N E(x) = N E0 (x).


Bilevel RL. In the bilevel RL problem, we are interested in finding the optimal parameter
x such that the Nash equilibrium induced by such a parameter, maximizes an objective
function f . The mathematical formulation is given as follows:

(5.4) min f (x, π), s.t. x ∈ X , π ∈ N E(x).


x,π

In the above problem, we aim to find a parameter x and select among all the Nash
Equilibria under x such that a certain loss function f is minimized. We next present a
motivating example for this general problem.
Motivating example: Incentive design. Adaptive incentive design (Ratliff et al.,
2019) involves an incentive designer who tries to manipulate self-interested agents by
modifying their payoffs with carefully designed incentive functions. In the case where the
agents are playing a zero-sum game, the incentive designer’s problem (Yang et al., 2021)
can be formulated as (5.4), given by

hX i
min f (π) = −Eπ,Pid γ t rid (st , at ) − c(st ) s.t. x ∈ X , π ∈ N E(x) (5.5)
x,π
t=0

where Pid (·|s, a) is the transition distribution of the designer; rid is the designer’s reward,
e.g., the social welfare reward (Yang et al., 2021); the function c(s) is the designer’s cost; the
expectation is taken over the trajectory generated by agents’ joint policy π and transition
Pid ; and in the lower level, the MDP Mτ (x) is parameterized by x via the incentive reward
rx . Note rx is the agents’ reward, which is designed by the designer to control the behavior
of the agents such that the designer’s reward given by rid and c is maximized.

5.2 The Nikaido-Isoda function as a penalty


Different from a static bilevel optimization problem, the problem in (5.4) does not have
an optimization problem in the lower level; instead, it has a more abstract constraint set
π ∈ N E(x). Our first step is to formulate the problem in (5.4) to a bilevel optimization
problem with an optimization reformulation of the Nash equilibrium seeking problem. In
doing so, we will use the Nikaido-Isoda (NI) function first introduced in (Nikaidô and Isoda,
1955). It takes a special form in two-player zero-sum games:
π1 ,π2 π1 ,π2
ψ(x, π) := max VM τ (x)
(ρ) − min VM τ (x)
(ρ). (5.6)
π1 ∈Π1 π2 ∈Π2

We have the following basic property of this function.

15
Shen and Yang and Chen

Lemma 12 (Bilevel formulation) Given any x and π ∈ Π, ψ(x, π) ≥ 0, ψ(x, π) ≤ 2ϵ if


π ∈ N Eϵ (x) and π ∈ N Eϵ (x) if ψ(x, π) ≤ ϵ. Therefore, (5.4) is equivalent to the following
bilevel optimization problem

BZ : min f (x, π) s.t. x ∈ X , π ∈ arg min ψ(x, π). (5.7)


x,π π∈Π

Proof From the definition (5.6), we have


π1 ,π2 π1 ,π2  π1 ,π2 π1 ,π2 
ψ(x, π) = − VM τ (x)
(ρ) + max VM τ (x)
(ρ) + VM τ (x)
(ρ) − min VM τ (x)
(ρ) . (5.8)
π1 ∈Π1 π2 ∈Π2

The result follows since both terms in the RHS of (5.8) are nonnegative on Π.
By the above lemma, ψ(x, π) is an optimality metric of the lower-level NE-seeking problem.
Therefore, it is natural to consider when ψ is a suitable penalty. Define the penalized
problem as

BZ λp : min f (x, π) + λψ(x, π), s.t. x ∈ X , π ∈ Π. (5.9)


x,π

To relate BZ λp with the original problem BZ, certain structures of ψ(x, ·) is required. A
special structure of ψ has been studied in previous works where each player’s payoff is
non-Markovian (see e.g., (Von Heusinger and Kanzow, 2009)). The result relies on certain
monotonicity conditions on the payoff functions that do not hold in our Markovian setting.
Instead, inspired by the previously discussed single-agent case, we prove a gradient dominance
condition under the following assumption.
π1 ,π2
Assumption 6 Assume τ > 0 and VM τ (x)
(ρ) is continuously differentiable in (x, π1 , π2 ).
For a justification of the stronger version of this assumption, please see Appendix D.3. Under
this assumption, we can prove the following key lemma.
Lemma 13 (Gradient dominance of ψ) If Assumption 6 holds, we have the following.
(a) Function ψ is differentiable with
π ,π2∗ π ∗ ,π2
∇π ψ(x, π) = − ∇π1 VM1τ (x)

(ρ), ∇π2 VM1τ (x) (ρ)
π1 ,π2
where π1∗ := argmaxπ1 ∈Π1 VM τ (x)
(ρ) and π2∗ defined similarly; and
π ∗ ,π
2 π ,π ∗
∇x ψ(x, π) = ∇x VM1τ (x) (ρ) − ∇x VM1τ (x)
2
(ρ).

(b) There exists a constant µ = (1 − γ) mins ρ(s) such that given any x and τ > 0, ψ(x, π)
is µ-gradient dominated in π:

max

⟨∇π ψ(x, π), π − π ′ ⟩ ≥ µψ(x, π), ∀π ∈ Π. (5.10)
π ∈Π

Please see Appendix D.1 for the proof. The proof is based on the gradient dominance
condition of the single-agent setting in Lemma 2, along with the max-min special form of
the NI function.
With Lemma 13, we are ready to relate BZ λp with BZ.

16
Principled Penalty-based Methods for Bilevel RL

Lemma 14 (Relation on solutions) Assume Assumption 6 holds and f (x, π) is L-Lipschitz-


continuous in π. Given accuracy δ > 0, choose λ ≥ δ −1 . If (xλ , πλ ) is a local/global solution
of BZ λp , it is a local/global solution of the relaxed BZ with some ϵλ ≤ δ:

BZ ϵ : min f (x, π), s.t. x ∈ X , π ∈ Π, ψ(x, π) ≤ ϵλ . (5.11)


x,π

The proof is deferred to Appendix D.2. The above lemma shows one can recover the
local/global solution of the approximate problem of BZ by solving BZ λp instead. To solve
for BZ λp , we propose a projected gradient type update next and establish its finite-time
convergence.

5.3 A policy gradient-based algorithm and its convergence analysis


To solve for BZ, we consider a projected gradient update to solve for its penalized problem
BZ λp . To evaluate the objective function in BZ λp , one will need to evaluate ∇ψ(x, π). Note
that evaluating ∇π ψ(x, π) requires the point π1∗ (π2 , x) and π2∗ (π1 , x) (defined in Lemma 13),
which are optimal policies of a fixed MDP given parameters (π2 , x) and (π1 , x) respectively.
There are various efficient algorithms to find the optimal policy of a regularized MDP.
Thus we assume that at each iteration k, we have access to some approximate optimal polices
π̂1k ≈ π1∗ (π2k , xk ) and π̂2 ≈ π2∗ (π1k , xk ) obtained by certain RL algorithms. With π̂ k = (π̂1k , π̂2k ),
we may denote the estimator of ∇ψ(xk , π k ) as ∇ψ(x ˆ k , π k ; π̂ k ), the definition of which follows

from Lemma 13 (a) with π̂1k and π̂2k in place of π1∗ and π2∗ respectively:

π̂ k ,π k π k ,π̂ k

ˆ
∇ψ(x k
, π k ; π̂ k ) := ∇x VM1τ (x2k ) (ρ) − ∇x VM1τ (x2k ) (ρ),
π k ,π̂ k π̂ k ,π k 
− ∇π1 VM1τ (x2k ) (ρ), ∇π2 VM1τ (x2k ) (ρ) .

We then perform a projected gradient-type update with this estimator:


h i
ˆ
(xk+1 , π k+1 ) = ProjZ (xk , π k ) − α ∇f (xk , π k ) + λ∇ψ(xk
, π k ; π̂ k ) (5.12)

where Z = X × Π. We make the following assumption on the sub-optimality of π̂1k and π̂2k .

Assumption 7 (Oracle accuracy) Given some pre-defined accuracy ϵorac > 0 and the
step size α, assume the approximate policies π̂1k and π̂2k satisfy the following inequality
K K
1X ˆ 1X 1
20λ2 ∥∇ψ(xk
, π k ; π̂ k )−∇ψ(xk , π k )∥2 ≤ ϵorac + ∥(xk+1, π k+1 )−(xk, π k )∥2.
K K α2
k=1 k=1

The left-hand side of the above inequality can be upper bounded by the optimality gaps
of the approximate optimal policies {π̂1k , π̂2k }. Note that here the policies {π̂1k , π̂2k } are not
approximate NE. Instead, π̂1k is the player 1’s approximately optimal policy on the Markov
model with parameter xk , where player 2 adopts π2k . Thus, to obtain π̂1k , one may use efficient
single-agent policy optimization algorithms. For example, when using the policy mirror
descent algorithm (Zhan et al., 2023), it will take an iteration complexity of O(log(λ2 /ϵorac ))
to solve for accurate enough approximate policies. Similarly, π̂2k is an approximately optimal

17
Shen and Yang and Chen

policy of player 2 on the Markov game with parameter xk , where player 1 adopts π1k . Thus
π̂2k can similarly be efficiently obtained using a standard single-agent policy optimization
algorithm. Furthermore, a more detailed justification of this assumption is provided in D.5.
We next identify sufficient conditions for the finite-time convergence in (5.12) as follows.

Assumption 8 (Smoothness assumption of ψ) Suppose Assumption 6 holds. Addi-


tionally, assume the following arguments hold.
π
(a) Given any s, VM (s) is Lv -Lipschitz-smooth on X × Π;
τ (x)

(b) If the discount factor γ > 0 then assume given x ∈ X , for any state s and initial state-
action s0 , a0 , there exists t such that Pxπ (st = s|s0 , a0 ) > 0, where Pxπ (st = s|s0 , a0 ) is
the probability of reaching s at time t in the MDP Mτ (x) under joint policy π.

Assumption 8 (a) can be satisfied under a smooth regularization function, and smooth
parameterized functions rx and Px ; see the justification in Appendix D.3. Assumption 8 (b)
is in the same spirit as Assumption 2 (b) in the single-agent case. Under Assumption 8, we
can prove that the NI function is Lipschitz-smooth.

Lemma 15 (Smoothness of ψ) Under Assumption 8, there exists a constant Lψ such


that ψ(x, π) is Lψ -Lipschitz-smooth on X × Π.

The proof of the above lemma can be found in Appendix D.4. With the above smoothness
condition, we are ready to establish the convergence result. Define the projected gradient of
the objective function in BZ λp (5.9) as
1
Gλ (xk , π k ) := (xk , π k ) − (x̄k+1 , π̄ k+1 ) ,

(5.13)
α
where (x̄k+1 , π̄ k+1 ) := ProjZ (xk , π k ) − α ∇f (xk , π k ) + λ∇ψ(xk , π k ) . Now we are ready


to present the convergence theorem of update (5.12).

Theorem 16 (Convergence of PBRL with zero-sum lower-level) Consider running


1
update (5.12). Suppose Assumptions 5, 6, 7 and 8 hold. Choose step size α ≤ Lf +λL ψ
, then
we have
K
16 f (x1 , π 1 ) + λψ(x1 , π 1 ) − inf (x,y)∈Z f (x, y)

1 X k k 2
∥Gλ (x , π )∥ ≤ + ϵorac . (5.14)
K αK
k=1

The proof is deferred to Appendix D.6. At each outer iteration k, let com(ϵorac ) be the
oracle’s iteration complexity. Then the above theorem suggests update (5.12) has an
iteration complexity of O(λϵ−1 com(ϵorac )). As discussed under Assumption 7, one could
use policy mirror descent to solve for π̂1k , π̂2k when estimating ∇ψ(xk , π k ), then we have
com(ϵorac ) = O(log(λ2 /ϵorac )). In such cases, update (5.12) has an iteration complexity of
Õ(λϵ−1 ).

6 Simulation
In this section, we test the empirical performance of PBRL in different tasks.

18
Principled Penalty-based Methods for Bilevel RL

0.7
PBRL (value penalty)
3.5 0.6 PBRL (Bellman penalty)

Follower optimality gap


Leader value function
0.5 Indpendent gradient
3.0
0.4
2.5
0.3
2.0 0.2
PBRL (value penalty)
1.5 PBRL (Bellman penalty) 0.1
Indpendent gradient 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Environment steps 1e5 Environment steps 1e5

Figure 1: Stackelberg Markov games. The result is generated by running the algorithms in
10 random Stackelberg MDPs. The environment step is the total number of steps
taken in the MDP, and is therefore also proportional to the total samples used in
πx ,πy
training. The leader’s value function is Vl k k (ρ), and the follower’s optimality
πx ,π ∗ (xk ) πx ,πy
gap is given by Vf k y (ρ) − Vf k k (ρ). A zero optimality gap means the
follower has found the best response to the leader.

6.1 Stackelberg Markov game


We first seek to solve the Stackelberg Markov game formulated as
πx ,πy∗ (x) π ,πy
min −Vl (ρ), s.t. x ∈ Rdx , πy∗ (x) = argmin −Vf x (ρ), (6.1)
x πy

where πx and πy is parameterized via the softmax function. Here the transition distribution
and rewards are randomly generated. It has a state space of size |S| = 100, and the leader,
and follower’s action space are of size |Al | = 5, |Af | = 5 respectively. Each entry of the
rewards Rl , Rf ∈ R100×5×5 is uniformly sampled between [0, 1] and values smaller than 0.7
are set to 0 to promote sparsity. Each entry of the transition matrix is sampled between
[0, 1] and then is normalized to be a distribution.
Baseline. We implement PBRL with both value and Bellman penalty, and compare
them with the independent policy gradient method (Daskalakis et al., 2020; Ding et al.,
2022). In the independent gradient method, each player myopically maximizes its own
π ,π π ,π
value function, i.e., the leader maximizes Vl x y (ρ) while the follower maximizes Vf x y (ρ).
πx ,πyk
At each step k, leader updates πxk with one-step gradient of Vl (ρ) while the follower
πx ,πy
updates πyk with one-step gradient of Vf k (ρ). We test all algorithms across 10 randomly
generated MDPs.
We report the results in Figure 1. In the right figure, we can see the follower’s optimality
gap diminishes to zero, that is, the followers have found their optimal policies. In the
meantime, the left figure reports the leaders’ total rewards for the three methods. Overall,
we find that both PBRL with value penalty and Bellman penalty outperform the independent
gradient: it can be observed from Figure 1 (left) that PBRL achieves a higher leader’s return
than the independent gradient, and the PBRL with value penalty reaches the highest value.

6.2 Deep reinforcement learning from human feedback


We test our algorithm in RLHF, following the experiment setting in (Christiano et al., 2017);
see a description of the general RLHF setting in Section 2.2.

19
Shen and Yang and Chen

Seaquest BeamRider MsPacman


1500 6000 DRLHF (9k labels) 3500 DRLHF (5.9k labels)
A2C (oracle) A2C (oracle)
1250 3000
5000 PBRL (9k labels) PBRL (5.9k labels)
Episodic return

Episodic return

Episodic return
2500
1000 4000
2000
750 3000
1500
500 2000
DRLHF (2.1k labels) 1000
250 A2C (oracle) 1000 500
PBRL (2.1k labels)
0 0 0
0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Environment steps 1e5 Environment steps 1e6 Environment steps 1e6

Figure 2: Performance on Atari games measured by true reward. The ‘episode return’ is the
sum of true rewards in an episode. We average the episode returns in 5 consecutive
episodes. The ‘environment steps are the number of steps taken per worker in
policy optimization. We compare the performance of PBRL (ours) and DRLHF
both with few labeled pairs, and A2C with true reward.

Environment and preference collection. We conduct our experiments in the Arcade


Learning Environment (ALE) (Bellemare et al., 2013) through OpenAI gym. The ALE
provides the game designer’s reward that can be treated as the ground truth reward. For
each pair of segments we collect, we assign preference to whichever has the highest ground
truth reward. This preference generation process allows us to benchmark our algorithm with
DRLHF which also uses this process.
Baseline. We compare PBRL with DRLHF (Christiano et al., 2017) and A2C (A3C
(Mnih et al., 2016) but synchronous). We use the ground truth reward to train A2C agent,
and treat A2C as an oracle algorithm. The oracle algorithm estimates a performance upper
bound for other algorithms.
The results are reported in Figure 2. The first two games (Seaquest and BeamRider)
are also reported in (Christiano et al., 2017). For Seaquest, the asymptotic performance
of DRLHF and PBRL are similar, while DRLHF is more unstable in training. Similar
observations can also be made in the original paper of DRLHF. For BeamRider and
MsPacman, we find out that PBRL has an advantage over DRLHF on the episode return.
It can be observed that PBRL is able to achieve higher best-episode-return than DRLHF,
and become comparable to the oracle algorithm.

6.3 Incentive design


Here we test our algorithm in the following incentive design problem:

hX i
min f (π) = −Eπ,Pid γ t rid (st , at ) s.t. π ∈ N E(x). (6.2)
x,π
t=0

See a detailed description of this task in the motivating example of Section 5.1. We
have |S| = 10, |A1 | = |A2 | = 5. The designer’s transition Pid (·|s, a) and the lower-level
transition P(·|s, a) are randomly generated. Then players’ original reward r(s, a) and the
designer’s reward rid (s, a) are randomly generated between [0, 1]. The players’ reward
rx (s, a) = r(s, a) + 0.2σ(x(s, a)), where r is the original reward given by the environment,
and σ(x(s, a)) is the incentive reward from the designer. Here σ is the sigmoid function and
x ∈ R|S|×|A1 |×|A2 | is the incentive reward parameter. The π1 , π2 are softmax policies.

20
Principled Penalty-based Methods for Bilevel RL

PBRL PBRL
3.5 0.8
No incentive design No incentive design
Meta-Gradient Meta-Grad
3.0

Designer reward
0.6

NE gap
2.5
0.4
2.0
0.2
1.5

1.0 0.0
0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Environments steps 1e5 Environments steps 1e5

Figure 3: Incentive design. The result is generated by running the algorithms in 5 random
environments. The environment step is the total number of steps taken, and is
proportional to the total samples used. The designer’s reward is f (π), which is
the expected cumulative designer reward. The NE gap is the estimated value
of NI function ψ(x, π). A zero NE gap indicates the players have achieved an
approximate Nash equilibrium under the rx (s, a).

Baseline. We implement the PBRL update for the zero-sum lower level introduced in
Section 5, and compare it with the Meta-Gradient method (Yang et al., 2021). To exclude the
case where the original zero-sum game (with no incentive reward) already has a high reward,
we also provide the performance when there is no incentive design, i.e., when σ(x(s, a))
is kept a constant. This will only return an approximate NE of the lower-level zero-sum
problem without incentive reward, and therefore provide a baseline. Then if an algorithm’s
output incentive reward is more effective, the more it improves over the baseline.
It can be observed from Figure 3 (right) that both PBRL and Meta-Gradient have found
the approximate NE under their respective incentive reward rx . It can be observed from
Figure 3 (left) that the incentive reward rx of both methods is effective since the designer’s
reward f (π) of both methods exceeds the start line (green). In addition, PBRL is able to
outperform Meta-Gradient since the incentive reward rx of PBRL can guide a π ∈ N E(x)
with a higher designer reward.

7 Conclusion
In this paper, we propose a penalty-based first-order algorithm for the bilevel reinforcement
learning problems. In developing the algorithm, we provide results in three aspects: 1)
we find penalty function with proper landscape properties such that the induced penalty
reformulation admits solutions for the original bilevel RL problem; 2) to develop a gradient-
based method, we study the differentiability of the penalty functions and find out their
close form gradients; 3) based on the previous findings, we propose the convergent PBRL
algorithm and evaluate on the Stackelberg Markov game, RLHF and incentive design.

Acknowledgments and Disclosure of Funding

The work was supported by the National Science Foundation (NSF) MoDL-SCALE project
2401297, NSF CAREER 2532349, NSF DMS project 2413243, and Cisco Research Award.

21
Shen and Yang and Chen

Table of Contents
A Preliminary results 22

B Proof in Section 2 and 3 24


B.1 Proof that Stackelberg Markov game is a bilevel RL problem . . . . . . . 24
B.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
B.4 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
B.5 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C Proof in Section 4 29
C.1 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
C.2 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
C.3 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
C.4 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
C.5 Sufficient conditions of the smoothness assumption . . . . . . . . . . . . . 33
C.6 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.7 Example gradient estimators of the penalty functions . . . . . . . . . . . . 37
C.8 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

D Proof in Section 5 40
D.1 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.2 Proof of Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.3 Justification of the smoothness assumption in the two-player case . . . . . 41
D.4 Proof of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
D.5 Gradient Estimator Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 42
D.6 Proof of Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

E Additional Experiment Details 44


E.1 Stackelberg Markov game . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
E.2 Deep reinforcement learning from human feedback . . . . . . . . . . . . . 44
E.3 Incentive design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Appendix A. Preliminary results


Lemma 17 (Lipschitz continuous optimal policy) Given x ∈ X , consider the optimal
policies in a convex policy class Π of a parameterized MDP Mτ (x). Suppose Assumption 2

22
Principled Penalty-based Methods for Bilevel RL

holds, τ > 0 and X is compact. Then the optimal policy πy∗ (x) is unique and the following
inequality hold:

∥πy∗ (x) − πy∗ (x′ )∥ ≤ τ −1 CJ ∥x − x′ ∥, ∀x, x′ ∈ X (A.1)

where CJ is a constant specified in the proof.

Proof By Lemma 4, the optimal policy of Mτ (x) on a convex policy class Π is unique,
given by

πy∗ (q(x)) = argmin J(q(x), π) := Es∼ρ [⟨π(s), qs (x)⟩ + τ hs (π(s))] (A.2)


π∈Π

where recall q(x) = (qs (x))s∈S with

qs (x) = (− max QπMτ (x) (s, a))a∈A . (A.3)


π∈Π

We overload the notation π ∗ here with πy∗ (q(x)) which equals πy∗ (x). In (A.2), since
τ Es∼ρ [hs (π(s))] is τ -strongly convex at π on Π, πy∗ (q(x)) satisfies (A.2) if and only if
it is a solution of the following parameterized variational inequality (VI)

⟨∇π J(q(x), π), π − π ′ ⟩ ≤ 0, ∀π ′ ∈ Π (A.4)

where
 
∇π J(q(x), π) = ρ(s)qs (x) + τ ρ(s)∇hs (π(s)) . (A.5)
s∈S

First, it can be checked that ∇π J(q(x), π) is continuously differentiable at any (q(x), π).
Secondly, by the uniform strong convexity of J(q(x), ·), given any q(x), it holds that

(π − π ′ )⊤ ∇2π J q(x), πy∗ (q(x)) (π − π ′ ) ≥ τ −1 ∥π − π ′ ∥2 .



(A.6)

Given these two properties of the VI, it then follows from (Dontchev and Rockafellar, 2009,
Theorem 2F.7) that the solution mapping πy∗ (q(x)) is τ −1 -Lipschitz-continuous locally at
any point q(x). Thus πy∗ (q(x)) is τ −1 -Lipschitz-continuous in q(x) globally, yielding

∥πy∗ (q(x)) − πy∗ (q(x′ ))∥ ≤ τ −1 ∥q(x) − q(x′ )∥


≤ τ −1 max ∥∇q(x)∥∥x − x′ ∥
x∈X
−1
=τ CJ ∥x − x′ ∥ (A.7)

where the second inequality follows from q(x) is continuously differentiable, which can be
checked by Lemma 8 under Assumption 2 and the continuity of πy∗ (x) we proved earlier;
and, CJ = maxx∈X ∥∇q(x)∥ is well-defined by compactness of X .

23
Shen and Yang and Chen

Appendix B. Proof in Section 2 and 3


B.1 Proof that Stackelberg Markov game is a bilevel RL problem
Lemma 18 (Stackelberg game cast as BM) The Stackelberg MDP from the follower’s
viewpoint can be defined as a parametric MDP:
Mτ (x) ={S, Af , rx (s, af ) = Eal ∼πx (s) [rl (s, al , af )],
Px (·|s, af ) = Eal ∼πx (s) [P(·|s, al , af )], τ hf }.
π ,π π
Then we have Vf x y (s) = VMyτ (x) (s), ∀s, and thus the original formulation of Stackelberg
game in (2.4) can be rewritten as BM:
πx ,πy π
SG : min −Vl (ρ), s.t. x ∈ X , y ∈ Y ∗ (x) = argmin −VMyτ (x) (ρ). (B.1)
x,y y∈Y
π ,π
Proof Recall that the follower’s value function Vf x y (s) under the leader’s policy πx and
the follower’s policy πy is defined as
hX∞ i
π ,π
γ t rf (st , al,t , af,t ) − τ hf,st (πy (st )) s0 = s, πx , πy

Vf x y (s) = E (B.2)
t=0
where the leader’s action al,t ∼ πl (st ), the follower’s action af,t ∼ πf (st ), and the state
transition follows st+1 ∼ P(·|st , al,t , af,t ).
It then follows from an expansion of the expectation in (B.2) that
h i
πx ,πy
Vf (s) = Eal,0 ∼πx (s0 ),af,0 ∼πy (s0 ) rf (s0 , al,0 , af,0 ) − τ hf,s0 (πy (s0 )) s0 = s, πx , πy
h i
+ γEal,0 ∼πx (s0 ),af,0 ∼πy (s0 ) rf (s1 , al,1 , af,1 ) − τ hf,s1 (πy (s1 )) s0 = s, πx , πy
s1 ∼P(s0 ,al,0 ,af,0 )
al,1 ∼πx (s1 ),af,1 ∼πy (s1 )

+ ...
h i
= Eaf,0 ∼πy (s0 ) rx (s0 , af,0 ) − τ hf,s0 (πy (s0 )) s0 = s, πy
h i
+ γE af,0 ∼πy (s0 ) rx (s1 , af,1 ) − τ hf,s1 (πy (s1 )) s0 = s, πy + . . .
s1 ∼Px (s0 ,af,0 )
af,1 ∼πy (s1 )
πy
= VMτ (x) (s)
where recall Px (s, af ) = Eal ∼πx (s) [P(·|s, al , af )] and rx (s, af ) = Eal ∼πx (s) [rl (s, al , af )]. Thus
π ,π π
we have Vf x y (s) = VMyτ (x) (s), ∀s. Therefore, the Stackelberg Markov game can be written
as BM.

B.2 Proof of Lemma 1


Proof Since (xλ , yλ ) is an ϵ-minima of BMλp , it holds for any x ∈ X and y ∈ Y that
πy π 
f (xλ , yλ ) + λ − VMτλ(xλ ) (ρ) + max VMyτ (xλ )
y∈Y
πy π 
≤ f (x, y) + λ − VMτ (x) (ρ) + max VMyτ (x) + ϵ. (B.3)
y∈Y

24
Principled Penalty-based Methods for Bilevel RL

Choosing x = xλ and y ∈ Y(xλ ) in the above inequality and rearranging yields


π πy 1 
max VMyτ (xλ ) (ρ) − VMτλ(xλ ) (ρ) ≤ f (xλ , yλ ) − f (xλ , y) + ϵ
y∈Y λ
1
C + ϵ ≤ δ + λ−1 ϵ.

≤ (B.4)
λ
π πy
Define ϵλ := maxy∈Y VMyτ (xλ ) (ρ) − VMτλ(xλ ) (ρ) then ϵλ ≤ δ + λ−1 ϵ. It follows from (B.3) that
for any x, y feasible for (3.4) that
π π 
f (xλ , yλ ) ≤ f (x, y) + λ − VMyτ (x) (ρ) + max VMyτ (x) − ϵλ + ϵ
y∈Y

≤ f (x, y) + ϵ. (B.5)

This completes the proof.

B.3 Proof of Lemma 2


Proof The following proof holds for any x and thus we omit x in the notations Mτ (x),
πy∗ (x) and Px in this proof. We first prove a policy gradient theorem for the regularized
MDP. From the Bellman equation, we have
X
π
VM τ
(s) = π(a|s)QπMτ (s, a) − τ hs (π(s)) (B.6)
a

Differentiating two sides of the equation with respect to π gives


X X
π π
∇VM τ
(s) = ∇π(a|s)Q Mτ (s, a) + π(a|s)∇QπMτ (s, a) − τ ∇π hs (π(s)). (B.7)
a a

By the definition of Q function, we have ∇QπMτ (s, a) = P(s′ |s, a)∇VM


π (s′ ). Substituting
P
s′ τ
this inequality into (B.7) yields
π
∇VM (s)
Xτ X
= ∇π(a|s)QπMτ (s, a) + Pπ (s1 = s′ |s0 = s)∇VM
π
τ
(s, a) − τ ∇π hs (π(s)) (B.8)
a s′

where P π (s1 = s′ |s0 = s) is the probability of s1 = s′ given s0 = s under policy π. Note


that the above inequality has a recursive structure, thus we can repeatedly applying it to
itself and obtain
π 1 X τ
∇VM (s) = Es̄∼dπs [ QπMτ (s̄, a)∇π(a|s̄)] + Es̄∼dπs [−∇π hs̄ (π(s̄))] (B.9)
τ
1−γ a
1 − γ

where dπs (s̄) := (1 − γ) t γ t P π (st = s̄|s0 = s) is the discounted visitation distribution.


P
Define dπMτ (s̄) := Es∼ρ [dπs (s̄)]. Since ∇π(a|s̄) = 1s̄,a where 1s̄,a is the indicator vector, we
have the regularized policy gradient given by
π 1  π
dMτ (s) QπMτ (s, ·) − τ ∇hs (π(s)) s∈S .

∇π V M (ρ) = (B.10)
τ
1−γ

25
Shen and Yang and Chen

Now we begin the prove the lemma. By the performance difference lemma (see e.g., (Lan,
2023, Lemma 2) and (Zhan et al., 2023, Lemma 5)), for any π ∈ Π, we have
π̃ π
max VM τ
(ρ) − VM τ
(ρ)
π̃∈Π
1 h i
= Es∼dπ∗ ⟨QπMτ (s, ·), πy∗ (s) − π(s)⟩ − τ hs (πy∗ (s)) + τ hs (π(s))
1−γ Mτ

1 h i
≤ Es∼dπ∗ ⟨QπMτ (s, ·), πy∗ (s) − π(s)⟩ − τ ⟨∇hs (π(s)), πy∗ (s) − π(s)⟩
1−γ Mτ

where the inequality follows from the convexity of hs . Continuing from the inequality, it
follows
π̃ π

(1 − γ) max VM τ
(ρ) − VM τ
(ρ)
π̃∈Π
h i
π ′ ′
≤ Es∼dπ∗ max ⟨Q Mτ (s, ·), π (s) − π(s)⟩ − τ ⟨∇h s (π(s)), π (s) − π(s)⟩
Mτ π ′ ∈Π
 π∗
dMτ (s) 
π ′ ′

= Es∼dπM max ⟨QMτ (s, ·), π (s) − π(s)⟩ − τ ⟨∇hs (π(s)), π (s) − π(s)⟩
τ dπMτ (s) π′ ∈Π
 π∗
dMτ 
π ′ ′

≤ Es∼dπM max ⟨Q Mτ (s, ·), π (s) − π(s)⟩ − τ ⟨∇h s (π(s)), π (s) − π(s)⟩
τ dπMτ ∞ π′ ∈Π
(B.11)
∗ ∗

Mτ (s) dπ

where the last inequality follows from dπ ≤ dπ and
Mτ (s) Mτ ∞
 
π ′ ′
max

⟨QM τ
(s, ·), π (s) − π(s)⟩ − τ ⟨∇hs (π(s)), π (s) − π(s)⟩
π ∈Π
≥ ⟨QπMτ (s, ·), π(s) − π(s)⟩ − τ ⟨∇hs (π(s)), π(s) − π(s)⟩ = 0. (B.12)

Continuing from (B.11), we have


π̃ π

(1 − γ) max VM τ
(ρ) − VM τ
(ρ)
π̃∈Π
 
1 π ′ ′
≤ max E d π ⟨QMτ (s, ·), π (s)−π(s)⟩−τ ⟨∇h s (π(s)), π (s)−π(s)⟩
(1−γ) mins ρ(s) π′ ∈Π Mτ
1
= max⟨∇π VM π
τ
(ρ), π ′ − π⟩ (B.13)
mins ρ(s) π′ ∈Π

where the inequality follows from (1 − γ)ρ(s) ≤ dπMτ (s) ≤ 1 for any s and π, and the equality
follows from (B.10). This proves the result.

B.4 Proof of Lemma 3


Proof Given xλ , point yλ satisfies the first-order stationary condition:

⟨∇y f (xλ , yλ ) + λ∇y p(xλ , yλ ), yλ − y ′ ⟩ ≤ 0, ∀y ′ ∈ Y (B.14)

26
Principled Penalty-based Methods for Bilevel RL

which leads to

1
⟨∇y p(xλ , yλ ), yλ − y ′ ⟩ ≤ − ⟨∇y f (xλ , yλ ), yλ − y ′ ⟩
λ
L∥yλ − y ′ ∥ LCu
≤ ≤ , ∀y ′ ∈ Y (B.15)
λ λ

where Cu := maxy,y′ ∈Y ∥y − y ′ ∥ which is well defined by compactness of Y. For the LHS of


the above inequality, we have the following inequality hold
πy
min

⟨∇y p(xλ , yλ ), yλ − y ′ ⟩ = max

⟨∇y VMτλ(xλ ) (ρ), y ′ − yλ ⟩
y ∈Y y ∈Y
1 π πy 
≥ max VMyτ (xλ ) (ρ) − VMτλ(xλ ) (ρ) (B.16)
(1 − γ) mins ρ(s) y∈Y

where the last inequality follows from we are using direct policy parameterization y = π and
Lemma 2.
Substituting (B.16) into (B.15) yields

π πy LCu
max VMyτ (xλ ) (ρ) − VMτλ(xλ ) (ρ) ≤ . (B.17)
y∈Y λ
πy π
Define ϵλ := −VMτλ(xλ ) (ρ) + maxy∈Y VMyτ (xλ ) (ρ) then ϵλ ≤ δ by choice of λ.
By local optimality of (xλ , yλ ), it holds for any x ∈ X , y ∈ Y and in the neighborhood of
(xλ , yλ ) that
πy π 
f (xλ , yλ ) + λ − VMτλ(xλ ) (ρ) + max VMyτ (xλ )
y∈Y
πy π 
≤ f (x, y) + λ − VMτ (x) (ρ) + max VMyτ (x) . (B.18)
y∈Y

From the above inequality, it holds for any (x, y) feasible for the relaxed BM in (3.4) and
in neighborhood of (xλ , yλ ) that

π π 
f (xλ , yλ ) ≤ f (x, y) + λ − VMyτ (x) (ρ) + max VMyτ (x) − ϵλ
y∈Y

≤ f (x, y) (B.19)

which proves the result.

B.5 Proof of Lemma 4


Proof We start with the first bullet. Define


VM τ (x)
π
(s) := max VM τ (x)
(s), Q∗Mτ (x) (s, a) := r(s, a) + γEs′ ∼Px (s,a) [VM

τ (x)
(s′ )].
π∈Π

27
Shen and Yang and Chen

Then it follows from the definition of the value function that for any s0 ,

VM τ (x)
(s0 )
h ∞
X i
γ t rx (st , at ) − τ hst (π(st )) s0 , π

= max E rx (s0 , a0 ) − τ hs0 (π(s0 )) +
π∈Π
t=1
h ∞ i
X
γ t rx (st , at )−τ hst (π(st )) s0 , a0 , π s0 , π
 
= max E rx (s0 , a0 )−τ hs0 (π(s0 ))+E
π∈Π
t=1

where the last equality follows from the law of total expectation. Continuing from above, we
have
h i

 π
VM τ (x)
(s 0 ) = max Ea0 ∼π(s0 ) rx (s 0 , a0 ) − τ hs 0 (π(s0 )) + γE V
s1 ∼Px (s0 ,a0 ) Mτ (x) (s 1 )
π∈Π
h  ∗ i
≤ max Ea0 ∼π(s0 ) r(s0 , a0 ) − τ hs0 (π(s0 )) + γEs1 ∼Px (s0 ,a0 ) VM τ (x)
(s 1 ) (B.20)
π∈Π

Given x, define a policy πy∗ = (πy∗ (s))s∈S ∈ Π via


h i
πy∗ (s0 ) := argmax Ea0 ∼π(s0 ) r(s0 , a0 ) − τ hs0 (π(s0 )) + γEs1 ∼Px (s0 ,a0 ) VM
 ∗
τ (x)
(s 1 )
π(s0 )

for any s0 ∈ S, where the argmax is a singleton following from the τ -strong convexity of τ h,
and we sometimes treat the singleton set as its element for convenience. Given the definition
of πy∗ , it then follows from (B.20) that

VM τ (x)
(s0 )
h  ∗ i
≤ Ea0 ∼πy∗ (s0 ) r(s0 , a0 ) − τ hs0 (π(s0 )) + γEs1 ∼Px (s0 ,a0 ) VM τ (x)
(s 1 )
h
≤ Ea0 ∼πy∗ (s0 ) r(s0 , a0 ) − τ hs0 (π(s0 ))
 ∗
i
+ γEs1 ∼Px (s0 ,a0 ),a1 ∼πy (s1 ) r(s1 , a1 ) − τ hs1 (π(s1 )) + γEs2 ∼Px (s1 ,a1 ) [VMτ (x) (s2 )]
∗ (B.21)

where the last inequality is a result of applying (B.20) twice. Continuing to recursively apply
(B.20) and then using the definition of VM π in (2.1) yield
τ (x)

∗ π∗
VM τ (x)
(s0 ) ≤ VMyτ (x) (s0 ), ∀s0 ∈ S (B.22)

which proves πy∗ is the optimal policy for Mτ (x). In addition, we have
h  π∗ i
πy∗ (s0 ) = argmax Ea0 ∼π(s0 ) r(s0 , a0 ) − τ hs0 (π(s0 )) + γEs1 ∼Px (s0 ,a0 ) VMyτ (x) (s1 )
π(s0 )
h π∗ i
y
= argmax Ea0 ∼π(s0 ) QMτ (x) (s0 , a0 ) − τ hs0 (π(s0 )) , ∀s0 . (B.23)
π(s0 )

Then we have πy∗ = arg miny∈Π g(x, y) and thus arg miny∈Π g(x, y) ∈ Y ∗ (x). To further prove
arg miny∈∆(A)|S| g(x, y) = Y ∗ (x), it then suffices to prove any other policy π ∈ Π different

28
Principled Penalty-based Methods for Bilevel RL

from πy∗ is not optimal. Let s′0 be the state such that πy∗ (s′0 ) ̸= π(s′0 ). We have
h i
′ ′ ′
π
 ∗
VM τ (x) (s 0 ) ≤ Ea0 ∼π(s ′ ) r(s , a0 ) − τ hs′ (π(s )) + γEs ∼P (s′ ,a ) V
0 0 0 0 1 x 0 0 M τ (x) (s 1 )
h i
′ ∗ ′
 ∗
< Ea0 ∼πy∗ (s′0 ) r(s0 , a0 ) − τ hs′0 (πy (s0 )) + γEs1 ∼Px (s′0 ,a0 ) VMτ (x) (s1 )

= VM τ (x)
(s′0 ) (B.24)

where the last inequality follows from the strong convexity of h and the definition of πy∗ ; and
the last equality follows from πy∗ is the optimal policy. This proves the result.
Next we prove the second bullet. We have

∥yϵ − πy∗ ∥2 ≤ τ −1 g(x, yϵ ) − v(x) ≤ τ −1 ϵ



(B.25)

where the first inequality follows from τ -strong-convexity of g(x, ·). Next we prove |f ∗ −fϵ∗ | ≤
Lτ −1 ϵ. Let fϵ∗ = f (x∗ϵ , yϵ∗ ). We have

f (x∗ϵ , Y(x∗ϵ )) − f (x∗ϵ , yϵ∗ ) ≤ L∥yϵ∗ − Y(x∗ϵ )∥ ≤ L τ −1 ϵ (B.26)

where the last inequality follows from (B.25). The result follows from the fact that
f (x∗ϵ , Y(x∗ϵ )) ≥ f ∗ and f (x∗ϵ , yϵ∗ ) ≤ f ∗ .

Appendix C. Proof in Section 4


C.1 Proof of Lemma 6
We first introduce a generalized Danskin’s theorem as follows.

Lemma 19 (Generalized Danskin’s Theorem (Clarke, 1975)) Let F be a compact


set and let a continuous function ℓ : Rd × F 7→ R satisfy: 1) ∇x ℓ(x, y) is continuous in
(x, y); and 2) given any x, for any y, y ′ ∈ argmaxy∈F ℓ(x, y), ∇x ℓ(x, y) = ∇x ℓ(x, y ′ ). Then
let h(x) := maxy∈F ℓ(x, y), we have ∇h(x) = ∇x ℓ(x, y ∗ ) for any y ∗ ∈ argmaxy∈F ℓ(x, y).

Lemma 19 first follows (Clarke, 1975, Theorem 2.1) where conditions (a)–(d) are guaranteed
by Lemma 19’s condition 1). Then by (Clarke, 1975, Theorem 2.1 (4)) that we have the
Clarke’s generalized gradient set of h(x) = maxy∈F ℓ(x, y) is the convex hull of {∇x ℓ(x, y), y ∈
argmaxy∈F ℓ(x, y)}. It then follows from Lemma 19’s condition 2) that this generalized
gradient set is a singleton {∇x ℓ(x, y ∗ )} with any y ∗ ∈ argmaxy∈F ℓ(x, y). Finally it follows
from (Clarke, 1975, Proposition 1.13) that h(x) is differentiable with gradient ∇x ℓ(x, y ∗ ).
Now to prove Lemma 6, it suffices to prove
π π ∗
∇ max VMyτ (x) (ρ) = ∇x VMyτ (x) (ρ)|y∗ ∈Y ∗ (x)
y∈Y

. This argument is true following from Assumption 1 and the generalized Danskin’s theorem
π
above, with ℓ(x, y) = VMyτ (x) (ρ).

29
Shen and Yang and Chen

C.2 Proof of Lemma 7


Proof (a). Under the assumptions in (a), Lemma 6 holds. It then follows from

π X
γ t ∇rx (st , at )|s0 ∼ ρ, πy

∇x VMyτ (x) (ρ) =E (C.1)
t=0
that the result holds.
(b). Given the follower’s policy πy , define the Stackelberg MDP from the leader’s view as
M(πy ) ={S, Al , rπy (s, al ) = Eaf ∼πy (s) [rf (s, af , al )] − τ hf,s (πy (s)),
Pπy (·|s, al ) = Eaf ∼πy (s) [P(·|s, al , af )]} (C.2)
Note M(πy ) does not include a regularization for its policy πx . By Lemma 18, we have the
π ,π
follower’s value function Vf x y (s) can be rewritten from the viewpoint that πy is the main
π ,π π
policy and πx is part of the follower’s MDP, that is, Vf x y (s) = VMyτ (x) (s). It can be proven
π ,π πx π πx
similarly that Vf x y (s) = VM(π y)
(s). Therefore, we have VMyτ (x) (s) = VM(π y)
(s) and
π πx
∇x VMyτ (x) (s) = ∇x VM(π y)
(s)
hX ∞ i
=E γ t QπM(π
x
y)
(s , a
t l,t )∇ log π (a |s
x l,t t ) s 0 = s, π x (C.3)
t=0
where the last equality follows from the policy gradient theorem (Sutton et al., 2000). We
have
QπM(π
x
y)
(s, al )
πx
= rπy (s, al ) + γEs′ ∼Pπy (s,al ) [VM(π y)
(s′ )]
π ,πy
= Eaf ∼πy (s) [rf (s, af , al )] − τ hf,s (πy (s)) + γEs′ ∼P(s,al ,af ),af ∼πy (s) [Vf x (s′ )]
π ,πy
= Eaf ∼πy (s) [Qf x (s, al , af )] − τ hf,s (πy (s)) (C.4)
π ,π
where the last equality follows from the definition of Qf x y (s, al , af ) in Section 2.2. Substi-
tuting the above equality into (C.3) yields
hX∞ i
πy π ,π
∇x VMτ (x) (s) = E γ t Qf x y (st , al,t , af,t )∇ log πx (al,t |st ) s0 = s, πx , πy
t=0

hX i
− τE γ t hf,st (πy (st ))∇ log πx (al,t |st ) s0 = s, πx , πy (C.5)
t=0
It then follows from Lemma 6 that
∇x p(x, y)
π π
= −∇x VMyτ (x) (ρ) + ∇x VMyτ (x) (ρ)|πy =πy∗ (x)
hX ∞ i
π ,π
γ t Qf x y (st , al,t , af,t )−τ hf,st (πy (st )) ∇ log πx (al,t |st ) s0 = s, πx , πy

= −E
t=0

πx ,π ∗ (x)
hX i
γ t Qf y (st , al,t , af,t )−τ hf,st (πy∗ (x)(st )) ∇ log πx (al,t |st ) s0 = s,πx ,πy∗ (x)

+E
t=0

30
Principled Penalty-based Methods for Bilevel RL

where πy∗ (x) is the follower’s optimal policy given leader’s policy πx .

C.3 Proof of Lemma 8


Proof We first consider ∇x g(x, y). To prove ∇x g(x, y) exist, it suffices to show ∇qs,a (x)
exist for any (s, a). By Lemma 19, to show qs,a (x) = − maxπ∈Π QπMτ (x) (s, a) is differentiable,
it remains to show that argmaxπ∈Π QπMτ (x) (s, a) is a singleton. By Lemma 4, the optimal
policy of Mτ (x) is unique. Since the unique optimal policy πy∗ (x) ∈ argmaxπ∈Π QπMτ (x) (s, a),
π ∗ (x)
it suffices to show any policy π different from πy∗ (x) leads to QπMτ (x) (s, a) < QMy τ (x) (s, a).
Next, we prove this result.
By the uniqueness of the optimal policy, the policies different from πy∗ (x) are non-optimal,
π π ∗ (x)
that is, for any non-optimal π, there exists state s̄ such that VM τ (x)
(s̄) < VMyτ (x) (s̄). By the
Bellman equation, we have for any T ,
−1
h TX i
QπMτ (x) (s, a) =E γ t rx (st , at )|π, s0 = s, a0 = a
t=0
T π
+ γ EsT ∼Pxπ (·|s0 =s,a0 =a) [VM τ (x)
(sT )]
By the irreducible Markov chain assumption, there exists i such that Pxπ (si = s̄|s0 = s, a0 =
a) > 0. Choosing T = i in the above equality yields
QπMτ (x) (s, a)
i−1
hX i
=E γ rx (st , at )|π, s0 = s, a0 = a + γ i Esi ∼Pxπ (·|s0 =s,a0 =a) [VM
t π
τ (x)
(si )]
t=0
i−1
π ∗ (x)
hX i
<E γ t rx (st , at )|π, s0 = s, a0 = a + γ i Esi ∼Pxπ (·|s0 =s,a0 =a) [VMyτ (x) (si )]
t=0
πy∗ (x)
≤ QMτ (x) (s, a) (C.6)

π π ∗ (x)
where the first inequality follows from VM τ (x)
(s̄) < VMyτ (x) (s̄) and Pxπ (si = s̄|s0 = s, a0 =
a) > 0; and the last inequality follows from the optimality of πy∗ (x).
Given (C.6), we can conclude that qs,a (x) is differentiable with the gradient
∇qs,a (x) = −∇x QπMτ (x) (s, a)|π=πy∗ (x) . (C.7)
Then ∇x g(x, y) can be computed as
∇x g(x, y) = −Es∼ρ,a∼πy (s) ∇x QπMτ (x) (s, a)
 
π=πy∗ (x)
. (C.8)

Since g(x, ·) is smooth and strongly convex, we can use Danskins’ theorem to obtain
∇v(x) = ∇x g(x, y)|y=argminy∈Y g(x,y) = ∇x g(x, y)|y=πy∗ (x) (C.9)
where the last equality follows from Lemma 4. This completes the proof.

31
Shen and Yang and Chen

C.4 Proof of Lemma 9

Proof We first prove the first bullet. We have


hX i
∇x QπMτ (x) (s, a) =E γ t ∇x rx (st , at )|π, s0 = s, a0 = a . (C.10)
t=0

It can be checked that Assumption 2 holds and then ∇x g(x, y), ∇v(x) follow from Lemma 8
with (C.10).
We next prove the second bullet. By (C.5), we have


hX i
π π ,π
∇x VMyτ (x) (s) = E γ t Qf x y (st , al,t , af,t )∇ log πx (al,t |st ) s0 = s, πx , πy
t=0

hX i
− τE γ t hf,st (πy (st ))∇ log πx (al,t |st ) s0 = s, πx , πy (C.11)
t=0

Then

π π
∇x QMy τ (x) (s, af ) = ∇x rx (s, af ) + γEs′ ∼Px (s,af ) [VMyτ (x) (s′ )


π
= ∇x Eal ∼πx (s) [rl (s, al , af )] + γEal ∼πx (s),s′ ∼P(s,al ,af ) [VMyτ (x) (s′ )] (C.12)


where the last eqaulity follows from the definition of Mτ (x) in Lemma 18. Using the log
trick, we can write

π
∇x QMy τ (x) (s, af )
h i
π ,π
= Eal ∼πx (s) r(s, al , af ) + γEs′ ∼P(s,al ,af ) [Vf x y (s′ )] ∇ log πx (al |s)


π
+ γEal ∼πx (s),s′ ∼P(s,al ,af ) [∇x VMyτ (x) (s′ )] (C.13)

Substituting (C.5) into the above equality yields

π
∇x QMy τ (x) (s, af )
h i
π ,π
= Eal ∼πx (s) r(s, al , af ) + γEs′ ∼P(s,al ,af ) [Vf x y (s′ )] ∇ log πx (al |s)



hX i
π ,π
+ γEal ∼πx (s),s′ ∼P(s,al ,af ) E γ t Qf x y (st , al,t , af,t )∇ log πx (al,t |st ) s0 = s′ , πx , πy
t=0

hX i
− τ γEal ∼πx (s),s′ ∼P(s,al ,af ) E γ t hf,st (πy (st ))∇ log πx (al,t |st ) s0 = s′ , πx , πy (C.14)
t=0

32
Principled Penalty-based Methods for Bilevel RL

π ,πy
Using the definition of Qf x in the first term, and taking γ of the second and third term
inside the expectation gives
π
∇x QMy τ (x) (s, af )
 π ,π 
= Eal ∼πx (s) Qf x y (s, al , af )∇ log πx (al |s)

hX i
π ,π
+ Eal ∼πx (s),s′ ∼P(s,al ,af ) E γ t Qf x y (st , al,t , af,t )∇ log πx (al,t |st ) s1 = s′ , πx , πy
t=1

hX i
− τ Eal ∼πx (s),s′ ∼P(s,al ,af ) E γ t hf,st (πy (st ))∇ log πx (al,t |st ) s1 = s′ , πx , πy
t=1

Continuing from above, combining the first and second-term yields


π
∇x QMy τ (x) (s, af )
hX ∞ i
π ,π
= Eal ∼πx (s) E γ t Qf x y (st ,al,t ,af,t )∇log πx (al,t |st ) s0 = s,al,0 = al ,af,0 = af ,πx ,πy
t=0

hX i
− τ Eal ∼πx (s),s′ ∼P(s,al ,af ) E γ t hf,st (πy (st ))∇ log πx (al,t |st ) s1 = s′ , πx , πy
t=1

hX i
π ,π
=E γ t Qf x y (st , al,t , af,t )∇ log πx (al,t |st ) s0 = s, af,0 = af , πx , πy
t=0

hX i
− τ Eal ∼πx (s),s′ ∼P(s,al ,af ) E γ t hf,st (πy (st ))∇ log πx (al,t |st ) s1 = s′ , πx , πy
t=1

It can then be checked that Assumption 2 holds and the result follows from Lemma 8 and
(C.14).

C.5 Sufficient conditions of the smoothness assumption


Lemma 20 Suppose the following conditions hold.
P
(a) For any (s, a), the policy parameterization πy satisfies 1) a ∥∇πy (a|s)∥ ≤ Bπ ; and,
2) πy (a|s) is Ly -Lipschitz-smooth.
(b) If τ > 0 then: 1) for any s, assume |hs (πy (s))| ≤ Bh and ∥∇y hs (πy (s))∥ ≤ Bh′ on Y;
and, 2) hs (πy (s)) is Lh -Lipschitz-smooth on Y.
π
(c) For any (s, a, s′ ), we have for any x ∈ X that 1) |rx (s, a)| ≤ Br ; and, 2) VMyτ (x) (ρ) is
Lvx -Lipschitz-smooth on X uniformly for y ∈ Y.
π
Then it holds for any s that VMyτ (x) (s) is Lipschitz-smooth on X × Y:
π π ′
∥∇VMyτ (x) (s) − ∇VMyτ (x) (s)∥ ≤ max{Lvx , Lvy }∥(x, y) − (x′ , y ′ )∥

τ Bh′ Bπ +|A|Ly (Br +τ Bh ) τ (Bh′ +Lh )


 2 
where Lvy = O Bπ (Br +τ Bh )
(1−γ) 3 + (1−γ) 2 + 1−γ .

33
Shen and Yang and Chen

P
where a ∥∇πy (a|s)∥
Condition (a) holds for direct parameterization, P P ≤ |A| and Ly = 0; and
it also holds for softmax parameterization where a ∥∇πy (a|s)∥ = a πy (a|s)∥∇ log πy (a|s)∥ ≤
1 and Ly = 2. Condition (b) holds for a smooth composite of regularization function and
policy, e.g., softmax and entropy (Mei et al., 2020, Lemma 14), or direct policy with a
smooth regularization. function. Condition (c) 1) is guaranteed since X is compact and
rx is continuous, and 2) needs to be checked for specific applications. For example, in
π
RLHF/Reward shaping, it can be checked from the formula of ∇x VMyτ (x) (s) in Lemma 7
Lr
that there exists Lvx = 1−γ if rx is Lr -Lipschitz-smooth.
π
Proof We start the proof by showing VMyτ (x) (s) is Lipschitz-smooth in y on uniformly for
any x, that is
π π ′
∥∇y VMyτ (x) (s) − ∇y VMyτ (x) (s)∥ ≤ Lvy ∥y − y ′ ∥ (C.15)

where Lvy is a constant independent of x. By the regularized policy gradient in (B.9), we


have
π 1 hX
π
i
∇y VMyτ (x) (s) = Es̄∼dπs,x
y QMy τ (x) (s̄, a)∇πy (a|s̄)
1−γ a
τ
+ Es̄∼dπs,x [−∇y hs̄ (πy (s̄))] (C.16)
1−γ
πy t πy
(s̄) := (1 − γ) ∞
P
where ds,x t=0 γ Px (st = s̄|s0 = s) is the discounted visitation distribution,
π
and recall Px y (st = s̄|s0 = s) is the probability of reaching state s̄ at time step t under Px
and πy . Towards proving (C.15), we prove the following results:
π π π
(1) We have QMy τ (x) (s, a) is uniformly bounded, and VMyτ (x) (s) and QMy τ (x) (s, a) are Lipschitz
continuous in y uniformly for any x.
π
By the definition of QMy τ (x) (s, a), we have


π
X Br + τ Bh
|QMy τ (x) (s, a)| ≤ γ t |rx (st , at )| + τ |hst (πy (st ))| ≤ , (C.17)
1−γ
t=0

therefore it follows from (C.16) that

π Br + τ B h τ Bh′
∥∇y VMyτ (x) (s)∥ ≤ Bπ + (C.18)
(1 − γ)2 1−γ

Then by the definition of the Q function


π  π
QMy τ (x) (s, a) = rx (s, a) + γEs′ ∼Px (s,a) VMyτ (x) (s′ )


we have

π Br + τ B h τ Bh′ 
∥∇y QMy τ (x) (s, a)∥ ≤ γ Bπ + (C.19)
(1 − γ)2 1−γ
π
(2) We have ds0y,x (s) is Lipschitz-continuous in y uniformly for any x. Define M as a MDP
with τ = 0, r(s, a) = 1s which is an indicator function of s, and transition Px . Then we can

34
Principled Penalty-based Methods for Bilevel RL

π
write ds0y,x (s) as
π π
X X
ds0y,x (s) = ds0y,x (s′ )πy (a′ |s′ )1s
s′ ∈S a′ ∈A
= Es∼dπs y,x ,a∼πy (s) [r(s, a)]
0
π
= (1 − γ)VMy (s0 )
π t πy
where the last equality follows from substituting in ds0y,x (s) = (1 − γ) ∞
P
t=0 γ Px (st = s|s0 ).
πy π
It then follows from (C.18) with τ = 0 (since VM (s0 ) has τ = 0) that ds0y,x (s) is also
uniformly Lipschitz continuous with constant Bπ :
π π ′
sup ∥ds0y,x (s) − ds0y,x (s)∥ ≤ Bπ ∥y − y ′ ∥. (C.20)
s∈S

To this end, we can decompose the difference as


π π ′
∇y VMyτ (x) (s) − ∇y VMyτ (x) (s)
1 hX
π
i 1 hX
π
i
= Es̄∼dπs,x
y QMy τ (x) (s̄, a)∇πy (a|s̄) − E πy ′ QMy τ (x) (s̄, a)∇πy (a|s̄)
1−γ a
1−γ s̄∼ds,x a
1 hX
π
i 1 hX π

i
+ E πy ′ QMy τ (x) (s̄, a)∇πy (a|s̄) − E πy ′ QMy τ (x) (s̄, a)∇πy (a|s̄)
1−γ s̄∼ds,x a 1−γ s̄∼ds,x a
1 hX π

i 1 hX π

i
+ E πy ′ QMy τ (x) (s̄, a)∇πy (a|s̄) − E πy ′ QMy τ (x) (s̄, a)∇πy′ (a|s̄)
1−γ s̄∼ds,x a 1−γ s̄∼ds,x a
τ τ
+ Es̄∼dπs,xy [−∇y hs̄ (πy (s̄))] − E πy′ [−∇y hs̄ (πy (s̄))]
1−γ 1−γ s̄∼ds,x
τ τ
+ E πy′ [−∇y hs̄ (πy (s̄))] − E πy′ [−∇y hs̄ (πy′ (s̄))]
1−γ s̄∼ds,x 1−γ s̄∼ds,x
Continuing from the above inequality, we have
π π ′
∥∇y VMyτ (x) (s) − ∇y VMyτ (x) (s)∥
1 πy πy ′ X πy
≤ 2 sup ∥ds,x (s) − ds,x (s)∥ sup QMτ (x) (s̄, a)∇πy (a|s̄)
1−γ s a
1 π π ′ X
+ sup QMy τ (x) (s̄, a) − QMy τ (x) (s̄, a) ∇πy (a|s̄)
1−γ a a
1 h π ′ X i
+ E πy′ sup QMy τ (x) (s̄, a) ∇πy (a|s̄) − ∇πy′ (a|s̄)
1 − γ s̄∼ds,x a a
τ πy πy ′
+ 2 sup ∥ds,x (s) − ds,x (s)∥ sup ∥∇y hs̄ (πy (s̄))∥
1−γ s
τ  
+ E πy′ ∥∇y hs̄ (πy (s̄)) − ∇y hs̄ (πy′ (s̄))∥ . (C.21)
1 − γ s̄∼ds,x
Then given the assumptions (a), (b) in this lemma, along with the (C.17)–(C.20), we can get
π π ′
∥∇y VMyτ (x) (s) − ∇y VMyτ (x) (s)∥ ≤ Lvy ∥y − y ′ ∥ (C.22)

35
Shen and Yang and Chen

τ Bh′ Bπ +|A|Ly (Br +τ Bh ) τ (Bh′ +Lh )


 
Bπ2 (Br +τ Bh )
where Lvy = O (1−γ)3
+ (1−γ)2
+ 1−γ . Thus we conclude

π π ′
∥∇VMyτ (x) (s) − ∇VMyτ (x) (s)∥2
π π ′ π ′ π ′
= ∥∇y VMyτ (x) (s) − ∇y VMyτ (x) (s)∥2 + ∥∇x VMyτ (x) (s) − ∇x VMyτ (x′ ) (s)∥2
≤ L2vy ∥y − y ′ ∥2 + L2vx ∥x − x′ ∥2 ≤ max{L2vy , L2vx }∥(x, y) − (x′ , y ′ )∥2 (C.23)

which proves the result.

C.6 Proof of Lemma 10


C.6.1 Smoothness of the value penalty
Proof Under the two assumptions, Lemma 17 holds and thus πy∗ (x) is unique and is τ −1 CJ -
Lipschitz continuous on X . Thus for any y, y ′ ∈ Y ∗ (x), we have πy = πy′ = πy∗ (x). With
Lemma 6, we have
π π
∥∇ max VMyτ (x) (ρ) − ∇ max VMyτ (x′ ) (ρ)∥
y∈Y y∈Y
π π
= ∥∇VM τ (x)
(ρ)|π=πy∗ (x) − ∇VM ′ (ρ)|π=π ∗ (x′ ) ∥
τ (x ) y

≤ Lv (∥x − x′ ∥ + ∥πy∗ (x) − πy∗ (x′ )∥)


≤ Lv (1 + τ −1 CJ )∥x − x′ ∥. (C.24)
π
It then follows from VMyτ (x) (ρ) is Lv -Lipschitz smooth that the value penalty is Lv (2+τ −1 CJ )-
Lipschitz smooth.

C.6.2 Smoothness of the Bellman penalty


Proof First note that Lemma 17 holds and thus πy∗ (x) is τ −1 CJ -Lipschitz continuous on X .
We have p(x, y) = g(x, y) − v(x) where

g(x, y) := Es∼ρ [⟨ys , qs (x)⟩ + τ hs (ys )]. (C.25)

By Lemma 8,

∇x g(x, y) = −Es∼ρ,a∼ys ∇x QπMτ (x) (s, a)


 
π=πy∗ (x)
. (C.26)

Since ∇x QπMτ (x) (s, a) is Lv -Lipschitz continuous by the assumption, and πy∗ (x) is τ −1 CJ -
Lipschitz continuous, we have ∇x g(x, y) is Lv (1 + τ −1 CJ )-Lipschitz continuous at x ∈ X
uniformly for any y. We also have ∇x g(x, y) is CJ -Lipschitz continuous at y ∈ Π uniformly for
any x ∈ X . Therefore, we conclude ∇x g(x, y) is (CJ + Lv (1 + τ −1 CJ ))-Lipschitz continuous
at (x, y) on X × Π.
Next we have
 
∇y g(x, y) = ρ(s)qs (x) + τ ρ(s)∇hs (ys ) . (C.27)
s∈S

36
Principled Penalty-based Methods for Bilevel RL

Since qs is CJ -Lipschitz continuous, and hs is Lh -Lipschitz smooth, we have ∇y g(x, y) is


(CJ + Lh )-Lipschitz continuous at (x, y) on X × Π.
Collecting the Lipschitz continuity of ∇x g(x, y) and ∇y g(x, y) yields g(x, y) is Lipschitz
smooth with modulus Lg = 2CJ + Lv (1 + τ −1 CJ ) + Lh . Then we have

∥v(x) − v(x′ )∥ = ∥g(x, πy∗ (x)) − g(x′ , πy∗ (x′ ))∥ ≤ Lg (∥x − x′ ∥ + τ −1 CJ ∥x − x′ ∥). (C.28)

Then we have p(x, y) = g(x, y) − v(x) is Lipschitz smooth with modulus Lg (2 + τ −1 CJ ).


Together with the assumption that f is Lf -Lipschitz smooth gives Fλ is Lv -Lipschitz smooth
with Lv = Lf + λLg (2 + τ −1 CJ ).

C.7 Example gradient estimators of the penalty functions


ˆ
In this section, we give examples of ∇p(x, y; π̂) that is an estimator of ∇p(x, y).
π π
Value penalty. Consider choosing the value penalty p(x, y) = −VMyτ (x) (ρ)+maxy∈Y VMyτ (x) (ρ).
Then by Lemma 6, we have
π π
∇x p(x, y) = −∇x VMyτ (x) (ρ) + ∇x VM τ (x)
(ρ)|π=πy∗ (x)

where recall πy∗ (x) is the optimal policy of MDP Mτ (x) on the policy class Π = {πy : y ∈ Y}.
ˆ
A natural choice of ∇p(x, y; π̂) is then
 
ˆ π π
∇p(x, y; π̂) := − ∇x VMyτ (x) (ρ) + ∇x VM τ (x)
(ρ)|π=π̂ , ∇y p(x, y) (C.29)

By (Agarwal et al., 2020, Lemma D.3.), there exists constant Lv = 2γ|A|/(1 − γ)3 that
VMπ (ρ) is Lv -Lipschitz-smooth in π for any x. Then the estimation error can be quantified
τ (x)
by
ˆ
∥∇p(x, y; π̂) − ∇p(x, y)∥ ≤ Lv ∥πy∗ (x) − π̂∥. (C.30)

Therefore, the estimation error is upper bounded by the policy optimality gap ∥πy∗ (x) − π̂∥.
One may use efficient algorithms (e.g., policy mirror descent (Zhan et al., 2023)) to solve
for π̂, which has an iteration complexity of O(log(1/ϵ)) to achieve ∥πy∗ (x) − π̂∥ ≤ ϵ. Then
Assumption 3 is guaranteed with complexity O(log(λ2 /ϵorac )).
Bellman penalty. Consider choosing the Bellman penalty p(x, y) = g(x, y) − v(x) where
recall g(x, y) = Es∼ρ [⟨ys , qs (x)⟩ + τ hs (ys )] and v(x) = miny∈Y g(x, y). Then by Lemma 8,
we have

∇x p(x, y) = −Es∼ρ,a∼πy (s) ∇x QπMτ (x) (s, a) π=π∗ (x)


 
y

+ Es∼ρ,a∼π(s) ∇x QπMτ (x) (s, a) π=π∗ (x)


 
(C.31)
y

ˆ
Therefore, a natural choice of ∇p(x, y; π̂) is then
ˆ
∇p(x, y; π̂)
 
:= − Es∼ρ,a∼πy (s) ∇x Qπ̂Mτ (x) (s, a) + Es∼ρ,a∼π̂(s) ∇x Qπ̂Mτ (x) (s, a) , ∇y p(x, y)
   

37
Shen and Yang and Chen

It then follows similarly to (C.30) that Assumption 3 is guaranteed with complexity


O(− log(ϵorac /λ2 )).
Example algorithms to get π̂. Finally, we also explicitly write down the update to obtain
π̂ to be self-contained. If we are using policy mirror descent, then at each outer-iteration k,
for i = 1, ...T where T is the inner iteration number, we run
n
πi 1 o
πki+1 (·|s) = argmin − ⟨p, QMk τ (x) (s, ·)⟩ + τ hs (p) + Dh (p, πki ; ξki ) , for any s ∈ S
p∈Π η

where η is a learning rate, Dh is the Bregman divergence, and ξki is given by


1 η πi
ξki+1 (s, a) = ξki (s, a) + QMk τ (x) (s, a). (C.32)
1 + ητ 1 + ητ

Finally, we set the last iterate πkT +1 (·|s) as the approximate optimal policy π̂k . For theoretical
reasons, we use this update in the analysis to gain a fast rate. While practically our update
scheme is not limited to policy mirror descent. As a simple example, the policy gradient-based
algorithms can also be used:
h πŷi i
ŷki+1 = ProjY ŷki + η∇ŷ VMτk(x) (ρ) , for i = 1, 2, . . . , T. (C.33)

We use the last iterate as the approximate optimal policy parameter: π̂k = πŷT +1 . In
k
πŷi
the above update, the policy gradient ∇ŷ V k
(ρ) can be estimated by a wide range of
Mτ (x)
algorithms including the basic Reinforce (Baxter and Bartlett, 2001), and the advantage
actor-critic (Mnih et al., 2016).

C.8 Proof of Theorem 11


Proof In this proof, we write z = (x, y). Consider choosing either the value penalty or the
Bellman penalty, then Lemma 10 holds under the assumptions of this theorem. Therefore,
Fλ is Lλ -Lipschitz-smooth with Lλ = Lf + λLp . Then by Lipschitz-smoothness of Fλ , it
holds that

Fλ (zk+1 ) ≤ Fλ (zk ) + ⟨∇Fλ (zk ), zk+1 − zk ⟩ + ∥zk+1 − zk ∥2
2
α≤ L1

λ
ˆ λ (zk ; π̂k ), zk+1 − zk ⟩ + 1 ∥zk+1 − zk ∥2
Fλ (zk ) + ⟨∇F

ˆ
+ ⟨∇Fλ (zk ) − ∇Fλ (zk ; π̂k ), zk+1 − zk ⟩. (C.34)

Consider the second term in the RHS of (C.34). It is known that zk+1 can be written as

ˆ λ (zk ; π̂k ), z⟩ + 1 ∥z − zk ∥2 .
zk+1 = arg min⟨∇F
z∈Z 2α
By the first-order optimality condition of the above problem, it holds that

ˆ λ (zk ; π̂k ) + 1
⟨∇F (zk+1 − zk ), zk+1 − z⟩ ≤ 0, ∀z ∈ Z.
α

38
Principled Penalty-based Methods for Bilevel RL

Since zk ∈ Z, we can choose z = zk in the above inequality and obtain

ˆ λ (zk ; π̂k ), zk+1 − zk ⟩ ≤ − 1 ∥zk+1 − zk ∥2 .


⟨∇F (C.35)
α
Consider the last term in the RHS of (C.34). By Young’s inequality, we first have

ˆ λ (zk ; π̂k ), zk+1 − zk ⟩


⟨∇Fλ (zk ) − ∇F
ˆ λ (zk ; π̂k )∥2 + 1 ∥zk+1 − zk ∥2
≤ α∥∇Fλ (zk ) − ∇F

ˆ 1
≤ αλ2 ∥∇p(zk ) − ∇p(z 2
k ; π̂k )∥ + ∥zk+1 − zk ∥2 (C.36)

Substituting (C.36) and (C.35) into (C.34) and rearranging the resulting inequality yield

1 ˆ
∥zk+1 − zk ∥2 ≤ Fλ (zk ) − Fλ (zk+1 ) + αλ2 ∥∇p(zk ) − ∇p(z 2
k ; π̂k )∥ . (C.37)

With z̄k+1 defined in (4.5), we have

∥z̄k+1 − zk ∥2 ≤ 2∥z̄k+1 − zk+1 ∥2 + 2∥zk+1 − zk ∥2


ˆ λ (zk ; π̂k )∥2 +2∥zk+1 −zk ∥2
≤ 2α2 ∥∇Fλ (zk ) − ∇F
ˆ
≤ 2α2 λ2 ∥∇p(zk ) − ∇p(z 2
k ; π̂k )∥ + 2∥zk+1 −zk ∥
2
(C.38)

where the second inequality uses non-expansiveness of ProjZ .


Together (C.37) and (C.38) imply

ˆ
∥z̄k+1 − zk ∥2 ≤ 10α2 λ2 ∥∇p(zk ) − ∇p(z 2
k ; π̂k )∥ + 8α(Fλ (zk ) − Fλ (zk+1 )).

Since p(x, y) ≥ 0, Fλ (z) ≥ inf z∈Z f (z) for any z ∈ Z. Taking a telescope sum of the above
inequality and using Gλ (zk ) = α1 (zk − z̄k+1 ) yield

K  K
X
2 8 Fλ (z1 ) − inf z∈Z f (z) X
ˆ
∥Gλ (zk )∥ ≤ + 10λ2 ∥∇p(zk ) − ∇p(z k ; π̂k )∥
2
α
k=1 k=1
 K
8 Fλ (z1 ) − inf z∈Z f (z) X 1 K
≤ + ∥Gλ (zk )∥2 + ϵorac (C.39)
α 2 2
k=1

where the last inequality follows from Assumption 3. Rearranging gives


K 
X
2 16 Fλ (z1 ) − inf z∈Z f (z)
∥Gλ (zk )∥ ≤ + Kϵorac . (C.40)
α
k=1

This proves the first inequality in this theorem. The result for OS follows similarly with
Fλ (y) being Lv -Lipschitz-smooth and ϵorac = 0 since no oracle is needed.

39
Shen and Yang and Chen

Appendix D. Proof in Section 5


D.1 Proof of Lemma 13
(a). Treating (π2 , x) (or (π1 , x)) as the parameter, it follows from Lemma 4 that π1∗ (or π2∗ )
is unique. Under Assumption 6, it then follows from Lemma 19 that (a) holds.
(b). We have

max

⟨∇π ψ(x, π), π − π ′ ⟩
π ∈Π
π ∗ ,π π ,π ∗
= max

2
⟨∇π2 VM1τ (x) (ρ), π2 − π2′ ⟩ + ⟨∇π1 − VM1τ (x)
2
(ρ), π1 − π1′ ⟩
π ∈Π
π ∗ ,π π ,π ∗
= max ⟨∇π2 − VM1τ (x)2
(ρ), π2′ − π2 ⟩ + max ⟨∇π1 VM1τ (x)
2
(ρ), π1′ − π1 ⟩
π2′ ∈Π2 π1′ ∈Π1
π ∗ ,π2 π ∗ ,π2 π ,π2∗ π ,π2∗
h  i
(ρ) + max VM1τ (x) (ρ) − VM1τ (x)

≥ µ max − VM1τ (x) (ρ) + VM1τ (x) (ρ)
π2 ∈Π2 π1 ∈Π1
h ∗
π1 ,π2 π1 ,π2∗ π ,π2∗ π ∗ ,π2 i
= µ VMτ (x) (ρ) − VMτ (x) (ρ) − − max VM1τ (x) (ρ) + min VM1τ (x) (ρ)
π1 ∈Π1 π2 ∈Π2

= µ ψ(x, π) − min ψ(x, π) (D.1)
π∈Π

where the inequality follows from Lemma 2.

D.2 Proof of Lemma 14


Proof Given xλ , point πλ satisfies the first-order stationary condition:

⟨∇π f (xλ , πλ ) + λ∇π ψ(xλ , πλ ), πλ − π ′ ⟩ ≤ 0, ∀π ′

which leads to
1
⟨∇π ψ(xλ , πλ ), πλ − π ′ ⟩ ≤ − ⟨∇π f (xλ , πλ ), πλ − π ′ ⟩
λ
L∥πλ − π ′ ∥ L
≤ ≤ , ∀π ′ . (D.2)
λ λ
Combining the above inequality with Lemma 13 5.10 yields
L
ψ(xλ , πλ ) ≤ .
λ
Define ϵλ := ψ(xλ , πλ ) then ϵλ ≤ δ by choice of λ.
Since (xλ , πλ ) is a local solution of BZ λp , it holds for any feasible (x, π) in the region where
(xλ , πλ ) attains its minimum that

f (xλ , πλ ) + λψ(xλ , πλ ) ≤ f (x, π) + λψ(x, π). (D.3)

From the above inequality, it holds for any (x, π) feasible for BMϵ and in the region that

f (xλ , πλ ) ≤ f (x, π) + λ ψ(x, π) − ϵλ ≤ f (x, π) (D.4)

which proves the result.

40
Principled Penalty-based Methods for Bilevel RL

D.3 Justification of the smoothness assumption in the two-player case


Lemma 21 Consider the following conditions.
(a) For any s, assume |hs (π1 (s))| ≤ Bh and ∥∇hs (π1 (s))∥ ≤ Bh′ for any π1 ∈ Π1 , and;
hs (π1 (s)) is Lh -Lipschitz-smooth on Π1 . Also, assume this holds for player 2’s policy
π2 ∈ Π2 .
π1 ,π2
(b) For any (s, a1 , a2 ), we have for any x ∈ X that 1) |rx (s, a1 , a2 )| ≤ Br , and; 2) VM τ (x)
(ρ)

is Lvx -Lipschitz-smooth on X uniformly for (π1 , π2 ).
τ Bh′ |A| τ (Bh′ +Lh )
 
Then there exists a universal constant Lv = O |A|(B r +τ Bh )
(1−γ) 3 + (1−γ) 2 + 1−γ + L′
vx such
π1 ,π2
that VM τ (x)
(s) is Lv -Lipschitz-smooth on X × Π1 × Π2 .
Proof Recall the definition of the value function

hX i
π1 ,π2 π t

VM τ (x)
(s) = V Mτ (x) (s) = E γ rx (s t , at ) − τ hs t (π 1 (s t )) + τ hst (π 2 (s t )) s0 = s, π
t=0

where the expectation is taken over the trajectory generated by at ∼ (π1 (st ), π2 (st )) and
st+1 ∼ P(st , at ). When viewing π1 as the main policy and player 1 as the main player, we
can view (π2 , x) as the parameter of player 1’s MDP, where the reward function is given
π1 ,π2
by Ea2 ∼π2 (s) [rx (s, a1 , a2 )], and the transition is Ea2 ∼π2 (s) [P(·|s, a1 , a2 )]. To prove VM τ (x)
(s)
is Lipschitz-smooth in π1 , it is then natural to use the previous results on single-agent
parameterized MDP in Lemma 20. Specifically, we hope to use (C.22).
Under the assumptions of this lemma, for (C.22) to hold, we additionally need to check
Lemma 20 (a).
X X
∥∇π1 (a1 |s)∥ = ∥1a1 ,s ∥ = |A|, ∇2 π(a|s) = 0 thus Ly = 0. (D.5)
a a

Then there exists constant Lv,1 that


π1 ,π2 π ′ ,π
∥∇π1 VM τ (x)
(s) − ∇π1 VM1τ (x) 2
(s)∥ ≤ Lv,1 ∥π1 − π1′ ∥ (D.6)
τ Bh′ |A| τ (Bh′ +Lh )
 
where Lv,1 = O |A|(B r +τ Bh )
(1−γ) 3 + (1−γ) 2 + 1−γ , which is a uniform constant for any π2 , x.
Similarly, there exists a uniform constant Lv,2 that
π1 ,π2 π ,π ′
∥∇π2 VM τ (x)
(s) − ∇π2 VM1τ (x)
2
(s)∥ ≤ Lv,2 ∥π2 − π2′ ∥. (D.7)

The above two inequalities along with assumption (b) 2) gives


π1 ,π2 π ′ ,π ′ π1 ,π2 2 π ′ ,π
∥∇VM τ (x)
(s) − ∇VM1τ (x2 ′ ) (s)∥ ≤ ∥∇π1 VM τ (x)
(s) − ∇π1 VM1τ (x) (s)∥
π1 ,π2 π ,π ′
+ ∥∇π2 VM τ (x)
(s) − ∇π2 VM1τ (x)
2
(s)∥
π1 ,π2 π ,π ′
+ ∥∇x VM τ (x)
(s) − ∇x VM1τ (x)
2
(s)∥
≤ L′vx ∥x − x′ ∥ + Lv,1 ∥π1 − π1′ ∥ + Lv,2 ∥π2 − π2′ ∥. (D.8)
Then the result holds with Lv = max{L′vx , Lv,1 , Lv,2 }.

41
Shen and Yang and Chen

D.4 Proof of Lemma 15


π1 ,π2
Proof Denote π1∗ (x, π2 ) := arg maxπ1 ∈Π1 VM τ (x)
(ρ). Treating (x, π2 ) as the parameter of
player 1’s MDP, then we have Lemma 17 holds under Assumption 8. It then follows that
there exits a constant Lπ such that π1∗ (x, π2 ) is Lπ -Lipschitz-continuous. Similar results also
hold for π2∗ (x, π1 ). With Lemma 13 (a), we have

∥∇ψ(x, π) − ∇ψ(x′ , π ′ )∥2


π ,π ∗ (x,π1 ) π ′ ,π ∗ (x′ ,π1′ )
= ∥∇(x,π1 ) VM1τ (x)
2
(ρ) − ∇(x,π1 ) VM1τ (x2 ′ ) (ρ)∥2
π ∗ (x,π ),π2 π ∗ (x′ ,π ′ ),π2′
+ ∥∇(x,π2 ) VM1τ (x) 2 (ρ) − ∇(x,π2 ) VM1τ (x′ ) 2 (ρ)∥2
≤ L2v (2∥x − x′ ∥2 + ∥π1∗ (x, π2 ) − π1∗ (x′ , π2′ )∥2 + ∥π2∗ (x, π1 ) − π2∗ (x′ , π1′ )∥2 )
≤ 2L2v (1 + L2π )∥x − x′ ∥ + L2v L2π (∥π1 − π1′ ∥2 + ∥π2 − π2′ ∥2 ) (D.9)
p
which proves ∇ψ(x, π) is Lv 2(1 + L2π )-Lipschitz continuous.

D.5 Gradient Estimator Accuracy


We omit the iteration index k here. The following arguments hold for any iteration k.
Recall that
 
ˆ π̂1 ,π2 π1 ,π̂2 π1 ,π̂2 π̂1 ,π2
∇ψ(x, π; π̂) = ∇x VM τ (x)
(ρ) − ∇ V
x Mτ (x) (ρ), − ∇ V
π1 Mτ (x) (ρ), ∇ V
π2 Mτ (x) (ρ)

and the formula of ∇ψ is (see Lemma 13 (a)):

π ∗ ,π2 π ,π2∗ π ,π2∗ π ∗ ,π2


 
∇ψ(x, π) = ∇x VM1τ (x) (ρ) − ∇x VM1τ (x) (ρ), − ∇π1 VM1τ (x) (ρ), ∇π2 VM1τ (x) (ρ)

π1 ,π2
where π1∗ := argmaxπ1 ∈Π1 VM τ (x)
(ρ) and π2∗ defined similarly. It then follows that

ˆ
∥∇ψ(x, π; π̂) − ∇ψ(x, π)∥
π ∗ ,π
2 π̂1 ,π2 π ,π ∗
π1 ,π̂2
≤ 2∥∇VM1τ (x) (ρ) − ∇VM τ (x)
(ρ)∥ + 2∥∇VM1τ (x)
2
(ρ) − ∇VM τ (x)
(ρ)∥
≤ 2Lv (∥π̂1 − π1∗ ∥ + ∥π̂2 − π2∗ ∥) (D.10)

where the last inequality follows from Assumption 8 (a). Fixing x, π2 , it takes the policy
mirror descent algorithm (Zhan et al., 2023) an iteration complexity of O(− log ϵ) to solve
for a π̂1 such that ∥π̂1 − π1∗ ∥ ≤ ϵ (and similarly for π̂2 ). Therefore, the iteration complexity
to guarantee Assumption 7 is O(− log(ϵorac /λ2 )).

D.6 Proof of Theorem 16


Proof In this proof, we write z = (x, π). Given π̂, we also define

ˆ λ (z; π̂) := ∇f (z) + λ∇ψ(z;


∇F ˆ π̂). (D.11)

42
Principled Penalty-based Methods for Bilevel RL

Thus the update we are analyzing can be written as


ˆ λ (z k ; π̂ k )
z k+1 = ProjZ z k − α∇F
 
(D.12)

Now we start proving the result. Under the assumptions, we have Fλ is Lλ -Lipschitz-smooth
with Lλ = Lf + λLψ , thus it holds that
Lλ k+1
Fλ (z k+1 ) ≤ Fλ (z k ) + ⟨∇Fλ (z k ), z k+1 − z k ⟩ + ∥z − z k ∥2
2
α≤ L1

λ
ˆ λ (z k ; π̂ k ), z k+1 − z k ⟩ + 1 ∥z k+1 − z k ∥2
Fλ (z k ) + ⟨∇F

k ˆ k
+ ⟨∇Fλ (z ) − ∇Fλ (z ; π̂ ), z k k+1 k
− z ⟩. (D.13)

Consider the second term in the RHS of (D.13). It is known that z k+1 can be written as

ˆ λ (z k ; π̂ k ), z⟩ + 1
z k+1 = arg min⟨∇F ∥z − z k ∥2 .
z∈Z 2α
By the first-order optimality condition of the above problem, it holds that

ˆ λ (z k ; π̂ k ) + 1 k+1
⟨∇F (z − z k ), z k+1 − z⟩ ≤ 0, ∀z ∈ Z.
α
Since z k ∈ Z, we can choose z = z k in the above inequality and obtain

ˆ λ (z k ; π̂ k ), z k+1 − z k ⟩ ≤ − 1 ∥z k+1 − z k ∥2 .
⟨∇F (D.14)
α
Consider the last term in the RHS of (D.13). By Young’s inequality, we first have
ˆ λ (z k ; π̂ k ), z k+1 − z k ⟩
⟨∇Fλ (z k ) − ∇F
ˆ λ (z k ; π̂ k )∥2 + 1 ∥z k+1 − z k ∥2
≤ α∥∇Fλ (z k ) − ∇F

2 k ˆ k k 2 1 k+1
≤ αλ ∥∇ψ(z ) − ∇ψ(z ; π̂ )∥ + ∥z − z k ∥2 (D.15)

Substituting (D.15) and (D.14) into (D.13) and rearranging the resulting inequality yield
1 k+1 ˆ
∥z − z k ∥2 ≤ Fλ (z k ) − Fλ (z k+1 ) + αλ2 ∥∇ψ(z k ) − ∇ψ(z k
; π̂ k )∥2 . (D.16)

With z̄ k+1 defined in (5.13), we have

∥z̄ k+1 − z k ∥2 ≤ 2∥z̄ k+1 − z k+1 ∥2 + 2∥z k+1 − z k ∥2


≤ 2α2 ∥∇Fλ (z k ) − ∇Fˆ λ (z k ; π̂ k )∥2 +2∥z k+1 −z k ∥2
ˆ
≤ 2α2 λ2 ∥∇ψ(z k ) − ∇ψ(z k
; π̂ k )∥2 + 2∥z k+1 −z k ∥2 (D.17)

where the second inequality uses non-expansiveness of ProjZ .


Together (D.16) and (D.17) imply
ˆ
∥z̄ k+1 − z k ∥2 ≤ 10α2 λ2 ∥∇ψ(z k ) − ∇ψ(z k
; π̂ k )∥2 + 8α(Fλ (z k ) − Fλ (z k+1 )).

43
Shen and Yang and Chen

Since p(x, y) ≥ 0, Fλ (z) ≥ inf z∈Z f (z) for any z ∈ Z. Taking a telescope sum of the above
inequality and using Gλ (z k ) = α1 (z k − z̄ k+1 ) yield

K  K
X
k 2 8 Fλ (z1 ) − inf z∈Z f (z) X
ˆ
∥Gλ (z )∥ ≤ + 10λ2 ∥∇ψ(z k ) − ∇ψ(z k
; π̂ k )∥2
α
k=1 k=1
 K
8 Fλ (z1 ) − inf z∈Z f (z) X 1 K
≤ + ∥Gλ (z k )∥2 + ϵorac (D.18)
α 2 2
k=1

where the last inequality follows from Assumption 7. Rearranging gives


K 
X
k 2 16 Fλ (z1 ) − inf z∈Z f (z)
∥Gλ (z )∥ ≤ + Kϵorac . (D.19)
α
k=1

This proves the theorem.

Appendix E. Additional Experiment Details


E.1 Stackelberg Markov game
For the independent policy gradient method (Daskalakis et al., 2020; Ding et al., 2022), we
set the learning rate as 0.1, and both the follower and the leader use Monte Carlo sampling
with trajectory length 5 and batch size 16 to estimate the policy gradient. For the PBRL
algorithms, to estimate a near-optimal policy π̂ at each outer iteration, we run the policy
gradient algorithm for T steps at every outer iteration. For PBRL with value penalty, we
set learning rate 0.1, penalty constant λ = 2, inner iteration number T = 1, and we use
Monte Carlo sampling with trajectory length 5 and batch size 16 to estimate the policy
gradient. For PBRL with the Bellman penalty, we use λ = 7 and inner iteration number
T = 10 instead.

E.2 Deep reinforcement learning from human feedback


We conduct our experiments in the Arcade Learning Environment (ALE) (Bellemare et al.,
2013) by OpenAI gymnasium similar to (Mnih et al., 2016) and (Christiano et al., 2017).
For the Atari games, we use A2C, which is a synchronous version of (Mnih et al., 2016), as
the policy gradient estimator in both DRLHF and PBRL. The policy and the critic share a
common base model: The input is fed through 4 convolutional layers of size 8 × 8, 5 × 5,
4 × 4, 4 × 4, strides 4, 2, 1, 1 and number of filters 16, 32, 32, 32, with ReLU activation. This
is followed by a fully connected layer of output size 256 and a ReLU non-linearity. The
output of the base model is fed to a fully connected layer with scalar output as a critic,
and another fully connected layer of action space size as policy. The reward predictor has
the same input (84 × 84 × 4 stacked image) as the actor-critic. The input is fed through
4 convolutional layers of size 7 × 7, 5 × 5, 3 × 3, 3 × 3, strides 3, 2, 1, 1 with 16 filters each
and ReLU activation. It is followed by a fully connected layer of size 64, ReLU activation,
and another fully connected layer of action space size that gives the reward function. We

44
Principled Penalty-based Methods for Bilevel RL

use random dropout (probability 0.5) between fully connected layers to prevent over-fitting
(only in reward predictor). The reward predictor and the policy are trained synchronously.
The reward predictor is updated for one epoch every 300 A2C update.
We compare trajectories of 25 time steps. At the start of training, we collect 576 pairs of
trajectories and warm up the reward predictor for 500 epochs. After training starts, we
collect 16 new pairs per reward learning epoch. We only keep the last 3000 pairs in a buffer.
For policy learning, we set the actor-critic learning rate 0.0003, the entropy coefficient 0.01,
the actor-critic batch size 16, initial upper-level loss coefficient 0.001 which decays every
3000 actor-critic gradient steps. We find out that learning is very sensitive to this coefficient,
so we generally select this coefficient so that the upper-level loss converges stably; for reward
learning, we set reward predictor learning rate 0.0003, reward predictor batch size 64, and
the reward predictor is trained for one epoch every 500 actor-critic gradient steps. For
Beamrider, we change the actor-critic learning rate to 7 × 10−5 .

E.3 Incentive design


For the PBRL algorithms, we set the learning rate as 0.1 and a penalty constant λ = 4.
The policy gradients are given by Monte Carlo sampling with trajectory length 5 and batch
size 24. To obtain π̂1k , π̂2k at each outer iteration k, we run the policy gradient algorithm for
a single iteration with a learning rate 0.1 at every outer iteration. For the meta-gradient
method, we use the same learning rate, trajectory length and batch size as PBRL. The inner
iteration number is 1.

References
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with
policy gradient methods in markov decision processes. In Proc. of Conference on Learning
Theory, 2020.

J. Baxter and P. L Bartlett. Infinite-horizon policy-gradient estimation. journal of artificial


intelligence research, 15:319–350, 2001.

M. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment:


An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:
253–279, 2013.

J. Bolte, E. Pauwels, and S. Vaiter. Automatic differentiation of nonsmooth iterative


algorithms. In Proc. of Advances in Neural Information Processing Systems, 2022.

Z. Borsos, M. Mutny, and A. Krause. Coresets via bilevel optimization for continual learning
and streaming. In Proc. of Advances in Neural Information Processing Systems, 2020.

S. Chakraborty, A. Bedi, A. Koppel, D. Manocha, H. Wang, M. Wang, and F. Huang.


PARL: A unified framework for policy alignment in reinforcement learning. In Proc. of
International Conference on Learning Representations, 2024.

45
Shen and Yang and Chen

L. Chen, S. T. Jose, I. Nikoloska, S. Park, T. Chen, and O. Simeone. Learning with limited
samples: Meta-learning and applications to communication systems. Foundations and
Trends® in Signal Processing, 17(2):79–208, 2023a.

T. Chen, Y. Sun, and W. Yin. Tighter analysis of alternating stochastic gradient method
for stochastic nested problems. In Proc. of Advances in Neural Information Processing
Systems, 2021.

X. Chen, T. Xiao, and K. Balasubramanian. Optimal algorithms for stochastic bilevel


optimization under relaxed smoothness conditions. arXiv preprint arXiv:2306.12067,
2023b.

Z. Chen, Y. Liu, B. Zhou, and M. Tao. Caching incentive design in wireless d2d networks:
A stackelberg game approach. In IEEE International Conference on Communications,
pages 1–6, 2016.

P. F Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement


learning from human preferences. Proc. of Advances in Neural Information Processing
Systems, 2017.

F. Clarke. Optimization and non-smooth analysis. Wiley-Interscience, 1983.

F. H. Clarke. Generalized gradients and applications. Transactions of the American


Mathematical Society, 205:247–262, 1975.

C. Daskalakis, D. J Foster, and N. Golowich. Independent policy gradient methods for


competitive reinforcement learning. In Proc. of Advances in Neural Information Processing
Systems, 2020.

D. Ding, C. Wei, K. Zhang, and M. Jovanovic. Independent policy gradient for large-
scale markov potential games: Sharper rates, function approximation, and game-agnostic
convergence. In Proc. of International Conference on Machine Learning, 2022.

A. L Dontchev and R T. Rockafellar. Implicit functions and solution mappings: A view


from variational analysis, volume 616. Springer, 2009.

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In Proc. of International Conference on Machine Learning, 2017.

L. Franceschi, M. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based


hyperparameter optimization. In Proc. of International Conference on Machine Learning,
2017.

L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for


hyperparameter optimization and meta-learning. In Proc. of International Conference on
Machine Learning, 2018.

S. Ghadimi and M. Wang. Approximation methods for bilevel programming. arXiv preprint
arXiv:1802.02246, 2018.

46
Principled Penalty-based Methods for Bilevel RL

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for


nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–
305, 2016.

T. Giovannelli, G. Kent, and L. Vicente. Inexact bilevel stochastic gradient methods for
constrained and unconstrained lower-level problems. arXiv preprint arXiv:2110.00604,
2022.

R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo. On the iteration complexity of hypergradient


computation. In Proc. of International Conference on Machine Learning, pages 3748–3758,
2020.

M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A two-timescale framework for bilevel
optimization: Complexity analysis and application to actor-critic. SIAM Journal on
Optimization, 33(1), 2023.

Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan. Learning to utilize
shaping rewards: A new approach of reward shaping. In Proc. of Advances in Neural
Information Processing Systems, 2020.

K. Ji, J. Yang, and Y. Liang. Provably faster algorithms for bilevel optimization and
applications to meta-learning. In Proc. of International Conference on Machine Learning,
2021a.

K. Ji, J. Yang, and Y. Liang. Bilevel optimization: Convergence analysis and enhanced
design. In Proc. of International Conference on Machine Learning, 2021b.

K. Ji, M. Liu, Y. Liang, and L. Ying. Will bilevel optimizers benefit from loops. In Proc. of
Advances in Neural Information Processing Systems, 2022.

H. Jiang, Z. Chen, Y. Shi, B. Dai, and T. Zhao. Learning to defend by learning to attack.
In Proc. of International Conference on Artificial Intelligence and Statistics, 2021.

P. Khanduri, S. Zeng, M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A near-optimal algorithm
for stochastic bilevel optimization via double-momentum. In Proc. of Advances in Neural
Information Processing Systems, 2021.

J. Kwon, D. Kwon, S. Wright, and R. Nowak. On penalty methods for nonconvex bilevel
optimization and first-order stochastic approximation. arXiv preprint arXiv:2309.01753,
2023.

G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling
complexity, and generalized problem classes. Mathematical programming, 198(1), 2023.

J. Li, B. Gu, and H. Huang. A fully single loop algorithm for bilevel optimization without
hessian inverse. In Proc. of AAAI Conference on Artificial Intelligence, 2022.

M. Littman. Friend-or-foe q-learning in general-sum games. In Proc. of International


Conference on Machine Learning, 2001.

47
Shen and Yang and Chen

Q. Liu, T. Yu, Y. Bai, and C. Jin. A sharp analysis of model-based reinforcement learning
with self-play. In Proc. of International Conference on Machine Learning, 2021a.

R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang. A generic first-order algorithmic framework
for bi-level programming beyond lower-level singleton. In Proc. of International Conference
on Machine Learning, 2020.

R. Liu, Y. Liu, S. Zeng, and J. Zhang. Towards gradient-based bilevel optimization with
non-convex followers and beyond. In Proc. of Advances in Neural Information Processing
Systems, 2021b.

R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang. A general descent aggregation framework
for gradient-based bi-level optimization. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 45(1):38–57, 2022.

Z. Lu and S. Mei. First-order penalty methods for bilevel optimization. arXiv preprint
arXiv:2301.01716, 2023.

Z. Luo, J. Pang, and D. Ralph. Mathematical programs with equilibrium constraints.


Cambridge University Press, 1996.

S. Ma, Z. Chen, S. Zou, and Y. Zhou. Decentralized robust v-learning for solving markov
games with model uncertainty. Journal of Machine Learning Research, 24(371):1–40, 2023.

D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization


through reversible learning. In Proc. of International Conference on Machine Learning,
2015.

J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans. On the global convergence rates


of softmax policy gradient methods. In Proc. of International Conference on Machine
Learning, 2020.

A Y. Mitrophanov. Sensitivity and convergence of uniformly ergodic markov chains. Journal


of Applied Probability, 42(4):1003–1014, 2005.

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and


K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. of
International Conference on Machine Learning, 2016.

A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv


preprint arXiv:1803.02999, 2018.

H. Nikaidô and K. Isoda. Note on non-cooperative convex games. 1955.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,


K. Slama, A. Ray, et al. Training language models to follow instructions with human
feedback. In Proc. of Advances in Neural Information Processing Systems, 2022.

A. Pacchiano, A. Saha, and J. Lee. Dueling rl: reinforcement learning with trajectory
preferences. arXiv preprint arXiv:2111.04850, 2021.

48
Principled Penalty-based Methods for Bilevel RL

F. Pedregosa. Hyperparameter optimization with approximate gradient. In Proc. of


International Conference on Machine Learning, 2016.

S. Qiu, Z. Yang, J. Ye, and Z. Wang. On finite-time convergence of actor-critic algorithm.


IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021.

A. Rajeswaran, C. Finn, S. Kakade, and S. Levine. Meta-learning with implicit gradients.


In Proc. of Advances in Neural Information Processing Systems, 2019.

L. J Ratliff, R. Dong, S. Sekar, and T. Fiez. A perspective on incentive design: Challenges


and opportunities. Annual Review of Control, Robotics, and Autonomous Systems, 2:
305–338, 2019.

S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization
problems. SIAM Journal on Optimization, 27(2):640–660, 2017.

A. Shaban, C. Cheng, N. Hatch, and By. Boots. Truncated back-propagation for bilevel
optimization. In Proc. of International Conference on Artificial Intelligence and Statistics,
2019.

L. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):


1095–1100, 1953.

H. Shen and T. Chen. A single-timescale analysis for stochastic approximation with multiple
coupled sequences. In Proc. of Advances in Neural Information Processing Systems, 2022.

H. Shen and T. Chen. On penalty-based bilevel gradient descent method. arXiv preprint
arXiv:2302.05185, 2023.

H. Shen, Z. Yang, and T. Chen. Principled penalty-based methods for bilevel reinforcement
learning and rlhf. In Proc. of International Conference on Machine Learning, 2024.

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert,


L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge.
nature, 550(7676):354–359, 2017.

Z. Song, J. Lee, and Z. Yang. Can we find nash equilibria at a linear rate in markov games?
arXiv preprint arXiv:2303.03095, 2023.

H. Stackelberg. The Theory of Market Economy. Oxford University Press, 1952.

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for


reinforcement learning with function approximation. In Proc. of Advances in Neural
Information Processing Systems, 2000.

A. Von Heusinger and C. Kanzow. Optimization reformulations of the generalized nash


equilibrium problem using nikaido-isoda-type functions. Computational Optimization and
Applications, 43:353–377, 2009.

49

You might also like