0% found this document useful (0 votes)

110 views201 pages

2997 Spring 2004

- The document describes Markov decision processes (MDPs), which model discrete-time stochastic systems. The state at time t+1 depends on the state, decision, and random disturbance at time t. - MDPs are used to determine optimal policies for problems involving sequential decision making over time to minimize costs or maximize rewards. Three optimality criteria are described: finite horizon, average cost, and infinite horizon discounted cost. - Examples of applications that can be modeled as MDPs include queueing networks, the game Tetris, and portfolio allocation problems. Dynamic programming methods like value iteration can be used to solve finite horizon MDPs.

Uploaded by

combatps1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views201 pages

2997 Spring 2004

Uploaded by

combatps1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 201

2.

997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

February 4
Handout #1

Lecture Note 1

Markov decision processes

In this class we will study discrete-time stochastic systems. We can describe the evolution (dynamics) of
these systems by the following equation, which we call the system equation:
xt+1 = f (xt , at , wt ),

(1)

where xt S, at Axt and wt W denote the system state, decision and random disturbance at time
t, respectively. In words, the state of the system at time t + 1 is a function f of the state, the decision
and a random disturbance at time t. An important assumption of this class of models is that, conditioned
on the current state xt , the distribution of future states xt+1 , xt+2 , . . . is independent of the past states
xt1 , xt2 , . . . . This is the Markov property, which rise to the name Markov decision processes.
An alternative representation of the system dynamics is given through transition probability matrices: for
each state-action pair (x, a), we let Pa (x, y) denote the probability that the next state is y, given that the
current state is x and the current action is a.
We are concerned with the problem of how to make decisions over time. In other words, we would like to
pick an action at Axt at each time t. In real-world problems, this is typically done with some objective in
mind, such as minimizing costs, maximizing prots or rewards, or reaching a goal. Let u(x, t) take values in
Ax , for each x. Then we can think of u as a decision rule that prescribes an action from the set of available
actions Ax based on the current time stage t and current state x. We call u a policy.
In this course, we will assess the quality of each policy based on costs that are accumulated additively
over time. More specically, we assume that at each time stage t a cost g at (xt ) is incurred. In the next
section, we describe some of the optimality criteria that will be used in this class when choosing a policy.
Based on the previous discussion, we characterize a Markov decision process by a tuple (S, A , P (, ), g ()),
consisting of a state space, a set of actions associated with each space, transition probabilities and costs as
sociated with each state-action pair. For simplicity, we will assume throughout the course that S and A x
are nite. Most results extend to the case of countably or uncountably innite state and action spaces under
certain technical assumptions.

Optimality Criteria

In the previous section we described Markov decision processes, and introduced the notion that decisions
are made based on certain costs that must be minimized. We have established that, at each time stage t, a
cost gat (xt ) is incurred. In any given problem, we must dene how costs at dierent time stages should be
combined. Some optimality criterions that will be used in the course are the following:
1. Finite-horizon total cost:
E

T 1

t=0

gat (xt ) x0 = x

(2)

2. Average cost:

T 1

lim sup E
ga (xt ) x0 = x

T t=0 t
T

3. Innite-horizon discounted cost:

t=0

t gat (xt ) x0 = x ,

(3)

(4)

where (0, 1) is a discount factor expressing temporal preferences. The presence of a discount
factor is most intuitive in problems involving cash ows, where the value of the same nominal amount
of money at a later time stage is not the same as its value at a earlier time stage, since money at
the earlier stage can be invested at a risk-free interest rate and is therefore equivalent to a larger
nominal amount at a later stage. However, discounted costs also oer good approximations to the
other optimality criteria. In particular, it can be shown that, when the state and action spaces are
nite, there is a large enough
< 1 such that, for all ,
optimal policies for the discounted-cost
problem are also optimal for the average-cost problem. However, the discounted-cost criterion tends
to lead to simplied analysis and algorithms.
Most of the focus of this class will be on discounted-cost problems.

Examples

The Markov decision processes has a broad range of applications. We introduce some interesting applications
in the following.

Queueing Networks
Consider the queueing network in Figure 1. The network consists of three servers and two dierent external
jobs, xed routes 1, 2, 3 and 4, 5, 6, 7, 8, forming a total of 8 queues of jobs at distinct processing stage. We
assume the service times are distributed according to geometric random variables: When a server i devotes
a time step to serving a unit from queue j, there is a probability ij that it will nish processing the unit in
that time step, independent of the past work done on the unit. Upon completion of that processing step, the
unit is moved to the next queue in its route, or out of the system if all processing steps have been completed.
New units arrive at the system in queues j = 1, 4 with probability j in any time step, independent of the
previous arrivals.

13
1

11
18

Machine 1

4
Machine 2

Figure 1: A queueing system

Machine 3

A common choice for the state of this system is an 8-dimensional vector containing the queue lengths.
Since each server serves multiple queues, in each time step it is necessary to decide which queue each of
the dierent servers is going to serve. A decision of this type may be coded as an 8-dimensional vector a
indicating which queues are being served, satisfying the constraint that no more than one queue associated
with each server is being served, i.e., ai {0, 1}, and a1 + a3 + a8 1, a2 + a6 1, a4 + a5 + a7 1. We can
impose additional constraints on the choices of a as desired, for instance considering only non-idling policies.
Policies are described by a mapping u returning an allocation of server eort a as a function of system
x. We represent the evolution of the queue lengths in terms of transition probabilities - the conditional
probabilities for the next state x(t + 1) given that the current state is x(t) and the current action is a(t).
For instance
P rob(x1 (t + 1) = x1 (t) + 1 | x(t), a(t)) = 1 ,

P rob(x3 (t + 1) = x3 (t) + 1, x2 (t + 1) = x2 (t) 1 | (x(t), a(t)) = 22 I(x2 (t) > 0, a2 (t)) = 1),

P rob(x3 (t + 1) = x3 (t) 1 | (x(t), a(t)) = 13 I(x3 (t) > 0, a3 (t)) = 1),

corresponding to an arrival to queue 1, a departure from queue 2 and an arrival to queue 3, and a departure
from queue 3. I() is the indication function. Transition probabilities related to other events are dened
similarly.

We may consider costs of the form g(x) = i xi , the total number of unnished units in the system. For
instance, this is a reasonably common choice of cost for manufacturing systems, which are often modelled
as queueing networks.

Tetris
Tetris is a computer game whose essence rule is to t a sequence of geometrically dierent pieces, which fall
from the top of the screen stochastically, together to complete the contiguous rows of blocks. Pieces arrive
sequentially and the geometric shape of the pieces are independently distributed. A falling piece can be
rotated and moved horizontally into a desired position. Note that the rotation and move of falling pieces
must be scheduled and executed before it reaches the remaining pile of pieces at the button of the screen.
Once a piece reaches the remaining pile, the piece must resite there and cannot be rotated or moved.
To put the Tetris game into the framework of Markov decision processes, one could dene the state to
correspond to the current conguration and current falling piece. The decision in each time stage is where
to place the current falling piece. Transitions to the next board conguration follow deterministically from
the current state and action; transitions to the next falling piece are given by its distribution, which could
be, for instance, uniform over all piece types. Finally, we associate a reward with each state-action pair,
corresponding to the points achieved by the number of rows eliminated.

Portfolio Allocation
Portfolio allocation deals with the question of how to invest a certain amount of wealth among a collection
of assets. One could dene the state as the wealth of each time period. More specically, let x 0 denote the
initial wealth and xt as the accumulated wealth at time period t. Assume there are n risky assets, which
correspond to random rate of return e1 , . . . , en . Investors distribute fractions a = (a1 , . . . , an ) of their wealth
n
among the n assets, and consume the remaining fraction 1 i=1 an . The evolution of wealth xt is given
3

by
xt+1

ait ei xt .

i=1

Therefore, transition probabilities can be derived from the distribution of the rate of return of each risky
n
assets. We associate with each state-action pair (x, a) a reward ga (x) = x(1 i=1 ai , corresponding to the
amount of wealth consumed.

Solving Finite-Horizon Problems

Finding a policy that minimizes the nite-horizon cost corresponds to solving the following optimization
problem:
T 1

(5)
min E
gu(xt ,t) (xt )|x0 = x
u(,)

t=0

A naive approach to solving (5) is to enumerate all possible policies u(x, t), evaluate the corresponding
expected cost, and choose the policy that maximizes it. However, note that the number of policies grows
exponentially on the number of states and time stages. A central idea in dynamic programming is that the
computation required to nd an optimal policy can be greatly reduced by noting that (5) can be rewritten
as follows:

T 1

min ga (x) +
Pa (x, y) min E
gu(xt ,t) (xt )|x1 = y
.
(6)
aAx

u(,)
t=1

Dene J (x, t0 ) as follows:

J (x, t0 ) = min E
u(,)

T 1

gu(xt ,t) (xt )|x1 = y .

t=t0

It is clear from (6) that, if we know J (, t0 + 1), we can easily nd J (x, t0 ) by solving

Pa (x, y)J (y, t0 + 1) .

J (x, t0 ) = min ga (x) +
aAx

(7)

Moreover, (6) suggests that an optimal action at state x and time t0 is simply one that minimizes the
right-hand side in (7). It is easy to verify that this is the case by using backwards induction.
We call J (x, t) the cost-to-go function. It can be found recursively by noting that
J (x, T 1) = min ga (x)
a

and J (x, t), t = 0, . . . , T 2, can be computed via (7).

Note that nding J (x, t) for all x S and t = 0, . . . , T 1 requires a number of computations that grow
linearly in the number of states and time stages, even though there are exponentially many policies.

Introduction to Discounted-Cost Problems

Based on the discussion for the nite-horizon problem, we may conjecture that an optimal decision for the
innite-horizon, discounted-cost problem may be found as follows:
4

1. Find (somehow) for every x and t0 ,

J (x, t0 ) = min E
u(,)

tt0 gu(xt ,t) (xt )|xt0 = x

t=t0

2. The optimal action for state x at time t0 is given by

Pa (x, y)J (y, t0 + 1) .

u (x, t0 ) = argminaAx ga (x) +

(8)

(9)

We may also conjecture that, as in the nite-horizon case, J (x, t) satises a recursive relation of the form

Pa (x, y)J (y, t + 1) .

J (x, t) = min ga (x) +
aAx

The rst thing to note in the innite-horizon case is that, based on expression (8), we have J (x, t) =
J (x, t ) = J (x) for all t and t . Indeed, note that, for every u,

tt0
E
=
gu(xt ,t) (xt )|xt0 = x

tt0 P robu (xt = y|xt0 = x)gu(y) (y)

t=t0

=
=

t=t0

tt0 P robu (xtt0 = y|x0 = x)gu(y) (y)

t P robu (xt = y|x0 = x)gu(y) (y).

t=0

Intuitively, since transition probabilities Pu (x, y) do not depend on time, innite-horizon problems look the
same regardless of the value of the initial time state t, as long as the initial state is the same.
Note also that, since J (x, t) = J (x), we can also infer from (9) that the optimal policy u (x, t) does
not depend on the current stage t, so that u (x, t) = u (x) for some function u (). We call policies that do
not depend on the time stage stationary. Finally, J must satisfy the following equation:

Pa (x, y)J (y) .

J (x) = min ga (x) +
aAx

This is called Bellmans equation.

We will show in the next lecture that the cost-to-go function is the unique solution of Bellmans equation
and the stationary policy u is optimal.

The Dynamic Programming Operators T and Tu

We now introduce some shorthand notation. For every stationary policy u, we let g u denote the vector with
entries gu(x) (x), and Pu denote the matrix with entries Pu(x) (x, y). We dene the dynamic programming
operators Tu and T as follows. For every function J : S , we have
Tu J = gu + Pu J,
5

and
T J = min Tu J.
u

With this new notation, Bellmans equation becomes

J = T J ,
and the policy u dened in the previous section satises
Tu J = T J .
More generally, for any function J, we call a policy u greedy with respect to J if
Tu J = T J.
We denote any policy that is greedy with respect to J by uJ .
The following basic properties about the operator T are relevant to much of the analysis in this course.

Lemma 1 (Monotonicity) Let J J be arbitrary. Then T J T J.

Proof

Since Pu 0, we have for all u

Tu J

gu + Pu J
gu + Pu J

= Tu J.

Now
TJ

TuJ J
TuJ J
= T J

We let e denote the vector with all entries equal to one.

Lemma 2 (Oset) For all J and k , we have T (J + ke) = T J + ke.
Proof

We have
T (J + ke)

min {gu + Pu (J + ke)}

min {gu + Pu J + ke}

T J + ke.

The second inequality follows from the fact that Pu e = e, since

Pu (x, y) = 1.

we have T J T J
J J
.
Lemma 3 (Maximum-Norm Contraction) For all J and J,

Proof

First, we have
J

J + J J
J + J J e.

. We now have
T J T J

T (J + J J e) T J
= T J + J J e T J
= J J e.

The rst inequality follows from monotonicity and the second from the oset property of T . Since J and J
are arbitrary, we conclude by the same reasoning that T J T J J J e. The lemma follows.

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

February 9
Handout #2

Lecture Note 2

Summary: Markov Decision Processes

Markov decision processes can be characterized by (S, A , g (), P (, )), where

S denotes a nite set of states

A
x denotes a nite set of actions for state x S

ga (x) denotes the nite timestage cost for action a Ax and state x S

Pa (x, y) denotes the transmission probability when the taken action is a Ax , current state is x, and
the next state is y
Let u(x, t) denote the policy for state x at time t and, similarly, let u(x) denote the stationary policy for
state x. Taking the stationary policy u(x) into consideration, we introduce the following notation
gu (x) gu(x) (x)
Pu (x, y) Pu(x) (x, y)
to represent the cost function and transition probabilities under policy u(x).

Costtogo Function and Bellmans Equation

In the previous lecture, we dened the discountedcost, innite horizon costtogo function as

J (x) = min E
t gu (xt )|x0 = x .
u

t=0

We also conjectured that J should satises the Bellmans equation

J (x) = min ga (x) +

Pa (x, y)J (y) ,
a

or, using the operator notation introduced in the previous lecture,

J = T J .
Finally, we conjectured that an optimal policy u could be obtained by taking a greedy policy with respect
to J .
In this and the following lecture, we will present and analyze algorithms for nding J , and prove
optimality of policies that are greedy with respect to it.

Value Iteration

The value iteration algorithm goes as follows:

1. J0 , k = 0
2. Jk+1 = T Jk , k = k + 1
3. Go back to 2
Theorem 1
lim Jk = J

Proof
Since J0 () and g () are nite, there exists a real number M satisfying
|J0 (x)| M and |ga (x)| M for all a Ax and x S. Then we have, for every integer K 1 and real
number (0, 1),
JK (x)

= T K J0 (x)

t
K
= min E
gu (xt ) + J0 (xK )x0 = x
u

t=0

K 1

min E
t gu (xt )x0 = x + K M
u

t=0

From

J (x) = min
u

gu (xt ) +

t=0

gu (xt ) ,

t=K

we have

(T K J0 )(x) J (x)

K1
K1

t
K
t
t
min E
gu (xt ) + J0 (xK )x0 = x min E
gu (xt ) +
gu (xt )x0 = x
u
u

t=0
t=0
t=K

E
t gu (xt ) + K J0 (xK )x0 = x E
t gu (xt ) +
t gu (xt )x0 = x

t=0
t=0
t=K

E K |J0 (xk )| +
t gu (xt )x0 = x

t=K

t
K
max E |J0 (xK )| +
|g0 (xt )|x0 = x
u

t=K

1
,
K M 1 +
1

where u
is the policy minimizing the second term in the rst line. We can bound J (x) (T K J0 )(x)
K M (1 + 1/(1 )) by using the same reasoning. It follows that T K J0 converges to J as K goes to innity.

Theorem 2 J is the unique solution of the Bellmans equation.

Proof

We rst show that J = T J . By contraction principle,

||T (T k J0 ) T k J0 ||

= ||T k+1 J0 T k J0 ||
||T k J0 T k1 J0 ||
k ||T J0 J0 || 0

as K

Since for all k we have J T J T J T k+1 J0 + J T k J0 + T k+1 J0 T k T0 , we conclude

J2 . Then
that J = T J . We next show that J is the unique solution to J = T J. Suppose that J1 =
0 < ||J1 J2 || = ||T J1 T J2 || ||J1 J2 ||
which is a contradiction.

K
n 1
Alternative Proof
We prove the statement by showing that T J is a Cauchy sequence in R . Observe
||T k+m J T k J||

= ||

(T k+n+1 J T k+n J)||

n=0

||T k+n+1 J T k+n J||

n=0

k+n ||T J J|| 0

as k, m

n=0

From above, we know that ||T J J || ||J J || . Therefore, the value iteration algorithm
converges to J . Furthermore, we notice that J is the xed point w.r.t. the operator T , i.e., J = T J .
We next introduce another value iteration algorithm.
k

3.1

GaussSeidel Value Iteration

The GaussSeidel value iteration goes as follows:

(T JK )(x) where

JK (x),
if x y, (not being updated yet)
JK (y) =
JK+1 (y), if x > y.

JK+1 (x)

We hence dene the operator F as follows

(F J)(x) = min ga (x) +

Pa (x, y)(F J)(y) +
Pa (x, y)J(y)
a

y<x
yx

updated already

(1)

not being updated yet

Does the operator F satisfy the maximum contraction? We answer this question by the following lemma.
1A

sequence xn in a metric space X is said to be a Cauchy sequence if for every > 0 there exists an integer N such that
||xn xm || if m, n N . Furthermore, in Rn , every Cauchy sequence converges.

Lemma 1
||J J||

||F J F J||
Proof

By the denition of F , we consider the case x = 1,

|(F J)(1) (F J)(1)
| = |(T J)(1) (T J)(1)| ||J J||

For the case x = 2, by the denition of F , we have

|(F J)(2) (F J)(2)|

|, . . . , |J(|S|) J(|S|)|
max |(F J)(1) (F J)(1)|, |J(2) J(2)

||J J||

Repeating the same reasoning for x = 3, . . . , we can show by induction that |(F J)(x) (F J)(x)|
, x S. Hence, we conclude ||F J F J||
||J J|| .
||J J||

Theorem 3 F has the unique xed point J .

Proof
By the denition of operator F and the Bellmans equation J = T J , we have J = F J .
The convergence result follows from the previous lemma. Therefore, F J = J . By maximum contraction

property, the uniqueness of J holds.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

February 11
Handout #4

Lecture Note 3

Value Iteration

Using value iteration, starting at an arbitrary J0 , we generate a sequence of {Jk } by

Jk+1 = T Jk , integer k 0.
We have shown that the sequence Jk J as k , and derived the error bounds
||Jk J || k ||J0 J ||
Recall that the greedy policy uJ with respect to value J is dened as T J = TuJ J. We also denote uk = uJk
as the greedy policy with respect to value Jk . Then, we have the following lemma.
Lemma 1 Given (0, 1),
||Juk Jk ||

1
||T Jk Jk ||
1

Proof:
Juk Jk = (I Puk )1 guk Jk
= (I Puk )1 (guk + Puk Jk Jk )
= (I Puk )1 (T Jk Jk )

=
t Put k (T Jk Jk )
t=0

t Put k e||T Jk Jk ||

t=0

t e||T Jk Jk ||

t=0

e
||T Jk Jk ||
1

where I is an identity matrix, and e is a vector of unit elements with appropriate dimension. The third
equality comes from T Jk = guk + Puk Jk , i.e., uk is the greedy policy w.r.t. Jk , and the forth equality holds

e
because (I Puk )1 = t=0 t Put k . By switching Juk and Jk , we can obtain Jk Juk 1
||T Jk Jk || ,
and hence conclude
e
|Juk Jk |
|T JK JK |
1
or, equivalently,
1
||T Jk Jk || .
||Juk Jk ||
1
2

Theorem 1
||Juk J ||

2
||Jk J ||
1

Proof:
||Juk J ||

= ||Juk Jk + Jk J ||
||Juk Jk || + ||Jk J ||
1
||T Jk J + J Jk || + ||Jk J ||

1
1

(||T Jk J || + ||J Jk || ) + ||Jk J ||

1
2
||Jk J ||

The second inequality comes from Lemma 1 and the third inequality holds by the contraction principle. 2

Optimality of Stationary Policy

Before proving the main theorem of this section, we introduce the following useful lemma.
Lemma 2 If J T J, then J J . If J T J, then J J .
Proof: Suppose that J T J. Applying operator T on both sides repeatedly k 1 times and by the
monotonicity property of T , we have
J T J T 2 J T k J.
For suciently large k, T k J approaches to J . We hence conclude J J . The other statement follows the
same argument.
2

We show the optimality of the stationary policy by the following theorem.

Theorem 2 Let u = (u1 , u2 , . . .) be any policy and let u uJ 1 . Then,
Ju Ju = J .
Moreover, let u be any stationary policy such that Tu J =
T J .2 Then, Ju (x) > J (x) for at least one state
x S.
Proof: Since g and J are nite, there exists a real number M satisfying ||gu || M and ||J || M .
Dene
Juk = Tu1 Tu2 . . . Tuk J .
1 That
2 That

is, J = T J = Tu J .

is to say that u is not a greedy policy w.r.t. J .

Then
||Juk Ju || M (1 +

1
)k 0 as k .
1

If u = (u , u , . . . ), then
||Ju Juk || 0 as k .
Thus, we have Juk = Tuk J = Tuk1
(T J ) = Tuk1
J = J . Therefore Ju = J . For any other policy, for

all k,

1
k
Ju Juk M 1 +
1

1
= Tu1 . . . Tuk J M 1 +
k

1
T J

Tu1 . . . Tuk1 T
J M

... J M 1 +

1
1

T J , i.e. Tu J T J , and at least one

Therefore Ju J . Take a stationary policy u such that Tu J =

state x S such that (Tu J )(x) > (T J )(x). Observe

J = T J Tu J
Applying Tu on both sides and by the monotonicity property of T , or applying Lemma 2,
J Tu J Tu2 J Tuk J Ju
and J (x) < Ju (x) for at least one state x.

Policy Iteration

The policy iteration algorithm proceeds as follows.

1. Start with policy u0 , k=0;
2. Evaluate Juk = guk + Puk Juk ;
3. Let uk+1 = uJuk ;
4. If uk+1 = uk stop; otherwise, go back to Step 2.
Note that Step 2 aims at getting a better policy for each iteration. Since the set of policies is nite, the
algorithm will terminate in nite steps. We state this concept formally by the following theorem.
Theorem 3 Policy iteration converges to u after a nite number of iterations.

Proof: If uk is optimal, then we are done. Now suppose that uk is not optimal. Then
T Juk Tuk Juk = Juk
with strict inequality for at least one state x. Since Tuk+1 Juk = T Juk and Juk = Tuk Juk , we have
Juk = Tuk Juk T Juk = Tuk+1 Juk Tunk+1 Juk Juk+1 as n .
Therefore, policy uk+1 is an improvement over policy uk .

In step 2, we solve Juk = guk + Puk Juk , which would require a signicant amount of computations. We
thus introduce another algorithm which has fewer iterations in step 2.

3.1

Asynchronous Policy Iteration

The algorithm goes as follows.

1. Start with policy u0 , cost-to-go function J0 , k = 0
2. For some subset Sk S, do one of the following
(i) value update
(ii) policy update

(Jk+1 )(x) = (Tuk Jk )(x), x Sk ,

uk+1 (x) = uJk (x), x Sk

3. k = k + 1; go back to step 2
Theorem 4 If Tu0 J0 J0 and innitely many value and policy updates are performed on each state, then
lim Jk = J .

Proof: We prove this theorem by two steps. First, we will show that
J Jk+1 Jk ,

This implies that Jk is a nonincreasing sequence. Since Jk is lower bounded by J , Jk will converge to some
value, i.e., Jk J as k . Next, we will show that Jk will converge to J , i.e., J = J .
Lemma 3 If Tu0 J0 J0 , the sequence Jk generated by asynchronous policy iteration converges.
Proof: We start by showing that, if Tuk Jk Jk , then Tuk+1 Jk+1 Jk+1 Jk . Suppose we have a value
update. Then,

x Sk , Jk+1 (x) = (Tuk Jk )(x) Jk (x)

Jk+1 Jk
x
/ Sk , Jk+1 (x) = Jk (x)
Thus,

(Tuk+1 Jk+1 )(x) = (Tuk Jk+1 )(x) (Tuk Jk )(x)

= Jk+1 (x),
x Sk
Jk (x) = Jk+1 (x), x
/ Sk

Now suppose that we have a policy update. Then Jk+1 = Jk . Moreover, for x Sk , we have
(Tuk+1 Jk+1 )(x)

(Tuk+1 Jk )(x)

(T Jk )(x)

(Tuk Jk )(x)

Jk (x)

Jk+1 (x).

The rst equality follows from Jk = Jk+1 , the second equality and rst inequality follows from the fact that
uk+1 (x) is greedy with respect to Jk for x Sk , the second inequality follows from the induction hypothesis,
and the third equality follows from Jk = Jk+1 . For x Sk , we have
(Tuk+1 Jk+1 )(x)

(Tuk Jk )(x)

Jk (x)
= Jk+1 (x).
The equalities follow from Jk = Jk+1 and uk+1 (x) = uk (x) for x Sk , and the inequality follows from the
induction hypothesis.
Since by hypothesis Tu0 J0 J0 , we conclude that Jk is a decreasing sequence. Moreover, we have
Tuk Jk Jk , hence Jk Juk J , so that Jk is bounded bellow. It follows that Jk converges to some limit

J.
2

Lemma 4 Suppose that Jk J, where Jk is generated by asynchronous policy iteration, and suppose that
there are innitely many value and policy updates at each state. Then J = J .
Now
Proof: First note that, since T Jk Jk , by continuity of the operator T , we must have T J J.

suppose that (T J)(x) < J(x) for some state x. Then, by continuity, there is an iteration index k such that
Let k > k > k correspond to iterations of the asynchronous policy iteration
(T Jk )(x) < J(x) for all k k.
algorithm such that there is a policy update at state x at iteration k , a value update at state x at iteration
k , and no updates at state x in iterations k < k < k . Such iterations are guaranteed to exist since
there are innitely many value and policy update iterations at each state. Then we have uk (x) = uk +1 (x),
Jk (x) = Jk (x), and
Jk +1 (x)

(Tuk Jk )(x)

(Tuk +1 Jk )(x)

< J.

(T Jk )(x)

The rst equality holds because there is a value update at state x at iteration k , the second equality holds
because uk (x) = uk +1 (x), the rst inequality holds because Jk is decreasing and Tuk +1 is monotone and
the third equality holds because there is a policy update at state x at iteration k .
5

We have concluded that Jk +1 < J. However by hypothesis Jk J, we have a contradiction, and it must
follow that T J = J, so that J = J .
2

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

February 17
Handout #6

Lecture Note 4

Averagecost Problems

In the average cost problems, we aim at nding a policy u which minimizes

T 1

Ju (x) = lim sup E

gu (xt ) x0 = 0 .
T T
t=0

(1)

Since the state space is nite, it can be shown that the lim sup can actually be replaced with lim for any
stationary policy. In the previous lectures, we rst nd the costtogo functions J (x) (for discounted
problems) or J (x, t) (for nite horizon problems) and then nd the optimal policy through the costtogo
functions. However, in the averagecost problem, Ju (x) does not oer enough information for an optimal
policy to be found; in particular, in most cases of interest we will have Ju (x) = u for some scalar u , for
all x, so that it does not allow us to distinguish the value of being in each state.
We will start by deriving some intuition based on nitehorizon problems. Consider a set of states
S = {x1 , x2 , . . . , x , . . . , xn }. The states are visited in a sequence with some initial state x, say
x, . . . . . ., x , . . . . . ., x , . . . . . ., x , . . . . . . ,

h(x)

2
u

Let Ti (x), i = 1, 2, . . . be the stages corresponding to the ith visit to state x , starting at state x. Let

Ti+1 (x)1
g
(x
)
u
t
t=Ti (x)

iu (x) = E
Ti+1 (x) Ti (x)
Intuitively, we must have iu (x) is independent of initial state x and iu (x) = ju (x), since we have the same
transition probabilities whenever we start a new trajectory in state x . Going back to observe the denition
of the function
T

J (x, T ) = min E
gu (xt )xo = x ,
u

t=0

we conjecture that the function can be approximated as follows.

J (x, T ) (x)T + h (x) + o(T ),

as T ,

(2)

Note that, since (x) is independent of the initial state, we can rewrite the approximation as
J (x, T ) T + h (x) + o(T ),

as T .

(3)

where term h (x) can be interpreted as a residual cost that depends on the initial state x and will be referred
to as the dierential cost function. It can be shown that

T1 (x)1

h (x) = E
(gu (x) ) .
t=0

We can now speculate about a version of Bellmans equation for computing and h . Approximating
J (x, T ) as in (3, we have

J (x, T + 1) = min ga (x) +

Pa (x, y)J (y, T )
a

(T + 1) + h (x) + o(T ) = min ga (x) +

Pa (x, y) [ T + h (y) + o(T )]

Therefore, we have

+ h (x) = mina ga (x) + y Pa (x, y)h (y)

(4)

As we did in the costtogo context, we set

Tu h = gu + Pu h
and
T h = min Tu h.
u

Then,we have
be arbitrary. Then T h T h.
(Tu h Tu h)

Lemma 1 (Monotonicity) Let h h

Lemma 2 (Oset) For all h and k , we have T (h + ke) = T h + ke.

Notice that the contraction principle does not hold for T h = minu Tu h.

Bellmans Equation

From the discussion above, we can write the Bellmans equation

e + h = T h.

(5)

Before examining the existence of solutions to Bellmans equation, we show the fact that the solution of the
Bellmans equation renders the optimal policy by the following theorem.
Theorem 1 Suppose that and h satisfy the Bellmans equation. Let u be greedy with respect to h , i.e.,
T h Tu h . Then,
Ju (x) = , x,
and
Ju (x) Ju (x), u.
Proof: Let u = (u1 , u2 , . . . ). Let N be arbitrary. Then
TuN 1 h
TuN 2 TuN 1 h

T h = e + h
TuN 2 (h + e)
= TuN 2 h + e
T h + e
= h + 2 e
2

Then
T1 T2 TN 1 h N e + h
Thus,we have
E

N 1

gu (xt ) + h (xN ) (N 1) e + h

t=0

By dividing both sides by N and take the limit as N approaches to innity, we have1
Ju e
Take u = (u , u , u , . . . ), then all the inequalities above become the equality. Thus
e = J u .
2
This theorem says that, if the Bellmans equation has a solution, then we can get a optimal policy from it.
Note that, if ( , h ) is a solution to the Bellmans equation, then ( , h + ke) is also a solution, for all
scalar k. Hence, if Bellmans equation in (5) has a solution, then it has innitely many solutions. However,
unlike the case of discountedcost and nitehorizon problems, the averagecost Bellmans equation does not
necessarily have a solution. In particular, the previous theorem implies that, if a solution exists, then the
average cost Ju (x) is the same for all initial states. It is easy to come up with examples where this is not
the case. For instance, consider the case when the transition probability is an identity matrix, i.e., the state
visits itself every time, and each state incurs dierent transition costs g(). Then the average cost depends
on the initial state, which is not the property of the average cost. Hence, the Bellmans equation does not
always hold.

1 Recall

that Ju (x) = lim supN E

1
N

N 1
t=0

gu (xt ) x0 = x .

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

February 18
Handout #7

Lecture Note 5

Relationship between Discounted and AverageCost Problems

In this lecture, we will show that optimal policies for discountedcost problems with large enough discount
factor are also optimal for averagecost problems. The analysis will also show that, if the optimal average
cost is the same for all initial states, then the averagecost Bellmans equation has a solution.
Note that the optimal average cost is independent of the initial state. Recall that
N 1

1
Ju (x) = lim sup E
gu (xt )|x0 = x
N N
t=0
or, equivalently,
N 1
1 t
Pu gu .
N N
t=0

Ju = lim

We also let Ju, denote the discounted costtogo function associated with policy u when the discount factor
is , i.e.,

Ju, =
t Put gu = (I Pu )1 gu .
t=0

The following theorem formalizes the relationship between the discounted costtogo function and average
cost.
Theorem 1 For every stationary policy u, there is hu such that
Ju, =

1
Ju + hu + O(|1 |).
1

(1)

Theorem 1 follows easily from the following proposition.

Proposition 1 For all stationary policies u, we have
(I Pu )1 =

1
P + Hu + O(|1 |)1 ,
1 u

(2)

N 1
1 t
Pu ,
N N
t=0

(3)

where
Pu = lim

Hu = (I Pu + Pu )1 Pu ,

(4)

Pu Pu

(5)

Pu Hu

= 0,

Pu Pu

(6)

Pu + Hu = I + Pu Hu .
1 O(|1

Pu ,

|) is a function satisfying lim1 O(|1 |) = 0.

(7)

Proof: Let M = (1 )(I Pu )1 . Then, since

t t
t
|M (x, y)| = (1 )
Pu (x, y) (1 )
1 = 1,

t=0

M (x, y) is in the form of

M (x, y) =

p()
,
q()

where p() and q() are polynomials such that q(1) =

1. We conclude that the limit lim1 M exists. Let

Pu = lim1 M . We can do Taylors expansion of M around = 1, so that

M = Pu + (1 )Hu + O((1 )2 )

where Hu = dM
d . Therefore

(I Pu )1 =

1 Pu

+ Hu + O(|1 |)

for some Pu and Hu .

Next, observe that

(1 )(I Pu )(I Pu )1 = (1 )I

for all . Taking the limit as 1 yields

(I Pu )Pu = 0,
so that Pu = Pu Pu . We can use the same reasoning to conclude that Pu = Pu Pu . We also have
(I Pu )Pu = (1 )Pu ,
hence for every we have
Pu = (1 )(I Pu )1 Pu ,

and taking the limit as 1 yields Pu Pu = Pu

.
t

We now show that, for every t 1, Pu Pu = (Pu Pu )t . For t = 1, it is trivial. Suppose that the
result holds up to n 1, i.e., Pun1 Pu = (Pu Pu )n1 . Then (Pu Pu )(Pu Pu )n1 = (Pu Pu )(Pun1
Pu ) = Pun Pu Pu Pu Pun1 + Pu Pu = Pun Pu Pu Pun2 + Pu = Pun Pu . By induction, we have
Put Pu = (Pu Pu )t .
Now note that
M Pu
Hu = lim
1
1

Pu
1
= lim (I Pu )
1
1

= lim
t (Put Pu )
1

t=0

lim I

lim

(Pu

Pu )t

t=1

t (Pu Pu )t Pu

t=0

(I Pu + Pu )1 Pu .
2

Hence Hu = (I Pu + Pu )1 Pu
.
We now show Pu Hu = 0. Observe

Pu Hu

= Pu (I Pu + Pu )1 Pu

=
Pu (Pu Pu )t Pu
t=0

= Pu Pu = 0.
Therefore, Pu Hu = 0.

Observe (IPu +Pu )Hu = I(IPu +Pu )Pu = IPu . Since Pu

Hu = 0, we have Pu + Hu = I + Pu Hu .
k

By multiplying Pu to Pu + Hu = I + Pu Hu , we have

Puk Pu + Puk Hu = Puk + Puk+1 Hu ,

Summing from k = 0 to k = N 1, we have

N Pu +

N
1

Puk Hu =

N
1

k=0

Puk +

k=0

Puk Hu ,

k=1

or, equivalently,
N Pu =

N
1

Puk + (PuN I)Hu .

k=0

Dividing both sides by N and letting N , then we have

N 1
limN N1 k=0 Puk = Pu .
2
Since Pu = Pu Pu and Pu itself is a stochastic matrix, the rows of Pu are of special meanings. Let

u denote a row of Pu . Then u = u Pu and u (x) = y u (y)Pu (y, x). Then Pu (x1 = x|x0 u ) =

y u (y)Pu (y, x). We can conclude that any row in matrix Pu is a stationary distribution for the Markov
chain under the policy u. However, does this observation mean that all rows in Pu are identical?
Theorem 2
Ju, =

Ju
+ hu + O(|1 |)
1

Proof:
Ju,

(I Pu )1 gu

Pu
=
+ Hu + O(|1 |) gu
1
Pu gu
+ Hu gu + O(|1 |)
=
1
N 1
1
1 t
Pu gu + hu +O(|1 |)
=
lim

1 N N
=

t=0

=Hu gu

Ju
+ hu +O(|1 |).

1
=Hu gu

Blackwell Optimality

In this section, we will show that policies that are optimal for the discountedcost criterion with large enough
discount factors are also optimal for the averagecost criterion. Indeed, we can actually strengthen the notion
of averagecost optimality and establish the existence of policies that are optimal for all large enough discount
factors.
Denition 1 (Blackwell Optimality) A stationary policy u is called Blackwell optimal if
(0, 1)

1).
such that u is optimal [,
Theorem 3 There exists a stationary Blackwell optimal policy and it is also optimal for the averagecost
problem among all stationary policies.
Proof: Since there are only nitely many policies, we must have for each state x a policy x such that
Jux , (x) Ju, (x) for all large enough . If we take the policy to be given by (x) = x (x), then
must satisfy Bellmans equation
Ju , = min {gu + Pu Ju , }
u

for all large enough , and we conclude that is Blackwell optimal.

is optimal for the averagecost problem. Then
Now let u be Blackwell optimal. Also suppose that u
Ju
J u
+ hu + O(|1 |)
+ hu + O(|1 |),
.
1
1
Taking the limit as 1, we conclude that
Ju Ju ,
and u must be optimal for the averagecost problem.

Remark 1 It is actually possible to establish averagecost optimality of Blackwell optimal policies among
the set of all policies, not only stationary ones.
Remark 2 An algorithm for computing Blackwell optimal policies involves lexicographic optimization of Ju ,
hu and higherorder terms in the Taylor expansion of Ju, .
Theorem 3 implies that if the optimal average cost is the same regardless of the initial state, then the
averagecost Bellmans equation has a solution. Combined with Theorem 1 of the previous lecture, it follows
that this is a necessary and sucient condition for existence of Bellmans equation solution.
Corollary 1 If Ju = e, then e + h = T h has a solution ( , hu ) with u which is Blackwell optimal.

Proof: We have, for all large enough ,

Ju ,
J u
+ hu + O((1 )2 )
1
e
+ hu + O((1 )2 )
1
+ hu + O((1 )2 )

min {gu + Pu Ju , }
u

J u
= min gu + Pu
+ hu + O((1 )2 )
u
1

e
= min gu + Pu
+ hu + O((1 )2 )
u
1

= min gu + Pu hu + O((1 )2 ) .

Taking the limit as 1, we get

e + hu = min {gu + Pu hu } = T hu .
u

In the averagecost setting, existence of a solution to Bellmans equation actually depends on the structure
of transition probabilities in the system. Some sucient conditions for the optimal average cost to be the
same regardless of the initial state are given below.
Denition 2 We say that two states x, y communicate under policy u if there are k, k {1, 2, . . . } such

that Puk (x, y) > 0, Puk (y, x) > 0.

Denition 3 We say that a state x is recurrent under policy u if, conditioned on the fact that it is visited

at least once, it is visited innitely many times.

Denition 4 We say that a state x is transient under policy u if it is only visited nitely many times,

regardless of the initial condition of the system.

Denition 5 We say that a policy u is unichain if all of its recurrent states communicate.
We state without proof the following theorem.
Theorem 4 Either of the following conditions is sucient for the optimal average cost to be the same
regardless of the initial state:
1. There exists a unichain optimal policy.
2. For every pair of states x and y, there is a policy u such that x and y communicate.

Value Iteration

We want to compute
N 1
1 t
Pu gu
N N
t=0

min lim
u

One way to obtain this value is to calculate a nite but very large N to approximate the limit and speculate
that such an limit is accurate. Hence we consider
k1

k
T J = min E
gu (xt ) + J0 (xk )
u

t=0

Recall J (x, T )
we have
= T + h (x). Choose some state x and x,
J (x, T ) J (
x, T ) = h (x) h (
x)
Then
hk (x) = J (x, k) k ,

for some 1 , 2 , . . .

Note that, since ( , h + ke) is a solution to Bellmans equation for all k whenever ( , h ) is a solution, we
x) = 0, we have the following commonly used
can choose the value of a single state arbitrarily. Letting h (
version of value iteration;
x)
(8)
hk+1 (x) = (T hk )(x) (T hk )(
we have = (T h)(
x) and h = h,
e + h = T h .
Theorem 5 Let hk be given by (8). Then if hk h,
Note that there must exist a solution to the averagecost Bellmans equation for value iteration to con
verge. However, it can be shown that existence of a solution is not a sucient condition.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

February 23
Handout #9

Lecture Note 6

Application to Queueing Networks

In the rst part of this lecture, we will discuss the application of dynamic programming to the queueing
network introduced in [1], which illustrates several issues encountered in the application of dynamic pro
gramming in practical problems. In particular, we consider the issues that arise when value iteration is
applied to problems with a large or innite state space.
The main points in [1], which we overview today, are the following:
Naive implementation of value iteration may lead to slow convergence and, in the case of innite
state spaces, policies with innite average cost in every iteration step, even though the iterates J k (x)
converge pointwise to J (x) for every state x;
Under certain conditions, with proper initialization J0 , we can have a faster convergence and stability
guarantees;
In queueing networks, proper J0 can be found from well-known heuristics such as uid model solutions.
We will illustrate these issues with examples involving queueing networks. For the generic results, in
cluding a proof of convergence of average-cost value iteration for MDPs with innite state spaces, refer to
[1].

1.1

Multiclass queueing networks

Consider a queueing network as illustrated in Fig. 1.

13
1

11
18

Machine 1

4
Machine 2

Figure 1: A queueing system

Machine 3

We introduce some notation:

g(x) =

the number of queues in the system

probability of exogenous arrival at queue i

probability that a job at queue i is completed if the job is being served

state, length of queue i

cost function, in which state x = (x1 , . . . , xN )

i=1

a {0, 1}N

ai = 1 if a job from queue i is being served, and ai = 0 otherwise.

The interpretation is as follows. At each time stage, at most one of the following events can happen: a new
job arrives at queue i with probability i , a job from queue i that is currently being served has its service
completed, with probability i , and either moves to another queue or leaves the system, depending on the
structure of the network. Note that, at each time stage, a server may choose to process a job from any of the
queues associated with it. Therefore the decision a encodes which queue is being processed at each server.
We refer to such a queueing network as multiclass because jobs at dierent queues has dierent service rates
and trajectories through the system.
As seen before, an optimal policy could be derived from the dierential cost function h , which is the
solution of Bellmans equation:
+ h = T h .
Consider using value iteration for estimating h . This requires some initial guess h0 . A common choice
is h0 = 0; however, we will show that this can lead to slow convergence of h . Indeed, we know that h is
equivalent to a quadratic, in the sense that there is a constant and a solution to Bellmans equation such

that 1 i x2i h (x) i xi21 . Now let h0 = 0. Then

k1 N

k
xi (t) x0 = x .
T h0 (x) = min E
u

t=0 i=1

Since

E [xi (t)] = E [xi (t 1)] + E [Ai (t)]

E [Di (t)] ,

=i (arrival) 0 (departure)

we have

E [xi (t) xi (t 1)] i

By (1), we have

1 You

E [xi (1)]

E [xi (0)] + i

E [xi (2)]

..
.

E [xi (1)] + i E [xi (0)] + 2i

E [xi (t)]

E [xi (0)] + ti

will show this for the special case of a single queue with controlled service rate in problem set 2.

(1)

Thus,
T k h0

k1
N

(xi (0) + ti )

t=0 i=1

kxi (0) +

i=1

k(k 1)
i
2

This implies that hk (x) is upper bounded by a linear function of the state x. In order for it to approach a
quadratic function of x, the iteration number k must have the same magnitude as x. It follows that, if the
state space is very large, convergence is slow. Moreover, if the state space is innite, which is the case if
queues do not have nite buers, only pointwise convergence of hk (x) to h (x) can be ensured, but for every
k, there is some state x such that hk (x) is a poor approximation to h (x).
Example 1 (Single queue length with controlled service rate) Consider a single queue with
State x dened as the queue length
Pa (x, x + 1) = ,

(arrival rate)

Pa (x, x 1) = 1 + a2 , where action a {0, 1}

Pa (x, x) = 1 1 a2 .
Let the cost function be dened as ga (x) = (1 + a)x.
The interpretation is as follows. At each time stage, there is a choice between processing jobs at a lower
service rate 1 or at a higher service rate 2 . Processing at a higher service rate helps to decrease future
queue lengths but an extra cost must be paid for the extra eort.
Suppose that > 1 . Then if policy u(x) = 0 for all x x0 , whenever the queue length is at least x0 ,
there are on average more job arrivals than departures, and it can be shown that eventually the queue length
converges to innity, leading to innite average cost.
Suppose that h0 (x) = 0, x. Then in every iteration k, there exists an xk such that hk = T k h0 (x) = cx+d
the greedy action is uk (
x) = 0, which is
for all x xk . Moreover, when hk = cx + d in a neighborhood of x,
the case that the average cost goes to innity.
2
As shown in [1], using the initial value h0 (x) = 1 +x2 leads to stable policies for every iterate hk ,
and ensures convergence to the optimal policy. The choice of h0 as a quadratic arises from problem-specic
knowledge. Moreover, appropriate choices in the case of queueing networks can be derived from well-known
heuristics and analysis specic to the eld.

Simulation-based Methods

The dynamic programming algorithms studied so far have the following characteristics:
innitely many value/and or policy updates are required at every state,
perfect information about the problem is required, i.e., we must know ga (x) and Pa (x, y),
we must know how to compute greedy policies, and in particular compute expectations of the form

Pa (x, y)J (y).

(2)
y

In realistic scenarios, each of these requirements may pose diculties. When the state space is large,
performing updates innitely often in every state may be prohibitive, or even if it is feasible, a clever order
of visitation may considerably speed up convergence. In many cases, the system parameters are not known,
and instead one has only access to observations about the system. Finally, even if the transition probabilities
are known, computing expectations of the form (2) may be costly. In the next few lectures, we will study
simulation-based methods, which aim at alleviating these issues.

2.1

Asynchronous value iteration

We describe the asynchronous value iteration (AVI) as

Jk+1 (xk ) = (T Jk )(xk ),

xk Sk

We have seen that, if every state has its value updated innitely many times, then the AVI converges
(see arguments in Problem set 1). The question remains as to whether convergence may be improved by
selecting states in a particular order, and whether we can dispense with the requirement of visiting every
state innitely many times.
We will consider a version of AVI where state updates are based on actual or simulated trajectories for
the system. It seems reasonable to expect that, if the system is often encountered at certain states, more
emphasis should be placed in obtaining accurate estimates and good actions for those states, motivating
performing value updates more often at those states. In the limit, it is clear that if a state is never visited,
under any policy, then the value of the cost-to-go function at such a state never comes into play in the
decision-making process, and no updates need to be performed for such a state at all. Based on the notion
that state trajectories contain information about which states are most relevant, we propose the following
version of AVI. We call it real-time value iteration (RTVI).
1. Take an arbitrary state x0 . Let k = 0.
2. Choose action uk in some fashion.
3. Let xk+1 = f (xk , uk , wk ) (recall from lecture 1 that f gives an alternative representation for state
transitions).
4. Let Jk+1 (xk+1 ) = (T Jk )(xk+1 ).
5. Let k = k + 1 and return to step 2.

2.2

Exploration x Exploitation

Note that there is still an element missing in the description of RTVI, namely, how to choose action u k . It
is easy to see that, if for every state x and y there is a policy u such that there is a positive probability of
reaching state y at some time stage, starting at state x, one way of choosing u k that ensures convergence of
RTVI is to select it randomly among all possible actions. This ensures that all states are visited innitely
often, and the convergence result proved for AVI holds for RTVI. However, if we are actually applying RTVI
as we run the system, we may not want to wait until RTVI converges before we start trying to use good
policies; we would like to use good policies as early as possible. A reasonable choice in this direction is to
take an action uk that is greedy with respect to the current estimate Jk of the optimal cost-to-go function.
4

In general, choosing uk greedily does not ensure convergence to the optimal policy. One possible failure
scenario is illustrated in Figure 2. Suppose that there is a subset of states B which is recurrent under an
optimal policy, and a disjoint subset of states A which is recurrent under another policy. If we start with a
guess J0 which is high enough at states outside region A, and always choose actions greedily, then an action
that never leads to states outside region A will be selected. Hence RTVI never has a chance of updating and
correcting the initial guess J0 at states in subset B, and in particular, the optimal policy is never achieved.
It turns out that, if we choose initial value J0 J , then the greedy policy selection performs well, as
shown in Fig 2(b). We state this concept formally by the following theorem.
The previous discussion highlights a tradeo that is fundamental to learning algorithms: the conict
of exploitation versus exploration. In particular, there is usually tension between exploiting information
accumulated by previous learning steps and exploring dierent options, possibly at a certain cost, in order
to gather more information.
J(x)

J(x)

x
(b) Initial value with J 0less or equal to J *

(a) Improper initial value J 0with greedy

policy selection

Figure 2: Initial Value Selection

Theorem 1 If J0 J and all states are reachable from one another, then the real time value iteration
algorithm (RTVI) with greedy policy ut satises the following
(a) Jk J for some J ,
(b) J = J for all states visited innitely many times,
(c) after some iterations, all decisions will be optimal.
Proof: Since T is monotone, we have
x1 .
(T J0 )(x) (T J )(x), x J1 (x1 ) J (x1 ) and J1 (x) = J0 (x) J (x), x =
We thus conclude that J0 J implies Jk J for all k. Let A be the set of states visited innitely many
times. Assume without loss of generality that only states in A are visited and they are visited innitely
many times. Dene

Pa (x, y)J0 (y) , x A.

(T A J)(x) = min ga (x) +
Pa (x, y)J(y) +
a

y A
/

Hence one could regard J as a function from the set A to |A| . So T A is similar to DP operator for the
subset A of states and
||T A J T A J|| ||J J|| .
Therefore, RTVI is AVI over A, with every state visited innitely many times. Thus,

J (x), if x A,
Jk (x) J (x) =
J0 (x),
if x
/ A.
Since the states x
/ A are never visited, we must have

Pa (x, y) = 0, x A, y
/
A,
where a is greedy with respect to J . Let u be the greedy policy of J . Then

Pu (x, y)J (y), x A.

Pu (x, y)J (y) = gu (x) +
J (x) = gu (x) +
yA

Therefore, we conclude
J (x) = Ju (x) J (x),

x A.

By hypothesis J0 J , we know that

J (x) = J (x),

x A.

References
[1] R-R Chen and S.P. Meyn, Value Iteration and Optimization of Multiclass Queueing Networks, Queueing
Systems, 32, pp. 6597, 1999.

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

February 25
Handout #10

Lecture Note 7

RealTime Value Iteration

Recall the realtime value iteration (RTVI) algorithm

choose

xk+1 = f (xk , uk , wk )

choose

ut in some fashion

update

Jk+1 (xk ) = (T Jk )(xk ),

We thus have

Jk+1 (x) = (T Jk )(x), x =

T Jk (xk ) = min ga (xk ) +

Pa (xk , y)Jk (y)

We encounter the following two questions in this algorithm.

1. what if we do not know Pa (x, y)?
2. even if we know/can simulate Pa (x, y), computing

Pa (x, y)J(y) may be expensive.

To overcome these two problems, we consider the Qlearning approach.

2
2.1

QLearning
Qfactors

For every stateaction pair, we consider

Q (x, a)

= ga (x) + Pa (xk , y)J (y)

J (x) =

min Q (x, a)

(1)
(2)

We can interpret these equations as Bellmans equations for an MDP with expanded state space. We have
the original states x S, with associated sets of feasible actions Ax , and extra states (x, a), x S, a Ax ,
corresponding to stateaction pairs, for which there is only one action available, and no decision must be
made. Note that, whenever we are in a state x where a decision must be made, the system transitions
deterministically to state (x, a) based on the state and action a chosen. Therefore we circumvent the need

to perform expectations y Pa (x, y)J(y) associated with greedy policies.

We dene the operator

(HQ)(x, a) = ga (x) +
Pa (x, y) min
Q(y, a )
(3)

It is easy to show that the operator H has the same properties as operator T dened in previous lectures
for discountedcost problems:
1

Monotonicity
Oset
Contraction

such that Q Q,
HQ HQ.

Q, and Q
H(Q + Ke) = HQ + Ke.

, Q, Q

||Q Q||
HQ H Q

It follows that H has a unique xed point, corresponding to the Q factor Q .

2.2

QLearning

We now develop a realtime value iteration algorithm for computing Q . An algorithm analogous to RTVI
for computing the costtogo function is as follows:

Qt+1 (xt , ut ) = gut (xt ) +

Put (x, y) min
Qt (y, a ).

However, this algorithm undermines the idea that Qlearning is motivated by situations where we do not

know Pa (x, y) or nd it expensive to compute expectations

a Pa (x, y)J(y). Alternatively, we consider
variants that implicitly estimate this expectation, based on state transitions observed in system trajectories.
Based on this idea, one possibility is to utilize a scheme of the form
Qt+1 (xt , at ) = gat (xt ) + min
Qt (xt+1 , a )

However, note that such an algorithm should not be expected to converge; in particular, Qt (xt+1 , a ) is a

noisy estimate of y Put (x, y) mina Qt (y, a ). We consider a smallstep version of this scheme, where the
noise is attenuated:

Qt+1 (xt , at ) = (1 t )Qt (xt , at ) + t gat (xt ) + min

Q
(x
,
a
)
.
(4)
t
t+1

We will study the properties of (4) under the more general framework of stochastic approximations, which
are at the core of many simulationbased or realtime dynamic programming algorithms.

Stochastic Approximation

In the stochastic approximation setting, the goal is to solve a system of equations

r = Hr,
where r is a vector in n for some n and H is an operator dened in n . If we know how to compute Hr
for any given r, it is common to try to solve this sytem of equations by value iteration:
rk+1 = Hrk .

(5)

Now suppose that we cannot compute Hr but have noisy estimates (Hr + w) with E[w] = 0. One alternative
is to approximate (5) by drawing several samples Hr + wi and averaging them, in order to obtain an estimate
of Hr. In this case, we would have
k
1
rt+1 =
(Hrt + wi )
k i=1
2

We can also do the summation recursively by setting

r t

(i)

1
(Hrt + wi ),
i j=1

(i+1)

i (i)
1
r +
(Hrt + wi+1 ).
i+1 t
i+1

rt
(k)

(i1)

Therefore, rt+1 = rt . Finally, we may consider replacing samples Hrt + wi with samples Hrt
obtaining the nal form
rt+1 = (1 t )rt + t (Hrt + wt ).

+ wi ,

A simple application of these ideas involves estimating the expected value of a random variable by drawing
i.i.d. samples.
Example 1 Let v1 , v2 , . . . be i.i.d. random variables. Given
rt+1 =

t
1
rt +
vt+1
t+1
t+1

we know that rt v by strong law of large numbers. We can actually prove

(General Version) rt+1 = (1 t )rt + t vt+1 v w.p. 1,
if

t=1

t = and

t=1

t2 < .

The conditions on the step sizes t

t =

(6)

t2 <

(7)

t=1

and

t=1

are standard in stochastic approximation algorithms. A simple argument illustrates the need for condition
(6): if the total sum of step sizes is nite, iterates rt are conned to a region around the initial guess r0 , so
that, if r0 is far enough from any solution of r = Hr, the algorithm cannot possibly converge. Moreover,
since we have noisy estimates of Hr, convergence of rt+1 = (1 t )rt + Hrt + t w requires that the noisy
term t w decreases with time, motivating the condition (7).
We will consider two approaches to analyzing the stochastic approximation algorithm
rt+1 =

(1 t )rt + t (Hrt + wt )

= rt + t (Hrt + wt rt )
= rt + t S(rt , wt )
where we dene S(rt , wt ) = Hrt + wt rt . The two approaches are
1. Lyapunov function analysis
2. ODE approach

(8)

3.1

Lyapunov function analysis

The question we try to answer is Does (8) converge? If so, where does it converge to?
We will rst illustrate the basic ideas of Lyapunov function analysis by considering a deterministic case.
3.1.1

Deterministic Case

In deterministic case, we have S(r, w) = S(r). Suppose there exists some unique r such that
S(r ) = Hr r = 0.
The basic idea is to show that a certain measure of distance between rt and r is decreasing.
Example 2 Suppose that F is a contraction with respect to 2 . Then
rt+1 = rt + t (F rt rt )
converges.
Proof: Since F is a contraction, there exists a unique r s.t. F r = r . Let
V (r) = r r 2 .
We will show V (rt ) V (rt+1 ). Observe
V (rt+1 )

= rt+1 r 2
= rt + t (F rt rt ) r 2
= (1 t )(rt r ) + t (F rt r )2
(1 t )rt r 2 + t F rt r 2
(1 t )rt r 2 + t rt r 2
= rt r 2 (1 )t rt r 2 .
t0

Therefore, rt r 2 is nonincreasing and bounded below by zero. Thus, rt r 2 0. Then

0 rt+1 r 2

rt r 2 (1 )t rt r 2

rt r 2 (1 )t

..
.

rt1 r 2 (1 )(t + t1 )

r0 r 2 (1 )

l=1

Hence

r0 r 2
,
t
(1 ) l=1 t

t
2

we thus have = 0.

We can isolate several key aspects in the convergence argument used for the example above:

1. We dene a distance V (rt ) 0 indicating how far rt is from a solution r satisfying S(r) = 01
2. We show that the distance is nonincreasing in t
3. We show that the distance indeed converges to 0.
The argument also involves the basic result that every nonincreasing sequence bounded below converges
to show that the distance converges
Motivated by these points, we introduce the notion of a Lyapunov function:
Denition 1 We call function V a Lyapunov function if V satises
(a) V () 0
(b) (r V )T S(r) 0
(c) V (r) = 0 S(r) = 0
3.1.2

Stochastic Case

The argument used for convergence in the stochastic case parallels the argument used in the deterministic
case. Let Ft denote all information that is available at stage t, and let
St (r) = E [S(r, wt )|Ft ] .
Then we require a Lyapunov function V satisfying
V () 0

(9)

(V (rt )) St (rt ) cV (rt )

(10)

V (r) V (
r) Lr r

2
E S(rt , wt ) |Ft K1 + K2 V (rt )2 ,

(11)

(12)

for some constants c,L,K1 and K2 .

Note that (9) and (10) are direct analogues of requiring existence of a distance that is nonincreasing in
t; moreover, (10) ensures that the distance decreases at a certain rate if rt is far from a desired solution r
satisfying V (r = 0). Condition (11) imposes some regularity on V which is required to show that V (rt )
does indeed converge to 0, and condition (12) imposes some control over the noise.
A last point worth mentioning is that (10) implies that the expected value of V (rt ) is nonincreasing;
however, we may have V (rt+1 ) > V (rt ) occasionally. Therefore we need an stochastic counterpart to the
result that every nonincreasing sequence bounded below converges. The stochastic counterpart of interest
to our analysis is given below.
Theorem 1 (Supermartingale Convergence Theorem) Suppose that Xt , Yt and Zt are nonnegative

random variables and t=1 Yt < with probability 1. Suppose also that

E Xt+1 Xi , Yi , Zi , i t Xt + Yt Zt
Then
1V

(r) = r r 2 0 in the above example.

1. Xt converges to a limit (which can be a random variable) with probability 1,

2.
t=1 Zt < .
Theorem 2 If (9), (10), (11), and (12) are satised and we have
1. V (rt ) converges.
2. limt V (rt ) = 0.
3. Every limit point of rt is a stationary point of V .

t=1

t = and

t=1

t2 < , then

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

March 1
Handout #11

Lecture Note 8

Lyapunov Function Analysis

In this lecture, we want to study the convergence of

rt+1 = rt + t S(rt , wt )
to some with E [S(r , wt )] = 0. Recall the Lyapunov function analysis in deterministic case that we pick
a function V (r) such that
V (r) 0, r,
V (r)T S(r) < 0, if r = r ,
V (r ) = 0.
The argument for convergence is that we observe V (rt ) decreasing over time and lower bounded; therefore,
V (rt ) converges to some limit. With technical conditions on V and S, we can show that rt r .
We now proceed to the stochastic case. Let Ft denote the history of the process up to stage t. Explicitly,
we can have Ft as
Ft = {rl , l t, wl , l < t, t , l t} .
Note that the step size t can depend on the history which is stochastic, but not on the disturbance wt .
1

We dene the Euclidean norm V 2 = (V T V ) 2 .

Theorem 1 Suppose that V such that
(a) V (r) 0, r,
(b) L such that V (r) V (
r)2 Lr r2 (Lipschitz continuity),

(c) K1 , K2 such that E S(rt , wt )22 Ft K1 + K2 V (rt )22 ,

(d) c such that V (rt )T E S(rt , wt ) Ft cV (rt )22 .

Then, if t satises

t=0

t = and

t=0

t2 < , we have

V (rt ) converges,
limt V (rt ) = 0.
every limit point r of rt satises V (
r) = 0.
We will prove the convergence for a special case where V (r) = 12 r r 22 for some r .
1

Theorem 2 Suppose V (r) = 12 r r 22 satises

(a) K1 , K2 such that E S(rt , wt )22 Ft K1 + K2 V (rt ),

(b) c such that V (rt )T E S(rt , wt ) Ft cV (rt ).

Then, if t > 0 with

t=0

t = and

t=0

t2 < ,

rt r ,

w.p. 1.

We use the following Supermartingale convergence theorem to prove Theorem 2.

Theorem 3 (Supermartingale Convergence Theorem) Suppose that Xt , Yt and Zt are nonnegative

random variables and t=1 Yt < with probability 1. Suppose also that

E Xt+1 Ft Xt + Yt Zt ,
w.p. 1.
Then
1. Xt converges to a limit with probability 1,

2.
t=1 Zt < .
The key idea for the proof of Theorem 2 is to show that V (rt ) is a supermartingale, so that V (rt ) converges
and then show that it converges to zero w.p. 1.
Proof: [Theorem 2]

E V (rt+1 )Ft
= E rt+1 r 22 Ft
2

1

= E (rt + t St r )T (rt + t St r )Ft

(St S(rt , wt ))
2

2
1

=
(rt r )T (rt r ) + t (rt r )T E St Ft + t E StT St Ft
2
2
Since V (rt ) = 12 rt r 22 , V (rt ) = (rt r ). Then

E V (rt+1 )Ft
= V (rt ) + t (rt r )T E St Ft + t E St 22 Ft
2

2

= V (rt ) + t V (rt )T E St Ft + t E St 22 Ft
2
2
V (rt ) t cV (rt ) + t (K1 + K2 V (rt ))
2

t2 K2
2
V (rt ) t c
V (rt ) + t K1

2
2

Xt
Zt

Since t > 0 and

t=0

t2 < , t must converge to zero, and Zt 0 for all large enough t. Moreover,

t=0

Yt =

K1 2
< .
2 t=0 t

Therefore, by Supermartingale convergence theorem,

V (rt ) converges w. p. 1, and

t2 K2
V (rt ) < , w. p. 1.
t c
2
t=0

Suppose that V (rt ) > 0. Then, by hypothesis that t=0 t = and t=0 t2 < , we must have

2 K2
t c t
V (rt ) =
2
t=0
which is a contradiction. Therefore
lim rt r 22 = 0

w.p. 1 rt r w.p. 1.

Example 1 (Stochastic GaussSeidel) Consider1

rt+1 (it )
rt+1 (i)

= rt (it ) + t ((F rt )(it ) rt (it )) ,

= rt (it ), i = it .

Suppose that F is a 2 contraction. Suppose also that it , t = 1, 2, . . . , are chosen i.i.d. with P (it = i) =
i > 0. Then
rt+1 (i) = rt (i) + t i ((F rt )(i) rt (i)) + t [1(it = i) i ] [(F rt )(i) rt (i)]

wt (i)

Dene

0
=

0
0

0
2

0
0
..

0
0

...
...

0
0

...
...

0
n

then
rt+1 = rt + t (F rt rt ) +t wt .

E[St |Ft ]

Let V (r) =

1
2 (r

r )

(r r ) 0. Then we have

V (r) = 1 (r r )

(Lipschitz continuity holds).

We also have

V (rt )T E St Ft

= (rt r )T 1 (F rt rt ) = (rt r )T (F rt r + r rt )
= (rt r )T (rt r ) + (rt r )T (F rt r )

1 Recall

rt t 22 + rt r 2 F rt r 2

rt t 22 + rt t 22

(1 ) min i2 V (rt )22 .

the AVI: rt+1 (it ) = (F rt )(it )

We nally have

E St 22 |Ft = E (F rt )(it ) rt (it ))2 |Ft

E F rt rt 22 |Ft
= F rt rt 22
F rt r 22 + rt r 22
(1 + )rt r 22
(1 + ) max i2 V (rt )22 .
i

We conclude by Theorem 1 that stochastic GaussSeidel converges.

Qlearning

Recall that the Qlearning algorithm updates the Q factor according to

Qt+1 (xt , at ) = Qt (xt , at ) + t (gat (xt ) + min
Qt (xt+1 , a ) Qt (xt , at )).

This update can be rewritten as

Qt+1 (x, a) = Qt (x, a)

Q
(x,
a)
+ t (x, a) ga (x) +
Pa (x, y) min
Q
(y,
a
)

t
t
a

(HQ)(x,a)

+ t (x, a) min
Q
(x
,
a
)

P
(x,
y)
min
Q
(y,
a
)
t
t+1
a
t

a
y

where
if (x, a) = (xt , at )

t (x, a) = 0,

t (xt , at ) = t

E t wt Ft = 0
|wt | Qt .
Then, we have
Qt+1 = Qt + t (HQt Qt ) + t wt .
We can use the following theorem to show that Qlearning converges, as long as every state and action
pair are visited innitely many times.

Theorem 4 Let rt+1 (i) = rt (i) + t (i) (Hrt )(i) rt (i) + wt (i) . Then, if

E wt Ft = 0
4

E wt2 (i)Ft A + Brt 2 for some norm

t=0

t (i) = ,

t=0

t (i)2 < , i

H is a maximumnorm contraction,
then rt r w.p. 1 (Hr = r ).
Comparing Theorems 2 and 4, note that, if H is a maximumnorm contraction, convergence occurs under
weaker conditions than if it is an Euclidean norm contraction.
Corollary 1 If

t=0

t (x, a) = with probability 1 for all (x, a), we have

Qt Q

w.p. 1.

ODE Approach

Often times, the behavior of rt+1 = rt + t S(rt , wt ) may be understood by analyzing the following ODE
instead:
rt = E [S(rt , wt )] .
The main idea for the ODE approach is as follows. Look at intervals [tm , tm+1 ) such that
tm+1 1

t = ,

where is small.

t=tm

Set rm rtm . Then

rt rtm + O(),

t [tm , tm+1 ).

(1)

Then
tm+1 1

rm+1

= rtm+1 = rm +

t S(rt , wt )

t=tm
tm+1 1

rtm +

t S(rt , wt ) + O()

(2)

t=tm
tm+1 1

= rtm +

t
S(rt , wt ) + O( 2 )

t=t
m

= rm + E [S(rm , w)] + O( 2 )
Therefore we can think of the stochastic scheme as a discrete version of the ODE
rm+1 = rm + E [S(rm , w)] r = E [S(r, w)] .
To make the argument rigorous, steps (1), (2) and (3) have to be justied.

(3)

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

March 3
Handout #12

Lecture Note 9

Explicit Explore or Exploit (E3 ) Algorithm

Last lecture, we studied the Q-learning algorithm:

Q
(x
,
a
)

Q
(x
,
a
)
.
Qt+1 (xt , at ) = Qt (xt , at ) + t gat (xt ) + min
t
t+1
t
t
t

An important characteristic of Q-learning is that it is a model-free approach to learning an optimal policy

in an MDP with unknown parameters. In other words, there is explicit attempt to model or estimate costs
and/or transition probabilities the value of each action is estimated directly through the Q-factor.
Another approach to the same problem is to estimate the MDP parameters from the data and nd a
policy based on the estimated parameters. In this lecture, we will study one such algorithm the Explicit
Explore or Exploit (E3 ) algorithm, proposed by Kearns and Singh [1].
The main ideas for E3 are as follows:
we divide states in two sets:
N
N

known states

unknown states

known states have been visited suciently many times to ensure that Pa (x, y), ga (x) are accurate
with high probabilities
an unknown state is moved to N when it has been visited at least m times for some number m
N and MN . The MDP M
N is presented in Fig. 1. Its main characteristic is that
We introduce two MDPs M
the unknown states from the original MDP are merged into a recurrent state x 0 with cost ga (x0 ) = gmax , a.
N but the estimated transition probabilities and costs are
The other MDP MN has the same structure as M
replaced with their true values.
We now introduce the algorithm.

1.1

Algorithm

We will rst consider a version of E3 which assumes knowledge of J ; the assumption will be lifted later.
The E3 algorithm proceeds as follows.
1. Let N = . Pick arbitrary state x0 . Let k = 0.
2. If xk
/ N , perform balanced wandering:
ak = action chosen fewest times at state xk
If xk N , then
1

N has J (xk ) J (xk ) + , stop.

attempt exploitation: If the optimal policy for M
MN
2

Return xk and M

attempt exploration: Follow policy

S0 for T steps where T =

1
1 .

n
Figure 1: Markov Decision Process M

Theorem 1 With probability no less than 1 , E 3 will stop after a number of actions and computation
time

1
1 1
poly
, , |S|,
, gmax

1
and return a state x and policy u such that Ju (x) J (x) + .

1.2

Main Points

The main points used for proving Theorem 1 are as follows:

(i) There exists m that is polynomially bounded such that, if all states in N have been visited at least m
N is suciently close to MN .
times, then M
(ii) Balanced wandering can only happen nitely many times.
(iii) (a) Ju,MN (x) Ju (x)
(b) Ju,MN Ju,M N

with high probability

(iv) If exploitation is not possible, then there is an exploration policy that reaches an unknown state after
T transitions with high probability.
To show the rst main point, we consider the following lemma.
Lemma 1 Suppose a state x has been visited at least m times with each action a A x having been executed
at least |Amx | times. Then, if

m = poly |S|,

1
1
1
, T, gmax , , log , var(g)
1

we have, w.p. 1 ,

2
1
= O
|S|gmax

2
1
= O
|S|gmax

|Pa (x, y) Pa (x, y)|

|ga (x) ga (x)|

The proof of this lemma is a directly application of the Cherno bound, which states that, if z 1 , z2 , . . .
are i.i.d. Bernoulli random variables, then
n

1
zi Ez1
n i=1

(SLLN)

P
zi Ez1 > 2 exp

n
2
i=1

The main point (ii) follows from pigeonhole principle:

after (m 1)|S| balanced wandering steps, at least one state will have to become known
The main point iii(a) follows from the next lemma.
Lemma 2 For all policy u,
Ju,MN (x) Ju (x), x.
Proof: Trivial for x
/ N since Ju,MN (x) =
Ju (x)

gmax
1

Ju (x). If x N , take T = inf{t : xt

/ N }. Then

T 1

gu (xt ) +

t=0

T 1

gu (xt )

t=T

gmax
t gu (xt ) + T
1

Ju,MN (x)

t=0

To prove the main point iii(b), we rst introduce the following denition.

be two MDPs. Then M

is a -approximation to M if
Denition 1 Let M and M
|Pa (x, y) Pa (x, y)|
|ga (x) ga (x)|
Lemma 3 If T

1
1

log

2gmax
(1)

2
1

and M is an O |S|gmax
approximation of M , then, u,
||Ju,M Ju,M || .

Sketch of proof: Take a policy u and a start state x. We consider paths of length T starting from x:
p = x 0 , x1 , x2 , . . . , x T
where p denotes the path. Note that
Ju,M (x) =

Pu,M (p)gu (p) + E

gu (xt ) ,

t=T +1

where
Pu,M (p) = Pu,M (x0 , x1 )Pu,M (x1 , x2 ) . . . Pu,M (xT 1 , xT )
is the probability of observing path p and
gu (p) =

t gu (xt )

t=0

is the discounted cost associated with path p.

By selecting T properly, we can have

T g

max
t

gu (xt )
E

1
t=T +1

Recall that Pa (x, y) Pa (x, y) . We consider two kinds of paths:

(a) paths containing at least one transition xt , xt+1 in the set R such that Pu (xt , xt+1 ) . Note that the
total probability associated with such paths is less than or equal to |S|T , since the probability of any
given path is less than or equal to , starting with each state x in each transition there are at most |S|
possible small probability transitions, and there are T transitions where this can occur. Therefore

gmax
gmax

|S|T
.
Pu (p)gu (P)
Pu (p)

to conclude that
We can follow the same principle with the MDP M

( + )|S|T gmax
u (p)

.
P
g
(P)
u

1
pR

Therefore, we have

( + 2) |S| T gmax

Pu (p)gu (p)
Pu (p)
gu (p)

1
pR

(b) For all other paths, we have

(1 )Pa (xt , xt+1 ) Pa (xt , xt+1 ) (1 + )Pa (xt , xt+1 )

where =

Therefore,
(1 )T Pu (p) Pu (p) (1 + )T Pu (p).
4

Moreover, |gu (p) gu (p)| T , then

Ju,T (1 + )T [Ju,T + T ] +
4
4
The theorem follows by considering an appropriate choice of .
(1 )T [Ju,T T ]

The main point (iv) says that: If exploitation is not possible, then exploration is. We show it by the
following lemma.
Lemma 4 For any x N , one of the following must hold.
N
(a) there exists u in MN such that Ju,T
(x) JT (x) + , or

(b) there exists u such that the probability that a walk of T steps will terminate in N C exceeds

(1)
gmax .

Proof: Let u be the policy that attains JT . If

JuN ,T (x) JT (x) +
then we are done. Suppose that
JuN ,T (x) > JT (x) + .
Then we have
JuN ,T (x) =

PuN (q)guN (q) +

and
JT (x) =

path in N

JuN ,T (x) Ju ,T (x) =

PuN (p)

path outside N

Pu (q)gu (q).

PuN (p) guN (p) Pu (p)gu (p) >

which implies

Pu (q)gu (q) +

Therefore

PuN (p)guN (p)

max
g1

(1 )
gmax
>
.
PuN (p)
1
gmax
r

In order the complete the proof of Theorem 1 from the four lemmas above, we have to consider the
probabilities from two forms of failure:
failure to stop the algorithm with a near-optimal policy
failure to perform enough exploration in a timely fashion
The rst point is addressed by Lemmas 1, 2 and 3; which establish that, if the algorithm stops, with high
probability the policy produced is near-optimal. The second point follows from Lemma 4, which shows that
each attempt to explore is successful with some non negligible probability. By applying the Cherno bound,
it can be shown that, after a number of attempts that is polynomial in the quantities of interest, exploration
will occur with high probability.
5

References
[1] M. Kearns and S. Singh, Near-Optimal Reinforcement Learning in Polynomial Time, Machine Learning,
Volume 49, Issue 2, pp. 209-232, Nov 2002.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

March 8
Handout #13

Lecture Note 10

Value Function Approximation

DP problems are centered around the cost-to-go function J or the Q-factor Q . In certain problems, such as
linear-quadratic-Gaussian systems, J exhibits some structure which allows for its compact representation:
Example 1 In LQG system, we have
xk+1

Axk + Buk + Cwk ,

g(x, u)

x0 Dx + u0 Eu,

x <n

where xk represents the system state, uk represents the control action, and wk is a Gaussian noise. It can
be shown that the optimal policy is of the form
uk

Lk xk

and the optimal cost-to-go function is of the form

J (x)

x0 Rx + S,

R <nm , S <

where R is a symmetric matrix. It follows that, if there are n state variables (i.e., x k <n ), storing
J requires storing n(n + 1)/2 + 1 real numbers, corresponding to the matrix R and the scalar S. The
computational time and storage space required is quadratic in the number of state variables.
In general, we are not as lucky as in the LQG system case, and exact representation of J requires that
it be stored as a lookup table, with one value per state. Therefore, the space is proportional to the size of
the state space, which grows exponentially with the number of state variables. This problem, known as the
curse of dimensionality, makes dynamic programming intractable in face of most problems of practical scale.
Example 2 Consider the game of Tetris, represented in Fig. 1. As seen in previous lectures, this game may
be represented as an MDP, and a possible choice of state is the pair (B, P ), in which B < nm represents
the board configuration and P represents the current falling piece. More specifically, we have b(i, j) = 1, if
position (i, j) of the board is filled, and b(i, j) = 0 otherwise.
If there are p different types of pieces, and the board has dimension n m, the number of states is on the
order of p 2nm , which grows exponentially with n and m.
Since exact solution of large-scale of MDPs is intractable, we consider approximate solutions instead.
There are two directions for approximations which are directly based on dynamic programming principles:
(1) Approximation in the policy space
Suppose that we would like to find a policy minimizing the average cost in an MDP. We can pose this

Figure 1: A tetris game

as a deterministic optimization problem, in the following way. Denote by (u) the average cost of policy
u. Then our problem corresponds to
min (u),
(1)
uU

where U is the set of all possible policies. In principle, we could solve (1) by enumerating all policies and
choosing the one with the smallest value of (u); however, note that the number of policies is exponential
in the number of states we have |Y| = |A||S| ; if there is no special structure to U, this problem requires
even more computational time than solving Bellmans equation for the cost-to-go function. A possible
approach to approximating the solution is to transform problem (1) by considering only a tractable
subset of all policies:
min (u)
uF

where F is a subset of the policy space. If F has some appropriate format, e.g., we consider policies
that are parameterized by a continuous variable, we may be able to solve this problem without having
to enumerate all policies in the set, but by using some standard optimization method such as gradient
descent. Methods based on this idea are called approximations in the policy space, and will be studied
later on in this class.
(2) Cost-to-go Function Approximation
Another approach to approximating the dynamic programming solution is to approximate the cost-to-go
function. The underlying idea for cost-to-go function approximation is that J has some structure that
allows for approximate compact representation
r),
J (x) J(x,

for some parameter r <P .

If we restrict ourselves to approximations of this form, the problem of computing and storing J is reduced
to computing and storing the parameter r, which requires considerably less space. Some examples of
approximation architectures J that may be considered are as follows:

Example 3
r) = cos(xT r) nonlinear in r
J(x,
)
r) = r0 + rT x
J(x,
1
linear in r
r) = r0 + rt (x)
J(x,
1

In the next few lectures, we will focus on cost-to-go function approximation. Note that there are two
important preconditions to the development of an effective approximation. First, we need to choose a
parameterization J that can closely approximate the desired cost-to-go function. In this respect, a suitable
choice requires some practical experience or theoretical analysis that provides rough information on the shape
of the function to be approximated. Regularities associated with the function, for example, can guide the
choice of representation. Designing an approximation architecture is a problem-specific task and it is not
the main focus of this paper; however, we provide some general guidelines and illustration via case studies
involving queueing problems. Second, given a parameterization for the cost-to-go function approximation,
we need an efficient algorithm that computes appropriate parameter values.
We will start by describing usual choices for approximation architectures.

Approximation Architectures

2.1

Neural Networks

A common choice for an approximation architecture are neural networks. Fig. ?? represents a neural network.
The underlying idea is as follows: we first convert the original state x into a vector x
< n , for some n. This
vector is used as the input to a linear layer of the neural network, which maps the input to a vector y < m ,
Pn
for some m, such that yj = i=1 rij x
i . The vector y is then used as the input to a sigmoidal layer, which
m
outputs a vector z < with the property that zi = f (yi ), and f (.) is a sigmoidal function. A sigmoidal
function is any function with the following properties:
1. monotonically increasing
2. differentiable
3. bounded
Fig. 3 represents a typical sigmoidal function.
The combination of a linear and a sigmoidal layer is called a perceptron, and a neural network consists
of a chain of one or more perceptrons (i.e., the output of a sigmoidal layer can be redirected to another
sigmoidal layer, and so on). Finally, the output of the neural network consists of a weighted sum of the
output z of the final layer:
X
r i zi .
g(z) =
i

r), where x is the input and r is the set of

It is easy to see that a neural network represents a function J(x,
r),
weights in each of the perceptrons. Recall that we are interested in representing a function J (x) as J(x,
for some set of weights r. Part of the appeal of neural networks is that they can be efficiently trained to fit
a function f (x) based on samples (xi , f (xi )) via an algorithm known as backpropagation. Moreover, they
are flexible enough to adequately represent a wide class of functions indeed, it can be shown given a class
3

rij

Input

Linear Layer

ri
+

Sigmoidal Layer

Figure 2: A neural network

Figure 3: A sigmoidal function

of functions on some bounded and closed set, if functions are uniformly smooth, we can get error O( n1 )
with n sigmoidal functions. (Barron 1990). Note, however, that in order to obtain a good approximation,
an adequate set of weights r must be found. Backpropagation, which is simply a gradient descent algorithm,
is able to find a local optimum among all set of weights, but finding the global optimum may be a difficult
problem.

2.2

State Space Partitioning

Another common choice for approximation architecture is based on partitioning of the state space. The
underlying idea is that similar states may be grouped together. For instance, in an MDP involving
continuous state variables (e.g., S = <2 ), one may consider partitioning the state space by using a grid (e.g.,
divide <2 in squares). The simplest case would be to use a uniform grid, and assume that the cost-to-go
function remains constant in each of the partitions. Alternatively, one may use functions that are linear in
each of the partitions, or splines, and so on. One may also consider other kinds of partition beyond uniform
grids representing the partitioning of the state space as a tree or using adaptive methods for choosing the
partitions, for instance.

2.3

Features

A special case of state space partitioning consists of mapping states to features, and considering approximations of the cost-to-go function that are functions of the features. The hope is that the feature would capture
aspects of the state that are relevant for the decision-making process and discard irrelevant details, thus
providing a more compact representation. At the same time, one would also hope that, with an appropriate
choice of features, the mapping from features to the (approximate) cost-to-go function would be smoother
than that from the original state to the cost-to-go function, thereby allowing for successful approximation
with architectures that are suitable for smooth mappings (e.g., polynomials). This process is represented
below.
(x), r).
State x features f (x) J(f
J (x) J (f (x)) such that f (x) J is smooth.
Example 4 Consider the tetris game. What features we should choose?
1. |h(i) h(i + 1)| (height)
2. how many holes
3. max h(i)

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

March 10
Handout #14

Lecture Note 11

Complexity and Model Selection

In this lecture, we will consider the problem of supervised learning. The setup is as follows. We have
pairs (x, y), distributed according to a joint distribution P (x, y). We would like to describe the relationship
between x and y through some function f chosen from a set of available functions C, so that y f(x).
Ideally, we would choose f by solving

f = argmingC E (y f)2 |x, y P

(test error)
However, we will assume that the distribution P is not known, but rather, we only have access to samples
(xi , yi ). Intuitively, we may try to solve
n

min
f

2
1
yi f(xi )
n i=1

(training error)

instead. It also seems that, the richer the class C is, the better the chance to correctly describe the relationship
between x and y. In this lecture, we will show that this is not the case, and the appropriate complexity of C
and the selection of a model for describing how x and y related must be guided by how much data is actually
available. This issue is illustrated in the following example.

Example

Consider tting the following data by a polynomial of nite degree:

x
1
2
3
4
y 2.5 3.5 4.5 5.5
Among several others, the following polynomials t the data perfectly:
y = x + 1.5
y = 2x4 20x3 + 70x2 99x + 49.5

1.5

2.5

3.5

Which polynomial should we choose?

x
1
2
3
4
y 2.3 3.5 4.7 5.5
Fitting the data with a rst-degree polynomial yields y = 1.03x + 1.3; tting it with a fourth-degree
polynomial yields (among others) y = 2x4 20.0667x3 + 70.4x2 99.5333x + 49.5.
Which polynomial should we choose?
Now consider the following (possibly noisy) data:

1.5

2.5

3.5

Training error vs. test error.

It seems intuitive in the previous example that a line may be the best description for the relationship
between x and y, even though a polynomial of degree 3 describes the data perfectly in both cases and no
linear function is able to describe the data perfectly in the second case. Is the intuition correct, and if so,
how can we decide on an appropriate representation, if relying solely on the training error does not seem
completely reasonable?
The essence of the problem is as follows. Ultimately, what we are interested in is the ability of our tted
curve to predict future data, rather than simply explaining the observed data. In other words, we would like
to choose a predictor that minimizes the expected error |y(x) y(x)| over all possible x. We call this the
test error. The average error over the data set is called the training error.
We will show that training error and test error can be related through a measure of the complexity of
the class of predictors being considered. Appropriate choice of a predictor will then be shown to require
balancing the training error and the complexity of the predictors being considered. Their relationship is
described in Fig. 1, where we plot test and training errors versus complexity of the predictor class C when
the number of samples is xed. The main diculty is that, as indicated in Fig. 1, there exists a tradeo
between the complexity and the errors, i.e., training error and the test error; while the approximation error
over the sampled points goes to zero as we consider richer approximation classes, the same is not true for
the test error, which we are ultimately interested in minimizing. This is due to the fact that, with only
nite amount of data and noisy observations yi , if the class C is too rich we may run into overtting
tting the noise in the observations, rather than the underlying structure linking x and y. This leads to poor
generalization from the training error to the test error.
We will investigate how bounds on the test error based on the training error and the complexity of C may
be developed for the special case of classication problems i.e., problems where y {1, +1}, which may
be seen as an indicating whether xi belongs in a certain set or not. The ideas and results easily generalize
to general function approximation.
2

Error

test error

training error
Optimal degree

maximum degree (d)

Figure 1: Error vs. degree of approximation function

3.1

Classication with a nite number of classiers

Suppose that, given n samples (xi , yi ), i = 1, . . . , n, we need to choose a classier hi from a nite set of
classiers f1 , . . . , fd .
Dene
=

E[|y fk (x)|]
n
1
=
|yi fk (xi )|.
n i=1

(k)
n (k)

In words, (k) is the test error associated with classier fk , and n (k) is a random variable representing the
training error associated with classier fk over the samples (xi , yi ), i = 1, . . . , n. As described before, we
would like to nd k = arg mink (k), but cannot compute directly. Let us consider using instead
k = arg min n (k)
k

compare to the optimal error

We are interested in the following question: How does the test error (k)
(k )?
Suppose that
|
n (k) (k)| , k,
(1)
for some > 0. Then we have
(k)
test error

n (k) +

training error + ,

and

(k)

+
n (k)

n (k ) +

(k ) + 2.

In words, if the training error is close to the test error for all classiers fk , then using k instead of k is
near-optimal. But can we expect (1) to hold?
Observe that |yi fk (xi )| are i.i.d. Bernoulli random variables. From the strong law of large numbers,
we must have
n (k) (k) w.p. 1.
This means that, if there are sucient samples, (1) should be true. Having only nitely many samples, we
face two questions:
(1) How many samples are needed before we have high condence that n (k) is close to (k)?
(2) Can we show that n (k) approaches (k) equally fast for all fk C?
The rst question is resolved by the Cherno bound: For i.i.d. Bernoulli random variables x i , i = 1, . . . , n,
we have

n
1

P
xi Ex1 > 2 exp(2n2 )
n i=1

Moreover, since there are only nitely many functions in C, uniform convergence of n (k) to (k) follows
immediately:
P (k : |(k) (k)| > )

= P (k {|
(k) (k)| > })
d

k=1

P ({|
(k) (k)| > })

2d exp(2n2 ).
Therefore we have the following theorem.
Theorem 1 With probability at least 1 , the training set (xi , yi ), i = 1, . . . , n, will be such that
test error training error + (d, n, )
where
(d, n, ) =

1
2n

1
log 2d + log
.

Measures of Complexity

In Theorem 1, the error (d, n, ) is on the order of log d. In other words, the more classiers are under
consideration, the larger the bound on the dierence between the testing and training errors, and the
dierence grows as a function of log d. It follows that, for our purposes, log d captures the complexity of
C. It turns out that, in the case where there are innitely many classiers to choose from, i.e., m = , a
dierent notion of complexity leads to a bound similar to that in Theorem 1
How can we characterize complexity? There are several intuitive choices, such as the degrees of freedom
associated with functions in S or the length required to describe any function in that set (description length).
In certain cases, these notions can be shown to give rise to bounds relating the test error to the training error.
In this class, we will consider a measure of complexity that holds more generally the Vapnik-Chernovenkis
(VC) dimension.
4

4.1

VC dimension

The VC dimension is a property of a class C of functions i.e., for each set C, we have an associated measure
of complexity, dV C (C). dV C captures how much variability there is between dierent functions in C. The
underlying idea is as follows. Take n points x1 , . . . , xn , and consider binary vectors in {1, +1}n formed
by applying a function f C to (xi ). How many dierent vectors can we come up with? In other words,
consider the following matrix:

f1 (x1 ) f1 (x2 ) . . . f1 (xn )

f2 (x1 ) f2 (x2 ) . . . f2 (xn )

..
..
..
..
.
.
.
.
where fi C. How many distinct rows can this matrix have? This discussion leads to the notion of shattering
and to the denition of the VC dimension.

Denition 1 (Shattering) A set of points x1 , . . . , xn is shattered by a class C of classiers if for any

assignment of labels in {1, 1}, there is f C such that f( xi ) = yi , i.
Denition 2 VC dimension of C is the cardinality of the largest set it can shatter.
Example 1 Consider |C | = d. Suppose x1 , x2 , dots, xn is shattered by C. We need d 2n and thus n log d.
This means that dV C (C) log d.
Example 2
Consider C = {hyperplanes in 2 }, Any two points in 2 can be shattered. Hence, dV C (C) 2. Consider
any three points in 2 , C can shatter these three points. Hence dV C (C) 3. Since C cannot shatter any
four points in 2 , hence dV C (C) 3. It follows that dV C (C) = 3. Moreover, it can be shown that, if
C = {hyperplanes in n }, then dV C (C) = n + 1.
Example 3 If C is the set of all convex sets in 2 , we can show that dV C (C) = .
It turns out that the VC dimension provides a generalization of the results from the previous section, for
nite sets of classiers, to general classes of classiers:
Theorem 2 With probability at least 1 over the choice of sample points (x i , yi ), i = 1, . . . , n, we have
(f ) n (f ) + (n, dV C (C), ),
where
(n, dV C (C), ) =

f C,

d
1
2n
V C log( dV C ) + 1 + log( 4 )
n

Moreover, a suitable extension to bounded real-valued functions, as opposed to functions taking value in
{1, +1}, can also be obtained. It is called the Pollard dimension and gives rise to results analogous to
Theorem 1 and 2.
Denition 3 Pollard dimension of C = {f (x)} = maxs dV C ({I(f (x) > s(x))})
5

Structural Risk Minimization

Based on the previous results, we may consider the following approach to selecting a class of functions C
whose complexity is appropriate for the number of samples available. Suppose that we have several classes
C1 C2 . . . Cp . Note that complexity increases from C1 to Cp . We have classiers f1 , f2 , . . . , fp which
minimizes the training error n (fi ) within each class. Then, given a condence level , we may found upper
bounds on the test error (fi ) associated with each classier:
(fi ) n (fi ) + (dV C , n, ),
with probability at least 1 , and we can choose the classier fi that minimizes the above upper bound.
This approach is called structural risk minimization.
There are two diculties associated with structural risk minimization: rst, the upper bound provided
by Theorems 1 and 2 may be loose; second, it may be dicult to determine the VC dimension of a given
class of classiers, and rough estimates or upper bounds may have to be used instead. Still, this may be a
reasonable approach, if we have a limited amount of data. If we have a lot of data, an alternative approach is
as follows. We can split the data in three sets: a training set, a validation set and a test set. We can use the
training set to nd the classier fi within each class Ci that minimizes the training error; use the validation
set to estimate the test error of each selected classier fi , and choose the classier f from f1 , . . . , fp with the
smallest estimate; and nally, use the test set to generate an estimate of the test error associated with f.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

March 12
Handout #16

Lecture Note 12

Value Function Approximation and Policy Performance

Recall that two tasks must be accomplished in order to for a value function approximation scheme to be
successful:
1. We must pick a good representation J, such that J () J(, r) for at least some parameter r;
2. We must pick a good parameter r, such that J (x) J(x, r).
Consider approximating J with a linear architecture, i.e., let
J(x, r) =

i (x)ri ,

t=1

for some functions i , i = 1, . . . , p. We can dene a matrix

= 1 . . .
|

|S|p given by

p .
|

With this notation, we can represent each function J(, r) as

J = r.

Fig. 1 gives a geometric interpretation of value function approximation. We may think of J as a vector
in |S| ; by considering approximations of the form J = r, we restrict attention to the hyperplane J = r
in the same space. Given a norm (e.g., the Euclidean norm), an ideal value function approximation
algorithm would choose r minimizing J r; in other words, it would nd the projection r of J onto
the hyperplane. Note that J r is a natural measure for the quality of the approximation architecture,
since it is the best approximation error that can be attained by any algorithm given the choice of .
Algorithms for value function approximation found in the literature do not compute the projection r ,
since this is an intractable problem. Building on the knowledge that J satises Bellmans equation, value
function approximation typically involves adaptation of exact dynamic programming algorithms. For in
stance, drawing inspiration from value iteration, one might consider the following approximate value iteration
algorithm:
rk+1 = T rk ,
where is a projection operator which maps T rk back onto the hyperplane r.
Faced with the impossibility of computing the best approximation r , a relevant question for any
value function approximation algorithm A generating an approximation r A is how large J rA is in
comparison with J r . In particular, it would be desirable that, if the approximation architecture

is capable of producing a good approximation to J , then the approximation algorithm should be able to
produce a relatively good approximation.
Another important question concerns the choice of a norm used to measure approximation errors.
Recall that, ultimately, we are not interested in nding an approximation to J , but rather in nding a good
policy for the original decision problem. Therefore we would like to choose to reect the performance
associated with approximations to J .
State 2
J

x, r )
J(

J = r

State 1

State 3

Figure 1: Value Function Approximation

Performance Bounds

We are interested in the following question. Let uJ be the greedy policy associated with an arbitrary
function J , and JuJ be the cost-to-go function associated with that policy. Can we relate the increase in
cost JuJ J to the approximation error J J ?
Recall the following theorem, from Lecture Note 3:
Theorem 1 Let J be arbitrary, uJ be a greedy policy with respect to J .1 Let JuJ be the cost-to-go function
for policy uJ . Then
2
JuJ J
J J .
1
The previous theorem suggests that choosing an approximation J that minimizes J J may be an
appropriate choice. Indeed, if J J is small, then we have a guarantee that the cost increase incurred
by using the (possibly sub-optimal) policy uJ is also relatively small. However, what about the reverse?
If J J is large, do we necessarily have a bad policy? Note that, for problems of practical scale, it
1 That

is, TuJ J = T J, and uJ (x) = arg mina ga (x) + y Pa (x, y)J(y) .

is unrealistic to expect that we could approximate J uniformly well over all states (which is required by
Theorem 1) or that we could nd a policy uJ that yields a cost-to-go uniformly close to J over all states.
The following example illustrates the notion that having a large error J J does not necessarily
lead to a bad policy. Moreover, minimizing J J may also lead to undesirable results.
Example 1 Consider a single queue with controlled service rate. Let x denote the queue length, B denote
the buer size, and
Pa (x, x + 1)

Pa (x, x 1)

(a),

Pa (B, B + 1)

ga (x)

a, x = 0, 1, . . . , B 1
a, x = 1, 2, . . . , B,

= x + q(a)

Suppose that we are interested in minimizing the average cost in this problem. Then we would like to nd
an approximation to the dierential cost function h . Suppose that we consider only linear approximations:
r) = r1 + r2 x. At the top of Figure 1, we represent h and two possible approximations, h1 and h2 . h1
h(x,
Which one is a better approximation? Note that h1 yields smaller
is chosen so as to minimize h h.
approximation errors than h2 over large states, but yields large approximation errors over the whole state
space. In particular, as we increase the buer size B, it should lead to worse and worse approximation errors
in almost all states. h2 , on the other hand, has an interesting property, which we now describe. At the
bottom of Figure 1, we represent the stationary state distribution (x) encountered under the optimal policy.
Note that it decays exponentially with x, and large states are rarely visited. This suggests that, for practical
purposes, h2 may lead to a better policy, since it approximates h better than h1 over the set of states that
are visited almost all of the time.
What the previous example hints at is that, in the case of a large state space, it may be important to
consider errors J J that dierentiate between more or less important states. In the next section, we
will introduce the notion of weighted norms and present performance bounds that take state relevance into
account.

2.1

Performance Bounds with State-Relevance Weights

We rst introduce the following weighted norms:

r)||
||J J(,

r)|
max |J (x) J(x,

r),
J J(,

r)1,
J J(,

r)|
max (x)|J (x) J(x,
xS

r)| ( > 0)
(x)|J (x) J(x,

r)| : x
E |J (x) J(x,

With this notation, we can prove the following theorem.

Theorem 2 Let J be such that J T J. Let JuJ be the cost-to-go of the greedy policy uJ . Then, for all
c > 0,
1
J J 1,c,J
JuJ J 1,c
1
3

h(x)

h2
B

Dist. of x

x
Figure 2: Illustration of Performance Bounds for Example 1

where
1
T
T
= (1 )cT
cJ = (1 )c (I PuJ )

t Put J

t=0

or equivalently
c,J (x) = (1 )

c(y)

t PutJ (y, x),

x S.

t=0

Proof: We have J T J J JuJ . Then

JuJ J 1,c

cT (JuJ J )

cT (JuJ J)

cT (I PuJ )1 guJ J)

cT (I PuJ )1 (T J J)

c T (I PuJ )1 (J J)
1
J J1,c,J
1

c T (I PuJ )1 (guJ + PuJ J J)

Comparing Theorems 1 and 2, we have

JuJ J

JuJ J 1,c

2
J J
1
1
J J 1,c,J .
1

There are two important dierences between these bounds:

1. The rst bound relates performance to the worst possible approximation error over all states, whereas
the second involves only a weighted average of errors. Therefore we expect the second bound to exhibit
better scaling properties.
2. The rst bound presents a worst-case guarantee on performance: the cost-to-go starting from any
initial state x cannot be greater than the stated bound. The second bound presents a guarantee on the
expected cost-to-go, given that the initial state is distributed according to distribution c. Although this
is a weaker guarantee, it represents a more realistic requirement for large-scale systems, and raises the
possibility of exploiting information about how important each dierent state is in the overall decision
problem.

From Value to Actions

The analysis presented in the previous sections is based on the greedy policy u J associated with approximation J. In order to use this policy, we need to compute

u(x) = arg min ga (x) +

Pa (x, y)J (y, r) .
(1)
aA

This step is typically done in real-time, as the system is being controlled. If the set of available actions A

is relatively small and the summation y Pa (x, y)J(y, r) can be computed relatively fast, then evaluating
(1) directly when an action at state x is needed is a feasible approach. However, if this is not the case,
alternative solutions must be considered. We describe a few:
If the action set is relatively small but there are many y s to sum over, we can estimate

Pa (x, y)J(y, r)

(2)

by sampling:
N
1
J(yi , r).
N i=1

(3)

In some cases, a very large number of samples may be required for the empirical average (3) to be a
reasonable estimate of (2). In these cases, the computation of (2) or (3) could be done oine, and
stored via Q-factors:

QJ(x, a) = ga (x) +
Pa (x, y)J(y, r)
y

Clearly, QJ (x, a) requires space proportional to the size of the state space, and cannot be stored
exactly. In the same spirit as value function approximation, we can approximate it with a parametric
representation:
Q(x, a, s) QJ(x, a)
We may nd an approximate parameter s based on the approximation J by solving
N

min
s

1
2
(Q(x, a, s) QJ(x, a))
N i=1

(oine)

or, alternatively, we could do value function approximation to approximate the Q-factor directly.
Finally, if the action space is too large, computing minaA Q(x, a, s) may be prohibitively expensive
as well. As an alternative, we may consider also a parametric representation for policies:
u(x) u(x, t)
We may nd an appropriate parameter t by tting u(x, t) to the greedy policy u(x), computed oine:
min
t

N
1
2
(u(xi ) u(xi , t))
N i=1

or consider algorithms which mix together value function approximation and policy approximation.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

March 17
Handout #17

Lecture Note 13

Temporal-Dierence Learning

We now consider the problem of computing an appropriate parameter r, so that, given an approximation
architecture J(x, r), J(, r) J ().
A class of iterative methods are the so-called temporal-dierence learning algorithms, which generates a
series of approximations Jk = J(, rk ) as follows. Consider generating a trajectory (x1 , u1 , . . . , xk , uk ), where
uk is the greedy policy with respect to Jk . We then have the error/temporal dierences
dk = guk (xk ) + Jk (xk+1 , rk ) Jk (xk , rk ),
which represent an approximation to the Bellman error (T Jk )(xk ) Jk (xk ) at state xk . Based on the
temporal dierences, an intuitive way of updating the parameters rk is to make updates proportional to the
observed Bellman error/temporal dierence:
rk+1 = rk + k dk zk ,
where k is the step size and zk is called an eligibility vector it measures how much updates to each
component of the vector rk would aect the Bellman error.
To gather more intuition about how to choose the eligibility vector, we will consider the case of au
tonomous systems, i.e., systems that do not involve control. In this case, we can estimate the cost-to-go
function via sampling as follows. Suppose that we have a trajectory x1 , . . . , xn . Then we have
J (x1 )

n1 g(xt )

t=1

J (x2 )

n2 g(xt )

t=2

..
.
In other words, from a trajectory x1 , . . . , xn , we can derive pairs (xi , J(xi )), where J(xi ) is a noisy and
biased estimate of J (xi ). Therefore we may consider tting the approximation J(x, r) by minimizing the
empirical squared error:
n
2

t , r)
min
Jn (xt ) J(x
(1)
r

t=1

We derive an incremental, approximate version of (1). First note that Jn (xt ) could be updated incrementally
as follows:
Jn+1 (xt ) = Jn (xt ) + n+1t g(xn+1 )
(2)
Alternatively, we may use a small-step update of the form

n
+1

Jn+1 (xt ) = Jn (xt ) +

jt g(xj ) Jn (xt ) ,
j=t

(3)

which makes Jn+1 (xt ) an average of the old estimate Jn (xt ) and the new estimate (2). Finally, we may
approximate (3) to have Jn (xt ) function d1 , d2 , . . . , dn :
n

jt g(xj ) Jn (xt ) =

g(xt ) + Jn (xt+1 ) Jn (xt ) + (g(xt+1 ) + Jn (xt+2 ) Jn (xt+1 )) + . . .

j=t

+nt (g(xn ) + Jn (xn+1 ) Jn (xn )) n+1t Jn (xn+1

jt dt .
j=t

Hence
Jn+1 (xt ) = Jn (xt ) +

n
+1

jt dj .

(4)

j=t

Finally, we may consider having the sum in (1) implemented incrementally, so that the previous temporal
dierences do not have to be stored:
Jn+1 (xt ) = Jn (xt ) + n+1t dn+1 .
Hence, in each time stage, we would like to nd rn minimizing
n
2

t , r) .
Jn (xt ) + nt dn J(x
min
r

(5)

t=1

Starting from the solution rn to the problem at stage n, we can approximate the solution of the problem at
stage n + 1 by updating rn+1 along the gradient of (5). This leads to
n

rn+1 = rn +
r J (rn , xt ) dn+1 .
t=1

We can also have an incremental version, given by

rk+1
zk

= rk + k zk dk
= zk1 + r J(xk , rk )

The algorithm above is known as T D(1). We have the generalization T D(), [0, 1].
rk+1
zk

= rk + k zk dk
= zk1 + r J(xk , rk )

TD()

Before analyzing the behavior of T D(), we are going to study a related, deterministic algorithm
approximate value iteration. The analysis of T D() will be based on interpreting it as a stochastic approxi
mation version of approximate value iteration.

Approximate Value Iteration

Dene the operator T

T J

(1 )

m T m+1 J,

for [0, 1)

m=0

T J

J ,

for = 1.

We can show that T satises the same properties as T :

Lemma 1
(1 ) J J
,
T J T J
1
J = T J

J, J

The motivation for T is as follows. Recall that, in value iteration, we have Jk+1 = T Jk . However, we
could also implement value iteration with Jk+1 = T L Jk , which implies L steps look ahead. Finally, we can
have an update that is a weighted average over all possible values of L; Jk+1 = T Jk gives one such update.
In what follows, we are going to restrict attention to linear approximation architectures. Let
J(x, r) =

i (x)ri ,

and

i=1

1 (1)
1 (2)
..
.
1 (n)

J =

2 (1)
2 (2)
..
.
2 (n)

...
...
...
...

Moreover, we are going to consider only autonomous systems.

associated with the system.
Let us introduce some notation. First, we have

d(1) 0
... 0

d(2) . . . 0
0
D=
.
..
...
.
. . . ..
0
0
. . . d(n)

P (1)
P (2)
..
.
P (n)

We denote by P the transition matrix

where d : S (0, 1)S is a probability distribution over states. Dene the weighted Euclidean norms

J2,D = J T DJ =
d(x)J 2 (x)
xS

< J J >D

= J DJ =
T

d(x)J(x)J(x)

For simplicity, we assume that i , i = 1, . . . , p is an orthonormal basis to the subspace J = r, i.e.,

i 2,D = 1 and < i , j >= 0, i = j
In matrix notation, we have
T D = I.
We are going to use the following projection operator :
J = rJ ,

where rJ = arg min r J2,D

State 2

T rk

J = r

rk+1 = T rk

State 1

State 3
Figure 1: Approximate Value Iteration

We can characterize explicitly by solving the associated minimizing problem. We have

= arg min r J22,D

= arg min(r J)T D(r J)

r
T
1 T
DJ
= D
= < , J >D
Hence, we have J = < , J >D .
Lemma 2 For all J,
J = < , J >D

(6)

< J, J J >D = 0
J22,D

J22,D

+ J

(7)
J22,D

(8)

Note that rk+1 = T rk . We know that the projection is a nonexpansion from

J J2,D = (J J)2,D J J2,D .
Moreover, T is a contraction:
KJ J
.
T J T J
However, the fact that and T are a non-expansion and a contraction with respect to dierent norms
implies that convergence of approximate value iteration cannot be guaranteed by a contraction argument, as
was the case with exact value iteration. Indeed, as illustrated in Figure 2, T is not necessarily a contraction
with respect to any norm, and one can nd counterexamples where T D() fails to converge.
As it turns out, there is a special choice of D that ensures convergence of T D() for all [0, 1]. Before
proving that, we need the following auxiliary result. First, we present two denitions involving Markov
chains.
Denition 1 A Markov chain is called irreducible if, for every pair of states x and y, there is k such that
P k (x, y) > 0.
Denition 2 A state x is called periodic if there is m such that P k (x, x) > 0 i k = mn, for some
n {0, 1, 2, . . . }. A Markov chain is called aperiodic if none of its states is periodic.
Lemma 3 Given a transition matrix P and assume that P is irreducible and aperiodic. Then there exists a
unique such that
T P = T
and

T
T
..
.
T

T rk

Figure 2: T rk must be inside the smaller square and T rk must be inside the circle, but T rk may
be outside the larger square and further away from J than rk .

This lemma was proved in Problem Set 2, for the special case where P (x, x) > 0 for some x.
We are now poised to prove the following central result used to derive a convergent version of T D():
Lemma 4 Suppose that the transition matrix P is irreducible and aperiodic. Let

1 0
... 0

2 . . . 0

0
,

D= .
.
.

.
.
.

.
.
... .
0
0
. . . |S|
where is the stationary distribution associated with P . Then
P J2,D J2,D .
Proof:
P J22,D

(x)

P (x, y)J(y)

P (x, y)J 2 (y)

(x)P (x, y)J 2 (y)

(y)J 2 (y)

= J22,D
The rst inequality follows the Jensens inequality and the third equality holds because is a stationary
2
distribution.

Based on the previous lemma, we can show that T

1 0

2
0
D =
..
..
.
.
0
0

is a contraction with respect to 2,D , where

... 0

... .
. . . |S|

and is the stationary distribution of the transition matrix P . It follows that, if the projection is performed
with respect to 2,D , T becomes a contraction with respect to the same norm, and convergence of
T D() is guaranteed.
Lemma 5
(i)
(ii)
(iii)

2,D
T J T J2,D J J

)
2,D
J J
T J T J2,D

1
(1 )
2,D
J J
T J T J2,D

1
7

Proof of (1)
2,D
T J T J

= g + P J (g + J)2,D
2,D
= P J P J

2,D
J J

Theorem 1 Let
rk+1 = T rk
and

D =

1
0
..
.
0

0
2
..
.
0

...
...
...
...

0
0
..
.
|S|

Then rk r with
r J 2,D K, J J 2,D .
Proof: Convergence follows from (iii). We have r = T r and J T J . Then
r J 22,D

2
r J + J J 2,D

r J 22,D + J J 22,D

T r
2

T J 22,D

+ J

(orthogonal)

2
J 2,D

(1 )
2
r J 22,D + J J 2,D

(1 )2

Therefore
r J 2,D

1
J J 2,D
1
2

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

March 29
Handout #18

Lecture Note 14

Convergence of T D()

In this lecture, we will continue to analyze the behavior of T D() for autonomous systems. We assume that
the system has stage costs g(x) and transition matrix P .
Recall that we want to approximate J by J
r. We nd successive approximations r0 , r1 , . . . by
applying T D():
rk+1
dk
zk

= rk + k dk zk

(1)

= g(xk ) + (rk )(xk+1 ) (rk )(xk )

= zk1 + (xk ) =

() (x )

(2)
(3)

We make the following assumptions:

Assumption 1 The Markov chain characterized by P is irreducible and aperiodic with stationary distribu
tion .
Assumption 2 The basis functions are orthonormal with respect to 2,D , where D = diag(), i.e.,
T D = I.
In the previous lecture, we introduced and analyzed approximate value iteration (AVI). The main idea is
that T D() may be interpreted as a stochastic approximations version of AVI. Before nishing the analysis
of T D(), we review the main points related to AVI.
Recall the operators T and :
T J

= (1 )

m T m+1 J,

m=0

< , J >D .

Then AVI is given by

rk+1 = T rk ,

(4)

and we have the following theorem characterizing its limiting behavior:

Theorem 1 If

1
0
..
.
0

0
2
..
.
0

...
...
...
...

0
0
..
.
|S|

and T P = T , then rk r , and

J r 2,D
where k =

(1)
1

1
J J 2,D
1 k2

We can think T D() as a stochastic approximations version of AVI. Recall that the main idea in stochastic
approximation algorithms is as follows. We would like to solve a system of equations r = Hr, but only have
access to noisy observations Hr = w for any given r. Then we attempt to solve r = Hr iteratively by
considering
rk+1 = rk + k (Hrk rk + wk ).
Hence in order to show that T D() is a stochastic approximations version of AVI, we would like to show
that
rk+1 = T rk rk + wk ,
for some noise wk .
The following lemma expresses (4) in a format that is more amenable to our analysis.
Lemma 1 The AVI equations (4) can be rewritten as
rk+1 = < , T rk >D ,

(5)

rk+1 = Ark + b,

(6)

or, equivalently,
where
T

A = (1 ) D

t+1

(P )

(7)

t=0

and
T

b= D

()t P t g.

(8)

t=0

Proof: (5) follows immediately from the denition of . Now note that
rk+1

= < , T rk >D
=
=
=

T DT rk
T

(1 ) D
(1 )T D

m=0

(P ) g +

= Ark +

m+1

t=0

m (P )m+1 rk + (1 )

m=0

t=0

(P )t g

m=t

(P )t g

t=0

= Ark + b.

At the same time, we have the following

Lemma 2 T D()s equations (1) can be rewritten as

rk+1 = rk + k (Ak rk rk + bk ),
where
lim EAk

= A,

lim Ebk

= b,

where A and b are given by (7) and (8), respectively.

Proof: It is easy to verify that (1) is equivalent to (9), where
= zk ((xk+1 ) (xk )) + I,

Ak
bk

= zk g(xk ).

We will study the limit of EAk and Ebk . For all J , we have
k

k
lim E [zk Jk ] = lim E
()
(x )J (xk )

lim E

() (xk )J (xk )

( P k (x, y) (y))

(P (x = x|x0 ) = P (x0 , x))

() (x0 )J (x )|x0

() (x0 )(P J )(x0 )|x0

() < , P J >D

Letting
J (xk , xk+1 ) = (xk+1 ) (xk ),
we conclude that
lim EAk

() < , P +1 P >D +I

=
=
=
=

T D
T D
T D

+1 P +1 T D
+1 P +1 T D

P + I
P T D + I

+1 P +1 T D

(1 )

+1 P +1

= A.
3

+1 P +1

(9)

In the fourth equality we have used the assumption that T D = I.

Similarly, letting
J (xk ) = g(xk )
yields
lim Ebk

() < , P g >D

= b.
If = 1, we have

() (g + P rk rk ) = J + (I P )1 (P Ir) = J r.

If < 1, then

() P

P (1 )

(1 )

t P t

t=0

Thus

() P (g + P rk rk )

= (1 )

(1 )

t P t (g + P r r)

t=0

t P t g + t+1 P t+1 r r

T t rk

= T rk rk
Therefore,
lim E [zk dk ] =< , T rk rk >D

Comparing Lemmas 1 and 2, it is clear that T D() can be seen as a stochastic approximations version
of AVI; in particular, TDs equations can be written as
rk+1 = rk + k (Ark + b rk + wk ),
where wk = (Ak A)rk + (bk b). If rk remains bounded, we should have limk Ewk = 0, so that the noise
is zeromean asymptotically. Note however that the noise is not independent of the past history, and in fact
follows a Markov chain, since matrices Ak and bk are functions of xk and xk+1 . This makes application of
the Lyapunov analysis for convergence of T D() dicult, and we must resort to the ODE analysis instead.
The next theorem provides the convergence result.
Theorem 2 Suppose that P is irreducible and aperiodic and that
rk r w.p.1, where r = T r .
4

k=1

k = and

k=1

k2 < . Then

To prove Theorem 2, we rst state without proof the following theorem.

Theorem 3 Let rk+1 = rk + k (A(xk )rk + b(xk )). Suppose that

2
(a)
k=1 k = ,
k=1 k <
(b) xk follows a Markov chain and has stationary distribution
(c) A = E[A(x)|x ] is negative denite, and b = E[b(x)|x ]
(d) E[A(xk )|x0 ] A Ck , x0 , k, and
E[b(xk )|x0 ] b Ck , x0 , k
Then rk r w.p.1, i.e., Ar + b = 0.
Sketch of Proof of Theorem 2 We verify that conditions (a)(d) of Theorem 3 are satised.
Conditions (a) and (b) are satised by assumption.
(c) For all r, we have
rT Ar

= rT < , (1 )

+1 P +1 r r >D

= < r, (1 )

+1 P +1 r >D r22,D

T r, a contraction w.r.t. 2,D

r2,D T r2,D r22,D

r22,D r22,D

( )

< 0
Hence, A is negative denite.
(d) We must consider the quantities

E[Ak A] = E[zk ((xk+1 ) (xk ) A],

E[bk b]

= E[zk g(xk ) b].

This involves a comparison of E[zk (xk+1 ], E[zk (xk )] and E[zk g(xk )] with their limiting values as k goes

to innity. Let us focus on term zk (xk ); the other terms involve similar analysis. We have

E[zk (xk )|x0 ]

= E
()
(x
)
(x
)
|
x
=
x
t
k
0

t=0

(xt )()

(xk )|xt , t 0

(xt )()

(xk )|x0 = x

t=0

(xt )()kt (xk )|x0

t=0

(xt )()

(xk )|xt

It follows from basic matrix theory that |P (xt = x|x0 ) (xt )| Ct , where corresponds to the second
highest eigenvalue of P, which is strictly less than one since P is irreducible and aperiodic. Therefore we
have

(xt )()kt (xk )|x0 = x E

(xt )()kt (xk ) x0
E
t=0

t=0

k
2
2

(xt )()kt (xk )|x0 = x E

(xt )()kt (xk ) x0 +
E
t=0

t=0

+ E

(xt )()kt (xk )|x0 = x E

t= k
2 +1

(xt )()kt (xk ) x0

t= k
2 +1

M ()k/2 + k/2 ,
for some M < . Moreover,

(xt )()

(xk )|x0 M ()k

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

March 31
Handout #19

Lecture Note 15

Optimal Stopping Problems

In the last lecture, we have analyzed the behavior of T D() for approximating the costtogo function in
autonomous systems. Recall that much of the analysis was based on the idea of sampling states according to
their stationary distribution. This was done either explicitly, as was assumed in approximate value iteration,
or implicitly through the simulation or observation of system trajectories. It is unclear how this line of
analysis can be extended to general controlled systems. In the presence of multiple policies, in general
there are multiple stationary state distributions to be considered, and it is not clear which one should be
used. Moreover, the dynamic programming operator T may not be a contraction with respect to any such
distribution, which invalidates the argument used in the autonomous case. However, there is a special
class of control problems for which analysis of T D() along the same lines used in the case of autonomous
systems is successful. These problems are called optimal stopping problems, and are characterized by a tuple
(S, P : S S [0, 1], g0 : S , g1 : S ), with the following interpretation. The problem involves a
Markov decision process with state space S. In each state, there are two actions available: to stop (action
0) or to continue (action 1). Once action 0 (stop) is selected, the system is halted and a nal cost g0 (x) is
incurred, based on the nal state x. At each previous time stage, action 1 (continue) is selected and a cost
g1 (x) is incurred. In this case, the system transitions from state x to state y with probability P (x, y). Each
policy corresponds to a (random) stopping time u , which is given by
u = min{k : u(xk ) = 0}.
Example 1 (American Options) An American call option is an option to buy stock at a price K, called
the stock price, on or before an expiration date the last time period the option can be exercised. The state
of a such a problem is the stock price Pk . Exercising the option corresponds to the stop action and leads
to a reward max(0, Pk K); not exercising the option corresponds to the continue action and incurs no
costs or rewards.
Example 2 (The Secretary Problem) In the secretary problem, a manager needs to hire a secretary.
He interviews secretaries sequentially and must make a decision about hiring each one of them immediately
after the interview. Each interview incurs a certain cost for the hours spent meeting with the candidate, and
hiring a certain person incurs a reward that is a function of the persons abilities.
In the innitehorizon, discountedcost case, each policy is associated with a discounted costtogo

t
u
Ju (x) = E
g1 (t) + g0 (xu )|x0 = x .
t=0

We are interested in nding the policy u with optimal costtogo

J = min Ju .
u

Bellmans equation for the optimal stopping problem is given by

J = min(g0 , g1 + P J) T J.
As usual, T is a maximumnorm contraction, and Bellmans equation has a unique solution corresponding
to the optimal costtogo function J . Moreover, T is also a weightedEuclideannorm contraction.
Lemma 1 Let be a stationary distribution associated with P , i.e., T P = T , and let D = diag() be a
diagonal matrix whose diagonal entries correspond to the elements in vector . Then, for all J and J, we
have
2,D J J
2,D
T J T J
Proof: Note that, for any scalars c1 , c2 and c3 , we have
| min(c1 , c3 ) min(c2 , c3 )| |c1 c2 |.

(1)

It is straightforward to show Eq.(1). Assume without loss of generality that c1 c2 . If c1 c2 c3 or

c3 c1 c2 , Eq(1) holds trivially. The only remaining case is c1 c3 c2 , and we have
| min(c1 , c3 ) min(c2 , c3 )| = |c3 c2 | |c1 c2 |.
Applying (1), we have
|T J T J| = | min(g0 , g1 + P J) min(g0 , g1 + P J)|
|P (J J)|.
It follows that
2,D
T J T J

P (J J)2,D
2,D ,
J J

where the last inequality follows from P J2,D J2,D by Jensens inequality and stationarity of .

1.1

T D()

Recall the Q function

Q(x, a) = ga (x) +

Pa (x, y)J (y).

In general control problems, storing Q may require substantially more space than storing J , since Q is
a function of stateaction pairs. However, in the case of optimal stopping problems, storing Q requires
essentially the same space as J , since the Q value of stopping is trivially equal to g0 (x). Hence in the case
of optimal stopping problems, we can set T D() to learn
Q = g1 + P J ,
the cost of choosing to continue and behaving optimally afterwards. Note that, assuming that onestage
costs g0 and g1 are known, we can derive an optimal policy from Q by comparing the cost of continuing
with the cost of stopping, which is simply g0 . We can express J in terms of Q as
J = min(g0 , Q ),
2

so that Q also satises the recursive relation

Q = g1 + P min(g0 , Q ).
This leads to the following version of T D():
rk+1
zk

= rk + k zk [g1 (x) + min (g0 (xk+1 ), (xk+1 )rk ) (xk )rk ]

= zk1 + (xk )

Let
HQ = g1 + P min(g0 , Q),
and
H Q = (1 )

(2)

t H t+1 Q.

t=0

We can rewrite T D() in terms of operator Q as

rk+1

= rk + k zk (H rk rk + wk ).

(3)

The following property of H implies that an analysis of T D() for optimal stopping problems along the same
lines of the analysis for autonomous systems suces for establishing convergence of the algorithm.
Lemma 2
2,D Q Q
2,D .
HQ HQ
Theorem 1 [Analogous to autonomous systems] Let rk be given by (3) and suppose that P is irreducible
and aperiodic. Then rk r w.p.1, where r satises
r = H r
and
r Q 2,D
where k =

(1)
1

1
Q Q 2,D ,
1 k2

(4)

We can also place a bound on the loss in the performance incurred by using a policy that is greedy with
respect to Q , rather than the optimal policy. Specically, consider the following stopping role based on
r .

stop,
if g0 (x) r (x)
u(x)

=
continue, otherwise
species a (random) stopping time , which is given by
u
= min{k : (xk )r g0 (xk )},
and the costtogo associated with u
is given by
1

J (x) = E
g1 (xt ) + g0 (x ) .
t=0

The following theorem establishes a bound on the expected dierence between J and the optimal costtogo
J .
3

Theorem 2

Q Q 2,D
(1 ) 1 K 2

E[J(x)|x ] E[J (x)|x ]

be the cost of choosing to continue in the current time stage, followed by using policy u:
Proof: Let Q

= g1 + P J.

Q
Then we have
E[J(x0 ) J(x0 )|x0 ]

= E[J(x1 ) J(x1 )|x0 ]

= T P (J J)

0

(x)|P (J J )(x)|

= P J P J 1,
P J P J 2,
=
=

1
g1 + P J g1 + P J 2,

1
Q Q 2,

The inequality (5) follows from the fact that, for all J, we have
J21,

E[|J(x)| : x ]2

E[J(x)2 : x ]
= J22, ,
and K, given by
where the inequality is due to Jensens inequality. Now dene the operators H

= g1 + P KQ, where

g0 (x), if r (x) g0 (x)

KQ =
Q,
otherwise.
Recall the denition of operator H. Then it is easy to verify the following identities:

Hr = Hr
HQ = Q

Q
= Q.

H
is a contraction with respect to 2, . Now we have
Moreover, it is also easy to show that H
Q 2,
Q

Hr
2,
Q Hr 2, + Q
Q
Hr
2,
= HQ Hr 2, + H
r 2,
Q r 2, + Q

Q 2, + Q r 2,
Q r 2, + Q
r 2,
2Q r 2, + Q
4

(5)

(6)

Thus,
Q 2,
Q

2
Q r 2, .
1

The theorem follows from Theorem 1 and equations (6) and (7).

1.2

(7)
2

Discounted Cost Problems with Control

Our analysis of T D() is based on comparisons with approximate value iteration:

rk+1 = T rk ,
where the projection represented by is dened with respect to the weighted Euclidean norm 2,
(or, equivalently, 2,D ) and is the stationary distribution associated with the transition matrix P . In
controlled problems, there is not a single stationary distribution to be taken into account; rather, for every
policy u, there is a stationary distribution u associated with transition probabilities Tu . A natural way of
extending AVI to the controlled case is to consider a scheme of the form
rk+1 = uk Tuk rk ,

(8)

where u represents the projection based on 2,u and uk is the greedy policy with respect to rk . Such
a scheme is a plausible approximation, for instance, for an approximate policy iteration based on T D()
trained with system trajectories:
1. Select a policy u0 . Let k = 0.
2. Fit Juk rk (e.g., via T D() for autonomous systems);
3. Choose uk+1 to be greedy with respect to rk . Let k = k + 1. Go back to step 2.
Note that step 2 in the approximate policy iteration presented above involves training over an innitely long
trajectory, with a single policy, in order to perform policy evaluation. Drawing inspiration from asynchronous
policy iteration, one may consider performing policy updates before the policy evaluation step is considered.
As it turns out, none of these algorithms is guaranteed to converge. In particular, approximate value iteration
(8) is not even guaranteed to have a xed point. For an analysis of approximate value iteration, see [1].
A special situation where AVI and T D() are guaranteed to converge occurs when the basis functions
are constant over partitions of the state space.
Theorem 3 Suppose that i (x) = 1{x Ai }, where Ai Aj = , i, j, i = j. Then
rk+1 = T rk
converges for any Euclidean projections .
Proof: We will show that, if is a Euclidean projection and i satisfy the assumption of the theorem, then
is a maximumnorm nonexpansion:
J J
.
J J
5

Let
(J )(x)
Ki

if x Ai , where

arg min
w(x) J (x) r

= Ki ,
=

xAi

Thus

w(x)
.
xAi w(x)

Ki = E[J (x)|x wi ], where wi =

Therefore
(J )(x) (J)(x)

E J (x) J(x)|x wi
J J

It follows that T is a maximumnorm contraction, which ensures convergence of approximate value iteration.
2

References
[1] D.P. de Farias and B. Van Roy. On the existence of xed points for appproximate value iteration and
temporaldierence learning. Journal of Optimization Theory and Applications, 105(3), 2000.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

April 5
Handout #20

Lecture Note 16

Linear Programming Approach to DP

In previous lectures on dynamic programming, we have studied the value and policy iteration algorithms
for solving Bellmans equation. We now introduce a different algorithm, which is based on formulating the
dynamic programming problem as a linear program.
Consider the following optimization problem:
cT J

maxJ
subject to

(1)

T J J,

and suppose that vector c is strictly positive: c > 0. Recall the following lemma from Lecture 3:
Lemma 1 For any J such that T J J, we have J J .
It follows from the previous lemma that, whenever c > 0, J is the unique solution to problem (1). We refer
to this problem as the exact LP. Note however that, strictly speaking, this problem is not a linear program;
in particular, the constraints
(T J)(x) J(x)
(

min ga (x) +
a

(2)
X
y

Pa (x, y)J(y)

J(x)

are not linear in the variables J of the LP. However, (1) can easily be converted into an LP by noting that
each constraint (2) is equivalent to
X
ga (x) +
Pa (x, y)J(y) J(x), a Ax .
y

Note that the exact LP contains as many variables as the number of states in the system, and as many
constraints as the number of state-action pairs.

1.1

Dual Linear Programming

We can also find an optimal policy by solving the dual of the exact LP. For simplicity, we will consider
average-cost problems in this section, but the analysis and underlying ideas easily extend to the discountedcost case. The dual LP has an interesting interpretation, and it can be shown that solving it iteratively via
simplex or interior-point methods is equivalent to performing specific forms of policy iteration.

In the average-cost case, it can be shown that the dual LP is given as follows:
X
(x, a)ga (x)
min

(3)

(x, a), x

(4)

x,a

subject to

XX
y

(y, a)Pa (y, x) =

X
a

(x, a) = 1

x,a

(x, a) 0, x, a
For simplicity, let us assume that the system is irreducible under every policy.
In order to analyze the dual LP, we consider the notion of randomized policies. So far, we have defined
a policy to be a mapping from states to actions; in other words, for every state x a policy u prescribes
a (deterministic) action u(x) A . Alternatively, we can consider an extended definition where, for any
state action, a policy u prescribes a probability u(x, a) for taking each action a A x . Each policy u is now
associated with a transition matrix Pu such that
X
Pu (x, y) =
u(x, a)Pa (x, y),
a

and yields a stationary state distribution u , which is the unique solution to

uT Pu = uT
X
u (x) = 1
x

u (x) 0.

We can also verify that the state costs associated with policy u are given by
X
u(x, a)ga (x).
gu (x) =
a

With these notions in mind, it can be shown that the variables (x, a) for any feasible solution to the
dual LP can be interpreted as state-action frequencies for a randomized policy. Indeed, let
X
(x) =
(x, a),
(5)
a

and
u(x, a) =

(x, a)
,
(x)

(6)

if (x) > 0, and u(x, ) be an arbitrary distribution over Ax , otherwise. Note that in either case we have
(x, a) = (x)u(x, a),
and u(x, a) is a randomized policy. Then we can show that corresponds to the stationary state distribution
P
u , and x,a (x, a)ga (x, a) corresponds to the average cost u of policy u.

Lemma 2 For every feasible solution (x, a) of the dual LP, let (x) and u(x, a) be given by (??) and (6).
P
Then = u , and x,a (x, a)ga (x) corresponds to the average cost u of policy u.
2

Proof: Consider constraints (4). Then we have

XX
X
(y, a)Pa (y, x) =
(x, a)
y

XX
y

(y)u(y, a)Pa (y, x) = (x)

(y)Pu (y, x) = (x).

We also have (x) 0 for all x, and

(x)

(x, a)

We conclude that is a stationary distribution associated with policy u. Since by assumption the system
is irreducible under every policy, each policy has a unique stationary distribution, and we have = u . We
now have
X
XX
(x, a)ga (x) =
u (x)u(x, a)ga (x)
x,a

u (x)gu (x)

u .

From the previous lemma, we conclude that each feasible solution of the dual LP is identified with a policy
u, and the variables (x, a) correspond to the probability of observing state x and action a, in steady state.
Consider using simplex or an interior-point method for solving the dual LP. Either method will generate a
sequence of feasible solutions 0 , 1 , . . . , with decreasing value of the objective function. Interpreting this
sequence with Lemma 2, we see that this is equivalent to generating a sequence of policies u 0 , u1 , . . . , with
decreasing average cost, and solving the LP corresponds to performing policy iteration.

Approximate Linear Programming

With an aim of computing a weight vector r <K such that

r is a close approximation to J , in view of
the exact LP one might pose the following optimization problem:
cT r

maxr
subject to

(7)

T r r

As with the case of exact dynamic programming, the optimization problem (??) can be recast as a linear
program
max

cT r

s.t.

ga (x) +

pa (x, y)(r)(y) (r)(x),

x S, a Ax .

We will refer to this problem as the approximate LP. Note that the approximate LP involves a potentially
much smaller number of variables it has one variable for each basis function. However, the number of
constraints remains as large as in the exact LP. Fortunately, most of the constraints become inactive, and
solutions to the linear program can be approximated efficiently, as we will show in future lectures.

2.1

State-Relevance Weights

In the exact LP, for any vector c with positive components, maximizing c T J yields J . In other words, the
choice of state-relevance weights does not influence the solution. The same statement does not hold for the
approximate LP. In fact, the choice of state-relevance weights may bear a significant impact on the quality
of the resulting approximation.
To motivate the role of state-relevance weights, let us start with a lemma that offers an interpretation of
their function in the approximate LP.
Lemma 3 A vector r solves
max

cT r

s.t.

T r r,

if and only if it solves

min

kJ rk1,c

s.t.

T r r.

Proof: It is clear that the approximate LP is equivalent to minimizing c T (J r) over all feasible r. For
all r such that T r r, we have r J , and cT (J r) = kJ rk1,c .
2

Lemma 3 suggests that the state-relevance weights may be used to control the quality of the approximation
to the cost-to-go function over different portions of the state space. Recall that ultimately we are interested
in generating good policies, rather than good approximations to the cost-to-go function, and ideally we would
like to choose c to reflect that objective. The following theorem, from Lecture 12, suggests certain choices
for state-relevance weights. Recall that
T,J = cT (I PuJ )1 .
Theorem 1 Let J : S 7 < be such that T J J. Then
kJuJ J k1,

1
kJ J k1,,J .
1

(8)

Contrasting Lemma 3 with the bound on the increase in costs (8) given by Theorem 1, we may want
to choose state-relevance weights c that capture the (discounted) frequency with which different states are
expected to be visited. Note that the frequency with which different states are visited in general depends on
the policy being used. One possibility is to have an iterative scheme, where the approximate LP is solved
multiple times with state-relevance weights adjusted according to the intermediate policies being generated.
Alternatively, a plausible conjecture is that some problems will exhibit structure making it relatively easy
4

J*
r *

J(2)

TJ > J

~
r
J(1)

J = r

Figure 1: Graphical interpretation of approximate linear programming

to make guesses about which states are desirable and therefore more likely to be visited often by reasonable
policies, and which ones are typically avoided and rarely visited. We expect structures enabling this kind of
procedure to be reasonably common in large-scale problems, in which desirable policies often exhibit some
form of stability, guiding the system to a limited region of the state space and allowing only infrequent
excursions from this region.

Approximation Error Analysis

For the approximate LP to be useful, it should deliver good approximations when the cost-to-go function is
near the span of selected basis functions. Figure 1 illustrates the issue. Consider an MDP with states 1 and
2. The plane represented in the figure corresponds to the space of all functions over the state space. The
shaded area is the feasible region of the exact LP, and J is the pointwise maximum over that region. In the
approximate LP, we restrict attention to the subspace J = r.
In Figure 1, the span of the basis functions comes relatively close to the optimal cost-to-go function J ;
if we were able to perform, for instance, a maximum-norm projection of J onto the subspace J = r, we
would obtain the reasonably good approximate cost-to-go function r . At the same time, the approximate
LP yields the approximate cost-to-go function
r. In this section, we develop bounds guaranteeing that
r

is not too much farther from J than r is.

Note that, in general, we cannot guarantee that
r will be close to J , regardless of how close r is; for
instance, Figure 2 illustrates a worst-case scenario, where even though r is close to J , the approximate
LP does not even have a feasible solution. However, the following theorem shows that, with some mild
conditions on the basis functions, a bound relating the distance between J and r to the distance between
J and
r can be developed.
Theorem 2 If v = e1 for some v, then we have
kJ
rk1,c
1 v

2
min kJ rk .
1 r

= e means

State 2

TJ

J

No feasible
r

Figure 2: A Special Case of ALP

Before proving Theorem 2, we state and prove the following auxiliary lemma:
Lemma 4 For all J, let

1+
J = J
kJ J k e
1

Then, we have
T J J
Proof: Let = kJ J k . Thus we have
1+
T J = T (J
e)
1

kT J T J k kJ J k =

J = r
r

State 1

Then
T J

=
=

(1 + )
e
1
(1 + )
e
J (1 + )e
1
(1 + )
1+
e (1 + )e
e
J +
1
1

We are now ready to finish the proof of Theorem 2. Let r = arg minr kJ rk . Let = kJ r k .
Then by Lemma 4, we have
1+

r = r
e
1
is a feasible solution for the ALP. From Lemma 3, we have
kJ
rk1,c

kJ
rk1,c
1+
ek1,c
1
1+

kJ r k1,c +
1
1+
kJ r k +

1
1+
+

1
2

1
kJ r

Theorem 2 establishes that when the optimal cost-to-go function lies close to the span of the basis
functions, the approximate LP generates a good approximation. In particular, if the error min r kJ rk
goes to zero (e.g., as we make use of more and more basis functions) the error resulting from the approximate
LP also goes to zero.
Though the above bound offers some support for the linear programming approach, there are some
significant weaknesses:
1. The bound calls for an element of the span of the basis functions to exhibit uniformly low error over
all states. In practice, however, minr kJ rk is typically huge, especially for large-scale problems.
2. The bound does not take into account the choice of state-relevance weights. As demonstrated in the
previous section, these weights can significantly impact the approximation error. A sharp bound should
take them into account.
In the next lecture, we will show how the previous analysis can be generalized to take into account structure
about the underlying MDP and address the aforementioned issues.
7

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

April 7
Handout #21

Lecture Note 17

Approximate Linear Programming

Recall that we approximate J

r, where r is the solution of
maxr
s.t.

cT r
T r r

(ALP)

In the previous lecture, we proved the following result on the approximation error yielded by the ALP:
Theorem 1 If v = e for some v, then we have
J
r1,c

2
min J r .
1 r

(1)

Though the above bound oers some support for the linear programming approach, there are some signicant
weaknesses:
1. The bound calls for an element of the span of the basis functions to exhibit uniformly low error over
all states. In practice, however, minr J r is typically huge, especially for large-scale problems.
2. The bound does not take into account the choice of state-relevance weights. As demonstrated in the
previous section, these weights can signicantly impact the approximation error. A sharp bound should
take them into account.
In this lecture, we present a line of analysis that generalizes Theorem 1 and addresses the aforementioned
diculties.

Lyapunov Function

To set the stage for the development of an improved bound, let us establish some notation. First, we
introduce a weighted maximum norm, dened by
J, = max (x)|J(x)|,
xS

(2)

for any : S + . As opposed to the maximum norm employed in Theorem 1, this norm allows for uneven
weighting of errors across the state space.
We also introduce an operator H, dened by

(HV )(x) = max

Pa (x, y)V (y),
aAx

for all V : S . For any V , (HV )(x) represents the maximum expected value of V (Y ) if the current state
is x and Y is a random variable representing the next state.
1

For each V : S , we dene a scalar V given by

V = max
x

(HV )(x)
.
V (x)

(3)

We can now introduce the notion of a Lyapunov function, as follows.

Denition 1 (Lyapunov function) We call V : S + a Lyapunov function if V < 1.
Our denition of a Lyapunov function translates into the condition that there exist V > 0 and < 1
such that
(HV )(x) V (x),
x S.
(4)
If were equal to 1, this would look like a Lyapunov stability condition: the maximum expected value
(HV )(x) at the next time step must be less than the current value V (x). In general, is less than 1, and
this introduces some slack in the condition.
Our error bound for the approximate LP will grow proportionately with 1/(1 V ), and we therefore
want V to be small. Note that V becomes smaller as the (HV )(x)s become small relative to the V (x)s;
V conveys a degree of stability, with smaller values representing stronger stability. Therefore our bound
suggests that, the more stable the system is, the easier it may be for the approximate LP to generate a good
scoring function.
An interesting (and useful) fact about Lyapunov functions is that T is a contraction with respect to
,1/V , when V is a Lyapunov function:
we have
Lemma 1 Let V be a Lyapunov function. Then, J, J,
, 1
, 1 V J J
T J T J
V
V
Proof: Let J, J be arbitrary. Let
, 1
= J J
V
Then
J V J J + V
and u be greedy with respect to J. Then
Let u
be greedy with respect to J,
T J T J

(gu + Pu J)
(gu + Pu J)
(gu + Pu J)
(gu + Pu J)
= Pu (J J)

Pu |J J|

Therefore, x S,
(T J )(x) (T J)(x)

Pu (x, y)

|J (y) J(y)|
V (y)
V (y)

|J (y ) J(y )|
Pu (y) max
V (y)
y
V (y )
y

Pu (x, y)V (y)

(HV )(x)

V (x)

Hence,
T J T J V V
and we have

|(T J )(x) (T J)(x)|

V
V (x)

We are now ready to state our main result. For any given function V mapping S to positive reals, we
use 1/V as shorthand for a function x 1/V (x).
Theorem 2 Let r be a solution of the approximate LP. Then, for any v K such that v is a Lyapunov
function,
2cT v
(5)
J
r1,c
min J r,1/v .
1 v r
Proof: Let
r = arg min J r, V1
r

and
= J r . V1 .
Let
r = r

1 + V
V
1 V

Then
T
r T r , Vt

V
r r , V1

(1 + V )
V , V1
1 V
(1 + V )
V
1 V

(6)

Moreover,
J r , V1 V J r , V1 = V

(7)

Thus,
1 + V
V
from (6)
1 V
1 + V
J V V V
V
from (7)
1 V
1 + V
V
r (1 + V )V V
1 V
1 + V
1 + V
=
r+
V (1 + V )V V
V
1 V
1 V
=
r

T
r

T r V

Therefore, we have r is feasible since T

r
r. Note that
J r , V1 = J V r J + V.
We have
J
r1,c

J
r1,c
=

=
=

(since r is optimal and r is feasible)

(1
+
V )
V 1,c
J r +
1 V
(1 + V ) T
J r 1,c +
c V
1 V

(1 + V ) T
c(x)|J (x) (r )(x)| +
c V
1 V
x

(1 + V ) T
c(x)V (x) +
c V
1 V
x

(1 + V ) T
+
c V cT V
1 V
2cT V

1 V

Let us now discuss how this new theorem addresses the shortcomings of Theorem 1 listed in the previous
section. We treat in turn the two items from the aforementioned list.
1. The norm appearing in Theorem 1 is undesirable largely because it does not scale well with
problem size. In particular, for large problems, the optimal value function can take on huge values
over some (possibly infrequently visited) regions of the state space, and so can approximation errors
in such regions.
Observe that the maximum norm of Theorem 1 has been replaced in Theorem 2 by ,1/v . Hence,
the error at each state is now weighted by the reciprocal of the Lyapunov function value. This should
to some extent alleviate diculties arising in large problems. In particular, the Lyapunov function
should take on large values in undesirable regions of the state space regions where J is large.
Hence, division by the Lyapunov function acts as a normalizing procedure that scales down errors in
such regions.
4

2. As opposed to the bound of Theorem 1, the state-relevance weights do appear in our new bound. In
particular, there is a coecient cT v scaling the right-hand-side. In general, if the state-relevance
weights are chosen appropriately, we expect that cT v will be reasonably small and independent of
problem size. We defer to the next section further qualication of this statement and a discussion of
approaches to choosing c in contexts posed by concrete examples.

Multiclass Queueing Networks

13
1

11
18

Machine 1

4
Machine 2

Machine 3

Figure 1: A Multiclass Queueing System

We now consider a queueing network with d queues and nite buers of size B to determine the impact
of dimensionality on the terms involved in the error bound of Theorem 2.
We assume that the number of exogenous arrivals occuring in any time step has expected value less than
or equal to Ad, for a nite A. The state x d indicates the number of jobs in each queue. The cost per
stage incurred at state x is given by
d
|x|
1

g(x) =
=
xi ,
d i=1
d

the average number of jobs per queue.

Let us rst consider the optimal value function J and its dependency on the number of state variables
d. Our goal is to establish bounds on J that will oer some guidance on the choice of a Lyapunov function
V that keeps the error minr J r,1/V small. Since J 0, we will only derive upper bounds.
Instead of carrying the buer size B throughout calculations, we will consider the innite buer case.
The optimal value function for the nite buer case should be bounded above by that of the innite buer
case, as having nite buers corresponds to having jobs arriving at a full queue discarded at no additional
cost.
We have

Ex [|xt |] |x| + Adt,

since the expected total number of jobs at time t cannot exceed the total number of jobs at time 0 plus the

expected number of arrivals between 0 and t, which is less than or equal to Adt. Therefore we have

t
|xt |
=
Ex
t Ex [|xt |]
t=0

t=0

t (|x| + Adt)

t=0

Ad
|x|
+
.
1 (1 )2

(8)

The rst equality holds because |xt | 0 for all t; by the monotone convergence theorem, we can interchange
the expectation and the summation. We conclude from (8) that the optimal value function in the innite
buer case should be bounded above by a linear function of the state; in particular,
0 J (x)

1
|x| + 0 ,
d

for some positive scalars 0 and 1 independent of the number of queues d.

As discussed before, the optimal value function in the innite buer case provides an upper bound for
the optimal value function in the case of nite buers of size B. Therefore, the optimal value function in
the nite buer case should be bounded above by the same linear function regardless of the buer size B.
As in the previous examples, we will establish bounds on the terms involved in the error bound of Theorem
2. We consider a Lyapunov function V (x) = d1 |x| + C for some constant C > 0, which implies
min J r,1/V

J ,1/V
1 |x| + d0
x0
|x| + dC
0
1 + ,
C
max

and the bound above is independent of the number of queues in the system.
Now let us study V . We have

|x| + Ad
(HV )(x)
+C
d

A
V (x) + |x|
d +C

A
V (x) +
,
C
and it is clear that, for C suciently large and independent of d, there is a < 1 independent of d such that
HV V , and therefore 11V is uniformly bounded on d.
Finally, let us consider cT V . We expect that under some stability assumptions, the tail of the steady-state

d
1
distribution will have an upper bound with geometric decay [1] and we take c(x) = 1
|x| . The
B+1
state-relevance weights c are equivalent to the conditional joint distribution of d independent and identically

distributed geometric random variables conditioned on the event that they are all less than B + 1. Therefore,
d

T
Xi + C Xi < B + 1, i = 1, ..., d
c V = E
d i=1
<

E [X1 ] + C

=
+ C,
1

where Xi , i = 1, ..., d are identically distributed geometric random variables with parameter 1 . It follows
that cT V is uniformly bounded over the number of queues.

References
[1] D. Bertsimas, D. Gamarnik, and J.N. Tsitsiklis. Performance of multiclass Markovian queueing networks
via piecewise linear Lyapunov functions. Annals of Applied Probability, 11(4):13841428, 2001.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

April 12
Handout #21

Lecture Note 18

Ecient Implementation of Approximate Linear Programming

While the ALP may involve only a small number of variables, there is a potentially intractable number of
constraints one per state-action pair. As such, we cannot in general expect to solve the ALP exactly. The
focus of this paper is on a tractable approximation to the ALP: the reduced linear program (RLP).
Generation of an RLP relies on three objects: (1) a constraint sample size m, (2) a probability measure
over the set of state-action pairs, and (3) a bounding set N K . The probability measure represents
a distribution from which we will sample constraints. In particular, we consider a set X of m state-action
pairs, each independently sampled according to . The set N is a parameter that restricts the magnitude
r. The RLP is dened by
of the RLP solution. This set should be chosen such that it contains
maximize
subject to

cT r

ga (x) + yS Pa (x, y)(r)(y) (r)(x),

r N.

(x, a) X

(1)

Let r be an optimal solution of the ALP and let r

be an optimal solution of the RLP. In order for the
solution of the RLP to be meaningful, we would like J
r1,c . To formalize
r1,c to be close to J

this, we consider a requirement that

r1,c J
r1,c 1 ,
Pr J

where > 0 is an error tolerance parameter and > 0 parameterizes a level of condence 1 . This paper
focusses on understanding the sample size m needed in order to meet such a requirement.

1.1

Results of our Analysis

To apply the RLP, given a problem instance, one must select parameters m, , and N . In order for the
RLP to be practically solvable, the sample size m must be tractable. Results of our analysis suggest that if
and N are well-chosen, an error tolerance of can be accommodated with condence 1 given a sample
size m that grows as a polynomial in K, 1/, and log 1/, and is independent of the total number of ALP
constraints.
Our analysis is carried out in two parts:
1. Sample complexity of near-feasibility. The rst part of our analysis applies to constraint sampling
in general linear programs not just the ALP. Suppose that we are given a set of linear constraints
zT r + z 0, z Z,
on variables r K , a probability measure on Z, and a desired error tolerance and condence
1 . Let z1 , z2 , . . . be independent identically distributed samples drawn from Z according to . We
will establish that there is a sample size

1
1
1
m=O
K ln + ln

such that, with probability at least 1 , there exists a subset Z Z of measure (Z) 1 such
that every vector r satisfying
zTi r + zi 0, i = 1, . . . , m,
also satises

zTi r + zi 0,

z Z.

We refer to the latter criterion as near-feasibility nearly all the constraints are satised. The main
point of this part of the analysis is that near-feasibility can be obtained with high-condence through
imposing a tractable number m of samples.
2. Sample complexity of a good approximation. We would like the the error J
r1,c of an

optimal solution r to the RLP to be close to the error J

r1,c of an optimal solution to the
ALP. In a generic linear program, near-feasibility is not sucient to bound such an error metric.
However, because of special structure associated with the RLP, given appropriate choices of and N ,
near-feasibility leads to such a bound. In particular, given a sample size

1
A
A
,
m=O
K ln
+ ln
(1 )
(1 )

where A = maxx |Ax |, with probability at least 1 we have

J
r1,c J
r1,c + J 1,c .
The parameter , which is to be dened precisely later, depends on the particular MDP problem
instance, the choice of basis functions, and the set N .
A major weakness of our error bound is that it relies on an idealized choice of . In particular, the choice
we will put forth assumes knowledge of an optimal policy. Alas, we typically do not know an optimal
policy that is what we are after in the rst place. Nevertheless, the result provides guidance on what
makes a desirable choice of distribution. The spirit here is analogous to one present in the importance
sampling literature. In that context, the goal is to reduce variance in Monte Carlo simulation through
intelligent choice of a sampling distribution and appropriate distortion of the function being integrated.
Characterizations of idealized sampling distributions guide the design of heuristics that are ultimately
implemented.
The set N also plays a critical role in the bound. It inuences the value of , and an appropriate
choice is necessary in order for this term to scale gracefully with problem size. Ideally, given a class of
problems, there should be a mechanism for generating N such that grows no faster than a low-order
polynomial function of the number of basis functions and the number of state variables. As we will
later discuss through an example involving controlled queueing networks, we expect that it will be
possible to design eective mechanisms for selecting N for practical classes of problems.
It is worth mentioning that our sample complexity bounds are loose. Our emphasis is on showing that the
number of required samples can be independent of the total number of constraints and can scale gracefully
with respect to the number of variables. Furthermore, our emphasis is on a general result that holds for a
broad class of MDPs, and therefore we do not exploit special regularities associated with particular choices
of basis functions or specic problems. In the presence of such special structure, one can sometimes provide
much tighter bounds or even methods for exact solution of the ALP, and results of this nature can be found
2

in the literature, as discussed in the following literature review. The signicance of our results is that they
suggest viability of the linear programming approach to approximate dynamic programming even in the
absence of such favorable special structure.

Sample Complexity of Near-Feasibility

Consider a set of linear constraints

zT r + z 0, z Z
where r K and Z is a set of constraint indices. We make the following assumption on the set of
constraints:
Assumption 1 There exists a vector r K that satises the system of inequalities (2).
We are interested in situations where there are relatively few variables and a possibly huge nite or
innite number of constraints, i.e., K |Z|. In such a situation, we expect that almost all the constraints
will be irrelevant, either because they are always inactive or because they have a minor impact on the
feasible region. Therefore one might speculate that the feasible region specied by all constraints can be
well-approximated by a sampled subset of these constraints. In the sequel, we show that this is indeed the
case, at least with respect to a certain criterion for a good approximation. We also show that the number of
constraints necessary to guarantee a good approximation does not depend on the total number of constraints,
but rather on the number of variables.
Our constraint sampling scheme relies on a probability measure over Z. The distribution will have
a dual role in our approximation scheme: on one hand, constraints will be sampled according to ; on the
other hand, the same distribution will be involved in the criterion for assessing the quality of a particular
set of sampled constraints.
We consider a subset of constraints to be good if we can guarantee that, by satisfying this subset, the
set of constraints that are not satised has small measure. In other words, given a tolerance parameter
(0, 1), we want to have W Z satisfying
sup

{r|zT r+z 0, zW}

y : yT r + y < 0 .

(2)

Whenever (2) holds for a subset W, we say that W leads to near-feasibility.

The next theorem establishes a bound on the number m of (possibly repeated) sampled constraints
necessary to ensure that the set W leads to near-feasibility with probability at least 1 .
Theorem 1 For any (0, 1) and (0, 1), and

12
2
4
K ln
+ ln
,
m

(3)

a set W of m i.i.d. random variables drawn from Z according to distribution , satises

sup
{r:zT r+z 0, zW}

y : yT r + y < 0

with probability at least 1 .

(4)

This theorem implies that, even without any special knowledge about the constraints, we can ensure nearfeasibility, with high probability, through imposing a tractable subset of constraints. The result follows
immediately from Corollary 8.4.2 on page 95 of [1] and the fact that the collection of sets {{(, )| T r +
0}|r K } has VC-dimension K, as established in [2]. The main ideas for the proof are as follows:

1 if zT r + kz 0
We dene, for each r, a function fr : Z {0, 1}, given by r fr (z) =
0 otherwise
We are interested in nding r such that

E fr (z) 1,

or equivalently, r such that almost all constraints are satised.

Let us consider the problem of estimating E fr (z) and generating a near-feasible r based on sampling.
Suppose we have Z = {Z1 , . . . , Zm },
m

E fr (z)

1
fr (z)
fr (zi ) = E
m i=1

Then if we could ensure that

fr | ,
|E fr E
then for all r FZ ,

(5)

fr = 1 E 1 , r F
E
Z

From the VC-dimension and supervised learning lecture, we know that there is a way of ensuring (5)

holds simultaneously for all r if C = fr : r P has a nite VC-dimension.

f | > ) = O(exp(kdV C (C)))
P (|E f E

if f C.
The nal part of the proof comes from verifying that C has VC-dimension less than or equal to p.
Theorem 1 may be perceived as a puzzling result: the number of sampled constraints necessary for a good
approximation of a set of constraints indexed by z Z depends only on the number of variables involved in
these constraints and not on the set Z. Some geometric intuition can be derived as follows. The constraints
are fully characterized by vectors [zT z ] of dimension equal to the number of variables plus one. Since
near-feasibility involves only consideration of whether constraints are violated, and not the magnitude of
violations, we may assume without loss of generality that [zT z ] = 1, for an arbitrary norm. Hence
constraints can be thought of as vectors in a low-dimensional unit sphere. After a large number of constraints
is sampled, they are likely to form a cover for the original set of constraints i.e., any other constraint
is close to one of the already sampled ones, so that the sampled constraints cover the set of constraints.
The number of sampled constraints necessary in order to have a cover for the original set of constraints
is bounded above by the number of sampled vectors necessary to form a cover to the unit sphere, which
naturally depends only on the dimension of the sphere, or alternatively, on the number of variables involved
in the constraints.

1.5

0.5

1.5

Figure 1: A feasible region dened by a large number of redundant constraints. Removing all but a random
sample of constraints is likely to bring about a signicant change the solution of the associated linear program.

Sample Complexity of a Good Approximation

In this section, we investigate the impact of using the RLP instead of the ALP on the error in the approxima
tion of the cost-to-go function. We show in Theorem 2 that, by sampling a tractable number of constraints,
the approximation error yielded by the RLP is comparable to the error yielded by the ALP.
The proof of Theorem 2 relies on special structure of the ALP. Indeed, it is easy to see that such a result
cannot hold for general linear programs. For instance, consider a linear program with two variables, which
are to be selected from the feasible region illustrated in Figure 1. If we remove all but a small random sample
of the constraints, the new solution to the linear program is likely to be far from the solution to the original
linear program. In fact, one can construct examples where the solution to a linear program is changed by
an arbitrary amount by relaxing just one constraint.
Let us introduce certain constants and functions involved in our error bound. We rst dene a family of
probability distributions on the state space S, given by
Tu = (1 )cT (I Pu )1 ,

(6)

for each policy u. Note that, if c is a probability distribution, u (x)/(1 ) is the expected discounted
number of visits to state x under policy u, if the initial state is distributed according to c. Furthermore,
lim1 u (x) is a stationary distribution associated with policy u. We interpret u as a measure of the
relative importance of states under policy u.
We will make use of a Lyapunov function V : S + , which is dened as follows.
Denition 1 (Lyapunov function) A function V : S + is called a Lyapunov function if there is a
scalar V < 1 and an optimal policy u such that
Pu V V V.

(7)

Our denition of a Lyapunov function is similar to that found in the previous lecture, with the dierence
that here the Lyapunov inequality (7) must hold only for an optimal policy, whereas in the previous lecture
it must hold simultaneously for all policies.
Lemma 1 Let V be a Lyapunov function for an optimal policy u . Then Tu is a contraction with respect
to ,1/V .
Proof: Let J and J be two arbitrary vectors in |S| . Then
Tu J Tu J = Pu (J J) J J,1/V Pu V J J,1/V V V.

For any Lyapunov function V , we also dene another family of probability distributions on the state
space S, given by
u (x)V (x)
u,V (x) =
.
(8)
Tu V
We also dene a distribution over state-action pairs
u,V (x, a) =

u,V (x)
, a Ax .
|Ax |

Finally, we dene constants

A = max |Ax |
x

and
=

Tu V
sup J r,1/V .
cT J rN

(9)

We now present the main result of the paper a bound on the approximation error introduced by
constraint sampling.
Theorem 2 Let and be scalars in (0, 1). Let u be an optimal policy and X be a (random) set of m stateaction pairs sampled independently according to the distribution u ,V (x, a), for some Lyapunov function V ,
where

16A
48A
2
m
K ln
+ ln
,
(10)
(1 )
(1 )

Let r be an optimal solution of the ALP that is in N , and let r be an optimal solution of the corresponding
RLP. If r N then, with probability at least 1 , we have
J
r1,c J
r1,c + J 1,c .

(11)

Proof: From Theorem 1, given a sample size m, we have, with probability no less than 1 ,
(1 )
4A

u ,V ({(x, a) : (Ta
r)(x) < (
r)(x)})
u ,V (x)
1(Ta r)(x)<(r)(x)
|Ax |
xS
aAx
1
u ,V (x)1(Tu r)(x)<(r)(x) .
A
xS

(12)

For any vector J , we denote the positive and negative parts by

J + = max(J, 0), J = max(J, 0),
where the maximization is carried out componentwise. Note that
J
r1,c

=
=

cT (I Pu )1 (gu (I Pu )
r)

cT (I Pu )1 |gu (I Pu )
r|

T
1
(gu (I Pu )
r) + (gu (I Pu )
r)
c (I Pu )

r) (gu (I Pu )
r) +
cT (I Pu )1 (gu (I Pu )

+2 (gu (I Pu )
r)

cT (I Pu )1 gu (I Pu )
r + 2 (Tu
r
r)

cT (J
r) + 2cT (I Pu )1 (Tu
r
r) .

(13)

The inequality comes from the fact that c > 0 and

(I Pu )1 =

n Pun 0,

n=0

where the inequality is componentwise, so that

(I Pu )1 (gu (I Pu )
r) (I Pu )1 |(gu (I Pu )
r)|

r)| .

= (I Pu )1 |(gu (I Pu )

Now let r be any optimal solution of the ALP1 . Clearly, r is feasible for the RLP. Since r
is the optimal
solution of the same problem, we have cT
r and
r cT
cT (J
r)

cT (J
r)

r1,c ,
J

(14)

therefore we just need to show that the second term in (13) is small to guarantee that the performance of
the RLP is not much worse than that of the ALP.
1 Note that all optimal solutions of the ALP yield the same approximation error J r
1,c , hence the error bound (11)
is independent of the choice of r.

Now
2cT (I Pu )1 (Tu
r
r)

=
=

T (Tu
r
r)
1 u
2
u (x) ((
r)(x) (Tu
r)(x)) 1(Tu r)(x)<(r)(x)
1
xS

2 (
r)(x) (Tu
r)(x)
u (x)V (x)1(Tu r)(x)<(r)(x)
1
V (x)
xS

2Tu V
Tu
r,1/V
u ,V (x)1(Tu r)(x)<(r)(x)
r
1
xS
T
V Tu
r
r,1/V
2 u
T
r J ,1/V + J
r,1/V )
V (Tu
2 u
T
V (1 + V )J
r,1/V
2 u

J 1,c ,

with probability greater than or equal to 1 , where second inequality follows from (12) and the fourth
inequality follows from Lemma 1. The error bound (11) then follows from (13) and (14).

Three aspects of Theorem 2 deserve further consideration. The rst of them is the dependence of the
number of sampled constraints (10) on . Two parameters of the RLP inuence the behavior of : the
Lyapunov function V and the bounding set N . Graceful scaling of the sample complexity bound depends
on the ability to make appropriate choices for these parameters.
The number of sampled constraints also grows polynomially with the maximum number of actions avail
able per state A, which makes the proposed approach inapplicable to problems with a large number of actions
per state. It can be shown that complexity in the action space can be exchanged for complexity in the state
space, so that such problems can be recast in a format that is amenable to our approach.
Finally, a major weakness of Theorem 2 is that it relies on sampling constraints according to the distri
bution u ,V . In general, u ,V is not known, and constraints must be sampled according to an alternative
distribution . Suppose that (x, a) = (x)/|Ax | for some state distribution . If is similar to u ,V ,
one might hope that the error bound (11) holds with a number of samples m close to the number suggested
in the theorem. We discuss two possible motivations for this:
1. It is conceivable that sampling constraints according to leads to a small value of
u ,V ({x : (
r)(x) (Tu
r)(x)}) (1 )/2,
with high probability, even though u ,V is not identical to . This would lead to a graceful sample
complexity bound, along the lines of (10). Establishing such a guarantee is closely related to the
problem of computational learning when the training and testing distributions dier.
2. If
Tu (Tu r r) C
T (Tu r r) ,
for some scalar C and all r, where

(x) =

(x)/V (x)
,
yS (y)/V (y)

then the error bound (11) holds with probability 1 given

16AC
48AC
2
m
K ln
+ ln
,
(1 )
(1 )

samples. It is conceivable that this will be true for a reasonably small value of C in relevant contexts.
How to choose is an open question, and most likely to be addressed adequately having in mind the
particular application at hand. As a simple heuristic, noting that u (x) c(x) as 0, one might choose
(x) = c(x)V (x).

References
[1] D. Anthony and N. Biggs. Computational Learning Theory. Cambridge University Press, 1992.
[2] R.M. Dudley. Central limit theorems for empirical measures. Annals of Probability, 6(6):899928, 1978.

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

April 14
Handout #23

Lecture Note 19

Overview

In the previous lecture, we studied constraint sampling as a generic approach to dealing with the large
number of constraints in the approximate LP, and showed that, by sampling a number of constraints that
is polynomial on the number of variables in the LP, it is possible to ensure that almost all constraints will
be satised, with high probability. The hope that an LP with a large number of constraints and a small
number of variables may be solved eciently either exactly or approximately stems from the fact that
many of the constraints should be redundant; in particular, it is known that only a number of constraints
equal to the number of variables is binding at the optimal solution. This gives hope that, at least in certain
problem-specic situations, other approaches besides constraint sampling may be used for dealing with the
large number of constraints. In particular, the following approaches may be possible:
We may be able to replace the original constraints T r r with an equivalent set of constraints
Ai r bi , i = 1, . . . , N where N is small;
Constraint Generation We may be able to solve the LP exactly without including all constraints by
solving it incrementally, as follows:
start with small subset of constraints
solve smaller LP
add one or more violated constraints
repeat until no violated constraints can be found.
Both approaches can be found in the literature; e.g., Morrison and Kumar [2] replace the exponentially many
constraints in the approximate LP with a manageable number of constraints in problems involving queueing
networks, and Grotschel and Holland [1] solve travelling salesman problems involving up to 260 constraints
by doing constraint generation. In todays lecture, we will study factored MDPs, a reasonably general class
of MDPs that lends itself well to both approaches.

Factored MDPs

The underlying idea in factored MDPs is that many high-dimensional MDPs are actually generated by
systems with many parts that are weakly interconnected. Each part i has an associated state variable Xi ,
so that the full state of the system is described by a vector (X1 , . . . , Xn ). We assume that costs are factored,
i.e.,

g(x) =
gj (XZj ),
(1)
j

where Zj {1, . . . , n} and XZj indicates a (hopefully small) subset of the state variables. Moreover, we also
assume that transition probabilities are factored, i.e.,
Pa (Xi (t + 1)|X(t)) = Pa (Xi (t + 1)|XZi (t)) i,
where once again Zi {1, . . . , n} and XZi indicates a (hopefully small) subset of the state variables. In
words, one way of viewing this assumptions is that costs are mostly local to the various parts of the system,
and dynamics are also mostly local, with each state variable being aected only the subset of state variables
it interacts with. Note that, in the long run, if all state variables are directly or indirectly interconnected,
the evolution of a particular state variable may still be aected by all others.
A common way of representing factored MDPs is through a dynamic Bayesian network, as shown in
Figure 1. The nodes at the left and right represent the state variables in subsequent time steps, and arcs
indicate the dependencies between state variables across time steps. This gure may be generalized to include
dependencies within the same time step.

g1 (x1 (t))
x1 (t)

g1 (x1 (t + 1))
x1 (t + 1)

x2 (t)

x2 (t + 1)

x3 (t)

x3 (t + 1)

xn (t)

xn (t + 1)

Figure 1: Each state is inuenced by a small subset of states in every time stage

Example 1 Consider the queueing network represented in Figure 2. With our usual choice of costs ga (x) =

i xi , corresponding to the total number of jobs in the system, it is clear that stage costs are factored.
Moreover, transition probabilities are also factored; for instance, we have
Pa (x2 (t + 1)|x(t)) = Pa1 ,a2 (x2 (t + 1)|x1 (t), x2 (t)),
since the number of jobs in queue 2 in the next time step is determined exclusively by potential departures
from queue 1 which depend only on x1 (t) and a1 (t) and potential departures from queue 2 which
depend only on x2 (t) and a2 (t).
2

Figure 2: An queueing network

We consider approximating J with a factored representation:

J (x)
i (xwi )ri J(x), Wi {1, . . . , n}.
Note that, in general, the optimal cost-to-go function J does not have an exact factored representation.
However, factored approximations are appealing both because, if the system is indeed only loosely intercon
nected, we can expect J to be roughly factored, and perhaps most importantly, factored approximations
give rise to decentralized policies. Indeed, note that Q factors associated with a factored approximation J
are also factored:

Q(x, a) = ga (x) +
Pa (x; y)J(y)
y

= ga (x) +

Pa (x1 , . . . , xn ; y1 , . . . , yn )

y1 ,...,yn

= ga (x) +
= ga (x) +

yWi

(yWi )ri

Pa (x1 , . . . , xn ; yWi )(yWi )ri

Pa (xZi ; yWi )(yWi )ri

= ga (x) + f (xZi ; r)
since yWi (t + 1) is only inuenced by a subset XZi (t) of X(t).

Ecient handling of constraints in factored MDPs

We will now show that the approximate LP constraints

ga (x) +
Pa (x, y)(y)r (x)r
y

can be dealt with eciently when we consider factored MDPs with factored cost-to-go function approxima
tions. For simplicity, let us denote each state-action pair (x, a) by a vector-valued variable t = (t1 , t2 , . . . , tm ) =
(x1 , . . . , xn , a1 , . . . , ap ). Then we are interested in dealing with a set of constraints

(2)
fi (tWi , r) 0, t.
i

The main diculty is that t can take on an unmanageably large number of values as many as the number of
state-action pairs in the system. We will show that these constraints can be replaced by a smaller, equivalent
subset. Moreover, we will show identifying a violated constraint can be done eciently, which allows for
using constraint generation schemes.
The rst step is to rewrite (2) as

max
fi (tWi , r) 0.
t

Consider solving the maximization problem above for a xed value of r. The naive approach is to enumerate
all possible values of t and take the one leading to maximum value of the objective. However, since each of
the terms fi (tWi , r) depends only on a subset tWi of the components of t, the problem can be solved more
eciently via variable elimination. We illustrate the procedure through the following example.
Example 2 Consider
max f1 (t1 , t2 ) + f2 (t2 , t3 ) + f3 (t2 , t4 ) + f4 (t3 , t4 ).

t1 ,t2 ,t3 ,t4

For simplicity, assume that ti {0, 1}, for i = 1, 2, 3, 4. If we solve the optimization problem above by
enumerating all possible solutions, there are on the order of O(24 ) operations. Consider optimizing over one
variable at a time, as follows:
1. Eliminate variable t2 : For each possible value of t2 ,t3 , we nd
e1 (t2 , t3 ) = max f3 (t2 , t4 ) + f4 (t3 , t4 ),
t4

and rewrite the problem as

max f1 (t1 , t2 ) + f2 (t2 , t3 ) + e1 (t2 , t3 ).

t1 ,t2 ,t3

2. Eliminate variable t3 : For each possible value of t2 , we nd

e2 (t2 ) = max f1 (t2 , t2 ) + e1 (t2 , t3 ),
t3

and rewrite the original problem as

max f1 (t1 , t2 ) + e2 (t2 ).
t1 ,t2

3. Eliminate variable t2 : For each possible value of t1 , we nd

e3 (t1 ) = max f1 (t1 , t2 ) + e2 (t2 ).
t2

4. Solve maxt1 e3 (t1 ).

Solving the problem via variable elimination requires on the order of O(23 ) operations. More generally,
variable elimination leads to exponential reduction in computational complexity relative to the naive approach.
4

The previous example suggests variable elimination as an ecient approach to verifying whether a can
didate solution r is feasible for all constraints, and identifying a violated constraint if that is not the case.
Therefore constraint generation can be implemented eciently when we consider factored MDPs with fac
tored cost-to-go function approximations. Moreover, the procedure described in the example can also be
used to generate a smaller set of constraints, if we introduce new variables in the LP. Indeed, let ei (tZi ) be
each of the functions involved in the scheme (including the original functions f ). Each function is given by

ei (tZi ) = max
ek (tZi , tji ).
t ji

kKi
i

For each function ei , we introduce a set of variables utei , where each ti corresponds to one possible assignment
to variables tZi ; for instance, in the example above, we would have variables
01
10
11
u00
e1 , ue1 , ue1 , ue1 ,

associated with function e1 (t2 , t3 ) and all possible assignments for variables t2 and t3 .
With this new denition, the original constraints can be replaced with
i

utfi = fi (ti , r), ti ,

and
i

utei

ek (ti , tji ), ti , tji .

kKi

References
[1] M. Grotschel and O. Holland. Solution of large-scale symmetric travelling salesman problems. Mathe
matical Programming, 51:141202, 1991.
[2] J.R. Morrison and P.R. Kumar. New linear program performance bounds for queueing networks. Journal
of Optimization Theory and Applications, 100(3):575597, 1999.

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

April 21
Handout #24

Lecture Note 20

Policy Search Methods

So far, we have focused on nding an optimal or good policy indirectly, by solving Bellmans equation either
exactly or approximately. In this lecture, we will consider algorithms that search for a good policy directly.
We will focus on averagecost problems. Recall that one approach to nding an averagecost optimal
policy is to solve Bellmans equation
e + h = T h.
Under certain technical conditions, ensuring that the optimal average cost is the same regardless of the initial
state in the system, it can be shown that Bellmans equation has a solution ( , h ), corresponding to the
optimal average cost, and h is the dierential cost function, from which an optimal policy can be derived.
An alternative to solving Bellmans equation is to consider searching over the space of policies directly, i.e.,
solving the problem
min (u),
(1)
uU

where (u) is the average cost associated with policy u and U is the set of all admissible policies. In the
past, we have been most focused on policies that are stationary and deterministic; in other words, if S is
the state space and A is the action space (consider a common action space across states, for simplicity), we
have considered the set of policies u : S A, which prescribe an action u(x) for each state x. Note that, if
U is the set of all deterministic and stationary policies, we have |U | = |A||S | , so that problem (1) involves
optimization over a nite and exponentially large set (in fact, |U | grows exponentially in the size of the state
space, or doubleexponentially in the dimension of the state space!).
In order to make searching directly in the policy space tractable, we are going to consider restricting the
set of policies U in (1). Specically, we are going to let U be a set of parameterized policies:
U = {u : K },
where each policy u corresponds to a randomized and stationary policy, i.e., u (x, a) gives the probability
of taking action a given that the state is x. We let g , P and () denote the stage costs and transition
probability matrix associated with policy u :

g (x) =
ga (x)u (x, a)
a

P (x, y)

Pa (x, y)u (x, a)

Example 1 (Threshold Policies) Admission Control

Suppose that we have a total amount R of a certain resource and requests from various types i for amounts
ri , i = 1, . . . , n. Once a request is accepted, it occupies the amount ri of resource for a certain length of
time, before freeing it again. The admission control problem is to decide, upon arrival of a new request,
whether to accept it or not. Let Rt the current amount of available resources. A possible threshold policy
1

for this problem could compare Rt with a certain threshold i and only accept a new request of type i if
Rt i .
American Options
Consider the problem of when to exercise the option to buy a certain stock at a prespecied price K. A
possible threshold policy is to exercise the option (i.e., buy the stock) if the market price at time t, Pt , is
larger that a threshold t .
Once we restrict attention to the class of policies parameterized by a real vector, problem (1) becomes a
standard nonlinear optimization problem:
min ().
(2)
k

With appropriate smoothness conditions, we can nd a local optimum of (2) by doing gradient descent:
k+1 = k k (k ).

(3)

In the next few lectures, we will show that biased and unbiased estimates of the gradient () can be
computed from system trajectories, giving rise to simulationbased gradient descent methods.

1.1

A Convenient Expression for ()

We rst introduce assumptions that ensure the existence and dierentiability of ().
Assumption 1 Let = {P | k } and be the closure of . The Markov Chain associated with P is
irreducible and there exists x that is recurrent for every P .
Assumption 2 P (x, y) and g (x) are bounded, twice dierentiable, with bounded rst and second deriva
tives.
Lemma 1 Under Assumption 1 and 2, for every policy there is a unique stationary distribution satis
fying T = T P , T e = 1, and () = T g . Moreover, () and are dierentiable.
In order to develop a simulationbased method for generating estimates of the gradient, we will show
that () can be written as the expected value of certain functions of pairs of states (x, y), where (x, y) is
distributed according to the stationary distribution (x)P (x, y) of pairs of consecutive states.
First observe that
() = T g + T g
(4)
It is clear that the secibd term T g can be estimated via simulation; in particular, we know that
T 1
1
g (xt ).
T T
t=0

T g = lim

Hence if we run the system with policy , and generate a suciently long trajectory x1 , x2 , . . . , xT , we can
set
T

1
g (xt ).
(x) g (x)
T t=1
x
The key insight is that the rst term in (4), T g , can also also be estimated via simulation. In order to
show that, we start with the following theorem.
2

Theorem 1 (Amazing Fact 1!)

T g = T P h ,

(5)

where h is the dierential cost function associated with policy .

Proof: We have
()e + h = g + P h .
Multiplying both sides by T , we get
()T e + T h = T g + T P h .
Since T e = 1, T e = 0, and
T h = T g + T P h .

(6)

Moreover, since T P = T , we have

T P + T P = T
and
T P h + T P h = T h

(7)
2

The theorem follows from (6) and (7).

It is still not clear how to easily compute (5) from the system trajectory. Note that
T 1
1
(
P (xt , y)h (y)),
T T
t=0 y

T P h = lim

which suggests averaging f (x) = y P (x, y)h (y) over a system trajectory x0 , x1 , . . . , xT , however this
gives rise to two diculties: rst, we must perform a summation over y in each step, which may involve
a large number of operations; second, we do not know h (y). We can get around the rst diculty by
employing an artifact known as the likelihood ratio method. Indeed, let
L (x, y) =

P (x, y)
.
P (x, y)

We make the following assumption about L :

Assumption 3 There is B < such that L (x, y) B for all .
Assumption 3 is true, e.g., if for each pair (x, y), we have P (x, y) = 0, , or P (x, y) , . More
concretely, a sucient condition is u (x, a) , a Ax .
Under this condition, we can rewrite (5) as

T P h =
(x)
P (x, y)h (y)
x

(x)P (x, y)L (x, y)h (y),

and assuming that we can compute or estimate h , we can estimate (5) from a trajectory x0 , x1 , xT by
considering
T 1
1
T P h
L (xt , xt+1 )h (xt+1 ).
T t=0
Our last step will be to show that we can get unbiased estimates of h by looking at cycles in the system
trajectory between visits to the recurrent state x . This follows from the following observation, which was
proved in Problem Set 2:
Theorem 2 Amazing Fact 2
Let x be a recurrent state under policy . Let
T = min{t > 0 : xt = x }
Then
h (x)

T 1

(g (xt ) ())|x0 = x

t=0

h (x )

is a dierential cost function for policy .

Putting together all of the pieces we have developed so far, we can consider the following algorithm. Let
x0 , x1 , . . . be a system trajectory. Let tm be the time of the mth visit to the recurrent state x . Based on
the trajectory xtm , xtm +1 , . . . , xtm+1 , compute
h (xn ) =

tm+1 1

m , n = tm + 1, . . . , tm+1 1,
g (xt )

t=n

and
m) =
F (

tm+1 1

(xn )h (xn1 , xn ) + g (xn ) .

n=tm

m ) gives a biased estimate of (), where the bias is on the order of O(|()
m |):
Then F (
Theorem 3

= E (T )() + G() ()

E Fm ()
where G() is a bounded function.
We can update the policy by letting
m+1
m+1

m)
= m m Fm (
tm+1 1

m + m
m
gm (xn )
=
n=tm

where > 0.
Assumption 4

m=1

m = and

m=1

2
m
< .

Theorem 4 Under Assumptions 1, 2, 3 and 4, we have

lim (m ) = 0

w.p. 1.

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

April 26
Handout #25

Lecture Note 21

Policy Search Methods

1.1

OLine Unbiased Gradient Descent Algorithm

Recall that we are interested in nding K such that () = 0, i,e, the policy parameterized policy u
corresponds to a local average cost minimum among the class of parameterized policies under consideration.
In the previous lecture, we proposed an algorithm for performing gradient descent based on system trajec
tories. We assume that there is a state x that is recurrent under all policies u . The algorithm generates a
series of policies 1 , 2 , . . . , which are updated whenever the system visits state x . The algorithm is given
as follows:
1. Let 0 be the initial policy. Assume (for simplicity) that the initial state is x0 = x . Let m = 0,
tm = 0.
2. Generate a trajectory xtm +1 , xtm +2 , . . . , xtm+1 according to policy um , where
tm+1 = inf{t > tm : xt = x }.
3. Let
h (xn )

tm+1 1

m , n = tm + 1, . . . , tm+1 1
g (xt )

t=n

m)
F (

tm+1 1

(xn )h (xn1 , xn ) + g (xn )

n=tm

m+1
m+1

m)
= m m Fm (
tm+1 1

m + m
m
gm (xn )
=
n=tm

4. Let m = m + 1, and go back to step 2.

It can be shown that this algorithm leads to a local average cost minimum, asymptotically:
Assumption 1 Let = {P | k } and be the closure of . The Markov Chain associated with P is
irreducible and there exists x that is recurrent for every P .
Assumption 2 P (x, y) and g (x) are bounded, twice dierentiable, with bounded rst and second deriva
tives.
Assumption 3 There is B < such that L (x, y) B for all .

Assumption 4

m=1

m = and

m=1

2
m
< .

Theorem 1 Under Assumptions 1, 2, 3 and 4, we have

lim (m ) = 0

w.p. 1.

Some points about the algorithm are worth notice:

m = (m ), Fm (
m ) would be an unbiased estimate of the gradient (m ). However,
If we had
that is not the case; we only have an estimate of (m ). One way of getting around this diculty
m , so as to give enough time for
m to
would be to update the policy m at a much slower rate than
converge to (m ) and generate an unbiased gradient estimate before updating the policy. However,
m are updated at the same rate, except for a constant
in the algorithm described above, both m and
factor , and convergence is still guaranteed.
The policy m is updated only at visits to state x . This means that the algorithm can become slow
when the system is large, and cycles between visits to x are long. In the sequel, we will look at
algorithms that updated the policy at every time step.
Intuition for the proof: In order to develop some intuition for the proof of Theorem 1, we will look
at the associated deterministic ODE. We have
G(t )
((t ) t )
t = (t )
E [T ]
t = ((t ) t )
We will rst argue that t converges, which implies that t lambda(t ) converges to zero. It follows that,
asymptotically, t is updated in the direction of (t ), and we conclude that
lim ((t )) = 0.

The proof for the stochastic algorithm follows a similar argument. It turns out that neither the ODE or
Lyapunov function approaches apply directly, and a customized, lengthy argument must be developed. The
full proof can be found in [1].
For the convergence of t , we discuss two cases:
(1) 0 (0 )
In this case, we rst argue that t (t ). Indeed, suppose 0 = (t0 ) for some t0 . Then either
(t0 ) = 0, and the ODE reaches an equilibrium, or (t0 ) < 0 and t0 = 0. We conclude that
t (t ), t.
From the above discussion, we conclude that t is nonincreasing and bounded. Therefore, t converges.
(2) 0 < (0 )
We have two possible situations:
(i) t < (t ), t

In this case, t is nondecreasing and bounded, therefore it converges.

(ii) t0 = (t0 ) for some t0 In this case, we are back to case (1).
We conclude that t converges, and thus ((t ) t ) 0. Therefore, t (t ) asymptotically, and
(t ) 0.
2

1.2

Online Unbiased Gradient Descent Algorithm

We now develop a version of gradient descent where the policy is updated in every time step, rather than
only at visits to state x . The algorithm has the advantage of being simpler and potentially faster.
First note that Fm can be computed incrementally between visits to state x :

Fm (, )

tm+1 1

(xn )L (xn1 , xn ) + g (xn )

n=tm
tm+1 1

= g (x ) +

(xn )L (xn1 , xn ) + g (xn )

n=tm +1
tm+1 1

= g (x ) +

tn+1 1

n=tm +1
tm+1 1

= g (x ) +

L (xk1 , xk ) + g (xn )
g (xk )

k=n

g (xn ) +

n=tm +1
tm+1 1

= g (x ) +

L (xk1 , xk )(g (xn ) )

tm +1

zn
g (xn ) + g (xn )

n=tm +1

where
zn =

L (xk1 , xk ) zn = zn1 + L (xn1 , xn )

k=tm +1

. This suggests the following Online Algorithm:

k )zk
k+1 = k k g (xk ) + (gk (xk )

0
if xk+1 = x
zk+1 =
zk + L (xk , xk+1 ) otherwise
Assumption 5 Let P = {P : k } and P be the closure of P. Then there exists N0 such that,
(P1 , P2 , . . . , PN0 ), Pi P , x,
N0
n

Pl (x, x ) > 0.
n=1 l=1

Assumption 6

k = ,

k2 < , k k1 , and

n+t

k=n (n

k ) AtP nP for some A and P .

Theorem 2 Under Assumptions 16, we have

(k ) 0,

w.p.1.

The idea behind the proof of Theorem 2 is that, due to the assumptions on the step sizes k (Assumption
6, eventually changes in the policy m made between two consecutive visits to state x are negligible, and
the algorithm behaves very similarly to the oine version. Assumption 5 is required in order to guarantee
that the time between visits to state x remains small, even as the policy is not stationary.

1.3

Biased Gradient Estimation

In both the oine and online unbiased gradient descent algorithms, the variance of the estimates depends
on the variance of the times between visits to state x , which can be very large depending on the system.

with smaller variance.

We now look at a dierent algorithm, which is aimed at developing estimates ()

=
().
The decrease in variance is traded against a potential bias in the estimate, i.e., we have E()
Note that a small amount of bias may still be acceptable since it should suce to have estimates that have
positive inner product with the true gradient, in order for the algorithm to converge:

E(),
() > 0.

We generate one such biased estimate ()

based on a discountedcost approximation. Let

t
J, (x) = E
g(xt )|x0 = x
t=0

Then we have
Theorem 3
() = (1 )T J, + T P J,

()

Proof: We have J, = g + P J, . Then

() = T g = T [J, P J, ].
T
Since

= T P, we have

T = T P + T P .

Hence,
()

= T J, T P J,
= T J, T J, + T P J,
=

(1 )T J, + T P J,
2

The following theorem shows that () can be used as an approximation to (), if is reasonably
close to one.
Theorem 4 Let () = T P J, . Then
lim () = ()

Proof: We have
J, =

()
e + h + O(|1 |).
1

Therefore,
(1

)T J,

=
=

()
(1
e + h + O(|1 |)
1
()
(1 )T
e + (1 )T (h + O(|1 |))

1
T
)

0 as 1

= ()T e + O(|1 |)
But T e = 1, we have T e = 0. Therefore,
(1 )T J, = 0 + O(|1 |) 0 as 1.
2

If we want to use () instead of (), a simulationbased algorithm will compute J, instead of h .

We have
) O(E[T 2 ])
V ar(h

V ar J, = O

and

1
(1 )2

However, using (), we have a bias O(E[T ](1 )).

Based on the previous discussion, we can generate an algorithm for estimating () using the same
ideas from the oine unbiased gradient descent algorithm. Indeed, consider the following algorithm, where
the policy is held xed:
k+1
zk+1

1
(g(xk )zk+1 k )
k+1
= zk + L (xk , xk+1 )

= k +

Then it can be shown that k (), if the policy is held xed. The gradient estimate k can be used
for updating the policy in an oine or online fashion, just as with the unbiased gradient descent algorithms.
Assumption 7

1. unique for each

2. |g(xk )| B, x
3. |L (x, y)| B, x, y
Theorem 5 Under Assumption 7, we have
lim k (),

w.p.1.

References
[1] P. Marbach and J.N. Tsitsiklis. Simulationbased optimization of Markov reward processes. IEEE
Transactions on Automatic Control, 46(2):191209, 2001.

Ses #3

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

Handout #3

Problem Set 1

1. Suppose that an investor wants to maximize its expected wealth E[wT ] at time stage T. There are two
investment options available: a xedinterest savings account with interest rate of 3% in each time stage,
or a stock whose price pt uctuates according to P (pt+1 = (1 + ru )pt ) = P (pt+1 = (1 rd )pt ) = 0.5,
where 0 rd 1 and ru 0. The investor must decide at each time stage which fraction of its current
wealth to invest in each option.
(a) Suppose that there are no transaction costs involved in buying or selling stock. What is the optimal
policy? Suppose that, instead of maximizing E[wT ], the investor wants to maximize E[log wT ].
What is the optimal policy? Do you expect the investor to be more or less conservative in this
case?
(b) Suppose that buying or selling stock involves a transaction cost of 0.5% of the transaction value.
Formulate this problem as an MDP, and write its Bellmans equation.
(c) Solve the problem numerically with T = 20, rd = 0.9 and ru = 1.2, for both the case of maximizing
E[wT ] and maximizing E[log wT ]. Solve it again with rd = 0.7 and ru = 1.4. Analyze your results.
2. Let M1 , M2 , . . . , Mn be matrices with Mi having ri1 rows and ri columns for i = 1, 2, . . . , n and some
positive integers r0 , r1 , . . . , rn . The problem is to choose the order for multiplying the matrices that
minimizes the number of multiplications needed to compute the product M1 M2 . . . Mn . Assume that
matrices are multiplied in the usual way.
(a) Formulate the problem as an MDP such that using backwards induction for nding the optimal
order in which to multiply the matrices requires O(n3 ) operations.
(b) Find the optimal order in which to multiply the matrices when n = 4 and (r0 , r1 , r2 , r3 , r4 ) =
(10, 30, 70, 2, 100).
(c) Using the same numerical values of part (b), solve the problem where the objective is instead to
maximize the number of multiplications. What is the ratio of the maximum to minimum number
of multiplications?
3. Show that GaussSeidel value iteration still converges to J if states are chosen in an arbitrary order,
as long as each state is visited innitely many times.
4. Let c (0, 1]|S| satisfy

c(x) = 1.

(a) Show that, for every policy u,

(I Pu )1 =

t Put .

t=0

(b) Let u = (1 )cT (I Pu )1 . Show that u is a probability distribution over S and u (x) > 0
for all x.
1

(c) For any (0, )|S| , dene

J1, =

(x)|J(x)|.

(d) Show that 1,c is a norm.

(e) Let J J be arbitrary, and recall that uJ stands for a greedy policy with respect to J. Show
that
1
1
T J J1,uJ .
JuJ J 1,c
TuJ (T J J)
1
1
(f) Show that, if we start value iteration with J0 J , we have Jk J for all k, and
JuJk J 1,c

k
T J0 J0 .
1

(g) In class, we have proved a bound on JuJk J . Compare the guarantees oered by the
algorithm when the stopping criterion is JuJk J 1,c versus JuJk J . Which
one oers stronger guarantees? In which case does the algorithm stop rst? Can you think of
situations where it may make sense to use the rst criterion?

Ses #6

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

Handout #8

Problem Set 2
1. Consider an MDP with a goal state x
, and suppose that for every other state y and policy u, there is
k
) > 0. We will analyze the problem known as stochastic shortest path,
k {1, 2, . . . } such that Pu (y, x
, given that
dened as follows. For every state x, let T (x) denote the rst time stage t such that xt = x
x0 = x. Then the objective is to choose a policy u that minimizes

T (x)

E
gu (xt )|x0 = x .
t=0

(a) Dene a costtogo function for this problem and write the corresponding Bellmans equation.
(b) Show that Pu (T (x) > t) t/|S| , for some 0 < < 1 and all x.
(c) Show that Bellmans equation has a unique solution corresponding to the optimal costtogo
function and leads to an optimal policy.
(d) Consider nding the policy that minimizes the average cost in this MDP, and assume that we have
x) = 0. Show that h may be interpreted as
chosen the dierential cost function h such that h (
the costtogo function in a stochastic shortest path problem.
(e) Dene operators Tu and T for the stochastic shortest path problem. Show that they satisfy the
monotonicity property. Is the oset property satised? Why?
(f) (bonus) Dene the weighted maximum norm to be given by
J, = max (x)|J(x)|,
x

for some positive vector . Show that Tu and T are contractions with respect to some weighted
maximum norm contraction.
2. Consider the problem of controlling the service rate in a single queue, based on the current queue
length x. In any time stage, at most one of two types of events may happen: a new job arrives at the
queue with probability , or a job departs the queue with probability 1 + a2 , where a {0, 1} is the
current action. At each time stage, a cost ga (x) = (1 + a)x is incurred. The objective is to choose a
policy so as to minimize the average cost.
(a) Model this problem as an MDP. Write Bellmans equation and dene the operator T .
(b) Show that the dierential cost function h is convex.
(c) Show that the optimal policy is of the form u (x) = 0 if and only if x x
for some x.

(d) Take the dierential cost function h such that h (0) = 0. Show that there is such that
x2 h 1 x2 .
3. Consider an MDP with two states 0 and 1. Upon entering state 0 the system stays there permanently
at no cost. In state 1 there is a choice of staying there at no cost or moving to state 0 at cost 1. Show
that every policy is average cost optimal, but the only stationary policy that is Blackwell optimal is
the one that keeps the system in the state it currently is.
1

4. (bonus) Let P be a transition probability matrix. Suppose that there is a state x

such that, for every
k
) > 0.
other state x, there is k {1, 2, . . . } such that P (x, x
(a) Let T (x) be the smallest time stage t such that xt = x,
conditioned on x0 = x. Show that
T (x) < with probability 1.
(b) Let

T (
x)1

(x) = E

(xt = x)|x0 = x

t=0

where (xt = x) = 1 if xt = x and 0 otherwise. Show that P = . Also show that

N 1
1 k
(y)
P (x, y)
N N
x)]

E[T
(
t=0

lim

and that any solution of P = is of the form = m for some scalar m.

(c) Suppose that P (x, x) > 0. Show that

lim P k (x, y) = (y)

for some : S [0, 1].

(d) Provide examples of transition probability matrices illustrating the two following situations: (i)
P = has at least two solutions that are not multiple of each other, and (ii) P = has all
solutions of the form = m for some scalar m but P k (x, y) does not converge.

Ses #11

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

Handout #13

Problem Set 3

1. Give an example where Qlearning is implemented with greedy policies (i.e., ut = mina Qt (xt , a)) and
fails to converge. How can it be modied so that convergence is ensured?
2. Suppose operator T is a contraction with respect to 2 . Does GaussSeidel value iteration converge?
2 for all J, J and there is a unique J such that
2 J J
3. Suppose operator F satises F J F J
J = F J .
(a) Let G J = (1 )J + F J. Show that there is (0, 1) such that G J J 2 < J J 2 .
(b) Consider Jt = F Jt Jt . Show that Jt converges to J .
J J
for all J, J and there is a unique J
4. (bonus) Suppose operator F satises F J F J
such that J = F J . Consider Jt = F Jt Jt . Show that Jt converges to J

Ses #16

2.997 DecisionMaking in LargeScale Systems

MIT, Spring 2004

Handout #20

Problem Set 4
1. Consider an MDP where actions A are vectors (A1 , . . . , An ) An , for some set A. Therefore in each
time stage the number of actions to be considered is exponential in the number n of action variables.
Show that this MDP can be converted into an equivalent one with A actions in each time stage but
a larger state space. (This problem shows that complexity in the action space can be traded for
complexity in the state space, which is addressed by value function approximation methods.)
2. Show that the VC dimension of the class of rectangles in d is 2d.
3. Another value function approximation algorithm based on temporal dierences is called least squares
policy evaluation (LSPE). We successively approximate the costtogo function J by J rk , k =
1, 2, . . . . Recall that (x) is the row vector whose ith entry corresponds to i (x). Dene the temporal
dierence relative to approximation rk :
dk (x, y) = g(x) + (y)rk (x)rk .
Then LSPE updates rk based on
rk

argmin
r

rk+1

(xm )r (xm )rk

m=0

2
lm

()

dk (xl , xl+1 )

l=m

= rk + (
rk rk ).

The updates can be rewritten recursively as

rk+1 = rk + Bk1 (Ak rk + bk ),
where
Bk

(xm )(xm ) ,

m=0

zm ((xm+1 ) (xm )),

m=0

bk
zm

=
=

zm g(xk ),

m=0

()ml (xl ).

l=0

(a) Show that

lim EAk

lim Ebk

lim EBk

= A = D(P I)

(P )m ,

m=0

= b = T D

m=0

= B = T D.

(P )m g,

(b) It can actually be shown that Ak /k A, bk /k b and Bk /k B, with probability 1, and rk

converges to r = A1 b. Compare r with the limiting value of rk achieved by TD().
(c) The main disadvantage of LSPE is that it requires inverting matrix Bk in each iteration. Note
that Bk pp , where p is the number of basis functions. However, there is an ecient incre
mental scheme for inverting Bk which only requires explicitly inverting scalars in each iteration.
i. (Matrix Inversion Lemma) Show that, for all matrices M and N , (I + M N )1 = I M (I +
N M )1 N.
ii. Propose an iterative scheme for computing Bk that only requires scalar inversion on all
iterations k = 2, 3, . . . (assume that Bk is invertible for every k).

Ses #20

2.997 Decision-Making in Large-Scale Systems

MIT, Spring 2004

Handout #24

Problem Set 5
1. Consider the average-cost LP:
max,h
s.t.

e + h T h,

where T h = minu gu + Pu h.
(a) Suppose that there is a unique optimal policy u , with a single class of recurrent states R. Show
where is the optimal average cost and
that the optimal solution of the LP is given by ( , h),

h(x)
= h (x) for all x R.
to the LP such that at
(b) Provide an example of an MDP such that there is an optimal solution h

least one greedy policy with respect to h is not optimal.

(c) Propose an algorithm based on linear programming for computing the dierential cost function
h .
2. Let uh denote the average-cost greedy policy with respect to h, i.e., uh = argmin(gu + Pu h). Let h
u

denote its average cost, and h denote its stationary state distribution. Show that
h = hT (T h h ) T h h 1,h .
3. Let h be such that
T h h + e,

for some < 1. Let h = minu (I Pu )1 (gu ).

(a) Show that

cT (T h h e) cT (h r) +

1 T
c h , 0.

(b) Show that lim1 (1 )cT max(h , 0) = 0.

(c) Suppose that there is v such that maxu Pu v v, for some < 1 and all v. Denote by r
the optimal solution of the LP
cT r

maxr

T r r + e.

s.t.
Show that

2cT v
1 T
min h r,1/v +
c h .
r
1

(d) Suppose that v = e for some v. Let R() denote the set of optimal solutions to
cT (T
r
r e)

cT r

maxr
s.t.

T r r + e.

be arbitrary. Show that if ur is a greedy policy with respect to r, for some r R(),
Let and

r, for some r R().

then it is also a greedy policy with respect to

Decentralized Strategies
for the assignment
problem
Hariharan Lakshmanan

Dynamic networks
Changing network topology example
wireless sensor networks.
Change is usually undirected
Sometimes changes need to be directed
example Mobile robots for search and
rescue operations

Related work
Chang et.al. applied a reinforcement
learning approach to learn node movement
policy to optimize long-term system
routing performance
Goldenberg et.al proposed a network
mobility control model for improving
system communication performance

Choosing the objective function

Learn network mobility to maximize
network connectivity?
Example

Example continued
One source and two receivers
s

Maximize network flow

Configuration that maximizes network
flow for the case of one source and one
receiver
s

Decentralized assignment
problem
Initial configuration
s

Each node chooses a destination between the

source and the receiver to minimize the
maximum distance that some node has to cover
while maximizing the network flow.

Problem formulation
min max xij d ij i
j

General strategy for

decentralized assignment
Solve local assignment problem
Exchange assignments with neighbors
Modify destination if necessary
Move towards destination for a certain time
Perform above steps till convergence

Methodology
Simulator written Currently does not
communicate with neighbors
Uses Dynamic programming to solve local
assignment problems

Results
Converges to a feasible solution for the
limited problems tested so far.
Performance depends on the initial
configuration

Example
The green circles indicate destination
points and the blue circles represent nodes
1

Example continued

Example continued
Re-Solving the assignment problem
periodically led to convergence
1
2

References
Y. Chang, T. H., L. P. Kaelbling (2003).
Mobilized ad-hoc networks: A reinforcement
learning approach, MIT AI Laboratory
D. Goldberg, J. L., A.S. Morse, B.E.Rosen, Y.R.
Yang (December, 2003). Towards mobility as a
network control primitive, Yale University

Approximate Dynamic Programming by Linear

Programming for Stochastic Scheduling
Mohamed Mostagir

Nelson Uhan

Introduction

In stochastic scheduling, we want to allocate a limited amount of resources to a

set of jobs that need to be serviced. Unlike in deterministic scheduling, however,
the parameters of the system may be stochastic. For example, the time it takes
to process a job may be subject to random uctuations. Stochastic scheduling problems occur in a variety of practical situations, such as manufacturing,
construction, and compiler optimization.
As in deterministic scheduling, the set of stochastic scheduling problems is
incredibly large and diverse. One important class of models involves scheduling
a xed number of jobs on a xed number of identical parallel machines while
minimizing a given performance measure. The processing times of the jobs are
assumed to follow some joint probability distribution. In addition, there may
be precedence constraints, or interdependencies between the jobs that require
certain jobs not be scheduled until others are completed. The deterministic
counterpart to this class of problems has been studied extensively [KSW98].
A nave approach to these problems would be to take the expected processing
times and use the algorithms for the deterministic problems. Unfortunately,
it is easy to construct examples that shows that this approach can produce
solutions that are arbitrarily bad. Fortunately, however, this class of stochastic
scheduling problems can be easily cast as a Markov decision process (MDP) and
therefore can be attacked by dynamic programming methods.
Results for this class of stochastic scheduling problems are somewhat scattered, and have been obtained using a variety of methods. Rothkopf [R66]
showed that for one machine without precedence constraints, an index rule
minimizes the expected sum of weighted completion times for arbitrary processing time probability distributions. M
ohring, Radermacher and Weiss [MRW84,
MRW85] study the analytic properties of various classes of scheduling policies, and determine optimal policies for special cases. M
ohring, Schulz, and
Uetz [MSU99] developed approximation algorithms for a variety of stochastic

scheduling problems using techniques from combinatorial optimization. Bertnon [BC99] focus on a dierent class of stochastic scheduling
sekas and Casta
problems, the quiz problem and its variations. They show how rollout algorithms can be implemented eciently, and present experimental evidence that
the performance of rollout policies is near-optimal.
For this project, we consider a problem with one machine and an arbitrary
normalized regular and additive objective function. We recast our nite-horizon
decision problem into a stochastic shortest path problem. We show that for a
relaxed formulation of our problem, the error of the solution obtained by the approximate linear programming approach to dynamic programming as presented
in de Farias and Van Roy [dV03] is uniformly bounded over the number of jobs
that need to be scheduled, provided that the expected job processing times are
nite. Finally, we argue using results from dynamic programming that the approximate solution for the relaxed formulation of our problem is also not that
far away from the optimal solution of the original problem.

The problem

Consider the following problem. We have a set of jobs N = {1, . . . , n} to

be processed on one machine. The machine can only process one job at a
time. The processing time of job i follows a discrete probability distribution
pi , and the distributions p1 , . . . , pn are assumed to be pairwise stochastically
independent. The jobs have to be scheduled nonpreemptively, that is, once a job
has started, it must be processed continuously until it is nished. The schedule
must also respect any precedence constraints, or interdependencies between jobs
that require certain jobs not be scheduled before others are completed. We
would like to minimize the expected value of the objective function

1 n1

h (Ri ) C (i+1) C (i)

C (1) , . . . , C (n) =
n i=0

where C (i) denotes the time of the ith job completion (C (0) = 0), Ri is the set
of jobs remaining to be processed at the time of the ith job completion, and h
is a set function such that h () = 0. Such an objective function is said to be
additive. The function h can be interpreted as the holding cost per unit time
on the set of uncompleted jobs. We also assume that is nondecreasing in the
job completion times.
Many common objective functions
in scheduling are nondecreasing and ad
ditive. For example, h (S) = jS wj for all S N generates the normalized

sum of weighted completion times, n1 jN wj Cj .

2.1

Formulation as a nite horizon MDP

We can formulate this problem as a nite horizon MDP with a nite state space
S. For each state x S, there is a set of available actions Ax . Taking action
2

a Ax in state x, we transition to state y with probability Pa (x, y) and incur

cost ga (x, y).
Since we only have one machine, the state of the system x is suciently
characterized by the set of jobs remaining to be processed Rx and the maximum
completion time of the jobs completed so far, Cmax (x):
x = (Cmax (x) , Rx ) S.
The size of the state space is exponential in the number of jobs.
Since the objective function is nondecreasing in completion times, and since
the jobs processing times are independent of their start times, any optimal
policy need not leave the machine deliberately idle at any time. Therefore, an
action in our problem is simply the next job to be processed: at every state x,
the action set is Ax Rx . If there are no precedence constraints, Ax = Rx .
The time stage costs are
1
ga (x, y) = h (Rx ) (Cmax (y) Cmax (x)) .
n
The transition probabilities are

pa (t) if Ry = Rx \ {a} and Cmax (y) = Cmax (x) + t

Pa (x, y) =
0
otherwise.
The problem is then to solve for the following nite-horizon cost-to-go function
J (x, n) = 0
J (x, t) = min

aAx

2.2

Pa (x, y) (ga (x, y) + J (y, t + 1)) , t = 0, 1, . . . , n 1.

Reformulation into a stochastic shortest path problem

Since our state space is exponential in size, we cannot hope to solve our problem
exactly using dynamic programming methods. All of the major methods for approximate dynamic programming consider innite-horizon discounted cost problems. In order to use these methods for our nite-horizon stochastic scheduling
problem, we recast our problem into a stochastic shortest path (SSP) problem. We refer to the following formulation of our problem as the original SSP
formulation.
We introduce a terminating state x
with the following property: only the
in one step.
states that have no more remaining jobs (Rx = ) can reach x
, the
Observe that for all states such that Rx = and at the terminating state x
set of actions available Ax is empty, and therefore the transition probabilities
and time stage costs involving x
are not aected by what action is taken. The
transition probabilities involving x
are

1 x such that Rx =

Pa (x, x)
= 1 if x = x

0 otherwise
3

and the time stage costs involving x

are

0 x such that Rx =
ga (x, x) =
0 if x = x.

The cost-to-go function is therefore

T (x)

J (x) = min E
gu (xt , xt+1 ) x0 = x .
u

t=0

where T (x) is the time stage when the system reaches the terminating state.
Recall that a stationary policy u is called a proper policy if, when using this
policy, there is positive probability that the terminating state will be reached
after a nite number of stages. Also recall that if all stationary policies are
proper, then the cost-to-go function for the SSP problem is the unique solution
to Bellmans equation [B01]. Since any policy in our scheduling problem requires
one additional job to be scheduled at each time stage, we must always reach the
terminating state with probability 1, regardless of the policy used. Therefore,
to solve our problem exactly, we can solve Bellmans equation.

Solution method and bounds

We consider the linear programming approach to dynamic programming by de

Farias and Van Roy [dV03] as a solution method to our problem. We briey
review the method and related results.

3.1

The linear programming approach to dynamic programming

Dene the operator

T J = min {gu + Pu J} .
u

where the minimization is carried out component-wise. We want to determine

J , , the unique solution to Bellmans equation, T J = J. The exact linear
programming (ELP) approach to solving Bellmans equation solves the following
optimization problem for any c > 0:
cT J
T J J.

maximize
subject to

Note that this optimization problem can be recast as the following linear program:
maximize

cT J

subject to

ga (x) +

Pa (x, y) J (y) J (y) , x S, a Ax .

The approximate linear programming (ALP) approach to dynamic programming

reduces the number of variables in the exact linear programming by the use of
basis functions. Given a set of basis functions 1 , . . . , K , we dene the matrix

|
|
= 1 K ,
|
|

and in the hopes of computing a vector r such that

r is a close approximation
to J , , we solve the following optimization problem:
maximize

cT r

subject to

T r r.

(1)

This problem can be recast as a linear program just as in the ELP approach.
Given a solution r, we hopefully can obtain a good policy by using the greedy
r,
policy with respect to

r ) (y) .
Pa (x, y) (
u (x) = arg min ga (x) +
aAx

de Farias and Van Roy [dV03] proved the following bound on the error of the
ALP solution.

Theorem 3.1. (de Farias and Van Roy 2003) Let r be a solution of the approximate LP (1), and (0, 1). Then, for any v RK such that (v) (x) > 0
for all x S and Hv < v,
J ,
r1,c

2cT v
min J , r,1/v
1 v r

where
(Hv) (x) = max

aAx

and

v = max
x

3.2

Pa (x, y) (v) (y)

(Hv) (x)
.
(v) (x)

A uniform bound over number of jobs

We relax our stochastic shortest path problem by introducing discount factors

(0, 1) at each time stage. We refer to this formulation as the -relaxed SSP
formulation. The cost-to-go function for this problem is

T (x)

J , (x) = min E
t gu (xt , xt+1 ) x0 = x .
u

t=0

For this relaxation, we show that the error of the approximate linear programming solution is uniformly bounded over the number of jobs to be scheduled.
5

Theorem 3.2. Assume that the holding cost h (S) is bounded above by M for
all subsets S of N . Let r be the ALP solution to the -relaxed SSP formulation
of the stochastic scheduling problem. For (0, 1),
J ,
r1,c

2M maxiN E [pi ]
.
1

Proof. Due to the nature of our problem, we know that for any state x, T (x) =
= 0 for all x such that Rx = , and ga (
x, x)
=
|Rx | + 1. However, since ga (x, x)
0, when computing the cost-to-go function at state x, we only need to consider
time stage costs for time stages 0 through |Rx | 1. Since the holding cost h is
bounded by above by M for all subsets of N , for some policy u we have

|Rx |1

J , (x) = min E
t gu(xt ) (xt , xt+1 ) x0 = x

t=0

|Rx |1

E
t gu(xt ) (xt , xt+1 ) x0 = x

t=0

|Rx |1

E
gu(xt ) (xt , xt+1 ) x0 = x

t=0

|Rx |1
1

h (Rxt ) (Cmax (xt+1 ) Cmax (xt )) x0 = x, u

=E
n

t=0

|Rx |1
1

=E
h (Rxt ) pu(xt ) x0 = x
n

t=0

|Rx |1
1

E
M pu(xt ) x0 = x
n

t=0
M

=
E [pi ]
n
iRx

where pu(xt ) is the processing time of the job chosen at state xt .

Consider the function
k
V (x) =
E [pi ]
n
iRx

for some constant k. Then for all x S

(HV ) (x) = max

Pa (x, y) V (y)
aAx

= k max

aAx

1
Pa (x, y)
E [pi ]
n
iRy

k
k
=
E [pi ] E [pa ]
n
n
iRx

E [pa ]
= V (x)
iRx E [pi ]

< V (x)

where a arg maxaAx Pa (x, y) V (y). Therefore, V is a Lyapunov function,

and there is a < 1 independent of the number of jobs n such that HV V .
Also note that
min J , r,1/V J , ,1/V
r

|J , (x)|
xS
V (x)

M
iRx pi
n E

max k
xS
iRx pi
n
M
=
k
= max

This is uniformly bounded over the number of states.

Finally, we consider cT V . Let c be some probability distribution such that
c (x) > 0 for all x S.

k
c (x) V (x) =
c (x)
E [pi ]
n
xS
xS
iRx

1
=k
c (x)
E [pi ]
n
xS
iRx

k
c (x) max E [pi ]
xS

= k max E [pi ] .
iN

This is uniformly bounded over the number of jobs, provided that the expected
processing
times of the jobs are bounded. Therefore, by Theorem 3.1, provided

that iRx E [pi ] is one of our basis functions, we have that

2cT v
min J r,1/v
1 v r
2M maxiN E [pi ]

J ,
r1,c

which is uniformly bounded over the number of jobs, n.

3.3

What happens when = 1?

Unfortunately, the bound in Theorem 3.2 explodes for = 1, and therefore

does not directly apply to the original SSP formulation. The problem lies in
the choice of the Lyapunov function. If we use the same Lyapunov function as
in the proof of Theorem 3.2, we are left with the following equality

E [pa ]
.
(HV ) (x) = V (x) 1
iRx E [pi ]

As the number of jobs n , the ratio of the expected processing time of any
one job to the sum of the expected processing times of all jobs goes to zero.
Therefore, by using the methods from the proof of Theorem 3.2, we cannot nd
a < 1 such that HV V regardless of the number of jobs.
However, since the ALP solution for the -relaxed formulation is not that far
away from the optimal solution to the -relaxed formulation, we can show that
the ALP solution obtained for the -relaxed SSP formulation is also not that
far away from the optimal solution of the original SSP formulation, depending
on the value of . First, we briey present a well-known result from dynamic
programming that relates the costs of the innite-horizon discounted cost and
average-cost problems.
Lemma 3.1. For any stationary policy u and (0, 1), we have
J,u =
where
J,u =

Ju
+ hu + O (|1 |)
1

k Puk gu ,

Ju =

k=0

hu is a vector satisfying

N 1
1 k
lim
Pu
N N
k=0

gu ,

Ju + hu = gu + Pu hu ,
and O (|1 |) is an -dependent vector such that
lim O (|1 |) = 0.

For the stochastic scheduling problem formulated as a stochastic shortest

path problem, the Ju in Lemma 3.1 is equal to zero. Recall that J is the
cost-to-go function for the original SSP formulation. Therefore, Lemma 3.1
implies
J J , J J ,

|J J

= O (|1 |)
| O (|1 |)

where J is the cost of using the optimal greedy policy for the -relaxed SSP
formulation in the original SSP formulation (hu in Lemma 3.1). Since the costto-go functions for the original and -relaxed SSP formulations are not that far
8

apart when is large, the ALP solution for the -relaxed SSP formulation is
not that far away from the optimal solution to the original SSP formulation
when is large.
Corollary 3.1. Let r be the optimal solution to the ALP for the -relaxed SSP
formulation. Then for (0, 1),
J
r1,c

2M maxiN E [pi ]
cT O (|1 |) .
1

Proof. The result follows immediately from Theorem 3.2 and the arguments
above:
J
r1,c J J , 1,c + J ,
r1,c
2H maxiN E [pi ]
1
2H maxiN E [pi ]
cT O (|1 |)

cT |J J , | +

Although we have had diculty in obtaining an error bound uniform over

the number of jobs for the ALP solution when = 1, recent work [d04] indicates
that relaxing the problem to include discount factors is the right approach to
obtaining good bounds for the original SSP formulation.

Conclusion

We have shown that in theory, the approximate linear programming approach

to dynamic programming should be a good solution method for the stochastic
scheduling problem we studied. By relaxing the problem to include a discounting factor at each time stage, we obtained a bound on the error of the ALP
solution that is uniform over the size of the problem. Using well-known results
in dynamic programming, we also showed that by solving the relaxed problem,
we obtain a solution that is not that far away from the optimal solution to the
original problem.
The analysis above also provides a error bound uniform over the number
of jobs for the multiple machine case. Although the state and action spaces
of the problem with multiple machines are very complex, an upper bound on
the cost-to-go function at any state can always be obtained by considering the
cost-to-go function on just one machine.
One immediate avenue of further research would be to determine an error
bound that is uniform over the number of jobs for the ALP solution to the
original SSP formulation. Although our error bound does not grow as the size
of the problem grows, the error bound is still potentially very large. It would
be nice to see if a tighter bound on the errors can be obtained. Another direction of investigation would be to run computational experiments on using ALP
9

for our stochastic scheduling problem. It would be nice to see how the ALP
approach performs in practice, and whether or not in reality ALP solves the
original SSP formulation (with = 1) poorly as the size of the problem grows.
Finally, the ALP method relies on solving a linear program with a constraint
for every state-action pair, which in our problem would result in an extremely
large optimization problem. One might investigate how the constraint sampling
LP approach to dynamic programming [dV01] performs with the stochastic
scheduling problem we studied, both theoretically and practically.

References
[B01]

Bertsekas, D. (2001). Dynamic Programming and Optimal Control.

Athena Scientic, Belmont MA.

[BC99]

Bertsekas, D., D. Casta

non (1999). Rollout algorithms for stochastic
scheduling problems. Journal of Heuristics 5, pp. 89-108.

[d04]

de Farias, D. P. (2004). Private communication, May 10, 2004.

[dV03]

de Farias, D. P., B. Van Roy (2003). The linear programming approach to approximate dynamic programming. Operations Research
51, pp. 850-865.

[dV01]

de Farias, D. P., B. Van Roy (2001). On constraint sampling for the

linear programming approach to dynamic programming. Mathematics
of Operations Research, forthcoming.

[KSW98] Karger, D., C. Stein, J. Wein (1998). Scheduling algorithms. CRC

Algorithms and Theory of Computation Handbook, Chapter 35.
[MRW84] M
ohring, R. H., F. J. Radermacher, G. Weiss (1984). Stochastic
ur Operations
scheduling problems I: general strategies. Zeitschrift f
Research 28, pp. 193-260.
[MRW85] M
ohring, R. H., F. J. Radermacher, G. Weiss (1985). Stochastic
scheduling problems II: set strategies. Zeitschrift f
ur Operations Research 29, pp. 65-104.
[MSU99] Mohring, R. H., A. S. Schulz, M. Uetz (1999). Approximation in
stochastic scheduling: the power of LP-based priority policies. Journal of the ACM 46, pp. 924-942.
[R66]

Rothkopf, M. H. (1966). Scheduling with random service times. Management Science 12, pp. 703-713.

[U96]

Uetz, M. (1996). Algorithms for deterministic and stochastic schedulur Mathematik, Technische Univering. Ph.D. dissertation, Institut f
at Berlin, Germany.
sit

Approximate Dynamic Programming (Via

Linear Programming) For Stochastic

Scheduling

Mohamed and Nelson

2.997 Project

May 12, 2004

Mohamed and Nelson

May 12, 2004

Outline

Scheduling and stochastic scheduling problems

Problem statement and formulation as an MDP
Reformulation into a stochastic shortest path problem

LP approach to approximate DP - Quick review

Main result and outline of proof
Questions and open issues
1

Mohamed and Nelson

May 12, 2004

Scheduling Problems

Given a set of tasks and limited resources, we need to eciently use the
resources so that a certain performance measure is optimized
Scheduling is everywhere:
computer networks, etc.

manufacturing,

project management,

Almost all interesting scheduling problems are computationally intractable

Have to settle for near-optimal or approximate solutions

Mohamed and Nelson

May 12, 2004

Simple Example - 2 Machines

n jobs
Processing time of job i: pi, deterministic
Objective: minimize the sum of completion times on 2 identical parallel
machines
Number of states is exponential Cant solve this problem by
enumeration
In fact, no polynomial algorithm is known
Note that the problem is deterministic and yet remains quite hard

Mohamed and Nelson

May 12, 2004

Stochastic Scheduling - An example

Mohamed and Nelson

May 12, 2004

Stochastic Scheduling

Many scheduling problems are plagued with uncertainties

Stochastic scheduling problem: pis follow some probability distribution
Uncertainty Larger state space

Mohamed and Nelson

May 12, 2004

Problem Denition
Set of jobs N = {1, . . . , n}
1 machine
Processing time of job i: discrete probability distribution pi
pi and pj pairwise stochastically independent for all i = j
The jobs have to be scheduled nonpreemptively
Objective: minimize

C (1), . . . , C (n)

1
=
h (Ri) C (i+1) C (i)
n i=0
6

Mohamed and Nelson

May 12, 2004

Problem Denition - Continued

Objective: minimize

C (1), . . . , C (n)

1
=
h (Ri) C (i+1) C (i)
n i=0

C (i) = time of the ith job completion, C (0) = 0

Ri = set of jobs remaining to be processed at the time of the ith job
completion
h is a set function such that h () = 0
Such an objective function is said to be additive.

Mohamed and Nelson

May 12, 2004

MDP formulation
Finite horizon, nite state space
State of the system:
x = (Cmax (x) , Rx) S
Cmax (x) is the completion time of the last job completed
Rx is the set of jobs remaining to be scheduled at state x
Note that the size of the state space is exponential in the number of
jobs.
Action at state x is the next job to be processed: a Ax Rx
8

Mohamed and Nelson

May 12, 2004

MDP formulation - Continued

Time stage costs:

1
ga (x, y) = h (Rx) (Cmax (y) Cmax (x)) .
n
Transition probabilities:

pa (t)
Pa (x, y) =
0

if Ry = Rx\ {a} and Cmax (y) = Cmax (x) + t

otherwise.

Mohamed and Nelson

May 12, 2004

MDP formulation - Continued

Solve for the nite-horizon cost-to-go function
J (x, n) = 0
J (x, t) = min

aAx
yS

Pa (x, y) (ga (x, y) + J (y, t + 1))

Exponential state space exact DP hopeless

, t = 0, 1, . . . , n 1

Approximate DP methods consider innite horizon problems

Recast our problem as stochastic shortest path problem

Mohamed and Nelson

May 12, 2004

Reformulation into SSP

Introduce a terminating state x

Only states with Rx = can reach x

in one step
Transition probabilities involving

1
Pa (x, x)
= 1

0
Time-stage costs involving x:

0
ga (x, x
) =
0

x such that Rx =
if x = x

otherwise

x such that Rx =
if x = x.

Mohamed and Nelson

May 12, 2004

Reformulation into SSP - Continued

Cost-to-go function for SSP formulation:

T (x)

J (x) = min E
gu (xt, xt+1) x0 = x .
u

t=0

T (x) = time stage when the system reaches the terminating state

Every policy reaches the terminating state in a nite number of steps

with probability 1
The cost-to-go function for the SSP problem is the unique solution
to Bellmans equation
12

Mohamed and Nelson

May 12, 2004

Approximate DP Via ALP

Exact linear program (ELP):
maximize

cT J

subject to

TJ J

(c > 0)

Approximate linear program (ALP):

maximize

cT r

subject to

T r r

(c > 0)

r is optimal solution to ALP obtain a (hopefully) good policy by using

the greedy policy with respect to

Mohamed and Nelson

May 12, 2004

Approximate DP Via ALP - Continued

Error bound for the ALP approach for discounted cost problems:

Theorem 1. (de Farias and Van Roy 2003) Let r be a solution of the
approximate LP. Then, for any v RK such that (v) (x) > 0 for all x S
and Hv < v,
2cT v
J
r 1,c
min J r
,1/v
1 v r

where
(Hv) (x) = max
and
v

aAx
yS

Pa (x, y) (v) (y)

(Hv) (x)
= max
.
x
(v) (x)

Mohamed and Nelson

May 12, 2004

Relaxed Stochastic Shortest Path Problem

Relaxation: introduce discount factor (0, 1) at each time stage

Call this formulation the -relaxed SSP formulation

Cost-to-go function for relaxed formulation:

T (x)

,
t

J (x) = min E
gu (xt, xt+1) x0 = x
u

t=0

For this relaxation, we show that the error of the ALP solution is
uniformly bounded over the number of jobs to be scheduled.
15

Mohamed and Nelson

May 12, 2004

Main Result

Theorem 2. Assume that the holding cost h (S) M for all subsets S
of N . Let r be the ALP solution to the -relaxed SSP formulation of the
stochastic scheduling problem. For (0, 1),
J ,
r1,c

2M maxiN E [pi]
.
1

The error is uniformly bounded over the number of jobs

How amazing is that?

Mohamed and Nelson

May 12, 2004

Outline of Proof
The cost-to-go function is

T (x)

,
t
J (x) = min E
gu(xt) (xt, xt+1) x0 = x
u

t=0

where

1
g (xt, xt+1) = h (Rx) (Cmax (xt+1) Cmax (xt))
n
Recall h is bounded from above by M . After some algebraic manipulation,
this quantity is found to be
M

E [pi]
n
iRx

Mohamed and Nelson

May 12, 2004

Outline of Proof - Continued

Let

k

V (x) =
E [pi]
n

iRx

V is a Lyapunov function
< 1 independent of n such that HV V
Also,
min J , r,1/V
r

M
k

Mohamed and Nelson

May 12, 2004

Outline of Proof - Continued

Consider cT V , c some probability distribution over S

We have the following uniform bound

c (x) V (x) k max E [pi]

Combining these results,

J ,
r1,c

2M maxiN E [pi]
1

Mohamed and Nelson

May 12, 2004

Conclusions and Remarks

ALP approach has an error bound for our relaxed stochastic scheduling

problem that does not grow with the number of jobs to be scheduled

What about = 1? (original SSP formulation)

Multiple machines?
Computational experiments: how does ALP perform in practice?

Mohamed and Nelson

May 12, 2004

Questions?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE

APPROXIMATE LINEAR PROGRAM?
YANN LE TALLEC AND THEOPHANE WEBER

Abstract. The linear programming approach to approximate dynamic programming was introduced in [1]. Whereas the state relevance weight (i.e. the
cost vector) of the linear program does not matter for exact dynamic programming, it is not the case for approximate dynamic programming. In this paper,
we address the issue of selecting an appropriate state relevant weight in the
case of approximate dynamic programming. In particular, we want to choose
c so that there is a practical control of the approximate policy performance by
the capability of the approximation architecture. We present here some theoretical results and more practical guidelines to select a good state relevance
vector.

1. Introduction
The linear programming approach to approximate dynamic programming was
introduced in [1], and it is reviewed quickly in Section 2. Whereas the state relevance weight (i.e. the cost vector) of the linear program does not matter for exact
dynamic programming, it is not the case for approximate dynamic programming.
There are no guidelines in the literature to select an appropriate state relevant
weight in the case of approximate dynamic programming. In Section 3, we propose
to use available performance bounds on the suboptimal policy based on the approximate linear program to build a criterion for choosing the state relevance weight c.
We characterize appropriate state relevance weights as solutions of an optimization
problem (P). However, (P) cannot be solved easily so that we look for suboptimal
solutions in Section 4, in particular we prove in Section 5 that under some technical assumptions we can choose c as a probability distribution. Finally, we establish
some practical necessary conditions to choose c; one of them suggesting to reinforce
the linear program for approximate dynamic programming.
1.1. Finite Markov decision process framework. In this paper, we consider
nite Markov decision process (MDP): they have a nite state space S and a nite
control space U (x) for each state x in S. Let gu (x) be the expected immediate
cost of applying control u in state x. Pu (x, y) denotes the transition probability
from state x to state y under control u U (x). The objective
of the controller is

t
to minimize the -discounted cost E
t0 gu(t) (xt )|x0 .
First, observe that it is possible to transform any nite Markov decision process
with nitely many controls in another one where the immediate cost of an action is
the same for all actions. Indeed, consider the MDP comprising the original MDP
states plus one state for each state-action pair. In this MDP, the controller rst
chooses a control and the system moves in the corresponding state-action pair.
Date: May 15, 2004.
1

YANN LE TALLEC AND THEOPHANE WEBER

From there, the system incurs the cost corresponding to the state-action pair and
follows the original dynamics to the next state. Figure 1 provides a simple example
of the transformation of an MDP into another one with same immediate cost at
each state.
ga
x

x, a

b
gb

a
b

x, b

Figure 1. A simple example of an MDP transformation. The

original MDP starts from state x and moves to state y1 and incurs
a cost ga if a is chosen, or moves to y2 and incurs gb if the other
action, b, is chosen.
As a result, we can make without loss of generality the following assumption.
Assumption 1.1. For all x in S, gu (x) = g(x) is independent of the control
u U (x). Furthermore, we can assume g(x) 0, x S.
2. The linear programming approach for dynamic programming
Let T be the usual dynamic
programming operator for discounted problem:
T J(x) := minuU (x) {g(x)+ yS Pu (x, y)J(y)}. It is well-known that T is monotonic (J J T J T J ) and that for all J,
lim T k J = J ,

where J is the optimal cost-to-go vector. Moreover, J

is the unique solution to
Bellman equations J = T J . As a result, if J T J, then J T J T k J J .
In other words, J is the biggest vector of R|S | verifying J T J. The following
proposition, which states that J can be computed by a linear program, is an
immediate consequence of this remark
Proposition 2.1. For all c > 0, J is the unique optimal solution to the following
linear program
(LP ) :

min cT J

JR|S|

J(x) g(x) +

Pu (x, y)J(y), x S, u U (x)

Unfortunately, the linear program (LP) is often enormous with as many variables
as the cardinality of the state space S and as many constraints as there are stateaction pairs. Hence, (LP) is often intractable. Moreover, even storing a cost-to-go
vector J as a lookup table might not be amenable for large state space.
One approach to deal with this curse of dimensionality is to approximate J (x)
(x)r, where r Rm (usually m |S|) and (x) = (1 (x), . . . , m (x)) are given
feature vectors.

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
3

Inspired by the form of (LP), it is natural to consider the approximation of

r(c), where r(c)

is an optimal solution of the approximate linear program:

J
(ALP ) :

min cT r

rRm

(x)r g(x) +

Pu (x, y)(y)r, x S, u U (x)

Notice that (ALP) has only m variables, but still as many constraints as (LP).
Hence, some large scale optimization technics are needed to solve (ALP), or alternatively [3] showed that constraints sampling is a viable approach to solve this
problem.
On the contrary to the case of the exact linear program (LP), there is no
guarantee that the optimal solution r(c) is independent of the choice of c > 0. The
objective of this paper is to provide a methodology to choose c, but to motivate
our criterion to select c, we rst need to introduce two performance bounds.
3. Two performance bounds
3.1. A general performance bound.
First, let us relate J to the cost-to-go of the policy, which is greedy with respect
to J, where J is any approximation of J .
Let J R|S | and uJ be the greedy policy with respect to J, i.e.

uJ (x) = argminuU (x) gu (x) +
Pu (x, y)J(y) .

In the case of equality between controls {u1 , . . . , uq }, then uJ chooses randomly

one of them with equal probability 1/q.
Theorem 3.1 (Theorem 3.1 in [1]). For all J R|S| such that T J J,
1
J J 1,,uJ ,
(3.1)
JuJ J 1,
1
where ,uJ := (1 ) T (I PuJ )1 and PuJ (x, y) denotes the probability of the
system going to state y under the policy uJ given it is in state x.
Notice that ,u is well-dened even for some randomized policy u.
Hence, if J is a good approximation of J in the sense of a certain weighted
l1 -norm, a policy greedy with respect to J will perform closely to optimal.
3.2. Approximate linear program approximation bound.
Here we reproduce a bound from [1], which says that solutions to (ALP) produce
approximations to the optimal cost-to-go function J that are close to the best
possible approximation within the architecture that is linear in .
Theorem 3.2 (Theorem 4.2 in [1]). Let r(c) be an optimal solution of the approxm
imate linear program with the state
relevance weight c. Then for any v R such
that (x)v > 0 and maxuU (x) yS Pu (x, y)(y)v < (x)v for all x S,
(3.2)
2cT v
2cT v
r(c)1,c
minm J r,1/v =
J r ,1/v ,
J
1 v rR
1 v
where r be the projection of J on the surface {r|r Rm } with respect to the
norm .,1/v .

YANN LE TALLEC AND THEOPHANE WEBER

The greedy policy with respect to

r(c), ur(c) , is called the ALP policy associated with the state relevance weight c.
3.3. Proposed criterion to choose the state relevance weight c.
We would like to link the two bounds (3.1) and (3.2) by controlling J
r(c)1,c in order to bound the performance loss of

r(c)1,,ur(c) with J
the ALP policy by the architecture capability to approximate J . Hence, we would
like to nd a state relevance weight c > 0 that makes this guarantee as sharp as
possible.
In other words, the state relevance weight should be chosen so that
(P ) :

min cT v

c>0
T
,u

r(c)

:= (1 ) T (I Pur(c) )1 cT ,

where ur(c) depends on c.

Then, for any feasible c of (P), we can write

(3.3)

Jur(c) J 1,

2cT v
J r ,1/v ,
1 v

by combining the bounds (3.1) and (3.2), and (P) tries to make the factor of the
right-hand side as small as possible.
Recall that r(c) depends on c so that we have a circular dependence between c
and r(c).

,ur(c)

r(c)

ur(c)

(P) is a dicult problem because of the complex constraint (1 ) T (I

Pur(c) )1 cT . In this paper, we try only to obtain feasible points of (P). Still,
there is a special case of where (P) can be solved exactly.
Proposition 3.3. If 0 happens to be chosen as the steady-state probability
distribution of Pur(c) (when it exists), i.e. T = T Pur(c) , (??) yields
cT T .
Proof. Indeed, let x =
0 be a left eigenvector of an invertible matrix A associated
with the eigenvalue ( = 0).
xT A = xT

xT A1 = 1 xT

1
If T = T Pur(c) , then T = 1
T (I Pur(c) ). Hence, T = (1 ) T (I
Pur(c) )1 cT , and the optimal solution of (P) is c = .

In the following section, we give derive some simple feasible points but their
performance with respect to the objective of (P) can be very poor. Then in Section
5, we try to obtain better feasible points of (P), namely probability distributions.

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
5

4. Two simple choices for c

4.1. A trivial bound.
It is well-known [1] that ,u is a probability distribution over S for all initial
distributions and all policies u in the sense that 0 ,u (x) 1, x S and
T,u 1 = 1. Moreover, by denition of weighted l1 -norms, c ,u 0 1,c
1,,u so that we have the following proposition.
Proposition 4.1. By choosing cT = (1, . . . , 1) = 1 in (ALP), we get the bound
Jur(c) J 1,

21T v
J r ,1/v
1 v

This bound is poor. c

does not depend on the problem characteristics. Further
more, there is a factor xS (x)v on the right-hand side, which becomes very
large in large scale system or in system where some states have a high value for the
Lyapunov function.
4.2. A simple algorithm.
Notice that if c > 0 is scaled by a positive factor > 0 the optimal solutions of
(ALP) are unchanged. This remark allows us to devise the following scheme.
(1) Pick any c > 0 and nd an optimal solution r(c) of (ALP)
(2) Given , compute ,ur(c) .
(3) Let > 0 be the smallest scalar such that 0 ,ur(c) c.
If < +,
Jur(c) J 1,

2cT v
J r ,1/v .
1 v

Unfortunately, there is no guarantee that will be small, even when it is nite, so

that the bound is practical.
5. Finding probability distribution feasible for (P)

If the state relevance weight c of the ALP could be chosen such that

(5.1)

cT = ,ur(c) := (1 ) T (I Pur(c) )1 ,

(3.3) would hold. Indeed, c veries (5.1) if and only if c is a probability distribution
that is feasible for (P). We hope that in this case, the bound (3.3) is practical.
5.1. A theoretical algorithm.
5.1.1. A naive algorithm.
A naive algorithm is as follows.
Algorithm A

(1) Start with k = 0 and any vector c0 0 such that xS c(x) = 1.
(2) Solve (ALP) for ck and let r(ck ) be any optimal solution.
r(ck ).
(3) Compute k := ,uk , where uk := u
r(ck ) is greedy with respect to
(4) Set ck+1 = k , do k = k + 1 and go back to 2

Equivalently, the algorithm may be represented by

M
r
F
U
M
c r(c) r(c) ur(c) ,ur(c) , or, in a more compact fashion: c
,ur(c) .

YANN LE TALLEC AND THEOPHANE WEBER

Denition 5.1. Dene P = {p R|S| | p(x) 0,
xS p(x) = 1} as the space of
probabilities distributions. It is a compact, convex set.
Notice that is a mapping from P in itself, where P is the set of probability
distribution over S, i.e.
Lemma 5.2. c is a xed point of c veries (5.1)
However, it is not clear that the algorithm A has a xed point, and whether ck
converges. If the mapping was continuous from P to P , Brouwers theorem would
guarantee the existence of a xed point.
M
r
F
U
However, in the chain c r(c) r(c) ur(c) ,ur(c) , some of the
functions may not be continuous so that M needs not be continuous.
The function F is just a matrix multiplication. Therefore it is continuous.

1
The function
= (1
t t M : (Pu , gu ) (u) = (1 ) (I Pu )

Pu is also continuous.
)
t

However, it is well-known that the functions r and U are not necessarily

continuous.
In the following part, we dene a randomized version of A making r and U
continuous so that Brouwers theorem will guarantee the existence of a xed point
to the randomized algorithm.
5.1.2. A randomized algorithm.
Dening and smoothing r
First, we show that (ALP) has an optimal solution thanks to the following lemma.
Lemma 5.3. The feasible set of (ALP), {r Rm | r T r} is nonempty and
bounded.
Proof. Since we assumed g(x) 0, the feasible set contains r = 0, and is thus
nonempty.
The matrix is full rank, hence is invertible (because it is symmetric
denite positive).
Therefore, ( )1 has a maximum norm which is denoted M1 = ( )1
For all r, ( )1 r ( )1 r , and using this property with r =
( )r, we get r M1 .( )r
Now, the matrix also has some maximum norm M2 , and using sub-multiplicative
property, we get r M1 .M2 r
If we consider r feasible for the ALP, we have r T r T 2 r .. J
which gives : r J .

Combining the two inequalities yield r M1 .M2 J

Hence, from the theory of Linear Programming, there is always a solution of
the LP that is an extreme point of the feasible set. Usually, there is a unique
optimal solution to a linear program with a cost vector c, and it is an extreme
point of the feasible polyhedron. In this case, r(c) is clearly dened. When there
are multiple optimal solutions, we can dene r arbitrarily because we will see that
it happens with probability 0 in our algorithm. Furthermore, it can be showed that
the function r dened above is piecewise constant [5].

The following lemma is well-known.

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
7

Lemma 5.4. Let f be a bounded piecewise constant mapping from some vector space
E to F. Let g be a continuous function
from F to R that has nite integral.

Then, the function f : x f (x + y).g(y)dy is a well dened, continuous
F

function from E to F.

f is a smoothed version of the initial piecewise constant mapping f.

This lemma suggests to randomize the cost vector c with some noise in order to
smooth r .

Proposition 5.5. Let c be a random vector dened by c = c + c, where c is

a Gaussian vector, which covariance matrix C is equal to v.I. Then, the function

r[c]] is continuous in c.
c E[

r

r[c]] can also be written c r(c+ c0 ).g(c0 )dc0 and r is a piecewise
Proof. c E[
RN

constant, bounded function. Using the previous lemma gets the result.

Smoothing U
The function U is also discontinuous, in the same fashion as the function r. We
therefore use the same trick, but in a slightly dierent way. Instead of using
deterministic greedy policies, we use randomized, -greedy policies
Denition 5.6. Let > 0. The -greedy policy with respect to J is a randomized
policy u for which the action a is chosen in state x with probability
exp[(gu (x) + Pu (x)J)/]
.
uJ (u, x) =
aU (x) exp[(ga (x) + Pa (x)J)/]
[4] provides various continuity results for the greedy policies, which we will
use.
Proposition 5.7. limsup|T J(x) T J(x)| = 0
0 J,x

This proposition states the fact that T approaches uniformly T as 0.

Proposition 5.8. T and u are continuous in J.
Randomized algorithm A(v, )

We now dene the randomized version of the algorithm A.

Denition 5.9. The randomized function (v, ) is dened by the following chain
of functions:
(v,)
r

F
c r(c) r(c) u (r(c)) ,u (r(c)) , or, in a compact fashion:
(v,)

c ,u (r(c))
Proposition 5.10. (v, ) is continuous from P to P.
Denition 5.11. Let v and be some positive numbers. The algorithm A(v, ) is:
1) Start from some c0 in P, and set k = 0.
2) Do ck+1 = (v, ) (ck )
3) Set k = k + 1 and go to 2.
Theorem 5.12. (v, ) has at least one xed point c(v, ) P
Proof. (v, ) is a continuous function on a compact, convex set. By application of
Brouwers theorem, it has a xed point.

YANN LE TALLEC AND THEOPHANE WEBER

Remark 5.13. Saying that (v, ) has at least one xed point is equivalent to saying
that A(v, ) has at least one xed point
However, the ck produced by the algorithm may still fail to converge so that
A(v, ) does not provide the value of a xed point.
5.1.3. Existence of xed point for the original algorithm A.. In this part, we will
use the previous theorem asserting the existence of a xed point to the algorithm
that holds for all variance v > 0 and all > 0 to show that there exists a xed
point to the original algorithm A.
Theorem 5.14. For any pair (vk , k ) in R2 with (vk , k ) > 0, denote Ck the set of
xed points of the algorithm A(vk , k ), which is not empty by Theorem 5.12.
If there is a sequence (vk , k )k0 of such pairs with (vk , k ) (0, 0), such that
there is an accumulation point c of the set Ck that yields a unique optimum if used
as a state relevance vector in (ALP), then c 0 is a probability distribution that
veries
:= (1 ) T (I Pur(c) )1
cT = ,u
r(c)

(5.2)

Proof. Without loss of generality, let ck Ck such that limk+ ck = c. By

denition,
(5.3)

ck = ,uk

r(ck )

== (1 ) T (I Puk

)1 .

r (ck )

Note Rm the polyhedron that is the feasible set of (ALP). By assumption,

r(c) > cT r for
there is a unique r(c)
verifying r(c)
(ALP feasibility) and cT
all r . Hence, r(c) stays the unique optimal solution of (ALP) for state relevance
weight close enough to c. Since ck c, there is K such that k K r(ck ) = r(c).
In particular, ur(c) = ur(ck ) , k K, and (5.3) becomes for k K
ck = (1 ) T (I Puk

(5.4)

r (c)

|S |
Recall that a -greedy policy u
chooses control u in
J with respect to J R
state x with probability

uJ (u, x) =

(5.5)

exp[(gu (x) + Pu (x)J )/]

.
aU (x) exp[(ga (x) + Pa (x)J )/]

Lemma 5.15. Assume that U = {u1 , . . . , uq } U (x) is the set of minimizers of

gu (x) + Pu (x)J. Then,

1/q if a U
lim uJ (a, x) = uJ (a, x) =
0 otherwise
0
k
For J =
r(c), the lemma yields limk+ u
r(c) .
r(c) = u

Combining this results with (5.4), we have

(5.6) c = lim ck = lim (1) T (I Puk

r (c)

)1 = (1) T (I Pur(c) )1 .

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
9

5.2. Necessary conditions.

Although the theoretical algorithm presented above shows the existence of a
probability distribution in the feasible set of (P), it is not practical. Now, we would
like to obtain practical guidelines for the choice of c. In particular, we derive in
this section some necessary conditions on the state relevance weight. One of them
yields a reinforced approximate linear program.
5.2.1. A condition on c depending on the Lyapunov function and the initial distribution.
Proposition 5.16. If c veries (5.1), then
T v (1 )1 cT (I Puv )v
Proof. Assume (5.1) holds, or equivalently
(1 ) T = cT (I Pur(c) ).
Then multiplying by v on the right and noting uv the greedy policy with respect
to the Lyapunov function v (Puv v Pu v, u, as we modied the Markov
Chain so that all policies has the same cost vector), we have
T v
T v

(1 )1 cT (I Pur(c) )v

(1 )1 cT (I Puv )v

Notice that the spectrum of (1 )1 (I Puv ) is of the form (1 )/(1 ),

where is an eigenvalue of Puv .
5.2.2. Reinforced approximate linear program.
A possible approach to obtain (5.1) is to enforce this constraint in the ALP and
hope there is a solution for a given c. That is to try to solve the following non-linear
program:
(RAN LP ) :

max cT r

rRm

T r r
cT = (1 ) T (I Pur(c) )
The last constraints are hard to deal with, but we can derive more tractable necessary conditions. In particular, the next proposition shows that they imply a system
of linear equations.
Proposition 5.17. If c veries (5.1), then the following system of linear equations
holds
(5.7)

(1 ) T
r(c) cT (I Pu )R, u U (x)

Proof. By denition of ur(c) given Assumption 1.1, we have

(5.8)

Pur(c)
r(c) Pu
r(c), u U (x)

YANN LE TALLEC AND THEOPHANE WEBER

Hence,
r(c) Pu
r(c), u U (x)
Pur(c)
(I Pur(c) )
r(c) (I Pu )
r(c), u U (x)

r(c) (I Pu
)1 (I Pu )
r(c), u U (x)
r(c)
The last equivalence follows from (I Pur(c) )1 =
ing both sides of the last equation by (1 ) T ,

t Put r(c) 0. Multiply-

(1 ) T
r(c) (1 ) T (I Pu
)1 (I Pu )
r(c), u U (x)
r(c)

,u
r (c)

As a result, it is natural to consider a reinforced linear program (RALP) to

approximate J by a linear combination of .
(RALP ) :

max cT r

rRm

T r r
(1 ) T r cT (I Pu )r, u U (x)
Notice that the last constraint enforces the equality ,ur(c) = c only on the
subspace {(I Pu )
r(c), u U }, whereas we need this condition to hold for
r(c)1,,r(c) =
r(c), r(c) being an optimal solution of (RALP) so that J
J

J
r(c)1,c .
6. Conclusion
We presented some new results for the choice of the state relevance weight c in the
approximate linear program. The criterion to choose c hinges on two performance
bounds that control the suboptimality of the ALP policy. However, these results
remain preliminary, in particular how to tailor the state relevance weight to the
problem setting remains an open question.
7. appendix
7.1. Insights on ,u .
By denition,
(7.1)

T,u := (1 ) T (I Pu )1 = (1 )

t T Put .

Hence, ,u is a geometric average of the presence probability over the state space
after t transitions under policy u starting from the distribution . When Pu irreducible, limt+ T Put = uT , where u is the steady-state distribution of Pu , i,e,
uT = uT Pu . Thus we can wonder how far is u from ,u . We show now that in
general ,u is further away from u than .
Since u is also a left eigenvalue of (1 )(I Pu )1 , we have
T,u uT = ( u )T (1 )(I Pu )1 .

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
11

When Pu irreducible, the eigenvalue of Pu have a modulus smaller than one by

Perron-Frobenius theorem. Hence, the eigenvalues of (I Pu )1 have a modulus
greater than 1. As a result, the previous equation yields
,u u 2 ( u )2
References
1. D. P. de Farias and B. Van Roy, The Linear Programming Approach to Approximate Dynamic
Programming, Operations Research, Vol. 51, No.
6, 2003.
2. D. P. de Farias, The Linear Programming Approach to Approximate Dynamic Programming:
Theory and Application, Ph.D. Thesis, Stanford
University, June 2002.
3. D. P. de Farias and B. Van Roy, On Constraint
Sampling for the Linear Programming Approach
to Approximate Dynamic Programming, to appear in Mathematics of Operations Research,
submitted August, 2001.
4. D. P. de Farias and B. Van Roy, On the Existence
of Fixed Points for Approximate Value Iteration and Temporal-Dierence Learning, Journal
of Optimization Theory and Applications, Vol.
105, No. 3, June, 2000.
5. Dimitris Bertsimas and John N. Tsitsiklis, Introduction to Linear Optimization, Athena Scientic.

How to choose the state relevance

weight in the approximate linear
programming approach for dynamic
programming?
Yann Le Tallec
and
Theophane Weber

Finite Markov chain framework

Finite state space X
For all x in X, finite control space U(x)
Bounded expected immediate cost gu(x) of control
u in state x
Transition probability matrix under control u: Pu
Proposition: Any finite Markov chain can be
transformed in an equivalent finite Markov chain
with gu(x)=g(x) for all u in U(x).

Linear programming
Let T be the DP operator for -discounted problem:
TJ=minu g + PuJ.
By monotonicity of T, J TJ J TJ TkJ J*.
Linear programming approach to DP:
For all c>0, J* unique optimal solution of
(LP): max cTx s.t. J(x) g(x)+ Pu(x,y)J(y), "(x,u)

Approximate linear program

Curse of dimensionality. Approximate:
J*(x) (x)r, r m, m |X|
Approximate linear program, c>0,
(ALP): maxr cTx s.t. r T r.
Unlike (LP), c matters: r*=r*(c).
r T r r T r J*

General performance bound

Proposition:
For all J in |X|,

E | J u J ( x) J * ( x) |; x ~ = J u J J *
where ,u = (1 ) ( I Pu )
T

In practice, is given by the application.

J J*

1, ,u J

ALP approximation bound

Proposition:
Let r* be an optimal solution of (ALP). Then for
all v s.t. v is a positive Lyapunov function,
J * r *

1,c

2cT v
min J * r

1 v r

,1 / v

Compare with

J u r* J *

r * J *

1, ,u J

ALP approximation bound

Proposition:
Let r* be an optimal solution of (ALP). Then for
all v s.t. v is a positive Lyapunov function,
J * r *

1,c

2cT v
min J * r

1 v r

,1 / v

Compare with

J u r* J *

r * J *

1, ,u J

Choose c>0 to relate the 2 bounds in an efficient way

Simple bounds
We want J * r * 1, K J * r * 1,c , K > 0
T
c
2
v
to yield J * J u
min J * r
K
,1 / v
1,
1 v r
This relation follows from ,u Kc
But r* depends implicitly on c via (ALP)
1. Trivially, c:=1. But poor bound for large state space
2. Algorithm using r*(c)=r*(Kc) for any K>0.

r *

1. Solve (ALP) for any c>0.

2. Compute ,r*
3. If possible, find the smallest K>0 such that ,r* Kc

Find pmf c=,r*

If c=,r*>0, c cannot be big and we have K=1

Nave algorithm: ck ALP

rkgreedy
urk ,urk = ck+1.
Fixed point? Convergence?
Theoretical algorithm
Relies on Browers fixed point theorem of continuous function
in convex compact set of |X|
rk not well defined for multiple optima
rk not continuous in c => randomized c by Gaussian noise N(0,vI), v>0
greedy not continuous in rk => -greedy: P(u) exp(- -1.(g+Purk))

For all v and , there is a fixed point to the nave algorithm

Reinforced ALP
Would like to solve (ALP) with the additional constraint

c = ,ur* = (1 ) T ( I Pur* ) 1
T

Recall that Pur* is greedy w.r.t r*, i.e.

Pur r* Pu r* for all u.
Hence,
(1- ) T(I- Pu r )-1(I- Pu) r* (1- )Tr*, "u
cT
Add the necessary linear constraints to (ALP)
cT(I- Pu) r* (1- )Tr*, "u

Conclusions
Some simple bounds on the (ALP) policy but
not necessarily tight.
Theoretical algorithm to find c as a probability
distribution.
Some insight in the role of c in (ALP)
Need practical algorithms depending on and
the Markov chain.

Speed Up Rubik's Cube Solving: CFOP Guide
60% (5)
Speed Up Rubik's Cube Solving: CFOP Guide
4 pages
Reasoning Under Uncertainty: 1 Markov Decision Process
No ratings yet
Reasoning Under Uncertainty: 1 Markov Decision Process
3 pages
Form 23 Certificate of Registration
No ratings yet
Form 23 Certificate of Registration
3 pages
Understanding PSD in Vibration Testing
No ratings yet
Understanding PSD in Vibration Testing
4 pages
Stanford Markov Decision Processes
No ratings yet
Stanford Markov Decision Processes
20 pages
Combined Balanced Ternary Number System: An Approach to a New Computational Number System combining The Ternary Number System and the Balanced Ternary Number System in the field of Computational Mathematics
No ratings yet
Combined Balanced Ternary Number System: An Approach to a New Computational Number System combining The Ternary Number System and the Balanced Ternary Number System in the field of Computational Mathematics
6 pages
List of Standard Reports in SAP
No ratings yet
List of Standard Reports in SAP
4 pages
In Cryptographic Terms
No ratings yet
In Cryptographic Terms
3 pages
How To Delete A Cluster Resource or Resource Group in Pacemaker Cluster - Red Hat Customer Portal
No ratings yet
How To Delete A Cluster Resource or Resource Group in Pacemaker Cluster - Red Hat Customer Portal
3 pages
EEE 461 Topic 4-1
No ratings yet
EEE 461 Topic 4-1
45 pages
1.4 Struktur Kontrol
No ratings yet
1.4 Struktur Kontrol
11 pages
SAP BI Data Modeling Guide
No ratings yet
SAP BI Data Modeling Guide
19 pages
Subtitle
No ratings yet
Subtitle
2 pages
Unit VI Stochastic Processes: Dr. Nita V. Patil Date:27/July/2021
0% (1)
Unit VI Stochastic Processes: Dr. Nita V. Patil Date:27/July/2021
50 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
1 95-Ais (Ii) - 2
No ratings yet
1 95-Ais (Ii) - 2
15 pages
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
No ratings yet
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
2 pages
Embedded Software Scheduling
No ratings yet
Embedded Software Scheduling
23 pages
Algorithms: An Algorithm For Making Regime-Changing Markov Decisions
No ratings yet
Algorithms: An Algorithm For Making Regime-Changing Markov Decisions
19 pages
Selenium Python Guide
No ratings yet
Selenium Python Guide
75 pages
Software Engineering Ethics Guide
No ratings yet
Software Engineering Ethics Guide
2 pages
SP 10 Markov Decision Process
No ratings yet
SP 10 Markov Decision Process
20 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Engineering Project Report
No ratings yet
Engineering Project Report
14 pages
How To Protect Yourself From The New Coronavirus
No ratings yet
How To Protect Yourself From The New Coronavirus
4 pages
OBC Certificate Guidelines for Migrants
No ratings yet
OBC Certificate Guidelines for Migrants
1 page
4 2005-Ais (Ii) - 2
No ratings yet
4 2005-Ais (Ii) - 2
1 page
2 1997-Estt - RR-19012007
No ratings yet
2 1997-Estt - RR-19012007
2 pages
Rad Studio Feature Matrix
No ratings yet
Rad Studio Feature Matrix
25 pages
State Governments: B..J.F.Eil C-7
No ratings yet
State Governments: B..J.F.Eil C-7
3 pages
5 97-Ais (Ii)
No ratings yet
5 97-Ais (Ii)
2 pages
11 2001-Ais (Ii)
No ratings yet
11 2001-Ais (Ii)
2 pages
1 2008-Ais - Ii
No ratings yet
1 2008-Ais - Ii
9 pages
IMEN319 - 6. Markov Decision Process
No ratings yet
IMEN319 - 6. Markov Decision Process
17 pages
Daily Pre Post Comparison - Nokia L8&L26 Daily - V3.2
No ratings yet
Daily Pre Post Comparison - Nokia L8&L26 Daily - V3.2
116 pages
Fujitsu Lifebook N Series: Bios Guide
No ratings yet
Fujitsu Lifebook N Series: Bios Guide
22 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Renesas 2012 - Identifying, Analyzing and Mitigating First-, Second - and Third-Order Effects On Motor Control Performance in Vector Control of PMSM Motor Applications
No ratings yet
Renesas 2012 - Identifying, Analyzing and Mitigating First-, Second - and Third-Order Effects On Motor Control Performance in Vector Control of PMSM Motor Applications
31 pages
Lecture MarkovDecisionProcess
No ratings yet
Lecture MarkovDecisionProcess
4 pages
NodeB Carrier Management (RAN17.1 - 02)
No ratings yet
NodeB Carrier Management (RAN17.1 - 02)
82 pages
Tube Mill Inspection & Maintenance Services
No ratings yet
Tube Mill Inspection & Maintenance Services
1 page
India 5ervices Officers in Public Sector Undertakings Etc.: (S LNG )
No ratings yet
India 5ervices Officers in Public Sector Undertakings Etc.: (S LNG )
5 pages
Q1: Defination Of: - Regular Language
No ratings yet
Q1: Defination Of: - Regular Language
8 pages
Markov Decision Processes
100% (1)
Markov Decision Processes
104 pages
IOS Security Guide
No ratings yet
IOS Security Guide
60 pages
CS 135 Discrete Structures Syllabus: Text Books
No ratings yet
CS 135 Discrete Structures Syllabus: Text Books
1 page
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Depot Repair Test Scripts
No ratings yet
Depot Repair Test Scripts
33 pages
BLDC Vs PMSM
No ratings yet
BLDC Vs PMSM
2 pages
Requirements Determination and Requirements Structuring: by Zhenyu Zhu
No ratings yet
Requirements Determination and Requirements Structuring: by Zhenyu Zhu
22 pages
Manual
No ratings yet
Manual
72 pages
Saml Ref
No ratings yet
Saml Ref
54 pages
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
No ratings yet
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
10 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Dynamic Programming and Markov Processes
No ratings yet
Dynamic Programming and Markov Processes
152 pages
Mondal Smdps
No ratings yet
Mondal Smdps
17 pages
Patch Analyst 5 - ArcGIS 9
100% (1)
Patch Analyst 5 - ArcGIS 9
2 pages
Questions A
No ratings yet
Questions A
7 pages
Markov Chain For Transition Probability
100% (1)
Markov Chain For Transition Probability
29 pages
Explicating Markov Chains and Transition Probability Matrices Via Simple Board Games
No ratings yet
Explicating Markov Chains and Transition Probability Matrices Via Simple Board Games
13 pages
SET394 - AI - Lecture 06 - Adversarial Search
No ratings yet
SET394 - AI - Lecture 06 - Adversarial Search
27 pages
Ai-Module 2
No ratings yet
Ai-Module 2
47 pages
Intermediate Feb16
No ratings yet
Intermediate Feb16
14 pages
HTML Text Markup and CSS Styling Guide
No ratings yet
HTML Text Markup and CSS Styling Guide
8 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Modeling and Intelligent Control of A Robotic Gas Metal Arc Welding System
No ratings yet
Modeling and Intelligent Control of A Robotic Gas Metal Arc Welding System
19 pages
06 MDP
No ratings yet
06 MDP
89 pages
Decision Science for MBA Students
No ratings yet
Decision Science for MBA Students
6 pages
CHAPTER 20-Final
No ratings yet
CHAPTER 20-Final
20 pages
Lec 08
No ratings yet
Lec 08
59 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Gaming Theory
No ratings yet
Gaming Theory
4 pages
Advanced Markov Chain Concepts
No ratings yet
Advanced Markov Chain Concepts
38 pages
Markov Chains for Reliability Experts
No ratings yet
Markov Chains for Reliability Experts
20 pages
STAT 6100 - MATH 6180 Lecture 18 - Mean Time Spent in Transient States
No ratings yet
STAT 6100 - MATH 6180 Lecture 18 - Mean Time Spent in Transient States
7 pages
Chapter 5 - Unit 2: Adversarial Search
No ratings yet
Chapter 5 - Unit 2: Adversarial Search
37 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
111 pages
Phase 8 - Present The Results of The Problems Proposed in The Final Project
No ratings yet
Phase 8 - Present The Results of The Problems Proposed in The Final Project
13 pages
The 2G, 3G and 4G Wireless Network Infrastructure Market: 2014 - 2020 - With An Evaluation of WiFi and WiMAX
No ratings yet
The 2G, 3G and 4G Wireless Network Infrastructure Market: 2014 - 2020 - With An Evaluation of WiFi and WiMAX
4 pages
AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
83 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Decision Analysis and Strategies
No ratings yet
Decision Analysis and Strategies
56 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
47 pages
Chapter17 1
No ratings yet
Chapter17 1
40 pages
Multi-Agent Reinforcement Learning
No ratings yet
Multi-Agent Reinforcement Learning
21 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
Markov Chains
100% (7)
Markov Chains
91 pages
Stochastic Process: 1 1 N N 1 N 1 N 1 N
No ratings yet
Stochastic Process: 1 1 N N 1 N 1 N 1 N
12 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Artificial Intelligence and Agent Technology
No ratings yet
Artificial Intelligence and Agent Technology
46 pages
6 - Discrete Markov Chains
No ratings yet
6 - Discrete Markov Chains
34 pages
Lecture 4: Sequential Decision Making: Simon Parsons
No ratings yet
Lecture 4: Sequential Decision Making: Simon Parsons
94 pages
DP Methods
No ratings yet
DP Methods
61 pages