2997 Spring 2004
2997 Spring 2004
February 4
Handout #1
Lecture Note 1
In this class we will study discrete-time stochastic systems. We can describe the evolution (dynamics) of
these systems by the following equation, which we call the system equation:
xt+1 = f (xt , at , wt ),
(1)
where xt S, at Axt and wt W denote the system state, decision and random disturbance at time
t, respectively. In words, the state of the system at time t + 1 is a function f of the state, the decision
and a random disturbance at time t. An important assumption of this class of models is that, conditioned
on the current state xt , the distribution of future states xt+1 , xt+2 , . . . is independent of the past states
xt1 , xt2 , . . . . This is the Markov property, which rise to the name Markov decision processes.
An alternative representation of the system dynamics is given through transition probability matrices: for
each state-action pair (x, a), we let Pa (x, y) denote the probability that the next state is y, given that the
current state is x and the current action is a.
We are concerned with the problem of how to make decisions over time. In other words, we would like to
pick an action at Axt at each time t. In real-world problems, this is typically done with some objective in
mind, such as minimizing costs, maximizing prots or rewards, or reaching a goal. Let u(x, t) take values in
Ax , for each x. Then we can think of u as a decision rule that prescribes an action from the set of available
actions Ax based on the current time stage t and current state x. We call u a policy.
In this course, we will assess the quality of each policy based on costs that are accumulated additively
over time. More specically, we assume that at each time stage t a cost g at (xt ) is incurred. In the next
section, we describe some of the optimality criteria that will be used in this class when choosing a policy.
Based on the previous discussion, we characterize a Markov decision process by a tuple (S, A , P (, ), g ()),
consisting of a state space, a set of actions associated with each space, transition probabilities and costs as
sociated with each state-action pair. For simplicity, we will assume throughout the course that S and A x
are nite. Most results extend to the case of countably or uncountably innite state and action spaces under
certain technical assumptions.
Optimality Criteria
In the previous section we described Markov decision processes, and introduced the notion that decisions
are made based on certain costs that must be minimized. We have established that, at each time stage t, a
cost gat (xt ) is incurred. In any given problem, we must dene how costs at dierent time stages should be
combined. Some optimality criterions that will be used in the course are the following:
1. Finite-horizon total cost:
E
T 1
t=0
gat (xt ) x0 = x
(2)
2. Average cost:
T 1
lim sup E
ga (xt ) x0 = x
T t=0 t
T
t=0
t gat (xt ) x0 = x ,
(3)
(4)
where (0, 1) is a discount factor expressing temporal preferences. The presence of a discount
factor is most intuitive in problems involving cash ows, where the value of the same nominal amount
of money at a later time stage is not the same as its value at a earlier time stage, since money at
the earlier stage can be invested at a risk-free interest rate and is therefore equivalent to a larger
nominal amount at a later stage. However, discounted costs also oer good approximations to the
other optimality criteria. In particular, it can be shown that, when the state and action spaces are
nite, there is a large enough
< 1 such that, for all ,
optimal policies for the discounted-cost
problem are also optimal for the average-cost problem. However, the discounted-cost criterion tends
to lead to simplied analysis and algorithms.
Most of the focus of this class will be on discounted-cost problems.
Examples
The Markov decision processes has a broad range of applications. We introduce some interesting applications
in the following.
Queueing Networks
Consider the queueing network in Figure 1. The network consists of three servers and two dierent external
jobs, xed routes 1, 2, 3 and 4, 5, 6, 7, 8, forming a total of 8 queues of jobs at distinct processing stage. We
assume the service times are distributed according to geometric random variables: When a server i devotes
a time step to serving a unit from queue j, there is a probability ij that it will nish processing the unit in
that time step, independent of the past work done on the unit. Upon completion of that processing step, the
unit is moved to the next queue in its route, or out of the system if all processing steps have been completed.
New units arrive at the system in queues j = 1, 4 with probability j in any time step, independent of the
previous arrivals.
13
1
22
37
11
18
35
Machine 1
26
4
Machine 2
34
Machine 3
A common choice for the state of this system is an 8-dimensional vector containing the queue lengths.
Since each server serves multiple queues, in each time step it is necessary to decide which queue each of
the dierent servers is going to serve. A decision of this type may be coded as an 8-dimensional vector a
indicating which queues are being served, satisfying the constraint that no more than one queue associated
with each server is being served, i.e., ai {0, 1}, and a1 + a3 + a8 1, a2 + a6 1, a4 + a5 + a7 1. We can
impose additional constraints on the choices of a as desired, for instance considering only non-idling policies.
Policies are described by a mapping u returning an allocation of server eort a as a function of system
x. We represent the evolution of the queue lengths in terms of transition probabilities - the conditional
probabilities for the next state x(t + 1) given that the current state is x(t) and the current action is a(t).
For instance
P rob(x1 (t + 1) = x1 (t) + 1 | x(t), a(t)) = 1 ,
P rob(x3 (t + 1) = x3 (t) + 1, x2 (t + 1) = x2 (t) 1 | (x(t), a(t)) = 22 I(x2 (t) > 0, a2 (t)) = 1),
corresponding to an arrival to queue 1, a departure from queue 2 and an arrival to queue 3, and a departure
from queue 3. I() is the indication function. Transition probabilities related to other events are dened
similarly.
We may consider costs of the form g(x) = i xi , the total number of unnished units in the system. For
instance, this is a reasonably common choice of cost for manufacturing systems, which are often modelled
as queueing networks.
Tetris
Tetris is a computer game whose essence rule is to t a sequence of geometrically dierent pieces, which fall
from the top of the screen stochastically, together to complete the contiguous rows of blocks. Pieces arrive
sequentially and the geometric shape of the pieces are independently distributed. A falling piece can be
rotated and moved horizontally into a desired position. Note that the rotation and move of falling pieces
must be scheduled and executed before it reaches the remaining pile of pieces at the button of the screen.
Once a piece reaches the remaining pile, the piece must resite there and cannot be rotated or moved.
To put the Tetris game into the framework of Markov decision processes, one could dene the state to
correspond to the current conguration and current falling piece. The decision in each time stage is where
to place the current falling piece. Transitions to the next board conguration follow deterministically from
the current state and action; transitions to the next falling piece are given by its distribution, which could
be, for instance, uniform over all piece types. Finally, we associate a reward with each state-action pair,
corresponding to the points achieved by the number of rows eliminated.
Portfolio Allocation
Portfolio allocation deals with the question of how to invest a certain amount of wealth among a collection
of assets. One could dene the state as the wealth of each time period. More specically, let x 0 denote the
initial wealth and xt as the accumulated wealth at time period t. Assume there are n risky assets, which
correspond to random rate of return e1 , . . . , en . Investors distribute fractions a = (a1 , . . . , an ) of their wealth
n
among the n assets, and consume the remaining fraction 1 i=1 an . The evolution of wealth xt is given
3
by
xt+1
ait ei xt .
i=1
Therefore, transition probabilities can be derived from the distribution of the rate of return of each risky
n
assets. We associate with each state-action pair (x, a) a reward ga (x) = x(1 i=1 ai , corresponding to the
amount of wealth consumed.
Finding a policy that minimizes the nite-horizon cost corresponds to solving the following optimization
problem:
T 1
(5)
min E
gu(xt ,t) (xt )|x0 = x
u(,)
t=0
A naive approach to solving (5) is to enumerate all possible policies u(x, t), evaluate the corresponding
expected cost, and choose the policy that maximizes it. However, note that the number of policies grows
exponentially on the number of states and time stages. A central idea in dynamic programming is that the
computation required to nd an optimal policy can be greatly reduced by noting that (5) can be rewritten
as follows:
T 1
min ga (x) +
Pa (x, y) min E
gu(xt ,t) (xt )|x1 = y
.
(6)
aAx
u(,)
t=1
yS
J (x, t0 ) = min E
u(,)
T 1
t=t0
It is clear from (6) that, if we know J (, t0 + 1), we can easily nd J (x, t0 ) by solving
(7)
yS
Moreover, (6) suggests that an optimal action at state x and time t0 is simply one that minimizes the
right-hand side in (7). It is easy to verify that this is the case by using backwards induction.
We call J (x, t) the cost-to-go function. It can be found recursively by noting that
J (x, T 1) = min ga (x)
a
Based on the discussion for the nite-horizon problem, we may conjecture that an optimal decision for the
innite-horizon, discounted-cost problem may be found as follows:
4
t=t0
(8)
(9)
yS
We may also conjecture that, as in the nite-horizon case, J (x, t) satises a recursive relation of the form
yS
The rst thing to note in the innite-horizon case is that, based on expression (8), we have J (x, t) =
J (x, t ) = J (x) for all t and t . Indeed, note that, for every u,
tt0
E
=
gu(xt ,t) (xt )|xt0 = x
=
=
t=t0
t=t0
t=0
Intuitively, since transition probabilities Pu (x, y) do not depend on time, innite-horizon problems look the
same regardless of the value of the initial time state t, as long as the initial state is the same.
Note also that, since J (x, t) = J (x), we can also infer from (9) that the optimal policy u (x, t) does
not depend on the current stage t, so that u (x, t) = u (x) for some function u (). We call policies that do
not depend on the time stage stationary. Finally, J must satisfy the following equation:
yS
We now introduce some shorthand notation. For every stationary policy u, we let g u denote the vector with
entries gu(x) (x), and Pu denote the matrix with entries Pu(x) (x, y). We dene the dynamic programming
operators Tu and T as follows. For every function J : S , we have
Tu J = gu + Pu J,
5
and
T J = min Tu J.
u
gu + Pu J
gu + Pu J
= Tu J.
Now
TJ
TuJ J
TuJ J
= T J
We have
T (J + ke)
T J + ke.
yS
Pu (x, y) = 1.
we have T J T J
J J
.
Lemma 3 (Maximum-Norm Contraction) For all J and J,
Proof
First, we have
J
J + J J
J + J J e.
. We now have
T J T J
T (J + J J e) T J
= T J + J J e T J
= J J e.
The rst inequality follows from monotonicity and the second from the oset property of T . Since J and J
are arbitrary, we conclude by the same reasoning that T J T J J J e. The lemma follows.
February 9
Handout #2
Lecture Note 2
A
x denotes a nite set of actions for state x S
ga (x) denotes the nite timestage cost for action a Ax and state x S
Pa (x, y) denotes the transmission probability when the taken action is a Ax , current state is x, and
the next state is y
Let u(x, t) denote the policy for state x at time t and, similarly, let u(x) denote the stationary policy for
state x. Taking the stationary policy u(x) into consideration, we introduce the following notation
gu (x) gu(x) (x)
Pu (x, y) Pu(x) (x, y)
to represent the cost function and transition probabilities under policy u(x).
In the previous lecture, we dened the discountedcost, innite horizon costtogo function as
J (x) = min E
t gu (xt )|x0 = x .
u
t=0
yS
Value Iteration
Proof
Since J0 () and g () are nite, there exists a real number M satisfying
|J0 (x)| M and |ga (x)| M for all a Ax and x S. Then we have, for every integer K 1 and real
number (0, 1),
JK (x)
= T K J0 (x)
K1
t
K
= min E
gu (xt ) + J0 (xK )x0 = x
u
t=0
K 1
min E
t gu (xt )x0 = x + K M
u
t=0
From
J (x) = min
u
K1
gu (xt ) +
t=0
gu (xt ) ,
t=K
we have
(T K J0 )(x) J (x)
K1
K1
t
K
t
t
min E
gu (xt ) + J0 (xK )x0 = x min E
gu (xt ) +
gu (xt )x0 = x
u
u
t=0
t=0
t=K
K1
K1
E
t gu (xt ) + K J0 (xK )x0 = x E
t gu (xt ) +
t gu (xt )x0 = x
t=0
t=0
t=K
E K |J0 (xk )| +
t gu (xt )x0 = x
t=K
t
K
max E |J0 (xK )| +
|g0 (xt )|x0 = x
u
t=K
1
,
K M 1 +
1
where u
is the policy minimizing the second term in the rst line. We can bound J (x) (T K J0 )(x)
K M (1 + 1/(1 )) by using the same reasoning. It follows that T K J0 converges to J as K goes to innity.
= ||T k+1 J0 T k J0 ||
||T k J0 T k1 J0 ||
k ||T J0 J0 || 0
as K
K
n 1
Alternative Proof
We prove the statement by showing that T J is a Cauchy sequence in R . Observe
||T k+m J T k J||
= ||
m1
n=0
m1
n=0
m1
as k, m
n=0
From above, we know that ||T J J || ||J J || . Therefore, the value iteration algorithm
converges to J . Furthermore, we notice that J is the xed point w.r.t. the operator T , i.e., J = T J .
We next introduce another value iteration algorithm.
k
3.1
JK (x),
if x y, (not being updated yet)
JK (y) =
JK+1 (y), if x > y.
JK+1 (x)
y<x
yx
updated already
(1)
Does the operator F satisfy the maximum contraction? We answer this question by the following lemma.
1A
sequence xn in a metric space X is said to be a Cauchy sequence if for every > 0 there exists an integer N such that
||xn xm || if m, n N . Furthermore, in Rn , every Cauchy sequence converges.
Lemma 1
||J J||
||F J F J||
Proof
|(F J)(1) (F J)(1)
| = |(T J)(1) (T J)(1)| ||J J||
|, . . . , |J(|S|) J(|S|)|
max |(F J)(1) (F J)(1)|, |J(2) J(2)
||J J||
Repeating the same reasoning for x = 3, . . . , we can show by induction that |(F J)(x) (F J)(x)|
, x S. Hence, we conclude ||F J F J||
||J J|| .
||J J||
February 11
Handout #4
Lecture Note 3
Value Iteration
1
||T Jk Jk ||
1
Proof:
Juk Jk = (I Puk )1 guk Jk
= (I Puk )1 (guk + Puk Jk Jk )
= (I Puk )1 (T Jk Jk )
=
t Put k (T Jk Jk )
t=0
t Put k e||T Jk Jk ||
t=0
t e||T Jk Jk ||
t=0
e
||T Jk Jk ||
1
where I is an identity matrix, and e is a vector of unit elements with appropriate dimension. The third
equality comes from T Jk = guk + Puk Jk , i.e., uk is the greedy policy w.r.t. Jk , and the forth equality holds
e
because (I Puk )1 = t=0 t Put k . By switching Juk and Jk , we can obtain Jk Juk 1
||T Jk Jk || ,
and hence conclude
e
|Juk Jk |
|T JK JK |
1
or, equivalently,
1
||T Jk Jk || .
||Juk Jk ||
1
2
Theorem 1
||Juk J ||
2
||Jk J ||
1
Proof:
||Juk J ||
= ||Juk Jk + Jk J ||
||Juk Jk || + ||Jk J ||
1
||T Jk J + J Jk || + ||Jk J ||
1
1
The second inequality comes from Lemma 1 and the third inequality holds by the contraction principle. 2
Before proving the main theorem of this section, we introduce the following useful lemma.
Lemma 2 If J T J, then J J . If J T J, then J J .
Proof: Suppose that J T J. Applying operator T on both sides repeatedly k 1 times and by the
monotonicity property of T , we have
J T J T 2 J T k J.
For suciently large k, T k J approaches to J . We hence conclude J J . The other statement follows the
same argument.
2
is, J = T J = Tu J .
Then
||Juk Ju || M (1 +
1
)k 0 as k .
1
If u = (u , u , . . . ), then
||Ju Juk || 0 as k .
Thus, we have Juk = Tuk J = Tuk1
(T J ) = Tuk1
J = J . Therefore Ju = J . For any other policy, for
all k,
1
k
Ju Juk M 1 +
1
1
= Tu1 . . . Tuk J M 1 +
k
1
T J
Tu1 . . . Tuk1 T
J M
1+
=J
... J M 1 +
1
1
1
1
Policy Iteration
Proof: If uk is optimal, then we are done. Now suppose that uk is not optimal. Then
T Juk Tuk Juk = Juk
with strict inequality for at least one state x. Since Tuk+1 Juk = T Juk and Juk = Tuk Juk , we have
Juk = Tuk Juk T Juk = Tuk+1 Juk Tunk+1 Juk Juk+1 as n .
Therefore, policy uk+1 is an improvement over policy uk .
In step 2, we solve Juk = guk + Puk Juk , which would require a signicant amount of computations. We
thus introduce another algorithm which has fewer iterations in step 2.
3.1
3. k = k + 1; go back to step 2
Theorem 4 If Tu0 J0 J0 and innitely many value and policy updates are performed on each state, then
lim Jk = J .
Proof: We prove this theorem by two steps. First, we will show that
J Jk+1 Jk ,
k.
This implies that Jk is a nonincreasing sequence. Since Jk is lower bounded by J , Jk will converge to some
value, i.e., Jk J as k . Next, we will show that Jk will converge to J , i.e., J = J .
Lemma 3 If Tu0 J0 J0 , the sequence Jk generated by asynchronous policy iteration converges.
Proof: We start by showing that, if Tuk Jk Jk , then Tuk+1 Jk+1 Jk+1 Jk . Suppose we have a value
update. Then,
= Jk+1 (x),
x Sk
Jk (x) = Jk+1 (x), x
/ Sk
Now suppose that we have a policy update. Then Jk+1 = Jk . Moreover, for x Sk , we have
(Tuk+1 Jk+1 )(x)
(Tuk+1 Jk )(x)
(T Jk )(x)
(Tuk Jk )(x)
Jk (x)
Jk+1 (x).
The rst equality follows from Jk = Jk+1 , the second equality and rst inequality follows from the fact that
uk+1 (x) is greedy with respect to Jk for x Sk , the second inequality follows from the induction hypothesis,
and the third equality follows from Jk = Jk+1 . For x Sk , we have
(Tuk+1 Jk+1 )(x)
(Tuk Jk )(x)
Jk (x)
= Jk+1 (x).
The equalities follow from Jk = Jk+1 and uk+1 (x) = uk (x) for x Sk , and the inequality follows from the
induction hypothesis.
Since by hypothesis Tu0 J0 J0 , we conclude that Jk is a decreasing sequence. Moreover, we have
Tuk Jk Jk , hence Jk Juk J , so that Jk is bounded bellow. It follows that Jk converges to some limit
J.
2
Lemma 4 Suppose that Jk J, where Jk is generated by asynchronous policy iteration, and suppose that
there are innitely many value and policy updates at each state. Then J = J .
Now
Proof: First note that, since T Jk Jk , by continuity of the operator T , we must have T J J.
suppose that (T J)(x) < J(x) for some state x. Then, by continuity, there is an iteration index k such that
Let k > k > k correspond to iterations of the asynchronous policy iteration
(T Jk )(x) < J(x) for all k k.
algorithm such that there is a policy update at state x at iteration k , a value update at state x at iteration
k , and no updates at state x in iterations k < k < k . Such iterations are guaranteed to exist since
there are innitely many value and policy update iterations at each state. Then we have uk (x) = uk +1 (x),
Jk (x) = Jk (x), and
Jk +1 (x)
(Tuk Jk )(x)
(Tuk +1 Jk )(x)
(Tuk +1 Jk )(x)
< J.
(T Jk )(x)
The rst equality holds because there is a value update at state x at iteration k , the second equality holds
because uk (x) = uk +1 (x), the rst inequality holds because Jk is decreasing and Tuk +1 is monotone and
the third equality holds because there is a policy update at state x at iteration k .
5
We have concluded that Jk +1 < J. However by hypothesis Jk J, we have a contradiction, and it must
follow that T J = J, so that J = J .
2
February 17
Handout #6
Lecture Note 4
Averagecost Problems
(1)
Since the state space is nite, it can be shown that the lim sup can actually be replaced with lim for any
stationary policy. In the previous lectures, we rst nd the costtogo functions J (x) (for discounted
problems) or J (x, t) (for nite horizon problems) and then nd the optimal policy through the costtogo
functions. However, in the averagecost problem, Ju (x) does not oer enough information for an optimal
policy to be found; in particular, in most cases of interest we will have Ju (x) = u for some scalar u , for
all x, so that it does not allow us to distinguish the value of being in each state.
We will start by deriving some intuition based on nitehorizon problems. Consider a set of states
S = {x1 , x2 , . . . , x , . . . , xn }. The states are visited in a sequence with some initial state x, say
x, . . . . . ., x , . . . . . ., x , . . . . . ., x , . . . . . . ,
h(x)
1u
2
u
Let Ti (x), i = 1, 2, . . . be the stages corresponding to the ith visit to state x , starting at state x. Let
Ti+1 (x)1
g
(x
)
u
t
t=Ti (x)
iu (x) = E
Ti+1 (x) Ti (x)
Intuitively, we must have iu (x) is independent of initial state x and iu (x) = ju (x), since we have the same
transition probabilities whenever we start a new trajectory in state x . Going back to observe the denition
of the function
T
J (x, T ) = min E
gu (xt )xo = x ,
u
t=0
as T ,
(2)
Note that, since (x) is independent of the initial state, we can rewrite the approximation as
J (x, T ) T + h (x) + o(T ),
as T .
(3)
where term h (x) can be interpreted as a residual cost that depends on the initial state x and will be referred
to as the dierential cost function. It can be shown that
T1 (x)1
h (x) = E
(gu (x) ) .
t=0
We can now speculate about a version of Bellmans equation for computing and h . Approximating
J (x, T ) as in (3, we have
Therefore, we have
(4)
Then,we have
be arbitrary. Then T h T h.
(Tu h Tu h)
Notice that the contraction principle does not hold for T h = minu Tu h.
Bellmans Equation
(5)
Before examining the existence of solutions to Bellmans equation, we show the fact that the solution of the
Bellmans equation renders the optimal policy by the following theorem.
Theorem 1 Suppose that and h satisfy the Bellmans equation. Let u be greedy with respect to h , i.e.,
T h Tu h . Then,
Ju (x) = , x,
and
Ju (x) Ju (x), u.
Proof: Let u = (u1 , u2 , . . . ). Let N be arbitrary. Then
TuN 1 h
TuN 2 TuN 1 h
T h = e + h
TuN 2 (h + e)
= TuN 2 h + e
T h + e
= h + 2 e
2
Then
T1 T2 TN 1 h N e + h
Thus,we have
E
N 1
gu (xt ) + h (xN ) (N 1) e + h
t=0
By dividing both sides by N and take the limit as N approaches to innity, we have1
Ju e
Take u = (u , u , u , . . . ), then all the inequalities above become the equality. Thus
e = J u .
2
This theorem says that, if the Bellmans equation has a solution, then we can get a optimal policy from it.
Note that, if ( , h ) is a solution to the Bellmans equation, then ( , h + ke) is also a solution, for all
scalar k. Hence, if Bellmans equation in (5) has a solution, then it has innitely many solutions. However,
unlike the case of discountedcost and nitehorizon problems, the averagecost Bellmans equation does not
necessarily have a solution. In particular, the previous theorem implies that, if a solution exists, then the
average cost Ju (x) is the same for all initial states. It is easy to come up with examples where this is not
the case. For instance, consider the case when the transition probability is an identity matrix, i.e., the state
visits itself every time, and each state incurs dierent transition costs g(). Then the average cost depends
on the initial state, which is not the property of the average cost. Hence, the Bellmans equation does not
always hold.
1 Recall
1
N
N 1
t=0
gu (xt ) x0 = x .
February 18
Handout #7
Lecture Note 5
In this lecture, we will show that optimal policies for discountedcost problems with large enough discount
factor are also optimal for averagecost problems. The analysis will also show that, if the optimal average
cost is the same for all initial states, then the averagecost Bellmans equation has a solution.
Note that the optimal average cost is independent of the initial state. Recall that
N 1
1
Ju (x) = lim sup E
gu (xt )|x0 = x
N N
t=0
or, equivalently,
N 1
1 t
Pu gu .
N N
t=0
Ju = lim
We also let Ju, denote the discounted costtogo function associated with policy u when the discount factor
is , i.e.,
Ju, =
t Put gu = (I Pu )1 gu .
t=0
The following theorem formalizes the relationship between the discounted costtogo function and average
cost.
Theorem 1 For every stationary policy u, there is hu such that
Ju, =
1
Ju + hu + O(|1 |).
1
(1)
1
P + Hu + O(|1 |)1 ,
1 u
(2)
N 1
1 t
Pu ,
N N
t=0
(3)
where
Pu = lim
Hu = (I Pu + Pu )1 Pu ,
(4)
Pu Pu
(5)
Pu Hu
= 0,
Pu Pu
Pu Pu
(6)
Pu + Hu = I + Pu Hu .
1 O(|1
Pu ,
(7)
t t
t
|M (x, y)| = (1 )
Pu (x, y) (1 )
1 = 1,
t=0
t=0
p()
,
q()
where Hu = dM
d . Therefore
(I Pu )1 =
1 Pu
+ Hu + O(|1 |)
(1 )(I Pu )(I Pu )1 = (1 )I
.
t
We now show that, for every t 1, Pu Pu = (Pu Pu )t . For t = 1, it is trivial. Suppose that the
result holds up to n 1, i.e., Pun1 Pu = (Pu Pu )n1 . Then (Pu Pu )(Pu Pu )n1 = (Pu Pu )(Pun1
Pu ) = Pun Pu Pu Pu Pun1 + Pu Pu = Pun Pu Pu Pun2 + Pu = Pun Pu . By induction, we have
Put Pu = (Pu Pu )t .
Now note that
M Pu
Hu = lim
1
1
Pu
1
= lim (I Pu )
1
1
= lim
t (Put Pu )
1
t=0
lim I
lim
(Pu
Pu )t
t=1
Pu
t (Pu Pu )t Pu
t=0
(I Pu + Pu )1 Pu .
2
Hence Hu = (I Pu + Pu )1 Pu
.
We now show Pu Hu = 0. Observe
Pu Hu
= Pu (I Pu + Pu )1 Pu
=
Pu (Pu Pu )t Pu
t=0
= Pu Pu = 0.
Therefore, Pu Hu = 0.
Hu = 0, we have Pu + Hu = I + Pu Hu .
k
By multiplying Pu to Pu + Hu = I + Pu Hu , we have
N
1
Puk Hu =
N
1
k=0
Puk +
k=0
Puk Hu ,
k=1
or, equivalently,
N Pu =
N
1
k=0
u denote a row of Pu . Then u = u Pu and u (x) = y u (y)Pu (y, x). Then Pu (x1 = x|x0 u ) =
y u (y)Pu (y, x). We can conclude that any row in matrix Pu is a stationary distribution for the Markov
chain under the policy u. However, does this observation mean that all rows in Pu are identical?
Theorem 2
Ju, =
Ju
+ hu + O(|1 |)
1
Proof:
Ju,
(I Pu )1 gu
Pu
=
+ Hu + O(|1 |) gu
1
Pu gu
+ Hu gu + O(|1 |)
=
1
N 1
1
1 t
Pu gu + hu +O(|1 |)
=
lim
1 N N
=
t=0
=Hu gu
Ju
+ hu +O(|1 |).
1
=Hu gu
Blackwell Optimality
In this section, we will show that policies that are optimal for the discountedcost criterion with large enough
discount factors are also optimal for the averagecost criterion. Indeed, we can actually strengthen the notion
of averagecost optimality and establish the existence of policies that are optimal for all large enough discount
factors.
Denition 1 (Blackwell Optimality) A stationary policy u is called Blackwell optimal if
(0, 1)
1).
such that u is optimal [,
Theorem 3 There exists a stationary Blackwell optimal policy and it is also optimal for the averagecost
problem among all stationary policies.
Proof: Since there are only nitely many policies, we must have for each state x a policy x such that
Jux , (x) Ju, (x) for all large enough . If we take the policy to be given by (x) = x (x), then
must satisfy Bellmans equation
Ju , = min {gu + Pu Ju , }
u
Remark 1 It is actually possible to establish averagecost optimality of Blackwell optimal policies among
the set of all policies, not only stationary ones.
Remark 2 An algorithm for computing Blackwell optimal policies involves lexicographic optimization of Ju ,
hu and higherorder terms in the Taylor expansion of Ju, .
Theorem 3 implies that if the optimal average cost is the same regardless of the initial state, then the
averagecost Bellmans equation has a solution. Combined with Theorem 1 of the previous lecture, it follows
that this is a necessary and sucient condition for existence of Bellmans equation solution.
Corollary 1 If Ju = e, then e + h = T h has a solution ( , hu ) with u which is Blackwell optimal.
min {gu + Pu Ju , }
u
J u
= min gu + Pu
+ hu + O((1 )2 )
u
1
e
= min gu + Pu
+ hu + O((1 )2 )
u
1
= min gu + Pu hu + O((1 )2 ) .
In the averagecost setting, existence of a solution to Bellmans equation actually depends on the structure
of transition probabilities in the system. Some sucient conditions for the optimal average cost to be the
same regardless of the initial state are given below.
Denition 2 We say that two states x, y communicate under policy u if there are k, k {1, 2, . . . } such
Denition 4 We say that a state x is transient under policy u if it is only visited nitely many times,
Value Iteration
We want to compute
N 1
1 t
Pu gu
N N
t=0
min lim
u
One way to obtain this value is to calculate a nite but very large N to approximate the limit and speculate
that such an limit is accurate. Hence we consider
k1
k
T J = min E
gu (xt ) + J0 (xk )
u
t=0
Recall J (x, T )
we have
= T + h (x). Choose some state x and x,
J (x, T ) J (
x, T ) = h (x) h (
x)
Then
hk (x) = J (x, k) k ,
for some 1 , 2 , . . .
Note that, since ( , h + ke) is a solution to Bellmans equation for all k whenever ( , h ) is a solution, we
x) = 0, we have the following commonly used
can choose the value of a single state arbitrarily. Letting h (
version of value iteration;
x)
(8)
hk+1 (x) = (T hk )(x) (T hk )(
we have = (T h)(
x) and h = h,
e + h = T h .
Theorem 5 Let hk be given by (8). Then if hk h,
Note that there must exist a solution to the averagecost Bellmans equation for value iteration to con
verge. However, it can be shown that existence of a solution is not a sucient condition.
February 23
Handout #9
Lecture Note 6
In the rst part of this lecture, we will discuss the application of dynamic programming to the queueing
network introduced in [1], which illustrates several issues encountered in the application of dynamic pro
gramming in practical problems. In particular, we consider the issues that arise when value iteration is
applied to problems with a large or innite state space.
The main points in [1], which we overview today, are the following:
Naive implementation of value iteration may lead to slow convergence and, in the case of innite
state spaces, policies with innite average cost in every iteration step, even though the iterates J k (x)
converge pointwise to J (x) for every state x;
Under certain conditions, with proper initialization J0 , we can have a faster convergence and stability
guarantees;
In queueing networks, proper J0 can be found from well-known heuristics such as uid model solutions.
We will illustrate these issues with examples involving queueing networks. For the generic results, in
cluding a proof of convergence of average-cost value iteration for MDPs with innite state spaces, refer to
[1].
1.1
13
1
37
22
35
11
18
Machine 1
26
4
Machine 2
34
Machine 3
g(x) =
xi
xi
i=1
a {0, 1}N
The interpretation is as follows. At each time stage, at most one of the following events can happen: a new
job arrives at queue i with probability i , a job from queue i that is currently being served has its service
completed, with probability i , and either moves to another queue or leaves the system, depending on the
structure of the network. Note that, at each time stage, a server may choose to process a job from any of the
queues associated with it. Therefore the decision a encodes which queue is being processed at each server.
We refer to such a queueing network as multiclass because jobs at dierent queues has dierent service rates
and trajectories through the system.
As seen before, an optimal policy could be derived from the dierential cost function h , which is the
solution of Bellmans equation:
+ h = T h .
Consider using value iteration for estimating h . This requires some initial guess h0 . A common choice
is h0 = 0; however, we will show that this can lead to slow convergence of h . Indeed, we know that h is
equivalent to a quadratic, in the sense that there is a constant and a solution to Bellmans equation such
k1 N
k
xi (t) x0 = x .
T h0 (x) = min E
u
t=0 i=1
Since
E [Di (t)] ,
=i (arrival) 0 (departure)
we have
1 You
E [xi (1)]
E [xi (0)] + i
E [xi (2)]
..
.
E [xi (t)]
E [xi (0)] + ti
will show this for the special case of a single queue with controlled service rate in problem set 2.
(1)
Thus,
T k h0
k1
N
(xi (0) + ti )
t=0 i=1
kxi (0) +
i=1
k(k 1)
i
2
This implies that hk (x) is upper bounded by a linear function of the state x. In order for it to approach a
quadratic function of x, the iteration number k must have the same magnitude as x. It follows that, if the
state space is very large, convergence is slow. Moreover, if the state space is innite, which is the case if
queues do not have nite buers, only pointwise convergence of hk (x) to h (x) can be ensured, but for every
k, there is some state x such that hk (x) is a poor approximation to h (x).
Example 1 (Single queue length with controlled service rate) Consider a single queue with
State x dened as the queue length
Pa (x, x + 1) = ,
(arrival rate)
Simulation-based Methods
The dynamic programming algorithms studied so far have the following characteristics:
innitely many value/and or policy updates are required at every state,
perfect information about the problem is required, i.e., we must know ga (x) and Pa (x, y),
we must know how to compute greedy policies, and in particular compute expectations of the form
In realistic scenarios, each of these requirements may pose diculties. When the state space is large,
performing updates innitely often in every state may be prohibitive, or even if it is feasible, a clever order
of visitation may considerably speed up convergence. In many cases, the system parameters are not known,
and instead one has only access to observations about the system. Finally, even if the transition probabilities
are known, computing expectations of the form (2) may be costly. In the next few lectures, we will study
simulation-based methods, which aim at alleviating these issues.
2.1
xk Sk
We have seen that, if every state has its value updated innitely many times, then the AVI converges
(see arguments in Problem set 1). The question remains as to whether convergence may be improved by
selecting states in a particular order, and whether we can dispense with the requirement of visiting every
state innitely many times.
We will consider a version of AVI where state updates are based on actual or simulated trajectories for
the system. It seems reasonable to expect that, if the system is often encountered at certain states, more
emphasis should be placed in obtaining accurate estimates and good actions for those states, motivating
performing value updates more often at those states. In the limit, it is clear that if a state is never visited,
under any policy, then the value of the cost-to-go function at such a state never comes into play in the
decision-making process, and no updates need to be performed for such a state at all. Based on the notion
that state trajectories contain information about which states are most relevant, we propose the following
version of AVI. We call it real-time value iteration (RTVI).
1. Take an arbitrary state x0 . Let k = 0.
2. Choose action uk in some fashion.
3. Let xk+1 = f (xk , uk , wk ) (recall from lecture 1 that f gives an alternative representation for state
transitions).
4. Let Jk+1 (xk+1 ) = (T Jk )(xk+1 ).
5. Let k = k + 1 and return to step 2.
2.2
Exploration x Exploitation
Note that there is still an element missing in the description of RTVI, namely, how to choose action u k . It
is easy to see that, if for every state x and y there is a policy u such that there is a positive probability of
reaching state y at some time stage, starting at state x, one way of choosing u k that ensures convergence of
RTVI is to select it randomly among all possible actions. This ensures that all states are visited innitely
often, and the convergence result proved for AVI holds for RTVI. However, if we are actually applying RTVI
as we run the system, we may not want to wait until RTVI converges before we start trying to use good
policies; we would like to use good policies as early as possible. A reasonable choice in this direction is to
take an action uk that is greedy with respect to the current estimate Jk of the optimal cost-to-go function.
4
In general, choosing uk greedily does not ensure convergence to the optimal policy. One possible failure
scenario is illustrated in Figure 2. Suppose that there is a subset of states B which is recurrent under an
optimal policy, and a disjoint subset of states A which is recurrent under another policy. If we start with a
guess J0 which is high enough at states outside region A, and always choose actions greedily, then an action
that never leads to states outside region A will be selected. Hence RTVI never has a chance of updating and
correcting the initial guess J0 at states in subset B, and in particular, the optimal policy is never achieved.
It turns out that, if we choose initial value J0 J , then the greedy policy selection performs well, as
shown in Fig 2(b). We state this concept formally by the following theorem.
The previous discussion highlights a tradeo that is fundamental to learning algorithms: the conict
of exploitation versus exploration. In particular, there is usually tension between exploiting information
accumulated by previous learning steps and exploring dierent options, possibly at a certain cost, in order
to gather more information.
J(x)
J(x)
J0
J*
J0
x
(b) Initial value with J 0less or equal to J *
yA
y A
/
Hence one could regard J as a function from the set A to |A| . So T A is similar to DP operator for the
subset A of states and
||T A J T A J|| ||J J|| .
Therefore, RTVI is AVI over A, with every state visited innitely many times. Thus,
J (x), if x A,
Jk (x) J (x) =
J0 (x),
if x
/ A.
Since the states x
/ A are never visited, we must have
Pa (x, y) = 0, x A, y
/
A,
where a is greedy with respect to J . Let u be the greedy policy of J . Then
yS
Therefore, we conclude
J (x) = Ju (x) J (x),
x A.
x A.
References
[1] R-R Chen and S.P. Meyn, Value Iteration and Optimization of Multiclass Queueing Networks, Queueing
Systems, 32, pp. 6597, 1999.
February 25
Handout #10
Lecture Note 7
xk+1 = f (xk , uk , wk )
choose
ut in some fashion
update
We thus have
2
2.1
QLearning
Qfactors
J (x) =
min Q (x, a)
(1)
(2)
We can interpret these equations as Bellmans equations for an MDP with expanded state space. We have
the original states x S, with associated sets of feasible actions Ax , and extra states (x, a), x S, a Ax ,
corresponding to stateaction pairs, for which there is only one action available, and no decision must be
made. Note that, whenever we are in a state x where a decision must be made, the system transitions
deterministically to state (x, a) based on the state and action a chosen. Therefore we circumvent the need
(HQ)(x, a) = ga (x) +
Pa (x, y) min
Q(y, a )
(3)
It is easy to show that the operator H has the same properties as operator T dened in previous lectures
for discountedcost problems:
1
Monotonicity
Oset
Contraction
such that Q Q,
HQ HQ.
Q, and Q
H(Q + Ke) = HQ + Ke.
, Q, Q
||Q Q||
HQ H Q
2.2
QLearning
We now develop a realtime value iteration algorithm for computing Q . An algorithm analogous to RTVI
for computing the costtogo function is as follows:
However, this algorithm undermines the idea that Qlearning is motivated by situations where we do not
However, note that such an algorithm should not be expected to converge; in particular, Qt (xt+1 , a ) is a
noisy estimate of y Put (x, y) mina Qt (y, a ). We consider a smallstep version of this scheme, where the
noise is attenuated:
We will study the properties of (4) under the more general framework of stochastic approximations, which
are at the core of many simulationbased or realtime dynamic programming algorithms.
Stochastic Approximation
(5)
Now suppose that we cannot compute Hr but have noisy estimates (Hr + w) with E[w] = 0. One alternative
is to approximate (5) by drawing several samples Hr + wi and averaging them, in order to obtain an estimate
of Hr. In this case, we would have
k
1
rt+1 =
(Hrt + wi )
k i=1
2
r t
(i)
1
(Hrt + wi ),
i j=1
(i+1)
i (i)
1
r +
(Hrt + wi+1 ).
i+1 t
i+1
rt
(k)
(i1)
Therefore, rt+1 = rt . Finally, we may consider replacing samples Hrt + wi with samples Hrt
obtaining the nal form
rt+1 = (1 t )rt + t (Hrt + wt ).
+ wi ,
A simple application of these ideas involves estimating the expected value of a random variable by drawing
i.i.d. samples.
Example 1 Let v1 , v2 , . . . be i.i.d. random variables. Given
rt+1 =
t
1
rt +
vt+1
t+1
t+1
t=1
t = and
t=1
t2 < .
t =
(6)
t2 <
(7)
t=1
and
t=1
are standard in stochastic approximation algorithms. A simple argument illustrates the need for condition
(6): if the total sum of step sizes is nite, iterates rt are conned to a region around the initial guess r0 , so
that, if r0 is far enough from any solution of r = Hr, the algorithm cannot possibly converge. Moreover,
since we have noisy estimates of Hr, convergence of rt+1 = (1 t )rt + Hrt + t w requires that the noisy
term t w decreases with time, motivating the condition (7).
We will consider two approaches to analyzing the stochastic approximation algorithm
rt+1 =
(1 t )rt + t (Hrt + wt )
= rt + t (Hrt + wt rt )
= rt + t S(rt , wt )
where we dene S(rt , wt ) = Hrt + wt rt . The two approaches are
1. Lyapunov function analysis
2. ODE approach
(8)
3.1
The question we try to answer is Does (8) converge? If so, where does it converge to?
We will rst illustrate the basic ideas of Lyapunov function analysis by considering a deterministic case.
3.1.1
Deterministic Case
In deterministic case, we have S(r, w) = S(r). Suppose there exists some unique r such that
S(r ) = Hr r = 0.
The basic idea is to show that a certain measure of distance between rt and r is decreasing.
Example 2 Suppose that F is a contraction with respect to 2 . Then
rt+1 = rt + t (F rt rt )
converges.
Proof: Since F is a contraction, there exists a unique r s.t. F r = r . Let
V (r) = r r 2 .
We will show V (rt ) V (rt+1 ). Observe
V (rt+1 )
= rt+1 r 2
= rt + t (F rt rt ) r 2
= (1 t )(rt r ) + t (F rt r )2
(1 t )rt r 2 + t F rt r 2
(1 t )rt r 2 + t rt r 2
= rt r 2 (1 )t rt r 2 .
t0
rt r 2 (1 )t rt r 2
rt r 2 (1 )t
..
.
rt1 r 2 (1 )(t + t1 )
r0 r 2 (1 )
l=1
Hence
r0 r 2
,
t
(1 ) l=1 t
t
2
we thus have = 0.
We can isolate several key aspects in the convergence argument used for the example above:
1. We dene a distance V (rt ) 0 indicating how far rt is from a solution r satisfying S(r) = 01
2. We show that the distance is nonincreasing in t
3. We show that the distance indeed converges to 0.
The argument also involves the basic result that every nonincreasing sequence bounded below converges
to show that the distance converges
Motivated by these points, we introduce the notion of a Lyapunov function:
Denition 1 We call function V a Lyapunov function if V satises
(a) V () 0
(b) (r V )T S(r) 0
(c) V (r) = 0 S(r) = 0
3.1.2
Stochastic Case
The argument used for convergence in the stochastic case parallels the argument used in the deterministic
case. Let Ft denote all information that is available at stage t, and let
St (r) = E [S(r, wt )|Ft ] .
Then we require a Lyapunov function V satisfying
V () 0
(9)
(10)
V (r) V (
r) Lr r
2
E S(rt , wt ) |Ft K1 + K2 V (rt )2 ,
(11)
(12)
random variables and t=1 Yt < with probability 1. Suppose also that
E Xt+1 Xi , Yi , Zi , i t Xt + Yt Zt
Then
1V
2.
t=1 Zt < .
Theorem 2 If (9), (10), (11), and (12) are satised and we have
1. V (rt ) converges.
2. limt V (rt ) = 0.
3. Every limit point of rt is a stationary point of V .
t=1
t = and
t=1
t2 < , then
March 1
Handout #11
Lecture Note 8
t=0
t = and
t=0
t2 < , we have
V (rt ) converges,
limt V (rt ) = 0.
every limit point r of rt satises V (
r) = 0.
We will prove the convergence for a special case where V (r) = 12 r r 22 for some r .
1
t=0
t = and
t=0
t2 < ,
rt r ,
w.p. 1.
random variables and t=1 Yt < with probability 1. Suppose also that
E Xt+1 Ft Xt + Yt Zt ,
w.p. 1.
Then
1. Xt converges to a limit with probability 1,
2.
t=1 Zt < .
The key idea for the proof of Theorem 2 is to show that V (rt ) is a supermartingale, so that V (rt ) converges
and then show that it converges to zero w.p. 1.
Proof: [Theorem 2]
E V (rt+1 )Ft
= E rt+1 r 22 Ft
2
1
=
(rt r )T (rt r ) + t (rt r )T E St Ft + t E StT St Ft
2
2
Since V (rt ) = 12 rt r 22 , V (rt ) = (rt r ). Then
E V (rt+1 )Ft
= V (rt ) + t (rt r )T E St Ft + t E St 22 Ft
2
2
= V (rt ) + t V (rt )T E St Ft + t E St 22 Ft
2
2
V (rt ) t cV (rt ) + t (K1 + K2 V (rt ))
2
t2 K2
2
V (rt ) t c
V (rt ) + t K1
2
2
Xt
Zt
t=0
Yt
t2 < , t must converge to zero, and Zt 0 for all large enough t. Moreover,
t=0
Yt =
K1 2
< .
2 t=0 t
t2 K2
V (rt ) < , w. p. 1.
t c
2
t=0
Suppose that V (rt ) > 0. Then, by hypothesis that t=0 t = and t=0 t2 < , we must have
2 K2
t c t
V (rt ) =
2
t=0
which is a contradiction. Therefore
lim rt r 22 = 0
w.p. 1 rt r w.p. 1.
Suppose that F is a 2 contraction. Suppose also that it , t = 1, 2, . . . , are chosen i.i.d. with P (it = i) =
i > 0. Then
rt+1 (i) = rt (i) + t i ((F rt )(i) rt (i)) + t [1(it = i) i ] [(F rt )(i) rt (i)]
wt (i)
Dene
0
=
0
0
0
2
0
0
..
0
0
...
...
0
0
...
...
0
n
then
rt+1 = rt + t (F rt rt ) +t wt .
E[St |Ft ]
Let V (r) =
1
2 (r
r )
(r r ) 0. Then we have
V (r) = 1 (r r )
We also have
V (rt )T E St Ft
= (rt r )T 1 (F rt rt ) = (rt r )T (F rt r + r rt )
= (rt r )T (rt r ) + (rt r )T (F rt r )
1 Recall
rt t 22 + rt r 2 F rt r 2
rt t 22 + rt t 22
We nally have
E F rt rt 22 |Ft
= F rt rt 22
F rt r 22 + rt r 22
(1 + )rt r 22
(1 + ) max i2 V (rt )22 .
i
Qlearning
Q
(x,
a)
+ t (x, a) ga (x) +
Pa (x, y) min
Q
(y,
a
)
t
t
a
(HQ)(x,a)
+ t (x, a) min
Q
(x
,
a
)
P
(x,
y)
min
Q
(y,
a
)
t
t+1
a
t
a
y
wt
where
if (x, a) = (xt , at )
t (x, a) = 0,
t (xt , at ) = t
E t wt Ft = 0
|wt | Qt .
Then, we have
Qt+1 = Qt + t (HQt Qt ) + t wt .
We can use the following theorem to show that Qlearning converges, as long as every state and action
pair are visited innitely many times.
Theorem 4 Let rt+1 (i) = rt (i) + t (i) (Hrt )(i) rt (i) + wt (i) . Then, if
E wt Ft = 0
4
t=0
t (i) = ,
t=0
t (i)2 < , i
H is a maximumnorm contraction,
then rt r w.p. 1 (Hr = r ).
Comparing Theorems 2 and 4, note that, if H is a maximumnorm contraction, convergence occurs under
weaker conditions than if it is an Euclidean norm contraction.
Corollary 1 If
t=0
w.p. 1.
ODE Approach
Often times, the behavior of rt+1 = rt + t S(rt , wt ) may be understood by analyzing the following ODE
instead:
rt = E [S(rt , wt )] .
The main idea for the ODE approach is as follows. Look at intervals [tm , tm+1 ) such that
tm+1 1
t = ,
where is small.
t=tm
t [tm , tm+1 ).
(1)
Then
tm+1 1
rm+1
= rtm+1 = rm +
t S(rt , wt )
t=tm
tm+1 1
rtm +
t S(rt , wt ) + O()
(2)
t=tm
tm+1 1
= rtm +
t
S(rt , wt ) + O( 2 )
t=t
m
= rm + E [S(rm , w)] + O( 2 )
Therefore we can think of the stochastic scheme as a discrete version of the ODE
rm+1 = rm + E [S(rm , w)] r = E [S(r, w)] .
To make the argument rigorous, steps (1), (2) and (3) have to be justied.
(3)
March 3
Handout #12
Lecture Note 9
Q
(x
,
a
)
Q
(x
,
a
)
.
Qt+1 (xt , at ) = Qt (xt , at ) + t gat (xt ) + min
t
t+1
t
t
t
known states
unknown states
known states have been visited suciently many times to ensure that Pa (x, y), ga (x) are accurate
with high probabilities
an unknown state is moved to N when it has been visited at least m times for some number m
N and MN . The MDP M
N is presented in Fig. 1. Its main characteristic is that
We introduce two MDPs M
the unknown states from the original MDP are merged into a recurrent state x 0 with cost ga (x0 ) = gmax , a.
N but the estimated transition probabilities and costs are
The other MDP MN has the same structure as M
replaced with their true values.
We now introduce the algorithm.
1.1
Algorithm
We will rst consider a version of E3 which assumes knowledge of J ; the assumption will be lifted later.
The E3 algorithm proceeds as follows.
1. Let N = . Pick arbitrary state x0 . Let k = 0.
2. If xk
/ N , perform balanced wandering:
ak = action chosen fewest times at state xk
If xk N , then
1
Return xk and M
1
1 .
n
Figure 1: Markov Decision Process M
Theorem 1 With probability no less than 1 , E 3 will stop after a number of actions and computation
time
1
1 1
poly
, , |S|,
, gmax
1
and return a state x and policy u such that Ju (x) J (x) + .
1.2
Main Points
(iv) If exploitation is not possible, then there is an exploration policy that reaches an unknown state after
T transitions with high probability.
To show the rst main point, we consider the following lemma.
Lemma 1 Suppose a state x has been visited at least m times with each action a A x having been executed
at least |Amx | times. Then, if
m = poly |S|,
1
1
1
, T, gmax , , log , var(g)
1
we have, w.p. 1 ,
2
1
= O
|S|gmax
2
1
= O
|S|gmax
The proof of this lemma is a directly application of the Cherno bound, which states that, if z 1 , z2 , . . .
are i.i.d. Bernoulli random variables, then
n
1
zi Ez1
n i=1
(SLLN)
n2
P
zi Ez1 > 2 exp
n
2
i=1
after (m 1)|S| balanced wandering steps, at least one state will have to become known
The main point iii(a) follows from the next lemma.
Lemma 2 For all policy u,
Ju,MN (x) Ju (x), x.
Proof: Trivial for x
/ N since Ju,MN (x) =
Ju (x)
gmax
1
T 1
gu (xt ) +
t=0
T 1
gu (xt )
t=T
gmax
t gu (xt ) + T
1
Ju,MN (x)
t=0
To prove the main point iii(b), we rst introduce the following denition.
is a -approximation to M if
Denition 1 Let M and M
|Pa (x, y) Pa (x, y)|
|ga (x) ga (x)|
Lemma 3 If T
1
1
log
2gmax
(1)
2
1
and M is an O |S|gmax
approximation of M , then, u,
||Ju,M Ju,M || .
Sketch of proof: Take a policy u and a start state x. We consider paths of length T starting from x:
p = x 0 , x1 , x2 , . . . , x T
where p denotes the path. Note that
Ju,M (x) =
gu (xt ) ,
t=T +1
where
Pu,M (p) = Pu,M (x0 , x1 )Pu,M (x1 , x2 ) . . . Pu,M (xT 1 , xT )
is the probability of observing path p and
gu (p) =
t gu (xt )
t=0
T g
max
t
gu (xt )
E
1
t=T +1
(a) paths containing at least one transition xt , xt+1 in the set R such that Pu (xt , xt+1 ) . Note that the
total probability associated with such paths is less than or equal to |S|T , since the probability of any
given path is less than or equal to , starting with each state x in each transition there are at most |S|
possible small probability transitions, and there are T transitions where this can occur. Therefore
gmax
gmax
|S|T
.
Pu (p)gu (P)
Pu (p)
pR
pR
to conclude that
We can follow the same principle with the MDP M
( + )|S|T gmax
u (p)
.
P
g
(P)
u
1
pR
Therefore, we have
( + 2) |S| T gmax
Pu (p)gu (p)
Pu (p)
gu (p)
1
pR
pR
Therefore,
(1 )T Pu (p) Pu (p) (1 + )T Pu (p).
4
Ju,T (1 + )T [Ju,T + T ] +
4
4
The theorem follows by considering an appropriate choice of .
(1 )T [Ju,T T ]
The main point (iv) says that: If exploitation is not possible, then exploration is. We show it by the
following lemma.
Lemma 4 For any x N , one of the following must hold.
N
(a) there exists u in MN such that Ju,T
(x) JT (x) + , or
(b) there exists u such that the probability that a walk of T steps will terminate in N C exceeds
(1)
gmax .
and
JT (x) =
path in N
PuN (p)
path outside N
Pu (q)gu (q).
which implies
Pu (q)gu (q) +
Therefore
qN
max
g1
(1 )
gmax
>
.
PuN (p)
1
gmax
r
In order the complete the proof of Theorem 1 from the four lemmas above, we have to consider the
probabilities from two forms of failure:
failure to stop the algorithm with a near-optimal policy
failure to perform enough exploration in a timely fashion
The rst point is addressed by Lemmas 1, 2 and 3; which establish that, if the algorithm stops, with high
probability the policy produced is near-optimal. The second point follows from Lemma 4, which shows that
each attempt to explore is successful with some non negligible probability. By applying the Cherno bound,
it can be shown that, after a number of attempts that is polynomial in the quantities of interest, exploration
will occur with high probability.
5
References
[1] M. Kearns and S. Singh, Near-Optimal Reinforcement Learning in Polynomial Time, Machine Learning,
Volume 49, Issue 2, pp. 209-232, Nov 2002.
March 8
Handout #13
Lecture Note 10
DP problems are centered around the cost-to-go function J or the Q-factor Q . In certain problems, such as
linear-quadratic-Gaussian systems, J exhibits some structure which allows for its compact representation:
Example 1 In LQG system, we have
xk+1
g(x, u)
x0 Dx + u0 Eu,
x <n
where xk represents the system state, uk represents the control action, and wk is a Gaussian noise. It can
be shown that the optimal policy is of the form
uk
Lk xk
x0 Rx + S,
R <nm , S <
where R is a symmetric matrix. It follows that, if there are n state variables (i.e., x k <n ), storing
J requires storing n(n + 1)/2 + 1 real numbers, corresponding to the matrix R and the scalar S. The
computational time and storage space required is quadratic in the number of state variables.
In general, we are not as lucky as in the LQG system case, and exact representation of J requires that
it be stored as a lookup table, with one value per state. Therefore, the space is proportional to the size of
the state space, which grows exponentially with the number of state variables. This problem, known as the
curse of dimensionality, makes dynamic programming intractable in face of most problems of practical scale.
Example 2 Consider the game of Tetris, represented in Fig. 1. As seen in previous lectures, this game may
be represented as an MDP, and a possible choice of state is the pair (B, P ), in which B < nm represents
the board configuration and P represents the current falling piece. More specifically, we have b(i, j) = 1, if
position (i, j) of the board is filled, and b(i, j) = 0 otherwise.
If there are p different types of pieces, and the board has dimension n m, the number of states is on the
order of p 2nm , which grows exponentially with n and m.
Since exact solution of large-scale of MDPs is intractable, we consider approximate solutions instead.
There are two directions for approximations which are directly based on dynamic programming principles:
(1) Approximation in the policy space
Suppose that we would like to find a policy minimizing the average cost in an MDP. We can pose this
where U is the set of all possible policies. In principle, we could solve (1) by enumerating all policies and
choosing the one with the smallest value of (u); however, note that the number of policies is exponential
in the number of states we have |Y| = |A||S| ; if there is no special structure to U, this problem requires
even more computational time than solving Bellmans equation for the cost-to-go function. A possible
approach to approximating the solution is to transform problem (1) by considering only a tractable
subset of all policies:
min (u)
uF
where F is a subset of the policy space. If F has some appropriate format, e.g., we consider policies
that are parameterized by a continuous variable, we may be able to solve this problem without having
to enumerate all policies in the set, but by using some standard optimization method such as gradient
descent. Methods based on this idea are called approximations in the policy space, and will be studied
later on in this class.
(2) Cost-to-go Function Approximation
Another approach to approximating the dynamic programming solution is to approximate the cost-to-go
function. The underlying idea for cost-to-go function approximation is that J has some structure that
allows for approximate compact representation
r),
J (x) J(x,
If we restrict ourselves to approximations of this form, the problem of computing and storing J is reduced
to computing and storing the parameter r, which requires considerably less space. Some examples of
approximation architectures J that may be considered are as follows:
Example 3
r) = cos(xT r) nonlinear in r
J(x,
)
r) = r0 + rT x
J(x,
1
linear in r
r) = r0 + rt (x)
J(x,
1
In the next few lectures, we will focus on cost-to-go function approximation. Note that there are two
important preconditions to the development of an effective approximation. First, we need to choose a
parameterization J that can closely approximate the desired cost-to-go function. In this respect, a suitable
choice requires some practical experience or theoretical analysis that provides rough information on the shape
of the function to be approximated. Regularities associated with the function, for example, can guide the
choice of representation. Designing an approximation architecture is a problem-specific task and it is not
the main focus of this paper; however, we provide some general guidelines and illustration via case studies
involving queueing problems. Second, given a parameterization for the cost-to-go function approximation,
we need an efficient algorithm that computes appropriate parameter values.
We will start by describing usual choices for approximation architectures.
Approximation Architectures
2.1
Neural Networks
A common choice for an approximation architecture are neural networks. Fig. ?? represents a neural network.
The underlying idea is as follows: we first convert the original state x into a vector x
< n , for some n. This
vector is used as the input to a linear layer of the neural network, which maps the input to a vector y < m ,
Pn
for some m, such that yj = i=1 rij x
i . The vector y is then used as the input to a sigmoidal layer, which
m
outputs a vector z < with the property that zi = f (yi ), and f (.) is a sigmoidal function. A sigmoidal
function is any function with the following properties:
1. monotonically increasing
2. differentiable
3. bounded
Fig. 3 represents a typical sigmoidal function.
The combination of a linear and a sigmoidal layer is called a perceptron, and a neural network consists
of a chain of one or more perceptrons (i.e., the output of a sigmoidal layer can be redirected to another
sigmoidal layer, and so on). Finally, the output of the neural network consists of a weighted sum of the
output z of the final layer:
X
r i zi .
g(z) =
i
rij
Input
Linear Layer
ri
+
Sigmoidal Layer
2.2
Another common choice for approximation architecture is based on partitioning of the state space. The
underlying idea is that similar states may be grouped together. For instance, in an MDP involving
continuous state variables (e.g., S = <2 ), one may consider partitioning the state space by using a grid (e.g.,
divide <2 in squares). The simplest case would be to use a uniform grid, and assume that the cost-to-go
function remains constant in each of the partitions. Alternatively, one may use functions that are linear in
each of the partitions, or splines, and so on. One may also consider other kinds of partition beyond uniform
grids representing the partitioning of the state space as a tree or using adaptive methods for choosing the
partitions, for instance.
2.3
Features
A special case of state space partitioning consists of mapping states to features, and considering approximations of the cost-to-go function that are functions of the features. The hope is that the feature would capture
aspects of the state that are relevant for the decision-making process and discard irrelevant details, thus
providing a more compact representation. At the same time, one would also hope that, with an appropriate
choice of features, the mapping from features to the (approximate) cost-to-go function would be smoother
than that from the original state to the cost-to-go function, thereby allowing for successful approximation
with architectures that are suitable for smooth mappings (e.g., polynomials). This process is represented
below.
(x), r).
State x features f (x) J(f
J (x) J (f (x)) such that f (x) J is smooth.
Example 4 Consider the tetris game. What features we should choose?
1. |h(i) h(i + 1)| (height)
2. how many holes
3. max h(i)
March 10
Handout #14
Lecture Note 11
In this lecture, we will consider the problem of supervised learning. The setup is as follows. We have
pairs (x, y), distributed according to a joint distribution P (x, y). We would like to describe the relationship
between x and y through some function f chosen from a set of available functions C, so that y f(x).
Ideally, we would choose f by solving
min
f
2
1
yi f(xi )
n i=1
(training error)
instead. It also seems that, the richer the class C is, the better the chance to correctly describe the relationship
between x and y. In this lecture, we will show that this is not the case, and the appropriate complexity of C
and the selection of a model for describing how x and y related must be guided by how much data is actually
available. This issue is illustrated in the following example.
Example
1.5
2.5
3.5
1.5
2.5
3.5
It seems intuitive in the previous example that a line may be the best description for the relationship
between x and y, even though a polynomial of degree 3 describes the data perfectly in both cases and no
linear function is able to describe the data perfectly in the second case. Is the intuition correct, and if so,
how can we decide on an appropriate representation, if relying solely on the training error does not seem
completely reasonable?
The essence of the problem is as follows. Ultimately, what we are interested in is the ability of our tted
curve to predict future data, rather than simply explaining the observed data. In other words, we would like
to choose a predictor that minimizes the expected error |y(x) y(x)| over all possible x. We call this the
test error. The average error over the data set is called the training error.
We will show that training error and test error can be related through a measure of the complexity of
the class of predictors being considered. Appropriate choice of a predictor will then be shown to require
balancing the training error and the complexity of the predictors being considered. Their relationship is
described in Fig. 1, where we plot test and training errors versus complexity of the predictor class C when
the number of samples is xed. The main diculty is that, as indicated in Fig. 1, there exists a tradeo
between the complexity and the errors, i.e., training error and the test error; while the approximation error
over the sampled points goes to zero as we consider richer approximation classes, the same is not true for
the test error, which we are ultimately interested in minimizing. This is due to the fact that, with only
nite amount of data and noisy observations yi , if the class C is too rich we may run into overtting
tting the noise in the observations, rather than the underlying structure linking x and y. This leads to poor
generalization from the training error to the test error.
We will investigate how bounds on the test error based on the training error and the complexity of C may
be developed for the special case of classication problems i.e., problems where y {1, +1}, which may
be seen as an indicating whether xi belongs in a certain set or not. The ideas and results easily generalize
to general function approximation.
2
Error
test error
training error
Optimal degree
3.1
Suppose that, given n samples (xi , yi ), i = 1, . . . , n, we need to choose a classier hi from a nite set of
classiers f1 , . . . , fd .
Dene
=
E[|y fk (x)|]
n
1
=
|yi fk (xi )|.
n i=1
(k)
n (k)
In words, (k) is the test error associated with classier fk , and n (k) is a random variable representing the
training error associated with classier fk over the samples (xi , yi ), i = 1, . . . , n. As described before, we
would like to nd k = arg mink (k), but cannot compute directly. Let us consider using instead
k = arg min n (k)
k
n (k) +
training error + ,
and
(k)
+
n (k)
n (k ) +
(k ) + 2.
In words, if the training error is close to the test error for all classiers fk , then using k instead of k is
near-optimal. But can we expect (1) to hold?
Observe that |yi fk (xi )| are i.i.d. Bernoulli random variables. From the strong law of large numbers,
we must have
n (k) (k) w.p. 1.
This means that, if there are sucient samples, (1) should be true. Having only nitely many samples, we
face two questions:
(1) How many samples are needed before we have high condence that n (k) is close to (k)?
(2) Can we show that n (k) approaches (k) equally fast for all fk C?
The rst question is resolved by the Cherno bound: For i.i.d. Bernoulli random variables x i , i = 1, . . . , n,
we have
n
1
P
xi Ex1 > 2 exp(2n2 )
n i=1
Moreover, since there are only nitely many functions in C, uniform convergence of n (k) to (k) follows
immediately:
P (k : |(k) (k)| > )
= P (k {|
(k) (k)| > })
d
k=1
P ({|
(k) (k)| > })
2d exp(2n2 ).
Therefore we have the following theorem.
Theorem 1 With probability at least 1 , the training set (xi , yi ), i = 1, . . . , n, will be such that
test error training error + (d, n, )
where
(d, n, ) =
1
2n
1
log 2d + log
.
Measures of Complexity
In Theorem 1, the error (d, n, ) is on the order of log d. In other words, the more classiers are under
consideration, the larger the bound on the dierence between the testing and training errors, and the
dierence grows as a function of log d. It follows that, for our purposes, log d captures the complexity of
C. It turns out that, in the case where there are innitely many classiers to choose from, i.e., m = , a
dierent notion of complexity leads to a bound similar to that in Theorem 1
How can we characterize complexity? There are several intuitive choices, such as the degrees of freedom
associated with functions in S or the length required to describe any function in that set (description length).
In certain cases, these notions can be shown to give rise to bounds relating the test error to the training error.
In this class, we will consider a measure of complexity that holds more generally the Vapnik-Chernovenkis
(VC) dimension.
4
4.1
VC dimension
The VC dimension is a property of a class C of functions i.e., for each set C, we have an associated measure
of complexity, dV C (C). dV C captures how much variability there is between dierent functions in C. The
underlying idea is as follows. Take n points x1 , . . . , xn , and consider binary vectors in {1, +1}n formed
by applying a function f C to (xi ). How many dierent vectors can we come up with? In other words,
consider the following matrix:
..
..
..
..
.
.
.
.
where fi C. How many distinct rows can this matrix have? This discussion leads to the notion of shattering
and to the denition of the VC dimension.
f C,
d
1
2n
V C log( dV C ) + 1 + log( 4 )
n
Moreover, a suitable extension to bounded real-valued functions, as opposed to functions taking value in
{1, +1}, can also be obtained. It is called the Pollard dimension and gives rise to results analogous to
Theorem 1 and 2.
Denition 3 Pollard dimension of C = {f (x)} = maxs dV C ({I(f (x) > s(x))})
5
Based on the previous results, we may consider the following approach to selecting a class of functions C
whose complexity is appropriate for the number of samples available. Suppose that we have several classes
C1 C2 . . . Cp . Note that complexity increases from C1 to Cp . We have classiers f1 , f2 , . . . , fp which
minimizes the training error n (fi ) within each class. Then, given a condence level , we may found upper
bounds on the test error (fi ) associated with each classier:
(fi ) n (fi ) + (dV C , n, ),
with probability at least 1 , and we can choose the classier fi that minimizes the above upper bound.
This approach is called structural risk minimization.
There are two diculties associated with structural risk minimization: rst, the upper bound provided
by Theorems 1 and 2 may be loose; second, it may be dicult to determine the VC dimension of a given
class of classiers, and rough estimates or upper bounds may have to be used instead. Still, this may be a
reasonable approach, if we have a limited amount of data. If we have a lot of data, an alternative approach is
as follows. We can split the data in three sets: a training set, a validation set and a test set. We can use the
training set to nd the classier fi within each class Ci that minimizes the training error; use the validation
set to estimate the test error of each selected classier fi , and choose the classier f from f1 , . . . , fp with the
smallest estimate; and nally, use the test set to generate an estimate of the test error associated with f.
March 12
Handout #16
Lecture Note 12
Recall that two tasks must be accomplished in order to for a value function approximation scheme to be
successful:
1. We must pick a good representation J, such that J () J(, r) for at least some parameter r;
2. We must pick a good parameter r, such that J (x) J(x, r).
Consider approximating J with a linear architecture, i.e., let
J(x, r) =
i (x)ri ,
t=1
= 1 . . .
|
|S|p given by
p .
|
Fig. 1 gives a geometric interpretation of value function approximation. We may think of J as a vector
in |S| ; by considering approximations of the form J = r, we restrict attention to the hyperplane J = r
in the same space. Given a norm (e.g., the Euclidean norm), an ideal value function approximation
algorithm would choose r minimizing J r; in other words, it would nd the projection r of J onto
the hyperplane. Note that J r is a natural measure for the quality of the approximation architecture,
since it is the best approximation error that can be attained by any algorithm given the choice of .
Algorithms for value function approximation found in the literature do not compute the projection r ,
since this is an intractable problem. Building on the knowledge that J satises Bellmans equation, value
function approximation typically involves adaptation of exact dynamic programming algorithms. For in
stance, drawing inspiration from value iteration, one might consider the following approximate value iteration
algorithm:
rk+1 = T rk ,
where is a projection operator which maps T rk back onto the hyperplane r.
Faced with the impossibility of computing the best approximation r , a relevant question for any
value function approximation algorithm A generating an approximation r A is how large J rA is in
comparison with J r . In particular, it would be desirable that, if the approximation architecture
is capable of producing a good approximation to J , then the approximation algorithm should be able to
produce a relatively good approximation.
Another important question concerns the choice of a norm used to measure approximation errors.
Recall that, ultimately, we are not interested in nding an approximation to J , but rather in nding a good
policy for the original decision problem. Therefore we would like to choose to reect the performance
associated with approximations to J .
State 2
J
x, r )
J(
J = r
State 1
State 3
Performance Bounds
We are interested in the following question. Let uJ be the greedy policy associated with an arbitrary
function J , and JuJ be the cost-to-go function associated with that policy. Can we relate the increase in
cost JuJ J to the approximation error J J ?
Recall the following theorem, from Lecture Note 3:
Theorem 1 Let J be arbitrary, uJ be a greedy policy with respect to J .1 Let JuJ be the cost-to-go function
for policy uJ . Then
2
JuJ J
J J .
1
The previous theorem suggests that choosing an approximation J that minimizes J J may be an
appropriate choice. Indeed, if J J is small, then we have a guarantee that the cost increase incurred
by using the (possibly sub-optimal) policy uJ is also relatively small. However, what about the reverse?
If J J is large, do we necessarily have a bad policy? Note that, for problems of practical scale, it
1 That
is unrealistic to expect that we could approximate J uniformly well over all states (which is required by
Theorem 1) or that we could nd a policy uJ that yields a cost-to-go uniformly close to J over all states.
The following example illustrates the notion that having a large error J J does not necessarily
lead to a bad policy. Moreover, minimizing J J may also lead to undesirable results.
Example 1 Consider a single queue with controlled service rate. Let x denote the queue length, B denote
the buer size, and
Pa (x, x + 1)
Pa (x, x 1)
(a),
Pa (B, B + 1)
0,
ga (x)
a, x = 0, 1, . . . , B 1
a, x = 1, 2, . . . , B,
= x + q(a)
Suppose that we are interested in minimizing the average cost in this problem. Then we would like to nd
an approximation to the dierential cost function h . Suppose that we consider only linear approximations:
r) = r1 + r2 x. At the top of Figure 1, we represent h and two possible approximations, h1 and h2 . h1
h(x,
Which one is a better approximation? Note that h1 yields smaller
is chosen so as to minimize h h.
approximation errors than h2 over large states, but yields large approximation errors over the whole state
space. In particular, as we increase the buer size B, it should lead to worse and worse approximation errors
in almost all states. h2 , on the other hand, has an interesting property, which we now describe. At the
bottom of Figure 1, we represent the stationary state distribution (x) encountered under the optimal policy.
Note that it decays exponentially with x, and large states are rarely visited. This suggests that, for practical
purposes, h2 may lead to a better policy, since it approximates h better than h1 over the set of states that
are visited almost all of the time.
What the previous example hints at is that, in the case of a large state space, it may be important to
consider errors J J that dierentiate between more or less important states. In the next section, we
will introduce the notion of weighted norms and present performance bounds that take state relevance into
account.
2.1
r)|
max |J (x) J(x,
r),
J J(,
r)1,
J J(,
r)|
max (x)|J (x) J(x,
xS
r)| ( > 0)
(x)|J (x) J(x,
xS
xS
r)| : x
E |J (x) J(x,
Theorem 2 Let J be such that J T J. Let JuJ be the cost-to-go of the greedy policy uJ . Then, for all
c > 0,
1
J J 1,c,J
JuJ J 1,c
1
3
h(x)
h1
h2
B
Dist. of x
x
Figure 2: Illustration of Performance Bounds for Example 1
where
1
T
T
= (1 )cT
cJ = (1 )c (I PuJ )
t Put J
t=0
or equivalently
c,J (x) = (1 )
c(y)
x S.
t=0
cT (JuJ J )
cT (JuJ J)
cT (I PuJ )1 guJ J)
cT (I PuJ )1 (T J J)
c T (I PuJ )1 (J J)
1
J J1,c,J
1
JuJ J 1,c
2
J J
1
1
J J 1,c,J .
1
The analysis presented in the previous sections is based on the greedy policy u J associated with approximation J. In order to use this policy, we need to compute
This step is typically done in real-time, as the system is being controlled. If the set of available actions A
is relatively small and the summation y Pa (x, y)J(y, r) can be computed relatively fast, then evaluating
(1) directly when an action at state x is needed is a feasible approach. However, if this is not the case,
alternative solutions must be considered. We describe a few:
If the action set is relatively small but there are many y s to sum over, we can estimate
Pa (x, y)J(y, r)
(2)
by sampling:
N
1
J(yi , r).
N i=1
(3)
In some cases, a very large number of samples may be required for the empirical average (3) to be a
reasonable estimate of (2). In these cases, the computation of (2) or (3) could be done oine, and
stored via Q-factors:
QJ(x, a) = ga (x) +
Pa (x, y)J(y, r)
y
Clearly, QJ (x, a) requires space proportional to the size of the state space, and cannot be stored
exactly. In the same spirit as value function approximation, we can approximate it with a parametric
representation:
Q(x, a, s) QJ(x, a)
We may nd an approximate parameter s based on the approximation J by solving
N
min
s
1
2
(Q(x, a, s) QJ(x, a))
N i=1
(oine)
or, alternatively, we could do value function approximation to approximate the Q-factor directly.
Finally, if the action space is too large, computing minaA Q(x, a, s) may be prohibitively expensive
as well. As an alternative, we may consider also a parametric representation for policies:
u(x) u(x, t)
We may nd an appropriate parameter t by tting u(x, t) to the greedy policy u(x), computed oine:
min
t
N
1
2
(u(xi ) u(xi , t))
N i=1
or consider algorithms which mix together value function approximation and policy approximation.
March 17
Handout #17
Lecture Note 13
Temporal-Dierence Learning
We now consider the problem of computing an appropriate parameter r, so that, given an approximation
architecture J(x, r), J(, r) J ().
A class of iterative methods are the so-called temporal-dierence learning algorithms, which generates a
series of approximations Jk = J(, rk ) as follows. Consider generating a trajectory (x1 , u1 , . . . , xk , uk ), where
uk is the greedy policy with respect to Jk . We then have the error/temporal dierences
dk = guk (xk ) + Jk (xk+1 , rk ) Jk (xk , rk ),
which represent an approximation to the Bellman error (T Jk )(xk ) Jk (xk ) at state xk . Based on the
temporal dierences, an intuitive way of updating the parameters rk is to make updates proportional to the
observed Bellman error/temporal dierence:
rk+1 = rk + k dk zk ,
where k is the step size and zk is called an eligibility vector it measures how much updates to each
component of the vector rk would aect the Bellman error.
To gather more intuition about how to choose the eligibility vector, we will consider the case of au
tonomous systems, i.e., systems that do not involve control. In this case, we can estimate the cost-to-go
function via sampling as follows. Suppose that we have a trajectory x1 , . . . , xn . Then we have
J (x1 )
n1 g(xt )
t=1
J (x2 )
n2 g(xt )
t=2
..
.
In other words, from a trajectory x1 , . . . , xn , we can derive pairs (xi , J(xi )), where J(xi ) is a noisy and
biased estimate of J (xi ). Therefore we may consider tting the approximation J(x, r) by minimizing the
empirical squared error:
n
2
t , r)
min
Jn (xt ) J(x
(1)
r
t=1
We derive an incremental, approximate version of (1). First note that Jn (xt ) could be updated incrementally
as follows:
Jn+1 (xt ) = Jn (xt ) + n+1t g(xn+1 )
(2)
Alternatively, we may use a small-step update of the form
n
+1
(3)
which makes Jn+1 (xt ) an average of the old estimate Jn (xt ) and the new estimate (2). Finally, we may
approximate (3) to have Jn (xt ) function d1 , d2 , . . . , dn :
n
jt g(xj ) Jn (xt ) =
j=t
jt dt .
j=t
Hence
Jn+1 (xt ) = Jn (xt ) +
n
+1
jt dj .
(4)
j=t
Finally, we may consider having the sum in (1) implemented incrementally, so that the previous temporal
dierences do not have to be stored:
Jn+1 (xt ) = Jn (xt ) + n+1t dn+1 .
Hence, in each time stage, we would like to nd rn minimizing
n
2
t , r) .
Jn (xt ) + nt dn J(x
min
r
(5)
t=1
Starting from the solution rn to the problem at stage n, we can approximate the solution of the problem at
stage n + 1 by updating rn+1 along the gradient of (5). This leads to
n
tn
rn+1 = rn +
r J (rn , xt ) dn+1 .
t=1
= rk + k zk dk
= zk1 + r J(xk , rk )
The algorithm above is known as T D(1). We have the generalization T D(), [0, 1].
rk+1
zk
= rk + k zk dk
= zk1 + r J(xk , rk )
TD()
Before analyzing the behavior of T D(), we are going to study a related, deterministic algorithm
approximate value iteration. The analysis of T D() will be based on interpreting it as a stochastic approxi
mation version of approximate value iteration.
(1 )
m T m+1 J,
for [0, 1)
m=0
T J
J ,
for = 1.
Lemma 1
(1 ) J J
,
T J T J
1
J = T J
J, J
The motivation for T is as follows. Recall that, in value iteration, we have Jk+1 = T Jk . However, we
could also implement value iteration with Jk+1 = T L Jk , which implies L steps look ahead. Finally, we can
have an update that is a weighted average over all possible values of L; Jk+1 = T Jk gives one such update.
In what follows, we are going to restrict attention to linear approximation architectures. Let
J(x, r) =
i (x)ri ,
and
i=1
1 (1)
1 (2)
..
.
1 (n)
J =
2 (1)
2 (2)
..
.
2 (n)
...
...
...
...
d(1) 0
... 0
d(2) . . . 0
0
D=
.
..
...
.
. . . ..
0
0
. . . d(n)
P (1)
P (2)
..
.
P (n)
where d : S (0, 1)S is a probability distribution over states. Dene the weighted Euclidean norms
J2,D = J T DJ =
d(x)J 2 (x)
xS
< J J >D
= J DJ =
T
d(x)J(x)J(x)
xS
J
State 2
T rk
J = r
rk+1 = T rk
State 1
rk
State 3
Figure 1: Approximate Value Iteration
(6)
< J, J J >D = 0
J22,D
J22,D
+ J
(7)
J22,D
(8)
T
T
..
.
T
rk
T rk
T rk
Figure 2: T rk must be inside the smaller square and T rk must be inside the circle, but T rk may
be outside the larger square and further away from J than rk .
This lemma was proved in Problem Set 2, for the special case where P (x, x) > 0 for some x.
We are now poised to prove the following central result used to derive a convergent version of T D():
Lemma 4 Suppose that the transition matrix P is irreducible and aperiodic. Let
1 0
... 0
2 . . . 0
0
,
D= .
.
.
.
.
.
.
.
... .
0
0
. . . |S|
where is the stationary distribution associated with P . Then
P J2,D J2,D .
Proof:
P J22,D
(x)
(x)
P (x, y)J(y)
xS
xS
(y)J 2 (y)
= J22,D
The rst inequality follows the Jensens inequality and the third equality holds because is a stationary
2
distribution.
1 0
2
0
D =
..
..
.
.
0
0
... 0
... 0
..
... .
. . . |S|
and is the stationary distribution of the transition matrix P . It follows that, if the projection is performed
with respect to 2,D , T becomes a contraction with respect to the same norm, and convergence of
T D() is guaranteed.
Lemma 5
(i)
(ii)
(iii)
2,D
T J T J2,D J J
(1
)
2,D
J J
T J T J2,D
1
(1 )
2,D
J J
T J T J2,D
1
7
Proof of (1)
2,D
T J T J
= g + P J (g + J)2,D
2,D
= P J P J
2,D
J J
Theorem 1 Let
rk+1 = T rk
and
D =
1
0
..
.
0
0
2
..
.
0
...
...
...
...
0
0
..
.
|S|
Then rk r with
r J 2,D K, J J 2,D .
Proof: Convergence follows from (iii). We have r = T r and J T J . Then
r J 22,D
2
r J + J J 2,D
r J 22,D + J J 22,D
T r
2
T J 22,D
+ J
(orthogonal)
2
J 2,D
(1 )
2
r J 22,D + J J 2,D
(1 )2
Therefore
r J 2,D
1
J J 2,D
1
2
March 29
Handout #18
Lecture Note 14
Convergence of T D()
In this lecture, we will continue to analyze the behavior of T D() for autonomous systems. We assume that
the system has stage costs g(x) and transition matrix P .
Recall that we want to approximate J by J
r. We nd successive approximations r0 , r1 , . . . by
applying T D():
rk+1
dk
zk
= rk + k dk zk
(1)
() (x )
(2)
(3)
=0
= (1 )
m T m+1 J,
m=0
< , J >D .
(4)
D=
1
0
..
.
0
0
2
..
.
0
...
...
...
...
0
0
..
.
|S|
(1)
1
1
J J 2,D
1 k2
We can think T D() as a stochastic approximations version of AVI. Recall that the main idea in stochastic
approximation algorithms is as follows. We would like to solve a system of equations r = Hr, but only have
access to noisy observations Hr = w for any given r. Then we attempt to solve r = Hr iteratively by
considering
rk+1 = rk + k (Hrk rk + wk ).
Hence in order to show that T D() is a stochastic approximations version of AVI, we would like to show
that
rk+1 = T rk rk + wk ,
for some noise wk .
The following lemma expresses (4) in a format that is more amenable to our analysis.
Lemma 1 The AVI equations (4) can be rewritten as
rk+1 = < , T rk >D ,
(5)
rk+1 = Ark + b,
(6)
or, equivalently,
where
T
A = (1 ) D
t+1
(P )
(7)
t=0
and
T
b= D
()t P t g.
(8)
t=0
Proof: (5) follows immediately from the denition of . Now note that
rk+1
= < , T rk >D
=
=
=
T DT rk
T
(1 ) D
(1 )T D
m=0
(P ) g +
= Ark +
m+1
m+1
rk
t=0
m (P )m+1 rk + (1 )
m=0
t=0
(P )t g
m=t
(P )t g
t=0
= Ark + b.
= A,
lim Ebk
= b,
Ak
bk
= zk g(xk ).
We will study the limit of EAk and Ebk . For all J , we have
k
k
lim E [zk Jk ] = lim E
()
(x )J (xk )
=0
lim E
() (xk )J (xk )
( P k (x, y) (y))
=0
=0
() (x0 )J (x )|x0
=0
() < , P J >D
=0
Letting
J (xk , xk+1 ) = (xk+1 ) (xk ),
we conclude that
lim EAk
() < , P +1 P >D +I
=0
=
=
=
=
T D
T D
T D
=0
=0
+1 P +1 T D
+1 P +1 T D
=0
P + I
P T D + I
=1
+1 P +1 T D
=0
(1 )
=0
+1 P +1
=0
= A.
3
+1 P +1
(9)
() < , P g >D
=0
= b.
If = 1, we have
() (g + P rk rk ) = J + (I P )1 (P Ir) = J r.
=0
If < 1, then
() P
=0
P (1 )
t=
=0
(1 )
=0
t P t
t=0
Thus
() P (g + P rk rk )
= (1 )
=0
(1 )
=0
t P t (g + P r r)
t=0
=0
t=0
t P t g + t+1 P t+1 r r
T t rk
= T rk rk
Therefore,
lim E [zk dk ] =< , T rk rk >D
Comparing Lemmas 1 and 2, it is clear that T D() can be seen as a stochastic approximations version
of AVI; in particular, TDs equations can be written as
rk+1 = rk + k (Ark + b rk + wk ),
where wk = (Ak A)rk + (bk b). If rk remains bounded, we should have limk Ewk = 0, so that the noise
is zeromean asymptotically. Note however that the noise is not independent of the past history, and in fact
follows a Markov chain, since matrices Ak and bk are functions of xk and xk+1 . This makes application of
the Lyapunov analysis for convergence of T D() dicult, and we must resort to the ODE analysis instead.
The next theorem provides the convergence result.
Theorem 2 Suppose that P is irreducible and aperiodic and that
rk r w.p.1, where r = T r .
4
k=1
k = and
k=1
k2 < . Then
2
(a)
k=1 k = ,
k=1 k <
(b) xk follows a Markov chain and has stationary distribution
(c) A = E[A(x)|x ] is negative denite, and b = E[b(x)|x ]
(d) E[A(xk )|x0 ] A Ck , x0 , k, and
E[b(xk )|x0 ] b Ck , x0 , k
Then rk r w.p.1, i.e., Ar + b = 0.
Sketch of Proof of Theorem 2 We verify that conditions (a)(d) of Theorem 3 are satised.
Conditions (a) and (b) are satised by assumption.
(c) For all r, we have
rT Ar
= rT < , (1 )
+1 P +1 r r >D
=0
= < r, (1 )
+1 P +1 r >D r22,D
=0
r22,D r22,D
( )
< 0
Hence, A is negative denite.
(d) We must consider the quantities
E[bk b]
This involves a comparison of E[zk (xk+1 ], E[zk (xk )] and E[zk g(xk )] with their limiting values as k goes
to innity. Let us focus on term zk (xk ); the other terms involve similar analysis. We have
kt
= E
()
(x
)
(x
)
|
x
=
x
t
k
0
t=0
zk
kt
(xt )()
(xk )|xt , t 0
t=
kt
(xt )()
(xk )|x0 = x
t=0
t=0
kt
(xt )()
(xk )|xt
t=
It follows from basic matrix theory that |P (xt = x|x0 ) (xt )| Ct , where corresponds to the second
highest eigenvalue of P, which is strictly less than one since P is irreducible and aperiodic. Therefore we
have
t=0
k
2
2
t=0
+ E
t= k
2 +1
t= k
2 +1
M ()k/2 + k/2 ,
for some M < . Moreover,
kt
(xt )()
t=
March 31
Handout #19
Lecture Note 15
In the last lecture, we have analyzed the behavior of T D() for approximating the costtogo function in
autonomous systems. Recall that much of the analysis was based on the idea of sampling states according to
their stationary distribution. This was done either explicitly, as was assumed in approximate value iteration,
or implicitly through the simulation or observation of system trajectories. It is unclear how this line of
analysis can be extended to general controlled systems. In the presence of multiple policies, in general
there are multiple stationary state distributions to be considered, and it is not clear which one should be
used. Moreover, the dynamic programming operator T may not be a contraction with respect to any such
distribution, which invalidates the argument used in the autonomous case. However, there is a special
class of control problems for which analysis of T D() along the same lines used in the case of autonomous
systems is successful. These problems are called optimal stopping problems, and are characterized by a tuple
(S, P : S S [0, 1], g0 : S , g1 : S ), with the following interpretation. The problem involves a
Markov decision process with state space S. In each state, there are two actions available: to stop (action
0) or to continue (action 1). Once action 0 (stop) is selected, the system is halted and a nal cost g0 (x) is
incurred, based on the nal state x. At each previous time stage, action 1 (continue) is selected and a cost
g1 (x) is incurred. In this case, the system transitions from state x to state y with probability P (x, y). Each
policy corresponds to a (random) stopping time u , which is given by
u = min{k : u(xk ) = 0}.
Example 1 (American Options) An American call option is an option to buy stock at a price K, called
the stock price, on or before an expiration date the last time period the option can be exercised. The state
of a such a problem is the stock price Pk . Exercising the option corresponds to the stop action and leads
to a reward max(0, Pk K); not exercising the option corresponds to the continue action and incurs no
costs or rewards.
Example 2 (The Secretary Problem) In the secretary problem, a manager needs to hire a secretary.
He interviews secretaries sequentially and must make a decision about hiring each one of them immediately
after the interview. Each interview incurs a certain cost for the hours spent meeting with the candidate, and
hiring a certain person incurs a reward that is a function of the persons abilities.
In the innitehorizon, discountedcost case, each policy is associated with a discounted costtogo
t
u
Ju (x) = E
g1 (t) + g0 (xu )|x0 = x .
t=0
(1)
P (J J)2,D
2,D ,
J J
where the last inequality follows from P J2,D J2,D by Jensens inequality and stationarity of .
1.1
T D()
In general control problems, storing Q may require substantially more space than storing J , since Q is
a function of stateaction pairs. However, in the case of optimal stopping problems, storing Q requires
essentially the same space as J , since the Q value of stopping is trivially equal to g0 (x). Hence in the case
of optimal stopping problems, we can set T D() to learn
Q = g1 + P J ,
the cost of choosing to continue and behaving optimally afterwards. Note that, assuming that onestage
costs g0 and g1 are known, we can derive an optimal policy from Q by comparing the cost of continuing
with the cost of stopping, which is simply g0 . We can express J in terms of Q as
J = min(g0 , Q ),
2
Let
HQ = g1 + P min(g0 , Q),
and
H Q = (1 )
(2)
t H t+1 Q.
t=0
= rk + k zk (H rk rk + wk ).
(3)
The following property of H implies that an analysis of T D() for optimal stopping problems along the same
lines of the analysis for autonomous systems suces for establishing convergence of the algorithm.
Lemma 2
2,D Q Q
2,D .
HQ HQ
Theorem 1 [Analogous to autonomous systems] Let rk be given by (3) and suppose that P is irreducible
and aperiodic. Then rk r w.p.1, where r satises
r = H r
and
r Q 2,D
where k =
(1)
1
1
Q Q 2,D ,
1 k2
(4)
We can also place a bound on the loss in the performance incurred by using a policy that is greedy with
respect to Q , rather than the optimal policy. Specically, consider the following stopping role based on
r .
stop,
if g0 (x) r (x)
u(x)
=
continue, otherwise
species a (random) stopping time , which is given by
u
= min{k : (xk )r g0 (xk )},
and the costtogo associated with u
is given by
1
J (x) = E
g1 (xt ) + g0 (x ) .
t=0
The following theorem establishes a bound on the expected dierence between J and the optimal costtogo
J .
3
Theorem 2
Q Q 2,D
(1 ) 1 K 2
be the cost of choosing to continue in the current time stage, followed by using policy u:
Proof: Let Q
= g1 + P J.
Q
Then we have
E[J(x0 ) J(x0 )|x0 ]
(x)|P (J J )(x)|
xS
= P J P J 1,
P J P J 2,
=
=
1
g1 + P J g1 + P J 2,
1
Q Q 2,
The inequality (5) follows from the fact that, for all J, we have
J21,
E[|J(x)| : x ]2
E[J(x)2 : x ]
= J22, ,
and K, given by
where the inequality is due to Jensens inequality. Now dene the operators H
HQ
= g1 + P KQ, where
Hr = Hr
HQ = Q
Q
= Q.
H
is a contraction with respect to 2, . Now we have
Moreover, it is also easy to show that H
Q 2,
Q
Hr
2,
Q Hr 2, + Q
Q
Hr
2,
= HQ Hr 2, + H
r 2,
Q r 2, + Q
Q 2, + Q r 2,
Q r 2, + Q
r 2,
2Q r 2, + Q
4
(5)
(6)
Thus,
Q 2,
Q
2
Q r 2, .
1
The theorem follows from Theorem 1 and equations (6) and (7).
1.2
(7)
2
(8)
where u represents the projection based on 2,u and uk is the greedy policy with respect to rk . Such
a scheme is a plausible approximation, for instance, for an approximate policy iteration based on T D()
trained with system trajectories:
1. Select a policy u0 . Let k = 0.
2. Fit Juk rk (e.g., via T D() for autonomous systems);
3. Choose uk+1 to be greedy with respect to rk . Let k = k + 1. Go back to step 2.
Note that step 2 in the approximate policy iteration presented above involves training over an innitely long
trajectory, with a single policy, in order to perform policy evaluation. Drawing inspiration from asynchronous
policy iteration, one may consider performing policy updates before the policy evaluation step is considered.
As it turns out, none of these algorithms is guaranteed to converge. In particular, approximate value iteration
(8) is not even guaranteed to have a xed point. For an analysis of approximate value iteration, see [1].
A special situation where AVI and T D() are guaranteed to converge occurs when the basis functions
are constant over partitions of the state space.
Theorem 3 Suppose that i (x) = 1{x Ai }, where Ai Aj = , i, j, i = j. Then
rk+1 = T rk
converges for any Euclidean projections .
Proof: We will show that, if is a Euclidean projection and i satisfy the assumption of the theorem, then
is a maximumnorm nonexpansion:
J J
.
J J
5
Let
(J )(x)
Ki
if x Ai , where
arg min
w(x) J (x) r
= Ki ,
=
xAi
Thus
w(x)
.
xAi w(x)
E J (x) J(x)|x wi
J J
It follows that T is a maximumnorm contraction, which ensures convergence of approximate value iteration.
2
References
[1] D.P. de Farias and B. Van Roy. On the existence of xed points for appproximate value iteration and
temporaldierence learning. Journal of Optimization Theory and Applications, 105(3), 2000.
April 5
Handout #20
Lecture Note 16
In previous lectures on dynamic programming, we have studied the value and policy iteration algorithms
for solving Bellmans equation. We now introduce a different algorithm, which is based on formulating the
dynamic programming problem as a linear program.
Consider the following optimization problem:
cT J
maxJ
subject to
(1)
T J J,
and suppose that vector c is strictly positive: c > 0. Recall the following lemma from Lecture 3:
Lemma 1 For any J such that T J J, we have J J .
It follows from the previous lemma that, whenever c > 0, J is the unique solution to problem (1). We refer
to this problem as the exact LP. Note however that, strictly speaking, this problem is not a linear program;
in particular, the constraints
(T J)(x) J(x)
(
min ga (x) +
a
(2)
X
y
Pa (x, y)J(y)
J(x)
are not linear in the variables J of the LP. However, (1) can easily be converted into an LP by noting that
each constraint (2) is equivalent to
X
ga (x) +
Pa (x, y)J(y) J(x), a Ax .
y
Note that the exact LP contains as many variables as the number of states in the system, and as many
constraints as the number of state-action pairs.
1.1
We can also find an optimal policy by solving the dual of the exact LP. For simplicity, we will consider
average-cost problems in this section, but the analysis and underlying ideas easily extend to the discountedcost case. The dual LP has an interesting interpretation, and it can be shown that solving it iteratively via
simplex or interior-point methods is equivalent to performing specific forms of policy iteration.
In the average-cost case, it can be shown that the dual LP is given as follows:
X
(x, a)ga (x)
min
(3)
(x, a), x
(4)
x,a
subject to
XX
y
X
a
(x, a) = 1
x,a
(x, a) 0, x, a
For simplicity, let us assume that the system is irreducible under every policy.
In order to analyze the dual LP, we consider the notion of randomized policies. So far, we have defined
a policy to be a mapping from states to actions; in other words, for every state x a policy u prescribes
a (deterministic) action u(x) A . Alternatively, we can consider an extended definition where, for any
state action, a policy u prescribes a probability u(x, a) for taking each action a A x . Each policy u is now
associated with a transition matrix Pu such that
X
Pu (x, y) =
u(x, a)Pa (x, y),
a
u (x) 0.
We can also verify that the state costs associated with policy u are given by
X
u(x, a)ga (x).
gu (x) =
a
With these notions in mind, it can be shown that the variables (x, a) for any feasible solution to the
dual LP can be interpreted as state-action frequencies for a randomized policy. Indeed, let
X
(x) =
(x, a),
(5)
a
and
u(x, a) =
(x, a)
,
(x)
(6)
if (x) > 0, and u(x, ) be an arbitrary distribution over Ax , otherwise. Note that in either case we have
(x, a) = (x)u(x, a),
and u(x, a) is a randomized policy. Then we can show that corresponds to the stationary state distribution
P
u , and x,a (x, a)ga (x, a) corresponds to the average cost u of policy u.
Lemma 2 For every feasible solution (x, a) of the dual LP, let (x) and u(x, a) be given by (??) and (6).
P
Then = u , and x,a (x, a)ga (x) corresponds to the average cost u of policy u.
2
XX
y
(x)
XX
(x, a)
1.
We conclude that is a stationary distribution associated with policy u. Since by assumption the system
is irreducible under every policy, each policy has a unique stationary distribution, and we have = u . We
now have
X
XX
(x, a)ga (x) =
u (x)u(x, a)ga (x)
x,a
u (x)gu (x)
u .
From the previous lemma, we conclude that each feasible solution of the dual LP is identified with a policy
u, and the variables (x, a) correspond to the probability of observing state x and action a, in steady state.
Consider using simplex or an interior-point method for solving the dual LP. Either method will generate a
sequence of feasible solutions 0 , 1 , . . . , with decreasing value of the objective function. Interpreting this
sequence with Lemma 2, we see that this is equivalent to generating a sequence of policies u 0 , u1 , . . . , with
decreasing average cost, and solving the LP corresponds to performing policy iteration.
maxr
subject to
(7)
T r r
As with the case of exact dynamic programming, the optimization problem (??) can be recast as a linear
program
max
cT r
s.t.
ga (x) +
yS
x S, a Ax .
We will refer to this problem as the approximate LP. Note that the approximate LP involves a potentially
much smaller number of variables it has one variable for each basis function. However, the number of
constraints remains as large as in the exact LP. Fortunately, most of the constraints become inactive, and
solutions to the linear program can be approximated efficiently, as we will show in future lectures.
2.1
State-Relevance Weights
In the exact LP, for any vector c with positive components, maximizing c T J yields J . In other words, the
choice of state-relevance weights does not influence the solution. The same statement does not hold for the
approximate LP. In fact, the choice of state-relevance weights may bear a significant impact on the quality
of the resulting approximation.
To motivate the role of state-relevance weights, let us start with a lemma that offers an interpretation of
their function in the approximate LP.
Lemma 3 A vector r solves
max
cT r
s.t.
T r r,
kJ rk1,c
s.t.
T r r.
Proof: It is clear that the approximate LP is equivalent to minimizing c T (J r) over all feasible r. For
all r such that T r r, we have r J , and cT (J r) = kJ rk1,c .
2
Lemma 3 suggests that the state-relevance weights may be used to control the quality of the approximation
to the cost-to-go function over different portions of the state space. Recall that ultimately we are interested
in generating good policies, rather than good approximations to the cost-to-go function, and ideally we would
like to choose c to reflect that objective. The following theorem, from Lecture 12, suggests certain choices
for state-relevance weights. Recall that
T,J = cT (I PuJ )1 .
Theorem 1 Let J : S 7 < be such that T J J. Then
kJuJ J k1,
1
kJ J k1,,J .
1
(8)
Contrasting Lemma 3 with the bound on the increase in costs (8) given by Theorem 1, we may want
to choose state-relevance weights c that capture the (discounted) frequency with which different states are
expected to be visited. Note that the frequency with which different states are visited in general depends on
the policy being used. One possibility is to have an iterative scheme, where the approximate LP is solved
multiple times with state-relevance weights adjusted according to the intermediate policies being generated.
Alternatively, a plausible conjecture is that some problems will exhibit structure making it relatively easy
4
J*
r *
J(2)
TJ > J
~
r
J(1)
J = r
For the approximate LP to be useful, it should deliver good approximations when the cost-to-go function is
near the span of selected basis functions. Figure 1 illustrates the issue. Consider an MDP with states 1 and
2. The plane represented in the figure corresponds to the space of all functions over the state space. The
shaded area is the feasible region of the exact LP, and J is the pointwise maximum over that region. In the
approximate LP, we restrict attention to the subspace J = r.
In Figure 1, the span of the basis functions comes relatively close to the optimal cost-to-go function J ;
if we were able to perform, for instance, a maximum-norm projection of J onto the subspace J = r, we
would obtain the reasonably good approximate cost-to-go function r . At the same time, the approximate
LP yields the approximate cost-to-go function
r. In this section, we develop bounds guaranteeing that
r
2
min kJ rk .
1 r
= e means
State 2
TJ
J
No feasible
r
Before proving Theorem 2, we state and prove the following auxiliary lemma:
Lemma 4 For all J, let
1+
J = J
kJ J k e
1
Then, we have
T J J
Proof: Let = kJ J k . Thus we have
1+
T J = T (J
e)
1
kT J T J k kJ J k =
J = r
r
State 1
Then
T J
=
=
(1 + )
e
1
(1 + )
e
J (1 + )e
1
(1 + )
1+
e (1 + )e
e
J +
1
1
J.
We are now ready to finish the proof of Theorem 2. Let r = arg minr kJ rk . Let = kJ r k .
Then by Lemma 4, we have
1+
r = r
e
1
is a feasible solution for the ALP. From Lemma 3, we have
kJ
rk1,c
kJ
rk1,c
1+
ek1,c
1
1+
kJ r k1,c +
1
1+
kJ r k +
1
1+
+
1
2
1
kJ r
Theorem 2 establishes that when the optimal cost-to-go function lies close to the span of the basis
functions, the approximate LP generates a good approximation. In particular, if the error min r kJ rk
goes to zero (e.g., as we make use of more and more basis functions) the error resulting from the approximate
LP also goes to zero.
Though the above bound offers some support for the linear programming approach, there are some
significant weaknesses:
1. The bound calls for an element of the span of the basis functions to exhibit uniformly low error over
all states. In practice, however, minr kJ rk is typically huge, especially for large-scale problems.
2. The bound does not take into account the choice of state-relevance weights. As demonstrated in the
previous section, these weights can significantly impact the approximation error. A sharp bound should
take them into account.
In the next lecture, we will show how the previous analysis can be generalized to take into account structure
about the underlying MDP and address the aforementioned issues.
7
April 7
Handout #21
Lecture Note 17
cT r
T r r
(ALP)
In the previous lecture, we proved the following result on the approximation error yielded by the ALP:
Theorem 1 If v = e for some v, then we have
J
r1,c
2
min J r .
1 r
(1)
Though the above bound oers some support for the linear programming approach, there are some signicant
weaknesses:
1. The bound calls for an element of the span of the basis functions to exhibit uniformly low error over
all states. In practice, however, minr J r is typically huge, especially for large-scale problems.
2. The bound does not take into account the choice of state-relevance weights. As demonstrated in the
previous section, these weights can signicantly impact the approximation error. A sharp bound should
take them into account.
In this lecture, we present a line of analysis that generalizes Theorem 1 and addresses the aforementioned
diculties.
Lyapunov Function
To set the stage for the development of an improved bound, let us establish some notation. First, we
introduce a weighted maximum norm, dened by
J, = max (x)|J(x)|,
xS
(2)
for any : S + . As opposed to the maximum norm employed in Theorem 1, this norm allows for uneven
weighting of errors across the state space.
We also introduce an operator H, dened by
yS
for all V : S . For any V , (HV )(x) represents the maximum expected value of V (Y ) if the current state
is x and Y is a random variable representing the next state.
1
(HV )(x)
.
V (x)
(3)
(gu + Pu J)
(gu + Pu J)
(gu + Pu J)
(gu + Pu J)
= Pu (J J)
Pu |J J|
Therefore, x S,
(T J )(x) (T J)(x)
Pu (x, y)
|J (y) J(y)|
V (y)
V (y)
|J (y ) J(y )|
Pu (y) max
V (y)
y
V (y )
y
(HV )(x)
V (x)
Hence,
T J T J V V
and we have
We are now ready to state our main result. For any given function V mapping S to positive reals, we
use 1/V as shorthand for a function x 1/V (x).
Theorem 2 Let r be a solution of the approximate LP. Then, for any v K such that v is a Lyapunov
function,
2cT v
(5)
J
r1,c
min J r,1/v .
1 v r
Proof: Let
r = arg min J r, V1
r
and
= J r . V1 .
Let
r = r
1 + V
V
1 V
Then
T
r T r , Vt
V
r r , V1
(1 + V )
V , V1
1 V
(1 + V )
V
1 V
(6)
Moreover,
J r , V1 V J r , V1 = V
(7)
Thus,
1 + V
V
from (6)
1 V
1 + V
J V V V
V
from (7)
1 V
1 + V
V
r (1 + V )V V
1 V
1 + V
1 + V
=
r+
V (1 + V )V V
V
1 V
1 V
=
r
T
r
T r V
J
r1,c
=
=
=
(1 + V ) T
c(x)|J (x) (r )(x)| +
c V
1 V
x
(1 + V ) T
c(x)V (x) +
c V
1 V
x
(1 + V ) T
+
c V cT V
1 V
2cT V
1 V
Let us now discuss how this new theorem addresses the shortcomings of Theorem 1 listed in the previous
section. We treat in turn the two items from the aforementioned list.
1. The norm appearing in Theorem 1 is undesirable largely because it does not scale well with
problem size. In particular, for large problems, the optimal value function can take on huge values
over some (possibly infrequently visited) regions of the state space, and so can approximation errors
in such regions.
Observe that the maximum norm of Theorem 1 has been replaced in Theorem 2 by ,1/v . Hence,
the error at each state is now weighted by the reciprocal of the Lyapunov function value. This should
to some extent alleviate diculties arising in large problems. In particular, the Lyapunov function
should take on large values in undesirable regions of the state space regions where J is large.
Hence, division by the Lyapunov function acts as a normalizing procedure that scales down errors in
such regions.
4
2. As opposed to the bound of Theorem 1, the state-relevance weights do appear in our new bound. In
particular, there is a coecient cT v scaling the right-hand-side. In general, if the state-relevance
weights are chosen appropriately, we expect that cT v will be reasonably small and independent of
problem size. We defer to the next section further qualication of this statement and a discussion of
approaches to choosing c in contexts posed by concrete examples.
13
1
37
22
35
11
18
Machine 1
26
4
Machine 2
34
Machine 3
g(x) =
=
xi ,
d i=1
d
since the expected total number of jobs at time t cannot exceed the total number of jobs at time 0 plus the
expected number of arrivals between 0 and t, which is less than or equal to Adt. Therefore we have
t
|xt |
=
Ex
t Ex [|xt |]
t=0
t=0
t (|x| + Adt)
t=0
Ad
|x|
+
.
1 (1 )2
(8)
The rst equality holds because |xt | 0 for all t; by the monotone convergence theorem, we can interchange
the expectation and the summation. We conclude from (8) that the optimal value function in the innite
buer case should be bounded above by a linear function of the state; in particular,
0 J (x)
1
|x| + 0 ,
d
J ,1/V
1 |x| + d0
x0
|x| + dC
0
1 + ,
C
max
and the bound above is independent of the number of queues in the system.
Now let us study V . We have
|x| + Ad
(HV )(x)
+C
d
A
V (x) + |x|
d +C
A
V (x) +
,
C
and it is clear that, for C suciently large and independent of d, there is a < 1 independent of d such that
HV V , and therefore 11V is uniformly bounded on d.
Finally, let us consider cT V . We expect that under some stability assumptions, the tail of the steady-state
d
1
distribution will have an upper bound with geometric decay [1] and we take c(x) = 1
|x| . The
B+1
state-relevance weights c are equivalent to the conditional joint distribution of d independent and identically
distributed geometric random variables conditioned on the event that they are all less than B + 1. Therefore,
d
T
Xi + C Xi < B + 1, i = 1, ..., d
c V = E
d i=1
<
E [X1 ] + C
=
+ C,
1
where Xi , i = 1, ..., d are identically distributed geometric random variables with parameter 1 . It follows
that cT V is uniformly bounded over the number of queues.
References
[1] D. Bertsimas, D. Gamarnik, and J.N. Tsitsiklis. Performance of multiclass Markovian queueing networks
via piecewise linear Lyapunov functions. Annals of Applied Probability, 11(4):13841428, 2001.
April 12
Handout #21
Lecture Note 18
While the ALP may involve only a small number of variables, there is a potentially intractable number of
constraints one per state-action pair. As such, we cannot in general expect to solve the ALP exactly. The
focus of this paper is on a tractable approximation to the ALP: the reduced linear program (RLP).
Generation of an RLP relies on three objects: (1) a constraint sample size m, (2) a probability measure
over the set of state-action pairs, and (3) a bounding set N K . The probability measure represents
a distribution from which we will sample constraints. In particular, we consider a set X of m state-action
pairs, each independently sampled according to . The set N is a parameter that restricts the magnitude
r. The RLP is dened by
of the RLP solution. This set should be chosen such that it contains
maximize
subject to
cT r
(x, a) X
(1)
r1,c J
r1,c 1 ,
Pr J
where > 0 is an error tolerance parameter and > 0 parameterizes a level of condence 1 . This paper
focusses on understanding the sample size m needed in order to meet such a requirement.
1.1
To apply the RLP, given a problem instance, one must select parameters m, , and N . In order for the
RLP to be practically solvable, the sample size m must be tractable. Results of our analysis suggest that if
and N are well-chosen, an error tolerance of can be accommodated with condence 1 given a sample
size m that grows as a polynomial in K, 1/, and log 1/, and is independent of the total number of ALP
constraints.
Our analysis is carried out in two parts:
1. Sample complexity of near-feasibility. The rst part of our analysis applies to constraint sampling
in general linear programs not just the ALP. Suppose that we are given a set of linear constraints
zT r + z 0, z Z,
on variables r K , a probability measure on Z, and a desired error tolerance and condence
1 . Let z1 , z2 , . . . be independent identically distributed samples drawn from Z according to . We
will establish that there is a sample size
1
1
1
m=O
K ln + ln
such that, with probability at least 1 , there exists a subset Z Z of measure (Z) 1 such
that every vector r satisfying
zTi r + zi 0, i = 1, . . . , m,
also satises
zTi r + zi 0,
z Z.
We refer to the latter criterion as near-feasibility nearly all the constraints are satised. The main
point of this part of the analysis is that near-feasibility can be obtained with high-condence through
imposing a tractable number m of samples.
2. Sample complexity of a good approximation. We would like the the error J
r1,c of an
1
A
A
,
m=O
K ln
+ ln
(1 )
(1 )
in the literature, as discussed in the following literature review. The signicance of our results is that they
suggest viability of the linear programming approach to approximate dynamic programming even in the
absence of such favorable special structure.
y : yT r + y < 0 .
(2)
12
2
4
K ln
+ ln
,
m
(3)
y : yT r + y < 0
(4)
This theorem implies that, even without any special knowledge about the constraints, we can ensure nearfeasibility, with high probability, through imposing a tractable subset of constraints. The result follows
immediately from Corollary 8.4.2 on page 95 of [1] and the fact that the collection of sets {{(, )| T r +
0}|r K } has VC-dimension K, as established in [2]. The main ideas for the proof are as follows:
1 if zT r + kz 0
We dene, for each r, a function fr : Z {0, 1}, given by r fr (z) =
0 otherwise
We are interested in nding r such that
E fr (z) 1,
E fr (z)
1
fr (z)
fr (zi ) = E
m i=1
fr | ,
|E fr E
then for all r FZ ,
(5)
fr = 1 E 1 , r F
E
Z
From the VC-dimension and supervised learning lecture, we know that there is a way of ensuring (5)
if f C.
The nal part of the proof comes from verifying that C has VC-dimension less than or equal to p.
Theorem 1 may be perceived as a puzzling result: the number of sampled constraints necessary for a good
approximation of a set of constraints indexed by z Z depends only on the number of variables involved in
these constraints and not on the set Z. Some geometric intuition can be derived as follows. The constraints
are fully characterized by vectors [zT z ] of dimension equal to the number of variables plus one. Since
near-feasibility involves only consideration of whether constraints are violated, and not the magnitude of
violations, we may assume without loss of generality that [zT z ] = 1, for an arbitrary norm. Hence
constraints can be thought of as vectors in a low-dimensional unit sphere. After a large number of constraints
is sampled, they are likely to form a cover for the original set of constraints i.e., any other constraint
is close to one of the already sampled ones, so that the sampled constraints cover the set of constraints.
The number of sampled constraints necessary in order to have a cover for the original set of constraints
is bounded above by the number of sampled vectors necessary to form a cover to the unit sphere, which
naturally depends only on the dimension of the sphere, or alternatively, on the number of variables involved
in the constraints.
1.5
0.5
0.5
0.5
1.5
Figure 1: A feasible region dened by a large number of redundant constraints. Removing all but a random
sample of constraints is likely to bring about a signicant change the solution of the associated linear program.
In this section, we investigate the impact of using the RLP instead of the ALP on the error in the approxima
tion of the cost-to-go function. We show in Theorem 2 that, by sampling a tractable number of constraints,
the approximation error yielded by the RLP is comparable to the error yielded by the ALP.
The proof of Theorem 2 relies on special structure of the ALP. Indeed, it is easy to see that such a result
cannot hold for general linear programs. For instance, consider a linear program with two variables, which
are to be selected from the feasible region illustrated in Figure 1. If we remove all but a small random sample
of the constraints, the new solution to the linear program is likely to be far from the solution to the original
linear program. In fact, one can construct examples where the solution to a linear program is changed by
an arbitrary amount by relaxing just one constraint.
Let us introduce certain constants and functions involved in our error bound. We rst dene a family of
probability distributions on the state space S, given by
Tu = (1 )cT (I Pu )1 ,
(6)
for each policy u. Note that, if c is a probability distribution, u (x)/(1 ) is the expected discounted
number of visits to state x under policy u, if the initial state is distributed according to c. Furthermore,
lim1 u (x) is a stationary distribution associated with policy u. We interpret u as a measure of the
relative importance of states under policy u.
We will make use of a Lyapunov function V : S + , which is dened as follows.
Denition 1 (Lyapunov function) A function V : S + is called a Lyapunov function if there is a
scalar V < 1 and an optimal policy u such that
Pu V V V.
(7)
Our denition of a Lyapunov function is similar to that found in the previous lecture, with the dierence
that here the Lyapunov inequality (7) must hold only for an optimal policy, whereas in the previous lecture
it must hold simultaneously for all policies.
Lemma 1 Let V be a Lyapunov function for an optimal policy u . Then Tu is a contraction with respect
to ,1/V .
Proof: Let J and J be two arbitrary vectors in |S| . Then
Tu J Tu J = Pu (J J) J J,1/V Pu V J J,1/V V V.
For any Lyapunov function V , we also dene another family of probability distributions on the state
space S, given by
u (x)V (x)
u,V (x) =
.
(8)
Tu V
We also dene a distribution over state-action pairs
u,V (x, a) =
u,V (x)
, a Ax .
|Ax |
and
=
Tu V
sup J r,1/V .
cT J rN
(9)
We now present the main result of the paper a bound on the approximation error introduced by
constraint sampling.
Theorem 2 Let and be scalars in (0, 1). Let u be an optimal policy and X be a (random) set of m stateaction pairs sampled independently according to the distribution u ,V (x, a), for some Lyapunov function V ,
where
16A
48A
2
m
K ln
+ ln
,
(10)
(1 )
(1 )
Let r be an optimal solution of the ALP that is in N , and let r be an optimal solution of the corresponding
RLP. If r N then, with probability at least 1 , we have
J
r1,c J
r1,c + J 1,c .
(11)
Proof: From Theorem 1, given a sample size m, we have, with probability no less than 1 ,
(1 )
4A
u ,V ({(x, a) : (Ta
r)(x) < (
r)(x)})
u ,V (x)
1(Ta r)(x)<(r)(x)
|Ax |
xS
aAx
1
u ,V (x)1(Tu r)(x)<(r)(x) .
A
xS
(12)
=
=
=
=
cT (I Pu )1 (gu (I Pu )
r)
cT (I Pu )1 |gu (I Pu )
r|
T
1
(gu (I Pu )
r) + (gu (I Pu )
r)
c (I Pu )
r) (gu (I Pu )
r) +
cT (I Pu )1 (gu (I Pu )
+2 (gu (I Pu )
r)
cT (I Pu )1 gu (I Pu )
r + 2 (Tu
r
r)
cT (J
r) + 2cT (I Pu )1 (Tu
r
r) .
(13)
n Pun 0,
n=0
(I Pu )1 (gu (I Pu )
r) (I Pu )1 |(gu (I Pu )
r)|
r)| .
= (I Pu )1 |(gu (I Pu )
Now let r be any optimal solution of the ALP1 . Clearly, r is feasible for the RLP. Since r
is the optimal
solution of the same problem, we have cT
r and
r cT
cT (J
r)
cT (J
r)
r1,c ,
J
(14)
therefore we just need to show that the second term in (13) is small to guarantee that the performance of
the RLP is not much worse than that of the ALP.
1 Note that all optimal solutions of the ALP yield the same approximation error J r
1,c , hence the error bound (11)
is independent of the choice of r.
Now
2cT (I Pu )1 (Tu
r
r)
=
=
T (Tu
r
r)
1 u
2
u (x) ((
r)(x) (Tu
r)(x)) 1(Tu r)(x)<(r)(x)
1
xS
2 (
r)(x) (Tu
r)(x)
u (x)V (x)1(Tu r)(x)<(r)(x)
1
V (x)
xS
2Tu V
Tu
r,1/V
u ,V (x)1(Tu r)(x)<(r)(x)
r
1
xS
T
V Tu
r
r,1/V
2 u
T
r J ,1/V + J
r,1/V )
V (Tu
2 u
T
V (1 + V )J
r,1/V
2 u
J 1,c ,
with probability greater than or equal to 1 , where second inequality follows from (12) and the fourth
inequality follows from Lemma 1. The error bound (11) then follows from (13) and (14).
Three aspects of Theorem 2 deserve further consideration. The rst of them is the dependence of the
number of sampled constraints (10) on . Two parameters of the RLP inuence the behavior of : the
Lyapunov function V and the bounding set N . Graceful scaling of the sample complexity bound depends
on the ability to make appropriate choices for these parameters.
The number of sampled constraints also grows polynomially with the maximum number of actions avail
able per state A, which makes the proposed approach inapplicable to problems with a large number of actions
per state. It can be shown that complexity in the action space can be exchanged for complexity in the state
space, so that such problems can be recast in a format that is amenable to our approach.
Finally, a major weakness of Theorem 2 is that it relies on sampling constraints according to the distri
bution u ,V . In general, u ,V is not known, and constraints must be sampled according to an alternative
distribution . Suppose that (x, a) = (x)/|Ax | for some state distribution . If is similar to u ,V ,
one might hope that the error bound (11) holds with a number of samples m close to the number suggested
in the theorem. We discuss two possible motivations for this:
1. It is conceivable that sampling constraints according to leads to a small value of
u ,V ({x : (
r)(x) (Tu
r)(x)}) (1 )/2,
with high probability, even though u ,V is not identical to . This would lead to a graceful sample
complexity bound, along the lines of (10). Establishing such a guarantee is closely related to the
problem of computational learning when the training and testing distributions dier.
2. If
Tu (Tu r r) C
T (Tu r r) ,
for some scalar C and all r, where
(x) =
(x)/V (x)
,
yS (y)/V (y)
16AC
48AC
2
m
K ln
+ ln
,
(1 )
(1 )
samples. It is conceivable that this will be true for a reasonably small value of C in relevant contexts.
How to choose is an open question, and most likely to be addressed adequately having in mind the
particular application at hand. As a simple heuristic, noting that u (x) c(x) as 0, one might choose
(x) = c(x)V (x).
References
[1] D. Anthony and N. Biggs. Computational Learning Theory. Cambridge University Press, 1992.
[2] R.M. Dudley. Central limit theorems for empirical measures. Annals of Probability, 6(6):899928, 1978.
April 14
Handout #23
Lecture Note 19
Overview
In the previous lecture, we studied constraint sampling as a generic approach to dealing with the large
number of constraints in the approximate LP, and showed that, by sampling a number of constraints that
is polynomial on the number of variables in the LP, it is possible to ensure that almost all constraints will
be satised, with high probability. The hope that an LP with a large number of constraints and a small
number of variables may be solved eciently either exactly or approximately stems from the fact that
many of the constraints should be redundant; in particular, it is known that only a number of constraints
equal to the number of variables is binding at the optimal solution. This gives hope that, at least in certain
problem-specic situations, other approaches besides constraint sampling may be used for dealing with the
large number of constraints. In particular, the following approaches may be possible:
We may be able to replace the original constraints T r r with an equivalent set of constraints
Ai r bi , i = 1, . . . , N where N is small;
Constraint Generation We may be able to solve the LP exactly without including all constraints by
solving it incrementally, as follows:
start with small subset of constraints
solve smaller LP
add one or more violated constraints
repeat until no violated constraints can be found.
Both approaches can be found in the literature; e.g., Morrison and Kumar [2] replace the exponentially many
constraints in the approximate LP with a manageable number of constraints in problems involving queueing
networks, and Grotschel and Holland [1] solve travelling salesman problems involving up to 260 constraints
by doing constraint generation. In todays lecture, we will study factored MDPs, a reasonably general class
of MDPs that lends itself well to both approaches.
Factored MDPs
The underlying idea in factored MDPs is that many high-dimensional MDPs are actually generated by
systems with many parts that are weakly interconnected. Each part i has an associated state variable Xi ,
so that the full state of the system is described by a vector (X1 , . . . , Xn ). We assume that costs are factored,
i.e.,
g(x) =
gj (XZj ),
(1)
j
where Zj {1, . . . , n} and XZj indicates a (hopefully small) subset of the state variables. Moreover, we also
assume that transition probabilities are factored, i.e.,
Pa (Xi (t + 1)|X(t)) = Pa (Xi (t + 1)|XZi (t)) i,
where once again Zi {1, . . . , n} and XZi indicates a (hopefully small) subset of the state variables. In
words, one way of viewing this assumptions is that costs are mostly local to the various parts of the system,
and dynamics are also mostly local, with each state variable being aected only the subset of state variables
it interacts with. Note that, in the long run, if all state variables are directly or indirectly interconnected,
the evolution of a particular state variable may still be aected by all others.
A common way of representing factored MDPs is through a dynamic Bayesian network, as shown in
Figure 1. The nodes at the left and right represent the state variables in subsequent time steps, and arcs
indicate the dependencies between state variables across time steps. This gure may be generalized to include
dependencies within the same time step.
g1 (x1 (t))
x1 (t)
g1 (x1 (t + 1))
x1 (t + 1)
x2 (t)
x2 (t + 1)
x3 (t)
x3 (t + 1)
xn (t)
xn (t + 1)
Figure 1: Each state is inuenced by a small subset of states in every time stage
Example 1 Consider the queueing network represented in Figure 2. With our usual choice of costs ga (x) =
i xi , corresponding to the total number of jobs in the system, it is clear that stage costs are factored.
Moreover, transition probabilities are also factored; for instance, we have
Pa (x2 (t + 1)|x(t)) = Pa1 ,a2 (x2 (t + 1)|x1 (t), x2 (t)),
since the number of jobs in queue 2 in the next time step is determined exclusively by potential departures
from queue 1 which depend only on x1 (t) and a1 (t) and potential departures from queue 2 which
depend only on x2 (t) and a2 (t).
2
x1
x2
a1
a2
a3
J (x)
i (xwi )ri J(x), Wi {1, . . . , n}.
Note that, in general, the optimal cost-to-go function J does not have an exact factored representation.
However, factored approximations are appealing both because, if the system is indeed only loosely intercon
nected, we can expect J to be roughly factored, and perhaps most importantly, factored approximations
give rise to decentralized policies. Indeed, note that Q factors associated with a factored approximation J
are also factored:
Q(x, a) = ga (x) +
Pa (x; y)J(y)
y
= ga (x) +
Pa (x1 , . . . , xn ; y1 , . . . , yn )
y1 ,...,yn
= ga (x) +
= ga (x) +
yWi
yWi
(yWi )ri
= ga (x) + f (xZi ; r)
since yWi (t + 1) is only inuenced by a subset XZi (t) of X(t).
ga (x) +
Pa (x, y)(y)r (x)r
y
can be dealt with eciently when we consider factored MDPs with factored cost-to-go function approxima
tions. For simplicity, let us denote each state-action pair (x, a) by a vector-valued variable t = (t1 , t2 , . . . , tm ) =
(x1 , . . . , xn , a1 , . . . , ap ). Then we are interested in dealing with a set of constraints
(2)
fi (tWi , r) 0, t.
i
The main diculty is that t can take on an unmanageably large number of values as many as the number of
state-action pairs in the system. We will show that these constraints can be replaced by a smaller, equivalent
subset. Moreover, we will show identifying a violated constraint can be done eciently, which allows for
using constraint generation schemes.
The rst step is to rewrite (2) as
max
fi (tWi , r) 0.
t
Consider solving the maximization problem above for a xed value of r. The naive approach is to enumerate
all possible values of t and take the one leading to maximum value of the objective. However, since each of
the terms fi (tWi , r) depends only on a subset tWi of the components of t, the problem can be solved more
eciently via variable elimination. We illustrate the procedure through the following example.
Example 2 Consider
max f1 (t1 , t2 ) + f2 (t2 , t3 ) + f3 (t2 , t4 ) + f4 (t3 , t4 ).
For simplicity, assume that ti {0, 1}, for i = 1, 2, 3, 4. If we solve the optimization problem above by
enumerating all possible solutions, there are on the order of O(24 ) operations. Consider optimizing over one
variable at a time, as follows:
1. Eliminate variable t2 : For each possible value of t2 ,t3 , we nd
e1 (t2 , t3 ) = max f3 (t2 , t4 ) + f4 (t3 , t4 ),
t4
t1 ,t2 ,t3
The previous example suggests variable elimination as an ecient approach to verifying whether a can
didate solution r is feasible for all constraints, and identifying a violated constraint if that is not the case.
Therefore constraint generation can be implemented eciently when we consider factored MDPs with fac
tored cost-to-go function approximations. Moreover, the procedure described in the example can also be
used to generate a smaller set of constraints, if we introduce new variables in the LP. Indeed, let ei (tZi ) be
each of the functions involved in the scheme (including the original functions f ). Each function is given by
ei (tZi ) = max
ek (tZi , tji ).
t ji
kKi
i
For each function ei , we introduce a set of variables utei , where each ti corresponds to one possible assignment
to variables tZi ; for instance, in the example above, we would have variables
01
10
11
u00
e1 , ue1 , ue1 , ue1 ,
associated with function e1 (t2 , t3 ) and all possible assignments for variables t2 and t3 .
With this new denition, the original constraints can be replaced with
i
utei
kKi
References
[1] M. Grotschel and O. Holland. Solution of large-scale symmetric travelling salesman problems. Mathe
matical Programming, 51:141202, 1991.
[2] J.R. Morrison and P.R. Kumar. New linear program performance bounds for queueing networks. Journal
of Optimization Theory and Applications, 100(3):575597, 1999.
April 21
Handout #24
Lecture Note 20
So far, we have focused on nding an optimal or good policy indirectly, by solving Bellmans equation either
exactly or approximately. In this lecture, we will consider algorithms that search for a good policy directly.
We will focus on averagecost problems. Recall that one approach to nding an averagecost optimal
policy is to solve Bellmans equation
e + h = T h.
Under certain technical conditions, ensuring that the optimal average cost is the same regardless of the initial
state in the system, it can be shown that Bellmans equation has a solution ( , h ), corresponding to the
optimal average cost, and h is the dierential cost function, from which an optimal policy can be derived.
An alternative to solving Bellmans equation is to consider searching over the space of policies directly, i.e.,
solving the problem
min (u),
(1)
uU
where (u) is the average cost associated with policy u and U is the set of all admissible policies. In the
past, we have been most focused on policies that are stationary and deterministic; in other words, if S is
the state space and A is the action space (consider a common action space across states, for simplicity), we
have considered the set of policies u : S A, which prescribe an action u(x) for each state x. Note that, if
U is the set of all deterministic and stationary policies, we have |U | = |A||S | , so that problem (1) involves
optimization over a nite and exponentially large set (in fact, |U | grows exponentially in the size of the state
space, or doubleexponentially in the dimension of the state space!).
In order to make searching directly in the policy space tractable, we are going to consider restricting the
set of policies U in (1). Specically, we are going to let U be a set of parameterized policies:
U = {u : K },
where each policy u corresponds to a randomized and stationary policy, i.e., u (x, a) gives the probability
of taking action a given that the state is x. We let g , P and () denote the stage costs and transition
probability matrix associated with policy u :
g (x) =
ga (x)u (x, a)
a
P (x, y)
for this problem could compare Rt with a certain threshold i and only accept a new request of type i if
Rt i .
American Options
Consider the problem of when to exercise the option to buy a certain stock at a prespecied price K. A
possible threshold policy is to exercise the option (i.e., buy the stock) if the market price at time t, Pt , is
larger that a threshold t .
Once we restrict attention to the class of policies parameterized by a real vector, problem (1) becomes a
standard nonlinear optimization problem:
min ().
(2)
k
With appropriate smoothness conditions, we can nd a local optimum of (2) by doing gradient descent:
k+1 = k k (k ).
(3)
In the next few lectures, we will show that biased and unbiased estimates of the gradient () can be
computed from system trajectories, giving rise to simulationbased gradient descent methods.
1.1
We rst introduce assumptions that ensure the existence and dierentiability of ().
Assumption 1 Let = {P | k } and be the closure of . The Markov Chain associated with P is
irreducible and there exists x that is recurrent for every P .
Assumption 2 P (x, y) and g (x) are bounded, twice dierentiable, with bounded rst and second deriva
tives.
Lemma 1 Under Assumption 1 and 2, for every policy there is a unique stationary distribution satis
fying T = T P , T e = 1, and () = T g . Moreover, () and are dierentiable.
In order to develop a simulationbased method for generating estimates of the gradient, we will show
that () can be written as the expected value of certain functions of pairs of states (x, y), where (x, y) is
distributed according to the stationary distribution (x)P (x, y) of pairs of consecutive states.
First observe that
() = T g + T g
(4)
It is clear that the secibd term T g can be estimated via simulation; in particular, we know that
T 1
1
g (xt ).
T T
t=0
T g = lim
Hence if we run the system with policy , and generate a suciently long trajectory x1 , x2 , . . . , xT , we can
set
T
1
g (xt ).
(x) g (x)
T t=1
x
The key insight is that the rst term in (4), T g , can also also be estimated via simulation. In order to
show that, we start with the following theorem.
2
(5)
(6)
(7)
2
It is still not clear how to easily compute (5) from the system trajectory. Note that
T 1
1
(
P (xt , y)h (y)),
T T
t=0 y
T P h = lim
which suggests averaging f (x) = y P (x, y)h (y) over a system trajectory x0 , x1 , . . . , xT , however this
gives rise to two diculties: rst, we must perform a summation over y in each step, which may involve
a large number of operations; second, we do not know h (y). We can get around the rst diculty by
employing an artifact known as the likelihood ratio method. Indeed, let
L (x, y) =
P (x, y)
.
P (x, y)
T P h =
(x)
P (x, y)h (y)
x
and assuming that we can compute or estimate h , we can estimate (5) from a trajectory x0 , x1 , xT by
considering
T 1
1
T P h
L (xt , xt+1 )h (xt+1 ).
T t=0
Our last step will be to show that we can get unbiased estimates of h by looking at cycles in the system
trajectory between visits to the recurrent state x . This follows from the following observation, which was
proved in Problem Set 2:
Theorem 2 Amazing Fact 2
Let x be a recurrent state under policy . Let
T = min{t > 0 : xt = x }
Then
h (x)
T 1
(g (xt ) ())|x0 = x
t=0
h (x )
tm+1 1
m , n = tm + 1, . . . , tm+1 1,
g (xt )
t=n
and
m) =
F (
tm+1 1
n=tm
m ) gives a biased estimate of (), where the bias is on the order of O(|()
m |):
Then F (
Theorem 3
= E (T )() + G() ()
E Fm ()
where G() is a bounded function.
We can update the policy by letting
m+1
m+1
m)
= m m Fm (
tm+1 1
m + m
m
gm (xn )
=
n=tm
where > 0.
Assumption 4
m=1
m = and
m=1
2
m
< .
w.p. 1.
April 26
Handout #25
Lecture Note 21
1.1
Recall that we are interested in nding K such that () = 0, i,e, the policy parameterized policy u
corresponds to a local average cost minimum among the class of parameterized policies under consideration.
In the previous lecture, we proposed an algorithm for performing gradient descent based on system trajec
tories. We assume that there is a state x that is recurrent under all policies u . The algorithm generates a
series of policies 1 , 2 , . . . , which are updated whenever the system visits state x . The algorithm is given
as follows:
1. Let 0 be the initial policy. Assume (for simplicity) that the initial state is x0 = x . Let m = 0,
tm = 0.
2. Generate a trajectory xtm +1 , xtm +2 , . . . , xtm+1 according to policy um , where
tm+1 = inf{t > tm : xt = x }.
3. Let
h (xn )
tm+1 1
m , n = tm + 1, . . . , tm+1 1
g (xt )
t=n
m)
F (
tm+1 1
n=tm
m+1
m+1
m)
= m m Fm (
tm+1 1
m + m
m
gm (xn )
=
n=tm
Assumption 4
m=1
m = and
m=1
2
m
< .
w.p. 1.
The proof for the stochastic algorithm follows a similar argument. It turns out that neither the ODE or
Lyapunov function approaches apply directly, and a customized, lengthy argument must be developed. The
full proof can be found in [1].
For the convergence of t , we discuss two cases:
(1) 0 (0 )
In this case, we rst argue that t (t ). Indeed, suppose 0 = (t0 ) for some t0 . Then either
(t0 ) = 0, and the ODE reaches an equilibrium, or (t0 ) < 0 and t0 = 0. We conclude that
t (t ), t.
From the above discussion, we conclude that t is nonincreasing and bounded. Therefore, t converges.
(2) 0 < (0 )
We have two possible situations:
(i) t < (t ), t
(ii) t0 = (t0 ) for some t0 In this case, we are back to case (1).
We conclude that t converges, and thus ((t ) t ) 0. Therefore, t (t ) asymptotically, and
(t ) 0.
2
1.2
We now develop a version of gradient descent where the policy is updated in every time step, rather than
only at visits to state x . The algorithm has the advantage of being simpler and potentially faster.
First note that Fm can be computed incrementally between visits to state x :
Fm (, )
tm+1 1
n=tm
tm+1 1
= g (x ) +
n=tm +1
tm+1 1
= g (x ) +
tn+1 1
n=tm +1
tm+1 1
= g (x ) +
L (xk1 , xk ) + g (xn )
g (xk )
k=n
g (xn ) +
n=tm +1
tm+1 1
= g (x ) +
tm +1
zn
g (xn ) + g (xn )
n=tm +1
where
zn =
k=tm +1
k )zk
k+1 = k k g (xk ) + (gk (xk )
0
if xk+1 = x
zk+1 =
zk + L (xk , xk+1 ) otherwise
Assumption 5 Let P = {P : k } and P be the closure of P. Then there exists N0 such that,
(P1 , P2 , . . . , PN0 ), Pi P , x,
N0
n
Pl (x, x ) > 0.
n=1 l=1
Assumption 6
k = ,
k2 < , k k1 , and
n+t
k=n (n
w.p.1.
The idea behind the proof of Theorem 2 is that, due to the assumptions on the step sizes k (Assumption
6, eventually changes in the policy m made between two consecutive visits to state x are negligible, and
the algorithm behaves very similarly to the oine version. Assumption 5 is required in order to guarantee
that the time between visits to state x remains small, even as the policy is not stationary.
1.3
In both the oine and online unbiased gradient descent algorithms, the variance of the estimates depends
on the variance of the times between visits to state x , which can be very large depending on the system.
=
().
The decrease in variance is traded against a potential bias in the estimate, i.e., we have E()
Note that a small amount of bias may still be acceptable since it should suce to have estimates that have
positive inner product with the true gradient, in order for the algorithm to converge:
E(),
() > 0.
t
J, (x) = E
g(xt )|x0 = x
t=0
Then we have
Theorem 3
() = (1 )T J, + T P J,
()
= T P, we have
T = T P + T P .
Hence,
()
= T J, T P J,
= T J, T J, + T P J,
=
(1 )T J, + T P J,
2
The following theorem shows that () can be used as an approximation to (), if is reasonably
close to one.
Theorem 4 Let () = T P J, . Then
lim () = ()
Proof: We have
J, =
()
e + h + O(|1 |).
1
Therefore,
(1
)T J,
=
=
()
(1
e + h + O(|1 |)
1
()
(1 )T
e + (1 )T (h + O(|1 |))
1
T
)
0 as 1
= ()T e + O(|1 |)
But T e = 1, we have T e = 0. Therefore,
(1 )T J, = 0 + O(|1 |) 0 as 1.
2
V ar J, = O
and
1
(1 )2
1
(g(xk )zk+1 k )
k+1
= zk + L (xk , xk+1 )
= k +
Then it can be shown that k (), if the policy is held xed. The gradient estimate k can be used
for updating the policy in an oine or online fashion, just as with the unbiased gradient descent algorithms.
Assumption 7
2. |g(xk )| B, x
3. |L (x, y)| B, x, y
Theorem 5 Under Assumption 7, we have
lim k (),
w.p.1.
References
[1] P. Marbach and J.N. Tsitsiklis. Simulationbased optimization of Markov reward processes. IEEE
Transactions on Automatic Control, 46(2):191209, 2001.
Ses #3
Handout #3
Problem Set 1
1. Suppose that an investor wants to maximize its expected wealth E[wT ] at time stage T. There are two
investment options available: a xedinterest savings account with interest rate of 3% in each time stage,
or a stock whose price pt uctuates according to P (pt+1 = (1 + ru )pt ) = P (pt+1 = (1 rd )pt ) = 0.5,
where 0 rd 1 and ru 0. The investor must decide at each time stage which fraction of its current
wealth to invest in each option.
(a) Suppose that there are no transaction costs involved in buying or selling stock. What is the optimal
policy? Suppose that, instead of maximizing E[wT ], the investor wants to maximize E[log wT ].
What is the optimal policy? Do you expect the investor to be more or less conservative in this
case?
(b) Suppose that buying or selling stock involves a transaction cost of 0.5% of the transaction value.
Formulate this problem as an MDP, and write its Bellmans equation.
(c) Solve the problem numerically with T = 20, rd = 0.9 and ru = 1.2, for both the case of maximizing
E[wT ] and maximizing E[log wT ]. Solve it again with rd = 0.7 and ru = 1.4. Analyze your results.
2. Let M1 , M2 , . . . , Mn be matrices with Mi having ri1 rows and ri columns for i = 1, 2, . . . , n and some
positive integers r0 , r1 , . . . , rn . The problem is to choose the order for multiplying the matrices that
minimizes the number of multiplications needed to compute the product M1 M2 . . . Mn . Assume that
matrices are multiplied in the usual way.
(a) Formulate the problem as an MDP such that using backwards induction for nding the optimal
order in which to multiply the matrices requires O(n3 ) operations.
(b) Find the optimal order in which to multiply the matrices when n = 4 and (r0 , r1 , r2 , r3 , r4 ) =
(10, 30, 70, 2, 100).
(c) Using the same numerical values of part (b), solve the problem where the objective is instead to
maximize the number of multiplications. What is the ratio of the maximum to minimum number
of multiplications?
3. Show that GaussSeidel value iteration still converges to J if states are chosen in an arbitrary order,
as long as each state is visited innitely many times.
4. Let c (0, 1]|S| satisfy
c(x) = 1.
xS
t Put .
t=0
(b) Let u = (1 )cT (I Pu )1 . Show that u is a probability distribution over S and u (x) > 0
for all x.
1
(x)|J(x)|.
xS
k
T J0 J0 .
1
(g) In class, we have proved a bound on JuJk J . Compare the guarantees oered by the
algorithm when the stopping criterion is JuJk J 1,c versus JuJk J . Which
one oers stronger guarantees? In which case does the algorithm stop rst? Can you think of
situations where it may make sense to use the rst criterion?
Ses #6
Handout #8
Problem Set 2
1. Consider an MDP with a goal state x
, and suppose that for every other state y and policy u, there is
k
) > 0. We will analyze the problem known as stochastic shortest path,
k {1, 2, . . . } such that Pu (y, x
, given that
dened as follows. For every state x, let T (x) denote the rst time stage t such that xt = x
x0 = x. Then the objective is to choose a policy u that minimizes
T (x)
E
gu (xt )|x0 = x .
t=0
(a) Dene a costtogo function for this problem and write the corresponding Bellmans equation.
(b) Show that Pu (T (x) > t) t/|S| , for some 0 < < 1 and all x.
(c) Show that Bellmans equation has a unique solution corresponding to the optimal costtogo
function and leads to an optimal policy.
(d) Consider nding the policy that minimizes the average cost in this MDP, and assume that we have
x) = 0. Show that h may be interpreted as
chosen the dierential cost function h such that h (
the costtogo function in a stochastic shortest path problem.
(e) Dene operators Tu and T for the stochastic shortest path problem. Show that they satisfy the
monotonicity property. Is the oset property satised? Why?
(f) (bonus) Dene the weighted maximum norm to be given by
J, = max (x)|J(x)|,
x
for some positive vector . Show that Tu and T are contractions with respect to some weighted
maximum norm contraction.
2. Consider the problem of controlling the service rate in a single queue, based on the current queue
length x. In any time stage, at most one of two types of events may happen: a new job arrives at the
queue with probability , or a job departs the queue with probability 1 + a2 , where a {0, 1} is the
current action. At each time stage, a cost ga (x) = (1 + a)x is incurred. The objective is to choose a
policy so as to minimize the average cost.
(a) Model this problem as an MDP. Write Bellmans equation and dene the operator T .
(b) Show that the dierential cost function h is convex.
(c) Show that the optimal policy is of the form u (x) = 0 if and only if x x
for some x.
(d) Take the dierential cost function h such that h (0) = 0. Show that there is such that
x2 h 1 x2 .
3. Consider an MDP with two states 0 and 1. Upon entering state 0 the system stays there permanently
at no cost. In state 1 there is a choice of staying there at no cost or moving to state 0 at cost 1. Show
that every policy is average cost optimal, but the only stationary policy that is Blackwell optimal is
the one that keeps the system in the state it currently is.
1
T (
x)1
(x) = E
(xt = x)|x0 = x
t=0
E[T
(
t=0
lim
Ses #11
Handout #13
Problem Set 3
1. Give an example where Qlearning is implemented with greedy policies (i.e., ut = mina Qt (xt , a)) and
fails to converge. How can it be modied so that convergence is ensured?
2. Suppose operator T is a contraction with respect to 2 . Does GaussSeidel value iteration converge?
2 for all J, J and there is a unique J such that
2 J J
3. Suppose operator F satises F J F J
J = F J .
(a) Let G J = (1 )J + F J. Show that there is (0, 1) such that G J J 2 < J J 2 .
(b) Consider Jt = F Jt Jt . Show that Jt converges to J .
J J
for all J, J and there is a unique J
4. (bonus) Suppose operator F satises F J F J
such that J = F J . Consider Jt = F Jt Jt . Show that Jt converges to J
Ses #16
Handout #20
Problem Set 4
1. Consider an MDP where actions A are vectors (A1 , . . . , An ) An , for some set A. Therefore in each
time stage the number of actions to be considered is exponential in the number n of action variables.
Show that this MDP can be converted into an equivalent one with A actions in each time stage but
a larger state space. (This problem shows that complexity in the action space can be traded for
complexity in the state space, which is addressed by value function approximation methods.)
2. Show that the VC dimension of the class of rectangles in d is 2d.
3. Another value function approximation algorithm based on temporal dierences is called least squares
policy evaluation (LSPE). We successively approximate the costtogo function J by J rk , k =
1, 2, . . . . Recall that (x) is the row vector whose ith entry corresponds to i (x). Dene the temporal
dierence relative to approximation rk :
dk (x, y) = g(x) + (y)rk (x)rk .
Then LSPE updates rk based on
rk
argmin
r
rk+1
m=0
2
lm
()
dk (xl , xl+1 )
l=m
= rk + (
rk rk ).
(xm )(xm ) ,
m=0
Ak
m=0
bk
zm
=
=
zm g(xk ),
m=0
()ml (xl ).
l=0
lim Ebk
lim EBk
= A = D(P I)
(P )m ,
m=0
= b = T D
m=0
= B = T D.
(P )m g,
Ses #20
Handout #24
Problem Set 5
1. Consider the average-cost LP:
max,h
s.t.
e + h T h,
where T h = minu gu + Pu h.
(a) Suppose that there is a unique optimal policy u , with a single class of recurrent states R. Show
where is the optimal average cost and
that the optimal solution of the LP is given by ( , h),
h(x)
= h (x) for all x R.
to the LP such that at
(b) Provide an example of an MDP such that there is an optimal solution h
denote its average cost, and h denote its stationary state distribution. Show that
h = hT (T h h ) T h h 1,h .
3. Let h be such that
T h h + e,
cT (T h h e) cT (h r) +
1 T
c h , 0.
maxr
T r r + e.
s.t.
Show that
2cT v
1 T
min h r,1/v +
c h .
r
1
(d) Suppose that v = e for some v. Let R() denote the set of optimal solutions to
cT (T
r
r e)
cT r
maxr
s.t.
T r r + e.
be arbitrary. Show that if ur is a greedy policy with respect to r, for some r R(),
Let and
Decentralized Strategies
for the assignment
problem
Hariharan Lakshmanan
Dynamic networks
Changing network topology example
wireless sensor networks.
Change is usually undirected
Sometimes changes need to be directed
example Mobile robots for search and
rescue operations
Related work
Chang et.al. applied a reinforcement
learning approach to learn node movement
policy to optimize long-term system
routing performance
Goldenberg et.al proposed a network
mobility control model for improving
system communication performance
Example continued
One source and two receivers
s
Decentralized assignment
problem
Initial configuration
s
Problem formulation
min max xij d ij i
j
Methodology
Simulator written Currently does not
communicate with neighbors
Uses Dynamic programming to solve local
assignment problems
Results
Converges to a feasible solution for the
limited problems tested so far.
Performance depends on the initial
configuration
Example
The green circles indicate destination
points and the blue circles represent nodes
1
Example continued
Example continued
Example continued
Re-Solving the assignment problem
periodically led to convergence
1
2
References
Y. Chang, T. H., L. P. Kaelbling (2003).
Mobilized ad-hoc networks: A reinforcement
learning approach, MIT AI Laboratory
D. Goldberg, J. L., A.S. Morse, B.E.Rosen, Y.R.
Yang (December, 2003). Towards mobility as a
network control primitive, Yale University
Nelson Uhan
Introduction
scheduling problems using techniques from combinatorial optimization. Bertnon [BC99] focus on a dierent class of stochastic scheduling
sekas and Casta
problems, the quiz problem and its variations. They show how rollout algorithms can be implemented eciently, and present experimental evidence that
the performance of rollout policies is near-optimal.
For this project, we consider a problem with one machine and an arbitrary
normalized regular and additive objective function. We recast our nite-horizon
decision problem into a stochastic shortest path problem. We show that for a
relaxed formulation of our problem, the error of the solution obtained by the approximate linear programming approach to dynamic programming as presented
in de Farias and Van Roy [dV03] is uniformly bounded over the number of jobs
that need to be scheduled, provided that the expected job processing times are
nite. Finally, we argue using results from dynamic programming that the approximate solution for the relaxed formulation of our problem is also not that
far away from the optimal solution of the original problem.
The problem
1 n1
where C (i) denotes the time of the ith job completion (C (0) = 0), Ri is the set
of jobs remaining to be processed at the time of the ith job completion, and h
is a set function such that h () = 0. Such an objective function is said to be
additive. The function h can be interpreted as the holding cost per unit time
on the set of uncompleted jobs. We also assume that is nondecreasing in the
job completion times.
Many common objective functions
in scheduling are nondecreasing and ad
ditive. For example, h (S) = jS wj for all S N generates the normalized
2.1
We can formulate this problem as a nite horizon MDP with a nite state space
S. For each state x S, there is a set of available actions Ax . Taking action
2
aAx
2.2
yS
Since our state space is exponential in size, we cannot hope to solve our problem
exactly using dynamic programming methods. All of the major methods for approximate dynamic programming consider innite-horizon discounted cost problems. In order to use these methods for our nite-horizon stochastic scheduling
problem, we recast our problem into a stochastic shortest path (SSP) problem. We refer to the following formulation of our problem as the original SSP
formulation.
We introduce a terminating state x
with the following property: only the
in one step.
states that have no more remaining jobs (Rx = ) can reach x
, the
Observe that for all states such that Rx = and at the terminating state x
set of actions available Ax is empty, and therefore the transition probabilities
and time stage costs involving x
are not aected by what action is taken. The
transition probabilities involving x
are
1 x such that Rx =
Pa (x, x)
= 1 if x = x
0 otherwise
3
0 x such that Rx =
ga (x, x) =
0 if x = x.
T (x)
J (x) = min E
gu (xt , xt+1 ) x0 = x .
u
t=0
where T (x) is the time stage when the system reaches the terminating state.
Recall that a stationary policy u is called a proper policy if, when using this
policy, there is positive probability that the terminating state will be reached
after a nite number of stages. Also recall that if all stationary policies are
proper, then the cost-to-go function for the SSP problem is the unique solution
to Bellmans equation [B01]. Since any policy in our scheduling problem requires
one additional job to be scheduled at each time stage, we must always reach the
terminating state with probability 1, regardless of the policy used. Therefore,
to solve our problem exactly, we can solve Bellmans equation.
3.1
maximize
subject to
Note that this optimization problem can be recast as the following linear program:
maximize
cT J
subject to
ga (x) +
yS
|
|
= 1 K ,
|
|
cT r
subject to
T r r.
(1)
This problem can be recast as a linear program just as in the ELP approach.
Given a solution r, we hopefully can obtain a good policy by using the greedy
r,
policy with respect to
r ) (y) .
Pa (x, y) (
u (x) = arg min ga (x) +
aAx
yS
de Farias and Van Roy [dV03] proved the following bound on the error of the
ALP solution.
Theorem 3.1. (de Farias and Van Roy 2003) Let r be a solution of the approximate LP (1), and (0, 1). Then, for any v RK such that (v) (x) > 0
for all x S and Hv < v,
J ,
r1,c
2cT v
min J , r,1/v
1 v r
where
(Hv) (x) = max
aAx
and
v = max
x
3.2
yS
(Hv) (x)
.
(v) (x)
T (x)
J , (x) = min E
t gu (xt , xt+1 ) x0 = x .
u
t=0
For this relaxation, we show that the error of the approximate linear programming solution is uniformly bounded over the number of jobs to be scheduled.
5
Theorem 3.2. Assume that the holding cost h (S) is bounded above by M for
all subsets S of N . Let r be the ALP solution to the -relaxed SSP formulation
of the stochastic scheduling problem. For (0, 1),
J ,
r1,c
2M maxiN E [pi ]
.
1
Proof. Due to the nature of our problem, we know that for any state x, T (x) =
= 0 for all x such that Rx = , and ga (
x, x)
=
|Rx | + 1. However, since ga (x, x)
0, when computing the cost-to-go function at state x, we only need to consider
time stage costs for time stages 0 through |Rx | 1. Since the holding cost h is
bounded by above by M for all subsets of N , for some policy u we have
|Rx |1
J , (x) = min E
t gu(xt ) (xt , xt+1 ) x0 = x
t=0
|Rx |1
E
t gu(xt ) (xt , xt+1 ) x0 = x
t=0
|Rx |1
E
gu(xt ) (xt , xt+1 ) x0 = x
t=0
|Rx |1
1
t=0
|Rx |1
1
=E
h (Rxt ) pu(xt ) x0 = x
n
t=0
|Rx |1
1
E
M pu(xt ) x0 = x
n
t=0
M
=
E [pi ]
n
iRx
yS
= k max
aAx
yS
1
Pa (x, y)
E [pi ]
n
iRy
k
k
=
E [pi ] E [pa ]
n
n
iRx
E [pa ]
= V (x)
iRx E [pi ]
< V (x)
|J , (x)|
xS
V (x)
M
iRx pi
n E
max k
xS
iRx pi
n
M
=
k
= max
k
c (x) V (x) =
c (x)
E [pi ]
n
xS
xS
iRx
1
=k
c (x)
E [pi ]
n
xS
iRx
k
c (x) max E [pi ]
xS
iN
= k max E [pi ] .
iN
This is uniformly bounded over the number of jobs, provided that the expected
processing
times of the jobs are bounded. Therefore, by Theorem 3.1, provided
J ,
r1,c
3.3
E [pa ]
.
(HV ) (x) = V (x) 1
iRx E [pi ]
As the number of jobs n , the ratio of the expected processing time of any
one job to the sum of the expected processing times of all jobs goes to zero.
Therefore, by using the methods from the proof of Theorem 3.2, we cannot nd
a < 1 such that HV V regardless of the number of jobs.
However, since the ALP solution for the -relaxed formulation is not that far
away from the optimal solution to the -relaxed formulation, we can show that
the ALP solution obtained for the -relaxed SSP formulation is also not that
far away from the optimal solution of the original SSP formulation, depending
on the value of . First, we briey present a well-known result from dynamic
programming that relates the costs of the innite-horizon discounted cost and
average-cost problems.
Lemma 3.1. For any stationary policy u and (0, 1), we have
J,u =
where
J,u =
Ju
+ hu + O (|1 |)
1
k Puk gu ,
Ju =
k=0
hu is a vector satisfying
N 1
1 k
lim
Pu
N N
k=0
gu ,
Ju + hu = gu + Pu hu ,
and O (|1 |) is an -dependent vector such that
lim O (|1 |) = 0.
|J J
= O (|1 |)
| O (|1 |)
where J is the cost of using the optimal greedy policy for the -relaxed SSP
formulation in the original SSP formulation (hu in Lemma 3.1). Since the costto-go functions for the original and -relaxed SSP formulations are not that far
8
apart when is large, the ALP solution for the -relaxed SSP formulation is
not that far away from the optimal solution to the original SSP formulation
when is large.
Corollary 3.1. Let r be the optimal solution to the ALP for the -relaxed SSP
formulation. Then for (0, 1),
J
r1,c
2M maxiN E [pi ]
cT O (|1 |) .
1
Proof. The result follows immediately from Theorem 3.2 and the arguments
above:
J
r1,c J J , 1,c + J ,
r1,c
2H maxiN E [pi ]
1
2H maxiN E [pi ]
cT O (|1 |)
cT |J J , | +
Conclusion
for our stochastic scheduling problem. It would be nice to see how the ALP
approach performs in practice, and whether or not in reality ALP solves the
original SSP formulation (with = 1) poorly as the size of the problem grows.
Finally, the ALP method relies on solving a linear program with a constraint
for every state-action pair, which in our problem would result in an extremely
large optimization problem. One might investigate how the constraint sampling
LP approach to dynamic programming [dV01] performs with the stochastic
scheduling problem we studied, both theoretically and practically.
References
[B01]
[BC99]
[d04]
[dV03]
de Farias, D. P., B. Van Roy (2003). The linear programming approach to approximate dynamic programming. Operations Research
51, pp. 850-865.
[dV01]
Rothkopf, M. H. (1966). Scheduling with random service times. Management Science 12, pp. 703-713.
[U96]
Uetz, M. (1996). Algorithms for deterministic and stochastic schedulur Mathematik, Technische Univering. Ph.D. dissertation, Institut f
at Berlin, Germany.
sit
10
Scheduling
2.997 Project
Outline
Scheduling Problems
Given a set of tasks and limited resources, we need to eciently use the
resources so that a certain performance measure is optimized
Scheduling is everywhere:
computer networks, etc.
manufacturing,
project management,
Stochastic Scheduling
Problem Denition
Set of jobs N = {1, . . . , n}
1 machine
Processing time of job i: discrete probability distribution pi
pi and pj pairwise stochastically independent for all i = j
The jobs have to be scheduled nonpreemptively
Objective: minimize
C (1), . . . , C (n)
n1
1
=
h (Ri) C (i+1) C (i)
n i=0
6
C (1), . . . , C (n)
n1
1
=
h (Ri) C (i+1) C (i)
n i=0
MDP formulation
Finite horizon, nite state space
State of the system:
x = (Cmax (x) , Rx) S
Cmax (x) is the completion time of the last job completed
Rx is the set of jobs remaining to be scheduled at state x
Note that the size of the state space is exponential in the number of
jobs.
Action at state x is the next job to be processed: a Ax Rx
8
pa (t)
Pa (x, y) =
0
aAx
yS
, t = 0, 1, . . . , n 1
10
1
Pa (x, x)
= 1
0
Time-stage costs involving x:
0
ga (x, x
) =
0
x:
x such that Rx =
if x = x
otherwise
x such that Rx =
if x = x.
11
T (x)
J (x) = min E
gu (xt, xt+1) x0 = x .
u
t=0
T (x) = time stage when the system reaches the terminating state
with probability 1
The cost-to-go function for the SSP problem is the unique solution
to Bellmans equation
12
cT J
subject to
TJ J
(c > 0)
maximize
cT r
subject to
T r r
(c > 0)
Theorem 1. (de Farias and Van Roy 2003) Let r be a solution of the
approximate LP. Then, for any v RK such that (v) (x) > 0 for all x S
and Hv < v,
2cT v
J
r 1,c
min J r
,1/v
1 v r
where
(Hv) (x) = max
and
v
aAx
yS
(Hv) (x)
= max
.
x
(v) (x)
14
T (x)
,
t
J (x) = min E
gu (xt, xt+1) x0 = x
u
t=0
For this relaxation, we show that the error of the ALP solution is
uniformly bounded over the number of jobs to be scheduled.
15
Main Result
Theorem 2. Assume that the holding cost h (S) M for all subsets S
of N . Let r be the ALP solution to the -relaxed SSP formulation of the
stochastic scheduling problem. For (0, 1),
J ,
r1,c
2M maxiN E [pi]
.
1
16
Outline of Proof
The cost-to-go function is
T (x)
,
t
J (x) = min E
gu(xt) (xt, xt+1) x0 = x
u
t=0
where
1
g (xt, xt+1) = h (Rx) (Cmax (xt+1) Cmax (xt))
n
Recall h is bounded from above by M . After some algebraic manipulation,
this quantity is found to be
M
E [pi]
n
iRx
17
k
V (x) =
E [pi]
n
iRx
V is a Lyapunov function
< 1 independent of n such that HV V
Also,
min J , r,1/V
r
M
k
18
xS
iN
2M maxiN E [pi]
1
19
ALP approach has an error bound for our relaxed stochastic scheduling
problem that does not grow with the number of jobs to be scheduled
20
Questions?
21
Abstract. The linear programming approach to approximate dynamic programming was introduced in [1]. Whereas the state relevance weight (i.e. the
cost vector) of the linear program does not matter for exact dynamic programming, it is not the case for approximate dynamic programming. In this paper,
we address the issue of selecting an appropriate state relevant weight in the
case of approximate dynamic programming. In particular, we want to choose
c so that there is a practical control of the approximate policy performance by
the capability of the approximation architecture. We present here some theoretical results and more practical guidelines to select a good state relevance
vector.
1. Introduction
The linear programming approach to approximate dynamic programming was
introduced in [1], and it is reviewed quickly in Section 2. Whereas the state relevance weight (i.e. the cost vector) of the linear program does not matter for exact
dynamic programming, it is not the case for approximate dynamic programming.
There are no guidelines in the literature to select an appropriate state relevant
weight in the case of approximate dynamic programming. In Section 3, we propose
to use available performance bounds on the suboptimal policy based on the approximate linear program to build a criterion for choosing the state relevance weight c.
We characterize appropriate state relevance weights as solutions of an optimization
problem (P). However, (P) cannot be solved easily so that we look for suboptimal
solutions in Section 4, in particular we prove in Section 5 that under some technical assumptions we can choose c as a probability distribution. Finally, we establish
some practical necessary conditions to choose c; one of them suggesting to reinforce
the linear program for approximate dynamic programming.
1.1. Finite Markov decision process framework. In this paper, we consider
nite Markov decision process (MDP): they have a nite state space S and a nite
control space U (x) for each state x in S. Let gu (x) be the expected immediate
cost of applying control u in state x. Pu (x, y) denotes the transition probability
from state x to state y under control u U (x). The objective
of the controller is
t
to minimize the -discounted cost E
t0 gu(t) (xt )|x0 .
First, observe that it is possible to transform any nite Markov decision process
with nitely many controls in another one where the immediate cost of an action is
the same for all actions. Indeed, consider the MDP comprising the original MDP
states plus one state for each state-action pair. In this MDP, the controller rst
chooses a control and the system moves in the corresponding state-action pair.
Date: May 15, 2004.
1
From there, the system incurs the cost corresponding to the state-action pair and
follows the original dynamics to the next state. Figure 1 provides a simple example
of the transformation of an MDP into another one with same immediate cost at
each state.
ga
x
x, a
y1
b
gb
ga
y1
a
b
y2
x, b
gb
y2
k+
min cT J
JR|S|
J(x) g(x) +
yS
Unfortunately, the linear program (LP) is often enormous with as many variables
as the cardinality of the state space S and as many constraints as there are stateaction pairs. Hence, (LP) is often intractable. Moreover, even storing a cost-to-go
vector J as a lookup table might not be amenable for large state space.
One approach to deal with this curse of dimensionality is to approximate J (x)
(x)r, where r Rm (usually m |S|) and (x) = (1 (x), . . . , m (x)) are given
feature vectors.
HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
3
min cT r
rRm
(x)r g(x) +
yS
Notice that (ALP) has only m variables, but still as many constraints as (LP).
Hence, some large scale optimization technics are needed to solve (ALP), or alternatively [3] showed that constraints sampling is a viable approach to solve this
problem.
On the contrary to the case of the exact linear program (LP), there is no
guarantee that the optimal solution r(c) is independent of the choice of c > 0. The
objective of this paper is to provide a methodology to choose c, but to motivate
our criterion to select c, we rst need to introduce two performance bounds.
3. Two performance bounds
3.1. A general performance bound.
First, let us relate J to the cost-to-go of the policy, which is greedy with respect
to J, where J is any approximation of J .
Let J R|S | and uJ be the greedy policy with respect to J, i.e.
uJ (x) = argminuU (x) gu (x) +
Pu (x, y)J(y) .
yS
r(c)1,,ur(c) with J
the ALP policy by the architecture capability to approximate J . Hence, we would
like to nd a state relevance weight c > 0 that makes this guarantee as sharp as
possible.
In other words, the state relevance weight should be chosen so that
(P ) :
min cT v
c>0
T
,u
r(c)
:= (1 ) T (I Pur(c) )1 cT ,
(3.3)
Jur(c) J 1,
2cT v
J r ,1/v ,
1 v
by combining the bounds (3.1) and (3.2), and (P) tries to make the factor of the
right-hand side as small as possible.
Recall that r(c) depends on c so that we have a circular dependence between c
and r(c).
,ur(c)
r(c)
ur(c)
xT A1 = 1 xT
1
If T = T Pur(c) , then T = 1
T (I Pur(c) ). Hence, T = (1 ) T (I
Pur(c) )1 cT , and the optimal solution of (P) is c = .
In the following section, we give derive some simple feasible points but their
performance with respect to the objective of (P) can be very poor. Then in Section
5, we try to obtain better feasible points of (P), namely probability distributions.
HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
5
21T v
J r ,1/v
1 v
2cT v
J r ,1/v .
1 v
If the state relevance weight c of the ALP could be chosen such that
(5.1)
cT = ,ur(c) := (1 ) T (I Pur(c) )1 ,
(3.3) would hold. Indeed, c veries (5.1) if and only if c is a probability distribution
that is feasible for (P). We hope that in this case, the bound (3.3) is practical.
5.1. A theoretical algorithm.
5.1.1. A naive algorithm.
A naive algorithm is as follows.
Algorithm A
(1) Start with k = 0 and any vector c0 0 such that xS c(x) = 1.
(2) Solve (ALP) for ck and let r(ck ) be any optimal solution.
r(ck ).
(3) Compute k := ,uk , where uk := u
r(ck ) is greedy with respect to
(4) Set ck+1 = k , do k = k + 1 and go back to 2
M
r
F
U
M
c r(c) r(c) ur(c) ,ur(c) , or, in a more compact fashion: c
,ur(c) .
Denition 5.1. Dene P = {p R|S| | p(x) 0,
xS p(x) = 1} as the space of
probabilities distributions. It is a compact, convex set.
Notice that is a mapping from P in itself, where P is the set of probability
distribution over S, i.e.
Lemma 5.2. c is a xed point of c veries (5.1)
However, it is not clear that the algorithm A has a xed point, and whether ck
converges. If the mapping was continuous from P to P , Brouwers theorem would
guarantee the existence of a xed point.
M
r
F
U
However, in the chain c r(c) r(c) ur(c) ,ur(c) , some of the
functions may not be continuous so that M needs not be continuous.
The function F is just a matrix multiplication. Therefore it is continuous.
1
The function
= (1
t t M : (Pu , gu ) (u) = (1 ) (I Pu )
Pu is also continuous.
)
t
HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
7
Lemma 5.4. Let f be a bounded piecewise constant mapping from some vector space
E to F. Let g be a continuous function
from F to R that has nite integral.
Then, the function f : x f (x + y).g(y)dy is a well dened, continuous
F
function from E to F.
r[c]] is continuous in c.
c E[
r
r[c]] can also be written c r(c+ c0 ).g(c0 )dc0 and r is a piecewise
Proof. c E[
RN
constant, bounded function. Using the previous lemma gets the result.
Smoothing U
The function U is also discontinuous, in the same fashion as the function r. We
therefore use the same trick, but in a slightly dierent way. Instead of using
deterministic greedy policies, we use randomized, -greedy policies
Denition 5.6. Let > 0. The -greedy policy with respect to J is a randomized
policy u for which the action a is chosen in state x with probability
exp[(gu (x) + Pu (x)J)/]
.
uJ (u, x) =
aU (x) exp[(ga (x) + Pa (x)J)/]
[4] provides various continuity results for the greedy policies, which we will
use.
Proposition 5.7. limsup|T J(x) T J(x)| = 0
0 J,x
Denition 5.9. The randomized function (v, ) is dened by the following chain
of functions:
(v,)
r
F
c r(c) r(c) u (r(c)) ,u (r(c)) , or, in a compact fashion:
(v,)
c ,u (r(c))
Proposition 5.10. (v, ) is continuous from P to P.
Denition 5.11. Let v and be some positive numbers. The algorithm A(v, ) is:
1) Start from some c0 in P, and set k = 0.
2) Do ck+1 = (v, ) (ck )
3) Set k = k + 1 and go to 2.
Theorem 5.12. (v, ) has at least one xed point c(v, ) P
Proof. (v, ) is a continuous function on a compact, convex set. By application of
Brouwers theorem, it has a xed point.
Remark 5.13. Saying that (v, ) has at least one xed point is equivalent to saying
that A(v, ) has at least one xed point
However, the ck produced by the algorithm may still fail to converge so that
A(v, ) does not provide the value of a xed point.
5.1.3. Existence of xed point for the original algorithm A.. In this part, we will
use the previous theorem asserting the existence of a xed point to the algorithm
that holds for all variance v > 0 and all > 0 to show that there exists a xed
point to the original algorithm A.
Theorem 5.14. For any pair (vk , k ) in R2 with (vk , k ) > 0, denote Ck the set of
xed points of the algorithm A(vk , k ), which is not empty by Theorem 5.12.
If there is a sequence (vk , k )k0 of such pairs with (vk , k ) (0, 0), such that
there is an accumulation point c of the set Ck that yields a unique optimum if used
as a state relevance vector in (ALP), then c 0 is a probability distribution that
veries
:= (1 ) T (I Pur(c) )1
cT = ,u
r(c)
(5.2)
ck = ,uk
r(ck )
== (1 ) T (I Puk
)1 .
r (ck )
(5.4)
)1
r (c)
|S |
Recall that a -greedy policy u
chooses control u in
J with respect to J R
state x with probability
uJ (u, x) =
(5.5)
k+
r (c)
)1 = (1) T (I Pur(c) )1 .
HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
9
(1 )1 cT (I Pur(c) )v
(1 )1 cT (I Puv )v
max cT r
rRm
T r r
cT = (1 ) T (I Pur(c) )
The last constraints are hard to deal with, but we can derive more tractable necessary conditions. In particular, the next proposition shows that they imply a system
of linear equations.
Proposition 5.17. If c veries (5.1), then the following system of linear equations
holds
(5.7)
(1 ) T
r(c) cT (I Pu )R, u U (x)
Pur(c)
r(c) Pu
r(c), u U (x)
10
Hence,
r(c) Pu
r(c), u U (x)
Pur(c)
(I Pur(c) )
r(c) (I Pu )
r(c), u U (x)
r(c) (I Pu
)1 (I Pu )
r(c), u U (x)
r(c)
The last equivalence follows from (I Pur(c) )1 =
ing both sides of the last equation by (1 ) T ,
t0
(1 ) T
r(c) (1 ) T (I Pu
)1 (I Pu )
r(c), u U (x)
r(c)
,u
r (c)
max cT r
rRm
T r r
(1 ) T r cT (I Pu )r, u U (x)
Notice that the last constraint enforces the equality ,ur(c) = c only on the
subspace {(I Pu )
r(c), u U }, whereas we need this condition to hold for
r(c)1,,r(c) =
r(c), r(c) being an optimal solution of (RALP) so that J
J
J
r(c)1,c .
6. Conclusion
We presented some new results for the choice of the state relevance weight c in the
approximate linear program. The criterion to choose c hinges on two performance
bounds that control the suboptimality of the ALP policy. However, these results
remain preliminary, in particular how to tailor the state relevance weight to the
problem setting remains an open question.
7. appendix
7.1. Insights on ,u .
By denition,
(7.1)
T,u := (1 ) T (I Pu )1 = (1 )
t T Put .
t0
Hence, ,u is a geometric average of the presence probability over the state space
after t transitions under policy u starting from the distribution . When Pu irreducible, limt+ T Put = uT , where u is the steady-state distribution of Pu , i,e,
uT = uT Pu . Thus we can wonder how far is u from ,u . We show now that in
general ,u is further away from u than .
Since u is also a left eigenvalue of (1 )(I Pu )1 , we have
T,u uT = ( u )T (1 )(I Pu )1 .
HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
11
Linear programming
Let T be the DP operator for -discounted problem:
TJ=minu g + PuJ.
By monotonicity of T, J TJ J TJ TkJ J*.
Linear programming approach to DP:
For all c>0, J* unique optimal solution of
(LP): max cTx s.t. J(x) g(x)+ Pu(x,y)J(y), "(x,u)
E | J u J ( x) J * ( x) |; x ~ = J u J J *
where ,u = (1 ) ( I Pu )
T
1,
J J*
1, ,u J
1,c
2cT v
min J * r
1 v r
,1 / v
Compare with
J u r* J *
1,
r * J *
1, ,u J
1,c
2cT v
min J * r
1 v r
,1 / v
Compare with
J u r* J *
1,
r * J *
1, ,u J
Simple bounds
We want J * r * 1, K J * r * 1,c , K > 0
T
c
2
v
to yield J * J u
min J * r
K
,1 / v
1,
1 v r
This relation follows from ,u Kc
But r* depends implicitly on c via (ALP)
1. Trivially, c:=1. But poor bound for large state space
2. Algorithm using r*(c)=r*(Kc) for any K>0.
,u
r*
r*
r *
Reinforced ALP
Would like to solve (ALP) with the additional constraint
c = ,ur* = (1 ) T ( I Pur* ) 1
T
Conclusions
Some simple bounds on the (ALP) policy but
not necessarily tight.
Theoretical algorithm to find c as a probability
distribution.
Some insight in the role of c in (ALP)
Need practical algorithms depending on and
the Markov chain.