Cross-Entropy Method For Reinforcement Learning
Cross-Entropy Method For Reinforcement Learning
Steijn Kistemaker
Heyendaalseweg 127 A1-01
6525 AJ Nijmegen
[email protected]
Supervised by:
Frans Oliehoek
Intelligent Autonomous Systems group
Informatics Institute
University of Amsterdam
Kruislaan 403, 1098 SJ Amsterdam
[email protected]
Shimon Whiteson
Intelligent Autonomous Systems group
Informatics Institute
University of Amsterdam
Kruislaan 403, 1098 SJ Amsterdam
[email protected]
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Standard Tetris . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Generalized Tetris in short . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Cross-Entropy Method . . . . . . . . . . . . . . . . . . . . . . . 6
4 Generalized Tetris 11
4.1 Generalized Tetris in the Reinforcement Learning Competition . . 11
4.2 Learning the block distribution . . . . . . . . . . . . . . . . . . . 12
4.3 Generalized Value functions . . . . . . . . . . . . . . . . . . . . 13
4.3.1 Generalized Linear Architecture . . . . . . . . . . . . . . 13
4.3.2 Generalized Tilecoded Architecture . . . . . . . . . . . . 13
6 Conclusion 18
7 Discussion 19
1 Introduction
1.1 Motivation
In everyday life we are constantly performing all kinds of tasks, from really sim-
ple to really complex tasks like driving a car. When driving a car we observe the
world and respond (act) to it. However defining how we would react in all situa-
tions would be a hard (impossible?) task. Tasks like these are thus not suited for
solving by a hand-coded approach. Reinforcement learning is a technique to deal
with these kinds of problems where optimal actions have to be taken to maximize
1 INTRODUCTION 2
a long-term reward and the solution of the problem is not clearly defined. The
tasks are often formulated as a Markov Decision Process where decision making is
modeled by mapping states to actions. To stimulate research in this field an annual
Reinforcement Learning competition [1] is held to compare RL techniques in dif-
ferent domains. This year one of those domains is Tetris, which belongs to the class
of tasks with a large state-space and where choosing the optimal action is hard. I
have done my research in this domain because it lends itself for benchmarking RL
techniques and, because of the large state space, it scales up to a real-world sized
problem.
Figure 1: From left to right: A block falling, the construction of a filled line, the
removal of the line and introduction of a new block
Tetris is a falling blocks game where blocks of different shapes called tetro-
minoes have to be aligned on the playing field to create horizontal lines without
gaps. A standard field has width 10 and height 20 and uses 7 tetrominoes (see Fig.
2) which occur with equal probability. We will use the term ’block’ for tetromino
from now on. The standard Tetris also shows the next block, which gives the player
the advantage of taking this into account when placing the current block. The goal
of the game is to remove as many lines as possible by moving the blocks sideways
and by rotating them 90 degree units to place them in the desired position[8]. While
doing these rotations and translations the block drops down at a discrete fixed rate,
so all actions must be done before the block hits an obstruction. When a block is
placed so a line is filled completely, the line is removed and the complete contents
of the field above it drops one line down. The game ends when the filled cells in
the playing field reach up to the top of the field so that no new block can enter.
Even though the concept and rules of Tetris are simple, playing a good game of
Tetris requires a lot of practice and skill. Extensive research has been done on the
1 INTRODUCTION 3
Figure 2: From left to right, top to bottom: I-block, J-block (or left gun), L-block
(or right gun), O-block (or square), S-block (or right snake), T-block, Z-block (or
left snake)
game of Tetris in a mathematical sense [11, 5, 7] and in AI research as well [8, 16].
Burgiel shows that a sequence of alternating S and Z blocks can not be survived
[5]. Given an infinite sequence of blocks every possible sequence will occur. This
proves that any game must ultimately end, so no perfect player could go on forever.
Also Demaine et al. have shown that finding the optimal placement of blocks is NP-
hard even if the complete sequence of blocks is known in advance [7]. This makes
it a problem fit for Reinforcement Learning techniques, and learning techniques in
general.
Standard Tetris and how we plan to solve the new problems it poses. In section 5
we will introduce our experiments, the measures to validate these experiments and
the results themselves. In section 6 we will state our general findings and conclude
about the effectiveness of the approach taken, ending with section 7, discussing
the validity of our results and introduce some interesting unanswered questions on
which further research could be done.
2 Background
2.1 Markov Decision Process
The Markov decision process (MDP) is a mathematical framework for modeling
tasks that have the Markov property. This Markov property defines a memory-less
system where the current state summarizes all past experiences in a compact way
so that all relevant information for decision making is retained [15]. The transition
probability of going from a certain state to the next state is thus independent of the
previous states.
The Markov decision process can be defined by states s ∈ S, actions a ∈ A
and the state transition function, given by (1), which gives the probability of going
from s to s’ with action a.
For every state transition at time t a reward rt is given where the expected value of
the next reward is given by (2).
The state s is the collection of sensations recieved by the decision maker. This
is not restricted by immediate (sensoric) sensations but can also be some internal
sensation retained by the decision maker. We can now define the Markov property
in a more precise way. The decision process is said to have to Markov property if
it satisfies (3), or (4) when rewards are taken into account [15].
negative reward, and adjusting its preference for the taken action. The tasks in Re-
inforcement Learning are often modelled as MDPs in which optimal actions over
the state space S have to be found to optimize the total return over the process.
In Fig. 3 we can see the interaction between the agent and the environment. At
timestep t the agent is in state s and takes an action a ∈ A(s), where A(s) is the set
of possible actions in s. It then transitions to state st+1 and recieves a reward rt+1
∈ <. Based on this interaction the agent must learn the optimal policy π : S → A
to optimize the total return Rt = rt+1 + rP t+2 + rt+3 + . . . + rT for episodic task
(with an endstate at timestep T), or Rt = ∞ k
k=0 γ · rt+k+1 , where 0 ≤ γ ≤ 1 for
a continuous task.
The task of the agent is to learn the optimal policy π ∗ by learning the optimal
state-value function V ∗ (s) = maxπ V π (s) for all s ∈ S.
In reinforcement learning several techniques have been developed to learn the
2 BACKGROUND 6
The value function V(s) learned is classicaly represented by a table with the values
stored for each state. Because real-world problems can have a state space too large
to represent by a table several techniques have been developed to approximate the
value for each state. This can be done by using a linear value function with weights
ω and features φ defining the value of a state as in (6) for n features. These weights
can then be learned by gradient-descent methods or parameter optimization tech-
niques such as genetic algorithms [10] and the Cross-Entropy Method [6].
n
X
V (s) = ωk · φk (s) (6)
k=1
The action space is defined as two discrete valued variables determining the
final translation and final rotation.1 When the block is placed in its final position,
the block is dropped down. With all the dynamics known we can predict the state
the system will be in after we have finished this action. Because we can enumerate
all possible actions, we can enumerate all possible after states (the so called look
ahead). In combination with an afterstate evaluation function we can choose the
’best’ afterstate, and thus the best action. The pseudocode for this is given below.
1
This is different from the game Tetris where the block has to be manipulated while it is falling
down.
3 FEATURE-BASED APPROACH TO TETRIS 8
Main Algorithm(currentfield )
1 for each step
2 do
3 if new block entered
4 actionTuple ← FIND B ESTA FTER S TATE (currentfield ,newBlock )
5 return actionTuple
findBestAfterState(currentfield ,newBlock )
1 maxscore ← −∞
2 for all translations t
3 do
4 for all rotations r
5 do
6 resultingfield ← AFTER S TATE (newBlock ,currentfield ,t ,r )
7 score ← SCORE F IELD (resultingfield )
8 if score > maxscore
9 bestAction ← {t ,r }
10 maxscore ← score
11 return bestAction
1. Pile Height: The row (height) of the highest occupied cell in the
board.
2
The “Blocks” feature is the count of the number of filled cells in the playing field. However
since every block adds 4 filled cells to the field, when no lines are removed this feature will have
the same value for all compared fields at one time. The difference in comparison between different
fields for this feature can thus only be caused by the removal of lines. The “Removed Lines” feature
however already encodes for this more directly, so the “Blocks” feature is rendered obsolete.
3 FEATURE-BASED APPROACH TO TETRIS 9
2. Holes: The number of all unoccupied cells that have at least one
occupied above them.
3. Connected Holes: Same as Holes above, however vertically
connected unoccupied cells only count as one hole.
4. Removed Lines: The number of lines that were cleared in the
last step to get to the current board.
5. Altitude Difference: The difference between the highest
occupied and lowest free cell that are directly reachable from
the top.
6. Maximum Well Depth: The depth of the deepest well (with a
width of one) on the board.
7. Sum of all Wells (CF): Sum of all wells on the board.
8. Weighted Blocks (CF): Sum of filled cells but cells in row n
count n-times as much as blocks in row 1 (counting from
bottom to top).
9. Landing Height (PD): The height at which the last tetramino
has been placed.
10. Row Transitions (PD): Sum of all horizontal
occupied/unoccupied-transitions on the board. The outside to
the left and right counts as occupied.
11. Column Transitions (PD): As Row Transitions above, but
counts vertical transitions. The outside below the game-board is
considered occupied.
3 FEATURE-BASED APPROACH TO TETRIS 10
This value function has been proven to work well when trained on one MDP
setting. We will, however, apply it to a set of MDPs with different characteristics.
The optimal weights for every MDP seperately could be different, so we want to
learn a general weight set that performs well on all MDPs in the testset.
Table 1: Overview of learning techniques applied to Standard Tetris and their re-
sults in chronological order.
4 Generalized Tetris
4.1 Generalized Tetris in the Reinforcement Learning Competition
The competition domain given by the Reinforcement Learning Competition is based
on the Van Roy (1995) specification of the Tetris problem. The competition domain
however differs from it, and from the standard Tetris, in a few significant ways. In
van Roy’s specification the agent chooses a translation and rotation as to specify
its final position. In the competition version however, the agent chooses to rotate
or translate the block one space as it drops down. This gives more control over
the block (like in the standard Tetris), but has a limiting effect on what positions
can be reached from the top position.4 Although it is possible to control the block
at every timestep we decided not to use this feature and act like a final translation
and position at the top can be chosen when a new block enters the field, and is
then followed by a drop down. Although this reduces some of the possibilities it
makes the action space a lot simpler. We are using an after state evaluation as de-
scribed earlier. The highest scoring after state is selected, and the sub-actions for
translating, rotating and dropping to reach that after state are planned.
Each MDP in the competition domain is defined by 9 parameters, of which
2 define the size of the playing field (directly observeable in each state) and 7
define the chances for each of the 7 different blocks occuring (we will call this the
block probability). These block probabilities are not known in advance and are not
directly observable. The field size and block probabilities can also vary from MDP
to MDP. In Standard Tetris the block probabilities are uniformly distributed, but in
the competition domain they are drawn from a non-uniform distribution.
4
As opposed to standard Tetris, where multiple actions can be done per timestep where the block
drops one line down.
4 GENERALIZED TETRIS 12
findBestAfterStateTwoPly(currentfield ,newBlock )
1 maxscore ← −∞
2 for all translations t
3 do
4 for all rotations r
5 do
6 resultingfield ← AFTER S TATE (newBlock ,currentfield ,t ,r )
7 score ← 0
8 for all blocks b
9 do
10 scoreThisBlock ← FIND B ESTA FTER S TATE (resultingfield ,b )
11 score ← score + P(b ) * scoreThisBlock
12 if score > maxscore
13 bestAction ← {t ,r }
14 maxscore ← score
15 return bestAction
For this value function however we assume a linear dependency between the
feature and the parameter, which might not be the case. Although this value func-
tion is much more expressive than the naive linear function, it will fail to express
all pairwise depencies in a correct manner when they are not linear.
T
|φ| |π|
X X πj < 0.10 ωi,j,1 X|φ|
V (S) = 0.10 ≤ πj ≤ 0.18 · ωi,j,2 · φi (s) + ωl · φl (s) (12)
i=1 j=1 πj > 0.18 ωi,j,3 l=1
Using this value function the influence of each parameter on each feature, while
removing the linear dependency because a weight for each range of a parameter
value in pair with a feature can be learned. However increasing the complexity
of the value function will make it harder to find the optimal weights, which could
have a negative effect on its learning abilities.
Table 2: Overview of the six training MDPs and their parameters. All block prob-
abilities are the empericaly determined relative occurences after 10.000 blocks,
rounded to 4 decimals
Figure 5: Results for one ply vs two ply look ahead. Shown are the averaged scores
per MDP over 10 performance runs with 3 weightsets.
meaning that it will score much better on average, but it still able to play games
with really low scores. Because the range of possible scores for the two ply is much
larger, this results in a larger performance variance for the two ply look ahead. The
fact that the training variance also increases indicates that 10 runs per MDP still
might not be enough to get a good (consistent) estimate on the capabilities of the
agent.
Table 3: The variance measures for one ply vs two ply look ahead.
bilize. By increasing the threshold we can perform training runs within practical
time.
All three value functions had 10 training runs, resulting in a total of 30 learned
weightsets. After learning we performed performance test runs where the complete
testset of MDPs was run, and every MDP was played 10 times by each agent. We
wanted to compare these results with eachother and with the specialized training
runs. To compare the learned weightsets with the specialized one ply training runs
we have measured the average performance on a single MDP as a percentage of
the scores of the specialized training runs (as described in the previous section).
To compare the different value functions with eachother we have used the RLC
Metric. The scores reported are the sum over all MDPs, where the score per MDP
was the average score over the 10 performance runs.
Figure 6: Per MDP results of the 10 naive linear learned weightsets as a percentage
of the specialized one ply scores.
mize the score as set by the RLC metric. In this figure the average obtained scores
over the 10 training runs per value function are shown in combination with their
standard deviations. We also show the maximum score obtained per value function
in red. We can see here that although the Naive linear architecture scores better
as a percentage of the specialized training runs, the Pairwise linear Architecture
scores better on the RLC metric. We can also see that the average results of the
Pairwise tilecoded architecture compares badly to the other value functions, but the
best training run compares much better to the other two training runs. This is also
represented by the high variance in scores using the Pairwise tilecoded architecture
(as denoted by the error bar).
6 Conclusion
Looking back on our results we can say with high confidence that the two ply look
ahead gives a significant benefit, even though the next block is not known. This
howevery comes at a great cost of computational complexity. When this however
can be reduced by a heuristic pruning method a two ply look ahead could be used
in an effective way.
The results of the three different value functions however are less clear. Al-
though the Pairwise Linear architecture outperforms both other value functions,
we also observe that the Naive Linear architecture outperforms the Pairwise Tile-
coded architecture. Because this last value function is at least equally expressive
as the Naive Linear value function we expect that the large number of weights to
be learned by the Pairwise Tilecoded architecture proved to be too complex. The
large variance in scores of this value function confirms this idea, because it shows
that although the average results are lower than the others, it can find solutions that
7 DISCUSSION 19
Figure 8: In light blue: Scores obtained using the RLC metric for the three different
value functions. The error bars denote the standard deviation over the 10 training
runs. In dark blue: The best score over 10 training runs.
7 Discussion
Although we have found that the Pairwise Linear value function performs best in
our setting, we have made some choices that could strongly affect our results. Due
to time limitations we capped the number of games per MDP per agent to 2 games.
We have used the average of this score as the score of the particular agent on that
MDP. Tetris is however known to have large variance in its scores. This means that
only 2 games per MDP probably is not enough to determine the performance of the
agent with high confidence. This can negatively affect the learning curve because
bad playing agents can play 2 good games and get selected in the elite population,
or good playing agents can play 2 bad games and not get selected. We think that
when more time is available increasing the number of games per MDP would result
7 DISCUSSION 20
T !
|φ|
X X πk < 0.10 πj < 0.10
V (S) = 0.10 ≤ πk ≤ 0.18 0.10 ≤ πj ≤ 0.18
i=1 j,k∈C(j,k) πk > 0.18 πj > 0.18
j∈{1,..,|π|},
k∈{1,..,|π|}
ωi,j,k,1 ωi,j,k,2 ωi,j,k,3 X|φ|
· ωi,j,k,4 ωi,j,k,5 ωi,j,k,6 · φi (s) + ωl · φl (s) (13)
ωi,j,k,7 ωi,j,k,8 ωi,j,k,9 l=1
This idea can be extended into an N-dimensional tile coded parameter space,
where N is the number of parameters, and a weight for each feature is given by
the complete combination of all N parameters. However increasing the number
of weights, will make it ever more complex to learn the right values. There will
REFERENCES 21
also have to be a set of training MDPs with sufficient variation to learn all these
weights, and not leave unlearned combinations. However, it would be interesting
to investigate whether a generalized policy could be learned this way when such a
sufficient training set is available.
References
[1] Reinforcement learning competition 2008. http://www.rl-competition.org.
[3] Bohm, Kokai, and Mandl. Evolving a heuristic function for the game of tetris.
Proc. Lernen,Wissensentdeckung und Adaptivitat, 2004.
[9] Farias and van Roy. Tetris: A study of randomized constraint sampling, pages
189–201. Springer-Verlag, 2006.
[14] Ramon and Driessens. On the numeric stability of gaussian processes re-
gression for relational reinforcement learning. In ICML-2004 Workshop on
Relational Reinforcement Learning, 2004.
REFERENCES 22
[15] Sutton and Barto. Reinforcement Learning: An Introduction. The MIT Press,
1998.
[16] Szita and Lorincz. Learning tetris using the noisy cross-entropy method. Neu-
ral Computation, 2006.
[17] Tsitsiklis and van Roy. Feature-based methods for large scale dynamic pro-
gramming. Machine Learning, 1996.