Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
43 views18 pages

Generating A Periodic Pattern For VLIW

This document proposes a new software pipelining technique for generating periodic patterns to fully utilize functional units in VLIW processors. The technique is based on graph traverse scheduling, which uses a hamiltonian recurrence in the dependence graph of a loop to automatically generate parallel threads. It shows how efficient VLIW code can be generated from the hamiltonian recurrence. The dependence graph is extended with scheduling recurrences to describe characteristics of the schedule like parallelism and required functional units before code generation. The technique aims to fully exploit parallelism in loops for VLIW architectures.

Uploaded by

anon_817055971
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views18 pages

Generating A Periodic Pattern For VLIW

This document proposes a new software pipelining technique for generating periodic patterns to fully utilize functional units in VLIW processors. The technique is based on graph traverse scheduling, which uses a hamiltonian recurrence in the dependence graph of a loop to automatically generate parallel threads. It shows how efficient VLIW code can be generated from the hamiltonian recurrence. The dependence graph is extended with scheduling recurrences to describe characteristics of the schedule like parallelism and required functional units before code generation. The technique aims to fully exploit parallelism in loops for VLIW architectures.

Uploaded by

anon_817055971
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Generating a Periodic Pattern for VLIW

Cristina Barrado, Jesús Labarta, Eduard Ayguadé and Mateo Valero


Departament d’Arquitectura de Computadors
Universitat Politècnica de Catalunya, Barcelona, Spain
Email: (cristina, jesus, eduard, mateo)@ac.upc.es

Abstract
Fine-grain parallelism available in VLIW and superscalar processors can be mainly exploited in
computational intensive loops. Aggressive scheduling techniques are required to fully exploit this
parallelism. In this paper we present a new Software Pipelining technique based on Graph Traverse
Scheduling, a parallelizing technique originally proposed for multiprocessor systems that generates
parallel threads automatically using a hamiltonian recurrence in the dependence graph of the loop.
Explicit synchronizations required in multiprocessors to guarantee data dependences are now
substituted in the VLIW architecture by the correct allocation of loop operations and nop-operations
in the lock-step execution. The technique proposed here shows how an efficient VLIW code can be
automatically generated using a hamiltonian recurrence in the dependence graph. The NP-hardness
of the scheduling problem is restricted here to a problem of smaller size than related techniques. The
dependence graph, extended with a scheduling recurrence, describes the characteristics of the
schedule: the number of functional units required and the efficiency achieved, in terms of parallelism,
can be known prior to generate the physical scheduling. In this paper we consider single-nested loops
without conditionals and multifunctional processing units with a unit latency; extensions for multi-
cycle processing units are straightforward. Finally we also show how other Software Pipelining
techniques can be interpreted by means of scheduling recurrences in the dependence graph. In this
sense our technique encompasses these other techniques since the schedules they obtaine can be also
achieved with the technique described here.
Keywords: VLIW, Loop Scheduling, Instruction Level Parallelism, Software Pipelining.

1 Introduction
VLIW is one of the firm alternatives for future processor design. They will offer a high instruction
level parallelism and their compilers play an important role in order to afford a high utilization of
the functional units. The potential fine-grain parallelism available in VLIW and superscalar
processors can be effectively exploited in computational intensive loops. The scheduling of
statements inside a basic block has been shown to be a poor approach to extract parallelism out of
1
sequential programs in high instruction-level parallelism architectures [HePa90]. In order to feed
the functional units with more operations, the scheduler has to consider statements across basic
block boundaries. Since the main part of the execution time of numerical applications is spent in
loops, their parallelization is the objective of most of parallelizing compilers. Most of them fail
when trying to obtain parallelism out of loops with tight recurrences in the dependence graph and
under resource constraints. The problem of finding the maximum parallelism is NP-Complete
[GaJo79] and existing compilers use greedy algorithms to find a solution and apply heuristics
when the optimal solution is not found.
Loop Unrolling and Software Pipelining are the most extended techniques to parallelize loops
for synchronous architectures. Loop Unrolling is presented for Trace Scheduling [Fish81] in order
to construct larger basic blocks to have more opportunities for operation compaction. Software
Pipelining [Lam88] consists on overlapping successive loop iterations by initiating a new iteration
before prior iterations have completed. The initiation of successive iterations is done at a constant
rate, named initiation interval (II). The number of resources and the recurrences in the dependence
graph (DG) limit the value of the II. All the iterations follow the same pattern and the heuristics
used for searching such pattern lead to different implementations of the technique. For instance,
Perfect Pipelining [AiNi88] and Modulo Scheduling [RaGl81] are two of the approaches to do
Software Pipelining under resource constraints.
In Perfect Pipelining the functional units can execute any type of operation and have unit
latency. The method consists on scheduling all the dependence-free operations of any iteration
using a greedy approach until a periodic pattern is found. They show that this pattern is always
found with a search algorithm of cost quadratic on the number of nodes of the graph. The schedule
generated is optimal from the point of view of the execution time, but constraints on the number
of resources are not considered. Once the pattern is found, some transformations based on simple
heuristics are applied to reduce the number of functional units needed.
Modulo Scheduling is targeted to more complex architectures. It considers that the functional
units are pipelined and have specific functionalities. First, the lower bound of the II (Minimum
Initiation Interval) is computed as the maximum of the ResMII (MII dictated by resource
constraints) and the RecMII (MII dictated by precedence or dependence constraints). Operations
are chosen using some topological approach and a greedy algorithm is used to schedule them
preserving data dependencies and resource utilization with operations already scheduled. The
statements scheduled are allocated inside the resource reservation table (an array of size equal to
the number of resources times the MII). Only one statement can be allocated in each slot of the
table, indicating that every MII clock cycles a new iteration of this statement is initiated. Heuristics
are used when collisions on the resource reservation table occur: when a statement can not fit in
any slot due to resource or precedence constraints, the MII is increased and the process repeated
2
again. At the end of the process, the reservation table defines the set of VLIW instructions that
form the loop body.
Software pipelining is also used to schedule loop operations for superscalar architectures. The
schedule obtained for VLIW is also valid for superscalar architectures if the long-instructions are
splitted into several and consecutive RISC instructions that are issued at the same clock cycle.
Some Software Pipelining algorithms have been specially developed for this type of architectures.
For instance, Circular Scheduling [Jain91] does software pipelining by manipulating the
dependence graph. The algorithm iterates until a local optimal solution is found: at each iteration,
a node from the top of the graph is circled. A copy of the circled statement will constitute part of
the prologue of the software pipelined loop, the remaining statements are copied in the epilogue
and the number of iterations is reduced in one. In the core of the software pipelined loop,
statements belonging to different iterations will appear.
In this paper we present a new software pipelining technique. It is based on Graph Traverse
Scheduling (GTS), a parallelizing technique presented in [ALTL91] and evaluated in [BLB94] for
the Alliant/FX architecture. The method assumes the existence of a hamiltonian recurrence in the
dependence graph (a recurrence traversing all the nodes in the graph). The generation of optimal
hamiltonian recurrences (according to parallelism and/or resource constraints) is not the objective
of this paper and can be found in [BLAV95b]. The hamiltonian recurrence is obtained (if not
available or not fulfils the constraints) by adding new dummy edges to the dependence graph DG
of the loop; this graph is named extended dependence graph (EDG). Any characteristic of the loop
schedule can be obtained from EDG before generating it. For instance, the number of functional
units required by the schedule is the weight of the hamiltonian recurrence; the efficiency achieved
in terms of parallelism, is determined by the parallelism of the most restrictive recurrence in EDG.
Once the hamiltonian recurrence is obtained by applying some heuristics, the NP-complete
part of the scheduling problem has already been solved. The method presented here extracts the
full inherent parallelism of the given EDG in polynomial time cost. When the periodicity of the
scheduling is not integral, most software pipelining techniques discard some rational fraction of
the parallelism by rounding the MII to the ceiling. Other techniques find full parallelism by
unrolling the loop before scheduling. The unrolling factor they need is equal to the denominator of
the MII [Hane94]. In our method unrolling may also be needed but the unrolling factor is always
less or equal to theirs and, most important, does not increase the input size of the NP-complete part
because this has already been solved.
In this paper we assume the following architectural features: VLIW processors with
multifunctional functional units of unit latency. In this architectural model, synchronizations can
be enforced by properly allocating operations to the long instructions that constitute the loop body
and their lock-step execution. The method generates automatically optimal VLIW code using a
3
hamiltonian recurrence in the dependence graph.
The organization of the paper is as follows. Section 2 reviews the main aspects of GTS to make
more comprehensive the technique presented here. Section 3 introduces the intuitive idea of our
approach by using an example. We will show how the number of idle clock-cycles of a schedule
can be computed and how they can be represented in the EDG. Section 4 presents the new
technique and the steps to find a schedule and VLIW code. In Section 5 we show how the schedules
generated by other software pipelining techniques can be modelled by means of scheduling
recurrences in the EDG. Some concluding remarks are given in Section 6.

2 GTS for Multiprocessors


The basis of our new Software Pipelining technique is Graph Traverse Scheduling. GTS was first
targeted to parallelize loops for multiprocessor and vector architectures. In this section we
summarize the key ideas used to generate code for multiprocessors. More details on the automatic
code generation can be found elsewhere [ALTL91]. Let L be a single nested loop do i=i0,N
including a set of statements S1,..Sn in the loop body. The execution order constraints of loop
operations are usually represented by the dependence graph DG of the loop L; DG =(V, E) is a
directed multigraph. V is the set of vertices of DG: each vertex x ∈V represents a statement Sx of
the loop. E is the set of the edges of DG: each edge (x,y) ∈ E represents a data dependence between
statements Sx and Sy: statement Sy can not be executed until Sx finishes. A function w:E→N is
used to associate a weight w(x,y) to each edge: the distance of the dependence or number of
iterations the dependence extends across. Edges can form cycles in the graph that we name
recurrences. The weight of a recurrence c, w(c), is the sum of the weights of all the edges that
compose it and its length, |c|, is the number of nodes it traverses. A hamiltonian recurrence is a
cycle in DG that visits once each node in the set V.
GTS assumes the existence of a hamiltonian recurrence in the DG to schedule loop operations.
Let Rsch (Scheduling Recurrence) be such a hamiltonian recurrence and let ω be its weight. GTS
generates a set of ω threads that follow the same periodic pattern. The allocation of operations to
each thread is done by assigning an initial operation to each thread and then assigning operations
by following the nodes of the Rsch repeatedly until the end of the loop. Figure 1-a) shows the
dependence graph for our running example. The number of threads generated using the
hamiltonian Rsch of the example (with ω=5) is also 5 (from t0 to t4). Figure 1-b) shows the
sequence of operations executed by each thread assuming i0=1. Notice how the Rsch characterizes
the sequence of operations of each thread: the weight of the incoming edge (belonging to Rsch) to
any node x defines the set of dependence free iterations of statement Sx. These are the initial
operations that are allocated to threads. The allocation of these initial operations is done so that
4
consecutive iterations of any statement are allocated to consecutive threads and that two
consecutive instances of a statement allocated to a thread are ω iterations apart.
Threads:
t0 t1 t2 t3 t4
1 1
2 S11 S31 S32 S33 S21
1

2 S22 S12 S13 S14 S34

time
1
3
S35 S23 S24 S25 S15
3 S16 S36 S37 S38 S26

S27 S17 S18 S19 S39


a) D G
S310 S28 S29 S210 S110
: : : : :
Rsch
Synch b) Schedule
Figure 1: GTS Schedule for Multiprocessors

Edges that do not belong to the Rsch must be enforced by means of synchronization in order
to ensure the semantics of the original sequential loop. Semaphores are used in GTS to guarantee
the execution ordering of dependent operations. A signal statement is generated after the statement
corresponding to the source of the edge, and a wait statement is introduced before the statement
corresponding to the sink node. The synchronization introduced follows a periodic pattern: the
number of consecutive threads crossed by each synchronization is a constant value that can be
statically computed from the weight of the edges in DG. For example, in the example in Figure 1,
two edges must be explicitly synchronized: (3,2) and (2,1). Any two dependent operations due to
them are allocated 4 and 3 threads (modulo 5) apart, respectively.

The parallelism obtained with GTS is constrained by the recurrences of the DG. The inherent
parallelism of a graph, Π, can be computed statically from them [ALTL91]. Each recurrence c in
the DG represents a constraint that can be quantified by the parallelism per statement of c, Πs(c),
computed as shown in expression (2.1). The recurrence cr with the lowest parallelism per statement
is the most restrictive recurrence (also known as the critical circuit) and defines the maximum
parallelism Π of the loop. The value Πs gives an intuitive idea of the average number of operations
of any statement that can be executed on each clock-cycle considering dependence constraints.

Π = n × Πs w(c)
Π s = min ( Π s ( c ) ) Π s ( c ) = ---------- (2.1)
c
for all recurrences c of G

5
3 Description of the Method
The objective of this section is to show how a Rsch can be used to schedule loops for a horizontal
machine. The architectural model assumed in this paper considers a given number of
multifunctional units with unit latency. The intuitive idea of our approach is introduced using the
same example in Figure 1. Although we generate a loop schedule during the explanation of the
process, the VLIW code can be directly generated from the EDG (the extended dependence graph
resulting of adding Rsch edges to the DG). The details of the algorithms are found in Section 4.
Our scheduling is based in one or several scheduling recurrences in the EDG. In this paper we
focus on a single hamiltonian scheduling recurrence (Rsch) with weight ω (the number of
functional units). The Rsch generates ω vertical patterns that are assigned to the functional units
in the intermediate schedule. The sequence of loop operations in each pattern is the same sequence
of operations that are assigned to each thread by GTS, as described in Section 2. In order to
minimize the number of instructions in the body of the VLIW loop, the operations that are finally
assigned to each physical functional units are not necessarily the same that are allocated in this
intermediate schedule (because the operations allocated to the same long instruction can be
executed by any of the multifunctional units).
The edges of the DG are dependences that must be preserved. Some of them may be redundant
with other edges and paths in the Rsch and thus preserved by the schedule itself. Algorithms for
eliminating redundant dependences are shown in [MiPa87], [KrSa91]. Non redundant
dependences correspond to data that are generated in one functional unit and consumed in another
and are namd synchronizing edges. The proper allocation of statements in the VLIW lock-step
execution is the way of preserving these dependences.
The edges of the most restrictive recurrence, cr, are the dependences that limit the maximum
speed at what operations can be executed. The operations that do not belong to the most restrictive
recurrence have sometimes to be delayed in order to balance the speed of all the recurrences. This
waiting-time is easily introduced in VLIW architectures by using nop-operations. They can be
represented in the Rsch by including some "empty" nodes. The parallelism of the Rsch is reduced
since the number of nodes of it is increased while its weight is preserved. The number of nodes
introduced is computed in order to achieve the same Πs for both recurrences Rsch and cr (and
consequently both will execute at the same speed).
The number of empty nodes that have to be included in the Rsch is not necessarily an integral
value; hence we may need to unroll the Rsch (vertical unrolling described later). The relation
between the number of empty nodes and the number of unrollings of the Rsch required can be
easily computed from the relation between the number of operative and wasted clock-cycles,
which is also the parallelism of the schedule. Figure 2 shows the schedule generated for the
6
hamiltonian EDG of our working example. The relation here between the number of empty nodes
and the number of unrollings is 1/3, that is, the Rsch has to be unrolled 3 times and one empty node
has to be inserted. However, as Figure 2-b) shows, we keep the representation of the graph without
unrolling and include a fraction of the empty node.
Functional Units
0 1 2 3 4
1 1 S11 S31 S32

1
2 S22 S12 e S33 S21
1
2 S35 S23 S13 S14 S34
3
e S36 S24 S25 S 15
3 S16 S17 S37 S38 S26
a) EDG S27 S28 S18 e S39

time
S310 S311 S29 S19 S110
1 S1i S312 S210
1
S111 e S211
1
2 S2i+1
1 S3i+4 S212 S112 S113 S313 S314
2 e
1 3 S315 S213 S214 S114 e
S1i+5
S2i+6 : S316 S317 S215 S115
3 S3i+9
e S318 S216
0
S1i+10 :
e/3 S2i+11 : S319
:
S3i+14

b) Rsch•3 c) Pattern d) Schedule using the Rsch•3

Figure 2 Vertical periodic pattern: Rsch•u

Empty nodes have to be allocated in the Rsch in a non-critical path, that is, outside the most
restrictive recurrence. In the example, where the most restrictive recurrence is composed by edges
(1,2) and (2,1), the empty node has been allocated after the node number 3 and on the first body of
the unrolled Rsch as Figure 2-c) shows.
In order to preserve the synchronization edges, operations must be correctly allocated in the
lock-step execution. This can be achieved in the scheduling by appropriately delaying the initiation
of the pattern, which can be directly computed from the EDG. In Figure 2, the initiation delay of
each pattern is shown with the arrows. Observe that the proposed delays guarantee all dependences
represented by synchronization edges. The operations executed before the initiation of the pattern
correspond to a non complete pattern. The allocation of such operations can be done by traversing
7
the Rsch backwards.

The last conceptual step consists on generating the VLIW code from the schedule. The VLIW
code consists of a prologue, a core and an epilogue. We focuss on the obtention of the core, since
the prologue and epilogue are a partial execution of the operations of the core and can be done
using predicated execution [RSP92]. The number of long-instructions of the core is equal to the
number of nodes of the most restrictive recurrence. Any horizontal slice of this size in the schedule
can be used as core. Each row of the slice is one of the long-instructions that form the core of the
software pipelined loop. The iteration instances of each statement have to be related to the iteration
index of the loop. In the example, the set of VLIW instructions that form the kernel of the loop is
the slice composed by 2 consecutive rows of the schedule. Every iteration of the software pipelined
loop represent the execution of 3 iterations of the original loop. In Figure 3 the obtention of the
long-instructions of the software pipelined loop core is shown.

i = i+3
S11 S31 S32
S22 S12 e S33 S21 S2i+1 S1i+1 - S3i+2 S2i
|cr|

5
S3 S23 S1 3
S1 4 S3 4
S3i+4 S2i+2 S1i+2 S1i+3 S3i+3
e S36 S24 S25 S15
b) Core of the VLIW loop
S16 S17 S37 S38 S26
: : : : :
a) Schedule

Figure 3 Obtention of the VLIW loop

Observe that the execution of the VLIW loop does not correspond exactly with the
intermediate schedule we used for the intuitive presentation of the method, but considering that
operations can be moved horizontally the result is the same.

Notice in Figure 3 that the II resulting of this schedule is not a constant value: sometimes it is
equal to 1 and and sometimes equal to 0; however the average value is 2/3. The reason is that the
parallelism per statement (Πs) of the EDG is also a non integral value (the Πs defined in (2.1) is
the inverse of the lower bound of the II for precedence constraints, RecMII, defined in [Lam88]).
Since our technique exploits full parallelism, 3 complete iterations are executed every 2 clock-
cycles.
8
4 VLIW Schedulings from a Hamiltonian Recurrence
The automatic process for scheduling loops for VLIW using a Rsch is divided in four steps. In
section 4.1 we show the first one: how to compute the number of empty nodes needed to represent
the idle clock-cycles of the schedule generated. In section 4.2 the conditions for the correct
allocation of such nodes are presented. We will show that Rsch must hold the following condition
in order to always find a valid solution: all the edges of the most restrictive recurrence but one must
be introduced in the Rsch. In section 4.3 we present the algorithm that computes the initiation delay
of each pattern. The generation of the loop core is direct once the initiation delay and the pattern
of each functional unit are known.

4.1 Empty Nodes and Unrolling


The objective of this section is the computation of the number of wasted clock-cycles related to
the operative clock-cycles of the schedule. The schedule we want to generate must preserve the
inherent parallelism of the EDG and use a number ω of functional units. The average number of
useful operations done each clock-cycle must be equal to the parallelism of the loop. The wasted
clock-cycles, where functional units are executing nop-operations, are represented in the Rsch as
empty nodes. The number of them (e) depends on the parallelism of the EDG and if it is not an
integral value then the loop has to be u times unrolled. The unrolling of the Rsch will be
represented in the EDG with Rsch•u.
The concept of unrolling here is understood as vertical unrolling, in contrast with the horizontal
unrolling that other techniques apply. In Loop Unrolling and in Software Pipelining unrolling is
understood as the technique that transforms a loop in another that has a longer loop body,
composed of a number of consecutive iterations. The objective is to have more operations that can
be compacted in the horizontal long-instructions. It is done before scheduling. The result is a
transformed loop with a fixed number of VLIW instructions that hold more than one instance of
every statement of the previous loop. We call this type of unrolling horizontal unrolling because
the objective is to have all the unrolled operations as horizontally compacted as possible. The
(horizontal) unrolling factor in Software Pipelining is equal to w(cr). In the technique we are
presenting the horizontal unrolling is achieved by the generation of the ω parallel patterns. In
contrast, the vertical unrolling is applied to the pattern fixed by the Rsch and not to consecutive
loop bodies. It is done after the Rsch is found, and does not increase the input size of the NP-
complete scheduling problem. Finally, the objective is not to horizontally compact the operations
that belong to the same vertical pattern.
The parallelism of a scheduling can be computed as the number of operations done divided by
the required time. If we consider the scheduling generated using one Rsch•u we have that the total
number of operations executed is the number of functional units, ω, times the number of operations
9
executed by each functional unit: n*u (where n is the number of statements of a loop body). The
time required to execute all these operations is equal to the length of the Rsch•u.
#operations n×u×ω
Π = ---------------------------- = ---------------------------- (4.1)
time ( n × u) + e
Making expression (4.1) equal to the expression of the parallelism found in (2.1), we have the
minimum integral values of u and e of the expression (4.2).
–1
u min = k × w(c r)
–1 (4.2)
e min = k × ( ω × c r – n × w(c r) )
where k = gcd(w(c r), ω × c r – n × w(c r))
and cr is the most restrictive recurrence

Observe that the unrollings needed to find the vertical pattern is at most the weight of the most
restrictive recurrence, w(cr) which in the more realistic cases is 1. Also that the number u of
vertical unrollings required is less or equal than the number of (horizontal) unrollings applied by
other Software Pipelining before scheduling.
For example, in the EDG of Figure 2-a) the most restrictive cycle is <(1,2),(2,1)> which has a
Πs equal to 3/2 (w(cr) is 3 and |cr| is 2). The Rsch has a weight, ω, equal to 5 and the total number
of nodes, n, is 3. The computation of e and u for this graph is:
gcd  w  c  , ω × c – 3 × w(c )  = gcd(3, 10 – 9) = 1
  r r r
u = 3
e = 1

4.2 Generation of the Rsch•u


The objective of this section is to find a Rsch•u that can be directly used for scheduling. We know
the values u and e but not the allocation of the e empty nodes in the unrolled Rsch. The two
following conditions must hold in order to have a valid Rsch•u:
•1st, the allocation of the e empty nodes must preserve the parallelism of the graph.
•2nd, the e empty nodes must be distributed uniformly across the u loop bodies of the Rsch•u.
The parallelism of the graph is not reduced if the empty nodes are introduced in a subpath of
the Rsch that is not part of the most restrictive recurrence. The mechanism, shown in Figure 4,
consists on substituting one of the edges of the Rsch, i.e. edge (i,j), with the sequence of edges
(i,nop0)... (nope/u, j), all of them of weight equal to 0 except for the first one that would have the
same weight of the original edge (i,j). In each one of the u loop bodies of the Rsch•u we introduce
e/u empty operations. If the value e/u is not integral then some unrolled bodies will have one more
empty node than the rest. For example, in Figure 4-c), where u is equal to 3 and e is equal to 4, the
first unrolled body, u0, has 2 empty operations and the next two, u1 and u2, have only 1.

10
The mechanism presented works always if the Rsch fulfils the property given in Lemma 1

Lemma 1: If Rsch contains all the edges of the most restrictive recurrence but one and e, u are
defined as in expression (4.2), then the e empty nodes can always be inserted in the u times
unrolled Rsch with no penalty for the parallelism. [BLAV95a]

:
Si u0
•i w(i,j) e
Rsch i• nop0 e
Sj
•j Rsch•u
• nope/u :
cr j 0 Si u1
c r e
Sj
:
Si u2
e
Sj
:
a) b) c) Pattern if
e=4 and u=3
Figure 4: Inserting empty nodes in Rsch•u

4.3 Computing the Initiation Delays


In this section we are going to present the third pass of the algorithm: the computation of the initial
delays of the pattern. When the most restrictive recurrence is not the Rsch the algorithm is the one
shown in Figure 5. If the most restrictive recurrence is the Rsch then we have a particular case
where e is equal to 0 and u is equal to 1. The computation of the initial delays in this case is shown
in Figure 7.
4.3.1 Computing Delays under Dependence Constraints
Let assume that we have a hamiltonian EDG where the most restrictive recurrence is not the Rsch,
then the computed value of e is greater than 0. Assume the Rsch holds the condition of Lemma 1
if cr=<e0,e1,...,e|cr|-1> is the most restrictive recurrence then all the edges but one, let’s say edges
e0, e1,...e|cr|-2, belong also to the Rsch. Then, by Lemma 1 we know that the Rsch•u can be always
generated. The synchronization edge e|cr|-1 must be preserved by correctly delaying the initiation
of the periodic pattern generated using the Rsch•u.
We can assume, without lost of generality, that the periodic pattern starts with statement S1 and
that the first functional unit starts the execution of such pattern with statement S1i0 at time 0. In
order to preserve the edge e|cr|-1 we need to start the execution of the pattern in the rest of the
functional units with a certain delay. The algorithm in Figure 5 computes the initiation delay and
the iteration instance of the first S1 for all the functional units. We assume that the expression of

11
the Πs(cr) is normalized, thus w(cr) and |cr| are relatively prime numbers.

M= gcd(ω, w(cr))
for s:= 0 to M-1 do
for α:= 0 to (ω/M) -1 do
t= (s + α*w(cr)) mod ω
delay[t] = α * |cr| + s * |cr| / w(cr)
iter[t] = i0 + s + α*w(cr)

Figure 5: Algorithm to compute delays under recurrence constraints

In Lemma 2 we show that the generated schedule using the computed delays is valid because
it preserves all the data dependences.

Lemma 2: Given a Rsch•u that contains all the edges of the most restrictive recurrence but
one, the edge e|cr|-1, if this is preserved through the introduction of delays, then every other edge
not included in the Rsch is also preserved [BLAV95a].

We can check that applying the algorithm to the example of Figure 2, in page 7, the results are
the same of the initiation delays shown in Figure 2-d) with the arrows:

t 0 3 1 4 2
delay[t] 0 2 4 6 8
iter[t] 1 4 7 10 13

4.3.2 Computing Delays under Resource Constraints


Under resource constraints the parallelism is limited by the number of functional units. Since the
number of functional units is equal to the weight of Rsch (ω) and the number of nodes of the graph
is n, the Πs limited by the resources is equal to ω/n but that is also the is Πs computed for the Rsch.
In consequence the Rsch is now the most restrictive recurrence.
In this particular case no empty nodes have to be included in the Rsch because the resources
are fully used during the execution of the loop. We can check this feature with the expression (4.2):
when Πs(cr) is equal to ω/n then e is 0. In this section we will show a new algorithm for computing
the initiation delays when the parallelism is restricted by the Rsch.
Suppose we have an architecture with 4 functional units. For the DG of Figure 6-a) we can use
the EDG shown in Figure 6-b) for efficiently schedule the loop.

12
A A
2
0 0
1 1 1

B C B C
2 4 functional units 2
0 0
1 1
-1 -1
D E D E
1 1
a) Dep. Graph (Πs=1) b) EDG with Rsch (Πs=4/5)
Figure 6: Generation of a Rsch under resource constraints

The algorithm for computing the delays is shown in Figure 7. We can observe that the initiation
delays depend on the value Πs.

for t:= 0 to ω-1 do


delay[t] = t * n / ω ; Πs is equal to ω/n
iter[t] = i0 + t

Figure 7: Algorithm to compute delays under resource constraints

In Lemma 3 we show that the generated schedule using the computed delays is valid because
it preserves all the data dependences.

Lemma 3: Given a Rsch of weight ω for scheduling a loop, where the most restrictive
recurrence of the extended dependence graph is the Rsch, the schedule generated after applying
the algorithm of Figure 7 preserves all the synchronization edges.

For the EDG of Figure 6-b) we have the initiation delays and the schedule shown in Figure 8.

Functional Units:
0 1 2 3
A1 C1
B1 A2 D1 E1
C2 B2 A3 D2
t 0 1 2 3
delay[t] 0 1 2 3 E2 C3 B3 A4
time

iter[t] 1 2 3 4
D3 E3 C4 B4
A5 D4 E4 C5
B5 A6 D5 E5
C6 B6 A7 D6
E6 C7 B7 A8
Figure 8 Delays and schedule for the EDG of Figure 6-b)

13
5 Interpretation of Other VLIW Schedules
In this section we show how some Software Pipelining techniques can be modeled using one or
several Rsch, with the objective of showing that GTS can be seen as a technique that generalizes
all other techniques. Software Pipelining schedules result in a horizontal periodic pattern while a
Rsch models a vertical periodic pattern. Any schedule can always be interpreted as more than one
scheduling recurrences in the EDG by connecting with dummy edges the nodes that correspond to
the operations that periodically execute each functional unit. These dummy edges will form a
number of recurrences that can be used as scheduling recurrences. Hence, the schedule generated
with other techniques can always be obtained with the proposed technique if the appropriate Rsch
is used. The idea is to show that the optimality of a scheduling depends on the heuristics used for
finding a Rsch, but once a Rsch is found the problem is solved. Using this approach, the scheduling
problem is reduced to a graph problem. Comparative studies between different heuristics can be
easily done if the schedule is represented as a Rsch in the DG.
We show with an example how the unrolling of the horizontal pattern and the manipulation of
it can lead to the vertical periodic pattern we look for. Suppose the DG shown in Figure 9, which
has a Πs equal to 1. The minimum number of resources needed to execute the loop with full
parallelism is 3.

1 0
2

2
1

Figure 9 DG of a loop

Different techniques may schedule this loop in several ways. Figure 10 presents three valid
Software Pipelining schedules for this loop. The first one can be achieved using the Perfect
Pipelining [AiNi88] under no resource constraints. The second one can be obtained with Circular
Scheduling [Jain91]. For this loop this is the optimal schedule since it uses the minimum number
of resources and fully exploits the inherent parallelism. If only two resources are available the
optimal schedule can be generated by unrolling twice the loop body and then applying Modulo
Scheduling [RaGl81]. The VLIW code generated is shown in Figure 10-c). Observe that in this
case the inherent parallelism of the loop can not be exploited due to resource constraints.
The method used to transform the horizontal pattern generated by a Software Pipelining
technique into the vertical patterns represented in the EDG is described with an example. Consider
14
S11 S31
S11 S12 S31 - -
S21 S12 - DO i=2, N, 2
DO i=3, N-1, 2
DO i=3, N S1i-1 S3i-1
S2i-2 S2i-1 - -
S3i-1 S2i-1 S1i S2i-1 S1i
S1i S1i+1 S3i-1 S3i
ENDDO ENDDO S2i S3i
S2N-1 S2N - - S3N S2N - ENDDO
N
S3 - - -

a) Greedy Sched. b) Circular Sched. c) Resource Constraints


Figure 10 VLIW Schedulings
the loop schedule of Figure 10-b) and its iteration space generated in Figure 11-b1) after unrolling
the horizontal pattern a number of times. Each column of the iteration space follows a periodic
pattern. The three Rsch of Figure 11-c1) model these three vertical patterns. Furthermore, the
iteration space of Figure 11-b2) is equivalent to the one in Figure 11-b1) if we move statements on
the same long-instruction from one functional unit to another. Figure 11-b2) shows in dashed the
vertical pattern that models the schedule. Here, the Rsch found is a hamiltonian recurrence.

S32 S22 S13 1 1


S33 S23 S14
DO i=3, N S34 S24 S15
S3i-1 S2i-1 S1i S35 S25 S16
2 1

ENDDO S36 S26 S17


a) : : : 3 1

b1) c1)

S32 S22 S13 1 0


S14 S33 S23
S24 S15 S34
2
S35 S25 S16 2 1
S17 S36 S26
: : : 3
b2) c2)
Figure 11: Transforming the horizontal to a vertical pattern

A possible vertical interpretation of the three schedules of Figure 10 is presented with the three
EDGs of Figure 12. Edges in solid line are the edges that belong to the Rsch while dashed line
edges (synchronizing edges) are data dependences across functional units. The weight of the Rsch
15
and the parallelism of each schedule are also shown together with the EDGs. In Figure 12-a) one
empty e-node has been introduced in the Rsch. This node represents the empty slots found in the
schedule of Figure 10-a). Thus, the vertical pattern modeled with this Rsch is {S1i, e, S3i-1, S2i+2}
incrementing i by 4 each time.

1 0 1 0 1
0 0
2 2 2

2 2 2
1
e 2 1 1 1
3

3 -1 3 3
a) w(Rsch)=4 b) w(Rsch)=3 c) w(Rsch)=2
Πs=1 Πs=1 Πs=2/3
Figure 12: Modeling schedules with a Rsch

6 Conclusions

In this paper a novel loop scheduling technique for VLIW architectures has been presented. The
starting point of the method is the dependence graph of the loop extended with a hamiltonian
recurrence, Rsch, used for scheduling purposes. The technique proposed generates VLIW code
automatically from this extended graph. It is simple to implement and its computational cost is
quadratic on the size of the graph. If |V| is the number of nodes of the graph and |E| the number of
edges, then the cost of the method is as follows: The computation of Πs is O(|V||E|) using the
algorithm presented in [Karp78]; e and u are computed in O(1); the allocation of the empty nodes
in the Rsch•u can be done in O(u2|V||E|) using a variant of the Karp’s algorithm to find the non
restrictive edges of the unrolled Rsch where empties can be inserted; the delay computation
algorithm and the generation of the VLIW code have a constant cost because they do not depend
on the size of the graph.
The code generated fully exploits the inherent parallelism of the loop even for non integral
values of it. Most compilers do Software Pipelining setting out the optimal MII to the ceiling of its
computed lower bound, loosing a fractional part of the parallelism. In order to achieve the optimal
rational MII = a/b then (horizontal) unrolling the loop must be apply b times before scheduling the
operations, thus increasing the size of a NP-Complete problem by a factor of b. In the approach
presented here, (vertical) unrolling of the Rsch is sometimes needed but the unrolling factor
required is always less or equal than the horizontal unrolling factor b. More important, the problem
size is incremented once the NP-Complete part of the problem (the generation of the optimal Rsch)
16
is solved.
Another advantage of this approach is that all the information of the loop and the
characteristics of a certain schedule can be studied on the extended dependence graph without
requiring the physical generation of it (the number of functional units required and the parallelism
achieved). Other techniques are based on an iterative process and heuristics to measure how far
from the optimal they are, but they ignore the characteristics of the final schedule until they
generate it.
Other scheduling techniques that result in a periodic pattern can be interpreted and analyzed
by means of GTS. The operations that periodically execute each functional unit can be represented
with one or several scheduling recurrences, that together with the DG, form the EDG that model
the scheduling. This unique representation of any schedule by means of a EDG allows a detailed
study of some characteristics of the scheduling without the need of generating it. Comparisons
between different schedules can be done by comparing their EDG.
An algorithm that generates an optimal hamiltonian recurrence, from the point of view of
parallelism achieved, is presented in [BLAV95b]. The Rsch can also be found trying to optimize
the lifetime of temporal variables in order to reduce register requirements. If functional units have
specific functionality then several non hamiltonian scheduling recurrences are needed, each
traversing the nodes that can be executed by the corresponding functional unit. The method for
automatic generation of VLIW code presented in this paper uses a single hamiltonian EDG.
Extensions for EDG with several non-hamiltonian scheduling recurrences are now on study.

Acknowledgments
This work has been supported by the Ministry of Education of Spain under contracts TIC880/92
and TIC0429/95 and by the CEPBA (the European Center for Parallelism of Barcelona).

References
[AiNi88] A. Aiken and A. Nicolau. "Perfect Pipelining: A new Loop Parallelization
Technique", Proc. of the European Symp. on Programming, March 1988.
[ALTL91] E. Ayguadé, J. Labarta, J. Torres, J.M. LLabería and M. Valero,“Parallelism
Evaluation and Partitioning of Nested Loops for Shared Memory Multiprocessors”,
chap. 11 of Advances in Languages and Compilers for Parallel Processing, PITMAN,
1991.
[BLB95] C. Barrado, J. Labarta and P. Borensztejn, "Implementation of GTS", Parallel
ARchitectures and Languages Europe, pp.565-576, 1994.
[BLAV95a] C. Barrado, J. Labarta, E. Ayguadé and M. Valero. "Generating a periodic pattern for
VLIW". DAC/CEPBA Tech. Report. No.95/06.
17
[BLAV95b] C. Barrado, J. Labarta, E. Ayguadé and M. Valero. "Searching for an Optimal
Hamiltonian Scheduling Recurrence on a Dependence Graph". DAC/CEPBA Tech.
Report. No.95/11.
[Fish81] J.A. Fisher. "Trace Scheduling: A Technique for Global Microcode Compaction".
IEEE Transactions on Computers, Vol C-30, No 7, pp478-490. July 1981.
[GaJo79] M.R. Garey and D.S. Johnson. "Computers and Intractability: A Guide to the Theory
of NP-Completeness". W.H. Freeman and Company, 1979.
[HePa90] J. Hennessy, D. Patterson. "Computer Architecture, A Quantitative Approach".
Morgan Kaufmann Publishers Inc. 1990.
[Hane94] C. Hanen. "Study of a NP-hard cyclic scheduling problem: The recurrent job-shop".
European Journal of Operational Research 72, pp88-101. 1994.
[Jain91] S. Jain. "Circular Scheduling: A New Technique to Perform Software Pipelining".
ACM SIGPLAN’91 Conf. on Prog. Lang. Design and Implementation, pp219-228.
June 1991.
[Karp78] R. Karp. "A Chracterization of the Minimum Cycle Mean in a Digraph". Discrete
Mathematics 23(1978), pp309-311. Nort-Holland Publishing Company, 1978.
[KrSa91] V.P. Krothapalli and P. Sadayappan. "Removal of Redundant Dependences in
DOACROSS Loops with Constant Depedences", IEEE Transactions on Parallel and
Distributed Systems, Vol. 2, No. 3, pp281-389, July 1991.
[Lam88] M. Lam. "Software Pipelining: An Effective Technique for VLIW Machines", Proc.
of the SIGPLAN’88 Conf. on Programming Language Design and Implementation,
pp318-328. June 1988.
[MiPa87] S. Midkiff and D. Padua. "Compiler Algorithms for Synchronization", IEEE
Transactions on Computers, Vol. C-36, No. 12, pp1485-1495, Dec. 1987.
[RaGl81] B.R. Rau and C.D. Glaeser. "Some Scheduling Techniques and an Easy Schedulable
Horizontal Architecture for High Performance Scientific Computing". Proc. 14th
Annual Microprogramming Workshop, Oct. 1981.
[RSP92] B.R. Rau, M.S. Schlansker, P.P. Tirumalai. "Code Generation Schema for Modulo
Scheduled Loops". IEEE Micro-25, pp.158-169, Sept 1992.

18

You might also like