EE663: Optimizing Compilers
Prof. R. Eigenmann
Purdue University
School of Electrical and Computer Engineering
Spring 2012
https://engineering.purdue.edu/~eigenman/ECE663/
EE663, Spring 2012 Slide 1
I. Motivation and Introduction:
Optimizing Compilers are in the Center of the
(Software) Universe
They translate increasingly advanced human interfaces (programming
languages) onto increasingly complex target machines
Today Tomorrow
C, C++, Problem
Human (programming) Java, Specification
language Fortran Language
Challenge
Translator
Grand
Machine architecture Workstation Globally
Multicores Distributed/Cloud
HPC Systems Resources
Processors have multiple cores. Parallelization is a key optimization.
EE663, Spring 2012 Slide 2
Issues in Optimizing /
Parallelizing Compilers
The Goal:
• We would like to run standard (C, C++, Java,
Fortran) programs on common parallel
computers
leads to the following high-level issues:
• How to detect parallelism?
• How to map parallelism onto the machine?
• How to create a good compiler architecture?
EE663, Spring 2012 Slide 3
Detecting Parallelism
• Program analysis techniques
• Data dependence analysis
• Dependence removing techniques
• Parallelization in the presence of
dependences
• Runtime dependence detection
EE663, Spring 2012 Slide 4
Mapping Parallelism onto the
Machine
• Exploiting parallelism at many levels
– Multiprocessors and multi-cores (our focus)
– Distributed computers (clusters or global
networks)
– Heterogeneous architectures
– Instruction-level parallelism
– Vector machines
• Exploiting memory organizations
– Data placement
– Locality enhancement
– Data communication
EE663, Spring 2012 Slide 5
Architecting a Compiler
• Compiler generator languages and tools
• Internal representations
• Implementing analysis and transformation
techniques
• Orchestrating compiler techniques (when to
apply which technique)
• Benchmarking and performance evaluation
EE663, Spring 2004 Slide 6
Parallelizing Compiler Books and
Survey Papers
Books:
• Michael Wolfe: High-Performance Compilers for Parallel Computing (1996)
• Utpal Banerjee: several books on Data Dependence Analysis and Transformations
• Ken Kennedy, John Allen: Optimizing Compilers for Modern Architectures: A
Dependence-based Approach (2001)
• Zima, H. and Chapman, B., Supercompilers for parallel and vector computers (1990)
• Scheduling and automatic Parallelization, Darte, A., Robert Y., and Vivien, F., (2000)
Survey Papers:
• Rudolf Eigenmann and Jay Hoeflinger, Parallelizing and Vectorizing Compilers, Wiley
Encyclopedia of Electrical Engineering, John Wiley &Sons, Inc., 2001
• Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic
Program Parallelization. Proceedings of the IEEE, 81(2), February 1993.
• David F. Bacon, Susan L. Graham, Compiler transformations for high-performance
computing, ACM Computing Surveys (CSUR), Volume 26, Issue 4, December 1994,
Pages: 345 - 420,1994
EE663, Spring 2012 Slide 7
Course Approach
There are many schools on optimizing compilers.
Our approach is performance-driven.
We will discuss:
– Performance of parallelization techniques
– Analysis and Transformation techniques in the
Cetus compiler (for multiprocessors/cores)
– Additional transformations (for GPGPUs and other
architectures)
– Compiler infrastructure considerations
EE663, Spring 2012 Slide 8
The Heart of Automatic
Parallelization
Data Dependence Testing
If a loop does not have data dependences
between any two iterations then it can be
safely executed in parallel
In science/engineering applications, loop
parallelism is most important. In non-
numerical programs other control structures
are also important
EE663, Spring 2012 Slide 9
Data Dependence Tests:
Motivating Examples
Loop Parallelization Statement Reordering
Can the iterations of this can these two statements be
loop be run concurrently? swapped?
DO i=1,100,2 DO i=1,100,2
B(2*i) = ... B(2*i) = ...
... = B(2*i) +B(3*i) ... = B(3*i)
ENDDO ENDDO
DD testing is important not just for
DD testing is needed to detect detecting parallelism
parallelism
A data dependence exists between two adjacent data references iff:
• both references access the same storage location and
• at least one of them is a write access
EE663, Spring 2012 Slide 10
Data Dependence Tests: Concepts
Terms for data dependences between statements of loop iterations.
• Distance (vector): indicates how many iterations apart are source
and sink of dependence.
• Direction (vector): is basically the sign of the distance. There are
different notations: (<,=,>) or (-1,0,+1) meaning dependence (from
earlier to later, within the same, from later to earlier) iteration.
• Loop-carried (or cross-iteration) dependence and non-loop-carried
(or loop-independent) dependence: indicates whether or not a
dependence exists within one iteration or across iterations.
– For detecting parallel loops, only cross-iteration dependences matter.
– equal dependences are relevant for optimizations such as statement
reordering and loop distribution.
EE663, Spring 2004 Slide 11
Data Dependence Tests: Concepts
• Iteration space graphs: the un-abstracted form of a dependence
graph with one node per statement instance.
j
Example:
DO i=1,n
DO j=1,m
a(i,j) = a(i-1,j-2)+b(I,j)
ENDDO
ENDDO
i
order
EE663, Spring 2004 Slide 12
Data Dependence Tests:
Formulation of the
Data-dependence problem
DO i=1,n
the question to answer:
a(4*i) = . . .
can 4*i1 ever be equal to 2*i2+1 within i1, i2 ∈[1,n] ?
. . . = a(2*i+1)
ENDDO Note that the iterations at which the two expressions are equal
may differ. To express this fact, we choose the notation i1, i2.
Let us generalize a bit: given
• two subscript functions f and g, and
• loop bounds lower, upper,
Does
f(i1) = g(i2) have a solution such that
lower ≤ i1, i2 ≤ upper ?
EE663, Spring 2004 Slide 13
This course would now be finished if:
• the mathematical formulation of the data dependence
problem had an accurate and fast solution, and
• there were enough loops in programs without data
dependences, and
• dependence-free code could be executed by today’s
parallel machines directly and efficiently.
• engineering these techniques into a production
compiler were straightforward.
There are enough hard problems to fill several courses!
EE663, Spring 2012 Slide 14
II. Performance of Basic
Automatic Program
Parallelization
EE663, Spring 2012 Slide 15
Two Decades of Parallelizing
Compilers
A Performance study at the beginning of the 90es (Blume study)
Analyzed the performance of state-of-the-art parallelizers and
vectorizers using the Perfect Benchmarks.
William Blume and Rudolf Eigenmann, Performance Analysis of
Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE
Transactions on Parallel and Distributed Systems, 3(6), November 1992,
pages 643--656.
Good reasons for starting two decades back:
• We will learn simple techniques first.
• We will see how parallelization techniques have evolved
• We will see that extensions of the important techniques back then are still the
important techniques today.
EE663, Spring 2012 Slide 16
Overall Performance
of parallelizers in 1990
Speedup on
8 processors
with 4-stage
vector units
EE663, Spring 2012 Slide 17
Performance of Individual Techniques
EE663, Spring 2012 Slide 18
Transformations measured in
the “Blume Study”
• Scalar expansion
• Reduction parallelization
• Induction variable substitution
• Loop interchange
• Forward Substitution
• Stripmining
• Loop synchronization
• Recurrence substitution
EE663, Spring 2012 Slide 19
Scalar Expansion Privatization
DO PARALLEL j=1,n
DO j=1,n
PRIVATE t
output t = a(j)+b(j)
flow t = a(j)+b(j)
c(j) = t + t2
c(j) = t + t2
ENDDO
ENDDO
anti
Expansion
We assume a shared-memory model:
• by default, data is shared, i.e., all DO PARALLEL j=1,n
processors can see and modify it
• processors share the work of
t0(j) = a(j)+b(j)
parallel loops c(j) = t0(j) + t0(j)2
ENDDO
EE663, Spring 2012 Slide 20
Parallel Loop Syntax and
Semantics in OpenMP
!$OMP PARALLEL PRIVATE(<private data>)
<preamble code>
#pragma omp parallel for !$OMP DO
DO i = lb, ub
for (i=lb; i<=ub; i++) {
<loop body code>
<loop body code>
ENDDO
} !$OMP END DO
<postamble code>
!$OMP END PARALLEL
work (iterations) shared by participating processors (threads)
Same code executed by all participating processors (threads)
EE663, Spring 2012 Slide 21
Reduction Parallelization
!$OMP PARALLEL, PRIVATE (s)
s=0
DO j=1,n !$OMP DO
DO j=1,n
flow sum = sum + a(j)
s = s + a(j)
ENDDO anti ENDDO
!$OMP ENDDO
!$OMP ATOMIC
sum=sum+s
!$OMP PARALLEL DO !$OMP END PARALLEL
!$OMP+REDUCTION(+:sum)
DO j=1,n
sum = sum + a(j)
ENDDO
EE663, Spring 2012 Slide 22
Induction Variable Substitution
ind = ind0 ind = ind0
DO j = 1,n DO PARALLEL j = 1,n
a(ind) = b(j) a(ind0+k*(j-1)) = b(j)
ind = ind+k ENDDO
flow
dependence ENDDO
Note, this is the reverse of strength reduction, an important
transformation in classical (code generating) compilers.
R0 ← &d
loop: loop:
real d(20,100)
... ...
DO j=1,n
R0 ← &d+20*j (R0) ← 0
d(1,j)=0
(R0) ← 0 ...
ENDDO
... R0 ← R0+20
jump loop jump loop
EE663, Spring 2012 Slide 23
Forward Substitution
m = n+1 m = n+1
… …
DO j=1,n DO j=1,n
a(j) = a(j+m) a(j) = a(j+n+1)
ENDDO ENDDO
a = x*y a = x*y
b = a+2 b = x*y+2
c=b+4 c = x*y + 6
dependences no dependences
EE663, Spring 2012 Slide 24
Stripmining
1 n
strip
DO i=1,n,strip
DO j=1,n DO j=i,min(i+strip-1,n)
a(j) = b(j) a(j) = b(j)
ENDDO ENDDO
ENDDO
There are many variants of stripmining
(sometimes called loop blocking)
EE663, Spring 2012 Slide 25
Loop Synchronization
DOACROSS j=1,n
DO j=1,n a(j) = b(j)
a(j) = b(j) post(current_iteration)
c(j) = a(j)+a(j-1) wait(current_iteration-1)
ENDDO c(j) = a(j)+a(j-1)
ENDDO
EE663, Spring 2012 Slide 26
Recurrence Substitution
DO =1,n
a(j) = c0+c1*a(j)+c2*a(j-1)+c3*a(j-2)
ENDDO
call rec_solver(a(1),n,c0,c1,c2,c3)
Basic idea of the recurrence solver:
DO j=1,10 DO j=11,20 DO j=21,30 DO j=31,40
DO j=1,40 a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1)
ENDDO ENDDO ENDDO ENDDO
a(j) = a(j) + a(j-1)
ENDDO
Error: 0 ∆a(10) ∆a(10)+∆a(20) ∆a(10)+∆a(20)+∆a(30)
EE663, Spring 2012 Slide 27
Loop Interchange
DO i= 1,n DO j= 1,m
DO j=1,m DO i=1,n
a(i,j) = a(i,j)+a(i,j-1) a(i,j) =a(i,j)+a(i,j-1)
ENDDO ENDDO
ENDDO ENDDO
• stride-1 references increase cache locality
– read: increase spatial locality
– write: avoid false sharing
• scheduling of outer loop is important (consider original loop nest):
– cyclic: no locality w.r.t. to i loop
– block schedule: there may be some locality
– dynamic scheduling: chunk scheduling desirable
• cache organization is important
• parallelism at outer position reduces loop fork/join overhead
EE663, Spring 2012 Slide 28
Effect of Loop Interchange
Example: speedups of the most time-consuming loops
in the ARC2D benchmark on 4-core machine
loop interchange applied in the
process of parallelization
10
Speedup
4
0
STEPFX STEPFX XPENTA FILERX
DO230 DO210 DO11 DO39
EE663, Spring 2012 Slide 29
Execution Scheme for Parallel Loops
1. Architecture supports parallel loops. Example: Alliant
FX/8 (1980es)
– machine instruction for parallel loop
– HW concurrency bus supports loop scheduling
store #0,<a>
a=0
load <n>,D6
! DO PARALLEL
sub 1,D6
DO i=1,n
load &b,A1
b(i) = 2 D7 is reserved
cdoall D6
ENDDO for the loop
store #2,A1(D7.r)
b=3 variable.
endcdoall Starts at 0.
store #3,<b>
EE663, Spring 2012 Slide 30
Execution Scheme for Parallel Loops
2. Microtasking scheme (dates back to early
IBM mainframes)
p1 p2 p3 p4
sequential init_helper_tasks
problem: wakeup_helpers
parallel loop startup
sleep_helpers
sequential must be very fast
wakeup_helpers
parallel
sleep_helpers
sequential
microtask startup: 1 µs
pthreads startup: up to 100 µs
EE663, Spring 2012 Slide 31
Compiler Transformation and Runtime
Function for the Microtasking Scheme
call init_microtasking() // once at program start
a=0 ...
! DO PARALLEL a=0
call loop_scheduler(loopsub,i,1,n,b)
DO i=1,n
b=3
b(i) = 2
ENDDO subroutine loopsub(mytask,lb,ub,b)
b=3 DO i=lb,ub
b(i) = 2
ENDDO
END
Master task Helper 1:
Helper task
loop_scheduler: loopsub
lb,ub loop:
partition loop iterations wait for flag
sh_var
wakeup call loopsub(id,lb,ub,sh_var)
flag
call loopsub(...) reset flag
barrier (all flags reset) Control blocks
(shared data)
return
EE663, Spring 2012 Slide 32
III. Performance of Advanced
Parallelization
EE663, Spring 2012 Slide 33
Manual Improvements of the
Perfect Benchmarks (1995)
Same
information as Rudolf Eigenmann, Jay
on Slide 17 Hoeflinger, and David Padua,
On the Automatic
Parallelization of the
Perfect Benchmarks.
IEEE Transactions on
Parallel and Distributed
Systems,
volume 9, number 1,
January 1998,
pages 5-23.
a eliminated file I/O
b parallelized random number generator
EE663, Spring 2012 Slide 34
Performance of Individual
Techniques in Manually
Improved Programs (1995)
Performance loss when disabling individual techniques (Cedar machine)
EE663, Spring 2012 Slide 35
Overall Performance of the
Cetus and ICC Compilers (2011)
)
(
'
&
!"##$%"
%
$
#
"
!
*+ ,- ./ 0+ 12 34 5- 2/
26789: ,6;<=>4?;<?6@ 5A@6:BC9=6@>+<?6@ /7AD8:6BC9=6@>+<?6@ .EF878G9::H>+<?6@ 1,,>/979::6: I9?@>/979::6:
NAS (Class A) Benchmarks on 8-core x86 processor
EE663, Spring 2012 Slide 36
Performance of Individual
Cetus Techniques (2011)
>
=
<
;
!"##$%"
:
9
8
7
6
?@ AB */ C@ DE !F GB E/
E(.2"%&#)4+$%(5&5)HII D+%&+&+3)HII 4%&$5)"II /0&J$'&1$''&"+)HII 400$()K-LM#'&"+)HII D+LM#'&"+)HII D+'-0#,$+3-)HII @&%&+3)HII 4%%)H+
/0"30$.)4+$%(5&5 /$0$%%-%&1$'&"+)*+$2%&+3) !"#$%&'()*+,$+#-.-+'
NAS Benchmarks (Class A) on 8-core x86 processor
EE663, Spring 2012 Slide 37
IV. Analysis and
Transformation Techniques
• 1 Data-dependence analysis
• 2 Parallelism enabling transformations
• 3 Techniques for multiprocessors/multicores
• 4 Advanced program analysis
• 5 Dynamic decision making
• 6 Techniques for vector architectures
• 7 Techniques for heterogeneous multicores
• 8 Techniques distributed-memory machines
EE663, Spring 2012 Slide 38
IV.1 Data Dependence Testing
Earlier, we have considered the simple case of a
1-dimensional array enclosed by a single loop:
DO i=1,n
the question to answer:
a(4*i) = . . .
can 4*i ever be equal to 2*i+1 within i ∈[1,n] ?
. . . = a(2*i+1)
ENDDO
In general: given
• two subscript functions f and g and
• loop bounds lower, upper.
Does
f(i1) = g(i2) have a solution such that
lower ≤ i1, i2 ≤ upper ?
EE663, Spring 2012 Slide 39
DDTests: doubly-nested loops
• Multiple loop indices:
DO i=1,n
DO j=1,m
X(a1*i + b1*j + c1) = . . .
. . . = X(a2*i + b2*j + c2)
ENDDO
ENDDO
dependence problem:
a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1
1 ≤ i 1, i 2 ≤ n
1 ≤ j 1, j 2 ≤ m
Almost all DD tests expect the coefficients ax to be integer constants.
Such subscript expressions are called affine.
EE663, Spring 2012 Slide 40
DDTests: even more complexity
• Multiple loop indices, multi-dimensional array:
DO i=1,n
DO j=1,m
X(a1*i1 + b1*j1 + c1, d1*i1 + e1*j1 + f1) = . . .
. . . = X(a2*i2 + b2*j2 + c2, d2*i2 +e2*j2 + f2)
ENDDO
ENDDO
dependence problem:
a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1
d1*i1 - d2*i2 + e1*j1 - e2*j2 = f2 - f1
1 ≤ i 1, i 2 ≤ n
1 ≤ j 1, j 2 ≤ m
EE663, Spring 2012 Slide 41
Data Dependence Tests:
The Simple Case
Note: variables i1, i2 are integers → diophantine equations.
Equation a * i1 - b* i2 = c has a solution if and only iff
gcd(a,b) (evenly) divides c
in our example this means: gcd(4,2)=2, which does not
divide 1 and thus there is no dependence.
If there is a solution, we can test if it lies within the loop
bounds. If not, then there is no dependence.
EE663, Spring 2012 Slide 42
Performing the GCD Test
• The diophantine equation
a1*i1 + a2*i2 +...+ an*in = c
has a solution iff gcd(a1,a2,...,an) evenly divides c
Examples:
15*i +6*j -9*k = 12 has a solution gcd=3
2*i + 7*j = 3 has a solution gcd=1
9*i + 3*j + 6*k = 5 has no solution gcd=3
Euklid Algorithm: find gcd(a,b)
Repeat for more than two numbers:
a ← a mod b gcd(a,b,c) = (gcd(a,gcd(b,c))
swap a,b
Until b=0 →The resulting a is the gcd
EE663, Spring 2012 Slide 43
Other Data Dependence Tests
• The GCD test is simple but not accurate
• Other tests
– Banerjee(-Wolfe) test: widely used test
– Power Test: improvement over Banerjee test
– Omega test: “precise” test, most accurate for
linear subscripts
– Range test: handles non-linear and symbolic
subscripts
– many variants of these tests
EE663, Spring 2012 Slide 44
The Banerjee(-Wolfe) Test
Basic idea:
if the total subscript range accessed by ref1
does not overlap with the range accessed
by ref2, then ref1 and ref2 are
independent.
DO j=1,100 ranges accesses:
a(j) = … [1:100]
… = a(j+200) [201:300]
ENDDO independent
EE663, Spring 2012 Slide 45
Mathematical Formulation of the
Test – Banerjee’s Inequalities
j1-j2 = 200
Min: 1-100=-99 The general case of a doubly-nested loop and
single subscript, as shown on Slide 40:
Max: 100-1=99
a1*i1-a2*i2 + b1*j1-b2*j2 = c2-c1
Min: a1-a2*n Min: b1-b2*m Assuming positive
Max: a1*n-a2 Max: b1*n-b2 coefficients
Multiple dimensions: apply test separately on each subscript or linearize
EE663, Spring 2012 Slide 46
Banerjee(-Wolfe) Test continued
Weakness of the test:
Consider this flow dependence
DO j=1,100 ranges accessed:
a(j) = … [1:100]
… = a(j+5) [6:105]
ENDDO no dependence ?
We did not take into consideration that only loop-carried
dependences matter for parallelization.
A loop-carried flow dependence only exists, if a read in some
iteration, j1, conflicts with a write in some later iteration, j2> j1
EE663, Spring 2012 Slide 47
Using Dependence Direction Information
in the Banerjee(-Wolfe) Test
Idea for overcoming the weakness:
for loop-carried dependences, make use of the fact
that j in ref2 is greater than in ref1
Still considering the potential flow Ranges accessed by
dependence from a(j) to a(j+5)
iteration j1 and any other
iteration j2, where j1 < j2 :
DO j=1,100 [j1]
a(j) = … [j1+6:105]
… = a(j+5) Independent for “>” direction
ENDDO
Clearly, this loop has a
dependence. But, it is
an anti-dependence
This is commonly referred to as the from a(j+5) to a(j)
Banerjee test with direction vectors.
EE663, Spring 2012 Slide 48
DD Testing with Direction Vectors
Considering direction vectors can increase the complexity of the DD test
substantially. For long vectors (corresponding to deeply-nested
loops), there are many possible combinations of directions.
*, * , . . . , *
= = =
(d1,d2,…,dn) < < <
> >
A possible algorithm:
1. try (*,*…*) , i.e., do not consider directions
2. (if not independent) try (<,*,*…*), (=,*,*…*)
3. (if still not independent) try (<,<,*…*),(<,>,*…*) ,(<,=,*…*)
(=,=,*…*), (=,<,*…*)
...
(This forms a tree)
EE663, Spring 2012 Slide 49
Data-dependence Test Driver
procedure DataDependenceAnalysis( PROG )
input : Program representing all source files: PROG
output : Data dependence graph containing dependence arcs DDG
// Collect all FOR loops meeting eligibility
// Checks: Canonical, FunctionCall, ControlFlowModifier
ELIGIBLE LOOPS = getOutermostEligibleLoops( PROG )
foreach LOOP in ELIGIBLE LOOPS
// Obtain lower bounds, upper bounds and loop steps
// for this loop and all enclosed loops i.e. the loop-nest
// Substitute symbolic information if available,
LOOP_INFO = collectLoopInformation( LOOP and enclosed nest )
// Collect all array access expressions appearing within the
// body of this loop, this includes enclosed loops and non-perfectly
// nested statements
ACCESSES = collectArrayAccesses( LOOP and enclosed nest )
// Traverse all array accesses, test relevant pairs and
// create a set of dependence arcs for the loop-nest
LOOP_DDG = runDependenceTest( LOOP_INFO, ACCESSES )
// Add loop dependence graph to the program-wide DDG
// The program-wide DDG is initially empty
DDG += LOOP_DDG
// return the program-wide data dependence graph once all loops are done
return DDG
Slide 50
Data-dependence Test Driver (continued)
procedure runDependenceTest( LOOP_INFO, ACCESSES )
input : Loop information for the current loop nest LOOP_INFO
List of array access expressions, ACCESSES
output : Loop data dependence graph LOOP_DDG
foreach ARRAY_1 in ACCESSES of type write
// Obtain alias information i.e. aliases to this array name
// Alias information in Cetus is generated through points-to analysis
ALIAS_SET = getAliases( ARRAY_1 )
// Collect all expressions/references to the same array from the entire list of accesses
TEST_LIST = getOtherReferences( ALIAS_SET, ACCESSES )
foreach ARRAY_2 in TEST_LIST
// Obtain the common loops enclosing the pair
COMMON NEST = getCommonNest( ARRAY_1, ARRAY_2 )
// Possibly empty set of direction vectors under which
// dependence exists is returned by the test
DV_SET = testAccessPair( ARRAY_1, ARRAY_2, COMMON_NEST, LOOP_INFO )
foreach DV in DV_SET
// Create arc from source to sink
DEP_ARC = buildDependenceArc( ARRAY_1, ARRAY_2, DV )
// Build the loop dependence graph by accumulating all arcs
LOOP_DDG += DEP ARC
// All expressions have been tested, return the loop dependence graph
return LOOP_DDG
Slide 51
Data-dependence Test Driver (continued)
procedure testAccessPair( A1, A2, COMMON_NEST, LOOP_INFO)
input : Pair of array accesses to be tested A1 and A2
Nest of common enclosing loops COMMON NEST
Information for these loops LOOP INFO
output : Possibly empty set of direction vectors under
which dependence exists DV SET
// Partition the subscripts of the array accesses into dimension pairs
// Coupled subscripts may be handled
PARTITIONS = partitionSubscripts( A1, A2, COMMON_NEST )
foreach PARTITION in PARTITIONS
// Depending on the number of loop index variables in the partition,
// use the corresponding test.
if( ZIV ) // zero index variables ZIV
DVs = simpleZIVTest( PARTITION )
else // single or multi-loop index variables: SIV, MIV
// traverse and prune over tree of direction vectors, collect DVs where
// dependence exists (traversal not shown here)
foreach DV in DV_TREE using prune
// In Cetus, the MIV test is performed using Banerjee or Range test
DVs += MIVTest( PARTITION, DV, COMMON_NEST, LOOP_INFO )
// Merge DVs for all partitions
DV_SET = merge( DVs )
return DV_SET
Slide 52
Non-linear and Symbolic DD Testing
Weakness of most data dependence tests:
subscripts and loop bounds must be affine,
i.e., linear with integer-constant coefficients
Approach of the Range Test:
capture subscript ranges symbolically
compare ranges: find their upper and lower bounds
by determining monotonicity. Monotonically
increasing/decreasing ranges can be compared by
comparing their upper and lower bounds.
EE663, Spring 2012 Slide 53
The Range Test
Basic idea :
1. Find the range of array accesses made in a given
loop iteration j => r(j).
2. If r(j) does not overlap with r(j+1) then there is no
cross-iteration dependence
Symbolic comparison of ranges r1 and r2:
max(r1)<min(r2) OR min(r1)>max(r2) => no overlap
Example: testing independence of the outer loop: ubx
DO i=1,n range of A accessed in iteration ix: [ix*m+1:(ix+1)*m]
DO j=1,m
A(i*m+j) = 0 range of A accessed in iteration ix+1: [(ix+1)*m+1:(ix+2)*m]
ENDDO
ENDDO lbx+1
ubx < lbx+1 ⇒ no cross-iteration dependence
EE663, Spring 2012 Slide 54
we need powerful expression
Range Test continued
manipulation and comparison
utilities
DO i1=L1,U1 Assume f,g are monotonically increasing w.r.t. all ix:
...
find upper bound of access range at loop k, 1<k<n:
DO in=Ln,Un
A(f(i1,...in)) = ...
successively substitute ix with Ux, x={n,n-1,...,k-1}
... = A(g(i1,...in)) lowerbound is computed analogously
ENDDO
...
ENDDO If f,g are monotonically decreasing w.r.t. some iy,
then substitute Ly when computing the upper
bound.
we need Determining monotonicity: consider d = f(...,ik,...) - f(...,ik-1,...)
range If d>0 (for all values of ik) then f is monotonically increasing w.r.t. k
analysis If d<0 (for all values of ik) then f is monotonically decreasing w.r.t. k
What about symbolic coefficients?
• in many cases they cancel out
• if not, find their range (i.e., all possible values they can assume at this point
in the program), and replace them by the upper or lower bound of the range.
EE663, Spring 2012 Slide 55
Handling Non-contiguous
Ranges
DO i1=1,u1 The basic Range Test finds
DO i2=1,u2 independence
A(n*i1+m*i2)) = … of the outer loop
ENDDO
if n >= u2 and m=1
ENDDO
But not
if n=1 and m>=u1
Idea:
- temporarily (during program analysis) interchange the loops,
- test independence,
- interchange back
Issues:
• legality of loop interchanging,
• change of parallelism as a result of loop interchanging
EE663, Spring 2012 Slide 56
Some Engineering Tasks and
Questions for DD Test Pass Writers
- Start with the simple case: linear (affine) subscripts, single nests with 1-dim arrays. Subscript
and loop bounds are integer constants. Stride 1 loop, lower bound =1
- Deal with multiple array dims and loop nests
- Add capabilities for non-stride-1 loops and lower bounds ≠1
- How to deal with symbolic subscript coefficients and bounds
- Ignore dependences in private variables and reductions
- Generate DD vectors
- Mark parallel loops
- Things to think about:
-- how to handle loop-variant coefficients
-- how to deal with private, reduction, induction variables
-- how to represent DD information
-- how to display the DD info
-- how to deal with non-parallelizable loops (IO op, function calls, other?)
-- how to find eligible DO loops?
-- how to find eligible loop bounds, array subscripts?
-- what is the result of the pass? Generate DD info or set parallel loop flags?
-- what symbolic analysis capabilities are needed?
EE663, Spring 2012 Slide 57
Data-Dependence Test, References
• Banerjee/Wolfe test
– M.Wolfe, U.Banerjee, "Data Dependence and its Application to Parallel
Processing", Int. J. of Parallel Programming, Vol.16, No.2, pp.137-178,
1987"
• Power Test"
– M. Wolfe and C.W. Tseng, The Power Test for Data Dependence, IEEE
Transactionson Parallel and Distributed Systems, IEEE Computer Society,
3(5), 591-601,1992.
• Range test
– William Blume and Rudolf Eigenmann. Non-Linear and Symbolic Data
Dependence Testing, IEEE Transactions of Parallel and Distributed
Systems, Volume 9, Number 12, pages 1180-1194, December 1998.
• Omega test
– William Pugh. The Omega test: a fast and practical integer programming
algorithm for dependence. Proceedings of the 1991 ACM/IEEE Conference
on Supercomputing,1991
• I Test
– Xiangyun Kong, David Klappholz, and Kleanthis Psarris, "The I Test: A New
Test for Subscript Data Dependence," Proceedings of the 1990 International
Conference on Parallel Processing, Vol. II, pages 204-211, August 1990.
EE663, Spring 2012 Slide 58
IV.2 Parallelism Enabling
Techniques
EE663, Spring 2012 Slide 59
Advanced Privatization
DO i=1,n DO j=1,n
loop-carried
t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j)
anti dependence C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2
ENDDO ENDDO
scalar privatization array privatization
!$OMP PARALLEL DO !$OMP PARALLEL DO
!$OMP+PRIVATE(t) !$OMP+PRIVATE(t)
DO i=1,n DO j=1,n
t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j)
C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2
ENDDO ENDDO
EE663, Spring 2012 Slide 60
Array Privatization
k=5 Capabilities needed for
DO j=1,n Array Privatization
t(1:10) = A(j,1:10)+B(j) • array Def-Use Analysis
C(j,iv) = t(k)
• combining and intersecting
t(11:m) = A(j,11:m)+B(j)
subscript ranges
C(j,1:m) = t(1:m)
ENDDO • representing subscript
ranges
• representing conditionals
DO j=1,n under which sections are
IF (cond(j)) defined/used
t(1:m) = A(j,1:m)+B(j) • if ranges are too complex to
C(j,1:m) = t(1:m) + t(1:m)**2 represent: overestimate
ENDIF Uses, underestimate Defs
D(j,1) = t(1)
ENDDO
EE663, Spring 2012 Slide 61
Array Privatization continued
Array privatization algorithm:
• For each loop nest:
– iterate from innermost to outermost loop:
• for each statement in the loop
– Find array definitions; add them to the existing
definitions in this loop.
– find array uses; if they are covered by a definition,
mark this array section as privatizable for this loop,
otherwise mark it as upward-exposed in this loop;
• aggregate defined and upward-exposed uses (expand
from range per-iteration to entire iteration space); record
them as Defs and Uses for this loop
EE663, Spring 2012 Slide 62
Some Engineering Tasks and
Questions for Privatization Pass Writers
• Start with scalar privatization
• Next step: array privatization with simple ranges (contiguous; no range
merge) and singly-nested loops
• Deal with multiply-nested loops (-> range aggregation)
• Add capabilities for merging ranges
• Implement advanced range representation (symbolic bounds, non-
contiguous ranges)
• Deal with conditional definitions and uses (too advanced for this course)
• Things to think about
– what symbolic analysis capabilities are needed?
– how to represent advanced ranges?
– how to deal with loop-variant subscript terms?
– how to represent private variables?
EE663, Spring 2012 Slide 63
Array Privatization,
References
• Peng Tu and D. Padua. Automatic Array Privatization.
Languages and Compilers for Parallel Computing. Lecture
Notes in Computer Science 768, U. Banerjee, D. Gelernter, A.
Nicolau, and D. Padua (Eds.), Springer-Verlag, 1994. "
• Zhiyuan Li, Array Privatization for Parallel Execution of Loops,
Proceedings of the 1992 ACM International Conference on
Supercomputing"
EE663, Spring 2012 Slide 64
!$OMP PARALLEL PRIVATE(s)
Reduction s=0
!$OMP DO
Privatized reduction
implementation
Parallelization DO i=1,n
s=s+A(i)
ENDDO
Scalar Reduction !$OMP ATOMIC
sum = sum+s
loop-carried
DO i=1,n !$OMP END PARALLEL
flow sum = sum + A(i)
dependence
ENDDO
DO i=1,num_proc
s(i)=0 Expanded reduction
implementation
Note, OpenMP has a reduction clause, ENDDO
only reduction recognition is needed: !$OMP PARALLEL DO
!$OMP PARALLEL DO DO i=1,n
!$OMP+REDUCTION(+:sum) s(my_proc)=s(my_proc)+A(i)
DO i=1,n ENDDO
sum = sum + A(i)
DO i=1,num_proc
ENDDO
sum=sum+s(i)
EE663, Spring 2012 ENDDO Slide 65
Parallelizing Array Reductions
Array Reductions (a.k.a. irregular or
histogram reductions) DIMENSION sum(m),s(m,#proc)
DIMENSION sum(m) !$OMP PARALLEL DO
DO i=1,n DO i=1,m
sum(expr) = sum(expr) + A(i) DO j=1,#proc
ENDDO s(i,j)=0
ENDDO
ENDDO Expanded reduction
implementation
DIMENSION sum(m),s(m) !$OMP PARALLEL DO
!$OMP PARALLEL PRIVATE(s) DO i=1,n
s(1:m)=0 Privatized reduction s(expr,my_proc)=s(expr,my_proc)+A(i)
!$OMP DO implementation ENDDO
DO i=1,n !$OMP PARALLEL DO
s(expr)=s(expr)+A(i) DO i=1,m
ENDDO DO j=1,#proc
!$OMP ATOMIC sum(i)=sum(i)+s(i,j)
sum(1:m) = sum(1:m)+s(1:m) ENDDO
!$OMP END PARALLEL ENDDO
Note, OpenMP 1.0 does not support such array reductions
EE663, Spring 2012 Slide 66
Recognizing Reductions
Recognition Criteria:
1. the loop may contain one or more reduction
statements of the form X=X ⊗ expr ,
where
• X is either scalar or an array expression, a[sub]
(sub must be the same on LHS and RHS)
• ⊗ is a reduction operation, such as +, *, min, max
2. X must not be used in any non-reduction statement
of the loop, nor in expr
EE663, Spring 2012 Slide 67
procedure RecognizeSumReductions (L)
Input : Loop L
Reduction
Output: reduction annotations for loop L, inserted in the IR Recognition
REDUCTION = {} // set of candidate reduction expressions
REF = {} // set of non-reduction variables referenced in L Algorithm
foreach stmt in L
localREFs = findREF(stmt) // gather all variables referenced in stmt
if (stmt is AssignmentStatement)
candidate = lhs_expr(stmt)
increment = rhs_expr(stmt) – candidate // symbolic subtraction
if ( !(baseSymbol(candidate) in findREF(increment)) ) // criterion1 is satisfied
REDUCTION = REDUCTION ∪ candidate
localREFs = findREF(increment) // all variables referenced in inc. expr.
REF = REF ∪ localREFs // collect non-reduction variables for criterion 2
foreach expr in REDUCTION
if ( ! (baseSymbol(expr) in REF) ) // criterion 2 is satisfied
if (expr is ArrayAccess AND expr.subscript is loop-variant)
CreateAnnotation(sum-reduction, ARRAY, expr)
else
CreateAnnotation(sum-reduction, SCALAR, expr)
end procedure Slide 68
Reduction Compiler Passes
Reduction recognition and parallelization
passes:
Induction variable recognition
recognizes and
Reduction recognition annotates reduction
Privatization variables
Data dependence test
Loop parallelization
<mapping passes>
Profitability decision
Reduction parallelization performs the reduction
transformation
compiler passes
EE663, Spring 2012 Slide 69
Performance Considerations
for Reduction Parallelization
• Parallelized reductions execute substantially more code than
their serial versions ⇒ overhead if the reduction (n) is small.
• In many cases (for large reductions) initialization and sum-up
are insignificant.
• False sharing can occur, especially in expanded reductions, if
multiple processors use adjacent array elements of the
temporary reduction array (s).
• Expanded reductions exhibit more parallelism in the sum-up
operation.
• Potential overhead in initialization, sum-up, and memory used
for large, sparse array reductions ⇒ compression schemes can
become useful.
EE663, Spring 2012 Slide 70
Induction Variable Substitution
ind = k
loop-carried
DO i=1,n Parallel DO i=1,n
flow ind = ind + 2 A(k+2*i) = B(i)
dependence
A(ind) = B(i) ENDDO
ENDDO
This is the simple case of an induction variable
EE663, Spring 2012 Slide 71
Generalized Induction Variables
ind=k
DO j=1,n Parallel DO j=1,n
ind = ind + j A(k+(j**2+j)/2) = B(j)
A(ind) = B(j) ENDDO
ENDDO
DO i=1,n
DO i=1,n DO j=1,i
ind1 = ind1 + 1 ind = ind + 1
ind2 = ind2 + ind1 A(ind) = B(i)
A(ind2) = B(i) ENDDO
ENDDO ENDDO
EE663, Spring 2012 Slide 72
Recognizing GIVs
• Pattern Matching:
– find induction statements in a loop nest of the form
iv=iv+expr or iv=iv*expr, where iv is an scalar integer.
– expr must be loop-invariant or another induction variable
(there must not be cyclic relationships among IVs)
– iv must not be assigned in a non-induction statement
• Abstract interpretation: find symbolic increments
of iv per loop iteration
• SSA-based recognition
EE663, Spring 2012 Slide 73
GIV Closed-form Computation and
Substitution Algorithm
Step1: find the increment rel. to start of loop L
FindIncrement(L)
inc=0
Loop structure L0: stmt type
foreach si of type I,L
For j: 1..ub if type(si)=I inc += exp
… else /* L */ inc+= FindIncrement(si)
S1: iv=iv+exp I inc_after[si]=inc
… inc_into_loop[L]= ∑1j-1(inc) ; inc may depend
S2: loop using iv L return ∑1ub(inc) ; on j
…
S3: stmt using iv U Step 2: substitute IV
… Replace (L,initval)
Rof val = initval+inc_into_loop[L]
foreach si of type I,L,U
Main: if type(si)=L Replace(si,val)
Insert this
totalinc = FindIncrement(L0) statement if type(si)=L,I val=initialval
Replace(L0,iv) If iv is live-out +inc_into_loop[L]
InsertStatement(“iv = iv+totalinc”) +inc_after[si]
if type(si)=U Substitute(si,iv,val)
For coupled GIVs: begin with independent iv.
EE663, Spring 2012 Slide 74
Induction Variables, References
• B. Pottenger and R. Eigenmann. Idiom Recognition in the Polaris
Parallelizing Compiler. ACM Int. Conf. on Supercomputing (ICS'95),
June 1995. "
• Mohammad R. Haghighat , Constantine D. Polychronopoulos, Symbolic
analysis for parallelizing compilers, ACM Transactions on Programming
Languages and Systems (TOPLAS), v.18 n.4, p.477-518, July 1996 "
• Michael P. Gerlek , Eric Stoltz , Michael Wolfe, Beyond induction
variables: detecting and classifying sequences using a demand-driven
SSA form, ACM Transactions on Programming Languages and
Systems (TOPLAS), v.17 n.1, p.85-122, Jan. 1995"
EE663, Spring 2012 Slide 75
Loop Skewing
!$OMP PARALLEL DO
DO i=1,4 DO set=1,?
set=1,9
DO j=1,6 i=? max(5-set,1)
A(i,j)= A(i-1,j-1) j=? max(-3+set,1)
ENDDO setsize=
setsize =?min(4,5-abs(set-5))
ENDDO DO k=0,setsize-1
A(i+k,j+k)=A(i-1+k,j-1+k)
ENDDO
ENDDO
i Iteration space graph:
Shared regions show sets of iterations in the
transformed code that can be executed in
parallel.
EE663, Spring 2012
j
Slide 76
Loop Skewing for the
Wavefront Method
DO i=2,n-1
DO j=2,n-1
A(i,j)= (A(i+1,j) +A(i-1,j)
+A(i,j+1) +A(i,j-1))/4
ENDDO
ENDDO
Outer loop is serial
Inner loop is parallel
j5 DO j=4, n+n-2
2 3 4 6 7. . .
2 DOALL i= max(2, n- j+ 1), min(n- 1, j- 2)
3 A(i, j- i) = (A(i+ 1, j- i) + A(i- 1, j- i)
+A(i, j+ 1- i) + A(i, j- 1 +i)/4
4 ENDDO
ENDDO
i 5.
.
.
EE663, Spring 2012 Slide 77
IV.3 Techniques for
Multiprocessors:
Mapping Parallelism to Shared-memory
Machines
EE663, Spring 2012 Slide 78
Loop Fusion and Distribution
DO i=1,n
A(i) = B(i) DO i=1,n
ENDDO
loop fusion A(i) = B(i)
C(i) = A(i-1) + D(i)
DO i=1,n ENDDO
C(i) = A(i-1)+D(i) loop distribution
ENDDO (fission)
• Loop fusion is the reverse of loop distribution
• Fusion reduces the loop fork/join overhead and enhances data affinity
• Distribution inserts a barrier synchronization between parallel loops
• Both transformations reorder computation
• Legality: dependences in fused loop must be lexically forward
EE663, Spring 2012 Slide 79
Loop Distribution Enables
Other Techniques
DO i=1,n
DO i=1,n A(i) = B(i)+A(i-1)
• enables
A(i) = B(i)+A(i-1) interchange ENDDO
DO j=1,m • separates
D(i,j)=E(i,j) out partial DOALL j=1,m
paralleism DO i=1,n
ENDDO
D(i,j)=E(i,j)
ENDDO
ENDDO
ENDDO
In a program with multiply-nested loops, there can be a large number of
possible program variants obtained through distribution and interchanging
EE663, Spring 2012 Slide 80
Enforcing Data Dependence
Criterion for correct transformation and execution of a
computation involving a data dependence with vector
v : (=,…<,…*)
Let Ls be the outermost loop with non-“=” DD-direction
Ls :
– Ls must be executed serially
– The direction at Ls must be “<”
Same rule applies to all dependences
Note that a data dependence is defined with respect to an ordered
execution. For autoparallelization, this is the serial program order.
User-defined, fully parallel loops by definition do not have cross-iteration
dependences. Legality rules for transforming already parallel programs are
different.
EE663, Spring 2012 Slide 81
Loop Interchange
Legality of Loop interchange and resulting parallelism can be
tested with the above rules:
After loop interchange, the two conditions must still hold.
DO i=1,n DOALL j=1,m
DOALL j=1,m DO i=1,n
A(i,j) = A(i-1,j) A(i,j) = A(i-1,j)
ENDDO ENDDO
ENDDO ENDDO
DOALL i=1,n DOALL j=1,m
DO j=1,m DO i=1,n
A(i,j) = A(i-1,j-1) A(i,j) = A(i-1,j-1)
ENDDO ENDDO
ENDDO ENDDO
EE663, Spring 2012 Slide 82
Loop Coalescing
a.k.a. loop collapsing
PARALLEL DO i=1,n PARALLEL DO ij=1,n*m
DO j=1,m i = 1 + (ij-1) DIV m
loop
A(i,j) = B(i,j) j = 1 + (ij-1) MOD m
coalescing
ENDDO A(i,j) = B(i,j)
ENDDO ENDDO
Loop coalescing
• can increase the number of iterations of a parallel loop
load balancing
• adds additional computation
overhead
EE663, Spring 2012 Slide 83
Loop Blocking/Tiling
DO PARALLEL i1=1,n,block
DO j=1,m DO j=1,m
DO i=1,n DO i=i1,min(i1+block-1,n)
loop B(i,j)=A(i,j)+A(i,j-1)
B(i,j)=A(i,j)+A(i,j-1) blocking
ENDDO ENDDO
ENDDO ENDDO
ENDDO
j j
p1
p2
i i
p3
p4
This is basically the same transformation as
stripmining, but followed by loop interchanging.
EE663, Spring 2012 Slide 84
Loop Blocking/
!$OMP PARALLEL
Tiling DO j=1,m
continued !$OMP DO
DO i=1,n
DO j=1,m B(i,j)=A(i,j)+A(i,j-1)
DO i=1,n ENDDO
B(i,j)=A(i,j)+A(i,j-1) !$OMP ENDDO NOWAIT
ENDDO ENDDO
ENDDO !$OMP END PARALLEL
j j
p1
p2
i i
p3
p4
EE663, Spring 2012 Slide 85
Choosing the Block Size
The block size must be small enough so that all data references
between the use and the reuse fit in cache.
DO j=1,m
DO k=1,block Number of references made between the
… (r1 data references) access A(k,j) and the access A(k,j-d) when
… = A(k,j) + A(k,j-d) referencing the same memory location:
… (r2 data references) (r1+r2+3)*d*block
ENDDO block < cachesize / (r1+r2+3)*d
ENDDO
If the cache is shared, all cores use it simultaneously. Hence the
effective cache size appears smaller:
block < cachesize / (r1+r2+3)*d*num_cores
Reference: Zhelong Pan, Brian Armstrong, Hansang Bae and Rudolf Eigenmann,
On the Interaction of Tiling and Automatic Parallelization, First International
Workshop on OpenMP (Wompat), 2005.
EE663, Spring 2012 Slide 86
Multi-level Parallelism from
Single Loops
DO i=1,n
A(i) = B(i) strip mining
for multi-level
ENDDO parallelism
PARALLEL DO (inter-cluster) i1=1,n,strip
PARALLEL DO (intra-cluster) i=i1,min(i1+strip-1,n)
A(i) = B(i)
ENDDO
ENDDO
M cluster
M M M M
P P P P P P P P P P P P P P P P
EE663, Spring 2012 Slide 87
References
• High Performance Compilers for Parallel
Computing, Michale Wolfe, Addison-Wesley, ISBN
0-8053-2730-4.
• Optimizing Compilers for Modern Architectures: A
Dependence-based Approach, Ken Kennedy and
John R. Allen, Morgan Kaufmann Publishers, ISBN
1558602860
EE663, Spring 2012 Slide 88
IV.4 Advanced Program
Analysis
EE663, Spring 2012 Slide 89
Interprocedural Analysis
• Most compiler techniques work intra-
procedurally
• Ideally, inter-procedural analyses and
transformations available
• In practice: inter-procedural operation of basic
analysis works well
• Inline expansion helps but no silver bullet
EE663, Spring 2012 Slide 90
Interprocedural Constant
Propagation
Making constant values of variables
known across subroutine calls
Subroutine A Subroutine B(m)
j = 150
DO k=1,100
call B(j) X(i)=X(i+m)
ENDDO
END
END
knowing that m>100 allows this
loop to be parallelized
EE663, Spring 2012 Slide 91
An Algorithm for Interprocedural
Constant Propagation
Intra-procedural part:
determine jump functions for all subroutines
Subroutine X(a,b,c)
e = 10
JY,1 = c
d = b+2 JZ,1 = a (jump function of first parameter)
call Y(c) JZ,2 = b+2
f = b*2 JZ,3 = ⊥ (called bottom, meaning non-constant)
call Z(a,d,c,e,f) JZ,4 = 10
END JZ,5 = ⊥
• Mechanism for finding jump functions: (local) forward substitution and
interprocedural MAYMOD information.
• Here we assume the compiler supports jump functions of the form
P+const (P is a subroutine parameter of the callee).
EE663, Spring 2012 Slide 92
Constant Propagation Algorithm:
Interprocedural Part
1. initialize all formal parameters to the value T (called top = non yet known)
2. for all jump functions:
– if it is ⊥: set formal parameter value to ⊥ (called bottom = unknown)
– if it is constant and the value of the formal parameter is the same
constant or T : set the value to this constant
3. put all formal parameters on a work queue
4. repeat: take a parameter from the queue until queue is empty
for all jump functions that contain this parameter:
• determine the value of the target parameter of this jump function.
Set it to this value, or to ⊥ if it is different from a previously set
value.
• if the value of the target parameter changes, put this parameter
on the queue
EE663, Spring 2012 Slide 93
Examples of Constant Propagation
x=3 Consider
Call SubY(x) what
happens if
x=3 x=3 t=6 t=7
Call SubY(x) Call SubY(x) Call SubU(t)
Subroutine SubY(a) Subroutine SubY(a)
Subroutine SubY(a) b = a+2
b = a+2
Call SubZ(b) Call SubZ(b)
… = ….a…
Subroutine SubU(c)
Subroutine SubZ(e) d = c-1
Call SubZ(d)
… = … e….
Subroutine SubZ(e)
… = … e….
EE663, Spring 2012 Slide 94
Interprocedural
Data-Dependence Analysis
• Motivational examples:
DO i=1,n DO i=1,n DO k=1,m
call clear(a,i) a(i) = b(i) DO i=1,n
ENDDO call dupl(a,i) a(i,k) = math(i,k)
ENDDO call smooth(a(i,k))
Subroutine clear(x,j) ENDDO
x(j) = 0 Subroutine dupl(x,j)
END x(j) = 2*x(j) Subroutine smooth(x,j)
END x(j) = (x(j-1)+x(j)+x(j+1))/3
END
EE663, Spring 2012 Slide 95
Interprocedural
Data-Dependence Analysis
• Overall strategy:
– subroutine inlining
– move loop into called subroutine
– collect array access information in callee
and use in the analysis of the caller
→ will be discussed in more detail
EE663, Spring 2012 Slide 96
Interprocedural
Data-Dependence Analysis
• Representing array access information
– summary information
• [low:high] or [low:high:stride]
• sets of the above
– exact representation
• essentially all loop bound and subscript information is
captured
– representation of multiple subscripts
• separate representation
• linearized
EE663, Spring 2012 Slide 97
Interprocedural
Data-Dependence Analysis
• Reshaping arrays
– simple conversion
• matching subarray or 2-D→1-D
– exact reshaping with div and mod
– linearizing both arrays
– equivalencing the two shapes
• can be used in subroutine inlining
Important: reshaping may lose the implicit
assertion that array bounds are not violated!
EE663, Spring 2012 Slide 98
Symbolic Analysis
• Expression manipulation techniques
– Expression simplification/normalization
– Expression comparison
– Symbolic arithmetic
• Range analysis
– Find lower/upper bounds of variable values at a
given statement
• For each statement and variable, or
• Demand-driven, for a given statement and variable
EE663, Spring 2012 Slide 99
Symbolic Range Analysis
Example
int foo(int k) {}
[]
int i, j;
[]
double a;
[]
for ( i=0; i<10; ++i ) {
[0<=i<=9]
a=(0.5*i);
}
[i=10]
j=(i+k);
[i=10, j=(i+k)]
return j;
}
EE663, Spring 2012 Slide 100
Alias Analysis
Find references to the same storage by different names
⇒ Program analyses and transformations must consider all these
names
Simple case: different named variables allocated in same
storage location
• Fortran Equivalence statement
• Same variable passed to subroutine by-reference as two
different parameters (can happen in Fortran and C++, but
not in C)
• Global variable also passed as subroutine parameter
EE663, Spring 2012 Slide 101
Pointer Alias Analysis
• More complex: variables pointed to by named pointers
– p=&a; q=&a => *p, *q are aliases
– Same variable passed to C subroutines via pointer
• Most complex: pointers between dynamic data structure
objects
– This is commonly referred to as shape analysis
EE663, Spring 2012 Slide 102
Is Alias Analysis in Parallelizing
Compilers Important?
• Fortran77: alias analysis is simple/absent
– By Fortran rule, aliased subroutine parameters must not be
written to
– there are no pointers
• C programs: alias analysis is a must
– Pointers, pointer arithmetic
– No Fortran-like rule about subroutine parameters
– Without alias information, compilers would have to be very
conservative => big loss of parallelism
– Classical science/engineering applications do not have
dynamic data structures => no shape analysis needed
EE663, Spring 2012 Slide 103
IV.5 Dynamic Decision
Support
EE663, Spring 2012 Slide 104
Achilles’ Heel of Compilers
Big compiler limitations:
– Insufficient compile-time knowledge
• Input data
• Architecture parameters (e.g., cache size)
• Memory layout
– Even if this information is known: Performance models too
complex
Effect:
– Unknown profitability of optimizations
– Inconsistent performance behavior
– Conservative behavior of compilers
– Many compiler options
– Users need to experiment with options
EE663, Spring 2012 Slide 105
Multi-version Code
IF (d>n) Limitations
PARALLEL DO i=1,n • Less readable
a(i) = a(i+d) • Additional code
ENDDO • Not feasible for all
optimizations
ELSE
• Combinatorial explosion
DO i=1,n when trying to apply to
a(i) = a(i+d) many optimization
decisions
ENDDO
EE663, Spring 2012 Slide 106
Profiling
• Gather missing information in a profile run
– Compiler instruments code that gathers at runtime
information needed for optimization decisions
• Use the gathered profile information for improved
decision making in a second compiler invocation
• Training vs. production data
• Initially used for branch prediction. Now increasingly
used to guide additional transformations.
• Requires a compiler performance model
EE663, Spring 2012 Slide 107
Autotuning – Empirical Tuning
Try many optimization • No compiler performance
model needed
variants; pick the • Optimization decisions
best at runtime. based on true execution time
• Dependence on training data
Search (same as profiling)
Space Version
Generation • Potentially huge search
Navigation
space
• Whole-program vs. section-
level tuning
Runtime
Evaluation
Many active research projects
EE663, Spring 2012 Slide 108
IV.4 Techniques for Vector
Machines
EE663, Spring 2012 Slide 109
Vector Instructions
A vector instruction operates on a number of
data elements at once.
Example: vadd va,vb,vc,32
vector operation of length 32 on vector registers va,vb, and vc
– va,vb,vc can be
• Special cpu registers or memory → classical
supercomputers
• Regular registers, subdivided into shorter partitions (e.g.,
64bit register split 8-way) → multi-media extensions
– The operations on the different vector elements
can overlap → vector pipelining
EE663, Spring 2012 Slide 110
Applications of Vector
Operations
• Science/engineering applications are typically
regular with large loop iteration counts.
This was ideal for classical supercomputers, which
had long vectors (up to 256; vector pipeline startup
was costly).
• Graphics applications can exploit “multi-
media” register features and instruction sets.
EE663, Spring 2012 Slide 111
Basic Vector Transformation
DO i=1,n
A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n)
ENDDO
DO i=1,n
A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n)
C(i-1) = D(i)**2 C(0:n-1)=D(1:n)**2
ENDDO
The triplet notation is interpreted to mean “vector operation”. Notice that this
is not (necessarily) the same meaning as in Fortran 90,
EE663, Spring 2012 Slide 112
Distribution and Vectorization
The transformation done on the previous slide involves loop distribution. Loop
distribution reorders computation and is thus subject to data dependence
constraints.
loop DO i=1,n
distribution A(i) = B(i)+C(i)
DO i=1,n ENDDO
A(i) = B(i)+C(i)
D(i) = A(i)+A(i-1) DO i=1,n
dependence
ENDDO D(i) = A(i)+A(i-1)
ENDDO
vectorization
The transformation is not legal if there is a A(1:n)=B(1:n)+C(1:n)
lexical-backward dependence:
D(1:n)=A(1:n)+A(0:n-1)
DO i=1,n loop-carried
A(i) = B(i)+C(i) dependence Statement reordering may help
resolve the problem. However, this is
C(i+1) = D(i)**2 not possible if there is a dependence
ENDDO cycle.
EE663, Spring 2012 Slide 113
Vectorization Needs
Expansion
... as opposed to privatization
DO i=1,n DO i=1,n
t = A(i)+B(i) expansion T(i) = A(i)+B(i)
C(i) = t + t**2 C(i) = T(i) + T(i)**2
ENDDO ENDDO
vectorization
T(1:n) = A(1:n)+B(1:n)
C(1:n) = T(1:n)+T(1:n)**2
EE663, Spring 2012 Slide 114
Conditional Vectorization
DO i=1,n
IF (A(i) < 0) A(i)=-A(i)
ENDDO
conditional vectorization
WHERE (A(1:n) < 0) A(1:n)=-A(1:n)
EE663, Spring 2012 Slide 115
Stripmining for Vectorization
stripmining DO i1=1,n,32
DO i=1,n
DO i=i1,min(i1+31,n)
A(i) = B(i)
A(i) = B(i)
ENDDO
ENDDO
ENDDO
Stripmining turns a single loop into a doubly-nested loop for two-level parallelism.
It also needs to be done by the code-generating compiler to split an operation into
chunks of the available vector length.
EE663, Spring 2012 Slide 116
IV.7 Compiling for
Heterogeneous
Architectures
EE663, Spring 2012 Slide 117
Why Heterogeneous
Architectures?
• Performance
– Fast uniprocessor best for serial code
– Many simple cores best for highly parallel code
– Special-purpose architectures for accelerating
certain code patterns
• E.g., math co-processor
• Energy
– Same arguments hold for power savings
EE663, Spring 2012 Slide 118
Examples of Accelerators
• GPU Accelerators are typically used as
co-processors.
• nvidia GPGPU • CPU+accelerator = heterogeneous
• Shared or distributed address space
• IBM Cell
• Intel MIC
• FPGAs
• Crypto processor
• Network processor
• Video Encoder/decoder
EE663, Spring 2012 Slide 119
Accelerator Architecture
CPU GPU
Grid
Example GPGPU: Thread Block M
• Address space is Thread Block 0
separate from CPU Shared Memory
• Complex Memory Registers Registers CUDA
hierarchy Memory
• Large number of cores
Thread 0
••• Thread K Model
Local Local
• Multithreaded SIMD Memory Memory
execution
Global Memory
• Optimized for coalesced CPU
Texture Memory with a Dedicated Cache
Memory
(stride-1) accesses
Constant Memory with a Dedicated Cache
EE663, Spring 2012 Slide 120
Compiler Optimizations for
GPGPUs
• Optimizing GPU Global Memory Accesses
– Parallel Loop Swap
– Loop Collapsing
– Matrix Transpose
• Exploiting GPU On-chip Memories
• Optimizing CPU-GPU Data Movement
– Resident GPU Variable Analysis
– Live CPU Variable Analysis
– Memory Transfer Promotion Optimization
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 121
Parallel Loop-Swap
Transformation Memory access at time t
#pragma omp parallel for Thread ID
k
for(i=0; i< N; i++)
T0
for(k=0; k<N; k++) T1 i
A[i][k] = B[i][k]; T2
Input OpenMP code T3
Global Memory
Thread ID T0 T1 T2 T3
#pragma omp parallel for k
schedule(static, 1)
for(k=0; k<N; k++) Memory
access at
for(i=0; i<N; i++) time t i
A[i][k] = B[i][k];
Optimized OpenMP code
Global Memory
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 122
Loop Collapsing
Transformation k
#pragma omp parallel for Thread ID
for(i=0; i<n_rows; i++) T0
for(k=rptr[i]; k<rptr[i+1]; k++) T1 i
w[i] += A[k]*p[col[k]]; T2
T3
Input OpenMP code
Global Memory
#pragma omp parallel
Thread ID T0 T1 T2 T3 T4 T5 T6 T7
#pragma omp for collapse(2)
schedule(static, 1)
for(i=0; i<n_rows; i++)
for(k=rptr[i]; k<rptr[i+1]; k++)
w[i] += A[k]*p[col[k]]; Global Memory
Optimized OpenMP code
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 123
Matrix-Transpose Transformation
Memory access at time t
i
Thread ID
T0 A
float d[N][M]
T1
... k
T2
<transpose d on transfer to GPU>
T3
#kernel function:
Global Memory
float d[M][N]
# pragma omp parallel Thread ID T0 T1 T2 T3
k
for(k=0; i< N; i++)
for(i=0; k<M; k++) Memory
…d[i,k] …]; access at
time t i
Global Memory
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 124
Techniques to Exploit GPU
On-chip Memories
Caching Strategies
Variable Type Caching Strategy
R/O shared scalar w/o locality SM
R/O shared scalar w/ locality SM, CM, Reg
R/W shared scalar w/ locality Reg, SM
R/W shared array element w/ locality Reg
R/O 1-dimensional shared array w/ locality TM
R/W private array w/ locality SM
Reg: Registers CM: Constant Memory
SM: Shared Memory TM: Texture Memory
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 125
Techniques to Optimize Data
Movement between CPU and GPU
• Resident GPU Variable Analysis
– Up-to-date data in GPU global memory:
do not copy again from CPU.
• Live CPU Variable Analysis
– After a kernel finishes:
only transfer live CPU variables from GPU to
CPU.
• Memory Transfer Promotion Optimization
– Find optimal points to insert necessary memory
transfers
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 126
GPGPU Performance Relative
to CPU
EE663, Spring 2012 Slide 127
3./,
3,
-.2,
!"##$%"&
-.1,
-.0,
-./,
-,
$4E'5'==;$C:6>"H"=L3, $4E'5"&9(C:6>"H"=L/, B=;M'=N5'==;$C:6, 4*"N=;M'=N5'==;$,
3./,
3,
-.2,
!"##$%"&
-.1,
Importance of
-.0,
-./,
-,
4*"5'==;$<76$%, '**4&"O;#P"(;9(7:>;;:*, $4E'9%("'E!=;$)?7Q", =;$'=D"ER'(G;#SL-,
Individual 3./,
3,
-.2,
!"##$%"&
-.1,
Optimizations
-.0,
-./,
-,
:(H6I((JG'$%7#BC#?5, *%(EI((JG'$%7#BC#G;#*6, *%(EI((JG'$%7#BC#95, *%(EI((JK=&6G'$%7#BC#D"B,
3./,
3,
-.2,
!"##$%"&
-.1,
-.0,
-./,
-,
*%(E?$=(G'$%7#BC#G;#*6, *%(E?$=(G'$%7#BC#D"B, *%(E?$=(G'$%7#BC#?5,
3./,
3,
-.2,
!"##$%"&
-.1,
-.0,
-./,
-,
4*"5'6(789('#*:;*", 4*"<'('=="=>;;:?@':, 4*"A#(;==7#BC#D"E4$F;#,
!"#$%&'()*+,
EE663, Spring 2012
IV.8 Techniques Specific
to Distributed-memory
Machines
EE663, Spring 2012 Slide 129
Execution Scheme on a
Distributed-Memory Machine
Typical execution scheme:
• All nodes execute the same program
M M M M • Program uses node_id to select the
P P P P subcomputation to execute on each
participating processor and the data to access.
For example,
mystrip=⎡n/max_nodes⎤
DO i=1,n lb = node_id*mystrip +1
ub = min(lb+mystrip-1,n)
... how to place
DO i=lb,ub and access
ENDDO ... data ?
ENDDO
how/when to
synchronize ?
This is called Single-Program-Multiple-Data (SPMD) execution
scheme
EE663, Spring 2012 Slide 130
Data Placement
Single owner:
• Data is distributed onto the participating
processors’ memories
Replication:
• Multiple versions of the data are placed
on some or all nodes.
EE663, Spring 2012 Slide 131
Data Distribution Schemes
numbers indicate the node of a 4-processor
distributed-memory machine on which the
array section is placed
block
1 2 3 4
distribution
cyclic
1234123412341234
distribution
block-cyclic
1 2 3 4 1
distribution
indexed
IND(1) IND(2) IND(3) IND(4) IND(5)
distribution
index array
Automatic data distribution is difficult because it is a
global optimization.
EE663, Spring 2012 Slide 132
Message Generation
for single-owner placement
EXAMPLE
message send (A(ub),my_proc+1)
DO i=1,n generation receive (A(lb-1),my_proc-1)
B(i) = A(i)+A(i-1) DO i=lb,ub
ENDDO B(i) = A(i)+A(i-1)
ENDDO
• lb,ub determine the iterations assigned to each processor.
• data uses block distribution and matches the iteration distribution
• my_proc is the current processor number
Compilers for languages such as HPF (High-Performance
Fortran) have explored these ideas extensively
EE663, Spring 2012 Slide 133
Owner-computes Scheme
In general, the elements accessed by a processor are different from the elements
owned by this processor as defined by the data distribution
DO i=1,n
send/receive what’s necessary
DO i=1,n IF I_own(A(i)) THEN
A(i)=B(i)+B(i-m) A(i) = B(i)+B(i-m)
C(ind(i))=D(ind2(i)) ENDIF
send/receive what’s necessary
ENDDO
IF I_own(C(ind(i)) THEN
C(ind(i))=D(ind2(i))
ENDIF
ENDDO
• nodes execute those iterations and statements whose LHS they own
• first they receive needed RHS elements from remote nodes
• nodes need to send all elements needed by other nodes
Example shows basic idea only. Compiler optimizations needed!
EE663, Spring 2012 Slide 134
Compiler Optimizations
for the raw owner computes scheme
• Eliminate conditional execution
– combine if statements with same condition
– reduce iteration space if possible
• Aggregate communication
– combine small messages into larger ones
– tradeoff: delaying a message enables message
aggregation but increases the message latency.
• Message Prefetch
– moving send operations earlier in order to reduce
message latencies.
there is a large number of research papers describing such techniques
EE663, Spring 2012 Slide 135
Message Generation for
Virtual Data Replication
Fully parallel section
w. local reads and writes
Broadcast written data Optimization: reduce broadcast
operations to necessary point-to-point
Fully parallel section communication
w. local reads and writes
time
Advantages:
• Fully parallel sections with local reads and writes
• Easier message set computation (no partitioning per processor needed)
Disadvantages:
• Not data-scalable
• More write operations necessary (but, collective communication can be used)
EE663, Spring 2012 Slide 136
7 Techniques for
Instruction-Level
Parallelization
EE663, Spring 2012 Slide 137
Implicit vs. Explicit ILP
Implicit ILP: ISA is the same as for sequential
programs.
– most processors today employ a certain degree of
implicit ILP
– parallelism detection is entirely done by the hardware
– compiler can assist ILP by arranging the code so that
the detection gets easier.
EE663, Spring 2012 Slide 138
Implicit vs. Explicit ILP
Explicit ILP: ISA expresses parallelism.
– parallelism is detected by the compiler
– parallelism is expressed in the form of
• VLIW (very long instruction words): packing several instructions
into one long word
• EPIC (Explicitly Parallel Instruction Computing): bundles of (up
to three) instructions are issued. Dependence bits can be
specified.
Used in Intel/HP IA-64 architecture. The processor also
supports predication, early (speculative) loads, prepare-to-
branch, rotating registers.
EE663, Spring 2012 Slide 139
Trace Scheduling
(invented for VLIW processors, still a useful terminology)
Two big issues must be solved by
all approaches:
1. Identifying the instruction sequence
that will be inspected for ILP. trace selection
Main obstacle: branches
2. reordering instructions so that
machine resources are exploited
efficiently. trace compaction
trace scheduling
EE663, Spring 2012 Slide 140
Trace Selection
• It is important to have a large instruction window (block) within
which the compiler can find parallelism.
• Branches are the problem. Instruction pipelines have to be
flushed/squashed at branches
• Possible remedies:
– eliminate branches
– code motion can increase block size
– block can contain out-branches with low probability
– predicated execution
EE663, Spring 2012 Slide 141
Branch Elimination
• Example:
comp R0 R1 comp R0 R1
bne L1: beq L2:
bra L2:
L1: . . . L1: . . .
. . . . . .
L2: . . . L2: . . .
EE663, Spring 2012 Slide 142
Code Motion
c1
I1 I1 I1
c2
I2 I3
I1
c2
c1 c1
I1 I2 I1 I3
Code motion can increase window sizes and eliminate subtrees
EE663, Spring 2012 Slide 143
Predicated Execution
IF (a>0) THEN
p = a>0 ; assignment of predicate
b=a
p: b=a ; executed if predicate true
ELSE ; executed if predicate false
p: b=-a
b=-a
ENDIF
Predication
• increases the window size for analyzing and exploiting parallelism
• increases the number of instructions “executed”
These are opposite demands!
Compare this technique to conditional vectorization
EE663, Spring 2012 Slide 144
Dependence-removing ILP
Techniques
ind = i0 ind = i0
... ...
dependence
ind = ind+1 ind = i0+1
...
dependence
...
ind = ind+1 ind = i0+2
sum = sum+expr1 s1=expr1
... ...
dependence
sum = sum+expr2 s1=s1+expr2
dependence. . .
...
sum = sum+expr3 s2=expr3
dependence. . .
...
sum = sum+expr4 s2=s2+expr4
...
sum=sum+s1+s2
shaded blocks of statements are independent of each other and can
be executed as parallel instructions
EE663, Spring 2012 Slide 145
Speculative ILP
Speculation is performed by the architecture in various forms
– Superscalar processors: compiler only has to deal with the
performance model. ISA is the same as for non-speculative
processors
– Multiscalar processors: (research only) compiler defines tasks that
the hardware can try execute speculatively in parallel. Other than
task boundaries, the ISA is the same.
References:
• Task Selection for a Multiscalar Processor, T. N. Vijaykumar and
Gurindar S. Sohi, The 31st International Symposium on
Microarchitecture (MICRO-31), pp. 81-92, December 1998.
• Reference Idempotency Analysis: A Framework for Optimizing
Speculative Execution, Seon-Wook Kim, Chong-Liang Ooi, Rudolf
Eigenmann, Babak Falsafi, and T.N. Vijaykumar,, In Proc. of
PPOPP'01, Symposium on Principles and Practice of Parallel
Programming, 2001.
EE663, Spring 2012 Slide 146
Compiler Model of Explicit
Specluative Parallel Execution
(Multicalar Processor)
• Overall Execution: speculative • Dependence Tracking: Data
threads choose and start the Flow and Control Flow
execution of any predicted next dependences are detected
thread. directly. Lead to roll-back. Anti
• Data Dependence and Control and Output dependences are
Flow Violations lead to roll- satisfied via speculative
backs. storage.
• Final Execution: satisfies all • Segment Commit: Correctly
cross-segment flow and control executed threads (I.e., their final
dependences. execution) commit their
speculative storage to the
• Data Access: Writes go to
memory, in sequential order.
thread-private speculative
storage. Reads read from
ancestor thread or memory.
EE663, Spring 2004 Slide 147