0% found this document useful (0 votes)

89 views147 pages

EE663: Optimizing Compilers: Prof. R. Eigenmann

This document provides an overview of the EE663 "Optimizing Compilers" course taught by Prof. R. Eigenmann at Purdue University. The course focuses on compiler techniques for detecting and exploiting parallelism in programs to map them efficiently onto modern parallel computer architectures. Key topics covered include data dependence analysis to detect parallel loops, mapping parallelism onto multicore processors and distributed systems, and architecting optimizing compilers. The goal is to automatically parallelize standard programming languages to run efficiently on parallel hardware.

Uploaded by

Udasi Devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views147 pages

EE663: Optimizing Compilers: Prof. R. Eigenmann

Uploaded by

Udasi Devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 147

EE663: Optimizing Compilers

Prof. R. Eigenmann
Purdue University
School of Electrical and Computer Engineering
Spring 2012

https://engineering.purdue.edu/~eigenman/ECE663/

EE663, Spring 2012 Slide 1

I. Motivation and Introduction:
Optimizing Compilers are in the Center of the
(Software) Universe
They translate increasingly advanced human interfaces (programming
languages) onto increasingly complex target machines

Today Tomorrow
C, C++, Problem
Human (programming) Java, Specification
language Fortran Language

Challenge
Translator
Grand
Machine architecture Workstation Globally
Multicores Distributed/Cloud
HPC Systems Resources

Processors have multiple cores. Parallelization is a key optimization.

EE663, Spring 2012 Slide 2
Issues in Optimizing /
Parallelizing Compilers
The Goal:
•  We would like to run standard (C, C++, Java,
Fortran) programs on common parallel
computers
leads to the following high-level issues:
•  How to detect parallelism?
•  How to map parallelism onto the machine?
•  How to create a good compiler architecture?

EE663, Spring 2012 Slide 3

Detecting Parallelism
•  Program analysis techniques
•  Data dependence analysis
•  Dependence removing techniques
•  Parallelization in the presence of
dependences
•  Runtime dependence detection

EE663, Spring 2012 Slide 4

Mapping Parallelism onto the
Machine
•  Exploiting parallelism at many levels
–  Multiprocessors and multi-cores (our focus)
–  Distributed computers (clusters or global
networks)
–  Heterogeneous architectures
–  Instruction-level parallelism
–  Vector machines
•  Exploiting memory organizations
–  Data placement
–  Locality enhancement
–  Data communication
EE663, Spring 2012 Slide 5
Architecting a Compiler
•  Compiler generator languages and tools
•  Internal representations
•  Implementing analysis and transformation
techniques
•  Orchestrating compiler techniques (when to
apply which technique)
•  Benchmarking and performance evaluation

EE663, Spring 2004 Slide 6

Parallelizing Compiler Books and
Survey Papers
Books:
•  Michael Wolfe: High-Performance Compilers for Parallel Computing (1996)
•  Utpal Banerjee: several books on Data Dependence Analysis and Transformations
•  Ken Kennedy, John Allen: Optimizing Compilers for Modern Architectures: A
Dependence-based Approach (2001)
•  Zima, H. and Chapman, B., Supercompilers for parallel and vector computers (1990)
•  Scheduling and automatic Parallelization, Darte, A., Robert Y., and Vivien, F., (2000)

Survey Papers:
•  Rudolf Eigenmann and Jay Hoeflinger, Parallelizing and Vectorizing Compilers, Wiley
Encyclopedia of Electrical Engineering, John Wiley &Sons, Inc., 2001
•  Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic
Program Parallelization. Proceedings of the IEEE, 81(2), February 1993.
•  David F. Bacon, Susan L. Graham, Compiler transformations for high-performance
computing, ACM Computing Surveys (CSUR), Volume 26, Issue 4, December 1994,
Pages: 345 - 420,1994

EE663, Spring 2012 Slide 7

Course Approach
There are many schools on optimizing compilers.
Our approach is performance-driven.
We will discuss:
–  Performance of parallelization techniques
–  Analysis and Transformation techniques in the
Cetus compiler (for multiprocessors/cores)
–  Additional transformations (for GPGPUs and other
architectures)
–  Compiler infrastructure considerations

EE663, Spring 2012 Slide 8

The Heart of Automatic
Parallelization
Data Dependence Testing

If a loop does not have data dependences

between any two iterations then it can be
safely executed in parallel

In science/engineering applications, loop

parallelism is most important. In non-
numerical programs other control structures
are also important
EE663, Spring 2012 Slide 9
Data Dependence Tests:
Motivating Examples
Loop Parallelization Statement Reordering
Can the iterations of this can these two statements be
loop be run concurrently? swapped?

DO i=1,100,2 DO i=1,100,2
B(2*i) = ... B(2*i) = ...
... = B(2*i) +B(3*i) ... = B(3*i)
ENDDO ENDDO
DD testing is important not just for
DD testing is needed to detect detecting parallelism
parallelism

A data dependence exists between two adjacent data references iff:

•  both references access the same storage location and
•  at least one of them is a write access
EE663, Spring 2012 Slide 10
Data Dependence Tests: Concepts
Terms for data dependences between statements of loop iterations.
•  Distance (vector): indicates how many iterations apart are source
and sink of dependence.
•  Direction (vector): is basically the sign of the distance. There are
different notations: (<,=,>) or (-1,0,+1) meaning dependence (from
earlier to later, within the same, from later to earlier) iteration.
•  Loop-carried (or cross-iteration) dependence and non-loop-carried
(or loop-independent) dependence: indicates whether or not a
dependence exists within one iteration or across iterations.
–  For detecting parallel loops, only cross-iteration dependences matter.
–  equal dependences are relevant for optimizations such as statement
reordering and loop distribution.

EE663, Spring 2004 Slide 11

Data Dependence Tests: Concepts
•  Iteration space graphs: the un-abstracted form of a dependence
graph with one node per statement instance.

j
Example:

DO i=1,n
DO j=1,m
a(i,j) = a(i-1,j-2)+b(I,j)
ENDDO
ENDDO
i
order

EE663, Spring 2004 Slide 12

Data Dependence Tests:
Formulation of the
Data-dependence problem
DO i=1,n
the question to answer:
a(4*i) = . . .
can 4*i1 ever be equal to 2*i2+1 within i1, i2 ∈[1,n] ?
. . . = a(2*i+1)
ENDDO Note that the iterations at which the two expressions are equal
may differ. To express this fact, we choose the notation i1, i2.

Let us generalize a bit: given

•  two subscript functions f and g, and
•  loop bounds lower, upper,
Does
f(i1) = g(i2) have a solution such that
lower ≤ i1, i2 ≤ upper ?
EE663, Spring 2004 Slide 13
This course would now be finished if:

•  the mathematical formulation of the data dependence

problem had an accurate and fast solution, and
•  there were enough loops in programs without data
dependences, and
•  dependence-free code could be executed by today’s
parallel machines directly and efficiently.
•  engineering these techniques into a production
compiler were straightforward.

There are enough hard problems to fill several courses!

EE663, Spring 2012 Slide 14

II. Performance of Basic
Automatic Program
Parallelization

EE663, Spring 2012 Slide 15

Two Decades of Parallelizing
Compilers
A Performance study at the beginning of the 90es (Blume study)
Analyzed the performance of state-of-the-art parallelizers and
vectorizers using the Perfect Benchmarks.

William Blume and Rudolf Eigenmann, Performance Analysis of

Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE
Transactions on Parallel and Distributed Systems, 3(6), November 1992,
pages 643--656.

Good reasons for starting two decades back:

•  We will learn simple techniques first.
•  We will see how parallelization techniques have evolved
•  We will see that extensions of the important techniques back then are still the
important techniques today.

EE663, Spring 2012 Slide 16

Overall Performance
of parallelizers in 1990

Speedup on
8 processors
with 4-stage
vector units

EE663, Spring 2012 Slide 17

Performance of Individual Techniques

EE663, Spring 2012 Slide 18

Transformations measured in
the “Blume Study”
•  Scalar expansion
•  Reduction parallelization
•  Induction variable substitution
•  Loop interchange
•  Forward Substitution
•  Stripmining
•  Loop synchronization
•  Recurrence substitution
EE663, Spring 2012 Slide 19
Scalar Expansion Privatization
DO PARALLEL j=1,n
DO j=1,n
PRIVATE t
output t = a(j)+b(j)
flow t = a(j)+b(j)
c(j) = t + t2
c(j) = t + t2
ENDDO
ENDDO
anti

Expansion
We assume a shared-memory model:
•  by default, data is shared, i.e., all DO PARALLEL j=1,n
processors can see and modify it
•  processors share the work of
t0(j) = a(j)+b(j)
parallel loops c(j) = t0(j) + t0(j)2
ENDDO

EE663, Spring 2012 Slide 20

Parallel Loop Syntax and
Semantics in OpenMP
!$OMP PARALLEL PRIVATE(<private data>)
<preamble code>
#pragma omp parallel for !$OMP DO
DO i = lb, ub
for (i=lb; i<=ub; i++) {
<loop body code>
<loop body code>
ENDDO
} !$OMP END DO
<postamble code>
!$OMP END PARALLEL

work (iterations) shared by participating processors (threads)

Same code executed by all participating processors (threads)

EE663, Spring 2012 Slide 21
Reduction Parallelization
!$OMP PARALLEL, PRIVATE (s)
s=0
DO j=1,n !$OMP DO
DO j=1,n
flow sum = sum + a(j)
s = s + a(j)
ENDDO anti ENDDO
!$OMP ENDDO
!$OMP ATOMIC
sum=sum+s
!$OMP PARALLEL DO !$OMP END PARALLEL
!$OMP+REDUCTION(+:sum)
DO j=1,n
sum = sum + a(j)
ENDDO
EE663, Spring 2012 Slide 22
Induction Variable Substitution
ind = ind0 ind = ind0
DO j = 1,n DO PARALLEL j = 1,n
a(ind) = b(j) a(ind0+k*(j-1)) = b(j)
ind = ind+k ENDDO
flow
dependence ENDDO

Note, this is the reverse of strength reduction, an important

transformation in classical (code generating) compilers.
R0 ← &d
loop: loop:
real d(20,100)
... ...
DO j=1,n
R0 ← &d+20*j (R0) ← 0
d(1,j)=0
(R0) ← 0 ...
ENDDO
... R0 ← R0+20
jump loop jump loop

EE663, Spring 2012 Slide 23

Forward Substitution
m = n+1 m = n+1
… …
DO j=1,n DO j=1,n
a(j) = a(j+m) a(j) = a(j+n+1)
ENDDO ENDDO

a = x*y a = x*y
b = a+2 b = x*y+2
c=b+4 c = x*y + 6
dependences no dependences

EE663, Spring 2012 Slide 24

Stripmining
1 n

strip

DO i=1,n,strip
DO j=1,n DO j=i,min(i+strip-1,n)
a(j) = b(j) a(j) = b(j)
ENDDO ENDDO
ENDDO

There are many variants of stripmining

(sometimes called loop blocking)

EE663, Spring 2012 Slide 25

Loop Synchronization

DOACROSS j=1,n
DO j=1,n a(j) = b(j)
a(j) = b(j) post(current_iteration)
c(j) = a(j)+a(j-1) wait(current_iteration-1)
ENDDO c(j) = a(j)+a(j-1)
ENDDO

EE663, Spring 2012 Slide 26

Recurrence Substitution
DO =1,n
a(j) = c0+c1*a(j)+c2*a(j-1)+c3*a(j-2)
ENDDO

call rec_solver(a(1),n,c0,c1,c2,c3)
Basic idea of the recurrence solver:
DO j=1,10 DO j=11,20 DO j=21,30 DO j=31,40
DO j=1,40 a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1) a(j) = a(j) + a(j-1)
ENDDO ENDDO ENDDO ENDDO
a(j) = a(j) + a(j-1)
ENDDO

Error: 0 ∆a(10) ∆a(10)+∆a(20) ∆a(10)+∆a(20)+∆a(30)

EE663, Spring 2012 Slide 27

Loop Interchange
DO i= 1,n DO j= 1,m
DO j=1,m DO i=1,n
a(i,j) = a(i,j)+a(i,j-1) a(i,j) =a(i,j)+a(i,j-1)
ENDDO ENDDO
ENDDO ENDDO

•  stride-1 references increase cache locality

–  read: increase spatial locality
–  write: avoid false sharing
•  scheduling of outer loop is important (consider original loop nest):
–  cyclic: no locality w.r.t. to i loop
–  block schedule: there may be some locality
–  dynamic scheduling: chunk scheduling desirable
•  cache organization is important
•  parallelism at outer position reduces loop fork/join overhead
EE663, Spring 2012 Slide 28
Effect of Loop Interchange
Example: speedups of the most time-consuming loops
in the ARC2D benchmark on 4-core machine
loop interchange applied in the
process of parallelization

Speedup
4

0
STEPFX STEPFX XPENTA FILERX
DO230 DO210 DO11 DO39

EE663, Spring 2012 Slide 29

Execution Scheme for Parallel Loops
1. Architecture supports parallel loops. Example: Alliant
FX/8 (1980es)
–  machine instruction for parallel loop
–  HW concurrency bus supports loop scheduling

store #0,<a>
a=0
load <n>,D6
! DO PARALLEL
sub 1,D6
DO i=1,n
load &b,A1
b(i) = 2 D7 is reserved
cdoall D6
ENDDO for the loop
store #2,A1(D7.r)
b=3 variable.
endcdoall Starts at 0.
store #3,<b>
EE663, Spring 2012 Slide 30
Execution Scheme for Parallel Loops

2. Microtasking scheme (dates back to early

IBM mainframes)

p1 p2 p3 p4
sequential init_helper_tasks
problem: wakeup_helpers
parallel loop startup
sleep_helpers
sequential must be very fast
wakeup_helpers
parallel
sleep_helpers
sequential

microtask startup: 1 µs
pthreads startup: up to 100 µs

EE663, Spring 2012 Slide 31

Compiler Transformation and Runtime
Function for the Microtasking Scheme
call init_microtasking() // once at program start
a=0 ...
! DO PARALLEL a=0
call loop_scheduler(loopsub,i,1,n,b)
DO i=1,n
b=3
b(i) = 2
ENDDO subroutine loopsub(mytask,lb,ub,b)
b=3 DO i=lb,ub
b(i) = 2
ENDDO
END

Master task Helper 1:

Helper task
loop_scheduler: loopsub
lb,ub loop:
partition loop iterations wait for flag
sh_var
wakeup call loopsub(id,lb,ub,sh_var)
flag
call loopsub(...) reset flag
barrier (all flags reset) Control blocks
(shared data)
return
EE663, Spring 2012 Slide 32
III. Performance of Advanced
Parallelization

EE663, Spring 2012 Slide 33

Manual Improvements of the
Perfect Benchmarks (1995)
Same
information as Rudolf Eigenmann, Jay
on Slide 17 Hoeflinger, and David Padua,
On the Automatic
Parallelization of the
Perfect Benchmarks.
IEEE Transactions on
Parallel and Distributed
Systems,
volume 9, number 1,
January 1998,
pages 5-23.

a eliminated file I/O

b parallelized random number generator

EE663, Spring 2012 Slide 34

Performance of Individual
Techniques in Manually
Improved Programs (1995)

Performance loss when disabling individual techniques (Cedar machine)

EE663, Spring 2012 Slide 35
Overall Performance of the
Cetus and ICC Compilers (2011)
)
(
'
&
!"##$%"

%
$
#
"
!
*+ ,- ./ 0+ 12 34 5- 2/

26789: ,6;<=>4?;<?6@ 5A@6:BC9=6@>+<?6@ /7AD8:6BC9=6@>+<?6@ .EF878G9::H>+<?6@ 1,,>/979::6: I9?@>/979::6:

NAS (Class A) Benchmarks on 8-core x86 processor

EE663, Spring 2012 Slide 36

Performance of Individual
Cetus Techniques (2011)
>
=
<
;
!"##$%"

:
9
8
7
6
?@ AB */ C@ DE !F GB E/
E(.2"%&#)4+$%(5&5)HII D+%&+&+3)HII 4%&$5)"II /0&J$'&1$''&"+)HII 400$()K-LM#'&"+)HII D+LM#'&"+)HII D+'-0#,$+3-)HII @&%&+3)HII 4%%)H+
/0"30$.)4+$%(5&5 /$0$%%-%&1$'&"+)*+$2%&+3) !"#$%&'()*+,$+#-.-+'

NAS Benchmarks (Class A) on 8-core x86 processor

EE663, Spring 2012 Slide 37

IV. Analysis and
Transformation Techniques

•  1 Data-dependence analysis
•  2 Parallelism enabling transformations
•  3 Techniques for multiprocessors/multicores
•  4 Advanced program analysis
•  5 Dynamic decision making
•  6 Techniques for vector architectures
•  7 Techniques for heterogeneous multicores
•  8 Techniques distributed-memory machines

EE663, Spring 2012 Slide 38

IV.1 Data Dependence Testing
Earlier, we have considered the simple case of a
1-dimensional array enclosed by a single loop:

DO i=1,n
the question to answer:
a(4*i) = . . .
can 4*i ever be equal to 2*i+1 within i ∈[1,n] ?
. . . = a(2*i+1)
ENDDO

In general: given
•  two subscript functions f and g and
•  loop bounds lower, upper.
Does
f(i1) = g(i2) have a solution such that
lower ≤ i1, i2 ≤ upper ?

EE663, Spring 2012 Slide 39

DDTests: doubly-nested loops

•  Multiple loop indices:

DO i=1,n
DO j=1,m
X(a1*i + b1*j + c1) = . . .
. . . = X(a2*i + b2*j + c2)
ENDDO
ENDDO

dependence problem:
a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1
1 ≤ i 1, i 2 ≤ n
1 ≤ j 1, j 2 ≤ m
Almost all DD tests expect the coefficients ax to be integer constants.
Such subscript expressions are called affine.
EE663, Spring 2012 Slide 40
DDTests: even more complexity

•  Multiple loop indices, multi-dimensional array:

DO i=1,n
DO j=1,m
X(a1*i1 + b1*j1 + c1, d1*i1 + e1*j1 + f1) = . . .
. . . = X(a2*i2 + b2*j2 + c2, d2*i2 +e2*j2 + f2)
ENDDO
ENDDO
dependence problem:
a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1
d1*i1 - d2*i2 + e1*j1 - e2*j2 = f2 - f1
1 ≤ i 1, i 2 ≤ n
1 ≤ j 1, j 2 ≤ m

EE663, Spring 2012 Slide 41

Data Dependence Tests:
The Simple Case
Note: variables i1, i2 are integers → diophantine equations.
Equation a * i1 - b* i2 = c has a solution if and only iff
gcd(a,b) (evenly) divides c

in our example this means: gcd(4,2)=2, which does not

divide 1 and thus there is no dependence.

If there is a solution, we can test if it lies within the loop

bounds. If not, then there is no dependence.

EE663, Spring 2012 Slide 42

Performing the GCD Test
•  The diophantine equation
a1*i1 + a2*i2 +...+ an*in = c
has a solution iff gcd(a1,a2,...,an) evenly divides c
Examples:
15*i +6*j -9*k = 12 has a solution gcd=3
2*i + 7*j = 3 has a solution gcd=1
9*i + 3*j + 6*k = 5 has no solution gcd=3

Euklid Algorithm: find gcd(a,b)

Repeat for more than two numbers:
a ← a mod b gcd(a,b,c) = (gcd(a,gcd(b,c))
swap a,b
Until b=0 →The resulting a is the gcd

EE663, Spring 2012 Slide 43

Other Data Dependence Tests
•  The GCD test is simple but not accurate
•  Other tests
–  Banerjee(-Wolfe) test: widely used test
–  Power Test: improvement over Banerjee test
–  Omega test: “precise” test, most accurate for
linear subscripts
–  Range test: handles non-linear and symbolic
subscripts
–  many variants of these tests

EE663, Spring 2012 Slide 44

The Banerjee(-Wolfe) Test
Basic idea:
if the total subscript range accessed by ref1
does not overlap with the range accessed
by ref2, then ref1 and ref2 are
independent.
DO j=1,100 ranges accesses:
a(j) = … [1:100]
… = a(j+200) [201:300]
ENDDO  independent

EE663, Spring 2012 Slide 45

Mathematical Formulation of the
Test – Banerjee’s Inequalities
j1-j2 = 200

Min: 1-100=-99 The general case of a doubly-nested loop and

single subscript, as shown on Slide 40:
Max: 100-1=99
a1*i1-a2*i2 + b1*j1-b2*j2 = c2-c1

Min: a1-a2n Min: b1-b2m Assuming positive

Max: a1*n-a2 Max: b1*n-b2 coefficients

Multiple dimensions: apply test separately on each subscript or linearize

EE663, Spring 2012 Slide 46

Banerjee(-Wolfe) Test continued
Weakness of the test:
Consider this flow dependence

DO j=1,100 ranges accessed:

a(j) = … [1:100]
… = a(j+5) [6:105]
ENDDO  no dependence ?

We did not take into consideration that only loop-carried

dependences matter for parallelization.
A loop-carried flow dependence only exists, if a read in some
iteration, j1, conflicts with a write in some later iteration, j2> j1
EE663, Spring 2012 Slide 47
Using Dependence Direction Information
in the Banerjee(-Wolfe) Test
Idea for overcoming the weakness:
for loop-carried dependences, make use of the fact
that j in ref2 is greater than in ref1
Still considering the potential flow Ranges accessed by
dependence from a(j) to a(j+5)
iteration j1 and any other
iteration j2, where j1 < j2 :
DO j=1,100 [j1]
a(j) = … [j1+6:105]
… = a(j+5)  Independent for “>” direction
ENDDO
Clearly, this loop has a
dependence. But, it is
an anti-dependence
This is commonly referred to as the from a(j+5) to a(j)
Banerjee test with direction vectors.
EE663, Spring 2012 Slide 48
DD Testing with Direction Vectors
Considering direction vectors can increase the complexity of the DD test
substantially. For long vectors (corresponding to deeply-nested
loops), there are many possible combinations of directions.
*, * , . . . , *
= = =
(d1,d2,…,dn) < < <
> >
A possible algorithm:
1.  try (*,*…*) , i.e., do not consider directions
2.  (if not independent) try (<,*,*…*), (=,*,*…*)
3.  (if still not independent) try (<,<,*…*),(<,>,*…*) ,(<,=,*…*)
(=,=,*…*), (=,<,*…*)
...
(This forms a tree)

EE663, Spring 2012 Slide 49

Data-dependence Test Driver
procedure DataDependenceAnalysis( PROG )
input : Program representing all source files: PROG
output : Data dependence graph containing dependence arcs DDG
// Collect all FOR loops meeting eligibility
// Checks: Canonical, FunctionCall, ControlFlowModifier
ELIGIBLE LOOPS = getOutermostEligibleLoops( PROG )
foreach LOOP in ELIGIBLE LOOPS
// Obtain lower bounds, upper bounds and loop steps
// for this loop and all enclosed loops i.e. the loop-nest
// Substitute symbolic information if available,
LOOP_INFO = collectLoopInformation( LOOP and enclosed nest )
// Collect all array access expressions appearing within the
// body of this loop, this includes enclosed loops and non-perfectly
// nested statements
ACCESSES = collectArrayAccesses( LOOP and enclosed nest )
// Traverse all array accesses, test relevant pairs and
// create a set of dependence arcs for the loop-nest
LOOP_DDG = runDependenceTest( LOOP_INFO, ACCESSES )
// Add loop dependence graph to the program-wide DDG
// The program-wide DDG is initially empty
DDG += LOOP_DDG
// return the program-wide data dependence graph once all loops are done
return DDG
Slide 50
Data-dependence Test Driver (continued)
procedure runDependenceTest( LOOP_INFO, ACCESSES )
input : Loop information for the current loop nest LOOP_INFO
List of array access expressions, ACCESSES
output : Loop data dependence graph LOOP_DDG
foreach ARRAY_1 in ACCESSES of type write
// Obtain alias information i.e. aliases to this array name
// Alias information in Cetus is generated through points-to analysis
ALIAS_SET = getAliases( ARRAY_1 )
// Collect all expressions/references to the same array from the entire list of accesses
TEST_LIST = getOtherReferences( ALIAS_SET, ACCESSES )
foreach ARRAY_2 in TEST_LIST
// Obtain the common loops enclosing the pair
COMMON NEST = getCommonNest( ARRAY_1, ARRAY_2 )
// Possibly empty set of direction vectors under which
// dependence exists is returned by the test
DV_SET = testAccessPair( ARRAY_1, ARRAY_2, COMMON_NEST, LOOP_INFO )
foreach DV in DV_SET
// Create arc from source to sink
DEP_ARC = buildDependenceArc( ARRAY_1, ARRAY_2, DV )
// Build the loop dependence graph by accumulating all arcs
LOOP_DDG += DEP ARC
// All expressions have been tested, return the loop dependence graph
return LOOP_DDG
Slide 51
Data-dependence Test Driver (continued)
procedure testAccessPair( A1, A2, COMMON_NEST, LOOP_INFO)
input : Pair of array accesses to be tested A1 and A2
Nest of common enclosing loops COMMON NEST
Information for these loops LOOP INFO
output : Possibly empty set of direction vectors under
which dependence exists DV SET
// Partition the subscripts of the array accesses into dimension pairs
// Coupled subscripts may be handled
PARTITIONS = partitionSubscripts( A1, A2, COMMON_NEST )
foreach PARTITION in PARTITIONS
// Depending on the number of loop index variables in the partition,
// use the corresponding test.
if( ZIV ) // zero index variables ZIV
DVs = simpleZIVTest( PARTITION )
else // single or multi-loop index variables: SIV, MIV
// traverse and prune over tree of direction vectors, collect DVs where
// dependence exists (traversal not shown here)
foreach DV in DV_TREE using prune
// In Cetus, the MIV test is performed using Banerjee or Range test
DVs += MIVTest( PARTITION, DV, COMMON_NEST, LOOP_INFO )
// Merge DVs for all partitions
DV_SET = merge( DVs )
return DV_SET
Slide 52
Non-linear and Symbolic DD Testing

Weakness of most data dependence tests:

subscripts and loop bounds must be affine,
i.e., linear with integer-constant coefficients

Approach of the Range Test:

capture subscript ranges symbolically
compare ranges: find their upper and lower bounds
by determining monotonicity. Monotonically
increasing/decreasing ranges can be compared by
comparing their upper and lower bounds.

EE663, Spring 2012 Slide 53

The Range Test
Basic idea :
1. Find the range of array accesses made in a given
loop iteration j => r(j).
2. If r(j) does not overlap with r(j+1) then there is no
cross-iteration dependence
Symbolic comparison of ranges r1 and r2:
max(r1)<min(r2) OR min(r1)>max(r2) => no overlap

Example: testing independence of the outer loop: ubx

DO i=1,n range of A accessed in iteration ix: [ixm+1:(ix+1)m]

DO j=1,m
A(i*m+j) = 0 range of A accessed in iteration ix+1: [(ix+1)*m+1:(ix+2)*m]
ENDDO
ENDDO lbx+1
ubx < lbx+1 ⇒ no cross-iteration dependence

EE663, Spring 2012 Slide 54

we need powerful expression
Range Test continued
manipulation and comparison
utilities

DO i1=L1,U1 Assume f,g are monotonically increasing w.r.t. all ix:

...
find upper bound of access range at loop k, 1<k<n:
DO in=Ln,Un
A(f(i1,...in)) = ...
successively substitute ix with Ux, x={n,n-1,...,k-1}
... = A(g(i1,...in)) lowerbound is computed analogously
ENDDO
...
ENDDO If f,g are monotonically decreasing w.r.t. some iy,
then substitute Ly when computing the upper
bound.

we need Determining monotonicity: consider d = f(...,ik,...) - f(...,ik-1,...)

range If d>0 (for all values of ik) then f is monotonically increasing w.r.t. k
analysis If d<0 (for all values of ik) then f is monotonically decreasing w.r.t. k

What about symbolic coefficients?

•  in many cases they cancel out
•  if not, find their range (i.e., all possible values they can assume at this point
in the program), and replace them by the upper or lower bound of the range.
EE663, Spring 2012 Slide 55
Handling Non-contiguous
Ranges
DO i1=1,u1 The basic Range Test finds
DO i2=1,u2 independence
A(n*i1+m*i2)) = … of the outer loop
ENDDO
if n >= u2 and m=1
ENDDO
But not
if n=1 and m>=u1

Idea:
- temporarily (during program analysis) interchange the loops,
- test independence,
- interchange back

Issues:
•  legality of loop interchanging,
•  change of parallelism as a result of loop interchanging
EE663, Spring 2012 Slide 56
Some Engineering Tasks and
Questions for DD Test Pass Writers
- Start with the simple case: linear (affine) subscripts, single nests with 1-dim arrays. Subscript
and loop bounds are integer constants. Stride 1 loop, lower bound =1
- Deal with multiple array dims and loop nests
- Add capabilities for non-stride-1 loops and lower bounds ≠1
- How to deal with symbolic subscript coefficients and bounds
- Ignore dependences in private variables and reductions
- Generate DD vectors
- Mark parallel loops
- Things to think about:
-- how to handle loop-variant coefficients
-- how to deal with private, reduction, induction variables
-- how to represent DD information
-- how to display the DD info
-- how to deal with non-parallelizable loops (IO op, function calls, other?)
-- how to find eligible DO loops?
-- how to find eligible loop bounds, array subscripts?
-- what is the result of the pass? Generate DD info or set parallel loop flags?
-- what symbolic analysis capabilities are needed?

EE663, Spring 2012 Slide 57

Data-Dependence Test, References
•  Banerjee/Wolfe test
–  M.Wolfe, U.Banerjee, "Data Dependence and its Application to Parallel
Processing", Int. J. of Parallel Programming, Vol.16, No.2, pp.137-178,
1987"
•  Power Test"
–  M. Wolfe and C.W. Tseng, The Power Test for Data Dependence, IEEE
Transactionson Parallel and Distributed Systems, IEEE Computer Society,
3(5), 591-601,1992.
•  Range test
–  William Blume and Rudolf Eigenmann. Non-Linear and Symbolic Data
Dependence Testing, IEEE Transactions of Parallel and Distributed
Systems, Volume 9, Number 12, pages 1180-1194, December 1998.
•  Omega test
–  William Pugh. The Omega test: a fast and practical integer programming
algorithm for dependence. Proceedings of the 1991 ACM/IEEE Conference
on Supercomputing,1991
•  I Test
–  Xiangyun Kong, David Klappholz, and Kleanthis Psarris, "The I Test: A New
Test for Subscript Data Dependence," Proceedings of the 1990 International
Conference on Parallel Processing, Vol. II, pages 204-211, August 1990.
EE663, Spring 2012 Slide 58
IV.2 Parallelism Enabling
Techniques

EE663, Spring 2012 Slide 59

Advanced Privatization
DO i=1,n DO j=1,n
loop-carried
t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j)
anti dependence C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2
ENDDO ENDDO

scalar privatization array privatization

!$OMP PARALLEL DO !$OMP PARALLEL DO

!$OMP+PRIVATE(t) !$OMP+PRIVATE(t)
DO i=1,n DO j=1,n
t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j)
C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2
ENDDO ENDDO

EE663, Spring 2012 Slide 60

Array Privatization
k=5 Capabilities needed for
DO j=1,n Array Privatization
t(1:10) = A(j,1:10)+B(j) •  array Def-Use Analysis
C(j,iv) = t(k)
•  combining and intersecting
t(11:m) = A(j,11:m)+B(j)
subscript ranges
C(j,1:m) = t(1:m)
ENDDO •  representing subscript
ranges
•  representing conditionals
DO j=1,n under which sections are
IF (cond(j)) defined/used
t(1:m) = A(j,1:m)+B(j) •  if ranges are too complex to
C(j,1:m) = t(1:m) + t(1:m)**2 represent: overestimate
ENDIF Uses, underestimate Defs
D(j,1) = t(1)
ENDDO
EE663, Spring 2012 Slide 61
Array Privatization continued
Array privatization algorithm:
•  For each loop nest:
–  iterate from innermost to outermost loop:
•  for each statement in the loop
–  Find array definitions; add them to the existing
definitions in this loop.
–  find array uses; if they are covered by a definition,
mark this array section as privatizable for this loop,
otherwise mark it as upward-exposed in this loop;
•  aggregate defined and upward-exposed uses (expand
from range per-iteration to entire iteration space); record
them as Defs and Uses for this loop

EE663, Spring 2012 Slide 62

Some Engineering Tasks and
Questions for Privatization Pass Writers
•  Start with scalar privatization
•  Next step: array privatization with simple ranges (contiguous; no range
merge) and singly-nested loops
•  Deal with multiply-nested loops (-> range aggregation)
•  Add capabilities for merging ranges
•  Implement advanced range representation (symbolic bounds, non-
contiguous ranges)
•  Deal with conditional definitions and uses (too advanced for this course)
•  Things to think about
–  what symbolic analysis capabilities are needed?
–  how to represent advanced ranges?
–  how to deal with loop-variant subscript terms?
–  how to represent private variables?

EE663, Spring 2012 Slide 63

Array Privatization,
References
•  Peng Tu and D. Padua. Automatic Array Privatization.
Languages and Compilers for Parallel Computing. Lecture
Notes in Computer Science 768, U. Banerjee, D. Gelernter, A.
Nicolau, and D. Padua (Eds.), Springer-Verlag, 1994. "

•  Zhiyuan Li, Array Privatization for Parallel Execution of Loops,

Proceedings of the 1992 ACM International Conference on
Supercomputing"

EE663, Spring 2012 Slide 64

!$OMP PARALLEL PRIVATE(s)
Reduction s=0
!$OMP DO
Privatized reduction
implementation

Parallelization DO i=1,n
s=s+A(i)
ENDDO
Scalar Reduction !$OMP ATOMIC
sum = sum+s
loop-carried
DO i=1,n !$OMP END PARALLEL
flow sum = sum + A(i)
dependence
ENDDO
DO i=1,num_proc
s(i)=0 Expanded reduction
implementation
Note, OpenMP has a reduction clause, ENDDO
only reduction recognition is needed: !$OMP PARALLEL DO
!$OMP PARALLEL DO DO i=1,n
!$OMP+REDUCTION(+:sum) s(my_proc)=s(my_proc)+A(i)
DO i=1,n ENDDO
sum = sum + A(i)
DO i=1,num_proc
ENDDO
sum=sum+s(i)
EE663, Spring 2012 ENDDO Slide 65
Parallelizing Array Reductions
Array Reductions (a.k.a. irregular or
histogram reductions) DIMENSION sum(m),s(m,#proc)
DIMENSION sum(m) !$OMP PARALLEL DO
DO i=1,n DO i=1,m
sum(expr) = sum(expr) + A(i) DO j=1,#proc
ENDDO s(i,j)=0
ENDDO
ENDDO Expanded reduction
implementation
DIMENSION sum(m),s(m) !$OMP PARALLEL DO
!$OMP PARALLEL PRIVATE(s) DO i=1,n
s(1:m)=0 Privatized reduction s(expr,my_proc)=s(expr,my_proc)+A(i)
!$OMP DO implementation ENDDO
DO i=1,n !$OMP PARALLEL DO
s(expr)=s(expr)+A(i) DO i=1,m
ENDDO DO j=1,#proc
!$OMP ATOMIC sum(i)=sum(i)+s(i,j)
sum(1:m) = sum(1:m)+s(1:m) ENDDO
!$OMP END PARALLEL ENDDO

Note, OpenMP 1.0 does not support such array reductions

EE663, Spring 2012 Slide 66
Recognizing Reductions
Recognition Criteria:
1.  the loop may contain one or more reduction
statements of the form X=X ⊗ expr , where
•  X is either scalar or an array expression, a[sub]
(sub must be the same on LHS and RHS)
•  ⊗ is a reduction operation, such as +, *, min, max
2.  X must not be used in any non-reduction statement
of the loop, nor in expr

EE663, Spring 2012 Slide 67

procedure RecognizeSumReductions (L)
Input : Loop L
Reduction
Output: reduction annotations for loop L, inserted in the IR Recognition
REDUCTION = {} // set of candidate reduction expressions
REF = {} // set of non-reduction variables referenced in L Algorithm
foreach stmt in L
localREFs = findREF(stmt) // gather all variables referenced in stmt
if (stmt is AssignmentStatement)
candidate = lhs_expr(stmt)
increment = rhs_expr(stmt) – candidate // symbolic subtraction
if ( !(baseSymbol(candidate) in findREF(increment)) ) // criterion1 is satisfied
REDUCTION = REDUCTION ∪ candidate
localREFs = findREF(increment) // all variables referenced in inc. expr.
REF = REF ∪ localREFs // collect non-reduction variables for criterion 2
foreach expr in REDUCTION
if ( ! (baseSymbol(expr) in REF) ) // criterion 2 is satisfied
if (expr is ArrayAccess AND expr.subscript is loop-variant)
CreateAnnotation(sum-reduction, ARRAY, expr)
else
CreateAnnotation(sum-reduction, SCALAR, expr)
end procedure Slide 68
Reduction Compiler Passes
Reduction recognition and parallelization
passes:
Induction variable recognition
recognizes and
Reduction recognition annotates reduction
Privatization variables
Data dependence test
Loop parallelization
<mapping passes>
Profitability decision
Reduction parallelization performs the reduction
transformation
compiler passes

EE663, Spring 2012 Slide 69

Performance Considerations
for Reduction Parallelization
•  Parallelized reductions execute substantially more code than
their serial versions ⇒ overhead if the reduction (n) is small.
•  In many cases (for large reductions) initialization and sum-up
are insignificant.
•  False sharing can occur, especially in expanded reductions, if
multiple processors use adjacent array elements of the
temporary reduction array (s).
•  Expanded reductions exhibit more parallelism in the sum-up
operation.
•  Potential overhead in initialization, sum-up, and memory used
for large, sparse array reductions ⇒ compression schemes can
become useful.

EE663, Spring 2012 Slide 70

Induction Variable Substitution

ind = k
loop-carried
DO i=1,n Parallel DO i=1,n
flow ind = ind + 2 A(k+2*i) = B(i)
dependence
A(ind) = B(i) ENDDO
ENDDO

This is the simple case of an induction variable

EE663, Spring 2012 Slide 71

Generalized Induction Variables
ind=k
DO j=1,n Parallel DO j=1,n
ind = ind + j A(k+(j**2+j)/2) = B(j)
A(ind) = B(j) ENDDO
ENDDO

DO i=1,n
DO i=1,n DO j=1,i
ind1 = ind1 + 1 ind = ind + 1
ind2 = ind2 + ind1 A(ind) = B(i)
A(ind2) = B(i) ENDDO
ENDDO ENDDO
EE663, Spring 2012 Slide 72
Recognizing GIVs
•  Pattern Matching:
–  find induction statements in a loop nest of the form
iv=iv+expr or iv=iv*expr, where iv is an scalar integer.
–  expr must be loop-invariant or another induction variable
(there must not be cyclic relationships among IVs)
–  iv must not be assigned in a non-induction statement
•  Abstract interpretation: find symbolic increments
of iv per loop iteration
•  SSA-based recognition

EE663, Spring 2012 Slide 73

GIV Closed-form Computation and
Substitution Algorithm
Step1: find the increment rel. to start of loop L
FindIncrement(L)
inc=0
Loop structure L0: stmt type
foreach si of type I,L
For j: 1..ub if type(si)=I inc += exp
… else /* L */ inc+= FindIncrement(si)
S1: iv=iv+exp I inc_after[si]=inc
… inc_into_loop[L]= ∑1j-1(inc) ; inc may depend
S2: loop using iv L return ∑1ub(inc) ; on j
…
S3: stmt using iv U Step 2: substitute IV
… Replace (L,initval)
Rof val = initval+inc_into_loop[L]
foreach si of type I,L,U
Main: if type(si)=L Replace(si,val)
Insert this
totalinc = FindIncrement(L0) statement if type(si)=L,I val=initialval
Replace(L0,iv) If iv is live-out +inc_into_loop[L]
InsertStatement(“iv = iv+totalinc”) +inc_after[si]
if type(si)=U Substitute(si,iv,val)
For coupled GIVs: begin with independent iv.
EE663, Spring 2012 Slide 74
Induction Variables, References
•  B. Pottenger and R. Eigenmann. Idiom Recognition in the Polaris
Parallelizing Compiler. ACM Int. Conf. on Supercomputing (ICS'95),
June 1995. "

•  Mohammad R. Haghighat , Constantine D. Polychronopoulos, Symbolic

analysis for parallelizing compilers, ACM Transactions on Programming
Languages and Systems (TOPLAS), v.18 n.4, p.477-518, July 1996 "

•  Michael P. Gerlek , Eric Stoltz , Michael Wolfe, Beyond induction

variables: detecting and classifying sequences using a demand-driven
SSA form, ACM Transactions on Programming Languages and
Systems (TOPLAS), v.17 n.1, p.85-122, Jan. 1995"

EE663, Spring 2012 Slide 75

Loop Skewing
!$OMP PARALLEL DO
DO i=1,4 DO set=1,?
set=1,9
DO j=1,6 i=? max(5-set,1)
A(i,j)= A(i-1,j-1) j=? max(-3+set,1)
ENDDO setsize=
setsize =?min(4,5-abs(set-5))
ENDDO DO k=0,setsize-1
A(i+k,j+k)=A(i-1+k,j-1+k)
ENDDO
ENDDO
i Iteration space graph:
Shared regions show sets of iterations in the
transformed code that can be executed in
parallel.

EE663, Spring 2012

j
Slide 76
Loop Skewing for the
Wavefront Method
DO i=2,n-1
DO j=2,n-1
A(i,j)= (A(i+1,j) +A(i-1,j)
+A(i,j+1) +A(i,j-1))/4
ENDDO
ENDDO
Outer loop is serial
Inner loop is parallel

j5 DO j=4, n+n-2
2 3 4 6 7. . .
2 DOALL i= max(2, n- j+ 1), min(n- 1, j- 2)
3 A(i, j- i) = (A(i+ 1, j- i) + A(i- 1, j- i)
+A(i, j+ 1- i) + A(i, j- 1 +i)/4
4 ENDDO
ENDDO
i 5.
.
.
EE663, Spring 2012 Slide 77
IV.3 Techniques for
Multiprocessors:
Mapping Parallelism to Shared-memory
Machines

EE663, Spring 2012 Slide 78

Loop Fusion and Distribution

DO i=1,n
A(i) = B(i) DO i=1,n
ENDDO
loop fusion A(i) = B(i)
C(i) = A(i-1) + D(i)
DO i=1,n ENDDO
C(i) = A(i-1)+D(i) loop distribution
ENDDO (fission)

•  Loop fusion is the reverse of loop distribution

•  Fusion reduces the loop fork/join overhead and enhances data affinity
•  Distribution inserts a barrier synchronization between parallel loops
•  Both transformations reorder computation
•  Legality: dependences in fused loop must be lexically forward

EE663, Spring 2012 Slide 79

Loop Distribution Enables
Other Techniques
DO i=1,n
DO i=1,n A(i) = B(i)+A(i-1)
•  enables
A(i) = B(i)+A(i-1) interchange ENDDO
DO j=1,m •  separates
D(i,j)=E(i,j) out partial DOALL j=1,m
paralleism DO i=1,n
ENDDO
D(i,j)=E(i,j)
ENDDO
ENDDO
ENDDO
In a program with multiply-nested loops, there can be a large number of
possible program variants obtained through distribution and interchanging

EE663, Spring 2012 Slide 80

Enforcing Data Dependence
Criterion for correct transformation and execution of a
computation involving a data dependence with vector
v : (=,…<,…*)

Let Ls be the outermost loop with non-“=” DD-direction

Ls :
–  Ls must be executed serially
–  The direction at Ls must be “<”

Same rule applies to all dependences

Note that a data dependence is defined with respect to an ordered

execution. For autoparallelization, this is the serial program order.
User-defined, fully parallel loops by definition do not have cross-iteration
dependences. Legality rules for transforming already parallel programs are
different.
EE663, Spring 2012 Slide 81
Loop Interchange
Legality of Loop interchange and resulting parallelism can be
tested with the above rules:
After loop interchange, the two conditions must still hold.

DO i=1,n DOALL j=1,m

DOALL j=1,m DO i=1,n
A(i,j) = A(i-1,j) A(i,j) = A(i-1,j)
ENDDO ENDDO
ENDDO ENDDO
DOALL i=1,n DOALL j=1,m
DO j=1,m DO i=1,n
A(i,j) = A(i-1,j-1) A(i,j) = A(i-1,j-1)
ENDDO ENDDO
ENDDO ENDDO

EE663, Spring 2012 Slide 82

Loop Coalescing
a.k.a. loop collapsing

PARALLEL DO i=1,n PARALLEL DO ij=1,n*m

DO j=1,m i = 1 + (ij-1) DIV m
loop
A(i,j) = B(i,j) j = 1 + (ij-1) MOD m
coalescing
ENDDO A(i,j) = B(i,j)
ENDDO ENDDO

Loop coalescing
•  can increase the number of iterations of a parallel loop
 load balancing
•  adds additional computation
 overhead

EE663, Spring 2012 Slide 83

Loop Blocking/Tiling
DO PARALLEL i1=1,n,block
DO j=1,m DO j=1,m
DO i=1,n DO i=i1,min(i1+block-1,n)
loop B(i,j)=A(i,j)+A(i,j-1)
B(i,j)=A(i,j)+A(i,j-1) blocking
ENDDO ENDDO
ENDDO ENDDO
ENDDO
j j
p1

p2
i i

This is basically the same transformation as

stripmining, but followed by loop interchanging.

EE663, Spring 2012 Slide 84

Loop Blocking/
!$OMP PARALLEL
Tiling DO j=1,m
continued !$OMP DO
DO i=1,n
DO j=1,m B(i,j)=A(i,j)+A(i,j-1)
DO i=1,n ENDDO
B(i,j)=A(i,j)+A(i,j-1) !$OMP ENDDO NOWAIT
ENDDO ENDDO
ENDDO !$OMP END PARALLEL

j j
p1

p2
i i

EE663, Spring 2012 Slide 85

Choosing the Block Size
The block size must be small enough so that all data references
between the use and the reuse fit in cache.

DO j=1,m
DO k=1,block Number of references made between the
… (r1 data references) access A(k,j) and the access A(k,j-d) when
… = A(k,j) + A(k,j-d) referencing the same memory location:
… (r2 data references) (r1+r2+3)*d*block
ENDDO  block < cachesize / (r1+r2+3)*d
ENDDO
If the cache is shared, all cores use it simultaneously. Hence the
effective cache size appears smaller:
block < cachesize / (r1+r2+3)*d*num_cores

Reference: Zhelong Pan, Brian Armstrong, Hansang Bae and Rudolf Eigenmann,
On the Interaction of Tiling and Automatic Parallelization, First International
Workshop on OpenMP (Wompat), 2005.

EE663, Spring 2012 Slide 86

Multi-level Parallelism from
Single Loops
DO i=1,n
A(i) = B(i) strip mining
for multi-level
ENDDO parallelism

PARALLEL DO (inter-cluster) i1=1,n,strip

PARALLEL DO (intra-cluster) i=i1,min(i1+strip-1,n)
A(i) = B(i)
ENDDO
ENDDO
M cluster
M M M M
P P P P P P P P P P P P P P P P

EE663, Spring 2012 Slide 87

References
•  High Performance Compilers for Parallel
Computing, Michale Wolfe, Addison-Wesley, ISBN
0-8053-2730-4.
•  Optimizing Compilers for Modern Architectures: A
Dependence-based Approach, Ken Kennedy and
John R. Allen, Morgan Kaufmann Publishers, ISBN
1558602860

EE663, Spring 2012 Slide 88

IV.4 Advanced Program
Analysis

EE663, Spring 2012 Slide 89

Interprocedural Analysis
•  Most compiler techniques work intra-
procedurally
•  Ideally, inter-procedural analyses and
transformations available
•  In practice: inter-procedural operation of basic
analysis works well
•  Inline expansion helps but no silver bullet

EE663, Spring 2012 Slide 90

Interprocedural Constant
Propagation
Making constant values of variables
known across subroutine calls
Subroutine A Subroutine B(m)
j = 150
DO k=1,100
call B(j) X(i)=X(i+m)
ENDDO
END
END

knowing that m>100 allows this

loop to be parallelized

EE663, Spring 2012 Slide 91

An Algorithm for Interprocedural
Constant Propagation
Intra-procedural part:
determine jump functions for all subroutines

Subroutine X(a,b,c)
e = 10
JY,1 = c
d = b+2 JZ,1 = a (jump function of first parameter)
call Y(c) JZ,2 = b+2
f = b*2 JZ,3 = ⊥ (called bottom, meaning non-constant)
call Z(a,d,c,e,f) JZ,4 = 10
END JZ,5 = ⊥

•  Mechanism for finding jump functions: (local) forward substitution and

interprocedural MAYMOD information.
•  Here we assume the compiler supports jump functions of the form
P+const (P is a subroutine parameter of the callee).
EE663, Spring 2012 Slide 92
Constant Propagation Algorithm:
Interprocedural Part

1.  initialize all formal parameters to the value T (called top = non yet known)
2.  for all jump functions:
–  if it is ⊥: set formal parameter value to ⊥ (called bottom = unknown)
–  if it is constant and the value of the formal parameter is the same
constant or T : set the value to this constant
3.  put all formal parameters on a work queue
4.  repeat: take a parameter from the queue until queue is empty
for all jump functions that contain this parameter:
•  determine the value of the target parameter of this jump function.
Set it to this value, or to ⊥ if it is different from a previously set
value.
•  if the value of the target parameter changes, put this parameter
on the queue
EE663, Spring 2012 Slide 93
Examples of Constant Propagation
x=3 Consider
Call SubY(x) what
happens if
x=3 x=3 t=6 t=7
Call SubY(x) Call SubY(x) Call SubU(t)

Subroutine SubY(a) Subroutine SubY(a)

Subroutine SubY(a) b = a+2
b = a+2
Call SubZ(b) Call SubZ(b)
… = ….a…

Subroutine SubU(c)
Subroutine SubZ(e) d = c-1
Call SubZ(d)
… = … e….

Subroutine SubZ(e)

… = … e….

EE663, Spring 2012 Slide 94

Interprocedural
Data-Dependence Analysis
•  Motivational examples:
DO i=1,n DO i=1,n DO k=1,m
call clear(a,i) a(i) = b(i) DO i=1,n
ENDDO call dupl(a,i) a(i,k) = math(i,k)
ENDDO call smooth(a(i,k))
Subroutine clear(x,j) ENDDO
x(j) = 0 Subroutine dupl(x,j)
END x(j) = 2*x(j) Subroutine smooth(x,j)
END x(j) = (x(j-1)+x(j)+x(j+1))/3
END

EE663, Spring 2012 Slide 95

Interprocedural
Data-Dependence Analysis
•  Overall strategy:
–  subroutine inlining
–  move loop into called subroutine
–  collect array access information in callee
and use in the analysis of the caller
→ will be discussed in more detail

EE663, Spring 2012 Slide 96

Interprocedural
Data-Dependence Analysis
•  Representing array access information
–  summary information
•  [low:high] or [low:high:stride]
•  sets of the above
–  exact representation
•  essentially all loop bound and subscript information is
captured
–  representation of multiple subscripts
•  separate representation
•  linearized

EE663, Spring 2012 Slide 97

Interprocedural
Data-Dependence Analysis
•  Reshaping arrays
–  simple conversion
•  matching subarray or 2-D→1-D
–  exact reshaping with div and mod
–  linearizing both arrays
–  equivalencing the two shapes
•  can be used in subroutine inlining
Important: reshaping may lose the implicit
assertion that array bounds are not violated!

EE663, Spring 2012 Slide 98

Symbolic Analysis
•  Expression manipulation techniques
–  Expression simplification/normalization
–  Expression comparison
–  Symbolic arithmetic
•  Range analysis
–  Find lower/upper bounds of variable values at a
given statement
•  For each statement and variable, or
•  Demand-driven, for a given statement and variable

EE663, Spring 2012 Slide 99

Symbolic Range Analysis
Example
int foo(int k) {}
[]
int i, j;
[]
double a;
[]
for ( i=0; i<10; ++i ) {
[0<=i<=9]
a=(0.5*i);
}
[i=10]
j=(i+k);
[i=10, j=(i+k)]
return j;
}

EE663, Spring 2012 Slide 100

Alias Analysis
Find references to the same storage by different names
⇒  Program analyses and transformations must consider all these
names

Simple case: different named variables allocated in same

storage location
•  Fortran Equivalence statement
•  Same variable passed to subroutine by-reference as two
different parameters (can happen in Fortran and C++, but
not in C)
•  Global variable also passed as subroutine parameter

EE663, Spring 2012 Slide 101

Pointer Alias Analysis
•  More complex: variables pointed to by named pointers
–  p=&a; q=&a => *p, *q are aliases
–  Same variable passed to C subroutines via pointer

•  Most complex: pointers between dynamic data structure

objects
–  This is commonly referred to as shape analysis

EE663, Spring 2012 Slide 102

Is Alias Analysis in Parallelizing
Compilers Important?
•  Fortran77: alias analysis is simple/absent
–  By Fortran rule, aliased subroutine parameters must not be
written to
–  there are no pointers
•  C programs: alias analysis is a must
–  Pointers, pointer arithmetic
–  No Fortran-like rule about subroutine parameters
–  Without alias information, compilers would have to be very
conservative => big loss of parallelism
–  Classical science/engineering applications do not have
dynamic data structures => no shape analysis needed

EE663, Spring 2012 Slide 103

IV.5 Dynamic Decision
Support

EE663, Spring 2012 Slide 104

Achilles’ Heel of Compilers
Big compiler limitations:
–  Insufficient compile-time knowledge
•  Input data
•  Architecture parameters (e.g., cache size)
•  Memory layout
–  Even if this information is known: Performance models too
complex
Effect:
–  Unknown profitability of optimizations
–  Inconsistent performance behavior
–  Conservative behavior of compilers
–  Many compiler options
–  Users need to experiment with options
EE663, Spring 2012 Slide 105
Multi-version Code
IF (d>n) Limitations
PARALLEL DO i=1,n •  Less readable
a(i) = a(i+d) •  Additional code
ENDDO •  Not feasible for all
optimizations
ELSE
•  Combinatorial explosion
DO i=1,n when trying to apply to
a(i) = a(i+d) many optimization
decisions
ENDDO

EE663, Spring 2012 Slide 106

Profiling
•  Gather missing information in a profile run
–  Compiler instruments code that gathers at runtime
information needed for optimization decisions
•  Use the gathered profile information for improved
decision making in a second compiler invocation
•  Training vs. production data
•  Initially used for branch prediction. Now increasingly
used to guide additional transformations.
•  Requires a compiler performance model

EE663, Spring 2012 Slide 107

Autotuning – Empirical Tuning
Try many optimization •  No compiler performance
model needed
variants; pick the •  Optimization decisions
best at runtime. based on true execution time
•  Dependence on training data
Search (same as profiling)
Space Version
Generation •  Potentially huge search
Navigation
space
•  Whole-program vs. section-
level tuning
Runtime
Evaluation

Many active research projects

EE663, Spring 2012 Slide 108
IV.4 Techniques for Vector
Machines

EE663, Spring 2012 Slide 109

Vector Instructions
A vector instruction operates on a number of
data elements at once.
Example: vadd va,vb,vc,32
vector operation of length 32 on vector registers va,vb, and vc
–  va,vb,vc can be
•  Special cpu registers or memory → classical
supercomputers
•  Regular registers, subdivided into shorter partitions (e.g.,
64bit register split 8-way) → multi-media extensions
–  The operations on the different vector elements
can overlap → vector pipelining

EE663, Spring 2012 Slide 110

Applications of Vector
Operations
•  Science/engineering applications are typically
regular with large loop iteration counts.
This was ideal for classical supercomputers, which
had long vectors (up to 256; vector pipeline startup
was costly).
•  Graphics applications can exploit “multi-
media” register features and instruction sets.

EE663, Spring 2012 Slide 111

Basic Vector Transformation
DO i=1,n
A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n)
ENDDO

DO i=1,n
A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n)
C(i-1) = D(i)**2 C(0:n-1)=D(1:n)**2
ENDDO

The triplet notation is interpreted to mean “vector operation”. Notice that this
is not (necessarily) the same meaning as in Fortran 90,

EE663, Spring 2012 Slide 112

Distribution and Vectorization
The transformation done on the previous slide involves loop distribution. Loop
distribution reorders computation and is thus subject to data dependence
constraints.
loop DO i=1,n
distribution A(i) = B(i)+C(i)
DO i=1,n ENDDO
A(i) = B(i)+C(i)
D(i) = A(i)+A(i-1) DO i=1,n
dependence
ENDDO D(i) = A(i)+A(i-1)
ENDDO
vectorization

The transformation is not legal if there is a A(1:n)=B(1:n)+C(1:n)

lexical-backward dependence:
D(1:n)=A(1:n)+A(0:n-1)
DO i=1,n loop-carried
A(i) = B(i)+C(i) dependence Statement reordering may help
resolve the problem. However, this is
C(i+1) = D(i)**2 not possible if there is a dependence
ENDDO cycle.
EE663, Spring 2012 Slide 113
Vectorization Needs
Expansion
... as opposed to privatization

DO i=1,n DO i=1,n
t = A(i)+B(i) expansion T(i) = A(i)+B(i)
C(i) = t + t**2 C(i) = T(i) + T(i)**2
ENDDO ENDDO
vectorization

T(1:n) = A(1:n)+B(1:n)
C(1:n) = T(1:n)+T(1:n)**2

EE663, Spring 2012 Slide 114

Conditional Vectorization
DO i=1,n
IF (A(i) < 0) A(i)=-A(i)
ENDDO
conditional vectorization

WHERE (A(1:n) < 0) A(1:n)=-A(1:n)

EE663, Spring 2012 Slide 115

Stripmining for Vectorization

stripmining DO i1=1,n,32
DO i=1,n
DO i=i1,min(i1+31,n)
A(i) = B(i)
A(i) = B(i)
ENDDO
ENDDO
ENDDO
Stripmining turns a single loop into a doubly-nested loop for two-level parallelism.
It also needs to be done by the code-generating compiler to split an operation into
chunks of the available vector length.

EE663, Spring 2012 Slide 116

IV.7 Compiling for
Heterogeneous
Architectures

EE663, Spring 2012 Slide 117

Why Heterogeneous
Architectures?
•  Performance
–  Fast uniprocessor best for serial code
–  Many simple cores best for highly parallel code
–  Special-purpose architectures for accelerating
certain code patterns
•  E.g., math co-processor
•  Energy
–  Same arguments hold for power savings

EE663, Spring 2012 Slide 118

Examples of Accelerators
•  GPU Accelerators are typically used as
co-processors.
•  nvidia GPGPU •  CPU+accelerator = heterogeneous
•  Shared or distributed address space
•  IBM Cell
•  Intel MIC
•  FPGAs
•  Crypto processor
•  Network processor
•  Video Encoder/decoder

EE663, Spring 2012 Slide 119

Accelerator Architecture
CPU GPU

Grid
Example GPGPU: Thread Block M
•  Address space is Thread Block 0
separate from CPU Shared Memory

•  Complex Memory Registers Registers CUDA

hierarchy Memory
•  Large number of cores
Thread 0
••• Thread K Model
Local Local
•  Multithreaded SIMD Memory Memory

execution
Global Memory
•  Optimized for coalesced CPU
Texture Memory with a Dedicated Cache
Memory
(stride-1) accesses
Constant Memory with a Dedicated Cache

EE663, Spring 2012 Slide 120

Compiler Optimizations for
GPGPUs
•  Optimizing GPU Global Memory Accesses
–  Parallel Loop Swap
–  Loop Collapsing
–  Matrix Transpose

•  Exploiting GPU On-chip Memories

•  Optimizing CPU-GPU Data Movement

–  Resident GPU Variable Analysis
–  Live CPU Variable Analysis
–  Memory Transfer Promotion Optimization

R. Eigenmann, Programming Models and Compilers for Accelerators Slide 121

Parallel Loop-Swap
Transformation Memory access at time t
#pragma omp parallel for Thread ID
k
for(i=0; i< N; i++)
T0
for(k=0; k<N; k++) T1 i
A[i][k] = B[i][k]; T2
Input OpenMP code T3
Global Memory

Thread ID T0 T1 T2 T3
#pragma omp parallel for k
schedule(static, 1)
for(k=0; k<N; k++) Memory
access at
for(i=0; i<N; i++) time t i
A[i][k] = B[i][k];
Optimized OpenMP code
Global Memory

R. Eigenmann, Programming Models and Compilers for Accelerators Slide 122

Loop Collapsing
Transformation k
#pragma omp parallel for Thread ID
for(i=0; i<n_rows; i++) T0
for(k=rptr[i]; k<rptr[i+1]; k++) T1 i
w[i] += A[k]*p[col[k]]; T2
T3
Input OpenMP code
Global Memory

#pragma omp parallel

Thread ID T0 T1 T2 T3 T4 T5 T6 T7
#pragma omp for collapse(2)
schedule(static, 1)
for(i=0; i<n_rows; i++)
for(k=rptr[i]; k<rptr[i+1]; k++)
w[i] += A[k]*p[col[k]]; Global Memory
Optimized OpenMP code

R. Eigenmann, Programming Models and Compilers for Accelerators Slide 123

Matrix-Transpose Transformation
Memory access at time t
i
Thread ID
T0 A
float d[N][M]
T1
... k
T2
<transpose d on transfer to GPU>
T3
#kernel function:
Global Memory
float d[M][N]
# pragma omp parallel Thread ID T0 T1 T2 T3
k
for(k=0; i< N; i++)
for(i=0; k<M; k++) Memory
…d[i,k] …]; access at
time t i

Global Memory

R. Eigenmann, Programming Models and Compilers for Accelerators Slide 124

Techniques to Exploit GPU
On-chip Memories
Caching Strategies
Variable Type Caching Strategy
R/O shared scalar w/o locality SM
R/O shared scalar w/ locality SM, CM, Reg
R/W shared scalar w/ locality Reg, SM
R/W shared array element w/ locality Reg
R/O 1-dimensional shared array w/ locality TM
R/W private array w/ locality SM

Reg: Registers CM: Constant Memory

SM: Shared Memory TM: Texture Memory
R. Eigenmann, Programming Models and Compilers for Accelerators Slide 125
Techniques to Optimize Data
Movement between CPU and GPU
•  Resident GPU Variable Analysis
–  Up-to-date data in GPU global memory:
do not copy again from CPU.
•  Live CPU Variable Analysis
–  After a kernel finishes:
only transfer live CPU variables from GPU to
CPU.
•  Memory Transfer Promotion Optimization
–  Find optimal points to insert necessary memory
transfers

R. Eigenmann, Programming Models and Compilers for Accelerators Slide 126

GPGPU Performance Relative
to CPU

EE663, Spring 2012 Slide 127

3./,
3,
-.2,

!"##$%"&
-.1,
-.0,
-./,
-,
$4E'5'==;$C:6>"H"=L3, $4E'5"&9(C:6>"H"=L/, B=;M'=N5'==;$C:6, 4*"N=;M'=N5'==;$,

3./,
3,
-.2,

!"##$%"&
-.1,

Importance of
-.0,
-./,
-,
4*"5'==;$<76$%, '**4&"O;#P"(;9(7:>;;:*, $4E'9%("'E!=;$)?7Q", =;$'=D"ER'(G;#SL-,

Individual 3./,
3,
-.2,

!"##$%"&
-.1,

Optimizations
-.0,
-./,
-,
:(H6I((JG'$%7#BC#?5, *%(EI((JG'$%7#BC#G;#*6, *%(EI((JG'$%7#BC#95, *%(EI((JK=&6G'$%7#BC#D"B,

3./,
3,
-.2,
!"##$%"&

-.1,
-.0,
-./,
-,
*%(E?$=(G'$%7#BC#G;#*6, *%(E?$=(G'$%7#BC#D"B, *%(E?$=(G'$%7#BC#?5,

3./,
3,
-.2,
!"##$%"&

-.1,
-.0,
-./,
-,
4*"5'6(789('#*:;*", 4*"<'('=="=>;;:?@':, 4*"A#(;==7#BC#D"E4$F;#,

!"#$%&'()*+,
EE663, Spring 2012
IV.8 Techniques Specific
to Distributed-memory
Machines

EE663, Spring 2012 Slide 129

Execution Scheme on a
Distributed-Memory Machine
Typical execution scheme:
•  All nodes execute the same program
M M M M •  Program uses node_id to select the
P P P P subcomputation to execute on each
participating processor and the data to access.
For example,
mystrip=⎡n/max_nodes⎤
DO i=1,n lb = node_id*mystrip +1
ub = min(lb+mystrip-1,n)
... how to place
DO i=lb,ub and access
ENDDO ... data ?
ENDDO
how/when to
synchronize ?
This is called Single-Program-Multiple-Data (SPMD) execution
scheme
EE663, Spring 2012 Slide 130
Data Placement
Single owner:
•  Data is distributed onto the participating
processors’ memories

Replication:
•  Multiple versions of the data are placed
on some or all nodes.

EE663, Spring 2012 Slide 131

Data Distribution Schemes
numbers indicate the node of a 4-processor
distributed-memory machine on which the
array section is placed
block
1 2 3 4
distribution
cyclic
1234123412341234   
distribution

block-cyclic
1 2 3 4 1 
distribution

indexed
IND(1) IND(2) IND(3) IND(4) IND(5) 
distribution

 index array

Automatic data distribution is difficult because it is a

global optimization.
EE663, Spring 2012 Slide 132
Message Generation
for single-owner placement
EXAMPLE

message send (A(ub),my_proc+1)

DO i=1,n generation receive (A(lb-1),my_proc-1)
B(i) = A(i)+A(i-1) DO i=lb,ub
ENDDO B(i) = A(i)+A(i-1)
ENDDO
•  lb,ub determine the iterations assigned to each processor.
•  data uses block distribution and matches the iteration distribution
•  my_proc is the current processor number

Compilers for languages such as HPF (High-Performance

Fortran) have explored these ideas extensively

EE663, Spring 2012 Slide 133

Owner-computes Scheme
In general, the elements accessed by a processor are different from the elements
owned by this processor as defined by the data distribution

DO i=1,n
send/receive what’s necessary
DO i=1,n IF I_own(A(i)) THEN
A(i)=B(i)+B(i-m) A(i) = B(i)+B(i-m)
C(ind(i))=D(ind2(i)) ENDIF
send/receive what’s necessary
ENDDO
IF I_own(C(ind(i)) THEN
C(ind(i))=D(ind2(i))
ENDIF
ENDDO
•  nodes execute those iterations and statements whose LHS they own
•  first they receive needed RHS elements from remote nodes
•  nodes need to send all elements needed by other nodes
Example shows basic idea only. Compiler optimizations needed!

EE663, Spring 2012 Slide 134

Compiler Optimizations
for the raw owner computes scheme
•  Eliminate conditional execution
–  combine if statements with same condition
–  reduce iteration space if possible
•  Aggregate communication
–  combine small messages into larger ones
–  tradeoff: delaying a message enables message
aggregation but increases the message latency.
•  Message Prefetch
–  moving send operations earlier in order to reduce
message latencies.
there is a large number of research papers describing such techniques
EE663, Spring 2012 Slide 135
Message Generation for
Virtual Data Replication
Fully parallel section
w. local reads and writes

Broadcast written data Optimization: reduce broadcast

operations to necessary point-to-point
Fully parallel section communication
w. local reads and writes

time
Advantages:
• Fully parallel sections with local reads and writes
• Easier message set computation (no partitioning per processor needed)

Disadvantages:
• Not data-scalable
• More write operations necessary (but, collective communication can be used)
EE663, Spring 2012 Slide 136
7 Techniques for
Instruction-Level
Parallelization

EE663, Spring 2012 Slide 137

Implicit vs. Explicit ILP
Implicit ILP: ISA is the same as for sequential
programs.
–  most processors today employ a certain degree of
implicit ILP
–  parallelism detection is entirely done by the hardware
–  compiler can assist ILP by arranging the code so that
the detection gets easier.

EE663, Spring 2012 Slide 138

Implicit vs. Explicit ILP
Explicit ILP: ISA expresses parallelism.
–  parallelism is detected by the compiler
–  parallelism is expressed in the form of
•  VLIW (very long instruction words): packing several instructions
into one long word
•  EPIC (Explicitly Parallel Instruction Computing): bundles of (up
to three) instructions are issued. Dependence bits can be
specified.
Used in Intel/HP IA-64 architecture. The processor also
supports predication, early (speculative) loads, prepare-to-
branch, rotating registers.

EE663, Spring 2012 Slide 139

Trace Scheduling
(invented for VLIW processors, still a useful terminology)

Two big issues must be solved by

all approaches:
1. Identifying the instruction sequence
that will be inspected for ILP. trace selection

Main obstacle: branches

2. reordering instructions so that
machine resources are exploited
efficiently. trace compaction

trace scheduling

EE663, Spring 2012 Slide 140

Trace Selection
•  It is important to have a large instruction window (block) within
which the compiler can find parallelism.
•  Branches are the problem. Instruction pipelines have to be
flushed/squashed at branches
•  Possible remedies:
–  eliminate branches
–  code motion can increase block size
–  block can contain out-branches with low probability
–  predicated execution

EE663, Spring 2012 Slide 141

Branch Elimination
•  Example:
comp R0 R1 comp R0 R1
bne L1: beq L2:
bra L2:
L1: . . . L1: . . .
. . . . . .

L2: . . . L2: . . .

EE663, Spring 2012 Slide 142

Code Motion
c1

I1 I1 I1
c2

I2 I3

I1
c2

c1 c1
I1 I2 I1 I3

Code motion can increase window sizes and eliminate subtrees

EE663, Spring 2012 Slide 143

Predicated Execution

IF (a>0) THEN
p = a>0 ; assignment of predicate
b=a
p: b=a ; executed if predicate true
ELSE ; executed if predicate false
p: b=-a
b=-a
ENDIF
Predication
•  increases the window size for analyzing and exploiting parallelism
•  increases the number of instructions “executed”
These are opposite demands!

Compare this technique to conditional vectorization

EE663, Spring 2012 Slide 144

Dependence-removing ILP
Techniques
ind = i0 ind = i0
... ...
dependence
ind = ind+1 ind = i0+1
...
dependence
...
ind = ind+1 ind = i0+2

sum = sum+expr1 s1=expr1

... ...
dependence
sum = sum+expr2 s1=s1+expr2
dependence. . .
...
sum = sum+expr3 s2=expr3
dependence. . .
...
sum = sum+expr4 s2=s2+expr4
...
sum=sum+s1+s2

shaded blocks of statements are independent of each other and can

be executed as parallel instructions
EE663, Spring 2012 Slide 145
Speculative ILP
Speculation is performed by the architecture in various forms
–  Superscalar processors: compiler only has to deal with the
performance model. ISA is the same as for non-speculative
processors
–  Multiscalar processors: (research only) compiler defines tasks that
the hardware can try execute speculatively in parallel. Other than
task boundaries, the ISA is the same.
References:
•  Task Selection for a Multiscalar Processor, T. N. Vijaykumar and
Gurindar S. Sohi, The 31st International Symposium on
Microarchitecture (MICRO-31), pp. 81-92, December 1998.
•  Reference Idempotency Analysis: A Framework for Optimizing
Speculative Execution, Seon-Wook Kim, Chong-Liang Ooi, Rudolf
Eigenmann, Babak Falsafi, and T.N. Vijaykumar,, In Proc. of
PPOPP'01, Symposium on Principles and Practice of Parallel
Programming, 2001.

EE663, Spring 2012 Slide 146

Compiler Model of Explicit
Specluative Parallel Execution
(Multicalar Processor)
•  Overall Execution: speculative •  Dependence Tracking: Data
threads choose and start the Flow and Control Flow
execution of any predicted next dependences are detected
thread. directly. Lead to roll-back. Anti
•  Data Dependence and Control and Output dependences are
Flow Violations lead to roll- satisfied via speculative
backs. storage.
•  Final Execution: satisfies all •  Segment Commit: Correctly
cross-segment flow and control executed threads (I.e., their final
dependences. execution) commit their
speculative storage to the
•  Data Access: Writes go to
memory, in sequential order.
thread-private speculative
storage. Reads read from
ancestor thread or memory.

EE663, Spring 2004 Slide 147

Discrete Mathematics
No ratings yet
Discrete Mathematics
138 pages
Dre T Biblio
No ratings yet
Dre T Biblio
312 pages
Troubleshoot Section
100% (1)
Troubleshoot Section
487 pages
Essential Data Analysis Skills
No ratings yet
Essential Data Analysis Skills
8 pages
ENIQ Events Node Commissioning
100% (1)
ENIQ Events Node Commissioning
31 pages
Digital Portfolio
No ratings yet
Digital Portfolio
14 pages
Repport Btech Final
No ratings yet
Repport Btech Final
50 pages
MT6752 EMMC Partition Layout
No ratings yet
MT6752 EMMC Partition Layout
6 pages
Unit 4 New
No ratings yet
Unit 4 New
129 pages
Science of Programming
No ratings yet
Science of Programming
379 pages
M.Tech - CSE Syllabus SIT Autonomy
No ratings yet
M.Tech - CSE Syllabus SIT Autonomy
88 pages
Assignment
No ratings yet
Assignment
8 pages
Zanasi Z1 Thermal Inkjet RemoteManager Handbuch
No ratings yet
Zanasi Z1 Thermal Inkjet RemoteManager Handbuch
185 pages
EE290C - Spring 2011 RX FIR Equalizers: - Currently Not Very Popular - Why?
No ratings yet
EE290C - Spring 2011 RX FIR Equalizers: - Currently Not Very Popular - Why?
4 pages
Btech It Syllabus
No ratings yet
Btech It Syllabus
95 pages
Municipal Complaint Portal Guide
No ratings yet
Municipal Complaint Portal Guide
2 pages
Baseline Switch 2250-SFP Plus v402!0!0 1 RN
No ratings yet
Baseline Switch 2250-SFP Plus v402!0!0 1 RN
3 pages
C# OOP: Inheritance & Polymorphism
No ratings yet
C# OOP: Inheritance & Polymorphism
40 pages
Session Level Yapp Handout PDF
No ratings yet
Session Level Yapp Handout PDF
27 pages
Presentation 17
No ratings yet
Presentation 17
18 pages
VDTT 2020 Brochure
No ratings yet
VDTT 2020 Brochure
10 pages
Fema - Champ. Manual PDF
No ratings yet
Fema - Champ. Manual PDF
49 pages
First Parallel Test in Empowerment Technologies 11: Godwino Integrated School
100% (1)
First Parallel Test in Empowerment Technologies 11: Godwino Integrated School
3 pages
Sticker Book PDF
No ratings yet
Sticker Book PDF
66 pages
James Sandoval's Resume - Software Engineer
No ratings yet
James Sandoval's Resume - Software Engineer
1 page
Itil v3
No ratings yet
Itil v3
7 pages
SAP ERP Utilities Certification Prep
No ratings yet
SAP ERP Utilities Certification Prep
5 pages
4.8 Stereographic Projection
No ratings yet
4.8 Stereographic Projection
2 pages
External Devices (Utility Card)
No ratings yet
External Devices (Utility Card)
20 pages
Mastercam X5 Circle Sorting Add-On
No ratings yet
Mastercam X5 Circle Sorting Add-On
9 pages
BITS Pilani, Hyderabad Campus CSF212, Database Systems : (2M) (2M) (2M) (2M) (2M)
No ratings yet
BITS Pilani, Hyderabad Campus CSF212, Database Systems : (2M) (2M) (2M) (2M) (2M)
8 pages
Android Programming - 2022
No ratings yet
Android Programming - 2022
2 pages
OPTIMAX PowerFit Virtual Power Plants Unit Commitment Pooling
No ratings yet
OPTIMAX PowerFit Virtual Power Plants Unit Commitment Pooling
4 pages
Information Security Transformation-Nahil Mahmood-Lecture 7
No ratings yet
Information Security Transformation-Nahil Mahmood-Lecture 7
5 pages
Profile Summary: Pallavi Kumari Pandey
No ratings yet
Profile Summary: Pallavi Kumari Pandey
2 pages
BRICS - Brazil Russia India China South Africa - Papertyari
No ratings yet
BRICS - Brazil Russia India China South Africa - Papertyari
2 pages

EE663: Optimizing Compilers: Prof. R. Eigenmann

Uploaded by

EE663: Optimizing Compilers: Prof. R. Eigenmann

Uploaded by

EE663: Optimizing Compilers

EE663, Spring 2012 Slide 1

Processors have multiple cores. Parallelization is a key optimization.

EE663, Spring 2012 Slide 3

EE663, Spring 2012 Slide 4

EE663, Spring 2004 Slide 6

EE663, Spring 2012 Slide 7

EE663, Spring 2012 Slide 8

If a loop does not have data dependences

In science/engineering applications, loop

A data dependence exists between two adjacent data references iff:

EE663, Spring 2004 Slide 11

EE663, Spring 2004 Slide 12

Let us generalize a bit: given

• the mathematical formulation of the data dependence

There are enough hard problems to fill several courses!

EE663, Spring 2012 Slide 14

EE663, Spring 2012 Slide 15

William Blume and Rudolf Eigenmann, Performance Analysis of

Good reasons for starting two decades back:

EE663, Spring 2012 Slide 16

EE663, Spring 2012 Slide 17

EE663, Spring 2012 Slide 18

EE663, Spring 2012 Slide 20

work (iterations) shared by participating processors (threads)

Same code executed by all participating processors (threads)

Note, this is the reverse of strength reduction, an important

EE663, Spring 2012 Slide 23

EE663, Spring 2012 Slide 24

There are many variants of stripmining

EE663, Spring 2012 Slide 25

EE663, Spring 2012 Slide 26

Error: 0 ∆a(10) ∆a(10)+∆a(20) ∆a(10)+∆a(20)+∆a(30)

EE663, Spring 2012 Slide 27

• stride-1 references increase cache locality

EE663, Spring 2012 Slide 29

2. Microtasking scheme (dates back to early

EE663, Spring 2012 Slide 31

Master task Helper 1:

EE663, Spring 2012 Slide 33

a eliminated file I/O

EE663, Spring 2012 Slide 34

Performance loss when disabling individual techniques (Cedar machine)

26789: ,6;<=>4?;<?6@ 5A@6:BC9=6@>+<?6@ /7AD8:6BC9=6@>+<?6@ .EF878G9::H>+<?6@ 1,,>/979::6: I9?@>/979::6:

NAS (Class A) Benchmarks on 8-core x86 processor

EE663, Spring 2012 Slide 36

NAS Benchmarks (Class A) on 8-core x86 processor

EE663, Spring 2012 Slide 37

EE663, Spring 2012 Slide 38

EE663, Spring 2012 Slide 39

• Multiple loop indices:

• Multiple loop indices, multi-dimensional array:

EE663, Spring 2012 Slide 41

in our example this means: gcd(4,2)=2, which does not

If there is a solution, we can test if it lies within the loop

EE663, Spring 2012 Slide 42

Euklid Algorithm: find gcd(a,b)

EE663, Spring 2012 Slide 43

EE663, Spring 2012 Slide 44

EE663, Spring 2012 Slide 45

Min: 1-100=-99 The general case of a doubly-nested loop and

Min: a1-a2*n Min: b1-b2*m Assuming positive

Multiple dimensions: apply test separately on each subscript or linearize

EE663, Spring 2012 Slide 46

DO j=1,100 ranges accessed:

We did not take into consideration that only loop-carried

EE663, Spring 2012 Slide 49

Weakness of most data dependence tests:

Approach of the Range Test:

EE663, Spring 2012 Slide 53

Example: testing independence of the outer loop: ubx

DO i=1,n range of A accessed in iteration ix: [ix*m+1:(ix+1)*m]

EE663, Spring 2012 Slide 54

DO i1=L1,U1 Assume f,g are monotonically increasing w.r.t. all ix:

we need Determining monotonicity: consider d = f(...,ik,...) - f(...,ik-1,...)

What about symbolic coefficients?

•  the mathematical formulation of the data dependence

•  stride-1 references increase cache locality

•  Multiple loop indices:

•  Multiple loop indices, multi-dimensional array:

Min: a1-a2n Min: b1-b2m Assuming positive

DO i=1,n range of A accessed in iteration ix: [ixm+1:(ix+1)m]

•  Zhiyuan Li, Array Privatization for Parallel Execution of Loops,

•  Mohammad R. Haghighat , Constantine D. Polychronopoulos, Symbolic

•  Michael P. Gerlek , Eric Stoltz , Michael Wolfe, Beyond induction

•  Loop fusion is the reverse of loop distribution

•  Mechanism for finding jump functions: (local) forward substitution and

•  Most complex: pointers between dynamic data structure