PGI Compilers
Tools for Scientists and Engineers
September 2006
www.pgroup.com
Outline of Todays Topics
Introduction to PGI Compilers and Tools Documentation. Getting Help Basic Compiler Options
Optimization Strategies
6.2 Features and Roadmap Questions and Answers
PGI Compilers and Tools, features
Optimization State-of-the-art vector, parallel, IPA, Feedback, Cross-platform AMD & Intel, 32/64-bit, Linux & Windows PGI Unified Binary for AMD and Intel processors
Tools Integrated OpenMP/MPI debug & profile, IDE integration
Parallel MPI, OpenMP 2.5, auto-parallel for Multi-core Comprehensive OS Support Red Hat 7.3 9.0, RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1 10.1, SLES 8/9/10, Windows XP, Windows x64
PGI Tools Enable Developers to:
View x64 as a unified CPU architecture Extract peak performance from x64 CPUs Ride innovation waves from both Intel and AMD
Use a single source base and toolset across Linux and Windows
Develop, debug, tune parallel applications for Multi-core, Multi-core SMP, Clustered Multi-core SMP
PGI Documentation and Support
PGI provided documentation PGI User Forums, at www.pgroup.com PGI FAQs, Tips & Techniques pages
Web support, a form-based system similar to email support Fax support
PGI Docs & Support, cont.
Legacy phone support, direct access, etc. PGI download web page PGI prepared/personalized training
PGI ISV program
PGI Premier Service program
PGI Basic Compiler Options
Basic Usage Language Dialects Target Architectures
Debugging aids
Optimization switches
PGI Basic Compiler Usage
A compiler driver interprets options and invokes preprocessors, compilers, assembler, linker, etc. Options precedence: if options conflict, last option on command line takes precedence Use -Minfo to see a listing of optimizations and transformations performed by the compiler Use -help to list all options or see details on how to use a given option, e.g. pgf90 -Mvect -help Use man pages for more details on options, e.g. man pgf90
Use v to see under the hood
Flags to support language dialects
Fortran
pgf77, pgf90, pgf95, pghpf tools Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF -Mextend, -Mfixed, -Mfreeform Type size i2, -i4, -i8, -r4, -r8, etc. -Mcray, -Mbyteswapio, -Mupcase, -Mnomain, -Mrecursive, etc. pgcc, pgCC, aka pgcpp Suffixes .c, .C, .cc, .cpp, .i -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs
C/C++
Specifying the target architecture
Not an issue on XT3. Defaults to the type of processor/OS you are running on Use the tp switch.
-tp k8-64 or tp p7-64 or tp core2-64 for 64-bit code. -tp amd64e for AMD opteron rev E or later -tp x64 for unified binary -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32 bit code
Flags for debugging aids
-g generates symbolic debug information used by a debugger -gopt generates debug information in the presence of optimization -Mbounds adds array bounds checking -v gives verbose output, useful for debugging system or build problems -Mlist will generate a listing -Minfo provides feedback on optimizations made by the compiler -S or Mkeepasm to see the exact assembly generated
Basic optimization switches
Traditional optimization controlled through -O[<n>], n is 0 to 4. -fast switch combines common set into one simple switch, is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre
For -Munroll, c specifies completely unroll loops with this loop count or less -Munroll=n:<m> says unroll other loops m times
-Mnoframe does not set up a stack frame -Mlre is loop-carried redundancy elimination
Basic optimization switches, cont.
fastsse switch is commonly used, extends fast to SSE hardware, and vectorization -fastsse is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre (-fast) plus -Mvect=sse, -Mscalarsse -Mcache_align, -Mflushz -Mcache_align aligns top level arrays and objects on cache-line boundaries -Mflushz flushes SSE denormal numbers to zero
Optimization Strategies
Establish a workload Optimization from the top-down Use of proper tools, methods Processor level optimizations, parallel methods Different flags/features for different types of code
Node level tuning
Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples
Function Inlining especially important for C and C++
Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try
15
Vectorizable F90 Array Syntax Data is REAL*4
350 ! 351 ! 352 ! 353 354 355 356 357 358 359 360 361 362
Initialize vertex, similarity and coordinate arrays Do Index = 1, NodeCount IX = MOD (Index - 1, NodesX) + 1 IY = ((Index - 1) / NodesX) + 1 CoordX (IX, IY) = Position (1) + (IX - 1) * StepX CoordY (IX, IY) = Position (2) + (IY - 1) * StepY JetSim (Index) = SUM (Graph (:, :, Index) * & & GaborTrafo (:, :, CoordX(IX,IY), CoordY(IX,IY))) VertexX (Index) = MOD (Params%Graph%RandomIndex (Index) - 1, NodesX) + 1 VertexY (Index) = ((Params%Graph%RandomIndex (Index) - 1) / NodesX) + 1 End Do
Inner loop at line 358 is vectorizable, can used packed SSE instructions
16
fastsse to Enable SSE Vectorization Minfo to List Optimizations to stderr
% pgf95 -fastsse -Mipa=fast -Minfo -S graphRoutines.f90 localmove: 334, Loop unrolled 1 times (completely unrolled) 343, Loop unrolled 2 times (completely unrolled) 358, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop
17
Scalar SSE:
.LB6_668: # lineno: 358 movss -12(%rax),%xmm2 movss -4(%rax),%xmm3 subl $1,%edx mulss -12(%rcx),%xmm2 addss %xmm0,%xmm2 mulss -4(%rcx),%xmm3 movss -8(%rax),%xmm0 mulss -8(%rcx),%xmm0 addss %xmm0,%xmm2 movss (%rax),%xmm0 addq $16,%rax addss %xmm3,%xmm2 mulss (%rcx),%xmm0 addq $16,%rcx testl %edx,%edx addss %xmm0,%xmm2 movaps %xmm2,%xmm0 jg .LB6_625
Vector SSE:
.LB6_1245: # lineno: 358 movlps (%rdx,%rcx),%xmm2 subl $8,%eax movlps 16(%rcx,%rdx),%xmm3 prefetcht0 64(%rcx,%rsi) prefetcht0 64(%rcx,%rdx) movhps 8(%rcx,%rdx),%xmm2 mulps (%rsi,%rcx),%xmm2 movhps 24(%rcx,%rdx),%xmm3 addps %xmm2,%xmm0 mulps 16(%rcx,%rsi),%xmm3 addq $32,%rcx testl %eax,%eax addps %xmm3,%xmm0 jg .LB6_1245:
Facerec Scalar: 104.2 sec Facerec Vector: 84.3 sec
18
Vectorizable C Code Fragment?
217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }
% pgcc fastsse Minfo functions.c func4: 221, Loop unrolled 4 times 221, Loop not vectorized due to data dependency 223, Loop not vectorized due to data dependency
Pointer Arguments Inhibit Vectorization
217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }
% pgcc fastsse Msafeptr Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Unrolled inner loop 4 times
C Constant Inhibits Vectorization
217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }
% pgcc fastsse Msafeptr Mfcon Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Generated vector SSE code for inner loop Generated 4 prefetch instructions for this loop
-Msafeptr Option and Pragma
M[no]safeptr[=all | arg | auto | dummy | local | static | global] all All pointers are safe
arg
local static
Argument pointers are safe
local pointers are safe static local pointers are safe
global
global pointers are safe
#pragma [scope] [no]safeptr={arg | local | global | static | all}, Where scope is global, routine or loop
22
Common Barriers to SSE Vectorization
Potential Dependencies & C Pointers Give compiler more
info with Msafeptr, pragmas, or restrict type qualifer
Function Calls Try inlining with Minline or Mipa=inline
Type conversions manually convert constants or use flags
Large Number of Statements Try Mvect=nosizelimit Too few iterations Usually better to unroll the loop Real dependencies Must restructure loop, if possible
23
Barriers to Efficient Execution of Vector SSE Loops
Not enough work vectors are too short Vectors not aligned to a cache line boundary Non unity strides
Code bloat if altcode is generated
24
Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating example
Function Inlining especially important for C and C++
Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try
25
What can Interprocedural Analysis and Optimization with Mipa do for You?
Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection
F90 shape propagation
Function inlining IPA optimization of libraries, including inlining
26
Effect of IPA on the WUPWISE Benchmark
PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline
Execution Time in Seconds 156.49 121.65 91.72
Mipa=fast => constant propagation => compiler sees complex
matrices are all 4x3 => completely unrolls loops
Mipa=fast,inline => small matrix multiplies are all inlined
27
Using Interprocedural Analysis
Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name
Mipa=libopt perform IPA optimizations on libraries
Mipa=libinline perform IPA inlining from libraries
28
Vectorization packed SSE instructions maximize performance
Interprocedural Analysis (IPA) use it! motivating examples
Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try
29
Explicit Function Inlining
Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions
For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!
30
Other C++ recommendations
Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay
Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties
31
Vectorization packed SSE instructions maximize performance
Interprocedural Analysis (IPA) use it! motivating examples
Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try
32
SMP Parallelization
Mconcur for auto-parallelization on multi-core
Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls
mp to enable OpenMP 2.5 parallel programming model
See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp just work Not supported on Cray XT3 would require some custom work
Mconcur and mp can be used together!
33
DO 10 I3=2, N-1 MGRID Benchmark DO 10 I2=2,N-1 DO 10 I1=2,N-1 Main Loop 10 R(I1,I2,I3) = V(I1,I2,I3) & -A(0)*(U(I1,I2,I3)) & -A(1)*(U(I1-1,I2,I3)+U(I1+1,I2,I3) & +U(I1,I2-1,I3)+U(I1,I2+1,I3) & +U(I1,I2,I3-1)+U(I1,I2,I3+1)) & -A(2)*(U(I1-1,I2-1,I3)+U(I1+1,I2-1,I3) & +U(I1-1,I2+1,I3)+U(I1+1,I2+1,I3) & +U(I1,I2-1,I3-1)+U(I1,I2+1,I3-1) & +U(I1,I2-1,I3+1)+U(I1,I2+1,I3+1) & +U(I1-1,I2,I3-1)+U(I1-1,I2,I3+1) & +U(I1+1,I2,I3-1)+U(I1+1,I2,I3+1) ) & -A(3)*(U(I1-1,I2-1,I3-1)+U(I1+1,I2-1,I3-1) & +U(I1-1,I2+1,I3-1)+U(I1+1,I2+1,I3-1) & +U(I1-1,I2-1,I3+1)+U(I1+1,I2-1,I3+1) & +U(I1-1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))
Auto-parallel MGRID Overall Speed-up is 40% on Dual-core AMD Opteron
% pgf95 fastsse Mipa=fast,inline Minfo Mconcur mgrid.f resid: ... 189, Parallel code for non-innermost loop activated if loop count >= 33; block distribution 291, 4 loop-carried redundant expressions removed with 12 operations and 16 arrays Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop
Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples Function Inlining especially important for C and C++
SMP Parallelization for Cray XD1 and multi-core processors
Miscellaneous Optimizations hit or miss, but worth a try
36
Miscellaneous Optimizations (1)
Mfprelaxed single-precision sqrt, rsqrt, div performed using reduced-precision reciprocal approximation
lacml and lacml_mp link in the AMD Core Math Library
Mprefetch=d:<p>,n:<q> control prefetching distance, max number of prefetch instructions per loop tp k8-32 can result in big performance win on some C/C++ codes that dont require > 2GB addressing; pointer and long data become 32-bits
37
Miscellaneous Optimizations (2)
O3 more aggressive hoisting and scalar replacement; not part of fastsse, always time your code to make sure its faster
For C++ codes: no_exceptions Minline=levels:10
M[no]movnt disable / force non-temporal moves
V[version] to switch between PGI releases at file level Mvect=noaltcode disable multiple versions of loops
38
Whats New in PGI 6.2
Industry-leading SPECFP06 and SPECINT06 Performance PGI Visual Fortran for Windows x64 & Windows XP Full-featured PGI Workstation/Server for 32-bit Windows XP PGI Unified Binary performance enhancements More gcc extensions / compatibility New SSE intrinsics
PGI CDK ROLL for ROCKS clusters
MPICH1 and MPICH2 support in the PGI CDK Incremental debugger/profiler enhancements
Limited tuning for Intel Core2 (Woodcrest et al)
PGI Visual Fortran 6.2
Deep integration with Visual Studio 2005
PGI-custom Fortran-aware text editor Syntax coloring, keyword completion Fortran 95 Intrinsics tips PGI-custom project system and icons
PGI Unified Binary executables
Auto-parallel for multi-core CPUs Native OpenMP 2.5 parallelization World-class performance 64-bit Windows x64 support
PGI-custom property pages
One-touch project build / execute MS Visual C++ interoperability Mixed VC++ / PGI Fortran applications PGI-custom parallel F95 debug engine OpenMP 2.5 / threads debugging Just-in-time debugging features DVF/CVF compatibility features Win32 API support
32-bit Windows 2000/XP support
Optimization/support for AMD64 Optimization/support for Intel EM64T DEC/IBM/Cray compatibility features cpp-compatible pre-processing Visual Studio 2005 bundled* MSDN Library bundled* GUI parallel debugging/profiling* Assembly-optimized BLAS/LAPACK/FFTs*
Complete (Vis Studio bundled) and Standard (no Vis Studio) versions
Boxed CD-ROM/Manuals media kit*
*PVF Workstation Complete Only
On the PGI Roadmap
PGI Unified Binary directives and enhancements Aggressive Intel Core2 and next gen AMD64 tuning Industry-leading SPECFP06 and SPECINT06 Performance on Linux/Windows/AMD/Intel/32/64 Incremental PGDBG enhancements, improved C++ support MPI Debugging / Profiling for Windows x64 CCS Clusters All-new cross-platform PGPROF performance profiler
Fortran 2003/C99 language features
GCC front-end compatibility, g++ interoperability PGC++ tuning, PGC++/VC++ interoperability Windows SUA and Apple/MacOS platform support
De facto standard scalable C/Fortran language/tools extensions
Questions?
Reach me at [email protected]
Thanks for your time