0% found this document useful (0 votes)

55 views42 pages

Compilers: Tools For Scientists and Engineers

PGIcompilers

Uploaded by

Sachin Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views42 pages

Compilers: Tools For Scientists and Engineers

PGIcompilers

Uploaded by

Sachin Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 42

PGI Compilers

Tools for Scientists and Engineers

September 2006

Brent Leback [email protected] Dave Norton [email protected]

www.pgroup.com

Outline of Todays Topics

Introduction to PGI Compilers and Tools Documentation. Getting Help Basic Compiler Options

Optimization Strategies
6.2 Features and Roadmap Questions and Answers

PGI Compilers and Tools, features

Optimization State-of-the-art vector, parallel, IPA, Feedback, Cross-platform AMD & Intel, 32/64-bit, Linux & Windows PGI Unified Binary for AMD and Intel processors

Tools Integrated OpenMP/MPI debug & profile, IDE integration

Parallel MPI, OpenMP 2.5, auto-parallel for Multi-core Comprehensive OS Support Red Hat 7.3 9.0, RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1 10.1, SLES 8/9/10, Windows XP, Windows x64

PGI Tools Enable Developers to:

View x64 as a unified CPU architecture Extract peak performance from x64 CPUs Ride innovation waves from both Intel and AMD

Use a single source base and toolset across Linux and Windows
Develop, debug, tune parallel applications for Multi-core, Multi-core SMP, Clustered Multi-core SMP

PGI Documentation and Support

PGI provided documentation PGI User Forums, at www.pgroup.com PGI FAQs, Tips & Techniques pages

Email support, via [email protected]

Web support, a form-based system similar to email support Fax support

PGI Docs & Support, cont.

Legacy phone support, direct access, etc. PGI download web page PGI prepared/personalized training

PGI ISV program

PGI Premier Service program

PGI Basic Compiler Options

Basic Usage Language Dialects Target Architectures

Debugging aids
Optimization switches

PGI Basic Compiler Usage

A compiler driver interprets options and invokes preprocessors, compilers, assembler, linker, etc. Options precedence: if options conflict, last option on command line takes precedence Use -Minfo to see a listing of optimizations and transformations performed by the compiler Use -help to list all options or see details on how to use a given option, e.g. pgf90 -Mvect -help Use man pages for more details on options, e.g. man pgf90

Use v to see under the hood

Flags to support language dialects

Fortran
pgf77, pgf90, pgf95, pghpf tools Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF -Mextend, -Mfixed, -Mfreeform Type size i2, -i4, -i8, -r4, -r8, etc. -Mcray, -Mbyteswapio, -Mupcase, -Mnomain, -Mrecursive, etc. pgcc, pgCC, aka pgcpp Suffixes .c, .C, .cc, .cpp, .i -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

C/C++

Specifying the target architecture

Not an issue on XT3. Defaults to the type of processor/OS you are running on Use the tp switch.
-tp k8-64 or tp p7-64 or tp core2-64 for 64-bit code. -tp amd64e for AMD opteron rev E or later -tp x64 for unified binary -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32 bit code

Flags for debugging aids

-g generates symbolic debug information used by a debugger -gopt generates debug information in the presence of optimization -Mbounds adds array bounds checking -v gives verbose output, useful for debugging system or build problems -Mlist will generate a listing -Minfo provides feedback on optimizations made by the compiler -S or Mkeepasm to see the exact assembly generated

Basic optimization switches

Traditional optimization controlled through -O[<n>], n is 0 to 4. -fast switch combines common set into one simple switch, is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre
For -Munroll, c specifies completely unroll loops with this loop count or less -Munroll=n:<m> says unroll other loops m times

-Mnoframe does not set up a stack frame -Mlre is loop-carried redundancy elimination

Basic optimization switches, cont.

fastsse switch is commonly used, extends fast to SSE hardware, and vectorization -fastsse is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre (-fast) plus -Mvect=sse, -Mscalarsse -Mcache_align, -Mflushz -Mcache_align aligns top level arrays and objects on cache-line boundaries -Mflushz flushes SSE denormal numbers to zero

Optimization Strategies
Establish a workload Optimization from the top-down Use of proper tools, methods Processor level optimizations, parallel methods Different flags/features for different types of code

Node level tuning

Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples

Function Inlining especially important for C and C++

Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

Vectorizable F90 Array Syntax Data is REAL*4

350 ! 351 ! 352 ! 353 354 355 356 357 358 359 360 361 362
Initialize vertex, similarity and coordinate arrays Do Index = 1, NodeCount IX = MOD (Index - 1, NodesX) + 1 IY = ((Index - 1) / NodesX) + 1 CoordX (IX, IY) = Position (1) + (IX - 1) * StepX CoordY (IX, IY) = Position (2) + (IY - 1) * StepY JetSim (Index) = SUM (Graph (:, :, Index) * & & GaborTrafo (:, :, CoordX(IX,IY), CoordY(IX,IY))) VertexX (Index) = MOD (Params%Graph%RandomIndex (Index) - 1, NodesX) + 1 VertexY (Index) = ((Params%Graph%RandomIndex (Index) - 1) / NodesX) + 1 End Do

Inner loop at line 358 is vectorizable, can used packed SSE instructions

fastsse to Enable SSE Vectorization Minfo to List Optimizations to stderr

% pgf95 -fastsse -Mipa=fast -Minfo -S graphRoutines.f90 localmove: 334, Loop unrolled 1 times (completely unrolled) 343, Loop unrolled 2 times (completely unrolled) 358, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop
17

Scalar SSE:
.LB6_668: # lineno: 358 movss -12(%rax),%xmm2 movss -4(%rax),%xmm3 subl $1,%edx mulss -12(%rcx),%xmm2 addss %xmm0,%xmm2 mulss -4(%rcx),%xmm3 movss -8(%rax),%xmm0 mulss -8(%rcx),%xmm0 addss %xmm0,%xmm2 movss (%rax),%xmm0 addq $16,%rax addss %xmm3,%xmm2 mulss (%rcx),%xmm0 addq $16,%rcx testl %edx,%edx addss %xmm0,%xmm2 movaps %xmm2,%xmm0 jg .LB6_625

Vector SSE:
.LB6_1245: # lineno: 358 movlps (%rdx,%rcx),%xmm2 subl $8,%eax movlps 16(%rcx,%rdx),%xmm3 prefetcht0 64(%rcx,%rsi) prefetcht0 64(%rcx,%rdx) movhps 8(%rcx,%rdx),%xmm2 mulps (%rsi,%rcx),%xmm2 movhps 24(%rcx,%rdx),%xmm3 addps %xmm2,%xmm0 mulps 16(%rcx,%rsi),%xmm3 addq $32,%rcx testl %eax,%eax addps %xmm3,%xmm0 jg .LB6_1245:

Facerec Scalar: 104.2 sec Facerec Vector: 84.3 sec

Vectorizable C Code Fragment?

217 void func4(float *u1, float *u2, float *u3, 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

% pgcc fastsse Minfo functions.c func4: 221, Loop unrolled 4 times 221, Loop not vectorized due to data dependency 223, Loop not vectorized due to data dependency

Pointer Arguments Inhibit Vectorization

% pgcc fastsse Msafeptr Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Unrolled inner loop 4 times

C Constant Inhibits Vectorization

% pgcc fastsse Msafeptr Mfcon Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Generated vector SSE code for inner loop Generated 4 prefetch instructions for this loop

-Msafeptr Option and Pragma

arg
local static

Argument pointers are safe

local pointers are safe static local pointers are safe

global

global pointers are safe

#pragma [scope] [no]safeptr={arg | local | global | static | all}, Where scope is global, routine or loop
22

Common Barriers to SSE Vectorization

Potential Dependencies & C Pointers Give compiler more

info with Msafeptr, pragmas, or restrict type qualifer

Function Calls Try inlining with Minline or Mipa=inline

Type conversions manually convert constants or use flags

Large Number of Statements Try Mvect=nosizelimit Too few iterations Usually better to unroll the loop Real dependencies Must restructure loop, if possible
23

Barriers to Efficient Execution of Vector SSE Loops

Not enough work vectors are too short Vectors not aligned to a cache line boundary Non unity strides

Code bloat if altcode is generated

Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating example

Function Inlining especially important for C and C++

Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

What can Interprocedural Analysis and Optimization with Mipa do for You?

Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection

F90 shape propagation

Function inlining IPA optimization of libraries, including inlining
26

Effect of IPA on the WUPWISE Benchmark

PGF95 Compiler Options fastsse fastsse Mipa=fast fastsse Mipa=fast,inline

Execution Time in Seconds 156.49 121.65 91.72

Mipa=fast => constant propagation => compiler sees complex

matrices are all 4x3 => completely unrolls loops

Mipa=fast,inline => small matrix multiplies are all inlined

Using Interprocedural Analysis

Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name

Mipa=libopt perform IPA optimizations on libraries

Mipa=libinline perform IPA inlining from libraries

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples

Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

Explicit Function Inlining

Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>] [lib:]<inlib> [name:]<func> except:<func> size:<n> levels:<n> Inline extracted functions from inlib Inline function func Do not inline function func Inline only functions smaller than n statements (approximate) Inline n levels of functions

For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!

Other C++ recommendations

Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay

Inheritance, polymorphism, virtual functions runtime lookup or check, no inlining, potential performance penalties

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples

Function Inlining especially important for C and C++ SMP Parallelization for Cray XD1 and multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

SMP Parallelization

Mconcur for auto-parallelization on multi-core

Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls

mp to enable OpenMP 2.5 parallel programming model

See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp just work Not supported on Cray XT3 would require some custom work

Mconcur and mp can be used together!

DO 10 I3=2, N-1 MGRID Benchmark DO 10 I2=2,N-1 DO 10 I1=2,N-1 Main Loop 10 R(I1,I2,I3) = V(I1,I2,I3) & -A(0)*(U(I1,I2,I3)) & -A(1)*(U(I1-1,I2,I3)+U(I1+1,I2,I3) & +U(I1,I2-1,I3)+U(I1,I2+1,I3) & +U(I1,I2,I3-1)+U(I1,I2,I3+1)) & -A(2)*(U(I1-1,I2-1,I3)+U(I1+1,I2-1,I3) & +U(I1-1,I2+1,I3)+U(I1+1,I2+1,I3) & +U(I1,I2-1,I3-1)+U(I1,I2+1,I3-1) & +U(I1,I2-1,I3+1)+U(I1,I2+1,I3+1) & +U(I1-1,I2,I3-1)+U(I1-1,I2,I3+1) & +U(I1+1,I2,I3-1)+U(I1+1,I2,I3+1) ) & -A(3)*(U(I1-1,I2-1,I3-1)+U(I1+1,I2-1,I3-1) & +U(I1-1,I2+1,I3-1)+U(I1+1,I2+1,I3-1) & +U(I1-1,I2-1,I3+1)+U(I1+1,I2-1,I3+1) & +U(I1-1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))

Auto-parallel MGRID Overall Speed-up is 40% on Dual-core AMD Opteron

% pgf95 fastsse Mipa=fast,inline Minfo Mconcur mgrid.f resid: ... 189, Parallel code for non-innermost loop activated if loop count >= 33; block distribution 291, 4 loop-carried redundant expressions removed with 12 operations and 16 arrays Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop

Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples Function Inlining especially important for C and C++

SMP Parallelization for Cray XD1 and multi-core processors

Miscellaneous Optimizations hit or miss, but worth a try

Miscellaneous Optimizations (1)

Mfprelaxed single-precision sqrt, rsqrt, div performed using reduced-precision reciprocal approximation

lacml and lacml_mp link in the AMD Core Math Library

Mprefetch=d:<p>,n:<q> control prefetching distance, max number of prefetch instructions per loop tp k8-32 can result in big performance win on some C/C++ codes that dont require > 2GB addressing; pointer and long data become 32-bits

Miscellaneous Optimizations (2)

O3 more aggressive hoisting and scalar replacement; not part of fastsse, always time your code to make sure its faster

For C++ codes: no_exceptions Minline=levels:10

M[no]movnt disable / force non-temporal moves

V[version] to switch between PGI releases at file level Mvect=noaltcode disable multiple versions of loops
38

Whats New in PGI 6.2

Industry-leading SPECFP06 and SPECINT06 Performance PGI Visual Fortran for Windows x64 & Windows XP Full-featured PGI Workstation/Server for 32-bit Windows XP PGI Unified Binary performance enhancements More gcc extensions / compatibility New SSE intrinsics

PGI CDK ROLL for ROCKS clusters

MPICH1 and MPICH2 support in the PGI CDK Incremental debugger/profiler enhancements

Limited tuning for Intel Core2 (Woodcrest et al)

PGI Visual Fortran 6.2

Deep integration with Visual Studio 2005
PGI-custom Fortran-aware text editor Syntax coloring, keyword completion Fortran 95 Intrinsics tips PGI-custom project system and icons

PGI Unified Binary executables

Auto-parallel for multi-core CPUs Native OpenMP 2.5 parallelization World-class performance 64-bit Windows x64 support

PGI-custom property pages

One-touch project build / execute MS Visual C++ interoperability Mixed VC++ / PGI Fortran applications PGI-custom parallel F95 debug engine OpenMP 2.5 / threads debugging Just-in-time debugging features DVF/CVF compatibility features Win32 API support

32-bit Windows 2000/XP support

Optimization/support for AMD64 Optimization/support for Intel EM64T DEC/IBM/Cray compatibility features cpp-compatible pre-processing Visual Studio 2005 bundled* MSDN Library bundled* GUI parallel debugging/profiling* Assembly-optimized BLAS/LAPACK/FFTs*

Complete (Vis Studio bundled) and Standard (no Vis Studio) versions

Boxed CD-ROM/Manuals media kit*

*PVF Workstation Complete Only

On the PGI Roadmap

PGI Unified Binary directives and enhancements Aggressive Intel Core2 and next gen AMD64 tuning Industry-leading SPECFP06 and SPECINT06 Performance on Linux/Windows/AMD/Intel/32/64 Incremental PGDBG enhancements, improved C++ support MPI Debugging / Profiling for Windows x64 CCS Clusters All-new cross-platform PGPROF performance profiler

Fortran 2003/C99 language features

GCC front-end compatibility, g++ interoperability PGC++ tuning, PGC++/VC++ interoperability Windows SUA and Apple/MacOS platform support

De facto standard scalable C/Fortran language/tools extensions

Questions?

Reach me at [email protected]
Thanks for your time

EmbeddedCProgrammingV1 1
0% (1)
EmbeddedCProgrammingV1 1
447 pages
CPP Hand Written Notes
100% (1)
CPP Hand Written Notes
55 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Report Template PDF
No ratings yet
Report Template PDF
9 pages
CUDA Libraries for Developers
No ratings yet
CUDA Libraries for Developers
86 pages
Fortran PGI Directives: 11 Optimization Tips
No ratings yet
Fortran PGI Directives: 11 Optimization Tips
15 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
Using GCC Auto-Vectorizer
No ratings yet
Using GCC Auto-Vectorizer
15 pages
Inline Functions & Pointers Guide
100% (1)
Inline Functions & Pointers Guide
93 pages
Embedded C Programming Guide
100% (1)
Embedded C Programming Guide
57 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
OOP Course for EEE Students
No ratings yet
OOP Course for EEE Students
155 pages
Introduction To Compilers: Jun.-Prof. Dr. Christian Plessl Custom Computing University of Paderborn
No ratings yet
Introduction To Compilers: Jun.-Prof. Dr. Christian Plessl Custom Computing University of Paderborn
51 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Computer Architecture: Vector Code
No ratings yet
Computer Architecture: Vector Code
6 pages
High Performance Managed Languages: Martin Thompson - @mjpt777
No ratings yet
High Performance Managed Languages: Martin Thompson - @mjpt777
107 pages
Google C++ Style Guide
No ratings yet
Google C++ Style Guide
54 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
CS 294-73 Software Engineering For Scientific Computing Lecture 14: Development For Performance
No ratings yet
CS 294-73 Software Engineering For Scientific Computing Lecture 14: Development For Performance
40 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
Optimizing C
No ratings yet
Optimizing C
189 pages
Help
No ratings yet
Help
4 pages
Shader Compilation with LLVM
No ratings yet
Shader Compilation with LLVM
37 pages
AT - Better C Code For ARM Devices
No ratings yet
AT - Better C Code For ARM Devices
30 pages
Code
No ratings yet
Code
73 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
14 Linkers W
No ratings yet
14 Linkers W
38 pages
3 Tobias Grosser 2017 Day2
No ratings yet
3 Tobias Grosser 2017 Day2
122 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
PL01 Guiao
No ratings yet
PL01 Guiao
3 pages
LLVM Clang - Advancing Compiler Technology
No ratings yet
LLVM Clang - Advancing Compiler Technology
28 pages
C++ OOP Basics for Students
No ratings yet
C++ OOP Basics for Students
70 pages
Slides
No ratings yet
Slides
125 pages
L03 C Intro
No ratings yet
L03 C Intro
35 pages
x86 Assembly for Compiler Developers
No ratings yet
x86 Assembly for Compiler Developers
19 pages
Web GPU
0% (1)
Web GPU
40 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Khem Raj Embedded Linux Conference 2014, San Jose, CA
No ratings yet
Khem Raj Embedded Linux Conference 2014, San Jose, CA
29 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Getting Started With The Labview Mobile Module
No ratings yet
Getting Started With The Labview Mobile Module
27 pages
Hidden Overhead of A Function API
No ratings yet
Hidden Overhead of A Function API
158 pages
Embedded C Interview Questions and Answers On Embedded Systems - Page 2
No ratings yet
Embedded C Interview Questions and Answers On Embedded Systems - Page 2
9 pages
Lab 7
No ratings yet
Lab 7
3 pages
CS3330 - A Quick Guide To SSE - SIMD
No ratings yet
CS3330 - A Quick Guide To SSE - SIMD
9 pages
Debug Optimized C Code Using The DBX Debug Program and The XL C Compiler
No ratings yet
Debug Optimized C Code Using The DBX Debug Program and The XL C Compiler
12 pages
CXSTM8 UsersGuide
No ratings yet
CXSTM8 UsersGuide
510 pages
c++4 Function
No ratings yet
c++4 Function
15 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Acle 2021Q2
No ratings yet
Acle 2021Q2
84 pages
C++ Code Optimization Tips
No ratings yet
C++ Code Optimization Tips
5 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Tinyos
No ratings yet
Tinyos
15 pages
openSAP Hanasql2 Week 1 All Slides
No ratings yet
openSAP Hanasql2 Week 1 All Slides
75 pages
LLVM Optimization Pipeline Guide
No ratings yet
LLVM Optimization Pipeline Guide
109 pages
Google C++ Style Guide
No ratings yet
Google C++ Style Guide
49 pages
4 - Functions and Structure
No ratings yet
4 - Functions and Structure
5 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Chapter 4
No ratings yet
Chapter 4
52 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Lec 1
No ratings yet
Lec 1
36 pages
Lecture 5
No ratings yet
Lecture 5
29 pages
Encrypted Document Analysis
No ratings yet
Encrypted Document Analysis
25 pages
Register Data Flow Framework
No ratings yet
Register Data Flow Framework
9 pages
Global SLP - Review Meeting
No ratings yet
Global SLP - Review Meeting
29 pages
OpenSAP Hanasql1 Week 3 Transcript en
No ratings yet
OpenSAP Hanasql1 Week 3 Transcript en
15 pages
ITE 112 Module 8
No ratings yet
ITE 112 Module 8
10 pages
Chapter 4
No ratings yet
Chapter 4
3 pages
Lec03 1 Program Optimizations
No ratings yet
Lec03 1 Program Optimizations
43 pages
Interview Questions Embedded Software Developer
No ratings yet
Interview Questions Embedded Software Developer
240 pages
Ahoy SAILR! There Is No Need To DREAM of C: A Compiler-Aware Structuring Algorithm For Binary Decompilation
No ratings yet
Ahoy SAILR! There Is No Need To DREAM of C: A Compiler-Aware Structuring Algorithm For Binary Decompilation
18 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
Inlining & Calling Conventions in Object Pascal
No ratings yet
Inlining & Calling Conventions in Object Pascal
4 pages
Slides 18 645 Simd
No ratings yet
Slides 18 645 Simd
37 pages
DucHuy CA Lab2 2021
No ratings yet
DucHuy CA Lab2 2021
25 pages
Paper 4 WS
No ratings yet
Paper 4 WS
25 pages
Us 18 Guilfanov Decompiler Internals Microcode WP
No ratings yet
Us 18 Guilfanov Decompiler Internals Microcode WP
14 pages
Mitsunari Shigeo (光成滋生)
No ratings yet
Mitsunari Shigeo (光成滋生)
32 pages
Asm Presentation
No ratings yet
Asm Presentation
34 pages

Compilers: Tools For Scientists and Engineers

Uploaded by

Compilers: Tools For Scientists and Engineers

Uploaded by

PGI Compilers

Tools for Scientists and Engineers

Brent Leback [email protected] Dave Norton [email protected]

Outline of Todays Topics

PGI Compilers and Tools, features

Tools Integrated OpenMP/MPI debug & profile, IDE integration

PGI Tools Enable Developers to:

PGI Documentation and Support

Email support, via [email protected]

PGI Docs & Support, cont.

PGI ISV program

PGI Basic Compiler Options

PGI Basic Compiler Usage

Use v to see under the hood

Flags to support language dialects

Specifying the target architecture

Flags for debugging aids

Basic optimization switches

Basic optimization switches, cont.

Node level tuning

Function Inlining especially important for C and C++

Vectorizable F90 Array Syntax Data is REAL*4

fastsse to Enable SSE Vectorization Minfo to List Optimizations to stderr

Facerec Scalar: 104.2 sec Facerec Vector: 84.3 sec

Vectorizable C Code Fragment?

Pointer Arguments Inhibit Vectorization

C Constant Inhibits Vectorization

-Msafeptr Option and Pragma

Argument pointers are safe

global pointers are safe

Common Barriers to SSE Vectorization

Potential Dependencies & C Pointers Give compiler more

Function Calls Try inlining with Minline or Mipa=inline

Type conversions manually convert constants or use flags

Barriers to Efficient Execution of Vector SSE Loops

Code bloat if altcode is generated

Function Inlining especially important for C and C++

F90 shape propagation

Effect of IPA on the WUPWISE Benchmark

Execution Time in Seconds 156.49 121.65 91.72

Mipa=fast => constant propagation => compiler sees complex

Mipa=fast,inline => small matrix multiplies are all inlined

Using Interprocedural Analysis

Mipa=libopt perform IPA optimizations on libraries

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples

Explicit Function Inlining

For C++ Codes, PGI Recommends IPA-based inlining or Minline=levels:10!

Other C++ recommendations

Vectorization packed SSE instructions maximize performance

Interprocedural Analysis (IPA) use it! motivating examples

Mconcur for auto-parallelization on multi-core

mp to enable OpenMP 2.5 parallel programming model

Mconcur and mp can be used together!

Auto-parallel MGRID Overall Speed-up is 40% on Dual-core AMD Opteron

SMP Parallelization for Cray XD1 and multi-core processors

Miscellaneous Optimizations (1)

lacml and lacml_mp link in the AMD Core Math Library

Miscellaneous Optimizations (2)

For C++ codes: no_exceptions Minline=levels:10

Whats New in PGI 6.2

PGI CDK ROLL for ROCKS clusters

Limited tuning for Intel Core2 (Woodcrest et al)

PGI Visual Fortran 6.2

PGI Unified Binary executables

PGI-custom property pages

32-bit Windows 2000/XP support

Boxed CD-ROM/Manuals media kit*

On the PGI Roadmap

Fortran 2003/C99 language features

De facto standard scalable C/Fortran language/tools extensions

You might also like