100% found this document useful (1 vote)

209 views40 pages

Introduction To Gpu Programming With Cuda and Openacc

This document provides an introduction and overview of GPU programming using CUDA and OpenACC. It discusses why GPU chips are useful for parallel computing compared to CPUs due to their high memory bandwidth and number of cores. It describes the architecture of GPU chips including their SIMD execution model and different types of memory. It introduces CUDA as the main programming language and shows a simple matrix addition example. It also discusses OpenACC and compares GPUs to other processors.

Uploaded by

plop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

209 views40 pages

Introduction To Gpu Programming With Cuda and Openacc

Uploaded by

plop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Introduction to

GPU Programming
with CUDA and OpenACC

Alabama Supercomputer Center

Alabama Research and Education Network 1
Contents Topics

§ Why GPU chips and CUDA?

§ GPU chip architecture overview
§ CUDA programming
§ Queue system commands
§ Other GPU programming options
§ OpenACC programming
§ Comparing GPUs to other processors

2
What is a GPU chip? GPU

§ A Graphic Processing Unit (GPU) chips is an adaptation of

the technology in a video rendering chip to be used as a
math coprocessor.

§ The earliest graphic cards simply mapped memory bytes to

screen pixels – i.e. the Apple ][ in 1980.
§ The next generation of graphics cards (1990s) had 2D
rendering capabilities for rendering lines and shaded areas.
§ Graphics cards started accelerating 3D rendering with
standards like OpenGL and DirectX in the early 2000s.
§ The most recent graphics cards have programmable
processors, so that game physics can be offloaded from the
main processor to the GPU.
§ A series of GPU chips sometimes called GPGPU (General
Purpose GPU) have double precision capability so that they
can be used as math coprocessors.
3
Why GPUs? GPU

Comparison of peak theoretical GFLOPs and memory

bandwidth for NVIDIA GPUs and Intel CPUs over the
past few years.
4
Graphs from the NVIDIA CUDA C Programming Guide 4.0.
CUDA Programming Language CUDA
The GPU chips are massive multithreaded, manycore
SIMD processors.

SIMD stands for Single Instruction Multiple Data.

Previously chips were programmed using standard

graphics APIs (DirectX, OpenGL).

CUDA, an extension of C, is the most popular GPU

programming language. CUDA can also be called
from a C++ program.

The CUDA standard has no FORTRAN support, but

Portland Group sells a third party CUDA FORTRAN.
5
Nvidia GPU Models Chips
T10 Fermi (T20) Kepler (K20)
§  30 multiprocessors with §  14 multiprocessors with §  13 multiprocessors with
–  8 single precision thread –  32 thread processors are single –  192 single precision thread
processors & double add/multiply processors
–  2 special function units –  4 special function units –  64 double precision thread
processors
–  Double precision unit –  2 clock ticks per double
precision operation –  32 special function units
§  1.3 GHz
§  1.15 GHz §  0.706 GHz
§  240 cores per chip
§  Faster memory bus §  Threads can spawn new
§  1036.8 GFLOP single threads (recursion)
§  Multiple kernels
§  86.4 GFLOP double (subroutines) can run at §  Multiple CPU cores can
once access simultaneously
§  448 cores per chip §  2496 cores per chip
§  1288 GFLOP single §  3520 GFLOP single
§  515.2 GFLOP double §  1170 GFLOP double

6
GPU Programming Example CUDA

// CPU only matrix add // GPU kernel

int main() { __global__ gpu(A[N][N], B[N]
int i, j; [N], C[N][N]) {
for (i=0;i<N;i++) { int i = threadIdx.x;
for (j=0;j<N;j++) { int j = threadIdx.y;
C[i][j]=A[i][j]+B[i][j]; C[i][j]=A[i][j]+B[i][j];
} }
}
} int main() {
dim3 dimBlk(N,N);
gpu<<1,dimBlk>>(A,B,C);
}
7
GPU Execution Model Code
Software Hardware

Thread is a single execution

of a kernel, and all
Thread Thread Processor execute the same code

Threads within a block have

access to shared memory
for local cooperation
Thread Block Multiprocessor
Kernel launched as a grid of
independent thread
blocks, and only a single
kernel executes at a time
Grid Device (on T10) 8
SIMD Programming CUDA

1.  Copy an array of data to the GPU.

2.  Call the GPU, specifying the dimensions of thread
blocks and number of thread blocks (called a grid).
3.  All processors are executing the same subroutine on
a different element of the array.
4.  The individual processors can choose different
branch paths. However, there is a performance
penalty as some wait while others are executing their
branch path.
5.  Copy an array of data back out to the CPU.

GPU programming is more closely tied to chip

architecture than conventional languages.
9
Multiple types of memory
help optimize performance Code
Motherboard
Page locked host memory – This allows the GPU to see the memory
on the motherboard. This is the slowest to access, but allows the GPU
to access the largest memory space.

GPU chip
Global memory – Visible to all multiprocessors on the GPU chip.
Constant memory – Device memory that is read only to the thread
processors and faster access than global memory.
Texture & Surface memory – Lower latency for reads to adjacent
array elements.

Multiprocessor
Shared memory – Shared between thread processors on the same
multiprocessor.

Thread processor
Local memory – accessible to the thread processor only. This is
where local variables are stored.
10
Calling CUDA from C++ CUDA

§ #include <cuda_runtime>

§ The function call in file.cpp calls a function in file.cu

which is
extern "C" void function();

§ That function in turn calls a function that is

__global__ void function()

CAUTION: The C++ program must be named file.cpp

(not file.cc). Files named with extension .cc can be
erased by the make process.

11
Double Precision Support CUDA
§ Double precision on GPUs is true 64
bit. Double precision on x86 chips
is 80 bit extended double precision. Double precision bits
§ For double precision you must
specify the architecture like this 80
70
-arch=compute_13 -code=sm_13 60
50
§ Use double precision variables in 40
the Makefile like CUFILES_sm_13 30
20
§ For double precision, do NOT use 10
-use_fast_math 0
GPU x86_64

§ The FLOPS ratings show that the Fermi chips should be

about 6X the performance of T10 chips for double
precision operations. However, the Fermi chips have an
8:1 ratio of thread processors to special function units,
and the T10 chips have a 4:1 ratio. One of our tests that
utilizes double precision and special functions showed a
2.5X improvement in speed in going from T10 to Fermi
chips.
12
CUDA SDK Directory Tree CUDA
§  Unlike most compilers, CUDA is designed to work within a directory
tree, rather than having source code, object and executable in the
same directory. The common tools in that directory tree must be
compiled before compiling your own programs.

§  The source code goes in

~/CUDA_SDK_4.0/C/src/MYPROGRAM

§  The object files get put in

~/CUDA_SDK_4.0/C/src/MYPROGRAM/obj/x86_64/release

§  The executables get put in

~/CUDA_SDK_4.0/C/bin/linux/release

§  The Makefile sets just a few variables, then loads a complex make
process with the command
include ../../common/common.mk
13
Using nvcc outside the directory tree CUDA

§ Compiling from within the CUDA directory tree is not

always desired.
§ The use of the nvcc compiler directly (not with the
provided Makefile) is supported in version 3.0 and 5.0,
but not in version 4.0
§ The available compile flags can be found with the
command “nvcc --help”.
§ An error free compile and link does not necessarily
make a functioning executable.
§ In order to find out how the default build process works,
type the following

make clean
make -n

14
Changes between CUDA versions CUDA
§ CUDA is still evolving. Here are some of the things that
have changed from version 3 to 4 to 5 to 6

§ Makefile format
§ Compile commands
§ Mechanisms for error trapping.
§ Header files
§ nvcc switched from using GCC to using LLVM
§ Processor support
§ MPI integration
§ Easier memory management called “unified memory”
§ C++ 11 support
§ Template support
§ New GPU based math libraries
15
Catching Error Messages CUDA

WARNING
By default, NO run time error messages are generated!

§ In order to generate error messages, the following steps

should be followed.

§ The .cu file should include

#include "cutil_inline.h”

§ Allocate data with cutilSafeCall, like this

cutilSafeCall (cudaMalloc( &Data, numx*numy*sizeof(double)));

§ Immediately after running a function on the GPU, call

cutilCheckMsg(”MYFUNCTION<<<>>> failed\n");
16
Common Error Messages CUDA

CAUTION: Error messages are not very indicative of the

problem.

§ An error like this might mean you forgot to include the
header file for CUDA
myprogram.cc:81: error: expected constructor,
destructor, or type conversion before ‘void’

§ An error like this indicates you have exceeded a limit

like the maximum number of threads per block.
(9) invalid configuration argument

§ If you get an error saying that -lcuda can't be found, it

means that the compile must be done on one of the
nodes with GPU chips installed on it.
17
Common Error Messages CUDA

§ Some things that you would expect to be compile time

errors will show up as run time errors. For example,
incorrectly passing arguments to functions.

§ You can get this error because of a thread

synchronization problem. Putting in
cudaDeviceSynchronize() calls can fix the problem.
==31102== Error: Internal profiling error
1719:999

§ If you are getting memory errors, try calling the program
like this.
cuda-memcheck program [arguments]

18
What algorithms work well on GPUs Code

§ Doing the same calculation with many pieces of input

data.

§ The number of processing steps should be at least an

order of magnitude greater than the number of pieces of
input/output data.

§ Single precision performance is better than double

precision.

§ Algorithms where most of the cores will follow the same

branch paths most of the time.

§ Algorithms that require little if any communication

between threads.
19
Adoption of GPUs at the
Alabama Supercomputer Center Code
Good
§ Recent versions of Amber perform well on GPUs and are
being used for production work.
§ Several universities have integrated GPU programming into
the curriculum.
§ About 5% of the applications at the Alabama Supercomputer
Center have GPU versions.

Disappointing
§ Early tests with BEAST, NAMD, and Quantum Espresso are
less than exciting. Not all algorithms are converted to GPU.

Status
§ The GPU offering remains a small test bed of 8 T10 chips, 8
Fermi chips, and 16 Kepler K20 chips.

20
Performance Optimization Code
§ Utilize the type of memory that will give the best
performance for the algorithm.

§ The chip is made for zero latency swapping threads so that

a different warp (group of usually 32 threads) can run while
one warp is waiting on IO, SFU, DPU. Thus it is often best
to have more threads than thread processors.

§ The best number of threads/block depends on the program,

but should be a multiple of 32 such as 64, 128, 192, 256, 768.

§ The grid size should be at least the number of

multiprocessors, and also works well as a multiple of the
number of multiprocessors.

§ If __syncthreads() slows the code, use more, smaller

blocks.
21
Mandelbrot Test Code
§ This is a single precision
Mandelbrot diagram generator
that is used as a simple parallel
programming example.
§ The large test run took 1 minute,
34 seconds on a single 2.26 GHz
Nehalem processor.
§ The same test took 9 seconds on
a T10 GPU after minimal
optimization of thread blocks.
§ This is a 10x speed up, but not
the 100x that marketing claims
suggest is possible.
§ In this case, the conditional do-
while inner loop probably
caused some cores to sit idle
waiting for the rest to reach their
break points.
22
Validation of Results Code

§ Validation is usually done using a gold kernel and

maybe a silver kernel.

§ Gold Kernel - data processed on the CPU with carefully

checked output. You compare the CUDA output to the
gold output to make sure the numerical accuracy is
within acceptable limits.

§ Silver Kernel - data processed on the GPU without

optimization or algorithmic enhancements. This is the
first step in GPU implementation. Again, comparing
optimized kernel to silver kernel shows if the
optimization reduced accuracy.

§ Both of these usually use the simplest, most naive

version of the algorithm (i.e. rectangle rule integration).
23
Other CUDA Tools CUDA
§ CUDA Memory Checker (cuda-memcheck) can be used
to find memory violations
§ CUDA debugger (cuda-gdb) is an extension of the GNU
debugger for Linux
§ NVIDIA Parallel Nsight is a debugger for Microsoft
Visual Studio
§ CUDA Visual Profiler

24
CUDA References Doc
§ On the Alabama Supercomputer Center
systems, documentation is in the
directory /opt/asn/doc/gpu
–  Start with README.txt and TIPS.txt
–  CUDA_C_Getting_Started_Linux.pdf
–  CUDA_C_Programming_Guide.pdf
–  CUDA_C_Best_Practices_Guide.pdf
–  Examples are in the portland_accelerator and
portland_cuda_fortran directories
–  There is more information in the supplmental_docs
directory

§ A good introduction to CUDA programming

–  "CUDA BY EXAMPLE" by J. Sanders, E. Kandrot,
Addison Wesley, 2011.

25
GPUs & the Queue System ASC

§ The queue system at the Alabama Supercomputer

Center has a couple commands for submitting work to
the queues.

§ The “gpu_interactive” command opens an

interactive session on a GPU node. This should be
used for compiling, only if it will not compile on the
login node.

§ The “run_gpu” command is used for submitting all

production work to the queue.

§ Only one GPU is available to a job. This is a policy

restriction due to the limited number of GPU chips
available.

26
Other GPU Programming Options Code
§ PGI Accelerator is a commercial compiler that allows
programming NVIDIA GPUs with OpenACC, a syntax
similar to OpenMP.

§ OpenMP is starting to release GPU features.

§ OpenCL – is a language under development for parallel
programming of many different hardware architectures
with a common syntax.
§ There are CUDA plugins for Python, Matlab, and
Mathematica
§ Math Libraries
–  cuSOLVER (BLAS, Lapack)
–  cuFFT
–  NVIDIA Performance Primitives library – NPP
–  GPULib
–  FLAGON – Fortran-9x library
–  Thrust (C++11)
§ Several more came and went already 27
OpenACC Example OpenACC

// OpenACC matrix add § openACC is easier

int main() { to program than
CUDA
int i, j;
§ but less efficient, so
#pragma acc kernels loop gang(32), vector (16) the program wont
for (i=0;i<N;i++) { run as fast
#pragma acc loop gang(16), vector(32)
for (j=0;j<N;j++) {
C[i][j]=A[i][j]+B[i][j];
}
}
}

28
Common OpenACC directives OpenACC
§ OpenACC directives in C and C++
#pragma acc DIRECTIVE
§ OpenACC directives in Fortran
!$acc DIRECTIVE
lines of Fortran code
!$acc end DIRECTIVE
§ Directive to attempt automatic parallelization
#pragma acc kernels
§ Directive to parallelize the next loop
#pragma acc parallel loop
§ Directive to specify which variables are copied, and
which are local
#pragma acc data copy(A), create(Anew)

The data directive is often needed to cut out data bottlenecks

29
Compiling and Running OpenACC

§ Typical compile command for C

pgcc -acc -Minfo=accel -ta=nvidia -o file file.c

§ Environment variable to print GPU use information at

run time
export PGI_ACC_TIME=1
§ The program runs slightly slower with this turned on

§ Environment variable to print out information about data

transfers to the GPU at run time
export PGI_ACC_NOTIFY=3
§ This slows down execution significantly

30
Ideal cases for OpenACC OpenACC

§ Programs where one or a few small sections of the

program are responsible for most of the CPU time.

§ Loops with many iterations.

§ Loops with no data dependencies between iterations.
§ Loops that work on many elements of large arrays.
§ Loops where functions can be inlined.

§ Conditional statements are OK, but better if you can

guess in advance which batches of data will follow the
same branch.

§ Portland Group compilers create programs with code

for three generations of GPUs; Tesla, Fermi, & Kepler

31
What Does NOT work well OpenACC

§ Loops with IO statements.

§ Loops with early exits, including do-while loops.

§ Loops with many branches to other functions.

§ Pointer arithmetic

Confusingly, a failed compile creates a single processor

executable.

32
OpenACC vs. CUDA Code

§ CUDA creates software for nVidia GPUs only. OpenACC

can program GPUs, Opteron, ATI, APUs, Xeon, and
Xeon Phi.
§ OpenACC does loop level parallelization. CUDA
parallelizes at the subroutine level.
§ OpenACC is easier to program, or adapt an existing
code.
§ CUDA is currently used more widely.
§ Some algorithms can be implemented in CUDA, but not
in OpenACC. i.e. recursion or early exit loops
§ OpenACC is newer (version 2.0 is out). CUDA is on
version 7
§ Both are still undergoing significant changes.

§ CUDA programs usually run faster (perhaps 30%).

33
OpenACC documentation Doc
§ Look at the Getting Started documentation and videos at
openacc-standard.org
§ https://developer.nvidia.com/content/openacc-example-part-1
§ The PGI Acclerator Compilers OpenACC Getting Started Guide
http://www.pgroup.com/doc/openACC_gs.pdf
§ There are example programs in the directories
/opt/asn/doc/pgi/accelerator_examples
/opt/asn/doc/pgi/openacc_example
§ There are tips for best results in the file
/opt/asn/doc/gpu/openacc_tips.txt
§ OpenACC 2.0 examples are at
http://devblogs.nvidia.com/parallelforall/7-powerful-new-features-openacc-2-0/

Unfortunately, once you get past the introductory

documentation, you will need to read the OpenACC technical
specifications and ask questions on user forums to maximize
performance with OpenACC. 34
Comparing GPUs to other types of processors

35
Vector/SIMD extensions SSE
•  4 x86 ops to add two single precision, four-component vectors

vector_result.x = vector_1.x + vector_2.x;

vector_result.y = vector_1.y + vector_2.y;
vector_result.z = vector_1.z + vector_2.z;
vector_result.w = vector_1.w + vector_2.w;

•  Using 128-bit SSE registers, pack vector components into a single

•  Intel’s Sandy Bridge architecture (used in UV) introduced AVX

instructions that further widens vector data path from 128 to 256 bits,
potentially resulting in up to a 2x performance improvement for some
applications

36
FLOPS vs Chip Architecture Chips

§ The FLOPS (FLoating point Operations Per Second)

rating is NOT a good comparison of GPU performance
relative to conventional processor performance.
–  FLOPS rating is usually a poor way to compare any types of chips.

§ The FLOPS rating for conventional processors includes

the vector math circuitry for SSE instructions. If your
program cannot use SSE instructions, a conventional
processor may under-perform it’s FLOPS rating, and the
GPU may approach the GPU FLOPS performance.

§ If your program has significant communication between

threads, or different threads take different branch paths,
the GPU may do worse than the FLOPS ratings suggest.

37
GPU vs. Xeon vs. Xeon Phi Chips
§ Xeon Phi is a processor with 57-61 x86 compatible cores
running at 1.053 to 1.238 GHz.

§ Xeon Phi is NOT a chip with a bunch of Xeon processor cores.

The cores on Phi are less powerful (about 1/5 speed).

§ Xeon Phi is a new chip architecture called MIC (Many Integrated

Cores). The next MIC chip will be Knights Landing.

§ OpenMP parallelized software will run on Xeon Phi, but runs

faster if you do some work to manage memory access bottle
necks.

§ Xeon Phi has SSE vector mathematics instructions. GPUs do

not do vector math.
38
Summary Done
§ There is a lot of interest in the HPC community about using GPU
chips because GPUs can give 10-300 fold the processing capacity
for the dollar spent on hardware... provided you have invested the
effort to port the software to that architecture.
§ GPUs are easier to program than other coprocessor technologies
(i.e. FPGAs).
§ The GPGPU programming market is currently dominated by Nvidia
chips and the CUDA programming language.
§ CUDA is the most mature of the GPU programming options, but still
an early stage technology.
§ OpenACC is increasing in popularity.
§ CUDA is more closely tied to hardware than higher level languages
like C++.
§ Many experts predict that OpenCL could become the preferred GPU
programming method if future versions achieve the intended goal of
being a “write once – run anywhere” parallel language.
39
Alabama
Supercomputer
Authority

State of Alabama Leader and Trusted Partner for Technology 40

Time Calculation Rule Fast Formula Reference Guide
No ratings yet
Time Calculation Rule Fast Formula Reference Guide
58 pages
Instant Ebooks Textbook Deep Generative Modeling Jakub M. Tomczak Download All Chapters
No ratings yet
Instant Ebooks Textbook Deep Generative Modeling Jakub M. Tomczak Download All Chapters
49 pages
Fake News Detection for Researchers
No ratings yet
Fake News Detection for Researchers
5 pages
Sanet - St.deep Learning in Practice
100% (1)
Sanet - St.deep Learning in Practice
219 pages
Deep Learning - A Beginners' Guide - Dulani Meedeniya - 1, 2023 - Chapman and Hall - CRC - 103247324X - Anna's Archive
No ratings yet
Deep Learning - A Beginners' Guide - Dulani Meedeniya - 1, 2023 - Chapman and Hall - CRC - 103247324X - Anna's Archive
199 pages
Monocular Depth Estimation with U-Net
No ratings yet
Monocular Depth Estimation with U-Net
8 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
What Is A Support Vector Machine?: Primer
No ratings yet
What Is A Support Vector Machine?: Primer
3 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Tutorials
No ratings yet
Tutorials
17 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Verilog Nonblocking Assignments Demystified
100% (2)
Verilog Nonblocking Assignments Demystified
3 pages
Face Recognition in The Browser With Tensorflow - Js & JavaScript
No ratings yet
Face Recognition in The Browser With Tensorflow - Js & JavaScript
13 pages
Vision-Face Recognition Attendance Monitoring System For Surveillance Using Deep Learning Technology and Computer Vision
No ratings yet
Vision-Face Recognition Attendance Monitoring System For Surveillance Using Deep Learning Technology and Computer Vision
5 pages
AVR Assembler Directives
No ratings yet
AVR Assembler Directives
9 pages
Embedded Lab 3 Report
100% (1)
Embedded Lab 3 Report
11 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
FANUC Robotics Training Guide
No ratings yet
FANUC Robotics Training Guide
38 pages
1 Year Schedule
No ratings yet
1 Year Schedule
10 pages
Seismic Interformertry
No ratings yet
Seismic Interformertry
23 pages
12 Computer Science Mixed Test 04
No ratings yet
12 Computer Science Mixed Test 04
6 pages
PyCUDA Guide for Scientists
100% (1)
PyCUDA Guide for Scientists
15 pages
Dynamic ACTIONS
No ratings yet
Dynamic ACTIONS
6 pages
01 Intro Ho
No ratings yet
01 Intro Ho
6 pages
Fdflow A Fortran 77 Solver For 2d Incomp Flow
No ratings yet
Fdflow A Fortran 77 Solver For 2d Incomp Flow
27 pages
COM Wrapper Tutorial for Custom Objects
No ratings yet
COM Wrapper Tutorial for Custom Objects
27 pages
The Ultimate Guide To Object Detection
No ratings yet
The Ultimate Guide To Object Detection
16 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Concept of Opp
No ratings yet
Concept of Opp
8 pages
Class Constructor Destructor
No ratings yet
Class Constructor Destructor
11 pages
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
No ratings yet
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
23 pages
Specification Accredited Gcse Computer Science j276
No ratings yet
Specification Accredited Gcse Computer Science j276
51 pages
Computer Vision For The Web - Sample Chapter
No ratings yet
Computer Vision For The Web - Sample Chapter
19 pages
SINUMERIK 828D PLC Programming Guide
No ratings yet
SINUMERIK 828D PLC Programming Guide
18 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
Feature Selection Techniques in Machine Learning - Javatpoint
No ratings yet
Feature Selection Techniques in Machine Learning - Javatpoint
9 pages
PLC Setup with RSLogix Guide
No ratings yet
PLC Setup with RSLogix Guide
17 pages
Vxworks Command
No ratings yet
Vxworks Command
4 pages
Computer Simulation of Opposing Reaction Kinetics: Chemical Engineering
50% (2)
Computer Simulation of Opposing Reaction Kinetics: Chemical Engineering
6 pages
Vxworks Programmers Guide 5-3-1
No ratings yet
Vxworks Programmers Guide 5-3-1
652 pages
Medical Image Fusion Method by Deep Learning
No ratings yet
Medical Image Fusion Method by Deep Learning
9 pages
Deep Learning For Computer Vision
No ratings yet
Deep Learning For Computer Vision
55 pages
Stock Price Prediction Using Machine Learning With Python
No ratings yet
Stock Price Prediction Using Machine Learning With Python
10 pages
DSP Filter Implementation Guide
No ratings yet
DSP Filter Implementation Guide
2 pages
Intro4 ANN Deep CNN PDF
No ratings yet
Intro4 ANN Deep CNN PDF
20 pages
AVL Vortex Lattice User Guide
No ratings yet
AVL Vortex Lattice User Guide
14 pages
Oops in ABAP: by Pavani
No ratings yet
Oops in ABAP: by Pavani
35 pages
PL SQL Developer
No ratings yet
PL SQL Developer
222 pages
Introduction To Excel Cooling Load Calculations Using RTS Method
No ratings yet
Introduction To Excel Cooling Load Calculations Using RTS Method
6 pages
CUDA Image Processing Thesis
No ratings yet
CUDA Image Processing Thesis
66 pages
Angular 2 for Developers
0% (1)
Angular 2 for Developers
39 pages
Getting To DOS: (Parents and Teachers, See The Introduction I've Written For You in .)
No ratings yet
Getting To DOS: (Parents and Teachers, See The Introduction I've Written For You in .)
67 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Usenix13 Moviestealer PDF
No ratings yet
Usenix13 Moviestealer PDF
17 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Anomaly Detection in Images CIFAR-10
No ratings yet
Anomaly Detection in Images CIFAR-10
9 pages
Lecture 3: Animation & Graphics
No ratings yet
Lecture 3: Animation & Graphics
32 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
Neural
No ratings yet
Neural
35 pages
Concept 2.6 User Manual 12/2010
No ratings yet
Concept 2.6 User Manual 12/2010
1,206 pages
Image To Image Translation Using Generative Adversarial Network
No ratings yet
Image To Image Translation Using Generative Adversarial Network
5 pages
Brain Tumor Classification
100% (1)
Brain Tumor Classification
12 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
GCDkit Manual
No ratings yet
GCDkit Manual
342 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
Graph Neural Networks Guide
No ratings yet
Graph Neural Networks Guide
28 pages
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
No ratings yet
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
283 pages
Multivariate Linear Regression Guide
No ratings yet
Multivariate Linear Regression Guide
24 pages
Deep Learning in Object Detection, PDF
No ratings yet
Deep Learning in Object Detection, PDF
64 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
Introductory Techniques For 3-D Computer Vision
No ratings yet
Introductory Techniques For 3-D Computer Vision
182 pages
Deep Learning (MODULE-3)
No ratings yet
Deep Learning (MODULE-3)
85 pages
Sample CoreJava For The Imaptient
No ratings yet
Sample CoreJava For The Imaptient
120 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Deep Learning For Medical Image Analysis 1st Edition S. Kevin Zhou
No ratings yet
Deep Learning For Medical Image Analysis 1st Edition S. Kevin Zhou
62 pages
Image Segmentation DeepLearning
No ratings yet
Image Segmentation DeepLearning
18 pages
Deep Learning
No ratings yet
Deep Learning
90 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Book
No ratings yet
Book
199 pages
Notions de Deep Learning
No ratings yet
Notions de Deep Learning
116 pages
The Geometry of Intelligence Foundations of Transformer Networks in Deep Learning (Pradeep Singh, Balasubramanian Raman) (Z-Library)
No ratings yet
The Geometry of Intelligence Foundations of Transformer Networks in Deep Learning (Pradeep Singh, Balasubramanian Raman) (Z-Library)
375 pages
Python Machine Learning Workboo - AI Publishiing
No ratings yet
Python Machine Learning Workboo - AI Publishiing
308 pages
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
100% (1)
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
82 pages

Introduction To Gpu Programming With Cuda and Openacc

Uploaded by

Introduction To Gpu Programming With Cuda and Openacc

Uploaded by

Introduction to

Alabama Supercomputer Center

§ Why GPU chips and CUDA?

§ A Graphic Processing Unit (GPU) chips is an adaptation of

§ The earliest graphic cards simply mapped memory bytes to

Comparison of peak theoretical GFLOPs and memory

SIMD stands for Single Instruction Multiple Data.

Previously chips were programmed using standard

CUDA, an extension of C, is the most popular GPU

The CUDA standard has no FORTRAN support, but

// CPU only matrix add // GPU kernel

Thread is a single execution

Threads within a block have

1. Copy an array of data to the GPU.

GPU programming is more closely tied to chip

§ The function call in file.cpp calls a function in file.cu

§ That function in turn calls a function that is

CAUTION: The C++ program must be named file.cpp

§ The FLOPS ratings show that the Fermi chips should be

§ The source code goes in

§ The object files get put in

§ The executables get put in

§ Compiling from within the CUDA directory tree is not

§ In order to generate error messages, the following steps

§ The .cu file should include

§ Allocate data with cutilSafeCall, like this

§ Immediately after running a function on the GPU, call

CAUTION: Error messages are not very indicative of the

§ An error like this indicates you have exceeded a limit

§ If you get an error saying that -lcuda can't be found, it

§ Some things that you would expect to be compile time

§ You can get this error because of a thread

§ Doing the same calculation with many pieces of input

§ The number of processing steps should be at least an

§ Single precision performance is better than double

§ Algorithms where most of the cores will follow the same

§ Algorithms that require little if any communication

§ The chip is made for zero latency swapping threads so that

§ The best number of threads/block depends on the program,

§ The grid size should be at least the number of

§ If __syncthreads() slows the code, use more, smaller

§ Validation is usually done using a gold kernel and

§ Gold Kernel - data processed on the CPU with carefully

§ Silver Kernel - data processed on the GPU without

§ Both of these usually use the simplest, most naive

§ A good introduction to CUDA programming

§ The queue system at the Alabama Supercomputer

§ The “gpu_interactive” command opens an

§ The “run_gpu” command is used for submitting all

§ Only one GPU is available to a job. This is a policy

§ OpenMP is starting to release GPU features.

// OpenACC matrix add § openACC is easier

The data directive is often needed to cut out data bottlenecks

§ Typical compile command for C

§ Environment variable to print GPU use information at

§ Environment variable to print out information about data

§ Programs where one or a few small sections of the

§ Loops with many iterations.

§ Conditional statements are OK, but better if you can

§ Portland Group compilers create programs with code

§ Loops with IO statements.

§ Loops with early exits, including do-while loops.

§ Loops with many branches to other functions.

Confusingly, a failed compile creates a single processor

§ CUDA creates software for nVidia GPUs only. OpenACC

§ CUDA programs usually run faster (perhaps 30%).

Unfortunately, once you get past the introductory

vector_result.x = vector_1.x + vector_2.x;

• Using 128-bit SSE registers, pack vector components into a single

• Intel’s Sandy Bridge architecture (used in UV) introduced AVX

§ The FLOPS (FLoating point Operations Per Second)

§ The FLOPS rating for conventional processors includes

§ If your program has significant communication between

§ Xeon Phi is NOT a chip with a bunch of Xeon processor cores.

§ Xeon Phi is a new chip architecture called MIC (Many Integrated

§ OpenMP parallelized software will run on Xeon Phi, but runs

§ Why GPU chips and CUDA?

§ A Graphic Processing Unit (GPU) chips is an adaptation of

§ The earliest graphic cards simply mapped memory bytes to

1.  Copy an array of data to the GPU.

§ The function call in file.cpp calls a function in file.cu

§ That function in turn calls a function that is

§ The FLOPS ratings show that the Fermi chips should be

§  The source code goes in

§  The object files get put in

§  The executables get put in

§ Compiling from within the CUDA directory tree is not

§ In order to generate error messages, the following steps

§ The .cu file should include

§ Allocate data with cutilSafeCall, like this

§ Immediately after running a function on the GPU, call

§ An error like this indicates you have exceeded a limit

§ If you get an error saying that -lcuda can't be found, it

§ Some things that you would expect to be compile time

§ You can get this error because of a thread

§ Doing the same calculation with many pieces of input

§ The number of processing steps should be at least an

§ Single precision performance is better than double

§ Algorithms where most of the cores will follow the same

§ Algorithms that require little if any communication

§ The chip is made for zero latency swapping threads so that

§ The best number of threads/block depends on the program,

§ The grid size should be at least the number of

§ If __syncthreads() slows the code, use more, smaller

§ Validation is usually done using a gold kernel and

§ Gold Kernel - data processed on the CPU with carefully

§ Silver Kernel - data processed on the GPU without

§ Both of these usually use the simplest, most naive

§ A good introduction to CUDA programming

§ The queue system at the Alabama Supercomputer

§ The “gpu_interactive” command opens an

§ The “run_gpu” command is used for submitting all

§ Only one GPU is available to a job. This is a policy

§ OpenMP is starting to release GPU features.

// OpenACC matrix add § openACC is easier

§ Typical compile command for C

§ Environment variable to print GPU use information at

§ Environment variable to print out information about data

§ Programs where one or a few small sections of the

§ Loops with many iterations.

§ Conditional statements are OK, but better if you can

§ Portland Group compilers create programs with code

§ Loops with IO statements.

§ Loops with early exits, including do-while loops.

§ Loops with many branches to other functions.

§ CUDA creates software for nVidia GPUs only. OpenACC

§ CUDA programs usually run faster (perhaps 30%).

•  Using 128-bit SSE registers, pack vector components into a single

•  Intel’s Sandy Bridge architecture (used in UV) introduced AVX

§ The FLOPS (FLoating point Operations Per Second)

§ The FLOPS rating for conventional processors includes

§ If your program has significant communication between

§ Xeon Phi is NOT a chip with a bunch of Xeon processor cores.

§ Xeon Phi is a new chip architecture called MIC (Many Integrated

§ OpenMP parallelized software will run on Xeon Phi, but runs

§ Xeon Phi has SSE vector mathematics instructions. GPUs do