High Performance Computing for
Data Intensive and Complex AI Applications
OpenACC Programming
(Theory & Hands on)
by
Dr.S.Devi Mahalakshmi, Professor/CSE
Mepco Schlenk Engineering College, Sivakasi
[email protected]
1 August 2025 1
Index
• • OpenACC • CUDA
GPU overview
• Introduction
• Introduction
• GPU v/s CPU
• Execution model
• Execution model
• GPGPU overview • Levels of
• CUDA-
parallelism Grids,Blocks,Threa
• NVIDIA tesla V100
• Directive syntax ds
• GPGPU-Programming • Compute • Cuda Programs
models constructs
• Loop and Kernels
• Data directives
2
GP-GPU
• General Purpose GPU or GPGPU - general purpose scientific and engineering
computing
• Processing of non graphical entities
• Computationally intensive data parallel tasks for achieving low time complexity
3
GPU v/s CPU
Architectural
Differences
CPU GPU
ALU
Control
Logic
Cache
DRAM DRAM
512 cores
Less than 20 cores
10s to 100s of threads per core
Latency is hidden by large cache
Latency is hidden by fast context
1 -2 threads per core
switching
GPUs don’t run without CPUs
4
Execution model
5
Programming models for GPGPU
• CUDA – Compute Unified Device Maximum
Low Level Architecture flexibility
Programmin • OpenCL – Open Computing Language
g
• OpenMP – Open
Easily accelerate
Multiprocessing
Directive applications
• OpenACC –Open Accelerator
s
6
OpenACC
7
OpenACC
What is OpenACC?
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with
only minor modifications to a serial CPU code.
8
OpenACC
What is OpenACC?
• A set of compiler directives that allow code regions to be offloaded
from a host CPU to, on a GPU
• High level GPU programming
• Similar to OpenMP directives
• Works for Fortran, C, C++
• Portable across different platforms and compilers
• Compilers supported - PGI, CRAY, GCC
9
NVIDIA GPU (CUDA) Task Granularity
GPU device -- CUDA grids:
Kennels/grids are assigned to a device.
• Streaming Multiprocessor (SM) -- CUDA thread blocks:
Blocks are assigned to a SM.
• CUDA cores -- CUDA threads:
Threads are assigned to a core.
• Warp: a unit that consists of 32 threads.
• Blocks are divided into warps.
• The SM executes threads at warp granularity.
• The warp size can be changed in the future.
10
OpenACC Task Granularity
• Gang --- block
• Worker – warp
• Vector – thread
Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
11
Levels of
Parallelism
12
key concepts
• Vector : Threads work in SIMT fashion
Individual tasks that are executed in parallel on the GPU
Threads are organized into warps, which are groups of 32 threads each
All threads within a warp are executed on a single GPU core
• Worker : Groups of threads that can be scheduled and executed on a streaming
multiprocessor (SM) within the GPU.
• Gang : workers are organized into gangs. Multiple gangs work independently
13
OpenACC
What are compiler directives?
• The directives tell the compiler or runtime to ……
• Generate parallel code for GPU
• Allocate GPU memory and copy input data
• Execute parallel code on GPU
• Copy output data to CPU and deallocate GPU memory
14
OpenACC Directive syntax
•C
#pragma acc directive [clause [,] clause] …]…
often followed by a structured code block
• Fortran
!$acc directive [clause [,] clause] …]...
often paired with a matching end directive surrounding a structured
code block
!$acc end directive
15
OpenACC Syntex
•#pragma - Gives instructions to the compiler on how to compile the code
• acc - Informs the compiler that code is to be executed using OpenACC
• directives - Commands in OpenACC for altering our code
• clauses - Specifiers or additions to directives
16
Execution Model
• Program runs on the host CPU
• Host offloads compute-intensive
regions (kernels) and related data
to the accelerator GPU
• Compute kernels are executed by
the
GPU
17
PGI Compiler basics
• The command to compile C code is ‘pgcc’
• The command to compile C++ code is ‘pgc++’
• The command to compile fortran code is ‘pgfortran’
$ pgcc main.c
$ pgc++ main.cpp
$ pgfortran main.f90
18
OpenACC Directive for parallelism
Two different approaches for defining parallel regions
kernels
• Defines a region to be transferred into a series of kernels to be executed in
sequence on an accelerator
• Work sharing parallelism is defined automatically for the separate kernels
parallel
• Defines a region to be executed on an accelerator
• Work sharing parallelism has to be defined manually
with similar work sharing, both can perform equally well
19
20
21
Example: Compute a*x + y, where x and y are vectors, and a is a scalar
22
Analysis of the compiling output
$ pgcc saxpy_array.c -o saxpy_array
main:
17, Generating copyin(x[:1000])
Generating copy(y[:1000])
19, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
• Accelerator kernel is generated.
• The loop computation is offloaded to (Tesla) GPU and is parallelized.
• The keywords copy and copyin are involved with data transfer.
• The keywords gang and vector are involved with tasks granularity.
23
Data dependency
• The loop is not parallelized if there is data dependency. For example,
• The compiling output:
24
25
26
Kernel vs. parallel loop
kernels
• More implicit.
• Gives the compiler more freedom to find and map parallelism.
• Compiler performs parallel analysis and parallelizes what it believes
safe.
parallel
• More explicit.
• Requires analysis by programmer to ensure safe parallelism
• Straightforward path from OpenMP
27
28
kernels vs. parallel (2)
• Parallelize a code block with two loops:
Generate one kernel
Generate two kernels
There is no barrier between the two loops:
There is an implicit barrier between
the second loop may start before the first
the two loops: the second loop will
loop ends. (This is different from OpenMP).
start after the first loop ends.
29
Clauses in kernel directives
Clauses in parallel directives
• if clause
Optional clause to decide if code should be executed on accelerator or host
• async clause
Specifies that a parallel accelerator or kernels regions should be executed
asynchronously
The host will evaluate the integer expression of the async clause to test or wait for
completion with the wait directive
• num_gangs clause
Specifies the number of gangs that will be executed in the accelerator parallel
region
• num_workers clause
Specifies the number of workers within each gang for a accelerator parallel region
31
Clauses in parallel directive
• vector length clause
Specifies the vector length to use for the vector or SIMD
operations within each worker of a gang
• private clause
A copy of each item on the list will be created for each gang
syntax: private(var1, var2, var3, ...)
• Avoid race condition
• firstprivate clause
A copy of each item on the list will be created for each gang and
initialized with the value of the item in the host
32
Clauses in parallel directive
reduction clause
• Specifies a reduction operation to be perform across gangs using a
private copy for each gang
• Private copy of the affected variable is generated for each loop
iteration
• Reduce all of those private copies into one final result, which is
returned from the region
• Syntax
reduction(operator:variable)
33
Reduction clause
What is reduction and why is it necessary?
• In the given example, the variable dt can be modified by multiple workers
(warps) simultaneously. This is called a data race condition.
• If data race happened, an incorrect result will be returned.
• To avoid data race, a reduction clause is required to protect the concerned
variable.
• Fortunately, the compiler is smart enough to create a reduction kernel and
avoid the data race automatically
• It is a good habit to explicitly specify reduction operators and variables.
34
Clauses in parallel and kernel
directives
Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m)
vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Data Directive
Data Directive
Specifies by :
#pragma acc data [clause [,clause]...] new-line structured block
• Defines scalars, arrays and subarrays to be allocated in the accelerator
memory for the duration of the region
• Can be used to control if data should be copied in or out from the
host
36
Data Directive clauses
Clauses for data directive
• if( condition )
• copy( list )
• copyin( list )
• copyout( list )
• create( list )
• present( list )
37
Data Directive clauses
copy (list):
Allocates memory on GPU and copies data from host to GPU when entering
region and copies data to the host when exiting region.
copyin(list):
Allocates memory on GPU and copies data from host to GPU when entering
region.
copyout(list):
Allocates memory on GPU and copies data to the host when exiting region.
create(list):
Allocates memory on GPU but does not copy. But do not copy to or from the
device.
present(list):
Data is already present on GPU. The listed variables are already present on the
device, so no further action needs to be taken 38
Data Directive clauses
• Syntax for C
#pragma acc data copy(a[0:size]) copyin(b[0:size]), copyout(c[0:size])
create(d[0:size]) present(d[0:size])
• Syntax for Fortran
!$acc acc data copy(a(0:size)) copyin(b(0:size]), copyout(c(0:size)) create(d(0:size))
present(d(0:size))
!$acc end data
39
Data Directive- Example
void main()
{
int i,a[n],b[n],result[n];
for(i=0;i<n;i++)
{
a[i]=i+1;
b[i]=n-i;
}
40
Data Directive- Example
#pragma acc data copyin(a, b) copyout(result)
{
#pragma acc parallel loop present(a, b, result)
for (i = 0; i < n; i++)
{
result[i] = a[i] + b[i];
} }
for(i=0;i<n;i++)
printf("%d %d %d\t",a[i],b[i],result[i]);
printf("\n");
}
41
Example
#pragma acc data copy(vecA, vecB, vecC)
{
#pragma acc parallel loop
for (i = 0; i < NX; i++)
{
vecC[i] = vecA[i] + vecB[i];
}
}
}
42
Thank You
43