Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views43 pages

OpenACC Programming

The document provides an overview of OpenACC programming for high-performance computing, focusing on GPU programming and its comparison with CPU. It explains the execution model, programming models, and the syntax for directives used in OpenACC to facilitate parallel computing. Key concepts such as task granularity, data directives, and compiler directives are also discussed, along with examples to illustrate their application.

Uploaded by

kaushigaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

OpenACC Programming

The document provides an overview of OpenACC programming for high-performance computing, focusing on GPU programming and its comparison with CPU. It explains the execution model, programming models, and the syntax for directives used in OpenACC to facilitate parallel computing. Key concepts such as task granularity, data directives, and compiler directives are also discussed, along with examples to illustrate their application.

Uploaded by

kaushigaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

High Performance Computing for

Data Intensive and Complex AI Applications

OpenACC Programming
(Theory & Hands on)
by
Dr.S.Devi Mahalakshmi, Professor/CSE
Mepco Schlenk Engineering College, Sivakasi
[email protected]

1 August 2025 1
Index
• • OpenACC • CUDA
GPU overview
• Introduction
• Introduction
• GPU v/s CPU
• Execution model
• Execution model
• GPGPU overview • Levels of
• CUDA-
parallelism Grids,Blocks,Threa
• NVIDIA tesla V100
• Directive syntax ds
• GPGPU-Programming • Compute • Cuda Programs
models constructs
• Loop and Kernels

• Data directives

2
GP-GPU
• General Purpose GPU or GPGPU - general purpose scientific and engineering
computing

• Processing of non graphical entities

• Computationally intensive data parallel tasks for achieving low time complexity

3
GPU v/s CPU
Architectural
Differences
CPU GPU
ALU

Control
Logic
Cache

DRAM DRAM
512 cores
Less than 20 cores
10s to 100s of threads per core
Latency is hidden by large cache
Latency is hidden by fast context
1 -2 threads per core
switching

GPUs don’t run without CPUs

4
Execution model

5
Programming models for GPGPU
• CUDA – Compute Unified Device Maximum
Low Level Architecture flexibility
Programmin • OpenCL – Open Computing Language
g

• OpenMP – Open
Easily accelerate
Multiprocessing
Directive applications
• OpenACC –Open Accelerator
s

6
OpenACC

7
OpenACC

What is OpenACC?
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with
only minor modifications to a serial CPU code.

8
OpenACC
What is OpenACC?
• A set of compiler directives that allow code regions to be offloaded
from a host CPU to, on a GPU
• High level GPU programming
• Similar to OpenMP directives
• Works for Fortran, C, C++
• Portable across different platforms and compilers
• Compilers supported - PGI, CRAY, GCC

9
NVIDIA GPU (CUDA) Task Granularity
GPU device -- CUDA grids:
Kennels/grids are assigned to a device.
• Streaming Multiprocessor (SM) -- CUDA thread blocks:
Blocks are assigned to a SM.
• CUDA cores -- CUDA threads:
Threads are assigned to a core.
• Warp: a unit that consists of 32 threads.
• Blocks are divided into warps.
• The SM executes threads at warp granularity.
• The warp size can be changed in the future.
10
OpenACC Task Granularity
• Gang --- block
• Worker – warp
• Vector – thread

Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
11
Levels of
Parallelism

12
key concepts
• Vector : Threads work in SIMT fashion
 Individual tasks that are executed in parallel on the GPU
 Threads are organized into warps, which are groups of 32 threads each
 All threads within a warp are executed on a single GPU core

• Worker : Groups of threads that can be scheduled and executed on a streaming


multiprocessor (SM) within the GPU.

• Gang : workers are organized into gangs. Multiple gangs work independently

13
OpenACC

What are compiler directives?


• The directives tell the compiler or runtime to ……
• Generate parallel code for GPU
• Allocate GPU memory and copy input data
• Execute parallel code on GPU
• Copy output data to CPU and deallocate GPU memory

14
OpenACC Directive syntax
•C
#pragma acc directive [clause [,] clause] …]…
often followed by a structured code block
• Fortran
!$acc directive [clause [,] clause] …]...
often paired with a matching end directive surrounding a structured
code block
!$acc end directive

15
OpenACC Syntex
•#pragma - Gives instructions to the compiler on how to compile the code
• acc - Informs the compiler that code is to be executed using OpenACC
• directives - Commands in OpenACC for altering our code
• clauses - Specifiers or additions to directives

16
Execution Model
• Program runs on the host CPU

• Host offloads compute-intensive


regions (kernels) and related data
to the accelerator GPU

• Compute kernels are executed by


the
GPU

17
PGI Compiler basics
• The command to compile C code is ‘pgcc’

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is ‘pgfortran’

$ pgcc main.c

$ pgc++ main.cpp

$ pgfortran main.f90

18
OpenACC Directive for parallelism

Two different approaches for defining parallel regions


kernels
• Defines a region to be transferred into a series of kernels to be executed in
sequence on an accelerator
• Work sharing parallelism is defined automatically for the separate kernels
parallel
• Defines a region to be executed on an accelerator
• Work sharing parallelism has to be defined manually

with similar work sharing, both can perform equally well


19
20
21
Example: Compute a*x + y, where x and y are vectors, and a is a scalar

22
Analysis of the compiling output
$ pgcc saxpy_array.c -o saxpy_array
main:
17, Generating copyin(x[:1000])
Generating copy(y[:1000])
19, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

• Accelerator kernel is generated.


• The loop computation is offloaded to (Tesla) GPU and is parallelized.
• The keywords copy and copyin are involved with data transfer.
• The keywords gang and vector are involved with tasks granularity.
23
Data dependency
• The loop is not parallelized if there is data dependency. For example,

• The compiling output:

24
25
26
Kernel vs. parallel loop
kernels
• More implicit.
• Gives the compiler more freedom to find and map parallelism.
• Compiler performs parallel analysis and parallelizes what it believes
safe.
parallel
• More explicit.
• Requires analysis by programmer to ensure safe parallelism
• Straightforward path from OpenMP
27
28
kernels vs. parallel (2)
• Parallelize a code block with two loops:

 Generate one kernel


 Generate two kernels
 There is no barrier between the two loops:
 There is an implicit barrier between
the second loop may start before the first
the two loops: the second loop will
loop ends. (This is different from OpenMP).
start after the first loop ends.
29
Clauses in kernel directives
Clauses in parallel directives
• if clause
Optional clause to decide if code should be executed on accelerator or host
• async clause
Specifies that a parallel accelerator or kernels regions should be executed
asynchronously
The host will evaluate the integer expression of the async clause to test or wait for
completion with the wait directive
• num_gangs clause
Specifies the number of gangs that will be executed in the accelerator parallel
region
• num_workers clause
Specifies the number of workers within each gang for a accelerator parallel region
31
Clauses in parallel directive
• vector length clause
Specifies the vector length to use for the vector or SIMD
operations within each worker of a gang
• private clause
A copy of each item on the list will be created for each gang
syntax: private(var1, var2, var3, ...)
• Avoid race condition
• firstprivate clause
A copy of each item on the list will be created for each gang and
initialized with the value of the item in the host
32
Clauses in parallel directive
reduction clause
• Specifies a reduction operation to be perform across gangs using a
private copy for each gang
• Private copy of the affected variable is generated for each loop
iteration
• Reduce all of those private copies into one final result, which is
returned from the region
• Syntax
reduction(operator:variable)

33
Reduction clause
What is reduction and why is it necessary?
• In the given example, the variable dt can be modified by multiple workers
(warps) simultaneously. This is called a data race condition.
• If data race happened, an incorrect result will be returned.
• To avoid data race, a reduction clause is required to protect the concerned
variable.
• Fortunately, the compiler is smart enough to create a reduction kernel and
avoid the data race automatically
• It is a good habit to explicitly specify reduction operators and variables.

34
Clauses in parallel and kernel
directives
Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m)
vector_length(k)

Syntax for Fortran


!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Data Directive
Data Directive
Specifies by :
#pragma acc data [clause [,clause]...] new-line structured block

• Defines scalars, arrays and subarrays to be allocated in the accelerator


memory for the duration of the region
• Can be used to control if data should be copied in or out from the
host

36
Data Directive clauses
Clauses for data directive
• if( condition )
• copy( list )
• copyin( list )
• copyout( list )
• create( list )
• present( list )

37
Data Directive clauses
copy (list):
Allocates memory on GPU and copies data from host to GPU when entering
region and copies data to the host when exiting region.
copyin(list):
Allocates memory on GPU and copies data from host to GPU when entering
region.
copyout(list):
Allocates memory on GPU and copies data to the host when exiting region.
create(list):
Allocates memory on GPU but does not copy. But do not copy to or from the
device.
present(list):
Data is already present on GPU. The listed variables are already present on the
device, so no further action needs to be taken 38
Data Directive clauses
• Syntax for C
#pragma acc data copy(a[0:size]) copyin(b[0:size]), copyout(c[0:size])
create(d[0:size]) present(d[0:size])

• Syntax for Fortran


!$acc acc data copy(a(0:size)) copyin(b(0:size]), copyout(c(0:size)) create(d(0:size))
present(d(0:size))
!$acc end data

39
Data Directive- Example
void main()
{
int i,a[n],b[n],result[n];
for(i=0;i<n;i++)
{
a[i]=i+1;
b[i]=n-i;
}

40
Data Directive- Example
#pragma acc data copyin(a, b) copyout(result)
{
#pragma acc parallel loop present(a, b, result)
for (i = 0; i < n; i++)
{
result[i] = a[i] + b[i];
} }
for(i=0;i<n;i++)
printf("%d %d %d\t",a[i],b[i],result[i]);
printf("\n");
}
41
Example
#pragma acc data copy(vecA, vecB, vecC)
{

#pragma acc parallel loop


for (i = 0; i < NX; i++)
{
vecC[i] = vecA[i] + vecB[i];
}
}

}
42
Thank You

43

You might also like