0% found this document useful (0 votes)

7 views43 pages

OpenACC Programming

The document provides an overview of OpenACC programming for high-performance computing, focusing on GPU programming and its comparison with CPU. It explains the execution model, programming models, and the syntax for directives used in OpenACC to facilitate parallel computing. Key concepts such as task granularity, data directives, and compiler directives are also discussed, along with examples to illustrate their application.

Uploaded by

kaushigaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views43 pages

OpenACC Programming

Uploaded by

kaushigaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

High Performance Computing for

Data Intensive and Complex AI Applications

OpenACC Programming
(Theory & Hands on)
by
Dr.S.Devi Mahalakshmi, Professor/CSE
Mepco Schlenk Engineering College, Sivakasi
[email protected]

1 August 2025 1
Index
• • OpenACC • CUDA
GPU overview
• Introduction
• Introduction
• GPU v/s CPU
• Execution model
• Execution model
• GPGPU overview • Levels of
• CUDA-
parallelism Grids,Blocks,Threa
• NVIDIA tesla V100
• Directive syntax ds
• GPGPU-Programming • Compute • Cuda Programs
models constructs
• Loop and Kernels

• Data directives

2
GP-GPU
• General Purpose GPU or GPGPU - general purpose scientific and engineering
computing

• Processing of non graphical entities

• Computationally intensive data parallel tasks for achieving low time complexity

3
GPU v/s CPU
Architectural
Differences
CPU GPU
ALU

Control
Logic
Cache

DRAM DRAM
512 cores
Less than 20 cores
10s to 100s of threads per core
Latency is hidden by large cache
Latency is hidden by fast context
1 -2 threads per core
switching

GPUs don’t run without CPUs

4
Execution model

5
Programming models for GPGPU
• CUDA – Compute Unified Device Maximum
Low Level Architecture flexibility
Programmin • OpenCL – Open Computing Language
g

• OpenMP – Open
Easily accelerate
Multiprocessing
Directive applications
• OpenACC –Open Accelerator
s

6
OpenACC

7
OpenACC

What is OpenACC?
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with
only minor modifications to a serial CPU code.

8
OpenACC
What is OpenACC?
• A set of compiler directives that allow code regions to be offloaded
from a host CPU to, on a GPU
• High level GPU programming
• Similar to OpenMP directives
• Works for Fortran, C, C++
• Portable across different platforms and compilers
• Compilers supported - PGI, CRAY, GCC

9
NVIDIA GPU (CUDA) Task Granularity
GPU device -- CUDA grids:
Kennels/grids are assigned to a device.
• Streaming Multiprocessor (SM) -- CUDA thread blocks:
Blocks are assigned to a SM.
• CUDA cores -- CUDA threads:
Threads are assigned to a core.
• Warp: a unit that consists of 32 threads.
• Blocks are divided into warps.
• The SM executes threads at warp granularity.
• The warp size can be changed in the future.
10
OpenACC Task Granularity
• Gang --- block
• Worker – warp
• Vector – thread

Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
11
Levels of
Parallelism

12
key concepts
• Vector : Threads work in SIMT fashion
 Individual tasks that are executed in parallel on the GPU
 Threads are organized into warps, which are groups of 32 threads each
 All threads within a warp are executed on a single GPU core

• Worker : Groups of threads that can be scheduled and executed on a streaming

multiprocessor (SM) within the GPU.

• Gang : workers are organized into gangs. Multiple gangs work independently

13
OpenACC

What are compiler directives?

• The directives tell the compiler or runtime to ……
• Generate parallel code for GPU
• Allocate GPU memory and copy input data
• Execute parallel code on GPU
• Copy output data to CPU and deallocate GPU memory

14
OpenACC Directive syntax
•C
#pragma acc directive [clause [,] clause] …]…
often followed by a structured code block
• Fortran
!$acc directive [clause [,] clause] …]...
often paired with a matching end directive surrounding a structured
code block
!$acc end directive

15
OpenACC Syntex
•#pragma - Gives instructions to the compiler on how to compile the code
• acc - Informs the compiler that code is to be executed using OpenACC
• directives - Commands in OpenACC for altering our code
• clauses - Specifiers or additions to directives

16
Execution Model
• Program runs on the host CPU

• Host offloads compute-intensive

regions (kernels) and related data
to the accelerator GPU

• Compute kernels are executed by

the
GPU

17
PGI Compiler basics
• The command to compile C code is ‘pgcc’

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is ‘pgfortran’

$ pgcc main.c

$ pgc++ main.cpp

$ pgfortran main.f90

18
OpenACC Directive for parallelism

Two different approaches for defining parallel regions

kernels
• Defines a region to be transferred into a series of kernels to be executed in
sequence on an accelerator
• Work sharing parallelism is defined automatically for the separate kernels
parallel
• Defines a region to be executed on an accelerator
• Work sharing parallelism has to be defined manually

with similar work sharing, both can perform equally well

19
20
21
Example: Compute a*x + y, where x and y are vectors, and a is a scalar

22
Analysis of the compiling output
$ pgcc saxpy_array.c -o saxpy_array
main:
17, Generating copyin(x[:1000])
Generating copy(y[:1000])
19, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

• Accelerator kernel is generated.

• The loop computation is offloaded to (Tesla) GPU and is parallelized.
• The keywords copy and copyin are involved with data transfer.
• The keywords gang and vector are involved with tasks granularity.
23
Data dependency
• The loop is not parallelized if there is data dependency. For example,

• The compiling output:

24
25
26
Kernel vs. parallel loop
kernels
• More implicit.
• Gives the compiler more freedom to find and map parallelism.
• Compiler performs parallel analysis and parallelizes what it believes
safe.
parallel
• More explicit.
• Requires analysis by programmer to ensure safe parallelism
• Straightforward path from OpenMP
27
28
kernels vs. parallel (2)
• Parallelize a code block with two loops:

 Generate one kernel

 Generate two kernels
 There is no barrier between the two loops:
 There is an implicit barrier between
the second loop may start before the first
the two loops: the second loop will
loop ends. (This is different from OpenMP).
start after the first loop ends.
29
Clauses in kernel directives
Clauses in parallel directives
• if clause
Optional clause to decide if code should be executed on accelerator or host
• async clause
Specifies that a parallel accelerator or kernels regions should be executed
asynchronously
The host will evaluate the integer expression of the async clause to test or wait for
completion with the wait directive
• num_gangs clause
Specifies the number of gangs that will be executed in the accelerator parallel
region
• num_workers clause
Specifies the number of workers within each gang for a accelerator parallel region
31
Clauses in parallel directive
• vector length clause
Specifies the vector length to use for the vector or SIMD
operations within each worker of a gang
• private clause
A copy of each item on the list will be created for each gang
syntax: private(var1, var2, var3, ...)
• Avoid race condition
• firstprivate clause
A copy of each item on the list will be created for each gang and
initialized with the value of the item in the host
32
Clauses in parallel directive
reduction clause
• Specifies a reduction operation to be perform across gangs using a
private copy for each gang
• Private copy of the affected variable is generated for each loop
iteration
• Reduce all of those private copies into one final result, which is
returned from the region
• Syntax
reduction(operator:variable)

33
Reduction clause
What is reduction and why is it necessary?
• In the given example, the variable dt can be modified by multiple workers
(warps) simultaneously. This is called a data race condition.
• If data race happened, an incorrect result will be returned.
• To avoid data race, a reduction clause is required to protect the concerned
variable.
• Fortunately, the compiler is smart enough to create a reduction kernel and
avoid the data race automatically
• It is a good habit to explicitly specify reduction operators and variables.

34
Clauses in parallel and kernel
directives
Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m)
vector_length(k)

Syntax for Fortran

!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Data Directive
Data Directive
Specifies by :
#pragma acc data [clause [,clause]...] new-line structured block

• Defines scalars, arrays and subarrays to be allocated in the accelerator

memory for the duration of the region
• Can be used to control if data should be copied in or out from the
host

36
Data Directive clauses
Clauses for data directive
• if( condition )
• copy( list )
• copyin( list )
• copyout( list )
• create( list )
• present( list )

37
Data Directive clauses
copy (list):
Allocates memory on GPU and copies data from host to GPU when entering
region and copies data to the host when exiting region.
copyin(list):
Allocates memory on GPU and copies data from host to GPU when entering
region.
copyout(list):
Allocates memory on GPU and copies data to the host when exiting region.
create(list):
Allocates memory on GPU but does not copy. But do not copy to or from the
device.
present(list):
Data is already present on GPU. The listed variables are already present on the
device, so no further action needs to be taken 38
Data Directive clauses
• Syntax for C
#pragma acc data copy(a[0:size]) copyin(b[0:size]), copyout(c[0:size])
create(d[0:size]) present(d[0:size])

• Syntax for Fortran

!$acc acc data copy(a(0:size)) copyin(b(0:size]), copyout(c(0:size)) create(d(0:size))
present(d(0:size))
!$acc end data

39
Data Directive- Example
void main()
{
int i,a[n],b[n],result[n];
for(i=0;i<n;i++)
{
a[i]=i+1;
b[i]=n-i;
}

40
Data Directive- Example
#pragma acc data copyin(a, b) copyout(result)
{
#pragma acc parallel loop present(a, b, result)
for (i = 0; i < n; i++)
{
result[i] = a[i] + b[i];
} }
for(i=0;i<n;i++)
printf("%d %d %d\t",a[i],b[i],result[i]);
printf("\n");
}
41
Example
#pragma acc data copy(vecA, vecB, vecC)
{

#pragma acc parallel loop

for (i = 0; i < NX; i++)
{
vecC[i] = vecA[i] + vecB[i];
}
}

}
42
Thank You

Rittal SK - 3302100 PDF
100% (3)
Rittal SK - 3302100 PDF
49 pages
Agricultural Services Directory
No ratings yet
Agricultural Services Directory
93 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Cours 1
No ratings yet
Cours 1
38 pages
Cours 1
No ratings yet
Cours 1
38 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Module 6
No ratings yet
Module 6
20 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Fortran PGI Directives: 11 Optimization Tips
No ratings yet
Fortran PGI Directives: 11 Optimization Tips
15 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Module 3
No ratings yet
Module 3
34 pages
Module 4
No ratings yet
Module 4
40 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
03 (Parallel Software)
No ratings yet
03 (Parallel Software)
38 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
ACA Chapter2
No ratings yet
ACA Chapter2
66 pages
OpenACC for HPC Developers
No ratings yet
OpenACC for HPC Developers
47 pages
Navya2022 Chapter ComparativeStudyOfDirective-ba
No ratings yet
Navya2022 Chapter ComparativeStudyOfDirective-ba
13 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
No ratings yet
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
24 pages
Parallel Detailed Explanations
No ratings yet
Parallel Detailed Explanations
2 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
AA Part1
No ratings yet
AA Part1
43 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Parallel Programming Insights
No ratings yet
Parallel Programming Insights
42 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
Parallel Answers
No ratings yet
Parallel Answers
6 pages
Parallel Programming Models Guide
No ratings yet
Parallel Programming Models Guide
32 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
CO Attainment Report OS Lab 21 22
No ratings yet
CO Attainment Report OS Lab 21 22
2 pages
19CS481 Assignment II
No ratings yet
19CS481 Assignment II
1 page
19CS481 Assignment I
No ratings yet
19CS481 Assignment I
1 page
Non Invasive HB Diagnosis Paper
No ratings yet
Non Invasive HB Diagnosis Paper
9 pages
13MC101 Model Question
No ratings yet
13MC101 Model Question
3 pages
Smart loT-Based Air Quality Monitoring
No ratings yet
Smart loT-Based Air Quality Monitoring
20 pages
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
No ratings yet
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
10 pages
Work Instruction Enable/Disable Beeping in Exceed: Purpose: Prerequisites: Category
No ratings yet
Work Instruction Enable/Disable Beeping in Exceed: Purpose: Prerequisites: Category
8 pages
UNIX For Testers
100% (1)
UNIX For Testers
141 pages
E-Book Website
No ratings yet
E-Book Website
23 pages
Ota Firmware
No ratings yet
Ota Firmware
43 pages
ADVANCED INSTRUMENTS 4250 Service Manual
No ratings yet
ADVANCED INSTRUMENTS 4250 Service Manual
146 pages
Ibm FW Mpt2sas n2125-1.20.02 Linux 32-64
No ratings yet
Ibm FW Mpt2sas n2125-1.20.02 Linux 32-64
3 pages
Apple Card $1000 - Google Search
No ratings yet
Apple Card $1000 - Google Search
1 page
Lenovo V15 G2 ITL 82KB00G5LM
No ratings yet
Lenovo V15 G2 ITL 82KB00G5LM
2 pages
ICT Grade 9
No ratings yet
ICT Grade 9
6 pages
Driver Updates for X505 Series
No ratings yet
Driver Updates for X505 Series
5 pages
Model Question Paper - PHD Admission Test, CSE, IIT Guwahati
No ratings yet
Model Question Paper - PHD Admission Test, CSE, IIT Guwahati
4 pages
Texas Instruments AM335x Beaglebone/Beaglebone Black Board Support Package
No ratings yet
Texas Instruments AM335x Beaglebone/Beaglebone Black Board Support Package
2 pages
Living in The It Era
No ratings yet
Living in The It Era
132 pages
Lenovo IdeaPad G460 Z460 Compal LA-5751P NIWE1 Rev0.3 Schematic
No ratings yet
Lenovo IdeaPad G460 Z460 Compal LA-5751P NIWE1 Rev0.3 Schematic
51 pages
Aviation Weather Forecast TAF Report
No ratings yet
Aviation Weather Forecast TAF Report
14 pages
Nuts & Volts 25-08 - Aug 2004
100% (3)
Nuts & Volts 25-08 - Aug 2004
108 pages
Karel Telefon Santrali MS48-Prog
No ratings yet
Karel Telefon Santrali MS48-Prog
102 pages
Encrypted Document Analysis
100% (1)
Encrypted Document Analysis
58 pages
Diode Logic: AND & OR Gate Basics
No ratings yet
Diode Logic: AND & OR Gate Basics
2 pages
Optical Fibers and Components: Topics
No ratings yet
Optical Fibers and Components: Topics
27 pages
110kV CRP Trafo Scheme
No ratings yet
110kV CRP Trafo Scheme
31 pages
63423en 02
No ratings yet
63423en 02
603 pages
TPM Step-By-Step Guide - 2.14
No ratings yet
TPM Step-By-Step Guide - 2.14
20 pages
Prime Number Calculation: Java, C, CUDA Comparison
No ratings yet
Prime Number Calculation: Java, C, CUDA Comparison
27 pages
The Losmandy G-11 Equatorial Mount User's Manual
No ratings yet
The Losmandy G-11 Equatorial Mount User's Manual
12 pages
Commissioning & Maintenance Guide
No ratings yet
Commissioning & Maintenance Guide
26 pages
Sprecher - + - Schuh LA2 16 1753 Datasheet
No ratings yet
Sprecher - + - Schuh LA2 16 1753 Datasheet
10 pages
Manual Li 2727
No ratings yet
Manual Li 2727
3 pages

OpenACC Programming

Uploaded by

OpenACC Programming

Uploaded by

High Performance Computing for

Data Intensive and Complex AI Applications

• Processing of non graphical entities

GPUs don’t run without CPUs

• Worker : Groups of threads that can be scheduled and executed on a streaming

What are compiler directives?

• Host offloads compute-intensive

• Compute kernels are executed by

• The command to compile C++ code is ‘pgc++’

• The command to compile fortran code is ‘pgfortran’

Two different approaches for defining parallel regions

with similar work sharing, both can perform equally well

• Accelerator kernel is generated.

• The compiling output:

 Generate one kernel

Syntax for Fortran

• Defines scalars, arrays and subarrays to be allocated in the accelerator

• Syntax for Fortran

#pragma acc parallel loop

You might also like