0% found this document useful (0 votes)

118 views33 pages

CUDA for Parallel Computing Experts

This document summarizes CUDA tricks and specialized libraries. It discusses parallel scan algorithms like prefix sum that are efficient on GPUs. These scans can be used to build applications involving sorting, sparse matrix operations, and stream compaction. The document also introduces libraries like CUDPP, CUFFT, and CUBLAS that provide common primitive operations and linear algebra routines optimized for CUDA programming.

Uploaded by

Luis Carlos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views33 pages

CUDA for Parallel Computing Experts

Uploaded by

Luis Carlos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CUDA Tricks

Presented by
D
Damodaran
d R
Ramani
i
Synopsis
 Scan Algorithm

 Applications
l

 Specialized Libraries

 CUDPP: CUDA Data Parallel Primitives Library

 Thrust: a Template Library for CUDA Applications

 CUDA FFT and BLAS libraries for the GPU

References
 Scan primitives for GPU Computing.
 Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

 Presentation on scan primitives by Gary J. Katz based on the article

Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens
(GPU GEMS Chapter
Ch 39)
Introduction
 GPUs massively parallel processors

 Programmable parts of the graphics pipeline

operates
p on primitives
p (vertices,
( , fragments)
g )

 These
ese p
primitive
t ep programs
og a s spa
spawn a
thread for each primitive to keep the parallel
processors full
p
 Stream programming model (particle
systems, image processing, grid-based
fluid simulations, and dense matrix algebra)
 Fragment
g program
p g operating
p g on n fragments
g
(accesses - O(n))
 Problem arises when access requirements are
complex (eg: prefix-sum – O(n2))
Prefix-Sum Example
 in: 3 1 7 0 4 1 6 3
 out: 0 3 4 11 11 14 16 22
Trivial Sequential Implementation
void scan(int* in, int* out, int n)
{
out[0] = 0;
for (int i = 1; i < n; i++)
out[i] = in[i-1] + out[i-1];
}
Scan: An Efficient Parallel Primitive

 Interested in finding efficient solutions to

parallel problems in which each output
requires global knowledge of the inputs.

 Why CUDA? (General Load-Store Memory

Architecture, On-chip Shared Memory, Thread
S
Synchronization)
h i i )
Threads & Blocks
 GeForce 8800 GTX ( 16 multiprocessors, 8 processors each)
 CUDA structures GPU programs into parallel thread blocks of up
to 512 SIMD
SIMD-parallel
parallel threads.
threads
 Programmers specify the number of thread blocks and threads
per block, and the hardware and drivers map thread blocks to
parallel multiprocessors on the GPU.
GPU
 Within a thread block, threads can communicate
through shared memory and cooperate through sync.
 B
Because only
l threads
th d withinithi the
th same block
bl k can cooperate
t via
i
shared memory and thread synchronization,programmers must
partition computation into multiple blocks.(complex
programming large performance benefits)
programming,
The Scan Operator
 Definition:
 The scan operation takes a binary associative
operator with identity I, and an array of n
elements
[a0, a1, …, an-11]
and returns the array
[I, a0, (a0 a1), … , (a0 a1 … an-2)]

Types – inclusive, exclusive, forward, backward

Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
( k >= 2d )
if(
x[out][k] = x[in][k – 2d-1] + x[in][k]
else
[ ][ ] = x[in][k]
x[out][k] [ ][ ]

Complexity O(nlog2n)
A work efficient parallel scan
 Goal is a parallel scan that is O(n)
instead of O((nlog
g2n)
 Solution:
 Balanced Trees: Build a binaryy tree on the
input data and sweep it to and from the
root.
Bi
Binary tree
t ith n leaves
with l h d=log
has l 2n levels,
l l
each level d has 2d nodes
One add is performed per node
node, therefore
O(n) add on a single traversal of the tree.
O(n) unsegmented scan
 Reduce/Up-Sweep
for(d = 0; d < log2n-1; d++)
for all k=0; k < n-1; k+=2d+1 in parallel
x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]

 D
Down-Sweep
S
x[n-1] = 0;
for(d
( = logg2n – 1;; d >=0;
; d--)
)
for all k = 0; k < n-1; k += 2d+1 in parallel
t = x[k + 2d – 1]
x[k + 2d - 1] = x[k + 2d+1 -1]
1]
x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]
Tree analogy

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) x2 0 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 x2 ∑(x0..x1) x4 ∑(x0..x3) x6 ∑(x0..x5)

0 x ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)

0
O(n) Segmented Scan

Up-Sweep
 Down-Sweep
Features of segmented scan
 3 times slower than unsegmented scan
 Useful for building broad variety of
applications which are not possible with
unsegmented scan.
scan
Primitives built on scan
 Enumerate
 enumerate([t f f t f t t]) = [0 1 1 1 2 2 3]
 Exclusive scan of input vector
 Distribute (copy)
 distribute([a b c][d e]) = [a a a][d d]
 I l i scan off input
Inclusive i t vector
t
 Split and split-and-segment
Split divides the input vector into two pieces, with all the
elements marked false on the left side of the output vector and all the
elements marked true on the right.
Applications
 Quicksort
 Sparse Matrix-Vector Multiply
 Tridiagonal Matrix Solvers and Fluid
Simulation
 Radix Sort
 Stream Compaction
 Summed-Area
Summed Area Tables
Quicksort
Sparse Matrix-Vector
Multiplication
Stream Compaction
Definition:
 Extracts the ‘interest’ elements from an array of
elements and places them continuously in a new
array
 Uses:
 Collision Detection
 Sparse Matrix Compression
A B A D D E C F B

A B A C B
Stream Compaction
A B A D D E C F B Input: We want to
preserve the gray
elements
1 1 1 0 0 0 1 0 1 Set a ‘1’ in each gray
input
Scan
0 1 2 3 3 3 3 4 4

A B A D D E C F B
Scatter
S tt gray iinputs
t tto
output using scan result
as scatter address
A B A C B

0 1 2 3 4
Radix Sort Using Scan
100 111 010 110 011 101 001 000 Input Array
0 1 0 0 1 1 1 0 b = least significant bit
e = Insert a 1 for all
1 0 1 1 0 0 0 1 false sort keys
0 1 1 2 3 3 3 3 f = Scan the 1s

Total Falses = e[n-1] + f[n-1]

0-0+4 1-1+4 2-1+4 3-2+4 4-3+4 5-3+4 6-3+4 7-3+4
=4 =4 =5 =5 =5 =6 =7 =8 t = index – f + Total Falses

0 4 1 2 5 6 7 3 d=b?t:f

100 111 010 110 011 101 001 000

Scatter input using d
as scatter address
100 010 110 000 111 011 101 001
Specialized Libraries
 CUDPP: CUDA Data Parallel Primitives
Library
 CUDPP is a library of data-parallel
algorithm
g primitives
p such as p
parallel prefix-
p
sum (”scan”), parallel sort and parallel
reduction.
CUDPP_DLL CUDPPResult cudppSparseMa
trixVectorMultiply(CUDPPHandle sparse
MatrixHandle,void * d_y,const void
* d_x )
Perform matrix-vector multiply y = A*x for
arbitrary sparse matrix A and vector x.
CUDPPScanConfig config;
config.direction = CUDPP_SCAN_FORWARD;
config.exclusivity = CUDPP_SCAN_EXCLUSIVE;
config.op = CUDPP_ADD;
config datatype = CUDPP_FLOAT;
config.datatype CUDPP FLOAT;
config.maxNumElements = numElements;
config.maxNumRows = 1;
config.rowPitch = 0;
cudppInitializeScan(&config);
cudppScan(d odata d_idata,
cudppScan(d_odata, d idata numElements,
numElements &config);
CUFFT
 No. of elements<8192 slower than fftw
 >8192,
>8192 5x speedup over threaded fftw
and 10x over serial fftw.
CUBLAS
 Cuda Based Linear Algebra Subroutines
 Saxpy, conjugate gradient, linear solvers.
 3D reconstruction of planetary nebulae.
 http://graphics.tu-
bs.de/publications/Fernandez08TechReport.pdf
 GPU Variant 100 times faster than CPU
version
 Matrix size is limited by graphics card
memory and texture sizesize.
 Although taking advantage of sparce
matrices will help reduce memory
consumption, sparse matrix storage is
not implemented by CUBLAS
CUBLAS.
Useful Links
 http://www.science.uwaterloo.ca/~hmerz/CUDA_ben
chFFT/
 http://developer.download.nvidia.com/compute/cuda
/2_0/docs/CUBLAS_Library_2.0.pdf
 http://gpgpu org/developer/cudpp
http://gpgpu.org/developer/cudpp
 http://gpgpu.org/2009/05/31/thrust

CSS - Melcs - SHS GRD 12
100% (14)
CSS - Melcs - SHS GRD 12
16 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
KM2M Combo-L (MS-6738) Manual
100% (1)
KM2M Combo-L (MS-6738) Manual
89 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
CUBLAS Library
No ratings yet
CUBLAS Library
264 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Fast Minimum Spanning Tree For Large Graphs On The GPU
No ratings yet
Fast Minimum Spanning Tree For Large Graphs On The GPU
6 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
CUDA Parallel Prefix Sum Guide
No ratings yet
CUDA Parallel Prefix Sum Guide
21 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
qt6j57h5zw Nosplash
No ratings yet
qt6j57h5zw Nosplash
2 pages
Parallel and Distributed Systems: Sivapuram Venkata Harshini 226003124
No ratings yet
Parallel and Distributed Systems: Sivapuram Venkata Harshini 226003124
33 pages
Lecture 03-Parallel Prefix
No ratings yet
Lecture 03-Parallel Prefix
6 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
Optimized Scan Primitives on CRAY Y-MP
No ratings yet
Optimized Scan Primitives on CRAY Y-MP
10 pages
New HPC - Removed
No ratings yet
New HPC - Removed
5 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
32 pages
Cublas Library
No ratings yet
Cublas Library
254 pages
HPC Codes-2
No ratings yet
HPC Codes-2
15 pages
Pap 3 Shared Memory Algos
No ratings yet
Pap 3 Shared Memory Algos
23 pages
Co 2
No ratings yet
Co 2
22 pages
Lab 1: Simple Hardware Design: Targeting Microblaze™ On Spartan™-3E Starter Kit
No ratings yet
Lab 1: Simple Hardware Design: Targeting Microblaze™ On Spartan™-3E Starter Kit
16 pages
SAP ABAP OOPS Interview Questions and Answers
50% (2)
SAP ABAP OOPS Interview Questions and Answers
8 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
The Control and Manipulation of Data Allows The Actual Status of An Action To Be Reported For Example How Many Cars Are Currently in The Car Park?
No ratings yet
The Control and Manipulation of Data Allows The Actual Status of An Action To Be Reported For Example How Many Cars Are Currently in The Car Park?
3 pages
Chapter 2: 8051 Microcontroller Architecture: 2.1 What Is 8051 Standard?
No ratings yet
Chapter 2: 8051 Microcontroller Architecture: 2.1 What Is 8051 Standard?
46 pages
Run macOS on Windows: A Guide
No ratings yet
Run macOS on Windows: A Guide
13 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
845ref B2
No ratings yet
845ref B2
504 pages
Week 11
No ratings yet
Week 11
21 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
21 pages
VtigerCRM 5.1.0 Asterisk Integration PDF
No ratings yet
VtigerCRM 5.1.0 Asterisk Integration PDF
6 pages
Web GPU
0% (1)
Web GPU
40 pages
CUDA Libraries for Developers
No ratings yet
CUDA Libraries for Developers
86 pages
GPU Graph Algorithms with CUDA
No ratings yet
GPU Graph Algorithms with CUDA
26 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
PP+Creamino A Cost-Effective Open-Source EEG-Based BCI System
No ratings yet
PP+Creamino A Cost-Effective Open-Source EEG-Based BCI System
11 pages
Graded Quiz Unit 3 - Attempt Review
No ratings yet
Graded Quiz Unit 3 - Attempt Review
11 pages
C and C++ Report
100% (2)
C and C++ Report
29 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
The Efficient Implementation of An Array Multiplier
No ratings yet
The Efficient Implementation of An Array Multiplier
5 pages
LA-A961Pgoliad MLK 14 Uma Dock A00 1007
No ratings yet
LA-A961Pgoliad MLK 14 Uma Dock A00 1007
53 pages
Types and Features of Keyboards
No ratings yet
Types and Features of Keyboards
29 pages
EET 3350 Digital Systems Design Textbook: John Wakerly: Counters
No ratings yet
EET 3350 Digital Systems Design Textbook: John Wakerly: Counters
64 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Ethernet Communication
No ratings yet
Ethernet Communication
3 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
No ratings yet
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
45 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Topic 5: 8086 Assembly Language Programming (24 Marks)
No ratings yet
Topic 5: 8086 Assembly Language Programming (24 Marks)
38 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
GCP-30 Series Genset Control: Application
No ratings yet
GCP-30 Series Genset Control: Application
38 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
03 - Synchronization
No ratings yet
03 - Synchronization
37 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Matlab GPU and Parallel Computing Guide
No ratings yet
Matlab GPU and Parallel Computing Guide
35 pages
Tidue 64
No ratings yet
Tidue 64
31 pages
Session03-Classes and Objects - FUCT
No ratings yet
Session03-Classes and Objects - FUCT
8 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
TryhackMe - Windows Fundamentals 2 by Nehru G Medium
No ratings yet
TryhackMe - Windows Fundamentals 2 by Nehru G Medium
4 pages
Genesis 7580g Area-Imaging Scanner User's Guide
No ratings yet
Genesis 7580g Area-Imaging Scanner User's Guide
16 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
TV Scanning and Display Basics
No ratings yet
TV Scanning and Display Basics
33 pages
Cambridge IGCSE™: Computer Science 0478/12 October/November 2020
No ratings yet
Cambridge IGCSE™: Computer Science 0478/12 October/November 2020
13 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Ble 90
No ratings yet
Ble 90
268 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
HPC File
No ratings yet
HPC File
22 pages
Ilogic Inventor API
No ratings yet
Ilogic Inventor API
29 pages
Meenal Resume New
No ratings yet
Meenal Resume New
2 pages
Kogge-Stone Adder Design Review
No ratings yet
Kogge-Stone Adder Design Review
3 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
General Purpose Registers
No ratings yet
General Purpose Registers
3 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages

CUDA for Parallel Computing Experts

Uploaded by

CUDA for Parallel Computing Experts

Uploaded by

CUDA Tricks

 CUDPP: CUDA Data Parallel Primitives Library

 Thrust: a Template Library for CUDA Applications

 CUDA FFT and BLAS libraries for the GPU

 Presentation on scan primitives by Gary J. Katz based on the article

 Programmable parts of the graphics pipeline

 Interested in finding efficient solutions to

 Why CUDA? (General Load-Store Memory

Types – inclusive, exclusive, forward, backward

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) x2 0 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 x2 ∑(x0..x1) x4 ∑(x0..x3) x6 ∑(x0..x5)

0 x ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)

Total Falses = e[n-1] + f[n-1]

100 111 010 110 011 101 001 000

You might also like