Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
222 views42 pages

Programming Gpus With Cuda: John Mellor-Crummey

This document summarizes programming GPUs using CUDA. It discusses how GPUs have become powerful parallel processors and how CUDA provides a C-like programming model to access this parallelism. CUDA abstracts the GPU as a hierarchy of threads organized into blocks that execute kernels. It describes key CUDA concepts such as shared memory, synchronization, and the memory model. Overall, the document introduces CUDA as a programming platform that hides GPU architectural details and enables general-purpose parallel programming on Nvidia GPUs.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views42 pages

Programming Gpus With Cuda: John Mellor-Crummey

This document summarizes programming GPUs using CUDA. It discusses how GPUs have become powerful parallel processors and how CUDA provides a C-like programming model to access this parallelism. CUDA abstracts the GPU as a hierarchy of threads organized into blocks that execute kernels. It describes key CUDA concepts such as shared memory, synchronization, and the memory model. Overall, the document introduces CUDA as a programming platform that hides GPU architectural details and enables general-purpose parallel programming on Nvidia GPUs.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Programming GPUs with CUDA

John Mellor-Crummey

Department of Computer Science


Rice University

[email protected]

COMP 422 Lecture 21 12 April 2011


Why GPUs?

• Two major trends


—GPU performance is pulling away from traditional processors
– ~10x memory bandwidth & floating point ops

—availability of general (non-graphics) programming interfaces


• GPU in every PC and workstation
—massive volume, potentially broad impact
Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0 2
NVidia Tesla GPU

Similar Tesla S870 server in


badlands.rcsg.rice.edu
(installed March, 2008)
Tesla (G80) Tesla2 (GT200)

CUDA Cores 128 240

Processor Clock 1.69 GHz 1.47 GHz

Floating Point Precision IEEE 754 SP IEEE 754 DP

Dedicated Memory 512 MB 1 GB GDDR3

Memory Clock (MHz) 1.1 GHz 1.2 GHz

Memory Interface Width 256-bit 512-bit

Memory Bandwidth 70.4 GB/s 159 GB/s

Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png 3


GPGPU?

• General Purpose computation using GPU


—applications beyond 3D graphics
—typically, data-intensive science and engineering applications
• Data-intensive algorithms leverage GPU attributes
—large data arrays, streaming throughput
—fine-grain SIMD parallelism
—low-latency floating point computation

4
GPGPU Programming of Yesteryear

• Stream-based programming model


• Express algorithms in terms of graphics operations
—use GPU pixel shaders as general-purpose SP floating point units
• Directly exploit
—pixel shaders
—vertex shaders
—video memory

threads interact through


off-chip video memory
• Example: GPUSort (Govindaraju, Manocha; 2005)
Figure Credits: Dongho Kim, School of Media, Soongsil University 5
Fragment from GPUSort

//invert the other half of the bitonic array and merge


glBegin(GL_QUADS);
for(int start=0; start<num_quads; start++){
glTexCoord2f(s+width,0);
glVertex2f(s,0);
glTexCoord2f(s+width/2,0);
glVertex2f(s+width/2,0);
glTexCoord2f(s+width/2,Height);
glVertex2f(s+width/2,Height);
glTexCoord2f(s+width,Height);
glVertex2f(s,Height);
s+=width;
}
glEnd();

(Govindaraju, Manocha; 2005)


6
CUDA
CUDA = Compute Unified Device Architecture
• Software platform for parallel computing on Nvidia GPUs
—introduced in 2006
—Nvidia’s repositioning of GPUs as versatile compute devices
• C plus a few simple extensions
—write a program for one thread
—instantiate for many parallel threads
—familiar language; simple data-parallel extensions
• CUDA is a scalable parallel programming model
—runs on any number of processors without recompiling

Slide credit: Patrick LeGresley, NVidia 7


Tesla GPU Architecture Abstraction

• NVidia GeForce 8 architecture


—128 CUDA cores (AKA programmable pixel shaders)
– 8 thread processor clusters (TPC)
– 2 streaming multiprocessors (SM) per TPC
– 8 streaming processors (SP) per SM

SM SP

Figure Credit:
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf
8
Introducing Fermi

• 512 CUDA cores


• Configurable L1 data cache
• 8x peak DP perf over Tesla 2
—IEEE 754-2008 FP standard

• GigaThread Engine
—concurrent kernel exec

• Full C++ support


• Unified address space
• Debugger support
• ECC support

Figure Credit:
http://www.nvidia.com/content/PDF/fermi_white_papers/
NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi Streaming Multiprocessor 9
GPU Comparison Summary

Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 10


Why CUDA?

• Business rationale
—opportunity for Nvidia to sell more chips
– extend the demand from graphics into HPC
—insurance against uncertain future for discrete GPUs
– both Intel and AMD aim to integrate GPUs on future microprocessors

• Technical rationale
—hides GPU architecture behind the programming API
– programmers never write “directly to the metal”
insulate programmers from details of GPU hardware
– enables Nvidia to change GPU architecture completely, transparently
preserve investment in CUDA programs
—simplifies the programming of multithreaded hardware
– CUDA automatically manages threads

11
CUDA Design Goals

• Support heterogeneous parallel programming (CPU + GPU)


• Scale to hundreds of cores, thousands of parallel threads
• Enable programmer to focus on parallel algorithms
—not GPU characteristics, programming language, work
scheduling ...

12
CUDA Software Stack for Heterogeneous Computing

Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1 13
Key CUDA Abstractions

• Hierarchy of concurrent threads


• Lightweight synchronization primitives
• Shared memory model for cooperating threads

14
Hierarchy of Concurrent Threads

• Parallel kernels composed of many threads


—all threads execute same sequential program
—use parallel threads rather than sequential loops

• Threads are grouped into thread blocks


—threads in block can sync and share memory

• Blocks are grouped into grids


—threads and blocks have unique IDs
– threadIdx: 1D, 2D, or 3D
– blockIdx: 1D or 2D
—simplifies addressing
—when processing
—multidimensional data
Slide credit: Patrick LeGresley, NVidia 15
CUDA Programming Example
Computing y = ax + y with a serial loop

void saxpy_serial(int n, float alpha, float *x, float *y) {


for (int i = 0; i< n; i++)
y[i] = alpha * x[i] + y[i];
}
// invoke serial saxpy kernel
saxpy_serial(n, 2.0, x, y) Host code

Computing y = ax + y in parallel using CUDA

__global__
void saxpy_parallel(int n, float alpha, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = alpha * x[i] + y[i];
} Device code
// invoke parallel saxpy kernel (256 threads per block)
int nblocks = (n + 255)/256
saxpy_parallel<<nblocks, 256>>(n, 2.0, x, y) Host code
16
Synchronization and Coordination

• Threads within a block may synchronize with barriers


—... step 1 ...
—__syncthreads();
—... step 2 ...
• Blocks can coordinate via atomic memory operations
—e.g. increment shared queue pointer with atomicInc()
• Implicit barrier between kernels launched by host
—vec_minus<<nblocks, blksize>>(a, b, c)
—vec_dot<<nblocks, blksize>>(c, c)

17
CPU vs. GPGPU vs. CUDA
Comparing the abstract models
CPU GPGPU CUDA/GPGPU

CPU

Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf


18
CUDA Memory Model

Figure credits: Patrick LeGresley, NVidia 19


Memory Model (Continued)

Figure credit: Patrick LeGresley, NVidia 20


Memory Access Latencies

• Register – dedicated HW - single cycle


• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
• Constant Memory – DRAM, cached, 1…10s…100s of cycles,
—depends on cache locality
• Texture Memory – DRAM, cached, 1…10s…100s of cycles
—depends on cache locality
• Instruction Memory (invisible) – DRAM, cached

21
Minimal Extensions to C

• Declaration specifiers to indicate where things live


—functions
—__global__ void KernelFunc(...); // kernel callable from host
– must return void
—__device__ float DeviceFunc(...); // function callable on device
– no recursion
– no static variables within function
—__host__ float HostFunc(); // only callable on host
—variables (next slide)
• Extend function invocation syntax for parallel kernel launch
—KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
• Built-in variables for thread identification in kernels
—dim3 threadIdx; dim3 blockIdx; dim3 blockDim;

22
Invoking a Kernel Function

• Call kernel function with an execution configuation

• Any call to a kernel function is asynchronous


—explicit synchronization is needed to block
• cudaThreadSynchronize() forces runtime to wait until all
preceding device tasks have finished
• Within kernel, declare shared memory as
—extern int __shared[];
23
CUDA Variable Declarations

• __device__ is optional with __local__, __shared__, or


__constant__
• Automatic variables without any qualifier reside in a register
—except arrays: reside in local memory
• Pointers can only point to memory allocated or declared in
global memory
—allocated on the host and passed to the kernel
– __global__ void Kernelfunc(float *ptr)
—address obtained for a global variable: float *ptr = &GlobalVar 24
Using Per Block Shared Memory

• Variables shared across block


—__shared__ int *begin, *end;
• Scratchpad memory
—__shared__ int scratch[blocksize];
—scratch[threadIdx.x] = begin[threadIdx.x];
—// … compute on scratch values …
—begin[threadIdx.x] = scratch[threadIdx.x];
• Communicating values between threads
—scratch[threadIdx.x] = begin[threadIdx.x];
—__syncthreads();
—int left = scratch[threadIdx.x - 1];

25
Features Available in GPU Code

• Special variables for thread identification in kernels


—dim3 threadIdx; dim3 blockIdx; dim3 blockDim;
• Intrinsics that expose specific operations in kernel code
—_syncthreads(); // barrier synchronization
• Standard math library operations
—exponentiation, truncation and rounding, trigonometric
functions, min/max/abs, log, quotient/remainder, etc.
• Atomic memory operations
—atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.

26
Runtime Support

• Memory management for pointers to GPU memory


—cudaMalloc(), cudaFree()
• Copying from host to/from device, device to device
—cudaMemcpy(), cudaMemcpy2D()

27
More Complete Example: Vector Addition

kernel code

...

28
Vector Addition Host Code

29
Extended C Summary

30
Compiling CUDA

31
Ideal CUDA programs

• High intrinsic parallelism


—e.g. per-element operations
• Minimal communication (if any) between threads
—limited synchronization
• High ratio of arithmetic to memory operations
• Few control flow statements
—SIMD execution
– divergent paths among threads in a block may be serialized (costly)
– compiler may replace conditional instructions by predicated
operations to reduce divergence

32
CUDA Matrix Multiply: Host Code

33
CUDA Matrix Multiply: Device Code

34
Optimization Considerations
• Kernel optimizations
—make use of shared memory
—minimize use divergent control flow
– SIMD execution must follow all paths taken within a thread group
—use intrinsic instructions when possible
– exploit the hardware support behind them

• CPU/GPU interaction
—maximize PCIe throughput
—use asynchronous memory copies
• Key resource considerations for Tesla GPU’s
—Max 512 threads per block Use compiler option:
—Up to 8 blocks per SM maxregisters=<regs>
—8K registers per SM (16K for Tesla2) to limit the number
—16 KB shared mem per SM of registers used per
thread
—16 KB local mem per thread
—64 KB of constant mem 35
GPU Application Domains

36
CUDA Resources

• General information about CUDA


—www.nvidia.com/object/cuda_home.html
• Nvidia GPUs compatible with CUDA
—www.nvidia.com/object/cuda_learn_products.html
• CUDA sample source code
—www.nvidia.com/object/cuda_get_samples.html
• Download the CUDA SDK
—www.nvidia.com/object/cuda_get.html

37
CUDA Alternative: OpenCL

• Emerging framework for writing programs that execute on


heterogeneous platforms, including CPUs, GPUs, etc.
—supports both task and data parallelism
—based on subset of ISO C99 with extensions for parallelism
—numerics based on IEEE 754 floating point standard
—efficiently interoperated with graphics APIs, e.g. OpenGL
• OpenCL managed by non-profit Khronos Group
• Initial specification approved for public release Dec. 8, 2008
—specification 1.0.33 released Feb 4, 2009

38
OpenCL Kernel Example: 1D FFT

39
OpenCL Host Program: 1D FFT

40
Device Programming Abstractions

• CPU
—single-threaded, serial instruction stream
– superscalar: manage pipelines for multiple functional units
– SIMD short vector operations: 3-4 operations per cycle
—data in cache or memory
• GPGPU
—use GPU pixel shaders as general purpose processors
—operate on data in video memory
—threads interact with each other through off-chip memory
• CUDA
—automatically manages threads
—divides data set into smaller chunks stored in on-chip memory
– reduces need to access off-chip memory improves performance
—multiple threads can share each chunk
41
References

• Patrick LeGresley, High Performance Computing with CUDA, Stanford


University Colloquium, October 2008, http://www.stanford.edu/dept/ICME/
docs/seminars/LeGresley-2008-10-27.pdf
• Vivek Sarkar. Introduction to General-Purpose computation on GPUs
(GPGPUs), COMP 635, September 2007
• Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. Dobb’s
Portal, http://www.ddj.com/architect/207200659, April 2008-March 2009.
• Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report,
January 2008.
• N. Govindaraju et al. A cache-efficient sorting algorithm for database and
data mining computations using graphics processors. http://
gamma.cs.unc.edu/SORT/gpusort.pdf
• http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort
• http://en.wikipedia.org/wiki/OpenCL
• http://www.khronos.org/opencl
—http://www.khronos.org/files/opencl-quick-reference-card.pdf 42

You might also like