0% found this document useful (0 votes)

222 views42 pages

Programming Gpus With Cuda: John Mellor-Crummey

This document summarizes programming GPUs using CUDA. It discusses how GPUs have become powerful parallel processors and how CUDA provides a C-like programming model to access this parallelism. CUDA abstracts the GPU as a hierarchy of threads organized into blocks that execute kernels. It describes key CUDA concepts such as shared memory, synchronization, and the memory model. Overall, the document introduces CUDA as a programming platform that hides GPU architectural details and enables general-purpose parallel programming on Nvidia GPUs.

Uploaded by

askbilladdmicrosoft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views42 pages

Programming Gpus With Cuda: John Mellor-Crummey

Uploaded by

askbilladdmicrosoft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Programming GPUs with CUDA

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

COMP 422 Lecture 21 12 April 2011

Why GPUs?

• Two major trends

—GPU performance is pulling away from traditional processors
– ~10x memory bandwidth & floating point ops

—availability of general (non-graphics) programming interfaces

• GPU in every PC and workstation
—massive volume, potentially broad impact
Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0 2
NVidia Tesla GPU

Similar Tesla S870 server in

badlands.rcsg.rice.edu
(installed March, 2008)
Tesla (G80) Tesla2 (GT200)

CUDA Cores 128 240

Processor Clock 1.69 GHz 1.47 GHz

Floating Point Precision IEEE 754 SP IEEE 754 DP

Dedicated Memory 512 MB 1 GB GDDR3

Memory Clock (MHz) 1.1 GHz 1.2 GHz

Memory Interface Width 256-bit 512-bit

Memory Bandwidth 70.4 GB/s 159 GB/s

Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png 3

GPGPU?

• General Purpose computation using GPU

—applications beyond 3D graphics
—typically, data-intensive science and engineering applications
• Data-intensive algorithms leverage GPU attributes
—large data arrays, streaming throughput
—fine-grain SIMD parallelism
—low-latency floating point computation

4
GPGPU Programming of Yesteryear

• Stream-based programming model

• Express algorithms in terms of graphics operations
—use GPU pixel shaders as general-purpose SP floating point units
• Directly exploit
—pixel shaders
—vertex shaders
—video memory

threads interact through

off-chip video memory
• Example: GPUSort (Govindaraju, Manocha; 2005)
Figure Credits: Dongho Kim, School of Media, Soongsil University 5
Fragment from GPUSort

//invert the other half of the bitonic array and merge

glBegin(GL_QUADS);
for(int start=0; start<num_quads; start++){
glTexCoord2f(s+width,0);
glVertex2f(s,0);
glTexCoord2f(s+width/2,0);
glVertex2f(s+width/2,0);
glTexCoord2f(s+width/2,Height);
glVertex2f(s+width/2,Height);
glTexCoord2f(s+width,Height);
glVertex2f(s,Height);
s+=width;
}
glEnd();

(Govindaraju, Manocha; 2005)

6
CUDA
CUDA = Compute Unified Device Architecture
• Software platform for parallel computing on Nvidia GPUs
—introduced in 2006
—Nvidia’s repositioning of GPUs as versatile compute devices
• C plus a few simple extensions
—write a program for one thread
—instantiate for many parallel threads
—familiar language; simple data-parallel extensions
• CUDA is a scalable parallel programming model
—runs on any number of processors without recompiling

Slide credit: Patrick LeGresley, NVidia 7

Tesla GPU Architecture Abstraction

• NVidia GeForce 8 architecture

—128 CUDA cores (AKA programmable pixel shaders)
– 8 thread processor clusters (TPC)
– 2 streaming multiprocessors (SM) per TPC
– 8 streaming processors (SP) per SM

SM SP

Figure Credit:
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf
8
Introducing Fermi

• 512 CUDA cores

• Configurable L1 data cache
• 8x peak DP perf over Tesla 2
—IEEE 754-2008 FP standard

• GigaThread Engine
—concurrent kernel exec

• Full C++ support

• Unified address space
• Debugger support
• ECC support

Figure Credit:
http://www.nvidia.com/content/PDF/fermi_white_papers/
NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi Streaming Multiprocessor 9
GPU Comparison Summary

Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 10

Why CUDA?

• Business rationale
—opportunity for Nvidia to sell more chips
– extend the demand from graphics into HPC
—insurance against uncertain future for discrete GPUs
– both Intel and AMD aim to integrate GPUs on future microprocessors

• Technical rationale
—hides GPU architecture behind the programming API
– programmers never write “directly to the metal”
insulate programmers from details of GPU hardware
– enables Nvidia to change GPU architecture completely, transparently
preserve investment in CUDA programs
—simplifies the programming of multithreaded hardware
– CUDA automatically manages threads

11
CUDA Design Goals

• Support heterogeneous parallel programming (CPU + GPU)

• Scale to hundreds of cores, thousands of parallel threads
• Enable programmer to focus on parallel algorithms
—not GPU characteristics, programming language, work
scheduling ...

12
CUDA Software Stack for Heterogeneous Computing

Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1 13
Key CUDA Abstractions

• Hierarchy of concurrent threads

• Lightweight synchronization primitives
• Shared memory model for cooperating threads

14
Hierarchy of Concurrent Threads

• Parallel kernels composed of many threads

—all threads execute same sequential program
—use parallel threads rather than sequential loops

• Threads are grouped into thread blocks

—threads in block can sync and share memory

• Blocks are grouped into grids

—threads and blocks have unique IDs
– threadIdx: 1D, 2D, or 3D
– blockIdx: 1D or 2D
—simplifies addressing
—when processing
—multidimensional data
Slide credit: Patrick LeGresley, NVidia 15
CUDA Programming Example
Computing y = ax + y with a serial loop

void saxpy_serial(int n, float alpha, float x, float y) {

for (int i = 0; i< n; i++)
y[i] = alpha * x[i] + y[i];
}
// invoke serial saxpy kernel
saxpy_serial(n, 2.0, x, y) Host code

Computing y = ax + y in parallel using CUDA

__global__
void saxpy_parallel(int n, float alpha, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = alpha * x[i] + y[i];
} Device code
// invoke parallel saxpy kernel (256 threads per block)
int nblocks = (n + 255)/256
saxpy_parallel<<nblocks, 256>>(n, 2.0, x, y) Host code
16
Synchronization and Coordination

• Threads within a block may synchronize with barriers

—... step 1 ...
—__syncthreads();
—... step 2 ...
• Blocks can coordinate via atomic memory operations
—e.g. increment shared queue pointer with atomicInc()
• Implicit barrier between kernels launched by host
—vec_minus<<nblocks, blksize>>(a, b, c)
—vec_dot<<nblocks, blksize>>(c, c)

17
CPU vs. GPGPU vs. CUDA
Comparing the abstract models
CPU GPGPU CUDA/GPGPU

CPU

Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf

18
CUDA Memory Model

Figure credits: Patrick LeGresley, NVidia 19

Memory Model (Continued)

Figure credit: Patrick LeGresley, NVidia 20

Memory Access Latencies

• Register – dedicated HW - single cycle

• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
• Constant Memory – DRAM, cached, 1…10s…100s of cycles,
—depends on cache locality
• Texture Memory – DRAM, cached, 1…10s…100s of cycles
—depends on cache locality
• Instruction Memory (invisible) – DRAM, cached

21
Minimal Extensions to C

• Declaration specifiers to indicate where things live

—functions
—__global__ void KernelFunc(...); // kernel callable from host
– must return void
—__device__ float DeviceFunc(...); // function callable on device
– no recursion
– no static variables within function
—__host__ float HostFunc(); // only callable on host
—variables (next slide)
• Extend function invocation syntax for parallel kernel launch
—KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
• Built-in variables for thread identification in kernels
—dim3 threadIdx; dim3 blockIdx; dim3 blockDim;

22
Invoking a Kernel Function

• Call kernel function with an execution configuation

• Any call to a kernel function is asynchronous

—explicit synchronization is needed to block
• cudaThreadSynchronize() forces runtime to wait until all
preceding device tasks have finished
• Within kernel, declare shared memory as
—extern int __shared[];
23
CUDA Variable Declarations

• device is optional with local, shared, or

__constant__
• Automatic variables without any qualifier reside in a register
—except arrays: reside in local memory
• Pointers can only point to memory allocated or declared in
global memory
—allocated on the host and passed to the kernel
– __global__ void Kernelfunc(float *ptr)
—address obtained for a global variable: float *ptr = &GlobalVar 24
Using Per Block Shared Memory

• Variables shared across block

—__shared__ int *begin, *end;
• Scratchpad memory
—__shared__ int scratch[blocksize];
—scratch[threadIdx.x] = begin[threadIdx.x];
—// … compute on scratch values …
—begin[threadIdx.x] = scratch[threadIdx.x];
• Communicating values between threads
—scratch[threadIdx.x] = begin[threadIdx.x];
—__syncthreads();
—int left = scratch[threadIdx.x - 1];

25
Features Available in GPU Code

• Special variables for thread identification in kernels

—dim3 threadIdx; dim3 blockIdx; dim3 blockDim;
• Intrinsics that expose specific operations in kernel code
—_syncthreads(); // barrier synchronization
• Standard math library operations
—exponentiation, truncation and rounding, trigonometric
functions, min/max/abs, log, quotient/remainder, etc.
• Atomic memory operations
—atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.

26
Runtime Support

• Memory management for pointers to GPU memory

—cudaMalloc(), cudaFree()
• Copying from host to/from device, device to device
—cudaMemcpy(), cudaMemcpy2D()

27
More Complete Example: Vector Addition

kernel code

...

28
Vector Addition Host Code

29
Extended C Summary

30
Compiling CUDA

31
Ideal CUDA programs

• High intrinsic parallelism

—e.g. per-element operations
• Minimal communication (if any) between threads
—limited synchronization
• High ratio of arithmetic to memory operations
• Few control flow statements
—SIMD execution
– divergent paths among threads in a block may be serialized (costly)
– compiler may replace conditional instructions by predicated
operations to reduce divergence

32
CUDA Matrix Multiply: Host Code

33
CUDA Matrix Multiply: Device Code

34
Optimization Considerations
• Kernel optimizations
—make use of shared memory
—minimize use divergent control flow
– SIMD execution must follow all paths taken within a thread group
—use intrinsic instructions when possible
– exploit the hardware support behind them

• CPU/GPU interaction
—maximize PCIe throughput
—use asynchronous memory copies
• Key resource considerations for Tesla GPU’s
—Max 512 threads per block Use compiler option:
—Up to 8 blocks per SM maxregisters=<regs>
—8K registers per SM (16K for Tesla2) to limit the number
—16 KB shared mem per SM of registers used per
thread
—16 KB local mem per thread
—64 KB of constant mem 35
GPU Application Domains

36
CUDA Resources

• General information about CUDA

—www.nvidia.com/object/cuda_home.html
• Nvidia GPUs compatible with CUDA
—www.nvidia.com/object/cuda_learn_products.html
• CUDA sample source code
—www.nvidia.com/object/cuda_get_samples.html
• Download the CUDA SDK
—www.nvidia.com/object/cuda_get.html

37
CUDA Alternative: OpenCL

• Emerging framework for writing programs that execute on

heterogeneous platforms, including CPUs, GPUs, etc.
—supports both task and data parallelism
—based on subset of ISO C99 with extensions for parallelism
—numerics based on IEEE 754 floating point standard
—efficiently interoperated with graphics APIs, e.g. OpenGL
• OpenCL managed by non-profit Khronos Group
• Initial specification approved for public release Dec. 8, 2008
—specification 1.0.33 released Feb 4, 2009

38
OpenCL Kernel Example: 1D FFT

39
OpenCL Host Program: 1D FFT

40
Device Programming Abstractions

• CPU
—single-threaded, serial instruction stream
– superscalar: manage pipelines for multiple functional units
– SIMD short vector operations: 3-4 operations per cycle
—data in cache or memory
• GPGPU
—use GPU pixel shaders as general purpose processors
—operate on data in video memory
—threads interact with each other through off-chip memory
• CUDA
—automatically manages threads
—divides data set into smaller chunks stored in on-chip memory
– reduces need to access off-chip memory improves performance
—multiple threads can share each chunk
41
References

• Patrick LeGresley, High Performance Computing with CUDA, Stanford

University Colloquium, October 2008, http://www.stanford.edu/dept/ICME/
docs/seminars/LeGresley-2008-10-27.pdf
• Vivek Sarkar. Introduction to General-Purpose computation on GPUs
(GPGPUs), COMP 635, September 2007
• Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. Dobb’s
Portal, http://www.ddj.com/architect/207200659, April 2008-March 2009.
• Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report,
January 2008.
• N. Govindaraju et al. A cache-efficient sorting algorithm for database and
data mining computations using graphics processors. http://
gamma.cs.unc.edu/SORT/gpusort.pdf
• http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort
• http://en.wikipedia.org/wiki/OpenCL
• http://www.khronos.org/opencl
—http://www.khronos.org/files/opencl-quick-reference-card.pdf 42

High Performance Computing: Modern Systems and Practices 1st Edition - Ebook PDF PDF Download
100% (6)
High Performance Computing: Modern Systems and Practices 1st Edition - Ebook PDF PDF Download
60 pages
Instant Ebooks Textbook Deep Generative Modeling Jakub M. Tomczak Download All Chapters
No ratings yet
Instant Ebooks Textbook Deep Generative Modeling Jakub M. Tomczak Download All Chapters
49 pages
Aarne Ranta - Implementing Programming Languages. An Introduction To Compilers and Interpreters (2012, College Publications)
No ratings yet
Aarne Ranta - Implementing Programming Languages. An Introduction To Compilers and Interpreters (2012, College Publications)
226 pages
Charles Severance, Kevin Dowd-High Performance Computing-O'Reilly Media (1998) PDF
100% (1)
Charles Severance, Kevin Dowd-High Performance Computing-O'Reilly Media (1998) PDF
132 pages
Accomplishment Report of Project ICARE
100% (1)
Accomplishment Report of Project ICARE
10 pages
Study Skills for Students
No ratings yet
Study Skills for Students
10 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
PyCUDA Guide for Scientists
100% (1)
PyCUDA Guide for Scientists
15 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Main GPU
No ratings yet
Main GPU
87 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
Verilog Nonblocking Assignments Demystified
100% (2)
Verilog Nonblocking Assignments Demystified
3 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
NVIDIA CUDA Computational Finance Geeks3D
No ratings yet
NVIDIA CUDA Computational Finance Geeks3D
39 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
XAI in Aerospace Maintenance
No ratings yet
XAI in Aerospace Maintenance
12 pages
Pthread Tutorial by Peter (Good One)
No ratings yet
Pthread Tutorial by Peter (Good One)
29 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Modern GPU
100% (1)
Modern GPU
221 pages
1 Introduction To Parallel Computing
No ratings yet
1 Introduction To Parallel Computing
58 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Advances in Computers
No ratings yet
Advances in Computers
299 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
376 pages
Lecture 3: Animation & Graphics
No ratings yet
Lecture 3: Animation & Graphics
32 pages
Neural Turing
No ratings yet
Neural Turing
194 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
CUDA C Programming Guide PDF
No ratings yet
CUDA C Programming Guide PDF
301 pages
Parallel Computing Toolbox
No ratings yet
Parallel Computing Toolbox
730 pages
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
44 pages
Advanced Computer Arc.
No ratings yet
Advanced Computer Arc.
128 pages
Lecture 4 - GPU Architecture and Programming
No ratings yet
Lecture 4 - GPU Architecture and Programming
30 pages
Performance Computing
100% (1)
Performance Computing
102 pages
Visualization of Complex Graphs in Augmented Reality
No ratings yet
Visualization of Complex Graphs in Augmented Reality
91 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
UNIX I/O System Overview
No ratings yet
UNIX I/O System Overview
7 pages
Concurrency Primer
No ratings yet
Concurrency Primer
12 pages
LLVM Tutorial
100% (1)
LLVM Tutorial
59 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
High Performance Computing
100% (2)
High Performance Computing
61 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
316 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Computer Graphics CSE 306
No ratings yet
Computer Graphics CSE 306
119 pages
Programming Methodology in C
No ratings yet
Programming Methodology in C
117 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
A Beginner's Guide To Stable LM Suite of Language Models
No ratings yet
A Beginner's Guide To Stable LM Suite of Language Models
4 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
C Faq
No ratings yet
C Faq
116 pages
Comp422 2011 Lecture8 UPC
No ratings yet
Comp422 2011 Lecture8 UPC
44 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
50 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
Graph Algorithms: John Mellor-Crummey
No ratings yet
Graph Algorithms: John Mellor-Crummey
33 pages
Drew English - 2024 Poetry Revision Booklet
No ratings yet
Drew English - 2024 Poetry Revision Booklet
82 pages
Git 203 Assignment 1
No ratings yet
Git 203 Assignment 1
2 pages
BCM and O Webcast - Questions - and - Answers PDF
No ratings yet
BCM and O Webcast - Questions - and - Answers PDF
12 pages
Python Lab Manual
No ratings yet
Python Lab Manual
27 pages
Session Guide
No ratings yet
Session Guide
6 pages
Intro to Philosophy Course Guide
No ratings yet
Intro to Philosophy Course Guide
3 pages
MS Excel Full Notes PDF Free Download - Google Search
No ratings yet
MS Excel Full Notes PDF Free Download - Google Search
3 pages
Low Power MAC Architecture Design
No ratings yet
Low Power MAC Architecture Design
5 pages
39 Books of The Old Testement Names and Meaning Assignment
No ratings yet
39 Books of The Old Testement Names and Meaning Assignment
4 pages
Pc3110rs-Copy Mnual FPDF
No ratings yet
Pc3110rs-Copy Mnual FPDF
68 pages
2 ND Selection Bed
No ratings yet
2 ND Selection Bed
12 pages
History12 - 2 - Bhakti - Sufi Traditions PDF
No ratings yet
History12 - 2 - Bhakti - Sufi Traditions PDF
30 pages
C Programming and Data Structures - Unit I Notes
No ratings yet
C Programming and Data Structures - Unit I Notes
40 pages
Quantifiers and Plural Noun Rules
No ratings yet
Quantifiers and Plural Noun Rules
6 pages
FUN WITH GRAMMAR (NOUNS) Chap06.pdf by Betty Azar
100% (1)
FUN WITH GRAMMAR (NOUNS) Chap06.pdf by Betty Azar
19 pages
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
No ratings yet
Lim 2014 Manichaeans and Public Disputation in Late Antiquity
40 pages
Phonetics Booklet - Key
No ratings yet
Phonetics Booklet - Key
11 pages
British Literacy - Literacy Devices Poems
No ratings yet
British Literacy - Literacy Devices Poems
6 pages
Identifying The Firmware of A Qlogic or Emulex FC HBA
No ratings yet
Identifying The Firmware of A Qlogic or Emulex FC HBA
2 pages
Gender Autonomy As A Feminist Premise of Identity and Its Impact Upon Female Protagonists in Fictional Narratives
No ratings yet
Gender Autonomy As A Feminist Premise of Identity and Its Impact Upon Female Protagonists in Fictional Narratives
7 pages
18.reading Mysterious Creatures
No ratings yet
18.reading Mysterious Creatures
1 page
Presentation1 Ktu
No ratings yet
Presentation1 Ktu
111 pages
(OOP) - 01-45 (22-08-2009) Updated
No ratings yet
(OOP) - 01-45 (22-08-2009) Updated
342 pages
NVR User's Installation and Operation Manual
No ratings yet
NVR User's Installation and Operation Manual
97 pages
Unit I Introduction 1.1 What Motivated Data Mining? Why Is It Important?
No ratings yet
Unit I Introduction 1.1 What Motivated Data Mining? Why Is It Important?
18 pages
B.A. Comparative Literature Hons
No ratings yet
B.A. Comparative Literature Hons
5 pages
Wabi, Sabi, and Shibui
No ratings yet
Wabi, Sabi, and Shibui
2 pages
Yassarnal Qur-Aan: Part Two
No ratings yet
Yassarnal Qur-Aan: Part Two
32 pages

Programming Gpus With Cuda: John Mellor-Crummey

Uploaded by

Programming Gpus With Cuda: John Mellor-Crummey

Uploaded by

Programming GPUs with CUDA

Department of Computer Science

COMP 422 Lecture 21 12 April 2011

• Two major trends

—availability of general (non-graphics) programming interfaces

Similar Tesla S870 server in

CUDA Cores 128 240

Processor Clock 1.69 GHz 1.47 GHz

Floating Point Precision IEEE 754 SP IEEE 754 DP

Dedicated Memory 512 MB 1 GB GDDR3

Memory Clock (MHz) 1.1 GHz 1.2 GHz

Memory Interface Width 256-bit 512-bit

Memory Bandwidth 70.4 GB/s 159 GB/s

Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png 3

• General Purpose computation using GPU

• Stream-based programming model

threads interact through

//invert the other half of the bitonic array and merge

(Govindaraju, Manocha; 2005)

Slide credit: Patrick LeGresley, NVidia 7

• NVidia GeForce 8 architecture

• 512 CUDA cores

• Full C++ support

Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 10

• Support heterogeneous parallel programming (CPU + GPU)

• Hierarchy of concurrent threads

• Parallel kernels composed of many threads

• Threads are grouped into thread blocks

• Blocks are grouped into grids

void saxpy_serial(int n, float alpha, float *x, float *y) {

Computing y = ax + y in parallel using CUDA

• Threads within a block may synchronize with barriers

Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf

Figure credits: Patrick LeGresley, NVidia 19

Figure credit: Patrick LeGresley, NVidia 20

• Register – dedicated HW - single cycle

• Declaration specifiers to indicate where things live

• Call kernel function with an execution configuation

• Any call to a kernel function is asynchronous

• __device__ is optional with __local__, __shared__, or

• Variables shared across block

• Special variables for thread identification in kernels

• Memory management for pointers to GPU memory

• High intrinsic parallelism

• General information about CUDA

• Emerging framework for writing programs that execute on

• Patrick LeGresley, High Performance Computing with CUDA, Stanford

You might also like

void saxpy_serial(int n, float alpha, float x, float y) {

• device is optional with local, shared, or