Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views6 pages

DS1822 - Parallel Computing-Unit3

The document provides an overview of GPU architectures, emphasizing their design differences from CPUs, focusing on data parallelism, and introducing CUDA as a programming model for parallel computing. It explains how GPUs optimize for high throughput and memory bandwidth, and how data parallelism allows for concurrent processing of data chunks. Additionally, it outlines the structure of CUDA programs and the role of threads, blocks, and grids in memory handling.

Uploaded by

Nisha Rajini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

DS1822 - Parallel Computing-Unit3

The document provides an overview of GPU architectures, emphasizing their design differences from CPUs, focusing on data parallelism, and introducing CUDA as a programming model for parallel computing. It explains how GPUs optimize for high throughput and memory bandwidth, and how data parallelism allows for concurrent processing of data chunks. Additionally, it outlines the structure of CUDA programs and the role of threads, blocks, and grids in memory handling.

Uploaded by

Nisha Rajini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIT III PROGRAMMING GPU’s

GPU Architectures – Data Parallelism – CUDA Basics – CUDA Program Structure – Threads, Blocks, Grids
– Memory Handling.

I. GPU Architectures
 GPU architecture is mainly driven by following key factors:
1. Amount of data processed at one time (Parallel processing).
2. Processing speed on each data element (Clock frequency).
3. Amount of data transferred at one time (Memory bandwidth).
4. Time for each data element to be transferred (Memory latency).
 To begin with, let us first look at main design distinctions between CPU and GPU.
 CPU design consists of multicore processors having large cores and large caches using control units
for optimal serial performance.
 Whereas, GPU design consists of large number of threads with small caches and minimized control
units for optimizing execution throughput.
 GPU provides much higher instruction throughput and memory bandwidth than the CPU within a
similar price and power envelope.

 GPU architecture focuses more on putting available cores to work and is less focused on low latency
cache memory access.
 In generic many core GPU, less space is devoted to control logic and caches. And, large numbers of
transistors are devoted to support parallel data processing. Following diagram shows GPU
architecture.
1) The GPU consists of multiple Processor Clusters (PC).

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also known as cores)
that share control logic and L1(layer 1) instruction cache.

4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2 (Layer 2) cache
before pulling data from global memory i.e. Graphic Double Data Rate (GDDR) DRAM.
5) The number of Streaming Multiprocessors (SMs) and cores per Streaming Multiprocessor (SM) varies as
per the targeted price and market of the GPU.

6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global memory allows
keeping data longer in global memory thereby reducing transfers to the CPU.

7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for memory latency.

8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is because GPU has
more transistors dedicated to computations and it worries less about retrieving data from memory.

9) Memory bus is optimized for bandwidth allowing serving large number of ALUs simultaneously.

10) GPU architecture is more optimized for data parallel throughput computations.

11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or Streaming
Multiprocessor (SM) level.

II. Data Parallelism


 Data parallelism is a key concept in parallel computing where a particular dataset is divided into
smaller chunks, and the same operation is performed concurrently on each chunk. This approach
leverages the ability to process multiple data elements simultaneously, significantly speeding up
computations.
 CPUs use Task Parallelism wherein
a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.

 Whereas GPUs use data parallelism wherein


a. Same instruction is executed on different data
b. Generally threads are lightweight
c. Programming is done for batches of threads (e.g. one pixel shader per group of pixels).
 In Data Parallelism, performance improvement is achieved by applying the same small set of tasks
iteratively over multiple streams of data.
 It is nothing but way of performing parallel execution of an application on multiple processors.
 In Data Parallelism, the goal is to scale the throughput of processing based on the ability to
decompose the data set into concurrent processing streams, all performing the same set of operations.
 CPU application manages the GPU and uses it to offload specific computations.
 GPU code is encapsulated in parallel routines called kernels.
 CPU executes the main program, which prepares the input data for GPU processing, invokes the
kernel on the GPU, and then obtains the results after the kernel terminates.
 A GPU kernel maintains its own application state. A GPU kernel is ordinary sequential function, but it
is executed in parallel by thousands of GPU threads.
 Data Parallelism is achieved in SIMD (Single Instruction Multiple Data) mode.
 In SIMD mode, an instruction is only decoded once and multiple ALUs perform the work on multiple
data elements in parallel.
 In this either single controller controls the parallel data operations or multiple threads work in the
same way on the individual compute nodes (SPMD).
 SIMD parallelism enhances the performance of computationally intensive applications that execute
the same operation on distinct elements in a dataset.

 Modern applications process large amounts of data which incurs significant execution time on
sequential computers.
 Data parallelism is used to advantage in applications like image processing, computer graphics,
algebra libraries like matrix multiplication, etc.

III. CUDA Hardware & CUDA Basics

CUDA Hardware:
 CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
 CUDA is introduced by NVIDIA in 2006.
 CUDA is data-parallel extension to the C/C++ languages and an API model for parallel programming.
 CUDA parallel programming model has three key abstractions –
o (1) a hierarchy of thread groups
o (2) shared memories, and
o (3) barrier synchronization.
 The programmer or compiler decomposes large computing problems into many small problems that
can be solved in parallel.
 Programs written using CUDA harness the power of GPU and thereby increase the computing
performance.
 In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it is
optimized for single threaded performance) and the compute intensive portion of the application runs
on thousands of GPU cores in parallel.
 Using CUDA, developers can utilize the power of GPUs to perform general computing tasks like
multiplying matrices and performing linear algebra operations (instead of just doing graphical
calculations).
 In CUDA, developers program in popular languages such as C, C++, Fortran, Python, DirectCompute
and MATLAB. And, express parallelism through extensions in the form of basic keywords.
 At high level, graphics card with a many-core GPU and high speed graphics device memory sits
inside a standard PC / server with one or two multicore CPUs.

 The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) also known as cores. Streaming Multiprocessor
(SM) uses dedicated L1 cache and shared L2 cache. Following diagram shows high level overview
GPU hardware.
CUDA Basics:

IV.CUDA Program Structure

V.Threads, Blocks and Grids

VI.Memory Handling

You might also like