DS1822 - Parallel Computing-Unit3

The document provides an overview of GPU architectures, emphasizing their design differences from CPUs, focusing on data parallelism, and introducing CUDA as a programming model for parallel computing. It explains how GPUs optimize for high throughput and memory bandwidth, and how data parallelism allows for concurrent processing of data chunks. Additionally, it outlines the structure of CUDA programs and the role of threads, blocks, and grids in memory handling.

Uploaded by

Nisha Rajini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views6 pages

DS1822 - Parallel Computing-Unit3

Uploaded by

Nisha Rajini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

UNIT III PROGRAMMING GPU’s

GPU Architectures – Data Parallelism – CUDA Basics – CUDA Program Structure – Threads, Blocks, Grids
– Memory Handling.

I. GPU Architectures
 GPU architecture is mainly driven by following key factors:
1. Amount of data processed at one time (Parallel processing).
2. Processing speed on each data element (Clock frequency).
3. Amount of data transferred at one time (Memory bandwidth).
4. Time for each data element to be transferred (Memory latency).
 To begin with, let us first look at main design distinctions between CPU and GPU.
 CPU design consists of multicore processors having large cores and large caches using control units
for optimal serial performance.
 Whereas, GPU design consists of large number of threads with small caches and minimized control
units for optimizing execution throughput.
 GPU provides much higher instruction throughput and memory bandwidth than the CPU within a
similar price and power envelope.

 GPU architecture focuses more on putting available cores to work and is less focused on low latency
cache memory access.
 In generic many core GPU, less space is devoted to control logic and caches. And, large numbers of
transistors are devoted to support parallel data processing. Following diagram shows GPU
architecture.
1) The GPU consists of multiple Processor Clusters (PC).

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also known as cores)
that share control logic and L1(layer 1) instruction cache.

4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2 (Layer 2) cache
before pulling data from global memory i.e. Graphic Double Data Rate (GDDR) DRAM.
5) The number of Streaming Multiprocessors (SMs) and cores per Streaming Multiprocessor (SM) varies as
per the targeted price and market of the GPU.

6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global memory allows
keeping data longer in global memory thereby reducing transfers to the CPU.

7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for memory latency.

8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is because GPU has
more transistors dedicated to computations and it worries less about retrieving data from memory.

9) Memory bus is optimized for bandwidth allowing serving large number of ALUs simultaneously.

10) GPU architecture is more optimized for data parallel throughput computations.

11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or Streaming
Multiprocessor (SM) level.

II. Data Parallelism

 Data parallelism is a key concept in parallel computing where a particular dataset is divided into
smaller chunks, and the same operation is performed concurrently on each chunk. This approach
leverages the ability to process multiple data elements simultaneously, significantly speeding up
computations.
 CPUs use Task Parallelism wherein
a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.

 Whereas GPUs use data parallelism wherein

a. Same instruction is executed on different data
b. Generally threads are lightweight
c. Programming is done for batches of threads (e.g. one pixel shader per group of pixels).
 In Data Parallelism, performance improvement is achieved by applying the same small set of tasks
iteratively over multiple streams of data.
 It is nothing but way of performing parallel execution of an application on multiple processors.
 In Data Parallelism, the goal is to scale the throughput of processing based on the ability to
decompose the data set into concurrent processing streams, all performing the same set of operations.
 CPU application manages the GPU and uses it to offload specific computations.
 GPU code is encapsulated in parallel routines called kernels.
 CPU executes the main program, which prepares the input data for GPU processing, invokes the
kernel on the GPU, and then obtains the results after the kernel terminates.
 A GPU kernel maintains its own application state. A GPU kernel is ordinary sequential function, but it
is executed in parallel by thousands of GPU threads.
 Data Parallelism is achieved in SIMD (Single Instruction Multiple Data) mode.
 In SIMD mode, an instruction is only decoded once and multiple ALUs perform the work on multiple
data elements in parallel.
 In this either single controller controls the parallel data operations or multiple threads work in the
same way on the individual compute nodes (SPMD).
 SIMD parallelism enhances the performance of computationally intensive applications that execute
the same operation on distinct elements in a dataset.

 Modern applications process large amounts of data which incurs significant execution time on
sequential computers.
 Data parallelism is used to advantage in applications like image processing, computer graphics,
algebra libraries like matrix multiplication, etc.

III. CUDA Hardware & CUDA Basics

CUDA Hardware:
 CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
 CUDA is introduced by NVIDIA in 2006.
 CUDA is data-parallel extension to the C/C++ languages and an API model for parallel programming.
 CUDA parallel programming model has three key abstractions –
o (1) a hierarchy of thread groups
o (2) shared memories, and
o (3) barrier synchronization.
 The programmer or compiler decomposes large computing problems into many small problems that
can be solved in parallel.
 Programs written using CUDA harness the power of GPU and thereby increase the computing
performance.
 In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it is
optimized for single threaded performance) and the compute intensive portion of the application runs
on thousands of GPU cores in parallel.
 Using CUDA, developers can utilize the power of GPUs to perform general computing tasks like
multiplying matrices and performing linear algebra operations (instead of just doing graphical
calculations).
 In CUDA, developers program in popular languages such as C, C++, Fortran, Python, DirectCompute
and MATLAB. And, express parallelism through extensions in the form of basic keywords.
 At high level, graphics card with a many-core GPU and high speed graphics device memory sits
inside a standard PC / server with one or two multicore CPUs.

 The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) also known as cores. Streaming Multiprocessor
(SM) uses dedicated L1 cache and shared L2 cache. Following diagram shows high level overview
GPU hardware.
CUDA Basics:

IV.CUDA Program Structure

V.Threads, Blocks and Grids

VI.Memory Handling

GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Programming Essentials
33% (3)
GPU Programming Essentials
28 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
17 pages
TIA EIA 568 B.2 1final
No ratings yet
TIA EIA 568 B.2 1final
86 pages
Programming Fundamentals: Lecture # 1
No ratings yet
Programming Fundamentals: Lecture # 1
42 pages
Unit 4
100% (1)
Unit 4
48 pages
Ar
No ratings yet
Ar
10 pages
DS1703 CV Unit2
No ratings yet
DS1703 CV Unit2
21 pages
Cybersecurity Tool for All Users
No ratings yet
Cybersecurity Tool for All Users
39 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
PART19
No ratings yet
PART19
20 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
No ratings yet
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
381 pages
Day1 1
No ratings yet
Day1 1
25 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Cuda
No ratings yet
Cuda
25 pages
Cuda
No ratings yet
Cuda
69 pages
Optilift RPC Manual Rockwell
No ratings yet
Optilift RPC Manual Rockwell
462 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Wella Hair Color Guide
No ratings yet
Wella Hair Color Guide
14 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
M6 Cuda Session
No ratings yet
M6 Cuda Session
53 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
CUDA
No ratings yet
CUDA
18 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Customer Journey Map
100% (1)
Customer Journey Map
20 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
Atos Capital Markets Day May 2025
No ratings yet
Atos Capital Markets Day May 2025
87 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
Frac Design for Petroleum Students
No ratings yet
Frac Design for Petroleum Students
49 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Notes On The Troubleshooting and Repair of Small Switchmode Power Supplies
100% (1)
Notes On The Troubleshooting and Repair of Small Switchmode Power Supplies
65 pages
DS1822-Parallel Computing - Unit2
No ratings yet
DS1822-Parallel Computing - Unit2
25 pages
Frontmatter
No ratings yet
Frontmatter
24 pages
Multidotscan 2.1 User Manual
No ratings yet
Multidotscan 2.1 User Manual
24 pages
BCA Syllabus
No ratings yet
BCA Syllabus
21 pages
DS1822 - Parallel Computing - Unit 1
No ratings yet
DS1822 - Parallel Computing - Unit 1
23 pages
GMC 300E Plus User Guide
No ratings yet
GMC 300E Plus User Guide
24 pages
Unit4 CV
No ratings yet
Unit4 CV
24 pages
Lab 3 Oops
No ratings yet
Lab 3 Oops
17 pages
ch6 Notes
No ratings yet
ch6 Notes
5 pages
CS3342 Software Design Course
No ratings yet
CS3342 Software Design Course
15 pages
Data Democratization: Toward A Deeper Understanding: September 2021
No ratings yet
Data Democratization: Toward A Deeper Understanding: September 2021
18 pages
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
No ratings yet
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
32 pages
Software Process Models Guide
No ratings yet
Software Process Models Guide
30 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
CAN-Based Smart Home System
No ratings yet
CAN-Based Smart Home System
7 pages
HTSO by Tosif Ghazi
No ratings yet
HTSO by Tosif Ghazi
11 pages
Main Project 2021 Zeroth
No ratings yet
Main Project 2021 Zeroth
9 pages
Introduction To UX Design
No ratings yet
Introduction To UX Design
8 pages
DS1703 CV Unit1
No ratings yet
DS1703 CV Unit1
36 pages
Attendance
No ratings yet
Attendance
2 pages
RT070 DS R2011 V1.0.3
No ratings yet
RT070 DS R2011 V1.0.3
2 pages
FAQs On OTS Registration Process
No ratings yet
FAQs On OTS Registration Process
3 pages
ZX81 Fpga VHDL
No ratings yet
ZX81 Fpga VHDL
1 page
Material Approval Application
No ratings yet
Material Approval Application
1 page
(4th Year) Roadmap To Dream Placement
No ratings yet
(4th Year) Roadmap To Dream Placement
1 page
Aspire Archon User Manual
No ratings yet
Aspire Archon User Manual
1 page

DS1822 - Parallel Computing-Unit3

Uploaded by

DS1822 - Parallel Computing-Unit3

Uploaded by

UNIT III PROGRAMMING GPU’s

2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).

II. Data Parallelism

 Whereas GPUs use data parallelism wherein

III. CUDA Hardware & CUDA Basics

IV.CUDA Program Structure

V.Threads, Blocks and Grids

You might also like