Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views29 pages

Comp206 Lecture14

This lecture covers General-Purpose Graphic Processing Units (GPGPUs) and their role in High Performance Computing (HPC), emphasizing their architecture and operational differences compared to CPUs. GPGPUs are designed to maximize throughput with many cores for parallel processing, while CPUs focus on minimizing latency. The document also introduces CUDA, a parallel computing platform that allows developers to leverage GPU capabilities for computational tasks.

Uploaded by

melikgurcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views29 pages

Comp206 Lecture14

This lecture covers General-Purpose Graphic Processing Units (GPGPUs) and their role in High Performance Computing (HPC), emphasizing their architecture and operational differences compared to CPUs. GPGPUs are designed to maximize throughput with many cores for parallel processing, while CPUs focus on minimizing latency. The document also introduces CUDA, a parallel computing platform that allows developers to leverage GPU capabilities for computational tasks.

Uploaded by

melikgurcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

COMP206 – Computer Architecture

Lecture #14 –HW Accelerators (GPUs)

© 2016 Pearson Education, Inc., Hoboken, NJ. All


rights reserved.
Buse Yılmaz, PhD.
Dept. of Computer Engineering, MEF University
Slides are original slides of the following books with
modifications:
Computer Organization and Architecture:
Designing for performance (10th Ed.) - William
Stallings
EuroCC Seminar – 2022 (several speakers)
BAŞARIM’22 Ümit Çatalyürek’s presentation

Chapter 1 — Computer Abstractions and


HPC Talk – Buse Yılmaz, İSÜ’22

Technology — 2
General-Purpose Graphic Processing
Units (GPGPU)
• Mostly used for running applications that weigh heavy on
graphics
 3D modeling software
 VDI infrastructures
 Game graphics, simulations, etc.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


• Today, GPGPU’s accelerates computational workloads in
modern High Performance Computing (HPC) landscapes
• GPUs are
 a type of accelerator
 they are coupled with a CPU
 not capable of running an OS
 we offload CUDA codes to GPUs

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
General-Purpose Graphic Processing
Units (GPGPU)
• CPU: minimize latency
 Optimized to be finish a task at a as low as possible latency, while
keeping the ability to quickly switch between operations

• GPU: maximize throughput


 GPU has a lot more cores to process a task. Motivation is to push as

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


many as possible tasks through is internals at once
 puts available cores to work and it’s less focused on low latency
cache memory access

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
ALU ALU
Control
ALUALU ALU
ALU
Control
ALU ALU
Cache

Cache
DRAM DRAM

CPU GPU
DRAM DRAM

CPU GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

CPU: Control logic and cache memory make up the majority of the CPU’s real
estate.
GPU: usesFigure 19.2 CPU
a massively vs.SIMD
parallel GPU(single
Silicon Area/Transistor
instruction Dedication
multiple data)
architecture to perform mainly mathematical operations: runs the same thread
of code on large amounts of data

GPU doesn’t require the same complex capabilities of the CPU’s control logic (out of
order

reserved.
execution, branch prediction, data hazards, etc)
GPUs are able to hide memory latency by managing the execution of more
threads than available processor cores
CPU vs. GPU

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
CPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


https://core.vmware.com/resource/exploring-gpu-architecture#section1

reserved.
GPU
CUDA core = Streaming
Processor (shader unit)

Streaming multiprocessor
(SM) contains streaming
processors (SP)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


SMs executes one instruction
at time on all SPs

Warp: Basic unit of


execution (32 threads)

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section4
Threads are uniformly bundled in
thread blocks
Grid: number of blocks per kernel launch
Grid
Grid / block dimensions: 1D, 2D, 3D Block(0, 0) Block(1, 0) Block(2, 0)
They need not have the same dimensions

Block: assigned to only one of the several Block(0, 1) Block(1, 1) Block(2, 1)


GPU streaming multiprocessors
(SMs).

A block is never split between SMs


and it consists of warps.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


There is a maximum number of
threads that an SM will accept (code Block (1,1)
won’t compile) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Distribute the load as uniformly as


possible (the number of thread blocks
launched should be no less than the Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
number of SMs on the GPU.)
Finding the optimum configuration can
be a very time consuming and Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
daunting process.

reserved.
NVIDIA H100: thread block clusters
(granularity larger than a single Thread
Block on a single SM)
Figure 19.1 Relationship Among Threads, Blocks, and a Grid
Table 19.1

CUDA Terms to GPU’s Hardware Components


Equivalence Mapping

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
EuroCC Seminar – 2022 (several speakers)
Memory Management

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Streaming Multiprocessor (SM)
• Thousands of registers that can be partitioned among threads of
execution

• Several caches:
 Shared memory for fast data interchange between threads
 Constant cache for fast broadcast of reads from constant memory
Texture cache to aggregate bandwidth from texture memory

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights



 L1 cache to reduce latency to local or global memory

• Warp schedulers that can quickly switch contexts between


threads and issue instructions to warps that are ready to execute

• Execution cores for integer and floating-point operations:


 Integer and single-precision floating point operations
 Double-precision floating point
 Special Function Units (SFUs) for single-precision floating-point

reserved.
transcendental functions (Nvidia Fermi & Kepler)

CUDA Handbook: A Comprehensive Guide to GPU Programming Nicholas Wilt


Published Jun 12, 2013 by Addison-Wesley Professional.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit

Register File (32k x 32-bit)


https://forums.developer.nvidia.com
/t/fermi-and-kepler-gpu-special-
Ld/St function-units/28345
Core Core Core Core
Ld/St
SFU
Ld/St Special Function
Core Core Core Core
Ld/St Units (SFUs) to (quoting the
CUDA Core Ld/St
NVIDIA White Paper on
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU Fermi) "execute

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


Ld/St
FP Int
Core Core Core Core
Ld/St
transcendental instructions
Unit Unit
Ld/St such as sin, cosine, reciprocal,
Result Queue
Core Core Core Core
Ld/St and square root. Each SFU
SFU
Ld/St executes one instruction per
Core Core Core Core
Ld/St thread, per clock"
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St

Interconnect Network

reserved.
64-kB Shared Memory/L1 Cache

Uniform Cache

Figure 19.5 Single SM Architecture


WARP Scheduler WARP Scheduler

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11


Warp 2 instruction 42 Warp 3 instruction 33

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


Warp 14 instruction 95 Warp 15 instruction 95
Time

Warp 8 instruction 12 Warp 9 instruction 12


Warp 14 instruction 96 Warp 3 instruction 34

Warp 2 instruction 43 Warp 15 instruction 96

reserved.
Figure 19.6 Dual Warp Schedulers and
Instruction Dispatch Units Run Example
EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Compute Capabilities for some GPUs

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
EuroCC Seminar – 2022 (several speakers)
Tensor
Cores

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Physical attributes
such as size and
configuration,
arrangements of its
components
https://www.nvidia.com/en-us/data-center/tensor-cores/
https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Compute Unified Device
Architecture (CUDA)
• A parallel computing platform and programming model created by NVIDIA and
implemented by the graphics processing units (GPUs) that they produce
• CUDA C is a C/C++ based language
• Program can be divided into three general sections
 Code to be run on the host (CPU) communication

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


 Code to be run on the device (GPU)
 The code related to the transfer of data between the host and the device

 The data-parallel code to be run on the GPU is called a kernel


 Typically will have few to no branching statements
 Branching statements in the kernel result in serial execution of the threads in the
GPU hardware

 A thread is a single instance of the kernel function


 The programmer defines the number of threads launched when the kernel
function is called
 The total number of threads defined is typically in the thousands to

reserved.
maximize the utilization of the GPU processor cores, as well as maximize
the available speedup
 The programmer specifies how these threads are to be bundled
CUDA API example

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
CUDA Hello World

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


reserved.
Simple CUDA Code
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y) { // Single-precision A*X Plus Y
// index of the thread within its thread block and the thread block within the grid, respectively.
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}

int main(void){
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));

cudaMalloc(&d_x, N*sizeof(float));

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


cudaMalloc(&d_y, N*sizeof(float));

for (int i = 0; i < N; i++) {


x[i] = 1.0f;
y[i] = 2.0f;
}

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);


cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements


// <<< number of thread blocks in the grid, the number of threads in a thread block>>>
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

float maxError = 0.0f;


for (int i = 0; i < N; i++)

reserved.
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);

cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/
CUDA Computing

• CUDA compiler: nvcc

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


• ISA: Parallel Thread Execution (PTX)
 programming model
 explicitly parallel: a PTX program specifies the execution of a given thread of
a parallel thread array. A cooperative thread array (CTA), is an array of threads
that execute a kernel concurrently or in parallel

reserved.
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Parallel Thread Execution (PTX)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#syntax

reserved.
https://docs.nvidia.com/cuda/parallel-thread-execution/

You might also like