COMP206 – Computer Architecture
Lecture #14 –HW Accelerators (GPUs)
© 2016 Pearson Education, Inc., Hoboken, NJ. All
rights reserved.
Buse Yılmaz, PhD.
Dept. of Computer Engineering, MEF University
Slides are original slides of the following books with
modifications:
Computer Organization and Architecture:
Designing for performance (10th Ed.) - William
Stallings
EuroCC Seminar – 2022 (several speakers)
BAŞARIM’22 Ümit Çatalyürek’s presentation
Chapter 1 — Computer Abstractions and
HPC Talk – Buse Yılmaz, İSÜ’22
Technology — 2
General-Purpose Graphic Processing
Units (GPGPU)
• Mostly used for running applications that weigh heavy on
graphics
3D modeling software
VDI infrastructures
Game graphics, simulations, etc.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
• Today, GPGPU’s accelerates computational workloads in
modern High Performance Computing (HPC) landscapes
• GPUs are
a type of accelerator
they are coupled with a CPU
not capable of running an OS
we offload CUDA codes to GPUs
reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
General-Purpose Graphic Processing
Units (GPGPU)
• CPU: minimize latency
Optimized to be finish a task at a as low as possible latency, while
keeping the ability to quickly switch between operations
• GPU: maximize throughput
GPU has a lot more cores to process a task. Motivation is to push as
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
many as possible tasks through is internals at once
puts available cores to work and it’s less focused on low latency
cache memory access
reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
ALU ALU
Control
ALUALU ALU
ALU
Control
ALU ALU
Cache
Cache
DRAM DRAM
CPU GPU
DRAM DRAM
CPU GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication
CPU: Control logic and cache memory make up the majority of the CPU’s real
estate.
GPU: usesFigure 19.2 CPU
a massively vs.SIMD
parallel GPU(single
Silicon Area/Transistor
instruction Dedication
multiple data)
architecture to perform mainly mathematical operations: runs the same thread
of code on large amounts of data
GPU doesn’t require the same complex capabilities of the CPU’s control logic (out of
order
reserved.
execution, branch prediction, data hazards, etc)
GPUs are able to hide memory latency by managing the execution of more
threads than available processor cores
CPU vs. GPU
EuroCC Seminar – 2022 (several speakers)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
CPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
https://core.vmware.com/resource/exploring-gpu-architecture#section1
reserved.
GPU
CUDA core = Streaming
Processor (shader unit)
Streaming multiprocessor
(SM) contains streaming
processors (SP)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
SMs executes one instruction
at time on all SPs
Warp: Basic unit of
execution (32 threads)
reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section4
Threads are uniformly bundled in
thread blocks
Grid: number of blocks per kernel launch
Grid
Grid / block dimensions: 1D, 2D, 3D Block(0, 0) Block(1, 0) Block(2, 0)
They need not have the same dimensions
Block: assigned to only one of the several Block(0, 1) Block(1, 1) Block(2, 1)
GPU streaming multiprocessors
(SMs).
A block is never split between SMs
and it consists of warps.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
There is a maximum number of
threads that an SM will accept (code Block (1,1)
won’t compile) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Distribute the load as uniformly as
possible (the number of thread blocks
launched should be no less than the Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
number of SMs on the GPU.)
Finding the optimum configuration can
be a very time consuming and Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
daunting process.
reserved.
NVIDIA H100: thread block clusters
(granularity larger than a single Thread
Block on a single SM)
Figure 19.1 Relationship Among Threads, Blocks, and a Grid
Table 19.1
CUDA Terms to GPU’s Hardware Components
Equivalence Mapping
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
EuroCC Seminar – 2022 (several speakers)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
EuroCC Seminar – 2022 (several speakers)
Memory Management
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Streaming Multiprocessor (SM)
• Thousands of registers that can be partitioned among threads of
execution
• Several caches:
Shared memory for fast data interchange between threads
Constant cache for fast broadcast of reads from constant memory
Texture cache to aggregate bandwidth from texture memory
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
L1 cache to reduce latency to local or global memory
• Warp schedulers that can quickly switch contexts between
threads and issue instructions to warps that are ready to execute
• Execution cores for integer and floating-point operations:
Integer and single-precision floating point operations
Double-precision floating point
Special Function Units (SFUs) for single-precision floating-point
reserved.
transcendental functions (Nvidia Fermi & Kepler)
CUDA Handbook: A Comprehensive Guide to GPU Programming Nicholas Wilt
Published Jun 12, 2013 by Addison-Wesley Professional.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit
Register File (32k x 32-bit)
https://forums.developer.nvidia.com
/t/fermi-and-kepler-gpu-special-
Ld/St function-units/28345
Core Core Core Core
Ld/St
SFU
Ld/St Special Function
Core Core Core Core
Ld/St Units (SFUs) to (quoting the
CUDA Core Ld/St
NVIDIA White Paper on
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU Fermi) "execute
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
Ld/St
FP Int
Core Core Core Core
Ld/St
transcendental instructions
Unit Unit
Ld/St such as sin, cosine, reciprocal,
Result Queue
Core Core Core Core
Ld/St and square root. Each SFU
SFU
Ld/St executes one instruction per
Core Core Core Core
Ld/St thread, per clock"
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
Interconnect Network
reserved.
64-kB Shared Memory/L1 Cache
Uniform Cache
Figure 19.5 Single SM Architecture
WARP Scheduler WARP Scheduler
Instruction Dispatch Unit Instruction Dispatch Unit
Warp 8 instruction 11 Warp 9 instruction 11
Warp 2 instruction 42 Warp 3 instruction 33
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
Warp 14 instruction 95 Warp 15 instruction 95
Time
Warp 8 instruction 12 Warp 9 instruction 12
Warp 14 instruction 96 Warp 3 instruction 34
Warp 2 instruction 43 Warp 15 instruction 96
reserved.
Figure 19.6 Dual Warp Schedulers and
Instruction Dispatch Units Run Example
EuroCC Seminar – 2022 (several speakers)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Compute Capabilities for some GPUs
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
EuroCC Seminar – 2022 (several speakers)
Tensor
Cores
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Physical attributes
such as size and
configuration,
arrangements of its
components
https://www.nvidia.com/en-us/data-center/tensor-cores/
https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
Occupancy – Saturate the GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Occupancy – Saturate the GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Occupancy – Saturate the GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Occupancy – Saturate the GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Occupancy – Saturate the GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Compute Unified Device
Architecture (CUDA)
• A parallel computing platform and programming model created by NVIDIA and
implemented by the graphics processing units (GPUs) that they produce
• CUDA C is a C/C++ based language
• Program can be divided into three general sections
Code to be run on the host (CPU) communication
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
Code to be run on the device (GPU)
The code related to the transfer of data between the host and the device
The data-parallel code to be run on the GPU is called a kernel
Typically will have few to no branching statements
Branching statements in the kernel result in serial execution of the threads in the
GPU hardware
A thread is a single instance of the kernel function
The programmer defines the number of threads launched when the kernel
function is called
The total number of threads defined is typically in the thousands to
reserved.
maximize the utilization of the GPU processor cores, as well as maximize
the available speedup
The programmer specifies how these threads are to be bundled
CUDA API example
EuroCC Seminar – 2022 (several speakers)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
CUDA Hello World
EuroCC Seminar – 2022 (several speakers)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
reserved.
Simple CUDA Code
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y) { // Single-precision A*X Plus Y
// index of the thread within its thread block and the thread block within the grid, respectively.
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void){
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
// <<< number of thread blocks in the grid, the number of threads in a thread block>>>
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
reserved.
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/
CUDA Computing
• CUDA compiler: nvcc
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
• ISA: Parallel Thread Execution (PTX)
programming model
explicitly parallel: a PTX program specifies the execution of a given thread of
a parallel thread array. A cooperative thread array (CTA), is an array of threads
that execute a kernel concurrently or in parallel
reserved.
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Parallel Thread Execution (PTX)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#syntax
reserved.
https://docs.nvidia.com/cuda/parallel-thread-execution/