GPU Computing CIS-543
Lecture 08: CUDA Memory Model
Dr. Muhammad Abid,
DCIS, PIEAS
GPU Computing, PIEAS
Aside
Arrangement of threads in a thread block (i.e.
mapping of threads to data):
CUDA execution units do not care about it
CUDA memory model performance strongly
depends on arrangement of threads in a thread
block
Example: 2D thread block of 1024 threads: 32 X 32 ;
16 X 64; 1 X 1024. why one execution configuration
performs better? Memory access pattern
GPU Computing, PIEAS
Introducing the CUDA Memory Model
The performance of many
HPC applications is limited
memory bandwidth, i.e.
how rapidly they can load
and store data.
Computing systems employ
memory hierarchy so that
memory appears to be
large and fast for
applications employing
principal of locality
GPU Computing, PIEAS
CUDA Memory Model
To programmers, there are generally two
classifications of memory:
Programmable: You explicitly control what data is
placed in programmable memory.
Non-programmable: You have no control over
data placement, and rely on automatic techniques
to achieve good performance. E.g., CPU's L1/L2$
CUDA memory model exposes many types
of programmable memory:
Registers , Shared memory , Local memory
Constant memory, Global memory
GPU Computing, PIEAS
CUDA Memory Model
Each thread can:
R/W per-thread
registers and local
mem
R/W per-block
shared memory
R/W per-grid
global memory
(~500 cycles)
R per-grid
constant/ texture
memory
GPU Computing, PIEAS
CUDA Memory Model
Each memory has a different scope, lifetime,
and caching behavior.
Variable declaration Memory Copy Lifetime Declaration
int LocalVar; register thread thread kernel
float var[100]; local thread thread kernel
__shared__ int Var; shared block block kernel
__device__ int Var; global grid application outside
__constant__ int constant grid application outside
Var;
GPU Computing, PIEAS
Registers
fastest memory space on a GPU.
An automatic variable declared in a kernel
without any type qualifier is generally stored
in a register.
Register variables are private to each thread.
GPU Computing, PIEAS
Registers
A kernel typically uses registers to hold
frequently accessed thread-private variables.
Registers limit: 63 (Fermi), 255(Kepler)
Nvcc -Xptxas -v: display info about:
number of regs used per kernel
bytes of shared memory per kernel
bytes of constant memory per kernel
bytes of Spill loads/ stores
bytes of stack frame
GPU Computing, PIEAS
Register Spilling
If a kernel uses more registers than the
hardware limit, the excess registers will spill
over to local memory. This register spilling
can have adverse performance
consequences.
The nvcc compiler uses heuristics to
minimize register usage and avoid register
spilling. You can optionally aid these
heuristics by providing additional information
for each kernel to the compiler in the form of
launch bounds:
GPU Computing, PIEAS
Local Memory
Located in device memory so high latency
and low bandwidth
Required efficient memory access patterns
Local memory:
Local arrays referenced with indices whose
values cannot be determined at compile-time.
Large local structures or arrays that would
consume too much register space.
Any variable that does not fit within the kernel
register limit
GPU Computing, PIEAS
Local Memory
For GPUs with compute capability >=2.0,
local memory data is also cached in a per-
SM L1 and per-device L2 cache
GPU Computing, PIEAS
Shared Memory
On-chip memory , low latency, high-
bandwidth
Declared with __shared__ qualifier in a
kernel
Partitioned among thread blocks
More shared memory per thread block
less no. of resident thread blocks per SM
less no. of active warps
Basic means for inter-thread communication
in a thread block
GPU Computing, PIEAS
Shared Memory
On-chip memory is partitioned b/w L1 cache
and shared memory for an SM
On-chip memory can be dynamically
configured, per-kernel basis, at runtime
using:
cudaError_t cudaFuncSetCacheConfig(const
void* func, enum cudaFuncCache cacheConfig);
cacheConfig values: cudaFuncCachePreferNone:
(default), cudaFuncCachePreferShared,
cudaFuncCachePreferL1,
cudaFuncCachePreferEqual
GPU Computing, PIEAS
Constant Memory
R-only memory located in device memory
Caches in per-SM constant cache
Declared outside of any kernel, global scope,
with __constant__ qualifier
64KB constant memory for all compute
apabilities.
GPU Computing, PIEAS
Constant Memory
performs best when all threads in a warp
read from the same memory address coz a
single read from constant memory
broadcasts to all threads in a warp.
GPU Computing, PIEAS
Global Memory
Located in device memory
High latency, large in size, most commonly
used memory on a GPU
Statically and dynamically allocated.
To allocate dynamically use cudaMalloc()
To allocate statically, use __device__
qualifier; declare outside of any kernel;
__device__ int vec[1000];
Scope: all threads running on a GPU can
R/W
Lifetime: application level
GPU Computing, PIEAS
Global Memory
accessible via 32-byte, 64-byte, or 128-byte
memory transactions, naturally aligned
When a warp performs a memory load/store,
the number of transactions required to satisfy
that request typically depends on the
following two factors:
Distribution of memory addresses across the
threads of that warp.
Alignment of memory addresses per transaction.
In general, the more transactions the
higher the potential for unused bytes to be
transferred reduction in throughput
efficiency.
GPU Computing, PIEAS
GPU Caches
Per-SM caches:
L1: caches local/ global memory and reg spills;
glds caching can be disabled;st are not cached
Read-only constant: caches constant memory
Read-only: caches texture memory; also glds
Per-device cache: shared by al SMs
L2: serve all load, store, and texture requests
provides efficient, high speed data sharing across
the GPU.
GPU Computing, PIEAS
GPU Caches
cached only in 2.x
GPU Computing, PIEAS
Pinned or Page-locked Memory
C malloc() function allocates pageable
memory that is subject to page fault
operations
The GPU cannot safely access data in pageable
host memory because it has no control over when
the host operating system may choose to
physically move that data.
GPU Computing, PIEAS
Pinned or Page-locked Memory
The CUDA runtime allows us to
directly allocate pinned host
memory using:
cudaError_t cudaMallocHost(void
**devPtr, size_t count);
cudaError_t cudaFreeHost(void *ptr);
read/ written with much higher
bandwidth than pageable memory.
Excessive allocation may degrade
host system performance
GPU Computing, PIEAS
Zero-Copy Memory
Both the host and device can access zero-
copy memory.
Pinned memory mapped into the device
address space and host address space.
Use following fun to create a mapped, pinned
memory region:
cudaError_t cudaHostAlloc(void **pHost, size_t
count, unsigned int flags);
cudaError_t cudaFreeHost(void *ptr);
cudaError_t cudaHostGetDevicePointer(void
**pDevice, void *pHost, unsigned int flags);
GPU Computing, PIEAS
Zero-Copy Memory
Advantages using zero-copy memory in
CUDA kernels:
Leveraging host memory when there is
insufficient device memory
Avoiding explicit data transfer between the host
and device
Sharing data b/w host and device
GPU Computing, PIEAS
Zero-Copy Memory
Disadvantage:
Using zero-copy memory as a supplement to
device memory with frequent read/write
operations will significantly slow performance.
Because every memory transaction to mapped
memory must pass over the PCIe bus, a
significant amount of latency is added even when
compared to global memory
GPU Computing, PIEAS
Aside: Zero-copy
Two common categories of heterogeneous
computing system architectures:
Integrated and discrete.
In integrated architectures, CPUs and GPUs are
fused onto a single die and physically share main
memory. In this architecture, zero-copy memory is
more likely to benefit both performance and
programmability because no copies over the PCIe
bus are necessary.
GPU Computing, PIEAS
Aside: Zero-copy
For discrete systems with devices connected to
the host via PCIe bus, zero-copy memory is
advantageous only in special cases. Be careful to
not overuse zero-copy memory. Device kernels
that read from zerocopy memory can be very slow
due to its high-latency.
GPU Computing, PIEAS
Unified Virtual Addressing (UVA)
UVA provides a single virtual memory
address space for all processors in the
system.
Host memory and device memory share a
single virtual address space
GPU Computing, PIEAS
Unified Virtual Addressing (UVA)
Under UVA, pinned host memory allocated
with cudaHostAlloc() has identical host and
device pointers. You can therefore pass the
returned pointer directly to a kernel function
Without UVA:
Allocated mapped, pinned host memory.
Acquired the device pointer to the mapped,
pinned memory using a CUDA runtime function.
Passed the device pointer to your kernel.
With UVA, there is no need to acquire the
device pointer or manage two pointers to
what is physically the same data.
GPU Computing, PIEAS
Memory Access Pattern
Memory access patterns determine how
efficiently device use memory bandwidth.
Applies to all types of memory reside in the
device memory. E.g. global/ local/ constant/
texture memory [Need to confirm].
CUDA applications heavily use global
memory, so applications must optimize global
memory access patterns
Like inst issue/ execution, memory
operations are also issued on per-warp basis
GPU Computing, PIEAS
Memory Access Pattern
Two main features of memory access
pattern:
Aligned memory accesses
Coalesced memory accesses
Aligned memory accesses occur when the
first address of a device memory transaction
is an even multiple of the cache granularity
being used to service the transaction (either
32 bytes for L2 cache or 128 bytes for L1
cache).
Performing a misaligned load will cause wasted
bandwidth.
GPU Computing, PIEAS
Memory Access Pattern
Coalesced memory accesses occur when all
32 threads in a warp access a contiguous
chunk of memory.
Aligned coalesced memory accesses are
ideal
maximize global memory throughput
GPU Computing, PIEAS
Memory Access Pattern
(a)
(b)
(a) Aligned and coalesced (b) Misaligned and uncoalesced
GPU Computing, PIEAS
Device Memory Reads
In an SM, data is pipelined through one of
the following three cache/buffer paths,
depending on what type of device memory is
being referenced:
L1/L2 cache or Constant cache or Read-only $
GPU Computing, PIEAS
L1 Caching of Global Loads
Check if GPU supports caching of global
loads in L1$ using:
Use globalL1CacheSupported of structure
cudaDeviceProp
If GPU supports caching it can disabled/
enabled using:
nvcc -Xptxas -dlcm=cg (disable)
nvcc -Xptxas -dlcm=ca (enable)
L1 caching: 128B memory transaction
No L1 caching: memory transaction of 1, 2,
or 4 segments. Segment size is 32B
GPU Computing, PIEAS
Cached Global Loads
pass through L1 cache and are serviced by
device memory transactions at the
granularity of an L1 cache line, 128-bytes
Both are aligned and coalesced. 100% load efficiency
GPU Computing, PIEAS
Cached Global Loads
Coalesced but not aligned. 50% load efficiency
All thread access the same address. 3.125% load efficiency
Addresses can fall across N cache lines, where 0 < N 32.
GPU Computing, PIEAS
UnCached Global Loads
do not pass through the L1 cache
performed at the granularity of memory
segments (32-bytes) and not cache lines
(128-bytes).
more fine-grained loads, and can lead to
better bus/ load utilization for misaligned or
uncoalesced memory accesses.
GPU Computing, PIEAS
UnCached Global Loads
Both are aligned and coalesced. 100% load
efficiency. Each requires one memory
transaction with 4 segments (4 * 32 = 128B)
GPU Computing, PIEAS
UnCached Global Loads
Not aligned at 128B boundry but coalesced. 3 transactions with total 5 segments
80% load efficiency
All thread access the same address. 12.5% load efficiency
Addresses can fall across N segments, where 0 < N 32.
GPU Computing, PIEAS
Memory Access Pattern
Aligned and coalesced
Data [threadIdx.x]
Misaligned and/ or uncoalesced, depending
on offset
Data [threadIdx.x + offset]
If offset = N * 32, where 32 is a warp size and N is
an integer 0,1,2,3.., then aligned and coalesced
GPU Computing, PIEAS
Global Memory Writes
Stores are not cached in L1 but rather in L2
Performed at a 32-byte segment granularity.
Memory transactions can be 1,2, or 4
segments at a time.
GPU Computing, PIEAS
Global Memory Writes
GPU Computing, PIEAS
Array of Structures versus Structure
of Arrays
struct innerStruct {
float x;
float y;
};
struct innerArray {
float x[N];
float y[N];
};
GPU Computing, PIEAS
Array of Structures versus Structure
of Arrays
Storing the data in SoA fashion makes full
use of GPU memory bandwidth. Because
there is no interleaving of elements, the SoA
layout on the GPU provides coalesced
memory accesses and can achieve more
efficient global memory utilization.
GPU Computing, PIEAS
Memory Optimization
While optimizing your applications for
memory performance, pay attention to:
Aligned and coalesced memory accesses
Sufficient concurrent memory operations to hide
memory latency
Increasing the number of independent memory
operations performed within each thread.
Expose sufficient parallelism to each SM using kernel
execution configuration
GPU Computing, PIEAS
Unified Memory (UM)
creates a pool of managed memory,
accessible on both the CPU and GPU with
the same memory address.
Supports automatic movement of data b/w
host and device.
UM depends on UVA support. This enables
host and device to use the same pointer.
UVA does not automatically migrate data
from host to device or vice versa; that is a
capability unique to Unified Memory.
GPU Computing, PIEAS
Unified Memory (UM)
Advantages:
No need for separate host and device memories
No need to transfer data b/w host and device or
vice versa
Maximize CUDA programmer's productivity
code is easier to maintain
GPU Computing, PIEAS
Unified Memory (UM)
float *A, *B, *gpuRef;
cudaMallocManaged((void **)&A, nBytes);
cudaMallocManaged((void **)&B, nBytes);
cudaMallocManaged((void **)&gpuRef, nBytes);
initialData(A, nxy);
initialData(B, nxy);
sumMatrixGPU<<<grid, block>>>(A, B, gpuRef, nx,
ny);
cudaDeviceSynchronize();
GPU Computing, PIEAS
Read-only cache
originally reserved for use by texture memory
loads.
For GPUs of compute capability 3.5 and
higher, the read-only cache can also support
global memory loads as an alternative to the
L1 cache.
The granularity of loads through the read-
only cache is 32 bytes.
GPU Computing, PIEAS
Read-only cache
two ways to direct global memory reads
through the read-only cache:
using the function __ldg: out[idx] = __ldg(&in[idx]);
u sing a declaration qualifier on the pointer being
dereferenced:
__global__ void copyKernel(int * __restrict__ out, const
int * __restrict__ in)
GPU Computing, PIEAS