May 8-11, 2017 | Silicon Valley
CUDA 9 AND BEYOND
Mark Harris, May 10, 2017
INTRODUCING CUDA 9
BUILT FOR VOLTA FASTER LIBRARIES
Tesla V100
New GPU Architecture cuBLAS for Deep Learning
Tensor Cores NPP for Image Processing
NVLink cuFFT for Signal Processing
Independent Thread Scheduling
COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES
Flexible Thread Groups Faster Compile Times
Efficient Parallel Algorithms partition
Unified Memory Profiling
Synchronize Across Thread NVLink Visualization
Blocks in a Single GPU or New OS and Compiler
Multi-GPUs sync sync Support
2
INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core
HBM2
120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
3
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
Volta HPC Application Performance
Relative to Tesla P100
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware. 10 Megawatts
4
FASTER LIBRARIES
5
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
DEEP LEARNING
Utilize Volta Tensor Cores GEMM optimizations for RNNs
(cuBLAS)
Volta optimized GEMMs (cuBLAS)
Faster image processing (NPP)
Out-of-box performance on Volta Scientific Computing
(all libraries) FFT optimizations across various sizes
(cuFFT)
NEW ALGORITHMS IMPROVED USER EXPERIENCE
Multi-GPU dense & sparse solvers, dense New install package for CUDA Libraries
eigenvalue & SVD (cuSOLVER) (library-only meta package)
Breadth first search, clustering, triangle Modular NPP with small footprint,
counting, extraction & contraction support for image batching
(nvGRAPH)
6
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute)
2 10
P100 (CUDA 8) P100 (CUDA 8)
1.8 9
V100 (CUDA 9)
1.6
V100 Tensor Cores (CUDA 9)
8
Relative Performance
Relative Performance
1.4 7
1.8x 9.3x
1.2 6
1 5
0.8 4
0.6 3
0.4 2
0.2 1
0 0
512 1024 2048 4096 512 1024 2048 4096
Matrix Size (M=N=K) Matrix Size (M=N=K)
7
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
Learn More
Connect with The Experts
H7129 Accelerated Libraries:
cuFFT, cuSPARSE, cuSOLVER, nvGRAPH
Wednesday 4pm – Lower Level Pod B
S7121: Jacobi-Based Eigenvalue Solver on GPU (cuSOLVER)
Lung Sheng Chien
Tuesday, May 9, 11:00 AM - 11:25 AM, Marriott Salon 3
8
COOPERATIVE GROUPS
9
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of
cooperating threads
Thread Block Group
Clean composition across software boundaries
Optimize for hardware fast path
Scalable from a few threads to all running threads
Partitioned Thread Groups
Deploy Everywhere: Kepler and Newer GPUs
Supported by CUDA developer tools
10
* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs
SYNCHRONIZE AT ANY SCALE
Three Key Capabilities
WHOLE-GRID MULTI-GPU
FLEXIBLE GROUPS
SYNCHRONIZATION SYNCHRONIZATION
Define and
Synchronize Multiple
Synchronize Arbitrary
Thread Blocks
Groups of Threads
partition
sync sync
sync sync
11
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program Thread Block Group
thread_group block = this_thread_block();
You can synchronize threads in a group
block.sync();
Create new groups by partitioning existing groups
Partitioned Thread Groups
thread_group tile32 = tiled_partition(block, 32);
thread_group tile4 = tiled_partition(tile32, 4);
Partitioned groups can also synchronize
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace 12
EXAMPLE: PARALLEL REDUCTION
Composable, Robust and Efficient
Per-Block Per-Warp
g = this_thread_block(); g = tiled_partition<32>(this_thread_block());
reduce(g, ptr, myVal); reduce(g, ptr, myVal);
__device__ int reduce(thread_group g, int *x, int val) {
int lane = g.thread_rank();
for (int i = g.size()/2; i > 0; i /= 2) {
x[lane] = val; g.sync();
val += x[lane + i]; g.sync();
}
return val;
} 13
LAUNCHING COOPERATIVE KERNELS
Three Synchronization Scales
Block or Sub-Block Launch with <<<>>> or
Sync cudaLaunchKernel()
Launch with
Multi-Block Sync
cudaLaunchCooperativeKernel()
Multi-Device Sync Launch with
cudaLaunchCooperativeKernelMultiDevice()
14
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
// threads update particles in parallel 1
0 2 3
integrate<<<blocks, threads, 0, stream>>>(particles);
4 5 6
7
15
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
// threads update particles in parallel 0 1 2 3 4
integrate<<<blocks, threads, 0, s>>>(particles);
// Collide each particle with others in neighborhood
collide<<<blocks, threads, 0, s>>>(particles); 5 6 7
Note change in how threads map to particles in acceleration data structure
16
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
0 1 2 3
// threads update particles in parallel
integrate<<<blocks, threads, 0, s>>>(particles);
4 5 6
7
// Note: implicit sync between kernel launches 0 1 2 3 4
// Collide each particle with others in neighborhood
collide<<<blocks, threads, 0, s>>>(particles);
5 6 7
Note change in how threads map to particles in acceleration data structure
17
WHOLE-GRID COOPERATION
Particle Simulation Update in a Single Kernel
__global__ void particleSim(Particle *p, int N) { 1
0 2 3
grid_group g = this_grid();
for (i = g.thread_rank(); i < N; i += g.size()) 4 5 6
7
integrate(p[i]);
0 1 2 3 4
g.sync() // Sync whole grid!
for (i = g.thread_rank(); i < N; i += g.size())
collide(p[i], p, N); 5 6 7
}
Launch using cudaLaunchCooperativeKernel(…) 18
MULTI-GPU COOPERATION
Large-scale Multi-GPU Simulation in a Single Kernel
__global__ void particleSim(Particle *p, int N) {
multi_grid_group g = this_multi_grid();
for (i = g.thread_rank(); i < N; i += g.size()) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
integrate(p[i]);
4 5 6 4 5 6 4 5 6 4 5 6
7 7 7 7
g.sync() // Sync all GPUs!
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
for (i = g.thread_rank(); i < N; i += g.size())
collide(p[i], p, N); 5 6 7 5 6 7 5 6 7 5 6 7
Launch using cudaLaunchCooperativeKernelMultiDevice(…)
19
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model
Volta Independent Thread Scheduling:
Program familiar algorithms and data structures in a natural way
Flexible thread grouping and synchronization
Use explicit synchronization, don’t rely on implicit convergence
CUDA 9 provides a fully explicit synchronization model
20
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model
Eliminate implicit warp synchronous programming on all architectures
Use explicit synchronization
Focus synchronization granularity with Cooperative Groups
Transition to new *_sync() primitives
__shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask()
CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()
21
Learn More
Cooperative Groups
Session S7622
Kyrylo Perelygin and Yuan Lin
Wednesday, 4pm Marriott Ballroom 3
22
DEVELOPER TOOLS
23
UNIFIED MEMORY PROFILING
Correlate CPU Page Faults with Source
Page Fault Correlation
24
NEW UNIFIED MEMORY EVENTS
Visualize Virtual Memory Activity
Memory Thrashing Page Throttling Remote Map
25
Learn More
S7495: Optimizing Application Performance
with CUDA Profiling Tools
Rahul Dhoot, Sanjiv Satoor, Mayank Jain
Thursday, 10am Marriott Ballroom 3
S7824: Developer Tools update in CUDA 9.0
Rafael Campana
Wednesday, 4pm 212A
26
THE BEYOND SECTION
27
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc
CUDA 8 Code with System Allocator
void sortfile(FILE *fp, int N) {
Removes CUDA-specific allocator
char *data; restrictions
// Allocate memory using any standard allocator Data movement is transparently
data = (char *) malloc(N * sizeof(char));
handled
fread(data, 1, N, fp);
Requires operating system support:
sort<<<...>>>(data,N,1,compare);
HMM Linux Kernel Module
use_data(data);
Learn More:
// Free the allocated memory HMM, Session 7764
free(data);
}
John Hubbard
4pm Wednesday (room 211B)
28
USING TENSOR CORES
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
NVIDIA cuDNN, cuBLAS, TensorRT }
CUDA C++
Volta Optimized
Warp-Level Matrix Operations
Frameworks and Libraries
29
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
D= A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 30
TENSOR CORE COORDINATION
Full Warp 16x16 Matrix Math
warp
Warp-synchronizing operation for
cooperative matrix math
Aggregate Matrix Multiply and
Accumulate for 16x16 matrices
warp
Result distributed across warp
31
CUDA TENSOR CORE PROGRAMMING
16x16x16 Warp Matrix Multiply and Accumulate (WMMA)
D=
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 32
CUDA TENSOR CORE PROGRAMMING
New WMMA datatypes
Per-Thread fragments to hold components of
matrices for use with Tensor Cores
wmma::fragment<matrix_a, …> Amat;
33
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations
Warp-level operation to fetch components of
matrices into fragments
wmma::load_matrix_sync(Amat, a, stride);
warp
34
CUDA TENSOR CORE PROGRAMMING
New WMMA Matrix Multiply and Accumulate Operation
Warp-level operation to perform matrix
multiply and accumulate
wmma::mma_sync(Dmat, Amat, Bmat, Cmat);
D=
35
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations
Warp-level operation to fetch components of
matrices into fragments
wmma::store_matrix_sync(d, Dmat, stride);
warp
Result
36
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Partition using an arbitrary label:
// Four groups of threads with same computed value
int label = foo() % 4;
thread_group block = partition(this_thread_block(), label);
Use with care: random groups can lead to SIMT execution inefficiency
37
FUTURE COOPERATIVE GROUPS
Library of Collective Algorithms
Reductions, sorting, prefix sum (scan), etc.
// collective key-value sort using all threads in the block
cooperative_groups::sort(this_thread_block(), myValues, myKeys);
// collective scan-based allocate across block
int sz = myAllocationSize(); // amount each thread wants
int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);
Note: preliminary API sketch
38
May 8-11, 2017 | Silicon Valley
CUDA 9 AND BEYOND
#GTC17
http://developer.nvidia.com/cuda-toolkit
http://parallelforall.com
[email protected]
@harrism
BACKUP
40
THREAD GROUPS INTERFACE
A thread can access the size of its group and its index (rank) in the group:
thread_group group = this_thread_block(); Intrinsic group
int index = group.thread_rank();
int size = group.size();
Thread block groups are a special type with more functions:
thread_block block = this_thread_block();
Linear index
int index = block.thread_rank();
dim3 tid = block.thread_index(); Equivalent to threadIdx (3D)
dim3 bid = block.group_index(); Equivalent to blockIdx (3D)
41
DISCOVERED CONCURRENCY
Simple, Robust Cooperation Within Warps
CUDA 8 CUDA 9 Cooperative Groups
__device__ int atomicAggInc(int *p) __device__ int atomicAggInc(int *p)
{ {
unsigned mask = __ballot(1); coalesced_group g = coalesced_threads();
unsigned total = __popc(mask); int prev;
unsigned prefix = __popc(mask & if (g.thread_rank() == 0)
__lanemask_lt()); offset = atomicAdd(p, g.size());
int lane = __ffs(mask) - 1; return g.thread_rank() + g.shfl(offset, 0);
int offset = 0; }
if (prefix == 0)
offset = atomicAdd(p, total); coalesced_threads() returns the group of
return prefix + threads that called it together (often a warp)
__shfl_sync(mask, offset, lane);
} coalesced_group supports warp shfl()
42
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Partition using an arbitrary label:
// Group of first four threads of all warps
auto tile = tiled_partition<32>(this_thread_block());
thread_group block = partition(this_thread_block(), tile.thread_rank() < 4);
// Four groups of threads with same computed value
int label = foo() % 4;
thread_group block = labeled_partition(this_thread_block(), label);
Use with care: random groups can lead to SIMT execution inefficiency
43
Need updated results
NPP IMAGE PROCESSING PRIMITIVES
Redesigned NPP boosts performance with smaller footprint
Over 2500 accelerated image, video & NPP Image Processing: 20-100x vs. CPU
computer vision primitives
Morphological Operations
JPEG
CUDA 9 streamlines NPP library Geometry Transforms
Filters
Small memory footprint
Color Processing
Image batching support
0 20 40 60 80 100 120
NPP/Tesla V100 Speedup vs. IPP / Xeon E5-2690 (Broadwell)
44
EXAMPLE: PARTICLE SIMULATION
1 2 3 0 1 2 3 4
0
4 5 6
7 7
5 6
Phase 1: Integration Phase 2: Collision Detection
45