100% found this document useful (1 vote)

212 views45 pages

Cuda 9 and Beyond

The document summarizes CUDA 9 and new features for GPU programming and deep learning. Key points include: 1) CUDA 9 introduces new features like cooperative thread groups for flexible synchronization across threads, improved libraries for deep learning and image processing, and tools to help developers. 2) The new Tesla V100 GPU uses the Volta architecture and has tensor cores to accelerate deep learning workloads over 9x faster than previous GPUs. 3) CUDA 9 and the V100 will power the upcoming Summit supercomputer, projected to achieve over 200 petaflops, helping accelerate scientific and deep learning applications.

Uploaded by

Japanese

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

212 views45 pages

Cuda 9 and Beyond

Uploaded by

Japanese

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

May 8-11, 2017 | Silicon Valley

CUDA 9 AND BEYOND

Mark Harris, May 10, 2017
INTRODUCING CUDA 9
BUILT FOR VOLTA FASTER LIBRARIES

Tesla V100
New GPU Architecture cuBLAS for Deep Learning
Tensor Cores NPP for Image Processing
NVLink cuFFT for Signal Processing
Independent Thread Scheduling

COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES

Flexible Thread Groups Faster Compile Times

Efficient Parallel Algorithms partition
Unified Memory Profiling
Synchronize Across Thread NVLink Visualization
Blocks in a Single GPU or New OS and Compiler
Multi-GPUs sync sync Support

2
INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core
HBM2

120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

3
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers

Volta HPC Application Performance

Relative to Tesla P100

Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware. 10 Megawatts
4
FASTER LIBRARIES

5
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
DEEP LEARNING
Utilize Volta Tensor Cores GEMM optimizations for RNNs
(cuBLAS)
Volta optimized GEMMs (cuBLAS)
Faster image processing (NPP)
Out-of-box performance on Volta Scientific Computing

(all libraries) FFT optimizations across various sizes

(cuFFT)

NEW ALGORITHMS IMPROVED USER EXPERIENCE

Multi-GPU dense & sparse solvers, dense New install package for CUDA Libraries
eigenvalue & SVD (cuSOLVER) (library-only meta package)

Breadth first search, clustering, triangle Modular NPP with small footprint,
counting, extraction & contraction support for image batching
(nvGRAPH)

6
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute)
2 10
P100 (CUDA 8) P100 (CUDA 8)
1.8 9
V100 (CUDA 9)
1.6
V100 Tensor Cores (CUDA 9)
8
Relative Performance

Relative Performance
1.4 7
1.8x 9.3x
1.2 6

1 5

0.8 4

0.6 3

0.4 2

0.2 1

0 0
512 1024 2048 4096 512 1024 2048 4096
Matrix Size (M=N=K) Matrix Size (M=N=K)
7
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
Learn More

Connect with The Experts

H7129 Accelerated Libraries:

cuFFT, cuSPARSE, cuSOLVER, nvGRAPH
Wednesday 4pm – Lower Level Pod B

S7121: Jacobi-Based Eigenvalue Solver on GPU (cuSOLVER)

Lung Sheng Chien
Tuesday, May 9, 11:00 AM - 11:25 AM, Marriott Salon 3

8
COOPERATIVE GROUPS

9
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication

Define, synchronize, and partition groups of

cooperating threads
Thread Block Group
Clean composition across software boundaries

Optimize for hardware fast path

Scalable from a few threads to all running threads

Partitioned Thread Groups

Deploy Everywhere: Kepler and Newer GPUs

Supported by CUDA developer tools

10
* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs
SYNCHRONIZE AT ANY SCALE
Three Key Capabilities

WHOLE-GRID MULTI-GPU
FLEXIBLE GROUPS
SYNCHRONIZATION SYNCHRONIZATION

Define and
Synchronize Multiple
Synchronize Arbitrary
Thread Blocks
Groups of Threads

partition
sync sync

sync sync

11
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program Thread Block Group

thread_group block = this_thread_block();

You can synchronize threads in a group

block.sync();

Create new groups by partitioning existing groups

Partitioned Thread Groups
thread_group tile32 = tiled_partition(block, 32);
thread_group tile4 = tiled_partition(tile32, 4);
Partitioned groups can also synchronize
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace 12
EXAMPLE: PARALLEL REDUCTION
Composable, Robust and Efficient
Per-Block Per-Warp

g = this_thread_block(); g = tiled_partition<32>(this_thread_block());
reduce(g, ptr, myVal); reduce(g, ptr, myVal);

device int reduce(thread_group g, int *x, int val) {

int lane = g.thread_rank();
for (int i = g.size()/2; i > 0; i /= 2) {
x[lane] = val; g.sync();
val += x[lane + i]; g.sync();
}
return val;
} 13
LAUNCHING COOPERATIVE KERNELS
Three Synchronization Scales

Block or Sub-Block Launch with <<<>>> or

Sync cudaLaunchKernel()

Launch with
Multi-Block Sync
cudaLaunchCooperativeKernel()

Multi-Device Sync Launch with

cudaLaunchCooperativeKernelMultiDevice()

14
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups

// threads update particles in parallel 1

0 2 3
integrate<<<blocks, threads, 0, stream>>>(particles);

4 5 6
7

15
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups

// threads update particles in parallel 0 1 2 3 4

integrate<<<blocks, threads, 0, s>>>(particles);

// Collide each particle with others in neighborhood

collide<<<blocks, threads, 0, s>>>(particles); 5 6 7

Note change in how threads map to particles in acceleration data structure

16
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups

0 1 2 3

// threads update particles in parallel

integrate<<<blocks, threads, 0, s>>>(particles);
4 5 6
7
// Note: implicit sync between kernel launches 0 1 2 3 4

// Collide each particle with others in neighborhood

collide<<<blocks, threads, 0, s>>>(particles);

5 6 7
Note change in how threads map to particles in acceleration data structure
17
WHOLE-GRID COOPERATION
Particle Simulation Update in a Single Kernel

global void particleSim(Particle *p, int N) { 1

0 2 3

grid_group g = this_grid();

for (i = g.thread_rank(); i < N; i += g.size()) 4 5 6

7
integrate(p[i]);
0 1 2 3 4
g.sync() // Sync whole grid!

for (i = g.thread_rank(); i < N; i += g.size())

collide(p[i], p, N); 5 6 7
}
Launch using cudaLaunchCooperativeKernel(…) 18
MULTI-GPU COOPERATION
Large-scale Multi-GPU Simulation in a Single Kernel

global void particleSim(Particle *p, int N) {

multi_grid_group g = this_multi_grid();

for (i = g.thread_rank(); i < N; i += g.size()) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

integrate(p[i]);
4 5 6 4 5 6 4 5 6 4 5 6
7 7 7 7
g.sync() // Sync all GPUs!
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

for (i = g.thread_rank(); i < N; i += g.size())

collide(p[i], p, N); 5 6 7 5 6 7 5 6 7 5 6 7

Launch using cudaLaunchCooperativeKernelMultiDevice(…)

19
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model

Volta Independent Thread Scheduling:

Program familiar algorithms and data structures in a natural way

Flexible thread grouping and synchronization

Use explicit synchronization, don’t rely on implicit convergence

CUDA 9 provides a fully explicit synchronization model

20
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model

Eliminate implicit warp synchronous programming on all architectures

Use explicit synchronization
Focus synchronization granularity with Cooperative Groups

Transition to new *_sync() primitives

__shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask()
CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()

21
Learn More

Cooperative Groups
Session S7622

Kyrylo Perelygin and Yuan Lin

Wednesday, 4pm Marriott Ballroom 3

22
DEVELOPER TOOLS

23
UNIFIED MEMORY PROFILING
Correlate CPU Page Faults with Source
Page Fault Correlation

24
NEW UNIFIED MEMORY EVENTS
Visualize Virtual Memory Activity

Memory Thrashing Page Throttling Remote Map

25
Learn More

S7495: Optimizing Application Performance

with CUDA Profiling Tools
Rahul Dhoot, Sanjiv Satoor, Mayank Jain
Thursday, 10am Marriott Ballroom 3

S7824: Developer Tools update in CUDA 9.0

Rafael Campana
Wednesday, 4pm 212A

26
THE BEYOND SECTION

27
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc

CUDA 8 Code with System Allocator

void sortfile(FILE *fp, int N) {
Removes CUDA-specific allocator
char *data; restrictions
// Allocate memory using any standard allocator Data movement is transparently
data = (char *) malloc(N * sizeof(char));
handled
fread(data, 1, N, fp);
Requires operating system support:
sort<<<...>>>(data,N,1,compare);
HMM Linux Kernel Module
use_data(data);
Learn More:
// Free the allocated memory HMM, Session 7764
free(data);
}
John Hubbard
4pm Wednesday (room 211B)
28
USING TENSOR CORES
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,

wmma::row_major);
NVIDIA cuDNN, cuBLAS, TensorRT }

CUDA C++
Volta Optimized
Warp-Level Matrix Operations
Frameworks and Libraries
29
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3

D= A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3

FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C 30
TENSOR CORE COORDINATION
Full Warp 16x16 Matrix Math

warp

Warp-synchronizing operation for

cooperative matrix math

Aggregate Matrix Multiply and

Accumulate for 16x16 matrices

warp

Result distributed across warp

31
CUDA TENSOR CORE PROGRAMMING
16x16x16 Warp Matrix Multiply and Accumulate (WMMA)

D=
FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C 32
CUDA TENSOR CORE PROGRAMMING
New WMMA datatypes

Per-Thread fragments to hold components of

matrices for use with Tensor Cores

wmma::fragment<matrix_a, …> Amat;

33
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations

Warp-level operation to fetch components of

matrices into fragments

wmma::load_matrix_sync(Amat, a, stride);

warp

34
CUDA TENSOR CORE PROGRAMMING
New WMMA Matrix Multiply and Accumulate Operation

Warp-level operation to perform matrix

multiply and accumulate

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

D=
35
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations

Warp-level operation to fetch components of

matrices into fragments

wmma::store_matrix_sync(d, Dmat, stride);

warp
Result

36
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility

Partition using an arbitrary label:

// Four groups of threads with same computed value

int label = foo() % 4;
thread_group block = partition(this_thread_block(), label);

Use with care: random groups can lead to SIMT execution inefficiency

37
FUTURE COOPERATIVE GROUPS
Library of Collective Algorithms

Reductions, sorting, prefix sum (scan), etc.

// collective key-value sort using all threads in the block

cooperative_groups::sort(this_thread_block(), myValues, myKeys);

// collective scan-based allocate across block

int sz = myAllocationSize(); // amount each thread wants
int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);

Note: preliminary API sketch

38
May 8-11, 2017 | Silicon Valley

CUDA 9 AND BEYOND

#GTC17

http://developer.nvidia.com/cuda-toolkit
http://parallelforall.com
[email protected]
@harrism
BACKUP

40
THREAD GROUPS INTERFACE

A thread can access the size of its group and its index (rank) in the group:
thread_group group = this_thread_block(); Intrinsic group

int index = group.thread_rank();

int size = group.size();

Thread block groups are a special type with more functions:

thread_block block = this_thread_block();

Linear index
int index = block.thread_rank();
dim3 tid = block.thread_index(); Equivalent to threadIdx (3D)
dim3 bid = block.group_index(); Equivalent to blockIdx (3D)
41
DISCOVERED CONCURRENCY
Simple, Robust Cooperation Within Warps
CUDA 8 CUDA 9 Cooperative Groups
__device__ int atomicAggInc(int *p) __device__ int atomicAggInc(int *p)
{ {
unsigned mask = __ballot(1); coalesced_group g = coalesced_threads();
unsigned total = __popc(mask); int prev;
unsigned prefix = __popc(mask & if (g.thread_rank() == 0)
__lanemask_lt()); offset = atomicAdd(p, g.size());
int lane = __ffs(mask) - 1; return g.thread_rank() + g.shfl(offset, 0);
int offset = 0; }
if (prefix == 0)
offset = atomicAdd(p, total); coalesced_threads() returns the group of
return prefix + threads that called it together (often a warp)
__shfl_sync(mask, offset, lane);
} coalesced_group supports warp shfl()

42
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Partition using an arbitrary label:

// Group of first four threads of all warps

auto tile = tiled_partition<32>(this_thread_block());
thread_group block = partition(this_thread_block(), tile.thread_rank() < 4);

// Four groups of threads with same computed value

int label = foo() % 4;
thread_group block = labeled_partition(this_thread_block(), label);

Use with care: random groups can lead to SIMT execution inefficiency
43
Need updated results

NPP IMAGE PROCESSING PRIMITIVES

Redesigned NPP boosts performance with smaller footprint

Over 2500 accelerated image, video & NPP Image Processing: 20-100x vs. CPU
computer vision primitives
Morphological Operations

JPEG

CUDA 9 streamlines NPP library Geometry Transforms

Filters
Small memory footprint
Color Processing
Image batching support
0 20 40 60 80 100 120
NPP/Tesla V100 Speedup vs. IPP / Xeon E5-2690 (Broadwell)

44
EXAMPLE: PARTICLE SIMULATION

1 2 3 0 1 2 3 4
0

4 5 6
7 7
5 6

Phase 1: Integration Phase 2: Collision Detection

Script Ldoe All Incxlusive by Flocon V2.3.lua
No ratings yet
Script Ldoe All Incxlusive by Flocon V2.3.lua
150 pages
Cuda C
No ratings yet
Cuda C
70 pages
Cuda PDF
No ratings yet
Cuda PDF
18 pages
Dsu Imp Paper
No ratings yet
Dsu Imp Paper
12 pages
Jetson Nano Developer Kit: User Guide
No ratings yet
Jetson Nano Developer Kit: User Guide
24 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
IOT Assisted MQTT For Segregation and Monitoring of Waste For Smart Cities
100% (2)
IOT Assisted MQTT For Segregation and Monitoring of Waste For Smart Cities
5 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
IoT Data to Thingspeak via GSM
No ratings yet
IoT Data to Thingspeak via GSM
15 pages
Practical File
No ratings yet
Practical File
4 pages
Maquinas de Corte de Chaves Ninja
No ratings yet
Maquinas de Corte de Chaves Ninja
418 pages
Selenium Interview Prep Guide
100% (1)
Selenium Interview Prep Guide
3 pages
CUDA Installation Guide Linux
No ratings yet
CUDA Installation Guide Linux
45 pages
Fog Computing: Made By-Ravi Shankar Singh M.Tech (Comp. Techology)
No ratings yet
Fog Computing: Made By-Ravi Shankar Singh M.Tech (Comp. Techology)
13 pages
Apple Employee Communications Kit
100% (1)
Apple Employee Communications Kit
17 pages
Manual
No ratings yet
Manual
72 pages
Gpu Applications Catalog
No ratings yet
Gpu Applications Catalog
51 pages
Raspberry Pi Workshop Proposal
No ratings yet
Raspberry Pi Workshop Proposal
7 pages
Book ROS
No ratings yet
Book ROS
166 pages
Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition
No ratings yet
Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition
12 pages
Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network
No ratings yet
Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network
12 pages
Game Theory
No ratings yet
Game Theory
109 pages
Automatic Object Tracking Hardware
100% (1)
Automatic Object Tracking Hardware
27 pages
Coca Cola 131212165203 Phpapp02
50% (2)
Coca Cola 131212165203 Phpapp02
26 pages
Nvidia - Ug - Matlab Gpu Coder
100% (1)
Nvidia - Ug - Matlab Gpu Coder
66 pages
Data Quality Model
No ratings yet
Data Quality Model
107 pages
Selenium Python Guide
No ratings yet
Selenium Python Guide
75 pages
STM32F4 User Manual
100% (2)
STM32F4 User Manual
42 pages
Lecture 1 - Intro
No ratings yet
Lecture 1 - Intro
57 pages
Practical 7:-Publish and Subscribe With MQTT
No ratings yet
Practical 7:-Publish and Subscribe With MQTT
4 pages
Interim Assessment Frequently Asked Questions: General
No ratings yet
Interim Assessment Frequently Asked Questions: General
24 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
Chapter 1 Robotics
No ratings yet
Chapter 1 Robotics
36 pages
Thingspeak Based Sensing and Monitoring System For Iot With Matlab Analysis
No ratings yet
Thingspeak Based Sensing and Monitoring System For Iot With Matlab Analysis
5 pages
Mascaret Theory Guide v8p2
No ratings yet
Mascaret Theory Guide v8p2
116 pages
An Example Machine Learning Notebook
No ratings yet
An Example Machine Learning Notebook
28 pages
Plant Disease Identification
No ratings yet
Plant Disease Identification
17 pages
CUDA C - Nvidia - Programming Guide EN
No ratings yet
CUDA C - Nvidia - Programming Guide EN
496 pages
Sadp 5
No ratings yet
Sadp 5
7 pages
Data Structure Practical File: D.S Lab
No ratings yet
Data Structure Practical File: D.S Lab
16 pages
Paper 1 Mock Review Answers
No ratings yet
Paper 1 Mock Review Answers
10 pages
Network Safety Test
100% (1)
Network Safety Test
7 pages
Vision Systems Applications PDF
No ratings yet
Vision Systems Applications PDF
618 pages
Nvidia Cuda
No ratings yet
Nvidia Cuda
26 pages
C01 Introduction To Media Access
No ratings yet
C01 Introduction To Media Access
15 pages
Survey of Deep Learning Accelerators
No ratings yet
Survey of Deep Learning Accelerators
44 pages
Improved YOLOv4 Tiny Network For Real-Time Electronic Component Detection
No ratings yet
Improved YOLOv4 Tiny Network For Real-Time Electronic Component Detection
13 pages
Paper-3 FPGA Based Gate and RTL Level Fault Injection Technique and Tool For Fault Tolerance Designs
No ratings yet
Paper-3 FPGA Based Gate and RTL Level Fault Injection Technique and Tool For Fault Tolerance Designs
9 pages
SSRN Id4107251
No ratings yet
SSRN Id4107251
7 pages
2011 ETRM Analyst Report IDC MarketScape Excerpt
No ratings yet
2011 ETRM Analyst Report IDC MarketScape Excerpt
11 pages
Computer Vision Based Moving Object Detection and Tracking: Suresh Kumar, Prof. Yatin Kumar Agarwal
No ratings yet
Computer Vision Based Moving Object Detection and Tracking: Suresh Kumar, Prof. Yatin Kumar Agarwal
6 pages
Panoptic Segmentation Overview
No ratings yet
Panoptic Segmentation Overview
29 pages
Concurrent and Real-Time Programming in Java: © Andy Wellings, 2004
No ratings yet
Concurrent and Real-Time Programming in Java: © Andy Wellings, 2004
35 pages
Capstone and Research Proposal Presentation Rubric
No ratings yet
Capstone and Research Proposal Presentation Rubric
5 pages
CDF Computer Programming
No ratings yet
CDF Computer Programming
4 pages
Icscsp 2018 Volume 2
No ratings yet
Icscsp 2018 Volume 2
765 pages
Snsce / Ece: Sangeetha.K/Robotics & Automation
No ratings yet
Snsce / Ece: Sangeetha.K/Robotics & Automation
15 pages
Pencom - Visual Basic For Application Serial Port Software Example
No ratings yet
Pencom - Visual Basic For Application Serial Port Software Example
8 pages
User's Manual: Release 3.0
No ratings yet
User's Manual: Release 3.0
126 pages
Embedded Systems Application Areas
No ratings yet
Embedded Systems Application Areas
8 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Installation Log
No ratings yet
Installation Log
4 pages
Digital Electronics Quiz
No ratings yet
Digital Electronics Quiz
157 pages
Self Assessment Capability Survey
No ratings yet
Self Assessment Capability Survey
3 pages
Assembly Line Optimization Guide
No ratings yet
Assembly Line Optimization Guide
2 pages
Ebook - Robot - Sensors and Methods For Mobile Robot Positioning - 1996
100% (10)
Ebook - Robot - Sensors and Methods For Mobile Robot Positioning - 1996
282 pages
Multi Object Tracking in Traffic Environments: A Systematic Literature
No ratings yet
Multi Object Tracking in Traffic Environments: A Systematic Literature
13 pages
Robotics for Enthusiasts
No ratings yet
Robotics for Enthusiasts
5 pages
SEPM Applications Layers
No ratings yet
SEPM Applications Layers
40 pages
Applsci 13 04144 v2
No ratings yet
Applsci 13 04144 v2
26 pages
Gesture Controlled Robot Using Image Processing
No ratings yet
Gesture Controlled Robot Using Image Processing
19 pages
myRIO Setup & Configuration Guide
100% (1)
myRIO Setup & Configuration Guide
57 pages
Data Storage & Visualisation in The Internet of Things: Node-Interoperability Test
No ratings yet
Data Storage & Visualisation in The Internet of Things: Node-Interoperability Test
28 pages
A Study On New Arduino NANO Board For WSN and IoT Applications
No ratings yet
A Study On New Arduino NANO Board For WSN and IoT Applications
9 pages
INTVLAN
No ratings yet
INTVLAN
8 pages
Compose Testing Cheatsheet
No ratings yet
Compose Testing Cheatsheet
1 page
PID Principle
No ratings yet
PID Principle
44 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Messenger Development Without Internet Using Zigbee Technology
No ratings yet
Messenger Development Without Internet Using Zigbee Technology
89 pages
16 Robotics Visions Warm Intelligence Traffic Safety
No ratings yet
16 Robotics Visions Warm Intelligence Traffic Safety
9 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
AI-Driven NLP with Transformers
No ratings yet
AI-Driven NLP with Transformers
3 pages

Cuda 9 and Beyond

Uploaded by

Cuda 9 and Beyond

Uploaded by

May 8-11, 2017 | Silicon Valley

CUDA 9 AND BEYOND

COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES

Flexible Thread Groups Faster Compile Times

Volta HPC Application Performance

(all libraries) FFT optimizations across various sizes

NEW ALGORITHMS IMPROVED USER EXPERIENCE

Connect with The Experts

H7129 Accelerated Libraries:

S7121: Jacobi-Based Eigenvalue Solver on GPU (cuSOLVER)

Define, synchronize, and partition groups of

Optimize for hardware fast path

Scalable from a few threads to all running threads

Deploy Everywhere: Kepler and Newer GPUs

Supported by CUDA developer tools

thread_group block = this_thread_block();

You can synchronize threads in a group

Create new groups by partitioning existing groups

__device__ int reduce(thread_group g, int *x, int val) {

Block or Sub-Block Launch with <<<>>> or

Multi-Device Sync Launch with

// threads update particles in parallel 1

// threads update particles in parallel 0 1 2 3 4

// Collide each particle with others in neighborhood

Note change in how threads map to particles in acceleration data structure

// threads update particles in parallel

// Collide each particle with others in neighborhood

__global__ void particleSim(Particle *p, int N) { 1

for (i = g.thread_rank(); i < N; i += g.size()) 4 5 6

for (i = g.thread_rank(); i < N; i += g.size())

__global__ void particleSim(Particle *p, int N) {

for (i = g.thread_rank(); i < N; i += g.size()) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

for (i = g.thread_rank(); i < N; i += g.size())

Launch using cudaLaunchCooperativeKernelMultiDevice(…)

Volta Independent Thread Scheduling:

Flexible thread grouping and synchronization

Use explicit synchronization, don’t rely on implicit convergence

Eliminate implicit warp synchronous programming on all architectures

Transition to new *_sync() primitives

Kyrylo Perelygin and Yuan Lin

Wednesday, 4pm Marriott Ballroom 3

Memory Thrashing Page Throttling Remote Map

S7495: Optimizing Application Performance

S7824: Developer Tools update in CUDA 9.0

CUDA 8 Code with System Allocator

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,

FP16 or FP32 FP16 FP16 FP16 or FP32

Warp-synchronizing operation for

Aggregate Matrix Multiply and

Result distributed across warp

Per-Thread fragments to hold components of

wmma::fragment<matrix_a, …> Amat;

Warp-level operation to fetch components of

Warp-level operation to perform matrix

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

Warp-level operation to fetch components of

wmma::store_matrix_sync(d, Dmat, stride);

Partition using an arbitrary label:

// Four groups of threads with same computed value

Reductions, sorting, prefix sum (scan), etc.

// collective key-value sort using all threads in the block

// collective scan-based allocate across block

Note: preliminary API sketch

CUDA 9 AND BEYOND

int index = group.thread_rank();

Thread block groups are a special type with more functions:

thread_block block = this_thread_block();

// Group of first four threads of all warps

// Four groups of threads with same computed value

NPP IMAGE PROCESSING PRIMITIVES

CUDA 9 streamlines NPP library Geometry Transforms

Phase 1: Integration Phase 2: Collision Detection

You might also like

device int reduce(thread_group g, int *x, int val) {

global void particleSim(Particle *p, int N) { 1

global void particleSim(Particle *p, int N) {