0% found this document useful (0 votes)

6 views29 pages

Comp206 Lecture14

This lecture covers General-Purpose Graphic Processing Units (GPGPUs) and their role in High Performance Computing (HPC), emphasizing their architecture and operational differences compared to CPUs. GPGPUs are designed to maximize throughput with many cores for parallel processing, while CPUs focus on minimizing latency. The document also introduces CUDA, a parallel computing platform that allows developers to leverage GPU capabilities for computational tasks.

Uploaded by

melikgurcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views29 pages

Comp206 Lecture14

Uploaded by

melikgurcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

COMP206 – Computer Architecture

Lecture #14 –HW Accelerators (GPUs)

© 2016 Pearson Education, Inc., Hoboken, NJ. All

rights reserved.
Buse Yılmaz, PhD.
Dept. of Computer Engineering, MEF University
Slides are original slides of the following books with
modifications:
Computer Organization and Architecture:
Designing for performance (10th Ed.) - William
Stallings
EuroCC Seminar – 2022 (several speakers)
BAŞARIM’22 Ümit Çatalyürek’s presentation

Chapter 1 — Computer Abstractions and

HPC Talk – Buse Yılmaz, İSÜ’22

Technology — 2
General-Purpose Graphic Processing
Units (GPGPU)
• Mostly used for running applications that weigh heavy on
graphics
 3D modeling software
 VDI infrastructures
 Game graphics, simulations, etc.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

• Today, GPGPU’s accelerates computational workloads in
modern High Performance Computing (HPC) landscapes
• GPUs are
 a type of accelerator
 they are coupled with a CPU
 not capable of running an OS
 we offload CUDA codes to GPUs

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
General-Purpose Graphic Processing
Units (GPGPU)
• CPU: minimize latency
 Optimized to be finish a task at a as low as possible latency, while
keeping the ability to quickly switch between operations

• GPU: maximize throughput

 GPU has a lot more cores to process a task. Motivation is to push as

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

many as possible tasks through is internals at once
 puts available cores to work and it’s less focused on low latency
cache memory access

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section1
ALU ALU
Control
ALUALU ALU
ALU
Control
ALU ALU
Cache

Cache
DRAM DRAM

CPU GPU
DRAM DRAM

CPU GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

CPU: Control logic and cache memory make up the majority of the CPU’s real
estate.
GPU: usesFigure 19.2 CPU
a massively vs.SIMD
parallel GPU(single
Silicon Area/Transistor
instruction Dedication
multiple data)
architecture to perform mainly mathematical operations: runs the same thread
of code on large amounts of data

GPU doesn’t require the same complex capabilities of the CPU’s control logic (out of
order

reserved.
execution, branch prediction, data hazards, etc)
GPUs are able to hide memory latency by managing the execution of more
threads than available processor cores
CPU vs. GPU

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
CPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

https://core.vmware.com/resource/exploring-gpu-architecture#section1

reserved.
GPU
CUDA core = Streaming
Processor (shader unit)

Streaming multiprocessor
(SM) contains streaming
processors (SP)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

SMs executes one instruction
at time on all SPs

Warp: Basic unit of

execution (32 threads)

reserved.
https://core.vmware.com/resource/exploring-gpu-architecture#section4
Threads are uniformly bundled in
thread blocks
Grid: number of blocks per kernel launch
Grid
Grid / block dimensions: 1D, 2D, 3D Block(0, 0) Block(1, 0) Block(2, 0)
They need not have the same dimensions

Block: assigned to only one of the several Block(0, 1) Block(1, 1) Block(2, 1)

GPU streaming multiprocessors
(SMs).

A block is never split between SMs

and it consists of warps.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

There is a maximum number of
threads that an SM will accept (code Block (1,1)
won’t compile) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Distribute the load as uniformly as

possible (the number of thread blocks
launched should be no less than the Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
number of SMs on the GPU.)
Finding the optimum configuration can
be a very time consuming and Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
daunting process.

reserved.
NVIDIA H100: thread block clusters
(granularity larger than a single Thread
Block on a single SM)
Figure 19.1 Relationship Among Threads, Blocks, and a Grid
Table 19.1

CUDA Terms to GPU’s Hardware Components

Equivalence Mapping

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
EuroCC Seminar – 2022 (several speakers)
Memory Management

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
Streaming Multiprocessor (SM)
• Thousands of registers that can be partitioned among threads of
execution

• Several caches:
 Shared memory for fast data interchange between threads
 Constant cache for fast broadcast of reads from constant memory
Texture cache to aggregate bandwidth from texture memory

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights


 L1 cache to reduce latency to local or global memory

• Warp schedulers that can quickly switch contexts between

threads and issue instructions to warps that are ready to execute

• Execution cores for integer and floating-point operations:

 Integer and single-precision floating point operations
 Double-precision floating point
 Special Function Units (SFUs) for single-precision floating-point

reserved.
transcendental functions (Nvidia Fermi & Kepler)

CUDA Handbook: A Comprehensive Guide to GPU Programming Nicholas Wilt

Published Jun 12, 2013 by Addison-Wesley Professional.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit

Register File (32k x 32-bit)

https://forums.developer.nvidia.com
/t/fermi-and-kepler-gpu-special-
Ld/St function-units/28345
Core Core Core Core
Ld/St
SFU
Ld/St Special Function
Core Core Core Core
Ld/St Units (SFUs) to (quoting the
CUDA Core Ld/St
NVIDIA White Paper on
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU Fermi) "execute

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Ld/St
FP Int
Core Core Core Core
Ld/St
transcendental instructions
Unit Unit
Ld/St such as sin, cosine, reciprocal,
Result Queue
Core Core Core Core
Ld/St and square root. Each SFU
SFU
Ld/St executes one instruction per
Core Core Core Core
Ld/St thread, per clock"
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St

Interconnect Network

reserved.
64-kB Shared Memory/L1 Cache

Uniform Cache

Figure 19.5 Single SM Architecture

WARP Scheduler WARP Scheduler

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11

Warp 2 instruction 42 Warp 3 instruction 33

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Warp 14 instruction 95 Warp 15 instruction 95
Time

Warp 8 instruction 12 Warp 9 instruction 12

Warp 14 instruction 96 Warp 3 instruction 34

Warp 2 instruction 43 Warp 15 instruction 96

reserved.
Figure 19.6 Dual Warp Schedulers and
Instruction Dispatch Units Run Example
EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
Compute Capabilities for some GPUs

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
EuroCC Seminar – 2022 (several speakers)
Tensor
Cores

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Physical attributes
such as size and
configuration,
arrangements of its
components
https://www.nvidia.com/en-us/data-center/tensor-cores/
https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
Occupancy – Saturate the GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
Occupancy – Saturate the GPU

reserved.
Compute Unified Device
Architecture (CUDA)
• A parallel computing platform and programming model created by NVIDIA and
implemented by the graphics processing units (GPUs) that they produce
• CUDA C is a C/C++ based language
• Program can be divided into three general sections
 Code to be run on the host (CPU) communication

 Code to be run on the device (GPU)
 The code related to the transfer of data between the host and the device

 The data-parallel code to be run on the GPU is called a kernel

 Typically will have few to no branching statements
 Branching statements in the kernel result in serial execution of the threads in the
GPU hardware

 A thread is a single instance of the kernel function

 The programmer defines the number of threads launched when the kernel
function is called
 The total number of threads defined is typically in the thousands to

reserved.
maximize the utilization of the GPU processor cores, as well as maximize
the available speedup
 The programmer specifies how these threads are to be bundled
CUDA API example

EuroCC Seminar – 2022 (several speakers)

reserved.
CUDA Hello World

EuroCC Seminar – 2022 (several speakers)

reserved.
Simple CUDA Code
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y) { // Single-precision A*X Plus Y
// index of the thread within its thread block and the thread block within the grid, respectively.
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}

int main(void){
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));

cudaMalloc(&d_x, N*sizeof(float));

cudaMalloc(&d_y, N*sizeof(float));

for (int i = 0; i < N; i++) {

x[i] = 1.0f;
y[i] = 2.0f;
}

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements

// <<< number of thread blocks in the grid, the number of threads in a thread block>>>
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

float maxError = 0.0f;

for (int i = 0; i < N; i++)

reserved.
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);

cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/
CUDA Computing

• CUDA compiler: nvcc

• ISA: Parallel Thread Execution (PTX)
 programming model
 explicitly parallel: a PTX program specifies the execution of a given thread of
a parallel thread array. A cooperative thread array (CTA), is an array of threads
that execute a kernel concurrently or in parallel

reserved.
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Parallel Thread Execution (PTX)

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#syntax

reserved.
https://docs.nvidia.com/cuda/parallel-thread-execution/

Gpu Computing
No ratings yet
Gpu Computing
57 pages
The Introvert's Way
0% (1)
The Introvert's Way
15 pages
Aws Lambda Tutorial
88% (8)
Aws Lambda Tutorial
393 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
FAMILY CODE - Ateneo Reviewer
100% (1)
FAMILY CODE - Ateneo Reviewer
26 pages
Unit 4
100% (1)
Unit 4
48 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
HSSC Cet: HKRHZ Ijh (Kk&2025
No ratings yet
HSSC Cet: HKRHZ Ijh (Kk&2025
128 pages
Project Olympus Universal Motherboard
No ratings yet
Project Olympus Universal Motherboard
40 pages
Note2 4
No ratings yet
Note2 4
11 pages
No Due III B
No ratings yet
No Due III B
3 pages
Triguna Concept in Indian Psychology
No ratings yet
Triguna Concept in Indian Psychology
18 pages
Multicore Computers
No ratings yet
Multicore Computers
9 pages
Lecture 6 4
No ratings yet
Lecture 6 4
10 pages
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
100% (1)
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
105 pages
Nevada Report HSGAC Testimony Binnall Elections 2020
No ratings yet
Nevada Report HSGAC Testimony Binnall Elections 2020
2 pages
Experiancing God Unit 10
No ratings yet
Experiancing God Unit 10
2 pages
Year-End Break & New Session Notice
No ratings yet
Year-End Break & New Session Notice
2 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Trivia Quiz: Environment, History, Geography, Sports, Politics
No ratings yet
Trivia Quiz: Environment, History, Geography, Sports, Politics
4 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Unit Operations in Mineral Processing: Prof. Rodrigo Serna and Dr. Robert Hartmann Spring 2019 Aalto University
No ratings yet
Unit Operations in Mineral Processing: Prof. Rodrigo Serna and Dr. Robert Hartmann Spring 2019 Aalto University
46 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Promotion Form
No ratings yet
Promotion Form
2 pages
Billet Marker
0% (1)
Billet Marker
4 pages
Evidence of Evolution
No ratings yet
Evidence of Evolution
15 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Study On The Relationship Between The WTO's IP Agreement and The Convention On Biological Diversity - Ipleaders
No ratings yet
Study On The Relationship Between The WTO's IP Agreement and The Convention On Biological Diversity - Ipleaders
20 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
No ratings yet
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
34 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
THC222 3
No ratings yet
THC222 3
8 pages
GPU Programming Course Schedule
No ratings yet
GPU Programming Course Schedule
33 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
PART19
No ratings yet
PART19
20 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
17 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
20210730201058D6214 - 3. Paraphrasing, Summarising, Combining Sources
No ratings yet
20210730201058D6214 - 3. Paraphrasing, Summarising, Combining Sources
22 pages
CUDA
No ratings yet
CUDA
46 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Final Exam Denis Bonilla
100% (1)
Final Exam Denis Bonilla
7 pages
BIO101 Exam 1 Study Guide
No ratings yet
BIO101 Exam 1 Study Guide
11 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Loader: Big Data Huawei Course
No ratings yet
Loader: Big Data Huawei Course
11 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Pound Ezra The Cantos
100% (1)
Pound Ezra The Cantos
615 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
English Exam Video Guide
No ratings yet
English Exam Video Guide
8 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Service Manual: Finisher
No ratings yet
Service Manual: Finisher
235 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
No ratings yet
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
35 pages
Eee350 - Control Systems January 2014
No ratings yet
Eee350 - Control Systems January 2014
16 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Modular Test 2
No ratings yet
Modular Test 2
7 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
A Brief History of Behavioral and Cognitive Behavioral Approaches To Sexual Offenders
No ratings yet
A Brief History of Behavioral and Cognitive Behavioral Approaches To Sexual Offenders
19 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages

Comp206 Lecture14

Uploaded by

Comp206 Lecture14

Uploaded by

COMP206 – Computer Architecture

Lecture #14 –HW Accelerators (GPUs)

© 2016 Pearson Education, Inc., Hoboken, NJ. All

Chapter 1 — Computer Abstractions and

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

• GPU: maximize throughput

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Warp: Basic unit of

Block: assigned to only one of the several Block(0, 1) Block(1, 1) Block(2, 1)

A block is never split between SMs

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Distribute the load as uniformly as

CUDA Terms to GPU’s Hardware Components

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

• Warp schedulers that can quickly switch contexts between

• Execution cores for integer and floating-point operations:

CUDA Handbook: A Comprehensive Guide to GPU Programming Nicholas Wilt

Register File (32k x 32-bit)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Figure 19.5 Single SM Architecture

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

Warp 8 instruction 12 Warp 9 instruction 12

Warp 2 instruction 43 Warp 15 instruction 96

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

 The data-parallel code to be run on the GPU is called a kernel

 A thread is a single instance of the kernel function

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

EuroCC Seminar – 2022 (several speakers)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

for (int i = 0; i < N; i++) {

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

float maxError = 0.0f;

• CUDA compiler: nvcc

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

You might also like