BCS702 Module 5 Textbook
BCS702 Module 5 Textbook
CHAPTER
In a typical SIMD system, each datapath carries out the test x [ i ] >= 0. Then the dat-
apaths for which the test is true execute x [ i ] += 1, while those for which x [ i ] < 0 are
idle. Then the roles of the datapaths are reversed: those for which x [ i ] >= 0 are idle
while the other datapaths execute x [ i ] −= 2. See Table 6.1.
A typical GPU can be thought of as being composed of one or more SIMD proces-
sors. Nvidia GPUs are composed of Streaming Multiprocessors or SMs.1 One SM
can have several control units and many more datapaths. So an SM can be thought
of as consisting of one or more SIMD processors. The SMs, however, operate asyn-
chronously: there is no penalty if one branch of an if −else executes on one SM, and
1 The abbreviation that Nvidia uses for a streaming multiprocessor depends on the particular GPU mi-
croarchitecture. For example, Tesla and Fermi multiprocessors have SMs, Kepler multiprocessors have
SMXs, and Maxwell multiprocessors have SMMs. More recent GPUs have SMs. We’ll use SM, regard-
less of the microarchitecture.
6.3 Heterogeneous computing 293
the other executes on another SM. So in our preceding example, if all the threads with
x [ i ] >= 0 were executing on one SM, and all the threads with x [ i ] < 0 were executing
on another, the execution of our if −else example would require only two stages. (See
Table 6.2.)
In Nvidia parlance, the datapaths are called cores, Streaming Processors, or SPs.
Currently,2 one of the most powerful Nvidia processor has 82 SMs, and each SM has
128 SPs for a total of 10,496 SPs. Since we use the term “core” to mean something
else when we’re discussing MIMD architectures, we’ll use SP to denote an Nvidia
datapath. Also note that Nvidia uses the term SIMT instead of SIMD. SIMT stands
for Single Instruction Multiple Thread, and the term is used because threads on an
SM that are executing the same instruction may not execute simultaneously: to hide
memory access latency, some threads may block while memory is accessed and other
threads, that have already accessed the data, may proceed with execution.
Each SM has a relatively small block of memory that is shared among its SPs. As
we’ll see, this memory can be accessed very quickly by the SPs. All of the SMs on a
single chip also have access to a much larger block of memory that is shared among
all the SPs. Accessing this memory is relatively slow. (See Fig. 6.1.)
The GPU and its associated memory are usually physically separate from the
CPU and its associated memory. In Nvidia documentation, the CPU together with its
associated memory is often called the host, and the GPU together with its memory
is called the device. In earlier systems the physical separation of host and device
memories required that data was usually explicitly transferred between CPU memory
and GPU memory. That is, a function was called that would transfer a block of data
from host memory to device memory or vice versa. So, for example, data read from
a file by the CPU or output data generated by the GPU would have to be transferred
between the host and device with an explicit function call. However, in more recent
Nvidia systems (those with compute capability ≥ 3.0), the explicit transfers in the
source code aren’t needed for correctness, although they may be able to improve
overall performance. (See Fig. 6.2.)
2 Spring 2021.
294 CHAPTER 6 GPU programming with CUDA
FIGURE 6.1
Simplified block diagram of a GPU.
FIGURE 6.2
Simplified block diagram of a CPU and a GPU.
6.4 CUDA hello 295
1 # i n c l u d e < s t d i o . h>
2 # i n c l u d e < c u d a . h> / ∗ Header f i l e f o r CUDA ∗ /
3
4 / ∗ Device code : r u n s on GPU ∗ /
5 __global__ void Hello ( void ) {
6
7 printf ( " Hello from thread % d !\ n " , threadIdx . x );
8 } /∗ Hello ∗/
9
10
11 / ∗ H o s t c o d e : Runs on CPU ∗ /
12 i n t m a i n ( i n t a r g c , char ∗ a r g v [ ] ) {
13 i n t thread_count ; / ∗ Number o f t h r e a d s t o r u n on GPU ∗ /
14
15 thread_count = strtol ( argv
[ 1 ] , NULL , 1 0 ) ;
16 / ∗ Get t h r e a d _ c o u n t f r o m command l i n e ∗ /
17
18 Hello < < <1 , thread_count > > >();
19 / ∗ S t a r t t h r e a d _ c o u n t t h r e a d s on GPU, ∗ /
20
21 cudaDeviceSynchronize (); / ∗ W a i t f o r GPU t o f i n i s h ∗ /
22
23 return 0;
24 } / ∗ main ∗ /
Program 6.1: CUDA program that prints greetings from the threads.
$ ./cuda_hello 10
in triple angle brackets. If there were any arguments to the Hello function, we would
enclose them in the following parentheses.
The kernel specifies the code that each thread will execute. So each of our threads
will print a message
" Hello from thread %d\n"
The decimal int format specifier (%d) refers to the variable threadIdx.x. The struct
threadIdx is one of several variables defined by CUDA when a kernel is started. In
our example, the field x gives the relative index or rank of the thread that is executing.
So we use it to print a message containing the thread’s rank.
After a thread has printed its message, it terminates execution.
Notice that our kernel code uses the Single-Program Multiple-Data or SPMD
paradigm: each thread runs a copy of the same code on its own data. In this case,
the only thread-specific data is the thread rank stored in threadIdx.x.
One very important difference between the execution of an ordinary C function
and a CUDA kernel is that kernel execution is asynchronous. This means that the
call to the kernel on the host returns as soon as the host has notified the system
that it should start running the kernel, and even though the call in main has re-
turned, the threads executing the kernel may not have finished executing. The call to
cudaDeviceSynchronize in Line 21 forces the main function to wait until all the threads
executing the kernel have completed. If we omitted the call to cudaDeviceSynchronize,
our program could terminate before the threads produced any output, and it might ap-
pear that the kernel was never called.
When the host returns from the call to cudaDeviceSynchronize, the main function
then terminates as usual with return 0.
To summarize, then:
• Execution begins in main, which is running on the host.
• The number of threads is taken from the command line.
• The call to Hello starts the kernel.
• The <<<1, thread_count>>> in the call specifies that thread_count copies of the
kernel should be started on the device.
• When the kernel is started, the struct threadIdx is initialized by the system, and
in our example the field threadIdx.x contains the thread’s index or rank.
• Each thread prints its message and terminates.
• The call to cudaDeviceSynchronize in main forces the host to wait until all of the
threads have completed kernel execution before continuing and terminating.
6.6 Threads, blocks, and grids 299
If thread_count is even, this kernel call will start a total of thread_count threads, and
the threads will be divided between the two SMs: thread_count/2 threads will run on
each SM. (What happens if thread_count is odd?)
CUDA organizes threads into blocks and grids. A thread block (or just a block
if the context makes it clear) is a collection of threads that run on a single SM. In a
kernel call the first value in the angle brackets specifies the number of thread blocks.
The second value is the number of threads in each thread block. So when we started
the kernel with
Hello < < <1 , thread_count > > >();
we were using one thread block, which consisted of thread_count threads, and, as a
consequence, we only used one SM.
We can modify our greetings program so that it uses a user-specified number of
blocks, each consisting of a user-specified number of threads. (See Program 6.2.) In
this program we get both the number of thread blocks and the number of threads in
each block from the command line. Now the kernel call starts blk_ct thread blocks,
each of which contains th_per_blk threads.
When the kernel is started, each block is assigned to an SM, and the threads in the
block are then run on that SM. The output is similar to the output from the original
program, except that now we’re using two system-defined variables: threadIdx.x and
blockIdx.x. As you’ve probably guessed, threadIdx.x gives a thread’s rank or index in
its block, and blockIdx.x gives a block’s rank in the grid.
A grid is just the collection of thread blocks started by a kernel. So a thread block
is composed of threads, and a grid is composed of thread blocks.
There are several built-in variables that a thread can use to get information on the
grid started by the kernel. The following four variables are structs that are initialized
in each thread’s memory when a kernel begins execution:
• threadIdx: the rank or index of the thread in its thread block.
• blockDim: the dimensions, shape, or size of the thread blocks.
300 CHAPTER 6 GPU programming with CUDA
1 # i n c l u d e < s t d i o . h>
2 # i n c l u d e < c u d a . h> / ∗ Header f i l e f o r CUDA ∗ /
3
4 / ∗ Device code : r u n s on GPU ∗ /
5 __global__ void Hello ( void ) {
6
7 printf ( " Hello from thread % d in block % d \ n " ,
8 threadIdx . x , blockIdx . x );
9 } /∗ Hello ∗/
10
11
12 / ∗ H o s t c o d e : Runs on CPU ∗ /
13 i n t m a i n ( i n t a r g c , char ∗ a r g v [ ] ) {
14 i n t blk_ct ; / ∗ Number o f t h r e a d b l o c k s ∗ /
15 i n t th_per_blk ; / ∗ Number o f t h r e a d s i n e a c h b l o c k ∗ /
16
17 blk_ct = strtol ( argv [ 1 ] , NULL , 1 0 ) ;
18 / ∗ Get number o f b l o c k s f r o m command l i n e ∗ /
19 t h _ p e r _ b l k = strtol ( argv [ 2 ] , NULL , 1 0 ) ;
20 / ∗ Get number o f t h r e a d s p e r b l o c k f r o m command l i n e ∗ /
21
22 Hello <<< b l k _ c t , t h _ p e r _ b l k > > > ( ) ;
23 / ∗ S t a r t b l k _ c t ∗ t h _ p e r _ b l k t h r e a d s on GPU, ∗ /
24
25 cudaDeviceSynchronize (); / ∗ W a i t f o r GPU t o f i n i s h ∗ /
26
27 return 0;
28 } / ∗ main ∗ /
Program 6.2: CUDA program that prints greetings from threads in multiple blocks.
5 Nvidia devices that have compute capability < 2 (see Section 6.7) only allow x- and y-dimensions in a
grid.
6.7 Nvidia compute capabilities and device architectures 301
This should start a grid with 2 × 3 × 1 = 6 blocks, each of which has 43 = 64 threads.
Note that all the blocks must have the same dimensions. More importantly, CUDA
requires that thread blocks be independent. So one thread block must be able to com-
plete its execution, regardless of the states of the other thread blocks: the thread
blocks can be executed sequentially in any order, or they can be executed in par-
allel. This ensures that the GPU can schedule a block to execute solely on the basis
of the state of that block: it doesn’t need to check on the state of any other block.6
6 With the introduction of CUDA 9 and the Pascal processor, it became possible to synchronize threads
in multiple blocks. See Subsection 7.1.13 and Exercise 7.6.
7 The values in this section are current as of spring 2021, but some of them may change when Nvidia
releases new GPUs and new versions of CUDA.
302 CHAPTER 6 GPU programming with CUDA
they fall in the range 0–7. CUDA no longer supports devices with compute capability
< 3.
For devices with compute capability > 1, the maximum number of threads per
block is 1024. For devices with compute capability 2.b, the maximum number of
threads that can be assigned to a single SM is 1536, and for devices with compute
capability > 2, the maximum is currently 2048. There are also limits on the sizes of
the dimensions in both blocks and grids. For example, for compute capability > 1,
the maximum x- or y-dimension is 1024, and the maximum z-dimension is 64. For
further information, see the appendix on compute capabilities in the CUDA C++
Programming Guide [11].
Nvidia also has names for the microarchitectures of its GPUs. Table 6.3 shows
the current list of architectures and some of their corresponding compute capabilities.
Somewhat confusingly, Nvidia also uses Tesla as the name for their products targeting
GPGPU.
We should note that Nvidia has a number of “product families” that can consist of
anything from an Nvidia-based graphics card to a “system on a chip,” which has the
main hardware components of a system, such as a mobile phone in a single integrated
circuit.
Finally, note that there are a number of versions of the CUDA API, and they do
not correspond to the compute capabilities of the different GPUs.
Program 6.3: Kernel and main function of a CUDA program that adds two vectors.
gridDim . x ∗ blockDim . x
threads. So we can assign a unique “global” rank or index to each thread by using the
formula
For example, if we have four blocks and five threads in each block, then the global
ranks or indexes are shown in Table 6.4. In the kernel, we assign this global rank to
my_elt and use this as the subscript for accessing each thread’s elements of the arrays
x, y, and z.
Note that we’ve allowed for the possibility that the total number of threads may
not be exactly the same as the number of components of the vectors. So before car-
rying out the addition,
we first check that my_elt < n. For example, if we have n = 997, and we want at least
two blocks with at least two threads per block, then, since 997 is prime, we can’t
possibly have exactly 997 threads. Since this kernel needs to be executed by at least
n threads, we must start more than 997. For example, we might use four blocks of
256 threads, and the last 27 threads in the last block would skip the line
Note that if we needed to run our program on a system that didn’t support CUDA,
we could replace the kernel with a serial vector addition function. (See Program 6.4.)
So we can view the CUDA kernel as taking the serial for loop and assigning each
iteration to a different thread. This is often how we start the design process when
we want to parallelize a serial code for CUDA: assign the iterations of a loop to
individual threads.
Also note that if we apply Foster’s method to parallelizing the serial vector sum,
and we make the tasks the additions of the individual components, then we don’t
need to do anything for the communication and aggregation phases, and the mapping
phase simply assigns each addition to a thread.
6.8 Vector addition 305
1 void Serial_vec_add (
2 const f l o a t x [] /∗ in ∗/ ,
3 const f l o a t y [] /∗ in ∗/ ,
4 float cz [] /∗ out ∗/ ,
5 const int n /∗ in ∗/ ) {
6
7 f o r ( i n t i = 0 ; i < n ; i ++)
8 cz [ i ] = x [ i ] + y [ i ] ;
9 } /∗ Serial_vec_add ∗/
6.8.2 Get_args
After declaring the variables, the main function calls a Get_args function, which re-
turns n, the number of elements in the arrays, blk_ct, the number of thread blocks,
and th_per_blk, the number of threads in each block. It gets these from the command
line. It also returns a char i_g. This tells the program whether the user will input x
and y or whether it should generate them using a random number generator. If the user
doesn’t enter the correct number of command line arguments, the function prints a
usage summary and terminates execution. Also if n is greater than the total number
of threads, it prints a message and terminates. (See Program 6.5.) Note that Get_args
is written in standard C, and it runs completely on the host.
The first three arrays are used on both the host and the device. The fourth array,
cz, is only used on the host: we use it to compute the vector sum with one core of
the host. We do this so that we can check the result computed on the device. (See
Program 6.6.)
First note that since cz is only used on the host, we allocate its storage using the
standard C library function malloc. For the other three arrays, we allocate storage in
Lines 9–11 using the CUDA function
__host__ cudaError_t cudaMallocManaged (
v o i d ∗∗ devPtr /∗ out ∗/ ,
size_t size /∗ in ∗/ ,
unsigned flags /∗ in ∗/ ) ;
The __host__ qualifier is a CUDA addition to C, and it indicates that the function
should be called and run on the host. This is the default for functions in CUDA
306 CHAPTER 6 GPU programming with CUDA
1 void Get_args (
2 c o n s t i n t argc /∗ in ∗/ ,
3 char ∗ argv [ ] /∗ in ∗/ ,
4 int∗ n_p /∗ out ∗/ ,
5 int∗ blk_ct_p /∗ out ∗/ ,
6 int∗ th_per_blk_p /∗ out ∗/ ,
7 char ∗ i_g /∗ out ∗/ ) {
8 i f ( argc != 5) {
9 / ∗ P r i n t an e r r o r m e s s a g e and e x i t ∗ /
10 ...
11 }
12
13 ∗ n_p = strtol ( argv [ 1 ] , NULL , 1 0 ) ;
14 ∗ b l k _ c t _ p = strtol ( argv [ 2 ] , NULL , 1 0 ) ;
15 ∗ t h _ p e r _ b l k _ p = strtol ( argv [ 3 ] , NULL , 1 0 ) ;
16 ∗ i_g = argv [ 4 ] [ 0 ] ;
17
18 /∗ Is n > t o t a l thread count = blk_ct ∗ th_per_blk ? ∗/
19 i f (∗ n_p > (∗ blk_ct_p ) ∗ ( ∗ th_per_blk_p ) ) {
20 / ∗ P r i n t an e r r o r m e s s a g e and e x i t ∗ /
21 ...
22 }
23 } /∗ Get_args ∗/
Program 6.5: Get_args function from CUDA program that adds two vectors.
1 void Allocate_vectors (
2 f l o a t ∗∗ x_p /∗ out ∗/ ,
3 f l o a t ∗∗ y_p /∗ out ∗/ ,
4 f l o a t ∗∗ z_p /∗ out ∗/ ,
5 f l o a t ∗∗ cz_p /∗ out ∗/ ,
6 int n /∗ in ∗/ ) {
7
8 / ∗ x , y , and z a r e u s e d on h o s t and d e v i c e ∗ /
9 cudaMallocManaged ( x_p , n ∗ s i z e o f ( f l o a t ) ) ;
10 c u d a M a l l o c M a n a g e d ( y_p , n ∗ s i z e o f ( f l o a t ) ) ;
11 c u d a M a l l o c M a n a g e d ( z_p , n ∗ s i z e o f ( f l o a t ) ) ;
12
13 / ∗ c z i s o n l y u s e d on h o s t ∗ /
14 ∗ cz_p = ( f l o a t ∗) malloc ( n∗ s i z e o f ( f l o a t ) ) ;
15 } /∗ Allocate_vectors ∗/
Program 6.6: Array allocation function of CUDA program that adds two vectors.
6.8 Vector addition 307
programs, so it can be omitted when we’re writing our own functions, and they’ll
only be run on the host.
The return value, which has type cudaError_t, allows the function to return an er-
ror. Most CUDA functions return a cudaError_t value, and if you’re having problems
with your code, it is a very good idea to check it. However, always checking it tends
to clutter the code, and this can distract us from the main purpose of a program. So
in the code we discuss we’ll generally ignore cudaError_t return values.
The first argument is a pointer to a pointer: it refers to the pointer that’s being
allocated. The second argument specifies the number of bytes that should be allo-
cated. The flags argument controls which kernels can access the allocated memory.
It defaults to cudaMemAttachGlobal and can be omitted.
The function cudaMallocManaged is one of several CUDA memory allocation func-
tions. It allocates memory that will be automatically managed by the “unified memory
system.” This is a relatively recent addition to CUDA,8 and it allows a programmer
to write CUDA programs as if the host and device shared a single memory: pointers
referring to memory allocated with cudaMallocManaged can be used on both the device
and the host, even when the host and the device have separate physical memories. As
you can imagine this greatly simplifies programming, but there are some cautions.
Here are a few:
1. Unified memory requires a device with compute capability ≥ 3.0, and a 64-bit
host operating system.
2. On devices with compute capability < 6.0 memory allocated with
cudaMallocManaged cannot be simultaneously accessed by both the device and the
host. When a kernel is executing, it has exclusive access to memory allocated with
cudaMallocManaged.
3. Kernels that use unified memory can be slower than kernels that treat device mem-
ory as separate from host memory.
The last caution has to do with the transfer of data between the host and the device.
When a program uses unified memory, it is up to the system to decide when to transfer
from the host to the device or vice versa. In programs that explicitly transfer data, it is
up to the programmer to include code that implements the transfers, and she may be
able to exploit her knowledge of the code to do things that reduce the cost of transfers,
things such as omitting some transfers or overlapping data transfer with computation.
At the end of this section we’ll briefly discuss the modifications required if you
want to explicitly handle the transfers between host and device.
The function Init_vectors either reads x and y from stdin using scanf or generates
them using the C library function random. It uses the last command line argument i_g
to decide which it should do.
The Serial_vec_add function (Program 6.4) just adds x and y on the host using a
for loop. It stores the result in the host array cz.
The Two_norm_diff function computes the “distance” between the vector z com-
puted by the kernel and the vector cz computed by Serial_vec_add. So it takes the
difference between corresponding components of z and cz, squares them, adds the
squares, and takes the square root:
(z[0] − cz[0])2 + (z[1] − cz[1])2 + · · · + (z[n − 1] − cz[n − 1])2 .
1 double Two_norm_diff (
2 const f l o a t z [ ] /∗ in ∗/ ,
3 c o n s t f l o a t cz [ ] /∗ in ∗/ ,
4 const int n /∗ in ∗/ ) {
5 double diff , sum = 0 . 0 ;
6
7 for ( int i = 0; i < n ; i ++) {
8 diff = z[i] − cz [ i ] ;
9 sum += diff ∗ diff ;
10 }
11 r e t u r n sqrt ( sum ) ;
12 } /∗ Two_norm_diff ∗/
Program 6.7: C function that finds the distance between two vectors.
The qualifier __device__ is a CUDA addition to C, and it indicates that the function
can be called from the device. So cudaFree can be called from the host or the device.
However, if a pointer is allocated on the device, it cannot be freed on the host, and
vice versa.
It’s important to note that unless memory allocated on the device is explicitly
freed by the program, it won’t be freed until the program terminates. So if a CUDA
program calls two (or more) kernels, and the memory used by the first kernel isn’t
explicitly freed before the second is called, it will remain allocated, regardless of
whether the second kernel actually uses it.
See Program 6.8.
6.8 Vector addition 309
1 void Free_vectors (
2 float∗ x /∗ in / out ∗/ ,
3 float∗ y /∗ in / out ∗/ ,
4 float∗ z /∗ in / out ∗/ ,
5 float∗ cz /∗ in / out ∗/ ) {
6
7 / ∗ A l l o c a t e d w i t h cudaMallocManaged ∗ /
8 cudaFree ( x );
9 cudaFree ( y ) ;
10 cudaFree ( z ) ;
11
12 /∗ Allocated with malloc ∗/
13 free ( cz
);
14 } /∗ Free_vectors ∗/
9 If your device has compute capability ≥ 3.0, you can skip this section.
310 CHAPTER 6 GPU programming with CUDA
Program 6.9: Part of CUDA program that implements vector addition without unified
memory.
6.8 Vector addition 311
1 void Allocate_vectors (
2 f l o a t ∗∗ hx_p /∗ out ∗/ ,
3 f l o a t ∗∗ hy_p /∗ out ∗/ ,
4 f l o a t ∗∗ hz_p /∗ out ∗/ ,
5 f l o a t ∗∗ cz_p /∗ out ∗/ ,
6 f l o a t ∗∗ dx_p /∗ out ∗/ ,
7 f l o a t ∗∗ dy_p /∗ out ∗/ ,
8 f l o a t ∗∗ dz_p /∗ out ∗/ ,
9 int n /∗ in ∗/ ) {
10
11 / ∗ dx , dy , and d z a r e u s e d on d e v i c e ∗ /
12 c u d a M a l l o c ( dx_p , n∗ s i z e o f ( f l o a t ) ) ;
13 c u d a M a l l o c ( dy_p , n∗ s i z e o f ( f l o a t ) ) ;
14 c u d a M a l l o c ( dz_p , n∗ s i z e o f ( f l o a t ) ) ;
15
16 / ∗ hx , hy , hz , c z a r e u s e d on h o s t ∗ /
17 ∗ hx_p = ( f l o a t ∗) malloc ( n∗ s i z e o f ( f l o a t ));
18 ∗ hy_p = ( f l o a t ∗) malloc ( n∗ s i z e o f ( f l o a t ));
19 ∗ hz_p = ( f l o a t ∗) malloc ( n∗ s i z e o f ( f l o a t ));
20 ∗ cz_p = ( f l o a t ∗) malloc ( n∗ s i z e o f ( f l o a t ));
21 } /∗ Allocate_vectors ∗/
Program 6.10: Allocate_vectors function for CUDA vector addition program that
doesn’t use unified memory.
The first argument is a reference to a pointer that will be used on the device. The
second argument specifies the number of bytes to allocate on the device.
After we’ve initialized hx and hy on the host, we copy their contents over to the
device, storing the transferred contents in the memory allocated for dx and dy, respec-
tively. The copying is done in Lines 24–26 using the CUDA function cudaMemcpy:
__host__ cudaError_t cudaMemcpy (
void ∗ dest /∗ out ∗/ ,
const void ∗ source /∗ in ∗/ ,
size_t count /∗ in ∗/ ,
cudaMemcpyKind kind /∗ in ∗/ );
This copies count bytes from the memory referred to by source into the memory
referred to by dest. The type of the kind argument, cudaMemcpyKind, is an enumer-
ated type defined by CUDA that specifies where the source and dest pointers are
located. For our purposes the two values of interest are cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost. The first indicates that we’re copying from the host to the
device, and the second indicates that we’re copying from the device to the host.
The call to the kernel in Line 28 uses the pointers dx, dy, and dz, because these are
addresses that are valid on the device.
312 CHAPTER 6 GPU programming with CUDA
After the call to the kernel, we copy the result of the vector addition from the
device to the host in Line 31 using cudaMemcpy again. A call to cudaMemcpy is syn-
chronous, so it waits for the kernel to finish executing before carrying out the transfer.
So in this version of vector addition we do not need to use cudaDeviceSynchronize to
ensure that the kernel has completed before proceeding.
After copying the result from the device back to the host, the program checks the
result, frees the memory allocated on the host and the device, and terminates. So for
this part of the program, the only difference from the original program is that we’re
freeing seven pointers instead of four. As before, the Free_vectors function frees the
storage allocated on the host with the C library function free. It uses cudaFree to free
the storage allocated on the device.
return 0;
}
It’s likely that either the host will print -5 or the device will hang. The reason is that
the address &sum is probably invalid on the device. So the dereference
∗ sum_p = x + y;
return 0;
}
If your system doesn’t support unified memory, the same idea will work, but the
result will have to be explicitly copied from the device to the host:
__global__ void Add ( int x , int y , i n t ∗ sum_p ) {
∗ sum_p = x + y;
} / ∗ Add ∗ /
return 0;
}
Note that in both the unified and non-unified memory settings, we’re returning a
single value from the device to the host.
If unified memory is available, another option is to use a global managed variable
for the sum:
__managed__ int sum ;
cudaDeviceSynchronize ();
printf ( " After kernel : The sum is % d \ n " , sum );
return 0;
}
The qualifier __managed__ declares sum to be a managed int that is accessible to all the
functions, regardless of whether they run on the host or the device. Since it’s man-
aged, the same restrictions apply to it that apply to managed variables allocated with
cudaMallocManaged. So this option is unavailable on systems with compute capability
< 3.0, and on systems with compute capability < 6.0, sum can’t be accessed on the
host while the kernel is running. So after the call to Add has started, the host can’t
access sum until after the call to cudaDeviceSynchronize has completed.
Since this last approach uses a global variable, it has the usual problem of reduced
modularity associated with global variables.
h = (b − a)/n.
xi = a + ih,
for i = 0, 1, 2, . . . , n − 1. To simplify the notation, we’ll also denote b, the right end
point of the interval, as
b = xn = a + nh.
Recall that if a trapezoid has height h and base lengths c and d, then its area is
h
(c + d).
2
So if we think of the length of the subinterval [xi , xi+1 ] as the height of the ith
trapezoid, and f (xi ) and f (xi+1 ) as the two base lengths (see Fig. 3.4), then the area
of the ith trapezoid is
h
[f (xi ) + f (xi+1 )] .
2
6.10 CUDA trapezoidal rule I 315
This gives us a total approximation of the area between the graph and the x-axis as
h h h
[f (x0 ) + f (x1 )] + [f (x1 ) + f (x2 )] + · · · + [f (xn−1 ) + f (xn )] ,
2 2 2
and we can rewrite this as
1
h (f (a) + f (b)) + (f (x1 ) + f (x2 ) + · · · + f (xn−1 )) .
2
We can implement this with the serial function shown in Program 6.11.
1 float Serial_trap (
2 const f l o a t a /∗ in ∗/ ,
3 const f l o a t b /∗ in ∗/ ,
4 const int n /∗ in ∗/ ) {
5 f l o a t x , h = ( b−a ) / n ;
6 f l o a t trap = 0 . 5 ∗ ( f ( a ) + f ( b ) ) ;
7
8 f o r ( i n t i = 1 ; i <= n −1; i ++) {
9 x = a + i∗h ;
10 t r a p += f ( x ) ;
11 }
12 trap = trap ∗h ;
13
14 return trap ;
15 } /∗ Serial_trap ∗/
Program 6.11: A serial function implementing the trapezoidal rule for a single CPU.
However, it’s immediately obvious that there are several problems here:
1. We haven’t initialized h or trap.
2. The my_i value can be too large or too small: the serial loop ranges from 1 up to
and including n − 1. The smallest value for my_i is 0 and the largest is the total
number of threads minus 1.
3. The variable trap must be shared among the threads. So the addition of my_trap
forms a race condition: when multiple threads try to update trap at roughly the
same time, one thread can overwrite another thread’s result, and the final value in
trap may be wrong. (For a discussion of race conditions, see Section 2.4.3.)
4. The variable trap in the serial code is returned by the function, and, as we’ve seen,
kernels must have void return type.
5. We see from the serial code that we need to multiply the total in trap by h after all
of the threads have added their results.
Program 6.12 shows how we might deal with these problems. In the following
sections, we’ll look at the rationales for the various choices we’ve made.
There are (at least) a couple of problems with these options: formal arguments to
functions are private to the executing thread and thread synchronization.
a pointer to the memory location to the kernel. That is, we can do something like
this:
/ ∗ Host code ∗ /
f l o a t ∗ trap_p ;
c u d a M a l l o c M a n a g e d (& t r a p _ p , sizeof ( float ));
...
318 CHAPTER 6 GPU programming with CUDA
/∗ Call kernel ∗/
...
/ ∗ A f t e r r e t u r n from k e r n e l ∗ /
∗ trap_p = h ∗(∗ trap_p ) ;
When we do this, each thread will get its own copy of trap_p, but all of the copies of
trap_p will refer to the same memory location. So ∗trap_p will be shared.
Note that using a pointer instead of a simple float also solves the problem of
returning the value of trap in Item 4.
A wrapper function
If you look at the code in Program 6.12, you’ll see that we’ve placed most of the
code we use before and after calling the kernel in a wrapper function, Trap_wrapper.
A wrapper function is a function whose main purpose is to call another function. It
can perform any preparation needed for the call. It can also perform any additional
work needed after the call.
forms a race condition, and the actual value ultimately stored in ∗trap_p will be un-
predictable. We’re solving this problem by using a special CUDA library function,
atomicAdd, to carry out the addition.
An operation carried out by a thread is atomic if it appears to all the other threads
as if it were “indivisible.” So if another thread tries to access the result of the operation
or an operand used in the operation, the access will occur either before the operation
6.10 CUDA trapezoidal rule I 319
started or after the operation completed. Effectively, then, the operation appears to
consist of a single, indivisible, machine instruction.
As we saw earlier (see Section 2.4.3), addition is not ordinarily an atomic op-
eration: it consists of several machine instructions. So if one thread is executing an
addition, it’s possible for another thread to access the operands and the result while
the addition is in progress. Because of this, the CUDA library defines several atomic
addition functions. The one we’re using has the following syntax:
__device__ f l o a t atomicAdd (
f l o a t ∗ float_p /∗ in / out ∗/ ,
float val /∗ in ∗/ ) ;
This atomically adds the contents of val to the contents of the memory referred to
by float_p and stores the result in the memory referred to by float_p. It returns the
value of the memory referred to by float_p at the beginning of the call. See Line 14
of Program 6.12.
The same approach can be used to time the serial trapezoidal rule:
GET_TIME ( start )
trap = Serial_trap ( a , b , n );
GET_TIME ( finish ) ;
printf ( " Elapsed time for cpu = % e seconds \ n " ,
f i n i s h −s t a r t );
Recall from the section on taking timings (Section 2.6.4) that we take a number
of timings, and we ordinarily report the minimum elapsed time. However, if the vast
majority of the times are much greater (e.g., 1% or 0.1% greater), then the minimum
time may not be reproducible. So other users who run the program may get a time
320 CHAPTER 6 GPU programming with CUDA
Table 6.5 Mean run-times for serial and CUDA trapezoidal rule
(times are in ms).
System ARM Nvidia Intel Nvidia GeForce
Cortex-A15 GK20A Core i7 GTX Titan X
Clock 2.3 GHz 852 MHz 3.5 GHz 1.08 GHz
SMs, SPs 1, 192 24, 3072
Run-time 33.6 20.7 4.48 3.08
much larger than ours. When this happens, we report the mean or median of the
elapsed times.
Now when we ran this program on our hardware, there were a number of times
that were within 1% of the minimum time. However, we’ll be comparing the run-
times of this program with programs that had very few run-times within 1% of the
minimum. So for our discussion of implementing the trapezoidal rule using CUDA
(Sections 6.10–6.13), we’ll use the mean run-time, and the means are taken over at
least 50 executions.
When we run the serial trapezoidal and the CUDA trapezoidal rule functions
many times and take the means of the elapsed times, we get the results shown in Ta-
ble 6.5. These were taken using n = 220 = 1,048,576 trapezoids with f (x) = x 2 + 1,
a = −3, and b = 3. The GPUs use 1024 blocks with 1024 threads per block for a
total of 1,048,576 threads. The 192 SPs of the GK20A are clearly much faster than a
fairly slow conventional processor, an ARM Cortex-A15, but a single core of an Intel
Core i7 is much faster than the GK20A. The 3072 SPs on a Titan X were 45% faster
than the single core of the Intel, but it would seem that with 3072 SPs, we should be
able to do better.
ample, suppose we have only 8 threads and one thread block. Then our threads
are 0, 1, . . . , 7, and one of the threads will be the first to succeed with the call to
atomicAdd. Say it’s thread 5. Then another thread will succeed. Say it’s thread 2.
Continuing in this fashion we get a sequence of atomicAdds, one per thread. Table 6.6
shows how this might proceed over time. Here, we’re trying to keep the computations
simple: we’re assuming that f (x) = 2x + 1, a = 0, and b = 8. So h = (8 − 0)/8 = 1,
and the value referenced by trap_p at the start of the global sum is
What’s important is that this approach may serialize the threads. So the com-
putation may require a sequence of 8 calculations. Fig. 6.3 illustrates a possible
computation.
So rather than have each thread wait for its turn to do an addition into ∗trap_p,
we can pair up the threads so that half of the “active” threads add their partial sum to
their partner’s partial sum. This gives us a structure that resembles a tree (or, perhaps
better, a shrub). See Fig. 6.4.
In our figures, we’ve gone from requiring a sequence of 8 consecutive additions to
a sequence of 4. More generally, if we double the number of threads and values (e.g.,
increase from 8 to 16), we’ll double the length of the sequence of additions using the
basic approach, while we’ll only add one using the second, tree-structured approach.
For example, if we increase the number of threads and values from 8 to 16, the first
approach requires a sequence of 16 additions, but the tree-structured approach only
requires 5. In fact, if there are t threads and t values, the first approach requires a
sequence of t additions, while the tree-structured approach requires log2 (t) + 1. For
example, if we have 1000 threads and values, we’ll go from 1000 communications
and sums using the basic approach to 11 using the tree-structured approach, and if
we have 1,000,000, we’ll go from 1,000,000 to 21!
There are two standard implementations of a tree-structured sum in CUDA. One
implementation uses shared memory, and in devices with compute capability < 3 this
322 CHAPTER 6 GPU programming with CUDA
FIGURE 6.3
Basic sum.
FIGURE 6.4
Tree-structured sum.
as a hierarchy with three “levels.” At the bottom, is the slowest, largest level: global
memory. In the middle is a faster, smaller level: shared memory. At the top is the
fastest, smallest level: the registers. For example, Table 6.7 gives some information
on relative sizes. Access times also increase dramatically. It takes on the order of 1
cycle to copy a 4-byte int from one register to another. Depending on the system it
can take up to an order of magnitude more time to copy from one shared memory
location to another, and it can take from two to three orders of magnitude more time
to copy from one global memory location to another.
An obvious question here: what about local variables? How much storage is avail-
able for them? And how fast is it? This depends on total available memory and
program memory usage. If there is enough storage, local variables are stored in regis-
ters. However, if there isn’t enough register storage, local variables are “spilled” to a
324 CHAPTER 6 GPU programming with CUDA
region of global memory that’s thread private, i.e., only the thread that owns the local
variables can access them.
So as long as we have sufficient register storage, we expect the performance of
a kernel to improve if we increase our use of registers and reduce our use of shared
and/or global memory. The catch, of course, is that the storage available in registers
is tiny compared to the storage available in shared and global memory.
The threads in a warp operate in SIMD fashion. So threads in different warps can
execute different statements with no penalty, while threads within the same warp
must execute the same statement. When the threads within a warp attempt to execute
different statements—e.g., they take different branches in an if −else statement—the
threads are said to have diverged. When divergent threads finish executing different
statements, and start executing the same statement, they are said to have converged.
The rank of a thread within a warp is called the thread’s lane, and it can be com-
puted using the formula
lane = threadIdx . x % warpSize ;
The warp shuffle functions allow the threads in a warp to read from registers used
by another thread in the same warp. Let’s take a look at the one we’ll use to implement
a tree-structured sum of the values stored by the threads in a warp10 :
__device__ f l o a t __shfl_down_sync (
unsigned mask /∗ in ∗/ ,
float var /∗ in ∗/ ,
unsigned diff /∗ in ∗/ ,
int width = warpSize /∗ in ∗/ );
The mask argument indicates which threads are participating in the call. A bit,
representing the thread’s lane, must be set for each participating thread to ensure
that all of the threads in the call have converged—i.e., arrived at the call—before
any thread begins executing the call to __shfl_down_sync. We’ll ordinarily use all the
threads in the warp. So we’ll usually define
10 Note that the syntax of the warp shuffles was changed in CUDA 9.0. So you may run across CUDA
programs that use the older syntax.
6.11 CUDA trapezoidal rule II: improving performance 325
mask = 0 xffffffff ;
Recall that 0x denotes a hexadecimal (base 16) value and 0xf is 151 0, which is
11112 .11 So this value of mask is 32 1’s in binary, and it indicates that every thread
in the warp participates in the call to __shfl_down_sync. If the thread with lane l calls
__shfl_down_sync, then the value stored in var on the thread with
lane = l + diff
is returned on thread l. Since diff has type unsigned, it is ≥ 0. So the value that’s
returned is from a higher-ranked thread. Hence the name “shuffle down.”
We’ll only use width = warpSize, and since its default value is warpSize, we’ll omit
it from our calls.
There are several possible issues:
• What happens if thread l calls __shfl_down_sync but thread l + diff doesn’t? In
this case, the value returned by the call on thread l is undefined.
• What happens if thread l calls __shfl_down_sync but l + diff ≥ warpSize? In this
case the call will return the value in var already stored on thread l.
• What happens if thread l calls __shfl_down_sync, and l + diff < warpSize, but
l + diff > largest lane in the warp. In other words, because the thread block size
is not a multiple of warpSize, the last warp in the block has fewer than warpSize
threads. Say there are w threads in the last warp, where 0 < w < warpSize. Then
if
l + diff ≥ w,
the value returned by the call is also undefined.
So to avoid undefined results, it’s best if
• All the threads in the warp call __shfl_down_sync, and
• All the warps have warpSize threads, or, equivalently, the thread block size
(blockDim.x) is a multiple of warpSize.
FIGURE 6.5
Tree-structured sum using warp shuffle.
Fig. 6.5 shows how the function would operate if warpSize were 8. (The diagram
would be illegible if we used a warpSize of 32.) Perhaps the most confusing point in
the behavior of __shfl_down_sync is that when the lane ID
l + diff ≥ warpSize,
the call returns the value in the caller’s var. In the diagram this is shown by having
only one arrow entering the oval with the sum, and it’s labeled with the value just
calculated by the thread carrying out the sum. In the row corresponding to diff = 4
(the first row of sums), the threads with lane IDs l = 4, 5, 6, and 7 all have l + 4 ≥ 8.
So the call to __shfl_down_sync returns their current var values, 9, 11, 13, and 15, re-
spectively, and these values are doubled, because the return value of the call is added
into the calling thread’s variable var. Similar behavior occurs in the row correspond-
ing to the sums for diff = 2 and lane IDs l = 6 and 7, and in the last row when
diff = 1 for the thread with lane ID l = 7.
From a practical standpoint, it’s important to remember that this implementation
will only return the correct sum on the thread with lane ID 0. If all of the threads
need the result, we can use an alternative warp shuffle function, __shfl_xor. See Ex-
ercise 6.6.
FIGURE 6.6
Dissemination sum using shared memory.
Since the threads belonging to a single warp operate synchronously, we can im-
plement something very similar to a warp shuffle using shared memory instead of
registers.
__device__ float Shared_mem_sum ( float shared_vals []) {
int my_lane = threadIdx . x % warpSize ;
This should be called by all the threads in a warp, and the array shared_vals should be
stored in the shared memory of the SM that’s running the warp. Since the threads in
the warp are operating in SIMD fashion, they effectively execute the code of the
function in lockstep. So there’s no race condition in the updates to shared_vals:
all the threads read the values in shared_vals[source] before any thread updates
shared_vals[my_lane].
Technically speaking, this isn’t a tree-structured sum. It’s sometimes called a dis-
semination sum or dissemination reduction. Fig. 6.6 illustrates the copying and
additions that take place. Unlike the earlier figures, this figure doesn’t show the di-
rect contributions that a thread makes to its sums: including these lines would have
made the figure too difficult to read. Also note that every thread reads a value from
another thread in each pass through the for statement. After all these values have
been added in, every thread has the correct sum—not just thread 0. Although we
won’t need this for the trapezoidal rule, this can be useful in other applications. Fur-
328 CHAPTER 6 GPU programming with CUDA
thermore, in any cycle in which the threads in a warp are working, each thread either
executes the current instruction or it is idle. So the cost of having every thread exe-
cute the same instruction shouldn’t be any greater than having some of the threads
execute one instruction and the others idle.
An obvious question here is: how does Shared_mem_sum make use of Nvidia’s
shared memory? The answer is that it’s not required to use shared memory. The
function’s argument, the array shared_vals, could reside in either global memory or
shared memory. In either case, the function would return the sum of the elements of
shared_vals.
However, to get the best performance, the argument shared_vals should be defined
to be __shared__ in a kernel. For example, if we know that shared_vals will need to
store at most 32 floats in each thread block, we can add this definition to our kernel:
For each thread block this sets aside storage for a collection of 32 floats in the shared
memory of the SM to which the block is assigned.
Alternatively, if it isn’t known at compile time how much shared memory is
needed, it can be declared as
and when the kernel is called, a third argument can be included in the triple angle
brackets specifying the size in bytes of the block of shared memory. For example, if
we were using Shared_mem_sum in a trapezoidal rule program, we might call the kernel
Dev_trap with
This would allocate storage for th_per_blk floats in the shared_vals array in each
thread block.
Program 6.13: CUDA kernel implementing trapezoidal rule and using Warp_sum.
thread’s calculation directly into ∗trap_p, each warp (or, in this case, thread block)
calls the Warp_sum function (Fig. 6.5) to add the values computed by the threads in the
warp. Then, when the warp returns, thread (or lane) 0 adds the warp sum for its thread
block (result) into the global total. Since, in general, this version will use multiple
thread blocks, there will be multiple warp sums that need to be added to ∗trap_p.
So if we didn’t use atomicAdd, the addition of result to ∗trap_p would form a race
condition.
Program 6.14: CUDA kernel implementing trapezoidal rule and using shared mem-
ory.
array of shared memory in Line 7; it initializes this array in Lines 11 and 14; and, of
course, the call to Shared_mem_sum is passed this array rather than a scalar register.
Since we know at compile time how much storage we’ll need in shared_vals, we
can define the array by simply preceding the ordinary C definition with the CUDA
qualifier __shared__:
__shared__ float shared_vals [ WARPSZ ] ;
Note that the CUDA defined variable warpSize is not defined at compile-time. So our
program defines a preprocessor macro
# define WARPSZ 32
6.12.4 Performance
Of course, we want to see how the various implementations perform. (See Table 6.8.)
The problem is the same as the problem we ran earlier (see Table 6.5): we’re in-
tegrating f (x) = x 2 + 1 on the interval [−3, 3], and there are 220 = 1,048,576
trapezoids. However, since the thread block size is 32, we’re using 32,768 thread
blocks (32 × 32,768 = 1,048,576).
6.13 CUDA trapezoidal rule III: blocks with more than one warp 331
Table 6.8 Mean run-times for trapezoidal rule using block size of
32 threads (times in ms).
System ARM Nvidia Intel Nvidia GeForce
Cortex-A15 GK20A Core i7 GTX Titan X
Clock 2.3 GHz 852 MHz 3.5 GHz 1.08 GHz
SMs, SPs 1, 192 24, 3072
Original 33.6 20.7 4.48 3.08
Warp Shuffle 14.4 0.210
Shared Memory 15.0 0.206
We see that on both systems and with both sum implementations, the new pro-
grams do significantly better than the original. For the GK20A, the warp shuffle
version runs in about 70% of the time of the original, and the shared memory version
runs in about 72% of the time of the original. For the Titan X, the improvements are
much more impressive: both versions run in less than 7% of the time of the original.
Perhaps most striking is the fact that on the Titan X, the warp shuffle is, on average,
slightly slower than the shared memory version.
6.13 CUDA trapezoidal rule III: blocks with more than one
warp
Limiting ourselves to thread blocks with only 32 threads reduces the power and flex-
ibility of our CUDA programs. For example, devices with compute capability ≥ 2.0
can have blocks with as many as 1024 threads or 32 warps, and CUDA provides a
fast barrier that can be used to synchronize all the threads in a block. So if we limited
ourselves to only 32 threads in a block, we wouldn’t be using one of the most useful
features of CUDA: the ability to efficiently synchronize large numbers of threads.
So what would a “block” sum look like if we allowed ourselves to use blocks with
up to 1024 threads? We could use one of our existing warp sums to add the values
computed by the threads in each warp. Then we would have as many as 1024/32 = 32
warp sums, and we could use one warp in the thread block to add the warp sums.
Since two threads belong to the same warp if their ranks in the block have the
same quotient when divided by warpSize, to add the warp sums, we can use warp 0,
the threads with ranks 0, 1, . . . , 31 in the block.
6.13.1 __syncthreads
We might try to use the following pseudocode for finding the sum of the values com-
puted by all the threads in a block:
Each thread computes its contribution ;
Each warp adds its threads ’ contributions ;
Warp 0 in block adds warp sums ;
332 CHAPTER 6 GPU programming with CUDA
However, there’s a race condition. Do you see it? When warp 0 tries to compute the
total of the warp sums in the block, it doesn’t know whether all the warps in the
block have completed their sums. For example, suppose we have two warps, warp 0
and warp 1, each of which has 32 threads. Recall that the threads in a warp operate in
SIMD fashion: no thread in the warp proceeds to a new instruction until all the threads
in the warp have completed (or skipped) the current instruction. But the threads in
warp 0 can operate independently of the threads in warp 1. So if warp 0 finishes
computing its sum before warp 1 computes its sum, warp 0 could try to add warp 1’s
sum to its sum before warp 1 has finished, and, in this case, the block sum could be
incorrect.
So we must make sure that warp 0 doesn’t start adding up the warp sums until all
of the warps in the block are done. We can do this by using CUDA’s fast barrier:
__device__ void __syncthreads ( void ) ;
This will cause the threads in the thread block to wait in the call until all of the threads
have started the call. Using __syncthreads, we can modify our pseudocode so that the
race condition is avoided:
Each thread computes its contribution ;
Each warp adds its threads ’ contributions ;
__syncthreads ();
Warp 0 in block adds warp sums ;
Now warp 0 won’t be able to add the warp sums until every warp in the block has
completed its sum.
There are a couple of important caveats when we use __syncthreads. First, it’s
critical that all of the threads in the block execute the call. For example, if the block
contains at least two threads, and our code includes something like this:
i n t my_x = threadIdx . x ;
i f ( my_x < blockDim . x / 2 )
__syncthreads ();
m y _ x ++;
then only half the threads in the block will call __syncthreads, and these threads can’t
proceed until all the threads in the block have called __syncthreads. So they will wait
forever for the other threads to call __syncthreads.
The second caveat is that __syncthreads only synchronizes the threads in a
block. If a grid contains at least two blocks, and if all the threads in the grid call
__syncthreads then the threads in different blocks will continue to operate indepen-
dently of each other. So we can’t synchronize the threads in a general grid with
__syncthreads.
12
12 CUDA 9 includes an API that allows programs to define barriers across more general collections of
threads than thread blocks, but defining a barrier across multiple thread blocks requires hardware support
that’s not available in processors with compute capability < 6.
6.13 CUDA trapezoidal rule III: blocks with more than one warp 333
Now each warp will store its threads’ calculations in a subarray of thread_calcs:
float∗ shared_vals = thread_calcs + my_warp ∗ warpSize ;
In this setting a thread stores its contribution in the subarray referred to by
shared_vals:
Calls to __syncthreads are fast, but they’re not free: every thread in the thread
block will have to wait until all the threads in the block have called __syncthreads. So
this can be costly. For example, if there are more threads in the block than there are
SPs in an SM, the threads in the block won’t all be able to execute simultaneously. So
some threads will be delayed reaching the second call to __syncthreads, and all of the
threads in the block will be delayed until the last thread is able to call __syncthreads.
So we should only call __syncthreads() when we have to.
Alternatively, each warp could store its warp sum in the “first” element of its
subarray:
f l o a t my_result = Shared_mem_sum ( shared_vals ) ;
i f ( m y _ l a n e == 0 ) shared_vals [ 0 ] = my_result ;
__syncthreads ( ) ;
...
It might at first appear that this would result in a race condition when the thread with
lane 0 attempts to update shared_vals, but the update is OK. Can you explain why?
an SM into 32 “banks” (16 for GPUs with compute capability < 2.0). This is done so
that the 32 threads in a warp can simultaneously access shared memory: the threads
in a warp can simultaneously access shared memory when each thread accesses a
different bank.
Table 6.9 illustrates the organization of thread_calcs. In the table, the columns
are banks, and the rows show the subscripts of consecutive elements of thread_calcs.
So the 32 threads in a warp can simultaneously access the 32 elements in any one of
the rows, or, more generally, if each thread access is to a different column.
When two or more threads access different elements in a single bank (or column in
the table), then those accesses must be serialized. So the problem with our approach
to saving the warp sums in elements 0, 32, 64, . . . , 992 is that these are all in the
same bank. So when we try to execute them, the GPU will serialize access, e.g.,
element 0 will be written, then element 32, then element 64, etc. So the writes will
take something like 32 times as long as it would if the 32 elements were stored in
different banks, e.g., a row of the table.
The details of bank access are a little complicated and some of the details depend
on the compute capability, but the main points are
• If each thread in a warp accesses a different bank, the accesses can happen simul-
taneously.
• If multiple threads access different memory locations in a single bank, the accesses
must be serialized.
• If multiple threads read the same memory location in a bank, the value read is
broadcast to the reading threads, and the reads are simultaneous.
The CUDA programming Guide [11] provides full details.
Thus we could exploit the use of the shared memory banks if we stored the results
in a contiguous subarray of shared memory. Since each thread block can use at least
16 Kbytes of shared memory, and our “current” definition of shared_vals only uses
at most 1024 floats or 4 Kbytes of shared memory, there is plenty of shared memory
available for storing 32 more floats.
336 CHAPTER 6 GPU programming with CUDA
So if we’re using shared memory warp sums, a simple solution is to declare two
arrays of shared memory: one for storing the computations made by each thread, and
another for storing the warp sums.
__shared__ float thread_calcs [ MAX_BLKSZ ] ;
__shared__ float warp_sum_arr [ WARPSZ ] ;
f l o a t ∗ shared_vals = thread_calcs + my_warp ∗ warpSize ;
...
f l o a t my_result = Shared_mem_sum ( shared_vals ) ;
i f ( m y _ l a n e == 0 ) w a r p _ s u m _ a r r [ m y _ w a r p ] = m y _ r e s u l t ;
__syncthreads ( ) ;
...
6.13.5 Finishing up
The remaining codes for the warp sum kernel and the shared memory sum kernel are
very similar. First warp 0 computes the sum of the elements in warp_sum_arr. Then
thread 0 in the block adds the block sum into the total across all the threads in the
grid using atomicAdd. Here’s the code for the shared memory sum:
i f ( m y _ w a r p == 0 ) {
i f ( t h r e a d I d x . x >= b l o c k D i m . x / w a r p S i z e )
warp_sum_arr [ threadIdx . x ] = 0 . 0 ;
blk_result = Shared_mem_sum ( warp_sum_arr ) ;
}
6.13.6 Performance
Before moving on, let’s take a final look at the run-times for our various versions of
the trapezoidal rule. (See Table 6.10.) The problem is the same: find the area under
6.13 CUDA trapezoidal rule III: blocks with more than one warp 337
Program 6.15: CUDA kernel implementing trapezoidal rule and using shared mem-
ory. This version can use large thread blocks.
Table 6.10 Mean run-times for trapezoidal rule using arbitrary block size
(times in ms).
System ARM Nvidia Intel Nvidia GeForce
Cortex-A15 GK20A Core i7 GTX Titan X
Clock 2.3 GHz 852 MHz 3.5 GHz 1.08 GHz
SMs, SPs 1, 192 24, 3072
Original 33.6 20.7 4.48 3.08
Warp Shuffle, 32 ths/blk 14.4 0.210
Shared Memory, 32 ths/blk 15.0 0.206
Warp Shuffle 12.8 0.141
Shared Memory 14.3 0.150
13 Technically, a bitonic sequence is either a sequence that first increases and then decreases, or it is a
sequence that can be converted to such a sequence by one or more circular shifts. For example, 3, 5, 4, 2, 1
is a bitonic sequence, since it increases and then decreases, but 5, 4, 2, 1, 3 is also a bitonic sequence, since
it can be converted to the first sequence by a circular shift.