Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 53b4539

Browse files
author
Bernhard Kerbl
committed
Improved assignment description
1 parent 6e70146 commit 53b4539

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

17_CooperativeGroups/src/main.cu

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -159,17 +159,20 @@ int main()
159159
/*
160160
Exercises:
161161
1) Write a kernel where each thread first computes its ID in a register.
162-
Within each group of 4 consecutive threads, threads should then share their
162+
Within each group of 4 consecutive threads, threads should then share their
163163
ID with all others, using shuffling. Write this kernel once with, once without
164164
cooperative groups, and confirm correctness via output.
165-
2) Launch a COOPERATIVE KERNEL and use grid-wide synchronization to make sure
165+
2) Launch a COOPERATIVE KERNEL and use grid-wide synchronization to make sure
166166
all threads in the entire grid are at the same point in the program. Can you
167-
think of any use cases for this?
168-
3) Write a simple program with the following tasks A, B, C, each with N threads.
169-
In A, each thread t should compute and store t*t in its output A_out[t]. In B,
170-
each thread t should compute A_out[N - t - 1] - t and store it in its output
171-
B_out[t]. In C, each thread t should compute B_out[N - t - 1] + 4 and store it
172-
in its output C_out[t]. Implement this once using one kernel for each task A,
167+
think of any use cases for this? Your device will need to support the attribute
168+
cudaDevAttrCooperativeLaunch for this, check if it has it before starting.
169+
3) Write a simple program with the following tasks A, B, C, each with N threads.
170+
In A, each thread t should compute and store t*t in its output A_out[t]. In B,
171+
each thread t should compute A_out[N - t - 1] - t and store it in its output
172+
B_out[t]. In C, each thread t should compute B_out[N - t - 1] + 4 and store it
173+
in its output C_out[t]. Implement this once using one kernel for each task A,
173174
and once with a single kernel that uses grid synchronization between tasks.
174175
In the single kernel, do you need additional threadfences and/or volatiles?
176+
Again, in order to do grid sync, your device will need to support the
177+
cudaDevAttrCooperativeLaunch attribute, check if it has it before starting.
175178
*/

0 commit comments

Comments
 (0)