0% found this document useful (0 votes)

13 views5 pages

HW 3

Homework 3 for CME 213 involves implementing a finite difference solver for the 2D heat diffusion equation using various memory access patterns. Students will create kernels using global and shared memory, optimizing performance while adhering to stability conditions. Deliverables include code implementations, surface plots, and analyses of performance across different grid sizes and discretization orders.

Uploaded by

saberwu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views5 pages

HW 3

Uploaded by

saberwu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CME 213, Introduction to parallel computing

Eric Darve
Spring 2025

Homework 3

Total number of points: 100 (+ 10 bonus).

This assignment builds on the previous assignment’s theme of examining memory access patterns. You
will implement a finite difference solver for the 2D heat diffusion equation in different ways to examine the
performance characteristics of different implementations. Please refer to the class and lecture slides on this
homework.

Background on the heat diffusion PDE. The heat diffusion PDE that we will be solving can be written

∂T ∂2T ∂2T
= +
∂t ∂x2 ∂y 2
To solve this PDE, we are going to discretize both the temporal and the spatial derivatives. To do this, we
define a two-dimensional grid Gi,j , 1 ≤ i ≤ nx , 1 ≤ j ≤ ny 1 , where nx is the number of points on the
x-axis and ny is the number of points on the y-axis. At each time step, we will evaluate the temperature
and its derivatives at these grid points.
While we will consistently use a first-order discretization scheme for the temporal derivative, we will
use a 2nd , 4th , or 8th order discretization scheme for the spatial derivative. 2
If we denote by Ti,j t the temperature at time t at a point (i, j), the 2nd order spatial discretization scheme

can be written as
t+1 t
+ C (x) Ti+1,j
t t t
+ C (y) Ti,j+1
t t t

Ti,j = Ti,j − 2Ti,j + Ti−1,j − 2Ti,j + Ti,j−1 .

The constants C (x) (xfcl in the code) and C (y) (yfcl in the code) are called Courant numbers. They depend
on the temporal discretization step as well as the spatial discretization step. To ensure the stability of the
discretization scheme, they have to be less than the maximum value given by the Courant-Friedrichs-Lewy
condition. 3 The starter code we provide takes care of picking the temporal discretization step to maximize
the Courant numbers while ensuring stability.
The starter code also contains host and device functions (stencil in CPUComputation.h and Stencil
in gpuStencil.cu, respectively) which contain the coefficients that go into the update equation. Therefore,
you do not need to figure out how to implement the different order updates. You only need to pass the
correct arguments when calling the device function.

Boundary conditions. The file BC.h in the starter code contains functions that will update the boundary
conditions and the temperature at points in the border for you. Therefore, you do not have to worry about
the size of the stencil as you approach the wall. 4
1
In fact, the size of the grid is gx × gy but we will only update the interior region of size nx × ny .
2
For instance, if we use a 4th order discretization scheme, we will express the derivative of T with respect to x at a point (i, j)
using Ti−2,j
t
, Ti−1,j
t
, Ti,j
t
, Ti+1,j
t
, and Ti+2,j
t
.
3
If you are interested, you can read more about this at http://en.wikipedia.org/wiki/Courant-Friedrichs-Lewy_
condition.
4
A general way of dealing with this problem is to change the size of the stencil as you approach the wall. This is complicated
and we simplified the process for this homework.

1
Various implementations. In this programming assignment, we are going to implement two different
kernels (and you can do a third one for extra credit):
• Global (function gpuStencilGlobal): this kernel will use global memory. Each thread should update
exactly one point of the mesh. You should use a 1D Grid and 1D Blocks with nx × ny threads total.

• Block (function gpuStencilBlock): this kernel will also use global memory. Each thread should
update numYPerStep points of the mesh (these points form a vertical line). You should use a 2D grid
ny
with nx × numYPerStep threads total.

• (Extra Credit) Shared (function gpuStencilShared): this kernel will use shared memory. A group of
threads should load a piece of the mesh into shared memory and then compute the new values of the
temperatures on the mesh. Each thread should load and update several elements. 5

Starter code. The starter code is composed of the following files (* means the file will not be submitted
by our script):
• *BC.h — This file contains the class boundary_conditions that will allow us to update the boundary
conditions during the simulation. Do not modify this file.

• *CPUComputation.h — This file contains functions related to the CPU implementation of the solver.
The CPU implementation should help you understand the solver and help write your GPU implemen-
tation. Do not modify this file.

• *Errors.h — This file contains the functions we will use to test your code. Specifically, these functions
will check for differences between the output of your implementation and the CPU implementation,
and write any errors to different file streams. Do not modify this file.

• gpuStencil.cu — This file contains kernels and wrapper functions that you need to implement.
Please fill in the sections marked TODO in this file.

• *Grid.h and *Grid.cu — These files contain the data structure that models the grid used to solve the
PDE. Do not modify these files.

• *hw3.sh and *run.sh — The script hw3.sh is used to submit jobs to the queue. The script hw3.sh
calls the script run.sh which runs the executable main with different combinations of parameters.
Do not modify these files.

• *main.cu — This file contains the main function and one test for each of the kernels we ask you to
implement. When you run the executable main, the main function will be called and will initialize
some global variables so that we may use the same starting grid for all three of our tests: global,
block, and shared. After initialization in main, the script will execute the tests in order. Note that if
you aren’t doing the extra credit (i.e., implementing the shared kernel), your code will always fail the
last test. Do not modify this file.

• Makefile — Running make will produce an executable main. Running make clean will remove the
build files as well as debug output. You may also change the nofma flag by commenting out line 30 of
the Makefile. This flag is further discussed below.

• mp1-util.h — This file contains utilities used in CPUComputation.h and Errors.h. Do not modify
this file.
5
Note, however, that the threads that loaded data in the border of the small piece will stay idle during the computation step.

2
• *params.in — The parameters used in the computation are read from the file params.in. You will
need to modify this file (except the second line) for debugging, performance testing, and to answer
the questions. Here is a list of the parameters in the order that they appear in the file params.in:

int nx_, ny_; // number of grid points in each dimension

double lx_, ly_; // extent of physical domain in each dimension
int iters_; // number of iterations to do
int order_; // order of discretization scheme for the spatial derivative

• plot.py — This script can be used to generate plots for Question 1 by running the command
python plot.py name_of_your_csv.csv. You may need to replace python with python3, and the
script expects that the csv was generated using the member function saveStateToFile of the class
Grid. If you choose to use this script, you will need to run it in your own Python environment since
some required packages are not installed on cme213-login.stanford.edu.

• *simParams.h and *simParams.cpp — These files contain the data structure necessary to handle the
parameters of the problem. Do not modify these files.

Running the program. Type make to compile the code. Once this is done, you can use the command
$ srun -p gpu-turing --gres=gpu:1 ./main
to run the executable using the parameters in params.in, or just execute
$ sbatch hw3.sh
to run with all of the different parameter combinations you may need to answer the questions below.
The output produced by the program will contain:

• The time and bandwidth for the CPU implementation and the different GPU implementations.

• A report of the errors for the different implementations. If A is the CPU solution and B is the GPU
solution, the program will output:
q
– L2Ref= gx1·gy gy i=1 Aij , the L2 norm of the CPU solution.
P Pgx 2
i=1

– L2Inf=max1≤i≤gy,1≤j≤gx |Aij − Bij |, the L∞ norm of the error between the CPU solution and
the GPU solution.
q
– L2Err= gx1·gy gy i=1 (Aij − Bij ) , the L2 norm of the error between the CPU solution
P Pgx 2
i=1
and the GPU solution.

The output of running the hw3.sh script can be found in the files result.txt, globalErrors.txt, and
sharedErrors.txt. With -gpu=nofma, the errors should be equal to 0 and the files globalErrors.txt (and
sharedErrors.txt if you implemented the shared kernel) should be empty. We will test your code with
this flag, so make sure that your code passes all of our tests with this flag set.
Without -gpu=nofma, typical ranges for roundoff errors are:

• 10−8 , 10−7 for L∞ norm error.

• 10−9 , 10−8 for L2 norm error.

3
Question 1
(30 points) Implement the function gpuStencilGlobal that runs the solver using global memory and
creates 3D surface plots of temperature on a 256 × 256 grid at iteration 0, 1000, and 2000 respectively,
with an 8th order discretization scheme. You must also fill in the function gpuComputationGlobal. The
difference (in terms of the norms) between your implementation and the CPU implementation should be in
the expected range. The class Grid has a member function saveStateToFile to dump all the data of the
grid to a CSV file. You can call this function in e.g. the for loop in gpuComputationGlobal to save the state
of the grid at a particular iteration. You can use plot.py or your own tools (Python, MATLAB, etc.) to
generate the plots and include them in your writeup.

Deliverables:

1. Code: The completed kernel gpuStencilGlobal, and completed function gpuComputationGlobal in

gpuStencil.cu.

2. Writeup: 3 surface plots of the temperature at iteration 0, 1000, and 2000 using an 8th order stencil.

Question 2
(35 points) Implement the functions gpuStencilBlock and gpuComputationBlock. You should use 2D blocks
and grids to implement the blocking strategy we talked about in class. You are responsible for choosing
block and grid dimensions as well as the value of numYPerStep to optimize the performance of your code.
The difference (in terms of the norms) between your implementation and the CPU implementation should
be in the expected range.

Deliverables:

1. Code: The completed kernel gpuStencilBlock, and completed function gpuComputationBlock in

gpuStencil.cu.

Question 3
(15 points) Plot the bandwidth (in GB/s) as a function of the grid size (in MegaPoints, i.e. in millions of
points) for the following grid sizes: 256×256; 512×512; 1024×1024; 2048×2048; 4096×4096. You can choose
the number of iterations to run to generate your results. You should run enough iterations so that results
are stable and meaningful (clearly 1 is not enough).

Deliverables (writeup only):

1. For order = 8, plot the bandwidth for the 2 (or 3) different algorithms.

2. For the block implementation, plot the bandwidth for the 3 different orders.

3. Extra credit: If you implemented the shared algorithm, plot the bandwidth for the 3 different orders.

Use a log scale for the x-axis of your plots.

Question 4
(20 points) Which kernel (global, block, or shared) and which order (2, 4, or 8) gives the best performance?
Your answer may depend on the grid size. Explain your results from Question 3.

4
Deliverables (writeup only): Explain performance differences

1. among kernels,

2. from varying order,

3. from varying grid size.

Discussing the shared memory kernel is optional (extra credit).

Question 5
(Extra credit 10 points) Implement the function gpuStencilShared that runs the solver using shared memory.
You should also fill in the function gpuComputationShared. Note that you have to answer the questions
related to shared memory implementation in Questions 3 and 4 to get the full extra credit.

Deliverables:

1. Code: The completed kernel gpuStencilShared, and completed function gpuComputationShared in

gpuStencil.cu.

2. Writeup: Extra credit portion for Question 3 and Question 4.

A Submission instructions
1. For all questions that require explanations and answers besides source code, put those explanations
and answers in a separate single PDF file. Upload this file on Gradescope.

2. Submit your code by uploading a zip file on Gradescope. Here is the list of files we are expecting:

gpuStencil.cu

We will not evaluate any code in files not listed above. Make sure to keep all file names as they are.

B Advice and hints

• Make sure you understand each parameter of the problem by looking at the class simParams. In
particular, the difference between nx_ and gx_ is important to understand.

• Keep in mind where your data are when choosing kernel parameters. You will be reading from global
memory using read-only cache (L1, 128-byte cache lines). Writes to global memory cannot be cached
in L1, therefore have to go to the shared L2 cache, which has 32-byte cache lines.

• Make sure your implementations can deal with rectangular grids as well as square ones.

BIOBASE BK-6310 Service Manual 202104
100% (4)
BIOBASE BK-6310 Service Manual 202104
140 pages
BCS 011 Solved
No ratings yet
BCS 011 Solved
16 pages
OSY Lecture 1 Notes - MSBTE NEXT ICON
100% (3)
OSY Lecture 1 Notes - MSBTE NEXT ICON
14 pages
ES QUIZ BITS - Docx-1
No ratings yet
ES QUIZ BITS - Docx-1
9 pages
Parallel and Distributed Systems: Sivapuram Venkata Harshini 226003124
No ratings yet
Parallel and Distributed Systems: Sivapuram Venkata Harshini 226003124
33 pages
Write Your First CFD Solver Ebook
No ratings yet
Write Your First CFD Solver Ebook
90 pages
OS Complete
No ratings yet
OS Complete
353 pages
PC Cuda Assignment-2
No ratings yet
PC Cuda Assignment-2
29 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
MC Openmp
No ratings yet
MC Openmp
10 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
KVM Virtualization Architecture Guide
No ratings yet
KVM Virtualization Architecture Guide
15 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
IT - Linux Course Book
No ratings yet
IT - Linux Course Book
153 pages
Jupyter Notebook Beginner Guide
No ratings yet
Jupyter Notebook Beginner Guide
10 pages
PDP-11/70 Processor Handbook (1977-1978)
100% (1)
PDP-11/70 Processor Handbook (1977-1978)
284 pages
Coding Practices SSW
No ratings yet
Coding Practices SSW
47 pages
HW 2
No ratings yet
HW 2
12 pages
Cuda
No ratings yet
Cuda
7 pages
IntroFortran Schedule and Exercises
No ratings yet
IntroFortran Schedule and Exercises
12 pages
Accelerating Scientific Computing with GPUs
100% (2)
Accelerating Scientific Computing with GPUs
96 pages
CG 6-9
No ratings yet
CG 6-9
15 pages
Rcs401: Operating Systems Unit I: References
No ratings yet
Rcs401: Operating Systems Unit I: References
1 page
Written Asst2
No ratings yet
Written Asst2
27 pages
NSF Proposal
100% (2)
NSF Proposal
22 pages
Parallelization Techniques Heat Conduction
No ratings yet
Parallelization Techniques Heat Conduction
17 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
LTO3
No ratings yet
LTO3
46 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Android
No ratings yet
Android
262 pages
PP 2024 HW3
No ratings yet
PP 2024 HW3
12 pages
Foss Full Notes
No ratings yet
Foss Full Notes
69 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Numerical Solution of Partial Differential Equations
No ratings yet
Numerical Solution of Partial Differential Equations
7 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Project 3
No ratings yet
Project 3
5 pages
CG 8
No ratings yet
CG 8
11 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
EPOCH Notes
No ratings yet
EPOCH Notes
30 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Final Project
No ratings yet
Final Project
12 pages
Programming Assignment 4 Guide
No ratings yet
Programming Assignment 4 Guide
9 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
What Is Linux?
No ratings yet
What Is Linux?
8 pages
Lab 3
No ratings yet
Lab 3
4 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Atlas 800 Inference Server (Model 3010) 23.0.0 Ascend Software Installation Guide 01
No ratings yet
Atlas 800 Inference Server (Model 3010) 23.0.0 Ascend Software Installation Guide 01
124 pages
OS Basics for Engineering Students
No ratings yet
OS Basics for Engineering Students
201 pages
Embedded Linux Labs
No ratings yet
Embedded Linux Labs
71 pages
Platform Technology (TP)
No ratings yet
Platform Technology (TP)
5 pages
HPC La2-2023qs
No ratings yet
HPC La2-2023qs
5 pages
As 3
No ratings yet
As 3
2 pages
Python Programming Tasks
No ratings yet
Python Programming Tasks
2 pages
Assignment (T)
No ratings yet
Assignment (T)
13 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
L 2 GPU
No ratings yet
L 2 GPU
11 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
CUDA Optimization for Jacobi Method
No ratings yet
CUDA Optimization for Jacobi Method
4 pages
CUDA Lab Guide for Students
No ratings yet
CUDA Lab Guide for Students
19 pages
Assign Pro
No ratings yet
Assign Pro
3 pages
SCTF
No ratings yet
SCTF
17 pages
Numme Lab: Finite Elements
No ratings yet
Numme Lab: Finite Elements
5 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Hw4 Question
No ratings yet
Hw4 Question
6 pages
GP-GPU Acceleration in WRF
No ratings yet
GP-GPU Acceleration in WRF
22 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
Fortran - Paul Tackley
No ratings yet
Fortran - Paul Tackley
32 pages
CSE4001 Parallel and Distributed Computing: Lab Assignment 6
No ratings yet
CSE4001 Parallel and Distributed Computing: Lab Assignment 6
8 pages
Parallel Scan in C CUda
No ratings yet
Parallel Scan in C CUda
3 pages
Exercise - 1D Heat Equation - OMP
No ratings yet
Exercise - 1D Heat Equation - OMP
12 pages
Processors
No ratings yet
Processors
25 pages
Openpipeflow 1.02b
No ratings yet
Openpipeflow 1.02b
22 pages
Paper PDF
No ratings yet
Paper PDF
52 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Os Unit 2
No ratings yet
Os Unit 2
12 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
OS Improtant
No ratings yet
OS Improtant
19 pages
Answers
No ratings yet
Answers
4 pages
Working With NIC Ring Buffers - SAM's Guide To Linux Administration
No ratings yet
Working With NIC Ring Buffers - SAM's Guide To Linux Administration
13 pages
Improving Primary Producer Incomes Through Organic Certification
No ratings yet
Improving Primary Producer Incomes Through Organic Certification
20 pages
Cut-Through and Store-and-Forward Ethernet Switching For Low-Latency Environments
No ratings yet
Cut-Through and Store-and-Forward Ethernet Switching For Low-Latency Environments
13 pages
OS Development Guide and Sources
No ratings yet
OS Development Guide and Sources
11 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Assignment 01-W2
No ratings yet
Assignment 01-W2
2 pages
2021-Problem-Set1 (Solutions)
No ratings yet
2021-Problem-Set1 (Solutions)
5 pages
Pgdca 104
No ratings yet
Pgdca 104
4 pages

HW 3

Uploaded by

HW 3

Uploaded by

CME 213, Introduction to parallel computing

Total number of points: 100 (+ 10 bonus).

int nx_, ny_; // number of grid points in each dimension

• 10−8 , 10−7 for L∞ norm error.

• 10−9 , 10−8 for L2 norm error.

1. Code: The completed kernel gpuStencilGlobal, and completed function gpuComputationGlobal in

1. Code: The completed kernel gpuStencilBlock, and completed function gpuComputationBlock in

Deliverables (writeup only):

Use a log scale for the x-axis of your plots.

2. from varying order,

3. from varying grid size.

Discussing the shared memory kernel is optional (extra credit).

1. Code: The completed kernel gpuStencilShared, and completed function gpuComputationShared in

2. Writeup: Extra credit portion for Question 3 and Question 4.

B Advice and hints

You might also like