stabilize fv3core performance #91

rheacangeo · 2022-01-05T05:19:54Z

Purpose

Since we've moved from fv3core to pace, the performance has been unstable, ranging between the expected 3.5 seconds to 7 seconds for c128_6ranks_baroclinic timesteps with the gtc:gt:gpu backend. We've identified that the change in default CUDA flags set in buildenv, specifically always having CRAY_CUDA_MPS=1 as the likely source of this fluctuation through experimentation with and without the flag. This flag supports multiple ranks using the same gpu, which we do make use of for 54 rank parallel tests (so our CI doesn't have to run 54 nodes on PR). But specifically for run_on_daint, even though we are specifying that each ranks works on 1 gpu only through other slurm settings, appears to result in inconsistent performance of the halo updates. Performance analysis using CUDA_LAUNCH_BLOCKING showed that the fluctuations were happening primarily during halo updates, but the halo update code hadn't changed functionally (though has quite a bit cosmetically) since performance was stable in fv3core.

Code changes:

The performance job script run_on_daint submits specifies that CRAY_CUDA_MPS=0.
The PYTHONOPTIMIZE setting was also moved to the slurm script rather than a setting in run_on_daint, to make relaunching the batch script easy to do without rerunning run_standalone or run_on_daint.

…e same gpu for performance runs, it leads to inconsistent results

FlorianDeconinck

LGTM

My main issue with our current pipeline is that the format makes in-code comments complicated. This is a prime example where CUDA_CRAY_MPS should be explained.

override buildenv default behavior to support multiple ranks using th…

3367823

…e same gpu for performance runs, it leads to inconsistent results

rheacangeo requested a review from FlorianDeconinck January 5, 2022 05:20

FlorianDeconinck approved these changes Jan 5, 2022

View reviewed changes

rheacangeo merged commit 7e9541f into main Jan 5, 2022

rheacangeo deleted the stabilize-fv3core-performance branch January 5, 2022 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stabilize fv3core performance #91

stabilize fv3core performance #91

rheacangeo commented Jan 5, 2022

Uh oh!

FlorianDeconinck left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stabilize fv3core performance #91

stabilize fv3core performance #91

Conversation

rheacangeo commented Jan 5, 2022

Purpose

Code changes:

Uh oh!

FlorianDeconinck left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants