Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@rheacangeo
Copy link
Contributor

Purpose

Since we've moved from fv3core to pace, the performance has been unstable, ranging between the expected 3.5 seconds to 7 seconds for c128_6ranks_baroclinic timesteps with the gtc:gt:gpu backend. We've identified that the change in default CUDA flags set in buildenv, specifically always having CRAY_CUDA_MPS=1 as the likely source of this fluctuation through experimentation with and without the flag. This flag supports multiple ranks using the same gpu, which we do make use of for 54 rank parallel tests (so our CI doesn't have to run 54 nodes on PR). But specifically for run_on_daint, even though we are specifying that each ranks works on 1 gpu only through other slurm settings, appears to result in inconsistent performance of the halo updates. Performance analysis using CUDA_LAUNCH_BLOCKING showed that the fluctuations were happening primarily during halo updates, but the halo update code hadn't changed functionally (though has quite a bit cosmetically) since performance was stable in fv3core.

Code changes:

  • The performance job script run_on_daint submits specifies that CRAY_CUDA_MPS=0.
  • The PYTHONOPTIMIZE setting was also moved to the slurm script rather than a setting in run_on_daint, to make relaunching the batch script easy to do without rerunning run_standalone or run_on_daint.

…e same gpu for performance runs, it leads to inconsistent results
Copy link
Contributor

@FlorianDeconinck FlorianDeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

My main issue with our current pipeline is that the format makes in-code comments complicated. This is a prime example where CUDA_CRAY_MPS should be explained.

@rheacangeo rheacangeo merged commit 7e9541f into main Jan 5, 2022
@rheacangeo rheacangeo deleted the stabilize-fv3core-performance branch January 5, 2022 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants