Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Since we've moved from fv3core to pace, the performance has been unstable, ranging between the expected 3.5 seconds to 7 seconds for c128_6ranks_baroclinic timesteps with the gtc:gt:gpu backend. We've identified that the change in default CUDA flags set in buildenv, specifically always having CRAY_CUDA_MPS=1 as the likely source of this fluctuation through experimentation with and without the flag. This flag supports multiple ranks using the same gpu, which we do make use of for 54 rank parallel tests (so our CI doesn't have to run 54 nodes on PR). But specifically for run_on_daint, even though we are specifying that each ranks works on 1 gpu only through other slurm settings, appears to result in inconsistent performance of the halo updates. Performance analysis using CUDA_LAUNCH_BLOCKING showed that the fluctuations were happening primarily during halo updates, but the halo update code hadn't changed functionally (though has quite a bit cosmetically) since performance was stable in fv3core.
Code changes: