Fix reduce_blocks_into_lanes race condition#1798
Merged
crcrpar merged 2 commits intoApr 26, 2024
Merged
Conversation
eqy
approved these changes
Apr 19, 2024
eqy
reviewed
Apr 19, 2024
crcrpar
approved these changes
Apr 26, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We are seeing numerical mismatches on GH and H100 when running the following unit tests: https://github.com/NVIDIA/apex/blob/master/tests/L0/run_optimizers/test_lamb.py#L251, https://github.com/NVIDIA/apex/blob/master/tests/L0/run_optimizers/test_lamb.py#L315. Running compute sanitizer on these tests reports races:
root@6b09a87d0bbc:/opt/pytorch/apex/tests/L0/run_optimizers# compute-sanitizer --tool racecheck python test_lamb.py -v -k test_multi_params ========= COMPUTE-SANITIZER test_multi_params (__main__.TestFusedLAMB) ... ========= Error: Race reported between Read access at T1 reduce_block_into_lanes<float>(T1 *, T1, int, bool)+0x2b0 in /opt/pytorch/apex/csrc/type_shim.h:350 ========= and Write access at T1 reduce_block_into_lanes<float>(T1 *, T1, int, bool)+0x5f0 in /opt/pytorch/apex/csrc/type_shim.h:333 [128 hazards]Comparing with the pytorch reduce_block_into_lanes, we find that one major difference is the location of the final __sync_threads(): https://github.com/pytorch/pytorch/blob/1ec05c769b7e1c6ab5ba75f86b4ae6d43d77ac96/aten/src/ATen/native/cuda/WeightNorm.cu#L96. Looking at the usage: https://github.com/search?q=repo%3ANVIDIA%2Fapex%20reduce_block_into_lanes&type=code, we note that share_results=False is always used so the final __sync_threads() is never called in apex use cases. Thus, in the unit tests, we hypothesize that reduce_block_into_lanes is being called multiple times. Then, because there is no sync after the read in line 350, the write in line 333 from the second iteration is racing ahead of the read in line 350 from the first iteration.
This PR attempts to fix this issue by moving the final __sync_thread() to its proper location to fix this race.
cc @eqy, @crcrpar