Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Jul 1, 2020

Cherry-pick #40495 into 1.6

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

  1. Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
    DDP construction.
  2. Rank 3 is a little slower and performs an RRef fetch call before the DDP
    construction.
  3. The RRef fetch call is done on Rank 0 and tries to acquire GIL.
  4. We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
    collective and Rank 3 is waiting for Rank 0 to release GIL.

Test Plan:

  1. Ran ddp_under_dist_autograd 500 times.
  2. waitforbuildbot

Summary:
Pull Request resolved: pytorch#40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
@dr-ci
Copy link

dr-ci bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit 6be6fd3 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@seemethere
Copy link
Member

Going to go ahead and merge this

@seemethere seemethere merged commit b4b8f5b into pytorch:release/1.6 Jul 1, 2020
@seemethere seemethere added this to the 1.6.0 milestone Jul 1, 2020
@malfet malfet deleted the malfet/cherry-pick-release-gil-into-1.6 branch July 1, 2020 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants